# Chapter 3 Filtering

In [None]:
import pandas as pd

# Read titanic dataset
tnc = pd.read_csv("./datasets/titanic.csv")

# Print dataframe
tnc.head()

## Filter records using boolean series

We can filter out rows by passing boolean series as an argument to the Dataframe. All the rows that have corresponding boolean value as True are extracted.

**Syntax: df[boolean_series]**

In [None]:
# Generate boolean series for first 5 passengers
boolean = [True, False, True, True, False]

# Print only the records that corresponds to True
tnc.head()[boolean]

### To generate boolean series we can take the help of comparison operators in python

Useful comparison operators are ==, !=, >, >=, < and <=

Ex: df["age"] == 18

In [None]:
# Generate boolean series for male passengers
tnc["gender"] == "male"

In [None]:
# Filter out male passengers from the dataframe using the series
tnc[tnc["gender"] == "male"]

In [None]:
# Find the count of male passengers
len(tnc[tnc["gender"] == "male"])

In [None]:
# Extract the passengers who survived
tnc[tnc["survived"] == 1]

In [None]:
# Extract the passengers who didn't survive
tnc[tnc["survived"] != 1]

In [None]:
# Extract the passengers whose age is greater than or equal to 18
# tnc[tnc["age"] >= 18]

In [None]:
# Verify that the datatype of age is string (python object)
tnc.age

In [None]:
# Since it is an object we cannot perform comparison operators on strings. 
# Only equality operations can be performed on strings.
tnc[tnc.age == "18"]

In [None]:
# Check for numeric columns in dataframe that do not contain '?'
def check(col):
    for row in tnc[col]:
        if row == '?':
            return False
    return True

for col in tnc.columns:
    if(check(col) and tnc[col].dtype == "int64"):
        print(col)

In [None]:
# Filter records whose count of siblings or spouse is greater than 1
tnc[tnc["sibsp"] > 1]

In [None]:
# Read houses dataset
houses = pd.read_csv("./datasets/kc_house_data.csv")

# Print dataframe
houses.head()

In [None]:
# Extract houses whose price is greater than 5,000,000
price_cond = houses["price"] > 5_000_000
houses[price_cond]

In [None]:
# Extract houses whose no. of bedrooms is greater than or equal to 10
bdroom_cond = houses["bedrooms"] >= 10
houses[bdroom_cond]

In [None]:
# Extract houses whose no. of bathrooms is less than 1
bathroom_cond = houses["bathrooms"] < 1
houses[bathroom_cond]

In [None]:
# Extract houses whose grade is less than or equal to 7
grd_cond = houses["grade"] <= 7
houses[grd_cond]

## Filter records within a range

To filter records within a range we make use of the method:
    
**Dataframe.column.between(min, max)**

Ex: df["age"].between(13, 19)

In [None]:
# Extract houses whose latitude lie in the range 47.30 to 47.35
lat_cond = houses["lat"].between(47.30, 47.35)
houses[lat_cond]

## Filter records whose values lie in a set

To filter records based on whether their values lie in a set we make use of the method:

**Dataframe.column.isin([val1, val2, ..., valN])**

Ex: df["age"].isin(10,20,30,40)

In [None]:
# Extract houses built in the years 1940, 1965 and 2008
yr_cond = houses["yr_built"].isin([1940, 1965, 2008])
houses[yr_cond]

## Filter records by combining conditions

We can combine the multiple conditions used while filtering using the bitwise **AND &** and **OR |** operators.

Ex: Return records of illegal child marriages of Indian girl children

    teenage = df["age"] <= 21
    indian = df["country"] == "India"
    girl = df["gender"] == "female"
    married = df["married"] == True
    
**Combine the logic:**
    
    df[married & indian & teenage & girl]

In [None]:
# Extract male passengers who survived having siblings or spouse count greater than or equal to 2
male = tnc["gender"] == "male"
double = tnc["sibsp"] >= 2
survived = tnc["survived"] == 1

# Combine all the logic
tnc[male & survived & double]

In [None]:
# Extract passengers who are male of pclass not equal to 1 (or) are women of pclass equal to 1
male = tnc["gender"] == "male"
male_pclass = tnc["pclass"] != 1

female = tnc["gender"] == "female"
female_pclass = tnc["pclass"] == 1

tnc[(male & male_pclass) | (female & female_pclass)]

## Filter records that do not meet the required conditions

To extract all the records that doesn't satisfy the required conditions we can make use of bitwise negation **~** operator.

In [None]:
# Extract houses that are new
# Note that a house is said to be new if it is built or renovated in the year 2014 or after
yr_blt = houses["yr_built"] >= 2014
yr_rnt = houses["yr_renovated"] >= 2014

houses[yr_blt | yr_rnt]

In [None]:
# Extract houses that are old
# Note that a house is said to be old if it is not new
houses[~(yr_blt | yr_rnt)]

## Filter records based on whether they are null or non-null

**Note: NaN (Not a Number) refers to missing or null values in pandas nomenclature.**

We can filter records if they are NaN (null values) using the **Dataframe.column.isna()** method. 

And if they are not NaN (non null values) using the **Dataframe.column.notna()** method.

In [None]:
# Read sales dataset
sales = pd.read_csv("./datasets/sales.csv")

# Print dataframe
sales

In [None]:
# Extract sales that have null ratings
na = sales["rating"].isna()
sales[na]

In [None]:
# Extract sales that have non-null shipping_zip
na = sales["shipping_zip"].notna()
sales[na]