# Missing Data and Applying a Mask

## Missing Values

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.DataFrame({"feature_1": [0.1,np.NaN,np.NaN,0.4],
                   "feature_2": [1.1,2.2,np.NaN,np.NaN]
                  })
df

Unnamed: 0,feature_1,feature_2
0,0.1,1.1
1,,2.2
2,,
3,0.4,


### Check if each value is missing

In [3]:
df.isnull()

Unnamed: 0,feature_1,feature_2
0,False,False
1,True,False
2,True,True
3,False,True


### Check if any values in a row are true


In [4]:
df_booleans = pd.DataFrame({"col_1": [True,True,False],
                            "col_2": [True,False,False]
                           })
df_booleans

Unnamed: 0,col_1,col_2
0,True,True
1,True,False
2,False,False


- If we use pandas.DataFrame.any(), it checks if at least one value in a column is `True`, and if so, returns `True`.
- If all rows are `False`, then it returns `False` for that column

In [5]:
df_booleans.any()

col_1    True
col_2    True
dtype: bool

- Setting the axis to zero also checks if any item in a column is `True`

In [6]:
df_booleans.any(axis=0)

col_1    True
col_2    True
dtype: bool

- Setting the axis to `1` checks if any item in a **row** is `True`, and if so, returns true
- Similarily only when all values in a row are `False`, the function returns `False`.

In [7]:
df_booleans.any(axis=1)

0     True
1     True
2    False
dtype: bool

### Sum booleans

In [8]:
series_booleans = pd.Series([True,True,False])
series_booleans

0     True
1     True
2    False
dtype: bool

- When applying `sum` to a series (or list) of booleans, the `sum` function treats `True` as 1 and `False` as zero.

In [9]:
sum(series_booleans)

2

You will make use of these functions in this week's assignment!

## Apply a Mask

Use a 'mask' to filter data of a dataframe

In [10]:
import pandas as pd

In [11]:
df = pd.DataFrame({"feature_1": [0,1,2,3,4]})
df

Unnamed: 0,feature_1
0,0
1,1
2,2
3,3
4,4


In [12]:
mask = df["feature_1"] >= 3
mask

0    False
1    False
2    False
3     True
4     True
Name: feature_1, dtype: bool

In [13]:
df[mask]

Unnamed: 0,feature_1
3,3
4,4


### Combining comparison operators

You'll want to be careful when combining more than one comparison operator, to avoid errors.
- Using the `and` operator on a series will result in a `ValueError`, because it's 

In [14]:
df["feature_1"] >=2

0    False
1    False
2     True
3     True
4     True
Name: feature_1, dtype: bool

In [15]:
df["feature_1" ] <=3

0     True
1     True
2     True
3     True
4    False
Name: feature_1, dtype: bool

In [16]:
# NOTE: This will result in a ValueError
df["feature_1"] >=2 and df["feature_1" ] <=3

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

### How to combine two logical operators for Series
What we want is to look at the same row of each of the two series, and compare each pair of items, one row at a time. To do this, use:
- the `&` operator instead of `and`
- the `|` operator instead of `or`.
- Also, you'll need to surround each comparison with parenthese `(...)`

In [17]:
# This will compare the series, one row at a time
(df["feature_1"] >=2) & (df["feature_1" ] <=3)

0    False
1    False
2     True
3     True
4    False
Name: feature_1, dtype: bool