# Missing Data

Let's show a few convenient methods to deal with Missing Data in pandas:

### Import `numpy` with the alias `np` and `pandas` with the alias `pd`

In [0]:
import numpy as np
import pandas as pd

### Load Data

<hr>

##### Mount Drive - **Google Colab Only Step**

When using google colab in order to access files on our google drive we need to mount the drive by running the below python cell, then clicking the link it generates and pasting the code in the cell.



In [3]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


Change Directory To Access The Dependent Files - **Google Colab Only Step**

In [4]:
directory = "teacher"
if (directory == "student"):
  %cd drive/Colab\ Notebooks/intro-to-python/
else:
  %cd drive/Shared\ drives/Rubrik/Data\ Science\ Track/intro-to-python

/content/drive/Shared drives/Rubrik/Data Science Track/intro-to-python


#### Load data into a variable called `df`



```python
# I've given you the path this time!
df = pd.read_csv('./data/rhode-island-police-stops.csv')
```

In [5]:
df = pd.read_csv('./data/rhode-island-police-stops.csv')

  interactivity=interactivity, compiler=compiler, result=result)


<hr>
<br>
<br>

## <span style="color:red"> Checking For Null Values </span>
In those cases when you load in a dataframe and want a quick overview of how many null values are present in the dataframe.

### `.isnull()`

This will return an ENTIRE dataframe with same shape as the original, but instead of values, there will either be a `True` or `False` at every cell.

1) If value was MISSING: `True`
<br>
2) IF value NOT MISSING: `False`


```python
df.isnull()
```

In [6]:
df.isnull()

Unnamed: 0,date_and_time,police_department,driver_gender,driver_age_raw,driver_age,driver_race,violation,search_conducted,search_type,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district
0,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False
2,False,False,True,True,True,True,True,False,True,False,True,True,True,True,False,False
3,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
509676,True,True,True,True,True,True,True,True,True,False,True,True,True,True,False,False
509677,True,True,True,True,True,True,True,True,True,False,True,True,True,True,False,False
509678,True,True,True,True,True,True,True,True,True,False,True,True,True,True,False,False
509679,True,True,True,True,True,True,True,True,True,False,True,True,True,True,False,False


The `.sum()` will sum the boolean values across all of the columns. So what will result is a pandas series, where each `index` value is a `column_name` and the matching `value` will be the `sum` of all `True` values for that column. In other words, 
<br>
`df.isnull().sum()` returns the number of missing values for each column.

```python
df.isnull().sum()
```

In [7]:
df.isnull().sum()

date_and_time             10
police_department         10
driver_gender          29097
driver_age_raw         29049
driver_age             30695
driver_race            29073
violation              29073
search_conducted          10
search_type           491919
contraband_found           0
stop_outcome           29073
is_arrested            29073
stop_duration          29073
out_of_state           29881
drugs_related_stop         0
district                   0
dtype: int64

<hr>
<br>
<br>

## <span style="color:red"> Drop Rows or Columns </span>
Sometimes it is appropriate to simply drop a `row`, `column`, or muliple of either, in order to deal with missing data.

### `.dropna()`

[dropna method documentation](https://www.geeksforgeeks.org/python-pandas-dataframe-dropna/)

#### `Default Behavior`: Drop ALL `rows` with ANY NaN values.
If this function detects a `nan` value in a row, it will `drop` that `row` and any other `row` containing a `nan`, for the entire `dataframe`.

```python
# default axis=0 aka 'rows'
df.dropna()
```

In [8]:
df.dropna()

Unnamed: 0,date_and_time,police_department,driver_gender,driver_age_raw,driver_age,driver_race,violation,search_conducted,search_type,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district
9,2005-01-24 20:32:00,600,M,1987.0,18.0,White,Speeding,True,Probable Cause,True,Citation,False,0-15 Min,True,True,Zone K1
10,2005-02-09 03:05:00,500,M,1976.0,29.0,White,Registration/plates,True,"Probable Cause,Protective Frisk",False,Citation,False,0-15 Min,False,False,Zone X4
83,2005-08-28 01:00:00,0,M,1979.0,26.0,White,Moving violation,True,"Incident to Arrest,Protective Frisk",False,Arrest Driver,True,16-30 Min,True,False,Zone X1
93,2005-09-15 02:20:00,500,M,1988.0,17.0,White,Moving violation,True,Incident to Arrest,False,Arrest Driver,True,16-30 Min,False,False,Zone X4
114,2005-09-24 02:20:00,300,M,1987.0,18.0,White,Moving violation,True,Incident to Arrest,False,Arrest Driver,True,16-30 Min,False,False,Zone K3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
509311,2015-12-28 11:05:00,300,F,1996.0,19.0,White,Other,True,Incident to Arrest,True,Citation,False,16-30 Min,True,True,Zone K3
509475,2015-12-30 01:37:00,500,M,1979.0,36.0,White,Registration/plates,True,Protective Frisk,False,Citation,False,0-15 Min,False,False,Zone X4
509508,2015-12-30 08:51:00,300,F,1987.0,28.0,Hispanic,Speeding,True,"Probable Cause,Reasonable Suspicion",True,Citation,False,30+ Min,True,True,Zone K3
509539,2015-12-30 13:15:00,200,M,1992.0,23.0,White,Seat belt,True,Incident to Arrest,False,Arrest Passenger,True,16-30 Min,True,False,Zone X3


**Note:** This is operation is not done in place by default.

<br>

#### `axis=1`: Drop ALL `columns` with ANY `NaN` values.

```python
df.dropna(axis=1)
```

In [9]:
df.dropna(axis=1)

Unnamed: 0,contraband_found,drugs_related_stop,district
0,False,False,Zone K1
1,False,False,Zone X4
2,False,False,Zone X1
3,False,False,Zone X4
4,False,False,Zone X4
...,...,...,...
509676,False,False,Zone NA
509677,False,False,Zone NA
509678,False,False,Zone NA
509679,False,False,Zone NA


**Note:** This is operation is not done in place by default.

<hr>
<br>
<br>

## <span style="color:red"> Fill in NaNs </span>
Sometimes it is appropriate to fill in the missing data, either by some specified value or by some aggregate statistic, like the median of a certain column.

### `.fillna()`

#### Fill ALL `NaN`s in `df` with some value.

<br>

#### Replace all missing values in driver_age, with the average of the column.

```python
# calc average age
avg_age = df['driver_age'].mean()

# Pass in avg_age to .fillna()
df['driver_age'].fillna(value = avg_age)
```

In [10]:
# calc average age
avg_age = df['driver_age'].mean()
 
# Pass in avg_age to .fillna()
df['driver_age'].fillna(value = avg_age)

0         20.000000
1         18.000000
2         33.982027
3         19.000000
4         27.000000
            ...    
509676    33.982027
509677    33.982027
509678    33.982027
509679    33.982027
509680    33.982027
Name: driver_age, Length: 509681, dtype: float64

Remember that any action in Pandas must be done `inplace` in order to save
The two options are:

1) Pass `inplace`=`True` to pandas function
```python
df['driver_age'].fillna(value = avg_age, inplace=True)
```

2) Save over `dataframe` or `column` in question.
```python
df['driver_age'] = df['driver_age'].fillna(value = avg_age)
```

3) DO NOT COMBINE!
```python
df['driver_age'] = df['driver_age'].fillna(value = avg_age, inplace=True) # <---- Very Wrong
```

In [0]:
df['driver_age'].fillna(value = avg_age, inplace=True)

<br>

#### Filling in missing values is called imputation

Resources for imputation:
- [Missing Values](https://github.com/SoftStackFactory/PythonDataScienceHandbook/blob/master/notebooks/03.04-Missing-Values.ipynb)
- [Data cleaning and missing values](https://www.neuraldesigner.com/learning/tutorials/data-set#MissingValues)