# Missing Data

Let's show a few convenient methods to deal with Missing Data in pandas:

### Import `numpy` with the alias `np` and `pandas` with the alias `pd`

### Load Data

<hr>

##### Mount Drive - **Google Colab Only Step**

When using google colab in order to access files on our google drive we need to mount the drive by running the below python cell, then clicking the link it generates and pasting the code in the cell.



In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


Change Directory To Access The Dependent Files - **Google Colab Only Step**

In [0]:
directory = "student"
if (directory == "student"):
  %cd drive/Colab\ Notebooks/intro-to-python/
else:
  %cd drive/Shared\ drives/Rubrik/Data\ Science\ Track/intro-to-python

/content/drive/Shared drives/Rubrik/Data Science Track/intro-to-python


#### Load data into a variable called `df`



```python
# I've given you the path this time!
df = pd.read_csv('./data/rhode-island-police-stops.csv')
```

<hr>
<br>
<br>

## <span style="color:red"> Checking For Null Values </span>
In those cases when you load in a dataframe and want a quick overview of how many null values are present in the dataframe.

### `.isnull()`

This will return an ENTIRE dataframe with same shape as the original, but instead of values, there will either be a `True` or `False` at every cell.

1) If value was MISSING: `True`
<br>
2) IF value NOT MISSING: `False`


```python
df.isnull()
```

The `.sum()` will sum the boolean values across all of the columns. So what will result is a pandas series, where each `index` value is a `column_name` and the matching `value` will be the `sum` of all `True` values for that column. In other words, 
<br>
`df.isnull().sum()` returns the number of missing values for each column.

```python
df.isnull().sum()
```

<hr>
<br>
<br>

## <span style="color:red"> Drop Rows or Columns </span>
Sometimes it is appropriate to simply drop a `row`, `column`, or muliple of either, in order to deal with missing data.

### `.dropna()`

[dropna method documentation](https://www.geeksforgeeks.org/python-pandas-dataframe-dropna/)

#### `Default Behavior`: Drop ALL `rows` with ANY NaN values.
If this function detects a `nan` value in a row, it will `drop` that `row` and any other `row` containing a `nan`, for the entire `dataframe`.

```python
# default axis=0 aka 'rows'
df.dropna()
```

**Note:** This is operation is not done in place by default.

<br>

#### `axis=1`: Drop ALL `columns` with ANY `NaN` values.

```python
df.dropna(axis=1)
```

**Note:** This is operation is not done in place by default.

<hr>
<br>
<br>

## <span style="color:red"> Fill in NaNs </span>
Sometimes it is appropriate to fill in the missing data, either by some specified value or by some aggregate statistic, like the median of a certain column.

### `.fillna()`

#### Fill ALL `NaN`s in `df` with some value.

<br>

#### Replace all missing values in driver_age, with the average of the column.

```python
# calc average age
avg_age = df['driver_age'].mean()

# Pass in avg_age to .fillna()
df['driver_age'].fillna(value = avg_age)
```

Remember that any action in Pandas must be done `inplace` in order to save
The two options are:

1) Pass `inplace`=`True` to pandas function
```python
df['driver_age'].fillna(value = avg_age, inplace=True)
```

2) Save over `dataframe` or `column` in question.
```python
df['driver_age'] = df['driver_age'].fillna(value = avg_age)
```

3) DO NOT COMBINE!
```python
df['driver_age'] = df['driver_age'].fillna(value = avg_age, inplace=True) # <---- Very Wrong
```

<br>

#### Filling in missing values is called imputation

Resources for imputation:
- [Missing Values](https://github.com/SoftStackFactory/PythonDataScienceHandbook/blob/master/notebooks/03.04-Missing-Values.ipynb)
- [Data cleaning and missing values](https://www.neuraldesigner.com/learning/tutorials/data-set#MissingValues)