# Unit 3 - missing values
---

1. Find rows with missing values
2. Remove missing values using dropna()  
3. Fill missing values using fillna()
4. Fill missing values using interpolate()





In [1]:
import pandas as pd
import numpy as np

In [2]:
url = 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv'
vacc_df = pd.read_csv(url)

<a id='section1'></a>

`null` / `na` - no value

`NaN` - **N**ot **a** **N**umber - the value is missing. This value will be ignored in calculations such as `.mean()`


### 1. Find rows with missing values

In [3]:
vacc_df.isnull().sum()

location                                  0
iso_code                                  0
date                                      0
total_vaccinations                     4126
people_vaccinated                      4760
people_fully_vaccinated                6553
daily_vaccinations_raw                 5172
daily_vaccinations                      201
total_vaccinations_per_hundred         4126
people_vaccinated_per_hundred          4760
people_fully_vaccinated_per_hundred    6553
daily_vaccinations_per_million          201
dtype: int64

`isnull()` is a pandas function, so either use it on a dataframe or call it through pd

In [4]:
pd.isnull(vacc_df).sum()

location                                  0
iso_code                                  0
date                                      0
total_vaccinations                     4126
people_vaccinated                      4760
people_fully_vaccinated                6553
daily_vaccinations_raw                 5172
daily_vaccinations                      201
total_vaccinations_per_hundred         4126
people_vaccinated_per_hundred          4760
people_fully_vaccinated_per_hundred    6553
daily_vaccinations_per_million          201
dtype: int64

In [5]:
vacc_df['daily_vaccinations'].notnull().sum()

10964

In [None]:
vacc_df['daily_vaccinations'].isnull().sum()

`isnan` is a numpy function

In [None]:
np.isnan(vacc_df['daily_vaccinations']).sum()

### 2. Remove missing values using dropna() 

Zimbabwe contains missing values

In [None]:
zimbabwe = vacc_df.loc[vacc_df.location == 'Zimbabwe']
zimbabwe.head(10)

In [None]:
zimbabwe['total_vaccinations'].isnull().sum()

In [None]:
zimbabwe['total_vaccinations'].notnull().sum()

We can see the difference between the number of values per row

In [None]:
zimbabwe.count()

Remove all values for a specific column

In [None]:
zimbabwe.dropna(subset = ['total_vaccinations']).count()

In [None]:
zimbabwe.dropna(subset = ['total_vaccinations', 'daily_vaccinations_per_million']).head()

For all columns

In [None]:
zimbabwe.dropna()

Note: `dropna()`, like most other functions in the pandas API returns a new DataFrame 
(a copy of the original with changes) as the result, so you should assign it back if you want to see changes:

In [None]:
zimbabwe.count()

assign it back:

In [None]:
zimbabwe2 = zimbabwe.dropna()
zimbabwe2


---
>A summary of the functions so far:
>
>* `.isnull()` - display rows that contain missing values
>* `.notnull()` - display rows that don't contain missing values
>* `.dropna()` - Remove rows with missing values according to parameters:
    * `.dropna()` (default) - drops rows if at least one column has NaN
    * `.dropna(how='all')` - drops rows only if all of its columns have NaNs
    * `.dropna(thresh = k)` - k how many non-null values you want to keep (k=3 means the row should contain at least 3 non-null values)
    * `.dropna(axis=1)` - drop columns instead of rows
> 

See documnetation [here.](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)

---


### 3. Fill missing values using fillna()

Use `.fillna()` to fill missing dataframe values with:
* Whatever value you choose
* Mean, median, mode

Replace all NaNs with 0s

In [None]:
vacc_df.fillna(0, inplace = False )
vacc_df

>`inplace = False` is the default. This doesn't change the vacc_df dataframe. 
>
>To change it you need:
>
>`vacc_df.fillna(0 , inplace = True)`
>
>or to assign:
>
>`vacc_df = vacc_df.fillna(0)`
>
>But we won't do that! This is where some **business understanding** comes in: it's not a good idea to fill a column like `total_vaccinations` with 0s. 
>
>See what happens:

In [None]:
vacc_df.fillna(0).head(15)

So we'll use 0's only for the daily_vaccinations columns, and perhaps for some other columns (which?)

In [None]:
vacc_df['daily_vaccinations'].fillna(0 , inplace = True)

checkout some of the data to see that it works

In [None]:
vacc_df.iloc[0:3,[0,2,7]]

What about `total_vaccinations`?

In [None]:
vacc_df.iloc[52:62,[0,2,3]]

For the `total_vaccinations` we'll use `ffill` which fills the missing values with first non-missing value that occurs before it.

Yes, `bfill` exists as well. If does what you think it does :-)

In [None]:
vacc_df['total_vaccinations'].fillna(method='ffill')[52:62]
#vacc_df['total_vaccinations'][52:62]

The first value for some country might be NaN 

Business understanding: this isn't good enought! We need to aggregate by country!!

In [None]:
vacc_df.iloc[57:62,[0,2,3]]

Use `groupby()` and `apply`

(This is more advanced and we will learn it later)


In [None]:
vacc_df['newTotal'] = vacc_df.groupby('location')['total_vaccinations'].apply(lambda x: x.fillna(method='ffill'))
vacc_df.iloc[52:62,[0,2,3,12]]

Other options - using central measures:

(this is without grouping by country)

In [None]:
# Using median
vacc_df['daily_vaccinations'].fillna(vacc_df['daily_vaccinations'].median(), inplace=True)
  
# Using mean
#vacc_df['daily_vaccinations'].fillna(vacc_df['daily_vaccinations'].mean(), inplace=True)
  
# Using mode
#vacc_df['daily_vaccinations'].fillna(vacc_df['daily_vaccinations'].mode(), inplace=True)

### 4. Fill missing values using interpolate()

In [7]:
vacc_df['total_vaccinations'].interpolate(method ='linear',  inplace = True) 
vacc_df.iloc[52:62,[0,2,3]]

Unnamed: 0,location,date,total_vaccinations
52,Africa,2021-02-07,549151.0
53,Africa,2021-02-08,593502.0
54,Africa,2021-02-09,661263.0
55,Africa,2021-02-10,795836.0
56,Africa,2021-02-11,908796.0
57,Africa,2021-02-12,1165581.0
58,Africa,2021-02-13,1446178.0
59,Africa,2021-02-14,1606588.5
60,Africa,2021-02-15,1766999.0
61,Africa,2021-02-16,1965819.0


---
>A summary of the functions so far:
>
>* `.fillna()` - fill missing values according to parameters:
    * `.fillna('k')`  - with value k, create a new dataframe
    * `.fillna('k', inplace = True)` - with value k, into the existing dataframe
    * `.fillna(method='ffill')` - fill with first non-missing value that occurs before it 
    * `.fillna(method='bfill')` - fill with first non-missing value that occurs after it  
> * `interpolate` - fill using some interpolation technique
>
>See documnetation:
>
>* [Missing data handling documentation](https://pandas-docs.github.io/pandas-docs-travis/reference/frame.html#missing-data-handling)
---