# Missing Data

Let's show a few convenient methods to deal with Missing Data in pandas:

In [1]:
import numpy as np
import pandas as pd

In [2]:
df1=pd.read_csv("weather_data.csv")

In [11]:
df1

Unnamed: 0,day,temperature,windspeed,event
0,01-01-2017,32.0,6.0,Rain
1,01-02-2017,,7.0,Sunny
2,01-03-2017,28.0,,Snow
3,01-04-2017,,7.0,
4,01-05-2017,32.0,,Rain
5,01-06-2017,31.0,2.0,Sunny
6,01-06-2017,34.0,5.0,


In [12]:
type(df1.day[0])

str

Now make day as index by

In [13]:
df1.set_index('day')

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
01-01-2017,32.0,6.0,Rain
01-02-2017,,7.0,Sunny
01-03-2017,28.0,,Snow
01-04-2017,,7.0,
01-05-2017,32.0,,Rain
01-06-2017,31.0,2.0,Sunny
01-06-2017,34.0,5.0,


Often we replace NaN by

In [14]:
#Here NaN will be replaced by Zero
new_df=df1.fillna(0)
new_df

Unnamed: 0,day,temperature,windspeed,event
0,01-01-2017,32.0,6.0,Rain
1,01-02-2017,0.0,7.0,Sunny
2,01-03-2017,28.0,0.0,Snow
3,01-04-2017,0.0,7.0,0
4,01-05-2017,32.0,0.0,Rain
5,01-06-2017,31.0,2.0,Sunny
6,01-06-2017,34.0,5.0,0


But zero is not he possible guess since all other value in the event column is a string

In [15]:
new_df=df1.fillna({'temperature':0,'windspeed':0,'event':'no event'})
new_df

Unnamed: 0,day,temperature,windspeed,event
0,01-01-2017,32.0,6.0,Rain
1,01-02-2017,0.0,7.0,Sunny
2,01-03-2017,28.0,0.0,Snow
3,01-04-2017,0.0,7.0,no event
4,01-05-2017,32.0,0.0,Rain
5,01-06-2017,31.0,2.0,Sunny
6,01-06-2017,34.0,5.0,no event


But putting zero for temperature for not good since it shows a higher variations in temperature than the previous day so we can use

In [16]:
new_df=df1.fillna(method="ffill")
new_df

  new_df=df1.fillna(method="ffill")


Unnamed: 0,day,temperature,windspeed,event
0,01-01-2017,32.0,6.0,Rain
1,01-02-2017,32.0,7.0,Sunny
2,01-03-2017,28.0,7.0,Snow
3,01-04-2017,28.0,7.0,Snow
4,01-05-2017,32.0,7.0,Rain
5,01-06-2017,31.0,2.0,Sunny
6,01-06-2017,34.0,5.0,Sunny


Here the column is filled with same value as previous day,we can also do backward fill

In [17]:
new_df=df1.fillna(method="bfill")
new_df

  new_df=df1.fillna(method="bfill")


Unnamed: 0,day,temperature,windspeed,event
0,01-01-2017,32.0,6.0,Rain
1,01-02-2017,28.0,7.0,Sunny
2,01-03-2017,28.0,7.0,Snow
3,01-04-2017,32.0,7.0,Rain
4,01-05-2017,32.0,2.0,Rain
5,01-06-2017,31.0,2.0,Sunny
6,01-06-2017,34.0,5.0,


In [18]:
new_df=df1.fillna(method='ffill')
new_df

  new_df=df1.fillna(method='ffill')


Unnamed: 0,day,temperature,windspeed,event
0,01-01-2017,32.0,6.0,Rain
1,01-02-2017,32.0,7.0,Sunny
2,01-03-2017,28.0,7.0,Snow
3,01-04-2017,28.0,7.0,Snow
4,01-05-2017,32.0,7.0,Rain
5,01-06-2017,31.0,2.0,Sunny
6,01-06-2017,34.0,5.0,Sunny


where the value is copied on a column basis and default value of axis =row

In [20]:
new_df=df1.fillna(method="ffill",limit=1)
new_df

  new_df=df1.fillna(method="ffill",limit=1)


Unnamed: 0,day,temperature,windspeed,event
0,01-01-2017,32.0,6.0,Rain
1,01-02-2017,32.0,7.0,Sunny
2,01-03-2017,28.0,7.0,Snow
3,01-04-2017,28.0,7.0,Snow
4,01-05-2017,32.0,7.0,Rain
5,01-06-2017,31.0,2.0,Sunny
6,01-06-2017,34.0,5.0,Sunny


This will limit the forward fill to 1 column

**Drop Na**

In [21]:
#Drop rows with null values
df1.dropna()

Unnamed: 0,day,temperature,windspeed,event
0,01-01-2017,32.0,6.0,Rain
5,01-06-2017,31.0,2.0,Sunny


In [22]:
#Drop columns with null values
df1.dropna(axis=1)

Unnamed: 0,day
0,01-01-2017
1,01-02-2017
2,01-03-2017
3,01-04-2017
4,01-05-2017
5,01-06-2017
6,01-06-2017


In [23]:
# will delete rows with all null values
#how takes string value of two kinds only (‘any’ or ‘all’). ‘any’ drops the row/column if ANY value is Null 
# and ‘all’ drops only if ALL values are null.
df1.dropna(how="all")

Unnamed: 0,day,temperature,windspeed,event
0,01-01-2017,32.0,6.0,Rain
1,01-02-2017,,7.0,Sunny
2,01-03-2017,28.0,,Snow
3,01-04-2017,,7.0,
4,01-05-2017,32.0,,Rain
5,01-06-2017,31.0,2.0,Sunny
6,01-06-2017,34.0,5.0,


In [24]:
#thresh: thresh takes integer value which tells minimum amount of na values to drop.
df1.dropna(thresh=2)

Unnamed: 0,day,temperature,windspeed,event
0,01-01-2017,32.0,6.0,Rain
1,01-02-2017,,7.0,Sunny
2,01-03-2017,28.0,,Snow
3,01-04-2017,,7.0,
4,01-05-2017,32.0,,Rain
5,01-06-2017,31.0,2.0,Sunny
6,01-06-2017,34.0,5.0,


thres=2 will delete only rows which have more than 2 null values