# Handling Missing Data part-1
---------------
     1)  fillna() method 
     2)  Forward fill method
     3)  Backward fill method
     4)  dropna() method
     


In [1]:
import pandas as pd

In [2]:
df=pd.read_csv('datasets/season.csv')

# This is our dataframe
df

Unnamed: 0,dates,day,temp,wind-speed
0,2/1/2012,sunny,45.0,12
1,3/1/2012,rainy,46.0,34
2,4/1/2012,hot,47.0,45
3,5/1/2012,,,56
4,6/1/2012,hot,49.0,Not available
5,7/1/2012,,,Not available
6,8/1/2012,hot,12.0,45
7,9/1/2012,rainy,23.0,41
8,10/1/2012,,,
9,11/1/2012,,,


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   dates       10 non-null     object 
 1   day         6 non-null      object 
 2   temp        6 non-null      float64
 3   wind-speed  8 non-null      object 
dtypes: float64(1), object(3)
memory usage: 448.0+ bytes


### Filling all the NaN with any number or string

If your dataframe contains NaN values then you can not do any data analysis operation because NaN doesnt make any sense to you.Sometimes it is better to fill it with any number or string.
* To fill all the cells containing NaN with any number or string we will use 
   > ##### fillna ( your_value_goes_here ) 
Here i am replacing all the NaN values with zeros.

In [3]:
df2=df.fillna(0)
df2

Unnamed: 0,dates,day,temp,wind-speed
0,2/1/2012,sunny,45.0,12
1,3/1/2012,rainy,46.0,34
2,4/1/2012,hot,47.0,45
3,5/1/2012,0,0.0,56
4,6/1/2012,hot,49.0,Not available
5,7/1/2012,0,0.0,Not available
6,8/1/2012,hot,12.0,45
7,9/1/2012,rainy,23.0,41
8,10/1/2012,0,0.0,0
9,11/1/2012,0,0.0,0


### Forward Fill Method

In various cases filling all NaN with same value leads to wrong conclusion. So you want to fill all the NaN of respective columns with data similar to that column. In the previous example we can see that filling zero in "day" column is not meaningful because zero does not denote any condition of day.What if we can fill the data similar to the respective columns. <br>
In the forward fill method we fill the value of NaN with the previous value of the same column like if we have NaN value in temp column in fourth row then the fourth row NaN value of temp column will be filled with the value of third row value of same column in this method 
 * This method can be implemented by 
         
         fillna ( method =" ffill ")

In [6]:
df3 = df.fillna(method='ffill')
df3

Unnamed: 0,dates,day,temp,wind-speed
0,2/1/2012,sunny,45.0,12
1,3/1/2012,rainy,46.0,34
2,4/1/2012,hot,47.0,45
3,5/1/2012,hot,47.0,56
4,6/1/2012,hot,49.0,Not available
5,7/1/2012,hot,49.0,Not available
6,8/1/2012,hot,12.0,45
7,9/1/2012,rainy,23.0,41
8,10/1/2012,rainy,23.0,41
9,11/1/2012,rainy,23.0,41


You can observe the 4th and 6th column of the temp and day, there is a change

### Backward Fill method 

This is just opposite of forward fill method.In this method we will fill the NaN value of a dataframe with the value in the same column but next row value like if we have NaN in temp column in 4th row then we will fill it with data in 5th row in the same column.
* This can be implemented by   
> fillna ( method = "bfill")

In [5]:
df

Unnamed: 0,dates,day,temp,wind-speed
0,2/1/2012,sunny,45.0,12
1,3/1/2012,rainy,46.0,34
2,4/1/2012,hot,47.0,45
3,5/1/2012,,,56
4,6/1/2012,hot,49.0,Not available
5,7/1/2012,,,Not available
6,8/1/2012,hot,12.0,45
7,9/1/2012,rainy,23.0,41
8,10/1/2012,,,
9,11/1/2012,,,


In [7]:
df4=df.fillna(method="bfill")
df4

Unnamed: 0,dates,day,temp,wind-speed
0,2/1/2012,sunny,45.0,12
1,3/1/2012,rainy,46.0,34
2,4/1/2012,hot,47.0,45
3,5/1/2012,hot,49.0,56
4,6/1/2012,hot,49.0,Not available
5,7/1/2012,hot,12.0,Not available
6,8/1/2012,hot,12.0,45
7,9/1/2012,rainy,23.0,41
8,10/1/2012,,,
9,11/1/2012,,,


* <b>limit=m :</b> If your dataframe is having "n" continuous NaN values and you want only "m" continuous NaN to fill by forward fill method or backward fill method then you can fill the m countinuous NaN by passing as extra argument in 
> <b>fillna(method="ffill",limit=m)</b>

In [8]:
df

Unnamed: 0,dates,day,temp,wind-speed
0,2/1/2012,sunny,45.0,12
1,3/1/2012,rainy,46.0,34
2,4/1/2012,hot,47.0,45
3,5/1/2012,,,56
4,6/1/2012,hot,49.0,Not available
5,7/1/2012,,,Not available
6,8/1/2012,hot,12.0,45
7,9/1/2012,rainy,23.0,41
8,10/1/2012,,,
9,11/1/2012,,,


In [9]:
df['dates'] = pd.to_datetime(df['dates'])

df = df.set_index('dates')

df

Unnamed: 0_level_0,day,temp,wind-speed
dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-02-01,sunny,45.0,12
2012-03-01,rainy,46.0,34
2012-04-01,hot,47.0,45
2012-05-01,,,56
2012-06-01,hot,49.0,Not available
2012-07-01,,,Not available
2012-08-01,hot,12.0,45
2012-09-01,rainy,23.0,41
2012-10-01,,,
2012-11-01,,,


### Droping the rows containing NaN

#### Droping rows containing NaN will done in following way -

 *  Droping or deleting all the rows which contains atleast on NaN <br>
 dropna() function will drop all the rows which contains atleast one NaN.

In [10]:
df2=df.dropna()
df2

Unnamed: 0_level_0,day,temp,wind-speed
dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-02-01,sunny,45.0,12
2012-03-01,rainy,46.0,34
2012-04-01,hot,47.0,45
2012-06-01,hot,49.0,Not available
2012-08-01,hot,12.0,45
2012-09-01,rainy,23.0,41


  *  Droping all the rows which contains all the values NaN <br>
To drop all the rows which contains only NaN except index(here date is the index) you need to pass how="all" in dropna() function.


In [10]:
df3=df.dropna(how = "all")
df3

Unnamed: 0_level_0,day,temp,wind-speed
dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-02-01,sunny,45.0,12
2012-03-01,rainy,46.0,34
2012-04-01,hot,47.0,45
2012-05-01,,,56
2012-06-01,hot,49.0,Not available
2012-07-01,,,Not available
2012-08-01,hot,12.0,45
2012-09-01,rainy,23.0,41


*  Droping all the rows which contains n valid values (n is any natural number)
If you want to delete those rows which contain only n=2 valid values.In the same way you can delete for n=1,2,3..etc provided n<=number of columns.Here n is called thresold value.

In [None]:
df4=df.dropna(thresh = 1)
df4

Unnamed: 0_level_0,day,temp,wind-speed
dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-02-01,sunny,45.0,12
2012-03-01,rainy,46.0,34
2012-04-01,hot,47.0,45
2012-05-01,,,56
2012-06-01,hot,49.0,Not available
2012-07-01,,,Not available
2012-08-01,hot,12.0,45
2012-09-01,rainy,23.0,41


        You can see the row which contains at least on valid values it remains same 
    and the row doesnt contains atleast one value gets deleted."not available" is not
    NaN so it will be considered as valid value for dropna() function.