 # Missing Values

- The quality of the dataset is very important for any data project. `Garbage in, garbage out.`
- It is therefore essential that we put in the extra effort to ensure that our dataset is as complete and free of errors or missing values.
- Pandas, being one of the best data analysis and manipulation libraries, is quite flexible in handling missing values.
- In Pandas missing values are denoted by `NaN` and `None` both.

**Reasons for missing values in data**
  - Data is not being intentionally filled especially if it is an optional field.(Data does not exists)
  - Human error.
  - If it was a survey, participants might quit the survey halfway.
  - Data being corrupted.

## Checking for missing values
**Pandas library offers two functions two detect missing values.**

- isnull()
- notnull()
- isna( )
- notna( )

The below two are similar to above two’s.

`isnull( ) is similar to isna( ) and notnull( ) is similar to notna( )`

- Above functions returns- `boolean values`

In [2]:
import pandas as pd
import numpy as np

In [3]:
df=pd.read_csv(r"data\std_multinan.csv")

In [3]:
df

Unnamed: 0,Name,English,Science
0,AAA,50.0,40.0
1,BBB,,45.0
2,CCC,,58.0
3,DDD,,90.0
4,EEE,,55.0
5,,40.0,
6,GGG,45.0,
7,,58.0,
8,KKK,90.0,90.0


In [5]:
df.isnull()  #Returns boolean value

Unnamed: 0,Name,English,Science
0,False,False,False
1,False,True,False
2,False,True,False
3,False,True,False
4,False,True,False
5,True,False,True
6,False,False,True
7,True,False,True
8,False,False,False


In [7]:
df.notnull()   #Returns boolean values

Unnamed: 0,Name,English,Science
0,True,True,True
1,True,False,True
2,True,False,True
3,True,False,True
4,True,False,True
5,False,True,False
6,True,True,False
7,False,True,False
8,True,True,True


In [8]:
df.isna()   #Returns boolean values

Unnamed: 0,Name,English,Science
0,False,False,False
1,False,True,False
2,False,True,False
3,False,True,False
4,False,True,False
5,True,False,True
6,False,False,True
7,True,False,True
8,False,False,False


In [10]:
df.notna()   #Returns boolean values

Unnamed: 0,Name,English,Science
0,True,True,True
1,True,False,True
2,True,False,True
3,True,False,True
4,True,False,True
5,False,True,False
6,True,True,False
7,False,True,False
8,True,True,True


In [11]:
#parameters

**To check dataframe/series contains missing values or not, we can check that by using .values.any( ).**

In [12]:
df.isnull().values.any()   #Return if any value in dataframe as NaN

True

**To know how many missing values are present in dataframe/series, we can check that by using .sum()**

In [13]:
df.isnull().sum()

Name        2
English     4
Science     3
dtype: int64

# Handling Missing values

There is not an optimal way to handle missing values. But depending on the characteristics of the dataset and the task, we can choose to:
- Drop the missing values(dropna,drop)
- Fill the missing values
- Replace the missing values

**`Drop the missing values` simply means delete those records in which missing values occurred. For that we have dropna( ) function**

In [21]:
df

Unnamed: 0,Name,English,Science
0,AAA,50.0,40.0
1,BBB,,45.0
2,CCC,,58.0
3,DDD,,90.0
4,EEE,,55.0
5,,40.0,
6,GGG,45.0,
7,,58.0,
8,KKK,90.0,90.0


In [22]:
df.dropna()   #drops all the records contains of NaN

Unnamed: 0,Name,English,Science
0,AAA,50.0,40.0
8,KKK,90.0,90.0


#### parameters in dropna that we can use

- `axis` It can be 0 and 1. 0 for rows and 1 for columns.
- `how` For this you can pass ‘any’ or ‘all’.
- `thresh` It is the minimum number of valid entries or non missing rows which should be present in each row. Its value can be integer.
- `subset` Define in which columns to look for missing values and drop.
- `inplace` It helps you to make changes permanent or temporary. True for permanent changes and False for temporary changes.

In [45]:
df.dropna()

Unnamed: 0,Name,English,Science
0,AAA,50.0,40.0
8,KKK,90.0,90.0


In [46]:
#axis

In [24]:
df.dropna(axis=0)  #here by default it is axis=0(row drop)

Unnamed: 0,Name,English,Science
0,AAA,50.0,40.0
8,KKK,90.0,90.0


In [25]:
df.dropna(axis=1)   #here when axis=1(column drop)

0
1
2
3
4
5
6
7
8


In [47]:
#thresh

In [42]:
df.dropna(axis=1,thresh=7)   #returns non values which has 7 not less than 7

Unnamed: 0,Name
0,AAA
1,BBB
2,CCC
3,DDD
4,EEE
5,
6,GGG
7,
8,KKK


In [44]:
df.dropna(axis=1,thresh=6).count()    #returns non values count which has 6 not less than 6

Name        7
Science     6
dtype: int64

**how**

In [54]:
df.dropna(how="any")

Unnamed: 0,Name,English,Science
0,AAA,50.0,40.0
8,KKK,90.0,90.0


In [55]:
df.dropna(how="any").count()

Name        2
English     2
Science     2
dtype: int64

**subset**

In [71]:
df

Unnamed: 0,Name,English,Science
0,AAA,50.0,40.0
1,BBB,,45.0
2,CCC,,58.0
3,DDD,,90.0
4,EEE,,55.0
5,,40.0,
6,GGG,45.0,
7,,58.0,
8,KKK,90.0,90.0


In [70]:
df.dropna(subset=["Name","English"])

Unnamed: 0,Name,English,Science
0,AAA,50.0,40.0
6,GGG,45.0,
8,KKK,90.0,90.0


**inplace**

In [75]:
df.dropna()  #it is not permanent 

Unnamed: 0,Name,English,Science
0,AAA,50.0,40.0
8,KKK,90.0,90.0


In [76]:
df   

Unnamed: 0,Name,English,Science
0,AAA,50.0,40.0
1,BBB,,45.0
2,CCC,,58.0
3,DDD,,90.0
4,EEE,,55.0
5,,40.0,
6,GGG,45.0,
7,,58.0,
8,KKK,90.0,90.0


In [79]:
df.dropna(inplace=True) #it will make it has permanent change

In [78]:
df

Unnamed: 0,Name,English,Science
0,AAA,50.0,40.0
8,KKK,90.0,90.0


**drop**

- drop will drop a specified row or column from the dataframe. 
- we use drop when there are more missing values in the column.

In [5]:
df

Unnamed: 0,Name,English,Science
0,AAA,50.0,40.0
1,BBB,,45.0
2,CCC,,58.0
3,DDD,,90.0
4,EEE,,55.0
5,,40.0,
6,GGG,45.0,
7,,58.0,
8,KKK,90.0,90.0


In [7]:
df.drop("English",axis=1)        #drops English column from the dataframe

Unnamed: 0,Name,Science
0,AAA,40.0
1,BBB,45.0
2,CCC,58.0
3,DDD,90.0
4,EEE,55.0
5,,
6,GGG,
7,,
8,KKK,90.0


In [8]:
df.drop(1)      #drops the specified row from the dataframe

Unnamed: 0,Name,English,Science
0,AAA,50.0,40.0
2,CCC,,58.0
3,DDD,,90.0
4,EEE,,55.0
5,,40.0,
6,GGG,45.0,
7,,58.0,
8,KKK,90.0,90.0


In [12]:
df.columns

Index(['Name', 'English', 'Science '], dtype='object')

**droping multiple columns**

In [14]:
df.drop(['English','Science '],axis=1)       #drops specified columns

Unnamed: 0,Name
0,AAA
1,BBB
2,CCC
3,DDD
4,EEE
5,
6,GGG
7,
8,KKK


In [16]:
df.drop(columns=['English','Science '] )    #this columns parameter also do the samething

Unnamed: 0,Name
0,AAA
1,BBB
2,CCC
3,DDD
4,EEE
5,
6,GGG
7,
8,KKK


### `Filling the missing values` `fillna()` function that can be applied on the DataFrame in order to replace/fill missing values with a user specified value.**

ways To fill NaN values
- user specified value

- forwardfill(ffill) or backwardfill(bfill)

- Numerical column
 - Mean
 - Median
 
- Categorical column
 - Mode
 
- interpolate

**User specified value- Replace missing value with 0 To fill in missing values with a numerical value of 0 (i.e. the value of 0 is an arbitrary value and this can be any other value of user choice)**
- df.fillna(0)

In [4]:
df=pd.read_csv(r"data\std_multinan.csv")

In [83]:
df

Unnamed: 0,Name,English,Science
0,AAA,50.0,40.0
1,BBB,,45.0
2,CCC,,58.0
3,DDD,,90.0
4,EEE,,55.0
5,,40.0,
6,GGG,45.0,
7,,58.0,
8,KKK,90.0,90.0


In [85]:
df.fillna(0)    #All the NaN values are filled with 0 

Unnamed: 0,Name,English,Science
0,AAA,50.0,40.0
1,BBB,0.0,45.0
2,CCC,0.0,58.0
3,DDD,0.0,90.0
4,EEE,0.0,55.0
5,0,40.0,0.0
6,GGG,45.0,0.0
7,0,58.0,0.0
8,KKK,90.0,90.0


In [86]:
df.fillna(0,axis=1)

Unnamed: 0,Name,English,Science
0,AAA,50.0,40.0
1,BBB,0.0,45.0
2,CCC,0.0,58.0
3,DDD,0.0,90.0
4,EEE,0.0,55.0
5,0.0,40.0,0.0
6,GGG,45.0,0.0
7,0.0,58.0,0.0
8,KKK,90.0,90.0


**bfill is an short form for backward fill. This bfill will backward fill the values in the series/dataframe.**

In [87]:
df.fillna(method="bfill")

Unnamed: 0,Name,English,Science
0,AAA,50.0,40.0
1,BBB,40.0,45.0
2,CCC,40.0,58.0
3,DDD,40.0,90.0
4,EEE,40.0,55.0
5,GGG,40.0,90.0
6,GGG,45.0,90.0
7,KKK,58.0,90.0
8,KKK,90.0,90.0


**ffill is an short form for forwardward fill. This ffill will forwardward fill the values in the series/dataframe.**

In [88]:
df.fillna(method="ffill")

Unnamed: 0,Name,English,Science
0,AAA,50.0,40.0
1,BBB,50.0,45.0
2,CCC,50.0,58.0
3,DDD,50.0,90.0
4,EEE,50.0,55.0
5,EEE,40.0,55.0
6,GGG,45.0,55.0
7,GGG,58.0,55.0
8,KKK,90.0,90.0


**Fill missing value with the mean value- let’s say that you want to fill in the missing value with the mean value for each column, you can type the following:**

In [None]:
#mean

In [89]:
df

Unnamed: 0,Name,English,Science
0,AAA,50.0,40.0
1,BBB,,45.0
2,CCC,,58.0
3,DDD,,90.0
4,EEE,,55.0
5,,40.0,
6,GGG,45.0,
7,,58.0,
8,KKK,90.0,90.0


In [91]:
df.dtypes

Name         object
English     float64
Science     float64
dtype: object

In [92]:
df["English"].fillna(df["English"].mean())

0    50.0
1    56.6
2    56.6
3    56.6
4    56.6
5    40.0
6    45.0
7    58.0
8    90.0
Name: English, dtype: float64

In [96]:
df.columns

Index(['Name', 'English', 'Science '], dtype='object')

In [98]:
df["Science "].fillna(df["Science "].mean())

0    40.0
1    45.0
2    58.0
3    90.0
4    55.0
5    63.0
6    63.0
7    63.0
8    90.0
Name: Science , dtype: float64

**Fill missing value with the median value- let’s say that you want to fill in the missing value with the median value for each column, you can type the following:**

In [None]:
#median

In [99]:
df["Science "].fillna(df["Science "].median())

0    40.0
1    45.0
2    58.0
3    90.0
4    55.0
5    56.5
6    56.5
7    56.5
8    90.0
Name: Science , dtype: float64

In [100]:
df["English"].fillna(df["English"].median())

0    50.0
1    50.0
2    50.0
3    50.0
4    50.0
5    40.0
6    45.0
7    58.0
8    90.0
Name: English, dtype: float64

**Fill missing value with the mode value- let’s say that you want to fill in the missing value with the mode value for each column, you can type the following:**


In [101]:
df

Unnamed: 0,Name,English,Science
0,AAA,50.0,40.0
1,BBB,,45.0
2,CCC,,58.0
3,DDD,,90.0
4,EEE,,55.0
5,,40.0,
6,GGG,45.0,
7,,58.0,
8,KKK,90.0,90.0


In [112]:
df1=pd.read_csv(r"C:\Users\HP\Desktop\Dataset_Pandas\std_dropna.csv")

In [113]:
df1

Unnamed: 0,StdID,StdName,Class,Location,Total
0,111.0,A,10.0,,180.0
1,,B,11.0,Viz,
2,113.0,,12.0,Blr,150.0
3,,,,,
4,115.0,E,10.0,Viz,
5,,F,11.0,Blr,140.0
6,117.0,G,,Hyd,188.0
7,,,,,
8,119.0,I,9.0,Blr,


In [114]:
df1.dtypes

StdID       float64
StdName      object
Class       float64
Location     object
Total       float64
dtype: object

In [115]:
df1.columns

Index(['StdID', 'StdName', 'Class', 'Location', 'Total'], dtype='object')

In [None]:
#mode

In [121]:
df1['Location'].mode()   #the output is in series

0    Blr
Name: Location, dtype: object

In [124]:
df1['Location'].mode()[0] #output is in strings

'Blr'

In [123]:
df1['Location'].fillna(df1['Location'].mode()[0])

0    Blr
1    Viz
2    Blr
3    Blr
4    Viz
5    Blr
6    Hyd
7    Blr
8    Blr
Name: Location, dtype: object

**inplace=True - all the fillna method we have seen is not a permanent change to make it permanent change in the dataframe inplace parameter is used**  

In [127]:
df1['Location'].fillna(df1['Location'].mode()[0]) #not a permanent change

0    Blr
1    Viz
2    Blr
3    Blr
4    Viz
5    Blr
6    Hyd
7    Blr
8    Blr
Name: Location, dtype: object

In [126]:
df1

Unnamed: 0,StdID,StdName,Class,Location,Total
0,111.0,A,10.0,,180.0
1,,B,11.0,Viz,
2,113.0,,12.0,Blr,150.0
3,,,,,
4,115.0,E,10.0,Viz,
5,,F,11.0,Blr,140.0
6,117.0,G,,Hyd,188.0
7,,,,,
8,119.0,I,9.0,Blr,


In [128]:
df1['Location'].fillna(df1['Location'].mode()[0],inplace=True) 

In [129]:
df1

Unnamed: 0,StdID,StdName,Class,Location,Total
0,111.0,A,10.0,Blr,180.0
1,,B,11.0,Viz,
2,113.0,,12.0,Blr,150.0
3,,,,Blr,
4,115.0,E,10.0,Viz,
5,,F,11.0,Blr,140.0
6,117.0,G,,Hyd,188.0
7,,,,Blr,
8,119.0,I,9.0,Blr,


In [130]:
df["English"].fillna(df["English"].mean()) #this not a permanent change

0    50.0
1    56.6
2    56.6
3    56.6
4    56.6
5    40.0
6    45.0
7    58.0
8    90.0
Name: English, dtype: float64

In [131]:
df

Unnamed: 0,Name,English,Science
0,AAA,50.0,40.0
1,BBB,,45.0
2,CCC,,58.0
3,DDD,,90.0
4,EEE,,55.0
5,,40.0,
6,GGG,45.0,
7,,58.0,
8,KKK,90.0,90.0


In [132]:
df["English"].fillna(df["English"].mean(),inplace=True)

In [133]:
df

Unnamed: 0,Name,English,Science
0,AAA,50.0,40.0
1,BBB,56.6,45.0
2,CCC,56.6,58.0
3,DDD,56.6,90.0
4,EEE,56.6,55.0
5,,40.0,
6,GGG,45.0,
7,,58.0,
8,KKK,90.0,90.0


**One more way to handle missing values in DataFrame is `interpolate() function`. It replace NaN values with the number between the previous and next row.**

In [5]:
dftemp = pd.read_csv(r"data/hyd_outliers.csv")

In [137]:
dftemp

Unnamed: 0,Date,Temp
0,21-01-2022,30.0
1,22-01-2022,28.0
2,23-01-2022,
3,24-01-2022,29.0
4,25-01-2022,4450.0
5,26-01-2022,32.0
6,27-01-2022,34.0
7,28-01-2022,-50.0
8,29-01-2022,36.0
9,30-01-2022,38.0


In [139]:
dftemp.interpolate()

Unnamed: 0,Date,Temp
0,21-01-2022,30.0
1,22-01-2022,28.0
2,23-01-2022,28.5
3,24-01-2022,29.0
4,25-01-2022,4450.0
5,26-01-2022,32.0
6,27-01-2022,34.0
7,28-01-2022,-50.0
8,29-01-2022,36.0
9,30-01-2022,38.0


In [6]:
dfage = pd.read_csv(r"data\std_interp.csv")

In [143]:
dfage

Unnamed: 0,Stdname,StdClass,Age
0,AAA,10,15.0
1,BBB,11,
2,CCC,12,18.0
3,DDD,10,
4,EEE,11,16.0
5,FFF,12,18.0
6,GGG,11,16.0
7,HHH,10,15.0
8,JJJ,12,
9,KKK,10,


In [144]:
dfage.interpolate()

Unnamed: 0,Stdname,StdClass,Age
0,AAA,10,15.0
1,BBB,11,16.5
2,CCC,12,18.0
3,DDD,10,17.0
4,EEE,11,16.0
5,FFF,12,18.0
6,GGG,11,16.0
7,HHH,10,15.0
8,JJJ,12,15.0
9,KKK,10,15.0
