# Handling missing values

Data cleaning is one those things that everyone does but no one really talks about. Sure, it’s not the "sexiest" part of machine learning. And no, there aren’t hidden tricks and secrets to uncover.



However, proper data cleaning can make or break your project. Professional data scientists usually spend a very large portion of their time on this step.

Why? Because of a simple truth in machine learning:

<b color="Red">Better data beats fancier algorithms.</b>

## Approach 1 :

Here are several useful functions to detecting, replace, and remove null values in Pandas DataFrame :

a. isnull()

b. notnull()

c. dropna()

d. fillna()

e. replace()

<h2>Reading the data</h2>

In [58]:
import pandas as pd
import numpy as np

titanic_train = pd.read_csv(
    "https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/ff414a1bcfcba32481e4d4e8db578e55872a2ca1/titanic.csv",
                           sep='\t')   

In [61]:
data=titanic_train[20:30]

In [62]:
df=data[['Name','Sex','Age','Ticket','Fare']]
df

Unnamed: 0,Name,Sex,Age,Ticket,Fare
20,"Fynney, Mr. Joseph J",male,35.0,239865,26.0
21,"Beesley, Mr. Lawrence",male,34.0,248698,13.0
22,"McGowan, Miss. Anna ""Annie""",female,15.0,330923,8.0292
23,"Sloper, Mr. William Thompson",male,28.0,113788,35.5
24,"Palsson, Miss. Torborg Danira",female,8.0,349909,21.075
25,"Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...",female,38.0,347077,31.3875
26,"Emir, Mr. Farred Chehab",male,,2631,7.225
27,"Fortune, Mr. Charles Alexander",male,19.0,19950,263.0
28,"O'Dwyer, Miss. Ellen ""Nellie""",female,,330959,7.8792
29,"Todoroff, Mr. Lalio",male,,349216,7.8958


In [73]:
df.isnull()

Unnamed: 0,Name,Sex,Age,Ticket,Fare
20,False,False,False,False,False
21,False,False,False,False,False
22,False,False,False,False,False
23,False,False,False,False,False
24,False,False,False,False,False
25,False,False,False,False,False
26,False,False,True,False,False
27,False,False,False,False,False
28,False,False,True,False,False
29,False,False,True,False,False


<h2> a. isNull() </h2>

In [63]:
#this will return the rows consists of NaN values
firstdf=df[df.isnull().any(1)] 
firstdf

Unnamed: 0,Name,Sex,Age,Ticket,Fare
26,"Emir, Mr. Farred Chehab",male,,2631,7.225
28,"O'Dwyer, Miss. Ellen ""Nellie""",female,,330959,7.8792
29,"Todoroff, Mr. Lalio",male,,349216,7.8958


In [64]:
#this will return the rows consists of NaN values basis of a particular column
seconddf=pd.isnull(df['Age']) 
df[seconddf]

Unnamed: 0,Name,Sex,Age,Ticket,Fare
26,"Emir, Mr. Farred Chehab",male,,2631,7.225
28,"O'Dwyer, Miss. Ellen ""Nellie""",female,,330959,7.8792
29,"Todoroff, Mr. Lalio",male,,349216,7.8958


<h2> b. notNull() </h2>

In [65]:
seconddf=pd.notnull(df['Age'])
df[seconddf]

Unnamed: 0,Name,Sex,Age,Ticket,Fare
20,"Fynney, Mr. Joseph J",male,35.0,239865,26.0
21,"Beesley, Mr. Lawrence",male,34.0,248698,13.0
22,"McGowan, Miss. Anna ""Annie""",female,15.0,330923,8.0292
23,"Sloper, Mr. William Thompson",male,28.0,113788,35.5
24,"Palsson, Miss. Torborg Danira",female,8.0,349909,21.075
25,"Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...",female,38.0,347077,31.3875
27,"Fortune, Mr. Charles Alexander",male,19.0,19950,263.0


<h2> c. DropNa </h2>

In [67]:
#this removes the entire row if any of row values is NaN
thirddf=df.dropna()
thirddf


Unnamed: 0,Name,Sex,Age,Ticket,Fare
20,"Fynney, Mr. Joseph J",male,35.0,239865,26.0
21,"Beesley, Mr. Lawrence",male,34.0,248698,13.0
22,"McGowan, Miss. Anna ""Annie""",female,15.0,330923,8.0292
23,"Sloper, Mr. William Thompson",male,28.0,113788,35.5
24,"Palsson, Miss. Torborg Danira",female,8.0,349909,21.075
25,"Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...",female,38.0,347077,31.3875
27,"Fortune, Mr. Charles Alexander",male,19.0,19950,263.0


### Alternative approaches

In [44]:
dict = {'First Score':[100, np.nan, np.nan, 95], 
        'Second Score': [30, np.nan, 45, 56], 
        'Third Score':[52, np.nan, 80, 98], 
        'Fourth Score':[60, 67, 68, 65]}
df1 = pd.DataFrame(dict) 
df1

Unnamed: 0,First Score,Second Score,Third Score,Fourth Score
0,100.0,30.0,52.0,60
1,,,,67
2,,45.0,80.0,68
3,95.0,56.0,98.0,65


In [45]:
df1 = pd.DataFrame(dict)       
df1.dropna(axis = 1)#along the column 

Unnamed: 0,Fourth Score
0,60
1,67
2,68
3,65


In [46]:
df1 = pd.DataFrame(dict)       
df1.dropna(axis = 0)#along the row 

Unnamed: 0,First Score,Second Score,Third Score,Fourth Score
0,100.0,30.0,52.0,60
3,95.0,56.0,98.0,65


In [72]:
print("\nDropping a row with a minimum 3 NaN value using 'thresh' parameter\n",'-'*68, sep='')
print(df1.dropna(axis=0, thresh=3))


Dropping a row with a minimum 3 NaN value using 'thresh' parameter
--------------------------------------------------------------------
   First Score  Second Score  Third Score  Fourth Score
0        100.0          30.0         52.0            60
2          NaN          45.0         80.0            68
3         95.0          56.0         98.0            65


<h2> d. Fillna </h2>

In [75]:
fourthdf=df.fillna('missing value ')
fourthdf

Unnamed: 0,Name,Sex,Age,Ticket,Fare
20,"Fynney, Mr. Joseph J",male,35,239865,26.0
21,"Beesley, Mr. Lawrence",male,34,248698,13.0
22,"McGowan, Miss. Anna ""Annie""",female,15,330923,8.0292
23,"Sloper, Mr. William Thompson",male,28,113788,35.5
24,"Palsson, Miss. Torborg Danira",female,8,349909,21.075
25,"Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...",female,38,347077,31.3875
26,"Emir, Mr. Farred Chehab",male,missing value,2631,7.225
27,"Fortune, Mr. Charles Alexander",male,19,19950,263.0
28,"O'Dwyer, Miss. Ellen ""Nellie""",female,missing value,330959,7.8792
29,"Todoroff, Mr. Lalio",male,missing value,349216,7.8958


In [80]:
dict = {'First Score':[ 'Mango', np.nan, np.nan, 'Apple'], 
        'Second Score': ['Tom', np.nan, 'Jerry', 'Noody'], 
        'Third Score':[52, np.nan, 80, 98], 
        'Fourth Score':[60, 67, 68, 65]}
df2 = pd.DataFrame(dict) 
df2

Unnamed: 0,First Score,Second Score,Third Score,Fourth Score
0,Mango,Tom,52.0,60
1,,,,67
2,,Jerry,80.0,68
3,Apple,Noody,98.0,65


In [81]:
sixthdf=df2.fillna('')
sixthdf

Unnamed: 0,First Score,Second Score,Third Score,Fourth Score
0,Mango,Tom,52.0,60
1,,,,67
2,,Jerry,80.0,68
3,Apple,Noody,98.0,65


In [83]:
print(df2.fillna(value=df2['Fourth Score'].mean()))

  First Score Second Score  Third Score  Fourth Score
0       Mango          Tom         52.0            60
1          65           65         65.0            67
2          65        Jerry         80.0            68
3       Apple        Noody         98.0            65


<h2> e. Replace </h2>

In [40]:
# will replace  Nan value in dataframe with value -10   
fivedf=df.replace(to_replace = np.nan, value = -10)
fivedf

Unnamed: 0,Name,Sex,Age,Ticket,Fare
20,"Fynney, Mr. Joseph J",male,35.0,239865,26.0
21,"Beesley, Mr. Lawrence",male,34.0,248698,13.0
22,"McGowan, Miss. Anna ""Annie""",female,15.0,330923,8.0292
23,"Sloper, Mr. William Thompson",male,28.0,113788,35.5
24,"Palsson, Miss. Torborg Danira",female,8.0,349909,21.075
25,"Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...",female,38.0,347077,31.3875
26,"Emir, Mr. Farred Chehab",male,-10.0,2631,7.225
27,"Fortune, Mr. Charles Alexander",male,19.0,19950,263.0
28,"O'Dwyer, Miss. Ellen ""Nellie""",female,-10.0,330959,7.8792
29,"Todoroff, Mr. Lalio",male,-10.0,349216,7.8958
