<font color="green">*To start working on this notebook, or any other notebook that we will use in the Moringa Data Science Course, we will need to save our own copy of it. We can do this by clicking File > Save a Copy in Drive. We will then be able to make edits to our own copy of this notebook.*</font>

# Python Data Cleaning: Basics

## Step 1: Finding and counting missing data





Having known how missing values in a dataset are created, we will now get introduced to how we can work with such a dataset. The first step that we will perform is to find the missing values and know how many they are. 



In [1]:
# We will first import our dataset that we will use for the examples. We will do this by first
# Importing our pandas library 
import pandas as pd


# then storing the url location of our dataset to the variable url
url = 'http://bit.ly/TitanicDataset1'

# We will read the dataset from above url and store the dataframe in the variable df
df = pd.read_csv(url)

In [2]:
# Challenge 1
# In parallel to this, we will also be working with another dataset too. We will use it for practice.

# Let's first store our url location just like we did above
gov_dataset = 'http://bit.ly/GovProjectFinanceDataset1'

# Then read the dataset from url and store it in our variable of choice
# 
dfgov = pd.read_csv(gov_dataset)

# And familiarize ourselves with the dataframe by viewing its first 5 rows
# 
dfgov.head()

Unnamed: 0,Implementing_Agency,Total_-_GOK_Budget_Est_KES,Total_-_Loan_Budget_Est_KES,Total_-_Grant_Budget_Est_KES,Total_Budget_Supported__by_Donors_KES,Total_2013/2014_Budget_KES,OBJECTID
0,,0.0,0,3524395000,3524395000,3524395000,0
1,MOEW&NR National Environment Management Auth...,0.0,0,300000000,300000000,300000000,1
2,"Ministry of Agriculture, Livestock and Fisheries",2306364000.0,23354779585,8732081258,32086860843,34393225231,2
3,The Presidency,,0,0,0,0,3
4,MOAL&F Kenya Plant Health Inspectorate Serv...,0.0,0,66916163,66916163,66916163,4


In [3]:
# We count the number of non - missing values in the df dataframe
#
df.count()

pclass       1309
survived     1309
name         1309
sex          1309
age          1046
sibsp        1309
parch        1309
ticket       1309
fare         1308
cabin         295
embarked     1307
boat          486
body          121
home.dest     745
dtype: int64

In [4]:
# Challenge 2: Government Project Dataset
# We also count the number of non - missing values in the our government project dataset
# 
dfgov.count()

Implementing_Agency                      39
Total_-_GOK_Budget_Est_KES               39
Total_-_Loan_Budget_Est_KES              40
Total_-_Grant_Budget_Est_KES             40
Total_Budget_Supported__by_Donors_KES    40
Total_2013/2014_Budget_KES               40
OBJECTID                                 40
dtype: int64

In [5]:
# A longer method can be to subtract the no. of non-missing rows from the total number 
# of rows in the dataframe in order to determine the no. of missing values as shown
# 
num_rows = df.shape[0]
num_missing = num_rows - df.count() 
num_missing

pclass          1
survived        1
name            1
sex             1
age           264
sibsp           1
parch           1
ticket          1
fare            2
cabin        1015
embarked        3
boat          824
body         1189
home.dest     565
dtype: int64

In [7]:
# Challenge 3: Government Project Dataset
# We now also subtract the no. of non-missing rows from the total number of rows 
# to determine the no. of missing values in our government project dataset
#
num_rowsgov = dfgov.shape[0]
num_missinggov = num_rowsgov - dfgov.count() 
num_missinggov

Implementing_Agency                      1
Total_-_GOK_Budget_Est_KES               1
Total_-_Loan_Budget_Est_KES              0
Total_-_Grant_Budget_Est_KES             0
Total_Budget_Supported__by_Donors_KES    0
Total_2013/2014_Budget_KES               0
OBJECTID                                 0
dtype: int64

In [11]:
# Another method would also be to count the number of missing values in our dataframe,  
# using the count_nonzero function from numpy - and including the isnull() method.

# But before we do that, we would need import numpy,
# 
import numpy as np

# Then count those missing values in our dataframe
#
np.count_nonzero(df.isnull())

3869

In [12]:
# Challenge 4: Government Project Dataset
# Let's just do what we did in our previous cell. We found out the number of missing values 
# in our dataset using the count_nonzero function from numpy 
#
np.count_nonzero(dfgov.isnull())

2

In [13]:
# We could also count the number of missing values for a particular column by specifying 
# the column by still using the count_nonzero function from numpy 
# 
np.count_nonzero(df['body'].isnull())

1189

In [14]:
# Challenge 5: Government Project Dataset
# And again specify how many non-missing values we have in the column: Total_-_GOK_Budget_Est_KES	
#
np.count_nonzero(dfgov['Total_-_GOK_Budget_Est_KES'].isnull())

1

## Step 2: Clean Missing Data

Now that we have been able to identify our missing values in our datasets, we can deal with them in the following ways.

### a) We can recode/replace these values
Here, we can do this by using the fillna method and recoding/replacing the missing values with another value or other values. In this example, we will recode the missing values to 0.

In [15]:
# We can recode missing values with 0 by doing the following. 
# Do note that this will appear as 0.0 in your dataframe.
# 
df_recode = df.fillna(0)

# then preview the first 5 rows for the recoded dataframe
#
df_recode.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2,0.0,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11,0.0,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,0,0.0,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,0,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,0,0.0,"Montreal, PQ / Chesterville, ON"


In [17]:
# After recoding, let's check for non-missing values in the recoded dataframe
#
df_recode.count()

pclass       1310
survived     1310
name         1310
sex          1310
age          1310
sibsp        1310
parch        1310
ticket       1310
fare         1310
cabin        1310
embarked     1310
boat         1310
body         1310
home.dest    1310
dtype: int64

In [18]:
# Challenge 6: Government Project Dataset
# Nice! Let's now replace the missing values in our dataset with 0 again 
# just as in our previous example
# 
df_recodegov = dfgov.fillna(0)
df_recodegov.count()

Implementing_Agency                      40
Total_-_GOK_Budget_Est_KES               40
Total_-_Loan_Budget_Est_KES              40
Total_-_Grant_Budget_Est_KES             40
Total_Budget_Supported__by_Donors_KES    40
Total_2013/2014_Budget_KES               40
OBJECTID                                 40
dtype: int64

**Challenge 7:** Together with your peer now discuss the cases where you might use the above recoding technique.

### b) We can fill forward missing values

When data is filled data forward, the last known value is used for the next missing value. The missing values are replaced with the last known/recorded value. Let's see how that works.

In [19]:
# First lets preview out df ,and note the missing data in the boat and body columns
#
df

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0000,0.0,0.0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1.0,2.0,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1305,3.0,0.0,"Zabour, Miss. Thamine",female,,1.0,0.0,2665,14.4542,,C,,,
1306,3.0,0.0,"Zakarian, Mr. Mapriededer",male,26.5000,0.0,0.0,2656,7.2250,,C,,304.0,
1307,3.0,0.0,"Zakarian, Mr. Ortin",male,27.0000,0.0,0.0,2670,7.2250,,C,,,
1308,3.0,0.0,"Zimmerman, Mr. Leo",male,29.0000,0.0,0.0,315082,7.8750,,S,,,


In [20]:
# And now let's fill forward
#
df_fill_forward = df.fillna(method='ffill')

# Then preview our dataframe and try to understand it
#
df_fill_forward

# Did you notice what happens when the column begins with a missing value? 
# Discuss this with your peer

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0000,0.0,0.0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0000,1.0,2.0,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1.0,2.0,113781,151.5500,C22 C26,S,11,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1.0,2.0,113781,151.5500,C22 C26,S,11,135.0,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1305,3.0,0.0,"Zabour, Miss. Thamine",female,14.5000,1.0,0.0,2665,14.4542,F38,C,C,328.0,"Antwerp, Belgium / Stanton, OH"
1306,3.0,0.0,"Zakarian, Mr. Mapriededer",male,26.5000,0.0,0.0,2656,7.2250,F38,C,C,304.0,"Antwerp, Belgium / Stanton, OH"
1307,3.0,0.0,"Zakarian, Mr. Ortin",male,27.0000,0.0,0.0,2670,7.2250,F38,C,C,304.0,"Antwerp, Belgium / Stanton, OH"
1308,3.0,0.0,"Zimmerman, Mr. Leo",male,29.0000,0.0,0.0,315082,7.8750,F38,S,C,304.0,"Antwerp, Belgium / Stanton, OH"


In [21]:
# Challenge 8: Government Project Dataset
# Onto our other dataset, let's now fill forward the missing values and see what happens
# 
df_fill_forwardgov = dfgov.fillna(method='ffill')
df_fill_forwardgov

Unnamed: 0,Implementing_Agency,Total_-_GOK_Budget_Est_KES,Total_-_Loan_Budget_Est_KES,Total_-_Grant_Budget_Est_KES,Total_Budget_Supported__by_Donors_KES,Total_2013/2014_Budget_KES,OBJECTID
0,,0.0,0,3524395000,3524395000,3524395000,0
1,MOEW&NR National Environment Management Auth...,0.0,0,300000000,300000000,300000000,1
2,"Ministry of Agriculture, Livestock and Fisheries",2306364000.0,23354779585,8732081258,32086860843,34393225231,2
3,The Presidency,2306364000.0,0,0,0,0,3
4,MOAL&F Kenya Plant Health Inspectorate Serv...,0.0,0,66916163,66916163,66916163,4
5,Ministry of Foreign Affairs,0.0,0,46000000,46000000,46000000,5
6,MRDA Ewaso Ngiro North Dev. Authority (ENNDA),6560292000.0,0,0,0,6560292000,6
7,MOE&P Kenya Electricity Transmission Company...,0.0,1304000000,0,1304000000,1304000000,7
8,Office of The Attorney General and Department...,0.0,0,1351905890,1351905890,1351905890,8
9,State Department of Infrastructure,12270000.0,0,0,0,12270000,9


**Challenge 9: **Together with your peer discuss the cases where you might use the above fill forward technique.

### c) We can fill backward missing values

We can also do the opposite, and fill data backward our missing values. When doing this, the newest value replaces the missing data.

In [23]:
# Again first lets preview df (last 10 items), and note the missing data in the boat and body columns
#
df.tail(10)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1300,3.0,1.0,"Yasbeck, Mrs. Antoni (Selini Alexander)",female,15.0,1.0,0.0,2659.0,14.4542,,C,,,
1301,3.0,0.0,"Youseff, Mr. Gerious",male,45.5,0.0,0.0,2628.0,7.225,,C,,312.0,
1302,3.0,0.0,"Yousif, Mr. Wazli",male,,0.0,0.0,2647.0,7.225,,C,,,
1303,3.0,0.0,"Yousseff, Mr. Gerious",male,,0.0,0.0,2627.0,14.4583,,C,,,
1304,3.0,0.0,"Zabour, Miss. Hileni",female,14.5,1.0,0.0,2665.0,14.4542,,C,,328.0,
1305,3.0,0.0,"Zabour, Miss. Thamine",female,,1.0,0.0,2665.0,14.4542,,C,,,
1306,3.0,0.0,"Zakarian, Mr. Mapriededer",male,26.5,0.0,0.0,2656.0,7.225,,C,,304.0,
1307,3.0,0.0,"Zakarian, Mr. Ortin",male,27.0,0.0,0.0,2670.0,7.225,,C,,,
1308,3.0,0.0,"Zimmerman, Mr. Leo",male,29.0,0.0,0.0,315082.0,7.875,,S,,,
1309,,,,,,,,,,,,,,


In [24]:
# And now fill backward
#
df_fill_backward = df.fillna(method='bfill')

# Then preview our dataframe (last 10 items) and try to understand what took place
#
df_fill_backward.tail(10)

# After this, we will discuss what happens when a column ends with a missing value.

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1300,3.0,1.0,"Yasbeck, Mrs. Antoni (Selini Alexander)",female,15.0,1.0,0.0,2659.0,14.4542,,C,,312.0,
1301,3.0,0.0,"Youseff, Mr. Gerious",male,45.5,0.0,0.0,2628.0,7.225,,C,,312.0,
1302,3.0,0.0,"Yousif, Mr. Wazli",male,14.5,0.0,0.0,2647.0,7.225,,C,,328.0,
1303,3.0,0.0,"Yousseff, Mr. Gerious",male,14.5,0.0,0.0,2627.0,14.4583,,C,,328.0,
1304,3.0,0.0,"Zabour, Miss. Hileni",female,14.5,1.0,0.0,2665.0,14.4542,,C,,328.0,
1305,3.0,0.0,"Zabour, Miss. Thamine",female,26.5,1.0,0.0,2665.0,14.4542,,C,,304.0,
1306,3.0,0.0,"Zakarian, Mr. Mapriededer",male,26.5,0.0,0.0,2656.0,7.225,,C,,304.0,
1307,3.0,0.0,"Zakarian, Mr. Ortin",male,27.0,0.0,0.0,2670.0,7.225,,C,,,
1308,3.0,0.0,"Zimmerman, Mr. Leo",male,29.0,0.0,0.0,315082.0,7.875,,S,,,
1309,,,,,,,,,,,,,,


In [25]:
# Challenge 10: Government Project Dataset
# Back to our government dataset. We now fill backward the missing values 
# and try to understand the changes that took place
# 
df_fill_backwardgov = dfgov.fillna(method='bfill')
df_fill_backwardgov

Unnamed: 0,Implementing_Agency,Total_-_GOK_Budget_Est_KES,Total_-_Loan_Budget_Est_KES,Total_-_Grant_Budget_Est_KES,Total_Budget_Supported__by_Donors_KES,Total_2013/2014_Budget_KES,OBJECTID
0,MOEW&NR National Environment Management Auth...,0.0,0,3524395000,3524395000,3524395000,0
1,MOEW&NR National Environment Management Auth...,0.0,0,300000000,300000000,300000000,1
2,"Ministry of Agriculture, Livestock and Fisheries",2306364000.0,23354779585,8732081258,32086860843,34393225231,2
3,The Presidency,0.0,0,0,0,0,3
4,MOAL&F Kenya Plant Health Inspectorate Serv...,0.0,0,66916163,66916163,66916163,4
5,Ministry of Foreign Affairs,0.0,0,46000000,46000000,46000000,5
6,MRDA Ewaso Ngiro North Dev. Authority (ENNDA),6560292000.0,0,0,0,6560292000,6
7,MOE&P Kenya Electricity Transmission Company...,0.0,1304000000,0,1304000000,1304000000,7
8,Office of The Attorney General and Department...,0.0,0,1351905890,1351905890,1351905890,8
9,State Department of Infrastructure,12270000.0,0,0,0,12270000,9


**Challenge 11: ** Together with your peer discuss the cases where you might use the above fill backward technique.


### d) We can interpolate missing values

Interpolation uses other values to fill in the missing values. It does this by treating missing values as if they should be equally spaced apart. This function fills in the missing values linearly as shown in the example below.

In [33]:
# Once again first lets print out df, and note the missing data in the boat and body columns
# NB: iloc gets rows (or columns) at particular positions in the index
#
df.iloc[0:10, 11:13]

Unnamed: 0,boat,body
0,2,
1,11,
2,,
3,,135.0
4,,
5,3,
6,10,
7,,
8,D,
9,,22.0


In [34]:
# Then perform our intepolation
#
df_interpolate = df.interpolate().iloc[0:10, 11:13];
df_interpolate.head(10)

Unnamed: 0,boat,body
0,2,
1,11,
2,,
3,,135.0
4,,116.166667
5,3,97.333333
6,10,78.5
7,,59.666667
8,D,40.833333
9,,22.0


In [40]:
# Challenge 12: Government Project Dataset
# Intepolate the missing values in our dataset, noting the changes
df_interpolategov = dfgov.interpolate().iloc[0:10, 0:2];
df_interpolategov.head(5)

Unnamed: 0,Implementing_Agency,Total_-_GOK_Budget_Est_KES
0,,0.0
1,MOEW&NR National Environment Management Auth...,0.0
2,"Ministry of Agriculture, Livestock and Fisheries",2306364000.0
3,The Presidency,1153182000.0
4,MOAL&F Kenya Plant Health Inspectorate Serv...,0.0


**Challenge 13: **Together with your peer discuss the cases where you might use the above interpolate technique.

### e) We can drop/delete missing values

The other way to work with missing data is to drop/delete the records with the missing data. This is a judgement that you have to make based on your research. Sometimes keeping the entire dataset together with the missing values, can leave you with a useless dataset. On the other hand, the missing data many not be random and dropping those missing values would leave you with a biased dataset or deleting the missing dataset might leave you with insufficient data for analysis. All of this we will learn more indepth during the course of the program. 

For now lets see how we can drop missing values from a dataset.

In [47]:
# Let's find out the size of our dataset
# 
df.shape

(1310, 14)

In [48]:
# If we were to keep our complete rows; meaning we drop any record with a missing value, then
#
df_dropped = df.dropna()

# We are left with no rows of data
#
df_dropped.shape

(0, 14)

In [49]:
# Printing out df_dropped
df_dropped

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest


In [54]:
# Challenge 14: Financial Allocation Dataset
# Let's now drop the records that have missing values
url = 'http://bit.ly/MSFinancialDataset'
#
dff = pd.read_csv(url)
dff_dropped = dff.dropna()
dff_dropped

Unnamed: 0,NG_Ministry,GOK_-_Draft_Estimated_KES,GOK_-_Supplementary_Estimated_I__KES,GOK_-_Supplementary_Estimated_II__KES,GOK_-_Off-Budget_Estimated_KES,Total_-_GOK_Budget_Est_KES,Loan_-_Draft_Estimated_KES,Loan_-_Supplementary_Estimated_I_KES,Loan_-_Supplementary_Estimated_II_KES,Loan_-_Off-Budget_Estimated_KES,Total_-_Loan_Budget_Est_KES,Grant_-_Draft_Estimated_KES,Grant_-_Supplementary_Estimated_I_KES,Grant_-_Supplementary_Estimated_II_KES,Grant_-_Off-Budget_Estimated_KES,Total_-_Grant_Budget_Est_KES,Total_2013/2014_Budget_KES,OBJECTID
1,State Department of Transport,320000000.0,400000000.0,6650000000.0,0.0,7370000000,14621100000.0,14780000000.0,13674700000.0,0.0,43075801108,800000000.0,885000000.0,1007000000.0,0.0,2692000000,53137801108,1
2,State Department of planning,6827202000.0,1608256000.0,0.0,0.0,8435458686,1417654000.0,1811446000.0,1360906000.0,0.0,4590006224,10096310000.0,10062560000.0,9208231000.0,0.0,29367102979,42392567889,2
4,The National Treasury,13000000.0,54000000.0,0.0,0.0,67000000,2168500000.0,1729750000.0,1509570000.0,0.0,5407819701,528475000.0,609340200.0,842175000.0,0.0,1979990195,7454809896,4
6,The Presidency,130000000.0,0.0,0.0,0.0,0,582712300.0,929693600.0,929693600.0,0.0,2442099540,120000000.0,0.0,0.0,0.0,120000000,2562099540,6
10,State Department of Infrastructure,18920800000.0,4618801000.0,0.0,0.0,23539602883,40968580000.0,37647000000.0,29118400000.0,0.0,107734000000,6754200000.0,10857100000.0,5820000000.0,0.0,23431300000,154705000000,10
13,State Department for Livestock,23401260.0,23401260.0,18480000.0,0.0,65282524,265535500.0,398551400.0,367701400.0,0.0,1031788340,24743460.0,6683460.0,6683460.0,0.0,38110380,1135181244,13
14,State Department for Agriculture,594543000.0,1218903000.0,369534000.0,0.0,2182980303,6162332000.0,7859592000.0,6965260000.0,0.0,20987183617,2329761000.0,2825607000.0,3915138000.0,0.0,9070505156,32240669076,14
15,Ministry of Health,1565363000.0,1541067000.0,0.0,0.0,3106429528,2266500000.0,5497925000.0,6909925000.0,0.0,14674350000,20595410000.0,23655330000.0,12414790000.0,3524395000.0,60189929679,77970709207,15
16,Ministry of Land Housing and Urban Development,585992400.0,1424325000.0,0.0,0.0,2010317134,11369260000.0,11221780000.0,9948744000.0,0.0,32539782975,1573795000.0,911852200.0,405750500.0,0.0,2891397492,37381497601,16
18,State Department for Devolution,337017000.0,50266000.0,0.0,0.0,387283000,2231534000.0,3459671000.0,3011121000.0,0.0,8702325421,3302602000.0,3246825000.0,2319062000.0,1689286000.0,10557774111,19647382532,18


Challenge 15: Together with your peer discuss the cases where you might use the above drop/delete technique.