# Module 6 - Working with Missing Data

### Prelude

Import pandas as pd and numpy as np

In [1]:
import pandas as pd
import numpy as np

Read the file enronRenamed.csv 

In [2]:
df=pd.read_csv(r'C:\Users\mwbarr\OneDrive - Federal Bureau of Investigation\Python_for_Data_Analysis\Python for Data Analysis\enronRenamed.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Subject,From,To,cc,bcc,Folder,Origin,FileName,content,user,labeled,dollarValueMentionedinMessage,messageLength,Date
0,0,"RTO Orders - Grid South, SE Trans, SPP and Ent...","Fulton, Donna </O=ENRON/OU=NA/CN=RECIPIENTS/CN...","Kean, Steven </O=ENRON/OU=NA/CN=RECIPIENTS/CN=...",,,\GWHALLE (Non-Privileged)\Markets,WHALLEY-G,GWHALLE (Non-Privileged).pst,The Southeast RTO orders are out and have foll...,whalley-g,False,,2570,2001-07-13 19:47:20
1,1,More UC/CSU Info,"Delainey, David </O=ENRON/OU=NA/CN=RECIPIENTS/...","Greg Whalley/HOU/ECT@ENRON, Lavorato, John </O...",,,\GWHALLE (Non-Privileged)\Markets,WHALLEY-G,GWHALLE (Non-Privileged).pst,---------------------- Forwarded by David W De...,whalley-g,False,,3761,2001-07-12 11:36:58
2,2,California Update 07.09.01,"Dasovich, Jeff </O=ENRON/OU=NA/CN=RECIPIENTS/C...","skean@enron.com, Shapiro, Richard </O=ENRON/OU...",,,\GWHALLE (Non-Privileged)\Markets,WHALLEY-G,GWHALLE (Non-Privileged).pst,The Bond Legislation The Democrats in the Asse...,whalley-g,False,,1348,2001-07-10 00:47:51
3,3,Davis & Company -- incompetence personified,"Shapiro, Richard </O=ENRON/OU=NA/CN=RECIPIENTS...","Lavorato, John </O=ENRON/OU=NA/CN=RECIPIENTS/C...",,,\GWHALLE (Non-Privileged)\Markets,WHALLEY-G,GWHALLE (Non-Privileged).pst,FYI ---------------------- Forwarded by Richar...,whalley-g,False,37.0,2592,2001-07-06 20:45:05
4,4,Link to DWR contract info,"Shapiro, Richard </O=ENRON/OU=NA/CN=RECIPIENTS...","Lavorato, John </O=ENRON/OU=NA/CN=RECIPIENTS/C...",,,\GWHALLE (Non-Privileged)\Markets,WHALLEY-G,GWHALLE (Non-Privileged).pst,FYI ---------------------- Forwarded by Richar...,whalley-g,False,43.0,1774,2001-07-06 20:44:33


### Part 1: Identifying and Removing Missing Values

**1a** Get the number of missing values in the cc column

In [3]:
df['cc'].isnull().sum()

12149

**1b** Get the proportion of null values in the cc column

In [4]:
df['cc'].isnull().mean()

0.6981782656169185

**1c** Filter the dataframe to include only rows with values in cc column

In [5]:
onlyCC=df[df['cc'].isnull()==False]
onlyCC.head()

Unnamed: 0.1,Unnamed: 0,Subject,From,To,cc,bcc,Folder,Origin,FileName,content,user,labeled,dollarValueMentionedinMessage,messageLength,Date
6,6,CPUC Decision Suspending Direct Access,James D Steffes <James D Steffes/NA/Enron@ENRON>,David W Delainey <David W Delainey/HOU/EES@EES...,Harry Kingerski <Harry Kingerski/NA/Enron@Enro...,,\GWHALLE (Non-Privileged)\Markets,WHALLEY-G,GWHALLE (Non-Privileged).pst,The following points highlight legal/regulator...,whalley-g,False,,2042,2001-06-21 21:29:00
7,7,California 7/31,"Johnston, Robert </O=ENRON/OU=NA/CN=RECIPIENTS...","Whalley, Greg </O=ENRON/OU=NA/CN=RECIPIENTS/CN...","Hickerson, Gary </O=ENRON/OU=NA/CN=RECIPIENTS/...",,\GWHALLE (Non-Privileged)\Markets,WHALLEY-G,GWHALLE (Non-Privileged).pst,The train wreck known as California utility cr...,whalley-g,False,,5417,2001-07-31 22:31:57
8,8,FW: California Update 7/26/01 p.3,"Johnston, Robert </O=ENRON/OU=NA/CN=RECIPIENTS...","Whalley, Greg </O=ENRON/OU=NA/CN=RECIPIENTS/CN...","Hickerson, Gary </O=ENRON/OU=NA/CN=RECIPIENTS/...",,\GWHALLE (Non-Privileged)\Markets,WHALLEY-G,GWHALLE (Non-Privileged).pst,Senator Burton spoke with the Speaker Wednesda...,whalley-g,False,2.0,4064,2001-07-27 11:32:47
9,9,FW: California Update 7/26/01,"Johnston, Robert </O=ENRON/OU=NA/CN=RECIPIENTS...","Whalley, Greg </O=ENRON/OU=NA/CN=RECIPIENTS/CN...","Hickerson, Gary </O=ENRON/OU=NA/CN=RECIPIENTS/...",,\GWHALLE (Non-Privileged)\Markets,WHALLEY-G,GWHALLE (Non-Privileged).pst,This is a long one... Let me know if you have ...,whalley-g,False,500.0,10947,2001-07-27 11:22:43
14,14,Amber and StorageApps Acquired,"Garland, Kevin </O=ENRON/OU=NA/CN=RECIPIENTS/C...","Rice, Ken </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Not...","Sheldon, Steven </O=ENRON/OU=NA/CN=RECIPIENTS/...",,\GWHALLE (Non-Privileged)\Merchant Investments,WHALLEY-G,GWHALLE (Non-Privileged).pst,"In a absolutely dead tech market, two of the E...",whalley-g,False,350.0,813,2001-07-25 22:13:40


### Part 2: Filling missing values

**2a** Fill all missing values in the dollarValueMentionedinMessage column with 0

In [7]:
filledDollars=df['dollarValueMentionedinMessage'].fillna(0)
filledDollars.head()

0     0.0
1     0.0
2     0.0
3    37.0
4    43.0
Name: dollarValueMentionedinMessage, dtype: float64

**2b** Fill all missing values in the dollarValueMentionedinMessage with the closest filled value following the missing value

In [12]:
filledDollars=df['dollarValueMentionedinMessage'].fillna(method='bfill')
filledDollars.head()

0    37.0
1    37.0
2    37.0
3    37.0
4    43.0
Name: dollarValueMentionedinMessage, dtype: float64

**2c** Fill all missing values in the dollarValueMentionedinMessage column with the value in messageLength from that row

In [13]:
filledDollars=df['dollarValueMentionedinMessage'].fillna(df['messageLength'])
filledDollars.head()

0    2570.0
1    3761.0
2    1348.0
3      37.0
4      43.0
Name: dollarValueMentionedinMessage, dtype: float64