### Null Handling

How to detect and clean up null values.

In [8]:
import pandas as pd
import numpy as np

**Filepath and Dataframe**

In [27]:
## Creating a variable for the directory of the source data
filepath = '/Users/MarkHinojosa/pandas/Data/Employee Attrition.csv'

## Call the dataframe using the filepath var
employeeDF = pd.read_csv(filepath)

**Info() Function**

In [28]:
## The info() function can give a high level count of non-null values.
employeeDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15787 entries, 0 to 15786
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Emp ID                 14999 non-null  float64
 1   satisfaction_level     14999 non-null  float64
 2   last_evaluation        14999 non-null  float64
 3   number_project         14999 non-null  float64
 4   average_montly_hours   14999 non-null  float64
 5   time_spend_company     14999 non-null  float64
 6   Work_accident          14999 non-null  float64
 7   promotion_last_5years  14999 non-null  float64
 8   dept                   14999 non-null  object 
 9   salary                 14999 non-null  object 
dtypes: float64(8), object(2)
memory usage: 1.2+ MB


**Creating NaN Values**

In [30]:
## We can use the Numpy .nan() method to force NaN values
employeeDF.loc[1,10,100,1000,1010,2000] = np.nan

In [32]:
employeeDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15787 entries, 0 to 15786
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Emp ID                 14993 non-null  float64
 1   satisfaction_level     14993 non-null  float64
 2   last_evaluation        14993 non-null  float64
 3   number_project         14993 non-null  float64
 4   average_montly_hours   14993 non-null  float64
 5   time_spend_company     14993 non-null  float64
 6   Work_accident          14993 non-null  float64
 7   promotion_last_5years  14993 non-null  float64
 8   dept                   14993 non-null  object 
 9   salary                 14993 non-null  object 
dtypes: float64(8), object(2)
memory usage: 1.2+ MB


In [39]:
## We can  see that row 2 of the DataFrame contains null values using the isnull() method.
employeeDF[0:5].isnull()

Unnamed: 0,Emp ID,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,promotion_last_5years,dept,salary
0,False,False,False,False,False,False,False,False,False,False
1,True,True,True,True,True,True,True,True,True,True
2,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False


**Handling NaN Values**

In [43]:
## The .dropna() function will drop any rows that have nulls for any field.
droppedDF = employeeDF.dropna()
droppedDF[0:5].isnull() ## Notice that row index 1 no longer exists

Unnamed: 0,Emp ID,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,promotion_last_5years,dept,salary
0,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False


In [47]:
## The .fillna() function will populate NaN with values that the user specifies.
fillDF = employeeDF.fillna('Test')
fillDF[0:5].isnull() ## Does not return row index 1 as being null

Unnamed: 0,Emp ID,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,promotion_last_5years,dept,salary
0,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False


In [48]:
## Noticed that the string 'Test' has been populated for every field in row index 1. Always be cognizent of the type of data that you are populating 
## as to not mismatch data types
fillDF.head()

Unnamed: 0,Emp ID,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,promotion_last_5years,dept,salary
0,1.0,0.38,0.53,2.0,157.0,3.0,0.0,0.0,sales,low
1,Test,Test,Test,Test,Test,Test,Test,Test,Test,Test
2,3.0,0.11,0.88,7.0,272.0,4.0,0.0,0.0,sales,medium
3,4.0,0.72,0.87,5.0,223.0,5.0,0.0,0.0,sales,low
4,5.0,0.37,0.52,2.0,159.0,3.0,0.0,0.0,sales,low
