# How to Deal with Missing Data in Python

## The Employee Dataset

We will be working with a very small Employee Dataset for this tutorial. Download the dataset in csv format from my Github repo and store it in your current working directory: [employees.csv](https://github.com/ChaitanyaBaweja/Programming-Tutorials/tree/master/Missing-Data-Pandas)

Let’s import this dataset into Python and take a look at it. 


In [24]:
# Importing libraries
import pandas as pd
import numpy as np

# Read csv file into a pandas dataframe
df = pd.read_csv("employees.csv")

# Prints out the first few rows
print(df.head())


  First Name  Gender Start Date Last Login Time  Salary Bonus %  \
0    Douglas    Male   08-06-93        12:42 PM   97308   6.945   
1     Thomas    Male  3/31/1996         6:53 AM   61933     NaN   
2      Maria  Female  4/23/1993             NaN  130590  11.858   
3      Jerry    Male   03-04-05         1:00 PM     NaN    9.34   
4      Larry    Male  1/24/1998         4:47 PM  101004   1.389   

  Senior Management             Team  
0              TRUE        Marketing  
1              TRUE              NaN  
2             FALSE          Finance  
3              TRUE          Finance  
4              TRUE  Client Services  


There are 1000 columns with 8 variables. You can get some basic statistics out using the `.dtypes` and `.describe()` method.

In [25]:
print(df.dtypes)
print(df.describe())

First Name           object
Gender               object
Start Date           object
Last Login Time      object
Salary               object
Bonus %              object
Senior Management    object
Team                 object
dtype: object
       First Name  Gender Start Date Last Login Time Salary Bonus %  \
count         931     852        996             997    998     997   
unique        201       3        968             719    993     968   
top       Marilyn  Female   03-01-04         1:35 PM      ?  12.182   
freq           11     428          2               5      2       3   

       Senior Management             Team  
count                932              957  
unique                 4               13  
top                 TRUE  Client Services  
freq                 467              105  


You would notice that the dtypes of all the columns is object. This shouldn't be the case for Salary, Senior Management and Bonus. This happens because we have **corrupt values in these columns**. Once we handle these missing values, we will convert these to the required type using `.astype()` method. 

## How to mark invalid/ corrupt values as missing

Pandas treat None and NaN as essentially interchangeable for indicating missing or null values. Other values like na and ? are not recognized by Pandas by default. Let’s focus on the Salary Column. 

In [26]:
print('Salary')
print(df['Salary'].head(10))


Salary
0     97308
1     61933
2    130590
3       NaN
4    101004
5    115163
6     65476
7     45906
8       NaN
9    139852
Name: Salary, dtype: object


In the 8th row there’s a missing value and in the 3rd row there is a NA, which Pandas automatically fills with NaN. But what happens with other symbols like ?, n.a., etc. Let's look at the Gender column.

In [28]:
print(df['Gender'].head(10))

Gender
0      Male
1      Male
2    Female
3      Male
4      Male
5      n.a.
6    Female
7    Female
8       NaN
9    Female
Name: Gender, dtype: object


We notice that n.a. isn't converted to NaN and remains in its original form. 
We can pass these formats in the `.read_csv()` method to allow Pandas to recognize them as corrupt values. Take a look:

In [29]:
# a list with all missing value formats
missing_value_formats = ["n.a.","?","NA","n/a", "na", "--"]
df = pd.read_csv("employees.csv", na_values = missing_value_formats)

#print gender again
print(df['Gender'].head(10))

0      Male
1      Male
2    Female
3      Male
4      Male
5       NaN
6    Female
7    Female
8       NaN
9    Female
Name: Gender, dtype: object


In Pandas, we have two functions for marking missing values: 
•	isnull() function to mark all of the NaN values in the dataset as True
•	notnull() to mark all of the NaN values in the dataset as False.


In [30]:
print(df['Gender'].isnull().head(10)) # NaN values are marked True
print(df['Gender'].notnull().head(10)) # non-NaN values are marked True


0    False
1    False
2    False
3    False
4    False
5     True
6    False
7    False
8     True
9    False
Name: Gender, dtype: bool
0     True
1     True
2     True
3     True
4     True
5    False
6     True
7     True
8    False
9     True
Name: Gender, dtype: bool


We can use the outputs of the isnull and notnull function for filtering. Let’s print all those rows of the database for which Gender is not missing. 

In [34]:
# returns True on indices for which Gender is not NaN
null_filter = df['Gender'].notnull()
print(df[null_filter].head())

  First Name  Gender Start Date Last Login Time    Salary  Bonus %  \
0    Douglas    Male   08-06-93        12:42 PM   97308.0    6.945   
1     Thomas    Male  3/31/1996         6:53 AM   61933.0      NaN   
2      Maria  Female  4/23/1993             NaN  130590.0   11.858   
3      Jerry    Male   03-04-05         1:00 PM       NaN    9.340   
4      Larry    Male  1/24/1998         4:47 PM  101004.0    1.389   

  Senior Management             Team  
0              True        Marketing  
1              True              NaN  
2             False          Finance  
3              True          Finance  
4              True  Client Services  


In [35]:
# drop all rows with NaN values
df.dropna(axis=0,inplace=True)

# check if we have any NaN values in our dataset
print(df.isnull().values.any())


False


In [None]:
# drop all rows with atleast one NaN
new_df = df.dropna(axis = 0, how ='any')  

# drop all rows with all NaN
new_df = df.dropna(axis = 0, how ='all')

# drop all columns with atleast one NaN
new_df = df.dropna(axis = 1, how ='any')

# drop all columns with all NaN
new_df = df.dropna(axis = 1, how ='all')
