# Data Cleaning

* Anywhere from 50% to 80% of data science is data cleaning
    * of course I hear 70% of statistics are made up on the spot
* Dealing with dirty data is a fact of life when doing data intensive research
* Especially if you are collecting or creating the data yourself
* Fortunately, Pandas is excellent at data cleaning and once you get the hang of it you might even enjoy it!


In [None]:
# load the necessary libraries
import pandas as pd
import numpy as np


## Missing Values 

* One of challenges you may face when working with messy data are *missing* or **null** values 
* There are multiple conventions for representing null values when doing data science in Python
* There is a Pythonic way using the `None` object
* There is a Numpy/Pandas-y way using `NaN`

### None - Pythonic Missing Data

* None is the standard way of representing nothing in plain python
* It is useful, but it is also a complex data structure
* It can be used in numeric and programmatic contexts

In [None]:
# create a numpy array of numbers and a null value represented by None
some_numbers = np.array([1,None,3,4])
some_numbers

* Because numpy arrays (and pandas series/columns) all have to be the same data type, it will default to the most expressive and most inefficient data type for the array
    * Note:  Pandas will automatically convert `None` to `Nan` so we use `np.array` here
* This means any operations running over the array/column/series are going to run slower than they could if the data type was numeric

In [None]:
# create a list of objects and a list of integers
# compute their sum and time how long it takes
for dtype in ['object','int']:
    print("data type = ", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()

* Notice the integer array was ***a lot*** faster than the object array
* Also, the vectorized math operations don't like `None`

In [None]:
some_numbers.sum()

### NaN - Numpy/Pandas-y Missing Numeric data

* The Numpy third-party library has a mechanism for representing missing numeric values
* Under the hood, NaNs are a standards compliant floating point numbers 
    * Note for R users: There is no `Null` only `NaN`
* This means you can use them with other numeric arrays for fast computations

In [None]:
# Create a numeric Pandas Series with a missing value
nanny = pd.Series([1, np.nan, 3, 4])
nanny.dtype

* Now we can use all the fast and easy computations in Pandas without worring about missing values

In [None]:
# compute the sum of all the numbers in the Series
nanny.sum()

## Operating on Null Values

* There are four functions in Pandas that are useful for working with missing data
* The examples below operate on Series, but they can work on Dataframes as well


### Null value functions

* `isna()` - Generate a boolean mask of the missing values (can also use `isnull()`)
* `notna()` - Do the opposite of `isna()` (can also use `notnull()`
* `dropna()` - Create a filtered copy of the data with no null values
* `fillna(value)` - Create a copy of the data will null values filled in

In [None]:
# display the Series
nanny

In [None]:
# what values are null
nanny.isna()

In [None]:
# what values are not null
nanny.notna()

* These masks can be used to filter the data and create a view of missing or not missing 

In [None]:
# not super useful in a Series, but handy with Dataframes
nanny[nanny.isna()]

* Rather than creating a view, we can create *copies* of the data with the null values removed or filled in

In [None]:
# Just get rid of all the null values
no_null_nanny = nanny.dropna()
no_null_nanny

In [None]:
# fill in the null values with zero
fill_null_nanny = nanny.fillna(0)
fill_null_nanny

In [None]:
# fill in the null values with a different value
fill_null_nanny = nanny.fillna(999)
fill_null_nanny

In [None]:
# The original nanny Series remains untouched #noreboot
# Fran Drescher frowns with dissapointment 
nanny

* These functions work with dataframes as well
* But you will need to pay closer attention to what it is doing 

In [None]:
df_nanny = pd.DataFrame([[1, np.nan, 2],
                        [2, 3, 5],
                        [np.nan, 4, 6]])
df_nanny

* Dropping null values with `dropna()` removes the entire axis (row or column) and returns a new copy of the dataframe
* You can specify dropping rows or columns with the axis parameter

In [None]:
# dropna gets rid of rows by default
df_nanny.dropna() # axis="rows" or axis=0

In [None]:
# use the axis="columns" or axis=1 to drop columns
df_nanny.dropna(axis="columns")

* There are a couple other parameters that let you specify other behaviors
* Like only dropping rows/columns with all null values or settings a threshold

## Working with null values in real data

* Here is an example of some real data, the diabetes data from week 2

In [None]:
# Import data file into a Pandas dataframe
df = pd.read_csv("../2 - data python two/diabetes.csv")

# Display the first 5 rows of the data
df.head()

In [None]:
# Display the metadata about the data, making sure to display null values
df.info() 

* If we look closely at this information we can see there are a few null values in this dataset
* There are 403 rows, but some columns have less than 403 non-null values
* Now let's check which values in the dataset are missing

In [None]:
# Create a boolean mask where True indicates a null value
df.isna().head()

* Gak! Too much data, how can we just get a quick count of the null values?
* What if we combined `isnull()` with the `sum()` function?

In [None]:
# Use the sum function to count the True values in the boolean mask
df.isna().sum()

* If we wanted to look at a specific column we can do the same operation 
* These functions work with Series as well as DataFrames

In [None]:
# How many null values in the chol column
df["chol"].isnull().sum()

* Now let's deal with missing values
* Solution 1: Remove rows with empty values
* If there are only a few null values and you know that deleting values will not cause adverse effects on your result, remove them from your DataFrame
* Make sure to save the new dataframe to a new variable!

In [None]:
# Display missing value counts
print("Missing values before dropping rows: ")
print(df.isnull().sum())


# Display new dataset
mod_df = df.dropna() # make a copy of the dataframe with null values removed
print("Missing values after dropping rows: ")
print(mod_df.isnull().sum())
