# Pandas Cheat Sheet - Data Cleaning
Data cleaning is arguably the most important part of data science. And luckily, Pandas has quite a few useful tools to help clean your data.

In [1]:
%pip install -Uqq pandas

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np

## The data
Let's create a DataFrame to use:

In [3]:
data = {
    'biscuit_1452': [6,5,2,1,5,6,5],
    'scone_2035': [4,8,9,np.NaN,2,np.NaN,6],
    'donut_2367': [9,4,8,12,13,10,18],
    'muffin_2011': [4,7,2,8,8,np.NaN,4],
}
df = pd.DataFrame(data)
df

Unnamed: 0,biscuit_1452,scone_2035,donut_2367,muffin_2011
0,6,4.0,9,4.0
1,5,8.0,4,7.0
2,2,9.0,8,2.0
3,1,,12,8.0
4,5,2.0,13,8.0
5,6,,10,
6,5,6.0,18,4.0


## Renaming
One of the first things you might want to do to this dataset is to rename the columns/row indexes.

In [4]:
df.columns = ['Biscuit', 'Scone', 'Donut', 'Muffin']
df

Unnamed: 0,Biscuit,Scone,Donut,Muffin
0,6,4.0,9,4.0
1,5,8.0,4,7.0
2,2,9.0,8,2.0
3,1,,12,8.0
4,5,2.0,13,8.0
5,6,,10,
6,5,6.0,18,4.0


In [5]:
df.index = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
df

Unnamed: 0,Biscuit,Scone,Donut,Muffin
Monday,6,4.0,9,4.0
Tuesday,5,8.0,4,7.0
Wednesday,2,9.0,8,2.0
Thursday,1,,12,8.0
Friday,5,2.0,13,8.0
Saturday,6,,10,
Sunday,5,6.0,18,4.0


## Null Values
There are many reasons that a dataset might have null values, and that reason will depend entirely on the method of data collection. Maybe a sensor malfunctioned, maybe someone mistakenly left out a piece of data, or maybe a null value simply means **0** in that data.

Regardless of the reason, you will likely have to find null values and filter/change them in the data before you can do your analysis. Here are a few ways pandas helps you do that:

In [6]:
df.isnull()  # Detects null values in each cell

Unnamed: 0,Biscuit,Scone,Donut,Muffin
Monday,False,False,False,False
Tuesday,False,False,False,False
Wednesday,False,False,False,False
Thursday,False,True,False,False
Friday,False,False,False,False
Saturday,False,True,False,True
Sunday,False,False,False,False


In [7]:
df.isnull().sum()  # Shows how many null values exist in each column

Biscuit    0
Scone      2
Donut      0
Muffin     1
dtype: int64

In [8]:
df.dropna()  # Drops rows containing null data

Unnamed: 0,Biscuit,Scone,Donut,Muffin
Monday,6,4.0,9,4.0
Tuesday,5,8.0,4,7.0
Wednesday,2,9.0,8,2.0
Friday,5,2.0,13,8.0
Sunday,5,6.0,18,4.0


In [9]:
df.dropna(axis=1)  # Drops columns containing null data

Unnamed: 0,Biscuit,Donut
Monday,6,9
Tuesday,5,4
Wednesday,2,8
Thursday,1,12
Friday,5,13
Saturday,6,10
Sunday,5,18


In [10]:
df.fillna(0)  # Replaces all null values with '0'

Unnamed: 0,Biscuit,Scone,Donut,Muffin
Monday,6,4.0,9,4.0
Tuesday,5,8.0,4,7.0
Wednesday,2,9.0,8,2.0
Thursday,1,0.0,12,8.0
Friday,5,2.0,13,8.0
Saturday,6,0.0,10,0.0
Sunday,5,6.0,18,4.0
