# Data Cleaning Practice

This notebook is to help us practice our data cleaning skills. The following were done to deal with null values:
- Dropping rows,
- Dropping columns,
- Replacing values,
- Filling null values,
- Backward filling,
- Forward filling,

and finally sorting the index.

In [296]:
# Importing necessary libraries
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

%matplotlib inline

In [297]:
# Importing and Previewing
sales_df = pd.read_csv('dirty_cafe_sales.csv')
sales_df

Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
0,TXN_1961373,Coffee,2,2.0,4.0,Credit Card,Takeaway,2023-09-08
1,TXN_4977031,Cake,4,3.0,12.0,Cash,In-store,2023-05-16
2,TXN_4271903,Cookie,4,1.0,ERROR,Credit Card,In-store,2023-07-19
3,TXN_7034554,Salad,2,5.0,10.0,UNKNOWN,UNKNOWN,2023-04-27
4,TXN_3160411,Coffee,2,2.0,4.0,Digital Wallet,In-store,2023-06-11
...,...,...,...,...,...,...,...,...
9995,TXN_7672686,Coffee,2,2.0,4.0,,UNKNOWN,2023-08-30
9996,TXN_9659401,,3,,3.0,Digital Wallet,,2023-06-02
9997,TXN_5255387,Coffee,4,2.0,8.0,Digital Wallet,,2023-03-02
9998,TXN_7695629,Cookie,3,,3.0,Digital Wallet,,2023-12-02


In [298]:
# Dataframe Info
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Transaction ID    10000 non-null  object
 1   Item              9667 non-null   object
 2   Quantity          9862 non-null   object
 3   Price Per Unit    9821 non-null   object
 4   Total Spent       9827 non-null   object
 5   Payment Method    7421 non-null   object
 6   Location          6735 non-null   object
 7   Transaction Date  9841 non-null   object
dtypes: object(8)
memory usage: 625.1+ KB


**Note:** Some columns should not be objects. Some like `Quantity` should be an integer,`Price Per Unit` and `Total Spent` should be float.

The `Transaction Date` column can also be converted into a datetime format

In [299]:
# Check Total Null Values
sales_df.isnull().sum()

Transaction ID         0
Item                 333
Quantity             138
Price Per Unit       179
Total Spent          173
Payment Method      2579
Location            3265
Transaction Date     159
dtype: int64

In [300]:
# Check Unique values in Item column
sales_df['Item'].unique()

array(['Coffee', 'Cake', 'Cookie', 'Salad', 'Smoothie', 'UNKNOWN',
       'Sandwich', nan, 'ERROR', 'Juice', 'Tea'], dtype=object)

In [301]:
# Assigning rows to be dropped where Item is Unknown or Error
to_drop = sales_df[(sales_df['Item'] == 'UNKNOWN') | (sales_df['Item'] == 'ERROR') | (sales_df['Item'].isna())]

In [302]:
# Dropping the affected rows
sales_df.drop(to_drop.index, axis = 0, inplace = True)

In [303]:
# Confirming that they have been dropped
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9031 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Transaction ID    9031 non-null   object
 1   Item              9031 non-null   object
 2   Quantity          8910 non-null   object
 3   Price Per Unit    8873 non-null   object
 4   Total Spent       8877 non-null   object
 5   Payment Method    6701 non-null   object
 6   Location          6086 non-null   object
 7   Transaction Date  8888 non-null   object
dtypes: object(8)
memory usage: 635.0+ KB


In [304]:
# Checking null values
sales_df.isna().sum()

Transaction ID         0
Item                   0
Quantity             121
Price Per Unit       158
Total Spent          154
Payment Method      2330
Location            2945
Transaction Date     143
dtype: int64

In [305]:
# Checking null values for Item column
sales_df['Item'].isna().sum()

0

In [306]:
# Confirming there are no placeholder values in Item Column 
sales_df['Item'].unique()

array(['Coffee', 'Cake', 'Cookie', 'Salad', 'Smoothie', 'Sandwich',
       'Juice', 'Tea'], dtype=object)

In [307]:
# Identifying and removing rows with 'ERROR' and 'UNKNOWN' values in Quantity
quantity_error = sales_df[(sales_df['Quantity'] == 'ERROR') | (sales_df['Quantity'] == 'UNKNOWN')]
sales_df.drop(quantity_error.index, axis = 0, inplace = True)

In [308]:
# Identifying and removing rows with 'ERROR' and 'UNKNOWN' values in Price Per Unit
ppu_error = sales_df[(sales_df['Price Per Unit'] == 'ERROR') | (sales_df['Price Per Unit'] == 'UNKNOWN')]
sales_df.drop(ppu_error.index, axis = 0, inplace = True)

In [309]:
# Identifying and removing rows with 'ERROR' and 'UNKNOWN' values in Total Spent
total_error = sales_df[(sales_df['Total Spent'] == 'ERROR') | (sales_df['Total Spent'] == 'UNKNOWN')]
sales_df.drop(total_error.index, axis = 0, inplace = True)

In [310]:
# There are still null values in Quantity, Price Per Unit and Total Spent
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8139 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Transaction ID    8139 non-null   object
 1   Item              8139 non-null   object
 2   Quantity          8027 non-null   object
 3   Price Per Unit    7988 non-null   object
 4   Total Spent       7989 non-null   object
 5   Payment Method    6042 non-null   object
 6   Location          5491 non-null   object
 7   Transaction Date  8011 non-null   object
dtypes: object(8)
memory usage: 572.3+ KB


In [311]:
# Filling those null values with 0
sales_df[['Quantity', 'Price Per Unit', 'Total Spent']] = sales_df[['Quantity', 'Price Per Unit', 'Total Spent']].fillna(0)

In [312]:
# Converting Quantity values to int
sales_df['Quantity'] = sales_df['Quantity'].astype(int)
# Converting Price Per Unit values to float
sales_df['Price Per Unit'] = sales_df['Price Per Unit'].astype(float)
# Converting Total Spent values to float
sales_df['Total Spent'] = sales_df['Total Spent'].astype(float)

In [313]:
# There are no null values in Quantity, Price Per Unit and Total Spent
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8139 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Transaction ID    8139 non-null   object 
 1   Item              8139 non-null   object 
 2   Quantity          8139 non-null   int32  
 3   Price Per Unit    8139 non-null   float64
 4   Total Spent       8139 non-null   float64
 5   Payment Method    6042 non-null   object 
 6   Location          5491 non-null   object 
 7   Transaction Date  8011 non-null   object 
dtypes: float64(2), int32(1), object(5)
memory usage: 540.5+ KB


In [314]:
# Check the unique values in Payment Method column
sales_df['Payment Method'].value_counts()

Digital Wallet    1868
Credit Card       1857
Cash              1840
ERROR              244
UNKNOWN            233
Name: Payment Method, dtype: int64

In [315]:
# Replacing ERROR with Other in Payment Method column
sales_df['Payment Method'] = sales_df['Payment Method'].replace('ERROR', "Other")
# Replacing UNKNOWN with Other in Payment Method column
sales_df['Payment Method'] = sales_df['Payment Method'].replace('UNKNOWN', "Other")
# Filling null values with Other in Payment Method column
sales_df['Payment Method'] = sales_df['Payment Method'].fillna("Other")

In [316]:
# Checking for unique values again
sales_df['Payment Method'].value_counts()

Other             2574
Digital Wallet    1868
Credit Card       1857
Cash              1840
Name: Payment Method, dtype: int64

In [317]:
# Payment Method has no null values
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8139 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Transaction ID    8139 non-null   object 
 1   Item              8139 non-null   object 
 2   Quantity          8139 non-null   int32  
 3   Price Per Unit    8139 non-null   float64
 4   Total Spent       8139 non-null   float64
 5   Payment Method    8139 non-null   object 
 6   Location          5491 non-null   object 
 7   Transaction Date  8011 non-null   object 
dtypes: float64(2), int32(1), object(5)
memory usage: 540.5+ KB


In [318]:
#  Checking unique values in Location column
sales_df['Location'].value_counts()

In-store    2463
Takeaway    2451
ERROR        300
UNKNOWN      277
Name: Location, dtype: int64

In [319]:
# Filling null values with forward fill
sales_df['Location'] = sales_df['Location'].fillna(method = 'ffill')

# Backward fill (method = "bfill")
# Forward fill (method = "ffill")

In [320]:
# Location has no null values
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8139 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Transaction ID    8139 non-null   object 
 1   Item              8139 non-null   object 
 2   Quantity          8139 non-null   int32  
 3   Price Per Unit    8139 non-null   float64
 4   Total Spent       8139 non-null   float64
 5   Payment Method    8139 non-null   object 
 6   Location          8139 non-null   object 
 7   Transaction Date  8011 non-null   object 
dtypes: float64(2), int32(1), object(5)
memory usage: 540.5+ KB


In [321]:
# Checking unique values in Location
sales_df['Location'].value_counts()

In-store    3649
Takeaway    3644
ERROR        428
UNKNOWN      418
Name: Location, dtype: int64

In [322]:
# Replacing ERROR and UNKNOWN with Unknown in the Location Column
sales_df['Location'] = sales_df['Location'].replace('ERROR', 'Unknown')
sales_df['Location'] = sales_df['Location'].replace('UNKNOWN', 'Unknown')

In [323]:
# Confirming unique values again in Location column
sales_df['Location'].value_counts()

In-store    3649
Takeaway    3644
Unknown      846
Name: Location, dtype: int64

In [324]:
# Let's now deal with the Transaction Date column
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8139 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Transaction ID    8139 non-null   object 
 1   Item              8139 non-null   object 
 2   Quantity          8139 non-null   int32  
 3   Price Per Unit    8139 non-null   float64
 4   Total Spent       8139 non-null   float64
 5   Payment Method    8139 non-null   object 
 6   Location          8139 non-null   object 
 7   Transaction Date  8011 non-null   object 
dtypes: float64(2), int32(1), object(5)
memory usage: 540.5+ KB


In [325]:
# Filling the null values with backfilling
sales_df['Transaction Date'] = sales_df['Transaction Date'].fillna(method = "bfill")

In [326]:
# All null values filled. Now to change the datatype to datetime
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8139 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Transaction ID    8139 non-null   object 
 1   Item              8139 non-null   object 
 2   Quantity          8139 non-null   int32  
 3   Price Per Unit    8139 non-null   float64
 4   Total Spent       8139 non-null   float64
 5   Payment Method    8139 non-null   object 
 6   Location          8139 non-null   object 
 7   Transaction Date  8139 non-null   object 
dtypes: float64(2), int32(1), object(5)
memory usage: 540.5+ KB


In [327]:
# Identifying rows with ERROR and UNKNOWN values
date_error = sales_df[(sales_df['Transaction Date'] == 'UNKNOWN') | (sales_df['Transaction Date'] == 'ERROR')]

In [328]:
# Dropping them
sales_df.drop(date_error.index, axis = 0, inplace = True)

In [329]:
# Converting the Transaction Date Column to datetime format
# errors = 'coerce': forces all to convert to the specified format
# Values that do not conform to the format are converted to NaT (Not a Time)
sales_df['Transaction Date'] = pd.to_datetime(df['Transaction Date'], format= "%Y-%m-%d", errors = 'coerce')

In [330]:
# There are a couple of null values (NaT) in the Transaction Date Column 
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7886 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Transaction ID    7886 non-null   object        
 1   Item              7886 non-null   object        
 2   Quantity          7886 non-null   int32         
 3   Price Per Unit    7886 non-null   float64       
 4   Total Spent       7886 non-null   float64       
 5   Payment Method    7886 non-null   object        
 6   Location          7886 non-null   object        
 7   Transaction Date  7761 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(2), int32(1), object(4)
memory usage: 523.7+ KB


In [331]:
# Backfilling to fill the NaT rows in Transaction Date column
sales_df['Transaction Date'] = sales_df['Transaction Date'].fillna(method = 'bfill')

In [332]:
# We have a clean DataFrame
# Datatypes check
# Null values eradicated
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7886 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Transaction ID    7886 non-null   object        
 1   Item              7886 non-null   object        
 2   Quantity          7886 non-null   int32         
 3   Price Per Unit    7886 non-null   float64       
 4   Total Spent       7886 non-null   float64       
 5   Payment Method    7886 non-null   object        
 6   Location          7886 non-null   object        
 7   Transaction Date  7886 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(2), int32(1), object(4)
memory usage: 523.7+ KB


In [333]:
# Oh no. Our index is f***** up!
sales_df

Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
0,TXN_1961373,Coffee,2,2.0,4.0,Credit Card,Takeaway,2023-09-08
1,TXN_4977031,Cake,4,3.0,12.0,Cash,In-store,2023-05-16
3,TXN_7034554,Salad,2,5.0,10.0,Other,Unknown,2023-04-27
4,TXN_3160411,Coffee,2,2.0,4.0,Digital Wallet,In-store,2023-06-11
5,TXN_2602893,Smoothie,5,4.0,20.0,Credit Card,In-store,2023-03-31
...,...,...,...,...,...,...,...,...
9993,TXN_4766549,Smoothie,2,4.0,0.0,Cash,In-store,2023-10-20
9995,TXN_7672686,Coffee,2,2.0,4.0,Other,Unknown,2023-08-30
9997,TXN_5255387,Coffee,4,2.0,8.0,Digital Wallet,Unknown,2023-03-02
9998,TXN_7695629,Cookie,3,0.0,3.0,Digital Wallet,Unknown,2023-12-02


In [334]:
# Resetting the index of our DataFrame
sales_df = sales_df.reset_index()

In [335]:
# Dropping the column labelled "index"
sales_df.drop('index', axis = 1, inplace = True)

In [336]:
# All good
sales_df

Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
0,TXN_1961373,Coffee,2,2.0,4.0,Credit Card,Takeaway,2023-09-08
1,TXN_4977031,Cake,4,3.0,12.0,Cash,In-store,2023-05-16
2,TXN_7034554,Salad,2,5.0,10.0,Other,Unknown,2023-04-27
3,TXN_3160411,Coffee,2,2.0,4.0,Digital Wallet,In-store,2023-06-11
4,TXN_2602893,Smoothie,5,4.0,20.0,Credit Card,In-store,2023-03-31
...,...,...,...,...,...,...,...,...
7881,TXN_4766549,Smoothie,2,4.0,0.0,Cash,In-store,2023-10-20
7882,TXN_7672686,Coffee,2,2.0,4.0,Other,Unknown,2023-08-30
7883,TXN_5255387,Coffee,4,2.0,8.0,Digital Wallet,Unknown,2023-03-02
7884,TXN_7695629,Cookie,3,0.0,3.0,Digital Wallet,Unknown,2023-12-02
