<h1 style="text-align:center;">Data Cleaning</h1>



First of all, we import pandas, read the sample about sales data from a British online shop and clean the dataframe up.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("../data/raw/data.csv", encoding='latin1')
df_raw = pd.read_csv("../data/raw/data.csv", encoding='latin1')
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [3]:
# 1. Clean df
df = df.dropna()
df = df.drop_duplicates()
df['CustomerID'] = df['CustomerID'].astype(int)

# 2. Delete cancelled invoices and outliers
df = df[df['InvoiceNo'].str.startswith('C') == False]
df = df[df['Quantity'] < df['Quantity'].quantile(0.999)]
df = df[df['UnitPrice'] < df['UnitPrice'].quantile(0.999)]

# 3. Convert to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

Now, we create Validation rules so that our dataset only contains meaningful values.

In [4]:
valid = (
    (df['Quantity'] > 0) &
    (df['UnitPrice'] > 0) &
    (df['CustomerID'] > 0)
)

df = df[valid]

Next, we create suitable columns for our further analysis and save the cleaned dataframe in a new file.

In [5]:
# 4. Create revenue column
df['Revenue'] = df['Quantity'] * df['UnitPrice']

# 5. Divide by date
df['Month'] = df['InvoiceDate'].dt.month
df['Weekday'] = df['InvoiceDate'].dt.day_name()
df['Quarter'] = df['InvoiceDate'].dt.quarter

df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Revenue,Month,Weekday,Quarter
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3,12,Wednesday,4
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,12,Wednesday,4
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0,12,Wednesday,4
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,12,Wednesday,4
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,12,Wednesday,4


Lastly, we check how many rows we removed and validate wether our dataset is meaningful and if it is, we save it in a new file.

In [6]:
print(f"Rows before cleaning:", len(df_raw))
print(f"Rows after cleaning:", len(df))
print(f"Rows removed:", len(df_raw)-len(df))

Rows before cleaning: 541909
Rows after cleaning: 391891
Rows removed: 150018


In [7]:
checks_ok = (
    (df['Revenue'] > 0).all() and
    (df['UnitPrice'] > 0).all() and
    (df['Quantity'] > 0).all() and 
    (df['Revenue'] == df['Quantity'] * df['UnitPrice']).all()
)
print(checks_ok)

True


In [8]:
df.to_csv("../data/cleaned/clean_data.csv", index=False)