<h1 style="text-align:center;">Data Cleaning</h1>



First of all, we import pandas and read the sample about sales data from a British online shop.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv(f"../data/raw/data.csv", encoding='latin1')
df_raw = pd.read_csv(f"../data/raw/data.csv", encoding='latin1')
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


We need to see which values ​​are missing and how often they are missing, for a suitable missing value strategy.

In [3]:
missing = df.isnull().sum()
print(missing)

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64


For a suitable strategy in this case, we delete all the rows with missing descriptions and replace the missing IDs with 0.

In [4]:
df = df.dropna(subset=['Description'])
df['CustomerID'] = df['CustomerID'].fillna(0)

print(f"Rows removed:", len(df_raw) - len(df))

Rows removed: 1454


After handling the missing values, we can clean the dataframe up.

In [5]:
# 1. Clean df
df = df.drop_duplicates()
df['CustomerID'] = df['CustomerID'].astype(int)

# 2. Delete cancelled invoices and outliers
df = df[df['InvoiceNo'].str.startswith('C') == False]
df = df[df['Quantity'] < df['Quantity'].quantile(0.999)]
df = df[df['UnitPrice'] < df['UnitPrice'].quantile(0.999)]

# 3. Convert to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

print(f"Rows removed:", len(df_raw) - len(df) - 1454)

Rows removed: 15595


Now, we create Validation rules so that our dataset only contains meaningful values.

In [6]:
# 4. Validation rules
valid = (
    (df['Quantity'] > 0) &
    (df['UnitPrice'] > 0) &
    (df['CustomerID'] >= 0)
)

df = df[valid]

print(f"Rows removed:", len(df_raw)-len(df) - 1454 - 15595)

Rows removed: 1048


Next, we clean the description for better standardization and create suitable columns for our further analysis.

In [7]:
# 5. Clean description column
df['Description'] = (
    df['Description']
    .astype(str)
    .str.upper()
    .str.strip()
    .str.replace(r"\s+", " ", regex=True)
    .str.replace(r"[^A-Z0-9 \-]", "", regex=True)  
)

# 6. Create revenue column
df['Revenue'] = df['Quantity']*df['UnitPrice']

# 7. Divide by date
df['Month'] = df['InvoiceDate'].dt.month
df['Weekday'] = df['InvoiceDate'].dt.day_name()
df['Quarter'] = df['InvoiceDate'].dt.quarter

df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Revenue,Month,Weekday,Quarter
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3,12,Wednesday,4
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,12,Wednesday,4
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0,12,Wednesday,4
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,12,Wednesday,4
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,12,Wednesday,4


Lastly, we check how many rows we removed in total and validate wether our dataset is meaningful and if it is, we save it in a new file.

In [8]:
print(f"Rows before cleaning:", len(df_raw))
print(f"Rows after cleaning:", len(df))
print(f"Rows removed in total:", len(df_raw) - len(df))

Rows before cleaning: 541909
Rows after cleaning: 523812
Rows removed in total: 18097


In [9]:
checks_ok = (
    (df['Revenue'] > 0).all() and
    (df['UnitPrice'] > 0).all() and
    (df['Quantity'] > 0).all() and 
    (df['Revenue'] == df['Quantity']*df['UnitPrice']).all()
)
print(checks_ok)

True


In [10]:
df.to_csv(f"../data/cleaned/clean_data.csv", index=False)