# Data Cleaning & Preparation – Online Retail II Dataset

This notebook performs data cleaning and transformation on the [UCI Online Retail II Dataset](https://archive.ics.uci.edu/dataset/502/online+retail+ii). The cleaned dataset will be exported as a CSV for SQL analysis in a separate notebook.

---

## 1. Load the Dataset

The notebook contains 2 sheets, we need to merge them by using **CONCAT**.

In [1]:
import pandas as pd

file_path = 'Online_Retail_Raw_Data.xlsx'

df1 = pd.read_excel(file_path, sheet_name='Year 2009-2010')
df2 = pd.read_excel(file_path, sheet_name='Year 2010-2011')

In [2]:
df = pd.concat([df1, df2], ignore_index=True)

## 2. Initial Data Exploration

In [3]:
# Dataset shape
print("Shape:", df.shape)

# Dataset info
df.info()

# Summary statistics
df.describe(include='all')

# Check for missing values
df.isnull().sum()

Shape: (1067371, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1067371 entries, 0 to 1067370
Data columns (total 8 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   Invoice      1067371 non-null  object        
 1   StockCode    1067371 non-null  object        
 2   Description  1062989 non-null  object        
 3   Quantity     1067371 non-null  int64         
 4   InvoiceDate  1067371 non-null  datetime64[ns]
 5   Price        1067371 non-null  float64       
 6   Customer ID  824364 non-null   float64       
 7   Country      1067371 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 65.1+ MB


  df.describe(include='all')


Invoice             0
StockCode           0
Description      4382
Quantity            0
InvoiceDate         0
Price               0
Customer ID    243007
Country             0
dtype: int64

This dataset contain 8 columns with 1067371 rows and among them **Description** and **Customer ID** columns contains null values.

## 3. Data Cleaning

Here, we need to handle the null values. We want to keep the **Customer Id** in **Integer Data Format**, so we would fill the missing values with **0** and **Description** with **Unknown**

In [4]:
# Handling missing customer id's
df['Customer ID'] = df['Customer ID'].fillna(0)

In [5]:
# Changing data type of customer id
df['Customer ID'] = df['Customer ID'].astype(int)

In [6]:
# Convert InvoiceDate to proper datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], dayfirst=True, errors='coerce')

In [7]:
# Fill missing Description with "Unknown"
df['Description'] = df['Description'].fillna("Unknown")

After analysing **Invoice** column, we saw there are some cancelled orders, with negative **Quantity**. However, for analysing the cancellation rate later, we wouldn't delete this column, so we would create a seperate column **IsCancelled**

In [8]:
# Filtering the cancelled orders 
df['IsCancelled'] = df['Invoice'].astype(str).str.startswith('C')

For easier calculation, we created another new column **Revenue** and **YearMonth**

In [9]:
# Add TotalPrice (Quantity * UnitPrice)
df['Revenue'] = df['Quantity'] * df['Price']

In [10]:
# Add YearMonth (Period)
df['YearMonth'] = df['InvoiceDate'].dt.to_period('M')

In [11]:
# Select relevant columns
df_clean = df[['Invoice', 'StockCode', 'Description', 'Quantity',
               'InvoiceDate', 'Price', 'Customer ID', 'Country',
               'Revenue', 'YearMonth', 'IsCancelled']]

In [12]:
# Export to cleaned CSV (for use in SQL Notebook)
df_clean.to_csv('Online_Retail_Clean_Data.csv', index=False, encoding='utf-8')
print("Cleaned dataset saved")

Cleaned dataset saved


**For better interpretation, we are formatting the column names in the cleaned dataset**

In [13]:
df_clean.columns = (
    df_clean.columns
    .str.strip()             
    .str.lower()             
    .str.replace(' ', '_')   
    .str.replace(r'[^\w]', '', regex=True) 
)

In [14]:
df_clean.head(10)

Unnamed: 0,invoice,stockcode,description,quantity,invoicedate,price,customer_id,country,revenue,yearmonth,iscancelled
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085,United Kingdom,83.4,2009-12,False
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085,United Kingdom,81.0,2009-12,False
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085,United Kingdom,81.0,2009-12,False
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085,United Kingdom,100.8,2009-12,False
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085,United Kingdom,30.0,2009-12,False
5,489434,22064,PINK DOUGHNUT TRINKET POT,24,2009-12-01 07:45:00,1.65,13085,United Kingdom,39.6,2009-12,False
6,489434,21871,SAVE THE PLANET MUG,24,2009-12-01 07:45:00,1.25,13085,United Kingdom,30.0,2009-12,False
7,489434,21523,FANCY FONT HOME SWEET HOME DOORMAT,10,2009-12-01 07:45:00,5.95,13085,United Kingdom,59.5,2009-12,False
8,489435,22350,CAT BOWL,12,2009-12-01 07:46:00,2.55,13085,United Kingdom,30.6,2009-12,False
9,489435,22349,"DOG BOWL , CHASING BALL DESIGN",12,2009-12-01 07:46:00,3.75,13085,United Kingdom,45.0,2009-12,False


In [15]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1067371 entries, 0 to 1067370
Data columns (total 11 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   invoice      1067371 non-null  object        
 1   stockcode    1067371 non-null  object        
 2   description  1067371 non-null  object        
 3   quantity     1067371 non-null  int64         
 4   invoicedate  1067371 non-null  datetime64[ns]
 5   price        1067371 non-null  float64       
 6   customer_id  1067371 non-null  int32         
 7   country      1067371 non-null  object        
 8   revenue      1067371 non-null  float64       
 9   yearmonth    1067371 non-null  period[M]     
 10  iscancelled  1067371 non-null  bool          
dtypes: bool(1), datetime64[ns](1), float64(2), int32(1), int64(1), object(4), period[M](1)
memory usage: 78.4+ MB


Now, the cleaned dataset is exported and it is ready to be used for **SQL Queries and RFM Segmentation**