# Data Cleaning & Preparation – Online Retail Dataset

This notebook performs data cleaning and transformation on the [UCI Online Retail Dataset](https://archive.ics.uci.edu/dataset/352/online+retail). The cleaned dataset will be exported as a CSV for SQL analysis in a separate notebook.

---

## 1. Load the Dataset

In [1]:
import pandas as pd
df = pd.read_excel('/kaggle/input/online-retail/Online Retail.xlsx')

## 2. Initial Data Exploration


In [2]:
# Dataset shape
print("Shape:", df.shape)

# Dataset info
df.info()

# Summary statistics
df.describe(include='all')

# Check for missing values
df.isnull().sum()

Shape: (541909, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

## 3. Data Cleaning


In [3]:
# Convert InvoiceDate to proper datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], dayfirst=True, errors='coerce')

In [4]:
# Fill missing Description with "Unknown"
df['Description'] = df['Description'].fillna("Unknown")

In [5]:
# Fill missing CustomerID with blank string and convert to str
df['CustomerID'] = df['CustomerID'].fillna('').astype(str)

In [6]:
# Filtering the cancelled orders 
df['IsCancelled'] = df['InvoiceNo'].astype(str).str.startswith('C')

In [7]:
# Add TotalPrice (Quantity * UnitPrice)
df['Revenue'] = df['Quantity'] * df['UnitPrice']

In [8]:
# Add YearMonth (Period)
df['YearMonth'] = df['InvoiceDate'].dt.to_period('M')

In [9]:
# Select relevant columns
df_clean = df[['InvoiceNo', 'StockCode', 'Description', 'Quantity',
               'InvoiceDate', 'UnitPrice', 'CustomerID', 'Country',
               'Revenue', 'YearMonth', 'IsCancelled']]


In [12]:
# Export to cleaned CSV (for use in SQL Notebook)
df_clean.to_csv('/kaggle/working/OnlineRetail_Clean.csv', index=False, encoding='utf-8')
print("Cleaned dataset saved: OnlineRetail_Clean.csv")

Cleaned dataset saved: OnlineRetail_Clean.csv
