Data Vetting: I used .head(), .info(), and .describe() to identify crucial issues like the negative prices/quantities, missing Customer IDs, and the incorrect datetime format.

Data Cleaning: I systematically removed nulls and duplicates, converted InvoiceDate to a proper datetime object, and filtered out non-sale transactions (returns and free items).

Feature Engineering: I created the critical Revenue column.

First Insight: I generated the Top 10 Countries by Revenue for NovaShop, giving them their first actionable data point from the clean set.

In [3]:
import pandas as pd

In [11]:
# data loading and slicing
FILE_PATH = '../data/Online_Retail.csv'
SAMPLE_FRACRION = 0.1 # We will sample 10% of the data
full_df = pd.read_csv(FILE_PATH, encoding='unicode_escape')

# Take a random 10% sample for faster processing
sampled_df = full_df.sample(frac=SAMPLE_FRACRION, random_state=42).reset_index(drop=True)

In [12]:
sampled_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,555200,71459,HANGING JAM JAR T-LIGHT HOLDER,24,2011-06-01 12:05:00,0.85,17315.0,United Kingdom
1,554974,21128,GOLD FISHING GNOME,4,2011-05-27 17:14:00,6.95,14031.0,United Kingdom
2,550972,21086,SET/6 RED SPOTTY PAPER CUPS,4,2011-04-21 17:05:00,0.65,14031.0,United Kingdom
3,576652,22812,PACK 3 BOXES CHRISTMAS PANETTONE,3,2011-11-16 10:39:00,1.95,17198.0,United Kingdom
4,546157,22180,RETROSPOT LAMP,2,2011-03-10 08:40:00,9.95,13502.0,United Kingdom


In [13]:
sampled_df.tail()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
54186,581433,22960,JAM MAKING SET WITH JARS,6,2011-12-08 15:54:00,4.25,14911.0,EIRE
54187,551993,85132C,CHARLIE AND LOLA FIGURES TINS,4,2011-05-05 15:39:00,1.95,14697.0,United Kingdom
54188,C559253,22326,ROUND SNACK BOXES SET OF4 WOODLAND,-1,2011-07-07 11:44:00,2.95,12472.0,Germany
54189,559491,21078,SET/20 STRAWBERRY PAPER NAPKINS,1,2011-07-08 13:53:00,2.46,,United Kingdom
54190,548992,21452,TOADSTOOL MONEY BOX,12,2011-04-05 13:52:00,2.95,13089.0,United Kingdom


In [14]:
sampled_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54191 entries, 0 to 54190
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   InvoiceNo    54191 non-null  object 
 1   StockCode    54191 non-null  object 
 2   Description  54042 non-null  object 
 3   Quantity     54191 non-null  int64  
 4   InvoiceDate  54191 non-null  object 
 5   UnitPrice    54191 non-null  float64
 6   CustomerID   40551 non-null  float64
 7   Country      54191 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 3.3+ MB


In [15]:
sampled_df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,54191.0,54191.0,40551.0
mean,10.638962,4.667784,15279.710167
std,321.689757,88.434068,1715.070628
min,-2472.0,0.0,12346.0
25%,1.0,1.25,13950.5
50%,3.0,2.08,15122.0
75%,10.0,4.13,16791.0
max,74215.0,13541.33,18287.0


Handling Missing Values
1. Missing Descriptions
2. Missing Customer IDs  (for eda1 not needed, we'll see in next notebook)

In [16]:
# Drop rows where the Description is missing
sampled_df.dropna(subset=['Description'], inplace=True)
# checking for NULL count
sampled_df['Description'].isnull().sum()

0

Data Type Conversion and Duplicate Removal

In [17]:
# Convert InvoiceDate from object (string) to datetime
sampled_df['InvoiceDate'] = pd.to_datetime(sampled_df['InvoiceDate'])

In [19]:
# Remove Duplicates
num_duplicate = sampled_df.duplicated().sum()
print(f'Found {num_duplicate} rows.')

Found 62 rows.


In [21]:
# Remove duplicates, keeping the first instance
sampled_df.drop_duplicates(inplace=True)
print(f'Dataframe size after removing duplicates is: {len(sampled_df)} rows.')

Dataframe size after removing duplicates is: 53980 rows.


Filtering Out Returns and Errors:
* Handling Negative Quantities and Prices


In [22]:
# Filter: Keep only transactions where Quantity is positive (i.e., sales, not returns)
sampled_df = sampled_df[sampled_df['Quantity'] > 0] 

# Filter: Keep only transactions where UnitPrice is positive (i.e., not free or an error)
sampled_df = sampled_df[sampled_df['UnitPrice'] > 0]

print(f'Dataframe size after filtering is: {len(sampled_df)} rows.')

Dataframe size after filtering is: 52933 rows.


Cleaning Text Columns
* Strip Whitespace
* Standardize Country

In [23]:
# Cleaning Text Columns
# Clean Description and Country columns
sampled_df['Description'] = sampled_df['Description'].str.strip()

# Handle specific country name inconsistencies
# EIRE is a common inconsistency in this dataset for Ireland
sampled_df['Country'].replace('EIRE', 'IRELAND', inplace=True)

print('Text columns cleaned and standardized.')

Text columns cleaned and standardized.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  sampled_df['Country'].replace('EIRE', 'IRELAND', inplace=True)


Feature Engineering: 

Creating the Revenue Column

In [27]:
sampled_df['Revenue'] = sampled_df['Quantity'] * sampled_df['UnitPrice']

print('Revenue feature created.')

Revenue feature created.


First Insight: 
Which Countries Drive Revenue?

In [30]:
# Group by Country, sum the Revenue, and sort for the top 10
top_contries = sampled_df.groupby('Country')['Revenue'].sum().sort_values(ascending=False).head(10)

print('\n — — Top 10 Countries by Revenue (GBP) — -')
print(top_contries)


 — — Top 10 Countries by Revenue (GBP) — -
Country
United Kingdom    941268.661
Netherlands        27435.830
IRELAND            26066.000
France             23645.330
Germany            22389.510
Australia          12429.990
Spain               5600.900
Switzerland         5483.890
Hong Kong           3597.850
Belgium             3593.510
Name: Revenue, dtype: float64


In [32]:
# Exporting cleaned for next notebook
sampled_df.to_csv('../data/online_retail_cleaned.csv', index=False)