### Online Retail Business Analysis
## Data Cleaning

In this initial section, we will establish our Extract, Transform and Load (ETL) Pipeline.
We do this in order to clean and prepare our data to ensure accurate and streamlined analysis.

In [None]:
import pandas as pd # import pandas for data manipulation


In [None]:
file_path = '../data/raw_data/online_retail.csv'
df = pd.read_csv(file_path, encoding='ISO-8859-1')
df.head()  # Display the first few rows of the DataFrame

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom


Now that we have loaded our dataset, first we need to inspect the data.

In [12]:
# Shape of the dataset
print("Shape:", df.shape)

# First few rows
df.head()

Shape: (541909, 8)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom


So we can see that our 'Shape' is 541909, 8, meaning 541909 Rows by 8 Columns
Next, let us look at a summary of our DataFrame:

In [13]:
# Summary for numeric columns
df.describe()

# Data types and non-null counts
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   541909 non-null  int64  
 7   Country      541909 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 33.1+ MB


Let us Check for missing values:

In [14]:
# Count missing values per column
df.isnull().sum().sort_values(ascending=False)

Description    1454
InvoiceNo         0
StockCode         0
Quantity          0
InvoiceDate       0
UnitPrice         0
CustomerID        0
Country           0
dtype: int64

We can see that our 'Description' Column has 1454 missing entries.
As we will be using our Product Description for further analysis, it follows that we should remove these missing values to prevent any skewness.

In [15]:
# Drop rows with missing product descriptions
df = df[df['Description'].notnull()]

In [16]:
# Verify missing values per column removal
df.isnull().sum().sort_values(ascending=False)

InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

We can now see that we no longer have any missing values in our 'Description' column.
As we now have no missing values, let us move on to cleaning negative/zero quantities and prices:

In [17]:
# Filter out non-positive quantity and price
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]

Let's log to check how many values were removed:

In [18]:
# Count rows before and after
initial_rows = df.shape[0]

# Apply filter again just for logging
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]

final_rows = df.shape[0]
print(f"Removed {initial_rows - final_rows} rows with non-positive Quantity or UnitPrice.")

Removed 0 rows with non-positive Quantity or UnitPrice.


In [19]:
print("Shape:", df.shape)

Shape: (530104, 8)


We can now see that we have removed 11,805 rows with empty, negative or zero values.
We now have 530104 rows remaining in our dataset.
Our next step is to check for duplicate entries:

In [20]:
# Count duplicate rows
duplicate_count = df.duplicated().sum()
print(f"Duplicate rows: {duplicate_count}")

Duplicate rows: 5226


We can see that we also have 5226 duplicate rows. 
We need to remove these to improve the accuracy of our analysis:

In [21]:
# Drop duplicates
df = df.drop_duplicates()

# Confirm change
print(f"Shape after removing duplicates: {df.shape}")

Shape after removing duplicates: (524878, 8)


### Feature Engineering
Now that we have solved for Missing Values, Negative and Zero Values and Duplicates, we should now be ready to add additional features to our DataFrame.
We have determined that it would be useful for our analysis to have a 'TotalPrice' column.