# Portfolio Project: Online Retail EDA with Python

# Project Overview  


As an entry-level data analyst, I’m exploring transactional data from an online retail company to uncover insights that can drive better business decisions. This dataset includes customer purchases, product details, quantities, prices, and timestamps.  


### Objectives  
- Perform **Exploratory Data Analysis (EDA)** to identify key trends.  
- Analyze **sales performance, customer behavior, and popular products**.  
- Provide **data-driven recommendations** to optimize online retail strategies.  


### My Approach  
To tackle this project, I’ll start by **ETL (Extract, Transform, Load)** to clean and prepare the dataset. Then, I’ll conduct in-depth analysis to identify key trends and insights like **busiest sales periods, top-selling products, and high-value customers**. Let's dive in!  


# Dataset Overview


For this project, I'll be working with the **Online Retail** dataset, which contains transactional data from an online store between 2010 and 2011. The dataset is in a `.csv` file named **`online_retail.csv`**, and it includes details about purchases such as product descriptions, quantities, prices, timestamps, and customer IDs.  


### Data Columns  
The dataset consists of the following fields:  
- **InvoiceNo** – Unique invoice number for each transaction.  
- **StockCode** – Unique product identifier.  
- **Description** – Product name/description.  
- **Quantity** – Number of units purchased.  
- **InvoiceDate** – Timestamp of the transaction.  
- **UnitPrice** – Price per unit of the product.  
- **CustomerID** – Unique identifier for each customer.  
- **Country** – Country where the transaction took place.  


### My Approach  

To analyze this dataset effectively, I’ll break the process into key steps:  

1. **Load the data** into a Pandas DataFrame and inspect the first few rows.  
2. **Clean the dataset** by handling missing values and removing unnecessary columns.  
3. **Explore basic statistics** to understand distributions and trends.  
4. **Visualize the data** using plots such as histograms, bar charts, and scatter plots.  
5. **Analyze sales trends** over time to identify peak sales periods.  
6. **Identify top-selling products and countries** based on quantity sold.  
7. **Detect anomalies or outliers** that may impact the analysis.  
8. **Summarize key findings** and insights from the data.  

Let's dive in and explore the dataset!  


# ETL

## 01. Load the data

Import the required libraries and load the dataset.

In [1]:
import pandas as pd
# import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns

df = pd.read_csv("source/online_retail.csv", encoding="ISO-8859-1")  # We use encoding to avoid UnicodeDecodeError (or encoding="Windows-1252")

Explore and familiarize with the dataset

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


In [None]:
df.describe()

# TODO: Found some negative values in quantity and unit price. We need to check if they are valid or not.

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


## 02. Clean the dataset

Now that we have identified the data types of each column and detected any missing (null) values, we have a clearer understanding of how to approach the ETL process.

Before proceeding, let's create a copy of the dataframe to preserve the original data in its unaltered state.

In [3]:
df_clean = df.copy()

### Transforming data types

With the copy created, we will begin by modifying the data types of specific columns.  
In this case, we will convert the `Country`, `InvoiceNo`, and `StockCode` columns from the object type to the category type.  
This transformation will optimize memory usage and improve performance when handling these columns in Pandas.


In [4]:
df_clean['Country'] = df['Country'].astype('category')
df_clean['InvoiceNo'] = df['InvoiceNo'].astype('category')
df_clean['StockCode'] = df['StockCode'].astype('category')

# Ensure the data types where set correctly with: df_clean.info()

### Handling missing (null) values

#### Handling missing `CustomerID` values

The dataset contains missing values in the `CustomerID` column, but these transactions are still valid purchases. Instead of dropping them or imputing arbitrary values (which could introduce bias), I will leave them as `NaN`.

 Why?
- Removing these rows would result in **loss of actual transaction data**.
- Imputing fake IDs would be **misleading**, as customer IDs are unique identifiers.
- Pandas and Matplotlib **handle NaN values gracefully** in most operations.

#### Handling missing `Description` values

The dataset contains null values in the `Description` column. Since these rows cannot be dropped without losing valuable data, we impute the missing descriptions using the corresponding `StockCode` values (which are complete and unique).

For that purpose, we follow this steps:
1. **Create a mapping dictionary** where each `StockCode` points to its correct `Description` (using only rows with non-null descriptions)
2. **Fill null values** by matching each missing `Description` with its `StockCode`'s known description

**Key Note**: If a `StockCode` has no valid description in the dataset, its `NaN` values will remain.

In [17]:
# Step 1: Map StockCode to Description (drop duplicates to ensure 1:1 mapping)
stock_to_desc = df_clean.dropna(subset=['Description']).drop_duplicates('StockCode').set_index('StockCode')['Description']

# Step 2: Fill NaN Descriptions using the mapped StockCode values
df_clean['Description'] = df_clean['Description'].fillna(df_clean['StockCode'].map(stock_to_desc))

After handling the preliminary missing values in the `Description` column, it's important to verify if any null values still remain. We will perform this check to ensure that all missing descriptions have been properly handled before moving forward with further analysis.

To do so, we'll check for any remaining nulls in the column.

In [19]:
# This will give us an updated count of the missing values in the 'Description' column
print(f'Original Description column null values: {df['Description'].isna().sum()}')
print(f'Updated Description column null values: {df_clean['Description'].isna().sum()}')

Original Description column null values: 1454
Updated Description column null values: 112


##### Imputing Remaining Null Values

After checking for null values, we found that 112 missing descriptions remain out of the initial 1,454 null values. To ensure we don't lose valuable transaction data, we will impute these remaining null values with the placeholder `'Unknown'`. This decision allows us to retain all rows in the dataset while clearly marking the transactions with missing descriptions.

In [23]:
df_clean['Description'] = df_clean['Description'].fillna('Unknown')

# To make sure this worked as intended: print(df_clean['Description'].isnull().sum())

By doing this, we preserve the full dataset while handling missing descriptions in a way that keeps the integrity of our analysis intact.

### Handling duplicates

### Handling negative invalid data