# Portfolio Project: Online Retail EDA with Python

## Project Overview  

This project explores an **online retail transactions dataset**, focusing on **data cleaning, exploratory data analysis (EDA), and deriving business insights**. The dataset contains information about customer purchases, including invoice details, product descriptions, quantities, prices, and customer IDs. 


### Objectives  
- Perform **Exploratory Data Analysis (EDA)** to identify key trends.  
- Analyze **sales performance, customer behavior, and popular products**.  
- Provide **data-driven recommendations** to optimize online retail strategies.  


### My Approach  
To tackle this project, I’ll start by **ETL (Extract, Transform, Load)** to clean and prepare the dataset. Then, I’ll conduct in-depth analysis to identify key trends and insights like **busiest sales periods, top-selling products, and high-value customers**. Let's dive in!  


## Dataset Overview


For this project, I'll be working with the **Online Retail** dataset, which contains transactional data from an online store between 2010 and 2011. The dataset is in a `.csv` file named **`online_retail.csv`**, and it includes details about purchases such as product descriptions, quantities, prices, timestamps, and customer IDs.  


### Data Columns  
The dataset consists of the following fields:  
- **InvoiceNo** – Unique invoice number for each transaction.  
- **StockCode** – Unique product identifier.  
- **Description** – Product name/description.  
- **Quantity** – Number of units purchased.  
- **InvoiceDate** – Timestamp of the transaction.  
- **UnitPrice** – Price per unit of the product.  
- **CustomerID** – Unique identifier for each customer.  
- **Country** – Country where the transaction took place.  


### My Approach  

To analyze this dataset effectively, I’ll break the process into key steps:  

1. **Load the data** into a Pandas DataFrame and inspect the first few rows.  
2. **Clean the dataset** by handling missing values and removing unnecessary data.  
3. **Explore basic statistics** to understand distributions and trends.  
4. **Visualize the data** using plots such as histograms, bar charts, and scatter plots.  
5. **Analyze sales trends** over time to identify peak sales periods.  
6. **Identify top-selling products and countries** based on quantity sold.  
7. **Detect anomalies or outliers** that may impact the analysis.  
8. **Summarize key findings** and insights from the data.  

Let's dive in and explore the dataset!  


# ETL

## 01. Load the data

Import the required libraries and load the dataset.

In [33]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("source/online_retail.csv", encoding="ISO-8859-1")  # We use encoding to avoid UnicodeDecodeError (or encoding="Windows-1252")

Explore and familiarize with the dataset

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


In [35]:
df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


## 02. Clean the dataset

Now that we have identified the data types of each column and detected any missing (null) values, we have a clearer understanding of how to approach the ETL process.

Before proceeding, let's create a copy of the dataframe to preserve the original data in its unaltered state.

In [36]:
df_clean = df.copy()

### Type Casting

With the copy created, we will begin by modifying the data types of specific columns.  
In this case, we will convert the `Country`, `InvoiceNo`, and `StockCode` columns from the object type to the category type.  
This transformation will optimize memory usage and improve performance when handling these columns in Pandas.


In [37]:
df_clean['Country'] = df_clean['Country'].astype('category')
df_clean['InvoiceNo'] = df_clean['InvoiceNo'].astype('category')
df_clean['StockCode'] = df_clean['StockCode'].astype('category')

# Ensure the data types where set correctly with: df_clean.info()

After that, we can also make sure that CustomerID is `int` instead of `float` to help Pandas process the information more efficiently

In [38]:
df_clean['CustomerID'] = df_clean['CustomerID'].astype('Int64')

### Handling Missing Values

#### Handling missing `CustomerID` values

The dataset contains missing values in the `CustomerID` column, but these transactions are still valid purchases. Instead of dropping them or imputing arbitrary values (which could introduce bias), I will leave them as `NaN`.

 Why?
- Removing these rows would result in **loss of actual transaction data**.
- Imputing fake IDs would be **misleading**, as customer IDs are unique identifiers.
- Pandas and Matplotlib **handle NaN values gracefully** in most operations.

#### Handling missing `Description` values

The dataset contains null values in the `Description` column. Since these rows cannot be dropped without losing valuable data, we impute the missing descriptions using the corresponding `StockCode` values (which are complete and unique).

For that purpose, we follow this steps:
1. **Create a mapping dictionary** where each `StockCode` points to its correct `Description` (using only rows with non-null descriptions)
2. **Fill null values** by matching each missing `Description` with its `StockCode`'s known description

**Key Note**: If a `StockCode` has no valid description in the dataset, its `NaN` values will remain.

In [39]:
# Step 1: Map StockCode to Description (drop duplicates to ensure 1:1 mapping)
stock_to_desc = df_clean.dropna(subset=['Description']).drop_duplicates('StockCode').set_index('StockCode')['Description']

# Step 2: Fill NaN Descriptions using the mapped StockCode values
df_clean['Description'] = df_clean['Description'].fillna(df_clean['StockCode'].map(stock_to_desc))

After handling the preliminary missing values in the `Description` column, it's important to verify if any null values still remain. We will perform this check to ensure that all missing descriptions have been properly handled before moving forward with further analysis.

To do so, we'll check for any remaining nulls in the column.

In [40]:
# This will give us an updated count of the missing values in the 'Description' column
print(f'Original Description column null values: {df['Description'].isna().sum()}')
print(f'Updated Description column null values: {df_clean['Description'].isna().sum()}')

Original Description column null values: 1454
Updated Description column null values: 112


##### Imputing Remaining Null Values

After checking for null values, we found that 112 missing descriptions remain out of the initial 1,454 null values. To ensure we don't lose valuable transaction data, we will impute these remaining null values with the placeholder `'Unknown'`. This decision allows us to retain all rows in the dataset while clearly marking the transactions with missing descriptions.

In [41]:
df_clean['Description'] = df_clean['Description'].fillna('Unknown')

# To make sure this worked as intended: print(df_clean['Description'].isnull().sum())

By doing this, we preserve the full dataset while handling missing descriptions in a way that keeps the integrity of our analysis intact.

### String Processing

Before removing duplicates, we need to ensure that all truly identical rows are recognized as such by Pandas. To achieve this, we will standardize string formatting to eliminate inconsistencies.

We will focus on two key columns: `Description` and `Country`, as they contain string-type data.

`Description`: We will remove leading, trailing, and extra in-between whitespaces and standardize all text to uppercase for consistency.

`Country`: Similarly, we will trim unnecessary spaces and format country names in title case (first letter uppercase, the rest lowercase).

These transformations will help ensure that duplicate records are correctly identified and handled in the next step.

In [42]:
# Clean Description: Remove leading/trailing spaces, handle in-between extra spaces, and standardize to lowercase
df_clean['Description'] = df_clean['Description'].str.strip().str.replace(r'\s+', ' ', regex=True).str.upper()

# Clean Country: Remove leading/trailing spaces, handle in-between extra spaces, and title-case the country names
df_clean['Country'] = df_clean['Country'].str.strip().str.replace(r'\s+', ' ', regex=True).str.title()

### Removing Duplicates

Duplicated values can introduce bias and lead to incorrect insights, making it essential to handle them properly.

To begin, we will check for duplicate records in the dataset. Since individual columns may contain duplicate values, our focus will be on identifying and removing rows where all columns are identical.

For this, we will use Pandas' `.drop_duplicates()` method, which efficiently eliminates fully duplicated rows, ensuring data integrity for analysis.

In [43]:
print(f'Number of duplicate rows: {df_clean.duplicated().sum()}')

Number of duplicate rows: 5268


In [44]:
# Remove exact duplicate rows
df_clean = df_clean.drop_duplicates()

### Validating Negative Values

Based on the results of `df.describe()` in the *Load the Data* step, we identified negative values in the `Quantity` and `UnitPrice` columns. Since these values are not expected in a standard sales dataset, we will handle them systematically.

In [45]:
# Check for negative values in Quantity and UnitPrice
print(df_clean[df_clean['Quantity'] < 0].shape[0])
print(df_clean[df_clean['UnitPrice'] < 0].shape[0])

10587
2


Since the number of negative values in the `Quantity` column is significantly higher, we will address them first.

#### Analizing the negative values in `Quantity`

In [46]:
# Take a look at the negative values in Quantity and look for patterns
df_clean[df_clean['Quantity'] < 0].sample(20)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
173241,551663,84548,CROCHET BEAR RED/BLUE KEYRING,-70,5/3/2011 12:34,0.0,,United Kingdom
394088,C570867,23349,ROLL WRAP VINTAGE CHRISTMAS,-12,10/12/2011 16:17,1.25,12607.0,Usa
250161,C559007,23110,PARISIENNE KEY CABINET,-2,7/5/2011 12:42,5.75,13534.0,United Kingdom
221645,556262,37327,SOLD AS SET ON DOTCOM,-95,6/9/2011 18:04,0.0,,United Kingdom
431996,C573779,22928,YELLOW GIANT GARDEN THERMOMETER,-1,11/1/2011 10:57,5.95,16360.0,United Kingdom
291607,562464,20701,PINK CAT FLORAL CUSHION COVER,-24,8/5/2011 11:31,0.0,,United Kingdom
143310,548684,72798C,SET/4 GARDEN ROSE DINNER CANDLE,-2,4/1/2011 16:46,0.0,,United Kingdom
294923,C562728,22849,BREAD BIN DINER STYLE MINT,-1,8/9/2011 9:41,14.95,12406.0,Denmark
538063,C581384,51008,AFGHAN SLIPPER SOCK PAIR,-2,12/8/2011 13:06,3.45,17673.0,United Kingdom
454402,C575573,21535,RED RETROSPOT SMALL MILK JUG,-1,11/10/2011 11:34,2.55,16705.0,United Kingdom


Upon examining the data, we can observe some patterns:  

1. Most `InvoiceNo` values associated with negative quantities begin with "C," which likely indicates a **credit transaction** (returns).  
2. Some `Description` values suggest special cases, such as "DAMAGED," "DISCOUNT," or "?".  

To better understand these cases, we will analyze how many unique `Description` values are associated with negative quantities.


In [47]:
negative_descriptions = df_clean[df_clean['Quantity'] < 0]['Description'].value_counts()
print(negative_descriptions.count())

2471


Given the large number of unique descriptions associated with negative quantities, we will focus on the most frequently occurring ones. This will help us identify common patterns and determine which descriptions may represent special cases that require specific handling.

In [48]:
print(negative_descriptions.head(30))  # Show the top 30 most frequent negative descriptions

Description
MANUAL                                 244
REGENCY CAKESTAND 3 TIER               180
POSTAGE                                126
CHECK                                  123
UNKNOWN                                 97
JAM MAKING SET WITH JARS                87
DISCOUNT                                77
SET OF 3 CAKE TINS PANTRY DESIGN        75
SAMPLES                                 61
DAMAGED                                 57
STRAWBERRY CERAMIC TRINKET BOX          54
ROSES REGENCY TEACUP AND SAUCER         54
RECIPE BOX PANTRY YELLOW DESIGN         47
DAMAGES                                 46
JUMBO BAG RED RETROSPOT                 44
LUNCH BAG RED RETROSPOT                 44
WOOD 2 DRAWER CABINET WHITE FINISH      43
RED RETROSPOT CAKE STAND                42
WHITE HANGING HEART T-LIGHT HOLDER      42
GREEN REGENCY TEACUP AND SAUCER         42
?                                       42
SMALL GLASS HEART TRINKET POT           40
SET OF 3 REGENCY CAKE TINS              37

#### Handling the negative values in `Quantity`

The output shows that some `Description` values represent regular products, while others indicate special cases (discounts, damaged products, or ambiguous values like "?"). We implement a unified classification system:

1. Categorize All Transactions

    Create a `TransactionType` column with three distinct labels:

    * **Return:** Transactions where `InvoiceNo` starts with "C" (credit notes).
    * **SpecialCase:** Transactions with descriptions matching predefined non-product terms (DISCOUNT, DAMAGED, SAMPLES, ?, etc.).
    * **Sale:** All other regular transactions.

2. Process Negative Quantities

    * **Returns:** Keep negatives (valid refund records).
    * **Special Cases:** Preserve original values (context-dependent).
    * **Sales:** Convert negatives to positives (assumed data entry errors).

**Note:** To distinguish legitimate negative quantities from data entry errors:

* Identified the top 30 most frequent descriptions for negative quantities

* Manually selected non-product terms (e.g., `DISCOUNT`, `DAMAGED`, `?`)

Resulting in the curated `special_case_list` used for classification.

In [49]:
# Predefined list of special cases for descriptions
special_case_list = [
    'DISCOUNT', 'DAMAGED', 'DAMAGES', 'SAMPLES', 'CHECK', 'MANUAL', 'POSTAGE', 
    'UNKNOWN', '?', 'AMAZON FEE', 'DOTCOM POSTAGE'
]

In [50]:
def classify_transaction(invoice_no, description):
    """Classify transaction as 'Return', 'SpecialCase', or 'Sale'."""
    if str(invoice_no).startswith('C'):
        return 'Return'
    elif description in special_case_list:
        return 'SpecialCase'
    else:
        return 'Sale'

# Use .map() efficiently by applying it on a tuple of (InvoiceNo, Description)
df_clean['TransactionType'] = list(map(classify_transaction, df_clean['InvoiceNo'], df_clean['Description']))

In [51]:
# Convert negatives to positives ONLY for regular Sales
df_clean.loc[
    (df_clean['TransactionType'] == 'Sale') & 
    (df_clean['Quantity'] < 0), 
    'Quantity'
] = df_clean['Quantity'].abs()  # Or: *= -1

To verify this approach worked as intended, we check if there are any negative `Quantity` values remaining that are neither classified as "Return" nor "SpecialCase".

In [52]:
print(f"Non Return or SpecialCase negative `Quantity` values: {df_clean[
    (df_clean['Quantity'] < 0) & 
    (~df_clean['TransactionType'].isin(['Return', 'SpecialCase']))
].shape[0]}")

Non Return or SpecialCase negative `Quantity` values: 0


Since the result is 0, it confirms that all negative `Quantity` values have been correctly handled according to our classification. We can now proceed to the next step.

#### Analizing and handling the negative values in `UnitPrice`

As observed earlier, only two transactions have negative `UnitPrice` values. Given their small number, we can inspect them directly as follows:

In [53]:
df_clean[df_clean['UnitPrice'] < 0]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,TransactionType
299983,A563186,B,ADJUST BAD DEBT,1,8/12/2011 14:51,-11062.06,,United Kingdom,Sale
299984,A563187,B,ADJUST BAD DEBT,1,8/12/2011 14:52,-11062.06,,United Kingdom,Sale


Since they are labeled as "ADJUST BAD DEBT," these transactions appear to represent financial adjustments rather than product sales.

As these records are valid, we will keep them in the dataset and classify them under "SpecialCase" in the `TransactionType` column while preserving their negative values.


In [54]:
# Ensure "ADJUST BAD DEBT" transactions are marked as SpecialCase
df_clean.loc[df_clean['Description'] == 'ADJUST BAD DEBT', 'TransactionType'] = 'SpecialCase'

### Handling Outliers

The output from our initial exploration of `UnitPrice` shows that some transactions contain exceptionally high values. Upon further inspection, we found that several of these do not represent actual product sales but rather special cases, such as fees, postage, or adjustments.

1. Identifying High `UnitPrice` Transactions
To detect potential outliers, we examined the top transactions sorted by `UnitPrice`:

- Initially, we used `df_clean.nlargest(10, 'UnitPrice')` to inspect the highest values.

- Many high `UnitPrice` values corresponded to non-product transactions, which are **not currently classified as SpecialCases** (e.g., "DOTCOM POSTAGE", "THROW AWAY", "MOULDY, THROWN AWAY.").

- To refine our analysis, we iteratively excluded known non-product descriptions and re-ran the analysis to identify remaining cases.

In [55]:
df_clean.nlargest(10, 'UnitPrice')

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,TransactionType
222681,C556445,M,MANUAL,-1,6/10/2011 15:31,38970.0,15098.0,United Kingdom,Return
524602,C580605,AMAZONFEE,AMAZON FEE,-1,12/5/2011 11:36,17836.46,,United Kingdom,Return
43702,C540117,AMAZONFEE,AMAZON FEE,-1,1/5/2011 9:55,16888.02,,United Kingdom,Return
43703,C540118,AMAZONFEE,AMAZON FEE,-1,1/5/2011 9:57,16453.71,,United Kingdom,Return
15016,C537630,AMAZONFEE,AMAZON FEE,-1,12/7/2010 15:04,13541.33,,United Kingdom,Return
15017,537632,AMAZONFEE,AMAZON FEE,1,12/7/2010 15:08,13541.33,,United Kingdom,SpecialCase
16356,C537651,AMAZONFEE,AMAZON FEE,-1,12/7/2010 15:49,13541.33,,United Kingdom,Return
16232,C537644,AMAZONFEE,AMAZON FEE,-1,12/7/2010 15:34,13474.79,,United Kingdom,Return
524601,C580604,AMAZONFEE,AMAZON FEE,-1,12/5/2011 11:35,11586.5,,United Kingdom,Return
299982,A563185,B,ADJUST BAD DEBT,1,8/12/2011 14:50,11062.06,,United Kingdom,SpecialCase


2. Filtering Out Non-Product Cases
We applied a filtering step to exclude descriptions that represent fees, adjustments, or other special cases, ensuring that we focus on actual product-related outliers. The filtered descriptions include:

    "AMAZON FEE", "MANUAL", "DOTCOM POSTAGE", "BANK CHARGES", "ADJUST BAD DEBT", "POSTAGE", "DISCOUNT", "CRUK COMMISSION"

In [56]:
df_clean[~df_clean['Description'].isin(
    ["AMAZON FEE", "MANUAL", 'DOTCOM POSTAGE', 'BANK CHARGES', 'ADJUST BAD DEBT', 'POSTAGE', 'DISCOUNT', 'CRUK COMMISSION' 
    ])].nlargest(10, 'UnitPrice')

# 'DOTCOM POSTAGE', 'THROW AWAY', 'UNSALEABLE, DESTROYED.', 'MOULDY, THROWN AWAY.'

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,TransactionType
222680,556444,22502,PICNIC BASKET WICKER 60 PIECES,60,6/10/2011 15:28,649.5,15098.0,United Kingdom,Sale
222682,556446,22502,PICNIC BASKET WICKER 60 PIECES,1,6/10/2011 15:33,649.5,15098.0,United Kingdom,Sale
242589,C558359,S,SAMPLES,-1,6/28/2011 15:10,570.0,,United Kingdom,Return
4989,536835,22655,VINTAGE RED KITCHEN CABINET,1,12/2/2010 18:06,295.0,13145.0,United Kingdom,Sale
32484,539080,22655,VINTAGE RED KITCHEN CABINET,1,12/16/2010 8:41,295.0,16607.0,United Kingdom,Sale
36165,C539438,22655,VINTAGE RED KITCHEN CABINET,-1,12/17/2010 15:11,295.0,16607.0,United Kingdom,Return
51636,540647,22655,VINTAGE RED KITCHEN CABINET,1,1/10/2011 14:57,295.0,17406.0,United Kingdom,Sale
82768,543253,22655,VINTAGE RED KITCHEN CABINET,1,2/4/2011 15:32,295.0,14842.0,United Kingdom,Sale
87141,C543632,22655,VINTAGE RED KITCHEN CABINET,-1,2/10/2011 16:22,295.0,14842.0,United Kingdom,Return
118769,546480,22656,VINTAGE BLUE KITCHEN CABINET,1,3/14/2011 11:38,295.0,13452.0,United Kingdom,Sale


To handle the identified outliers, we expand the `special_case_list` to include these descriptions and reuse the `classify_transaction` function, ensuring they are correctly categorized in the dataset.

In [57]:
# Expanding the special_case_list with newly identified non-product descriptions
special_case_list.extend([
    "DOTCOM POSTAGE", "THROW AWAY", "UNSALEABLE, DESTROYED.", "MOULDY, THROWN AWAY.",
    "AMAZON FEE", "MANUAL", "BANK CHARGES", "ADJUST BAD DEBT", "POSTAGE", "DISCOUNT", "CRUK COMMISSION"
])

# Reapplying the classification function to update TransactionType
df_clean['TransactionType'] = list(map(classify_transaction, df_clean['InvoiceNo'], df_clean['Description']))


Then, we can re-run the high `UnitPrice`analysis to confirm that all outliers are either properly classified or remain legitimate product sales.

By implementing this refined approach, we ensure that outlier removal is not arbitrary but instead data-driven, focusing on legitimate product transactions while flagging non-product entries as SpecialCases.

In [58]:
df_clean.nlargest(10, 'UnitPrice')

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,TransactionType
222681,C556445,M,MANUAL,-1,6/10/2011 15:31,38970.0,15098.0,United Kingdom,Return
524602,C580605,AMAZONFEE,AMAZON FEE,-1,12/5/2011 11:36,17836.46,,United Kingdom,Return
43702,C540117,AMAZONFEE,AMAZON FEE,-1,1/5/2011 9:55,16888.02,,United Kingdom,Return
43703,C540118,AMAZONFEE,AMAZON FEE,-1,1/5/2011 9:57,16453.71,,United Kingdom,Return
15016,C537630,AMAZONFEE,AMAZON FEE,-1,12/7/2010 15:04,13541.33,,United Kingdom,Return
15017,537632,AMAZONFEE,AMAZON FEE,1,12/7/2010 15:08,13541.33,,United Kingdom,SpecialCase
16356,C537651,AMAZONFEE,AMAZON FEE,-1,12/7/2010 15:49,13541.33,,United Kingdom,Return
16232,C537644,AMAZONFEE,AMAZON FEE,-1,12/7/2010 15:34,13474.79,,United Kingdom,Return
524601,C580604,AMAZONFEE,AMAZON FEE,-1,12/5/2011 11:35,11586.5,,United Kingdom,Return
299982,A563185,B,ADJUST BAD DEBT,1,8/12/2011 14:50,11062.06,,United Kingdom,SpecialCase


## 03. Explore basic statistics

After cleaning the dataset, we proceed to explore its basic statistics. This initial analysis provides a foundation for understanding the data’s structure, identifying early patterns, and evaluating potential anomalies or irregularities. It sets the stage for more advanced analytical steps by offering a general snapshot of the dataset’s composition.

We begin by generating summary statistics and exploring key aspects of the data.

In [59]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 536641 entries, 0 to 541908
Data columns (total 9 columns):
 #   Column           Non-Null Count   Dtype   
---  ------           --------------   -----   
 0   InvoiceNo        536641 non-null  category
 1   StockCode        536641 non-null  category
 2   Description      536641 non-null  object  
 3   Quantity         536641 non-null  int64   
 4   InvoiceDate      536641 non-null  object  
 5   UnitPrice        536641 non-null  float64 
 6   CustomerID       401604 non-null  Int64   
 7   Country          536641 non-null  object  
 8   TransactionType  536641 non-null  object  
dtypes: Int64(1), category(2), float64(1), int64(1), object(4)
memory usage: 52.8+ MB


**Key Observations:**

- **Rows**: 536,641 transactions
- **Columns**: 9
- Missing Values: Most columns are complete, except for CustomerID, which is missing in approximately 25% of the records
- **Derived Column**: A new column, `TransactionType`, classifies each row as a `Sale`, `Return`, or `SpecialCase` based on invoice patterns and product descriptions

In [60]:
df_clean['TransactionType'].value_counts(normalize=True)

TransactionType
Sale           0.977879
Return         0.017239
SpecialCase    0.004882
Name: proportion, dtype: float64

The majority of transactions are regular **Sales**, followed by a smaller proportion of **Returns** and **Special Cases**, as expected based on our previous classification process.

### Overview of each transaction type

Having all transactions categorized under `TransactionType`, we can now analyze each group independently without conflating their behaviors. This allows for more accurate insights and better handling of specific cases.

To begin, we use the `.describe()` method to generate summary statistics for each transaction type.

In [61]:
df_clean[df_clean['TransactionType'] == 'Sale'].drop(columns=['CustomerID']).describe()

Unnamed: 0,Quantity,UnitPrice
count,524770.0,524770.0
mean,10.987383,3.274727
std,159.878853,4.460465
min,1.0,0.0
25%,1.0,1.25
50%,4.0,2.08
75%,12.0,4.13
max,80995.0,649.5


In [62]:
df_clean[df_clean['TransactionType'] == 'Return'].drop(columns=['CustomerID']).describe()

Unnamed: 0,Quantity,UnitPrice
count,9251.0,9251.0
mean,-29.78705,48.57043
std,1147.997592,667.926393
min,-80995.0,0.01
25%,-6.0,1.45
50%,-2.0,2.95
75%,-1.0,5.95
max,-1.0,38970.0


In [63]:
df_clean[df_clean['TransactionType'] == 'SpecialCase'].drop(columns=['CustomerID']).describe()

Unnamed: 0,Quantity,UnitPrice
count,2620.0,2620.0
mean,4.115649,121.476855
std,232.965881,580.267542
min,-3000.0,-11062.06
25%,1.0,2.95
50%,1.0,18.0
75%,2.0,118.3425
max,5368.0,13541.33


**Key Takeaways:**

- **Returns** correctly reflect negative quantities, consistent with refund transactions.
- **Special Cases** exhibit high variability in both `Quantity` and `UnitPrice`, supporting their exclusion from regular sales analysis.
- **Sales** now contain only positive quantities and show a reasonable distribution of unit prices, indicating reliable data for core business analysis.


### Quick Revenue metric

As a preliminary step toward deeper financial analysis, we introduce a new column called `Revenue`, calculated as the product of `Quantity` and `UnitPrice`.

This allows us to quickly estimate the total revenue generated from valid sales transactions and establish a foundation for further performance metrics in the following sections.


In [64]:
df_clean['Revenue'] = df_clean['Quantity'] * df_clean['UnitPrice']