# 🛠️ ETL Process Documentation - Hackathon Project

## 📌 Project: Online Retail Transaction Analysis

### 🎯 Objective
Clean, transform, and prepare the *Online Retail* dataset for analysis by removing errors, correcting inconsistencies, and structuring the data for visualisation and insights.

---

In [13]:
import pandas as pd

df = pd.read_csv(r'../data/online_retail.csv')

#### Create a Data Dictionary of the imported dataset 

In [14]:
# Custom Function to create a comprehensive data dictionary for Online Retail datasets
# Takes a DataFrame and returns a data dictionary
def create_data_dictionary(df):
    # Official descriptions from UCI ML Repository Variables Table: https://archive.ics.uci.edu/dataset/352/online+retail
    descriptions = {
        'InvoiceNo': 'Invoice number - 6-digit integral number uniquely assigned to each transaction. If starts with letter "c", indicates a cancellation',
        'StockCode': 'Product (item) code - 5-digit integral number uniquely assigned to each distinct product',
        'Description': 'Product (item) name/description',
        'Quantity': 'The quantities of each product (item) per transaction',
        'InvoiceDate': 'Invoice date and time - the day and time when each transaction was generated',
        'UnitPrice': 'Unit price - product price per unit in sterling (£)',
        'CustomerID': '5-digit integral number uniquely assigned to each customer',
        'Country': 'Country name - the name of the country where each customer resides'
    }
    
    dictionary_data = []
    for column in df.columns:
        # Get 3 sample values (non-null)
        sample_values = df[column].dropna().head(3).tolist()
        sample_str = ', '.join([str(x) for x in sample_values])
        
        dictionary_data.append({
            'Column': column,
            'Data Type': str(df[column].dtype),
            'Missing Values': df[column].isnull().sum(),
            'Missing %': round((df[column].isnull().sum() / len(df)) * 100, 2),
            'Unique Values': df[column].nunique(),
            'Sample Values': sample_str,
            'Description': descriptions.get(column, 'Custom/Engineered column - description needed')
        })
    return pd.DataFrame(dictionary_data)

# Store the dictionary in a variable
raw_data_dictionary = create_data_dictionary(df)

# Display data dictionary
raw_data_dictionary


Unnamed: 0,Column,Data Type,Missing Values,Missing %,Unique Values,Sample Values,Description
0,InvoiceNo,object,0,0.0,25900,"536365, 536365, 536365",Invoice number - 6-digit integral number uniqu...
1,StockCode,object,0,0.0,4070,"85123A, 71053, 84406B",Product (item) code - 5-digit integral number ...
2,Description,object,1454,0.27,4223,"WHITE HANGING HEART T-LIGHT HOLDER, WHITE META...",Product (item) name/description
3,Quantity,int64,0,0.0,722,"6, 6, 8",The quantities of each product (item) per tran...
4,InvoiceDate,object,0,0.0,23260,"2010-12-01 08:26:00, 2010-12-01 08:26:00, 2010...",Invoice date and time - the day and time when ...
5,UnitPrice,float64,0,0.0,1630,"2.55, 3.39, 2.75",Unit price - product price per unit in sterlin...
6,CustomerID,int64,0,0.0,4372,"17850, 17850, 17850",5-digit integral number uniquely assigned to e...
7,Country,object,0,0.0,38,"United Kingdom, United Kingdom, United Kingdom",Country name - the name of the country where e...


#### Check Invoice Column

- We have used tableau prep to quickly identify data conserns, we first noticed that InvoiceNo contains string values.

def analyse_invoice_column(df, column_name='InvoiceNo'):
    """
    Analyse the InvoiceNo column to identify different invoice types
    
    Parameters:
    df (pandas.DataFrame): The dataframe to analyze
    column_name (str): The name of the invoice column to analyze (default: 'InvoiceNo')
    
    Prints analysis results including counts and sample values for regular and non-standard invoices.
    """
    print(f"Invoice Column records: {len(df)}")
    
    # Check for different invoice patterns
    regular_invoices = df[df[column_name].astype(str).str.match(r'^\d{6}$')]
    non_standard_invoices = df[~df[column_name].astype(str).str.match(r'^\d{6}$')]
    
    print(f"\nRegular invoices (exactly 6 digits): {len(regular_invoices)} ({len(regular_invoices)/len(df)*100:.1f}%)")
    print(f"Non-standard invoices (not exactly 6 digits): {len(non_standard_invoices)} ({len(non_standard_invoices)/len(df)*100:.1f}%)")
    
    # Show sample invoice numbers for each type
    print("\nExpected invoice format:")
    print(regular_invoices[column_name].head(5).tolist())
    print()
    
    if len(non_standard_invoices) > 0:
        print("\nInvoices with string values:")
        print(non_standard_invoices[column_name].head(10).tolist())
        print()

# Run the analysis on the current dataset
analyse_invoice_column(df)

### Filter out rows where InvoiceNo contains 'C' (cancellations)
- Dataset research from {[link](https://archive.ics.uci.edu/dataset/352/online+retail)} shows that invoice numbers begining with C are cancellations, these are not within the scope of our user stories and will be removed from the dataset.

In [16]:
df = df[~df['InvoiceNo'].astype(str).str.contains('C', case=False, na=False)]


#### Or quantity is negative

In [17]:
df = df[df['Quantity'] > 0]

### Drop rows with null product descriptions

In [18]:
df = df.dropna(subset=['Description'])

### Remove cancelled orders (already filtered out above)
# Note: Invoices containing 'C' (cancellations) are already removed in the previous step

In [19]:
# This step is now redundant as we already filtered out invoices containing 'C' above
# df = df[~df['InvoiceNo'].astype(str).str.contains('C', case=False, na=False)]

### Remove duplicates

In [20]:
df = df.drop_duplicates(subset=['InvoiceNo', 'StockCode', 'Quantity', 'CustomerID'])

### Remove missing or unspecified countries

In [21]:
df = df[df['Country'].notna()]
df = df[df['Country'] != 'Unspecified']

### Re-analyze Invoice Column After Cleaning
Now let's check the invoice column again after all our cleaning steps to see how the data looks.

In [22]:
# Use the function to re-analyze the invoice column after cleaning
print("=== AFTER CLEANING ===")
analyse_invoice_column(df)

=== AFTER CLEANING ===
Invoice Column records: 524931

Regular invoices (exactly 6 digits): 524928 (100.0%)
Non-standard invoices (not exactly 6 digits): 3 (0.0%)

Expected invoice format:
['536365', '536365', '536365', '536365', '536365']


Invoices with string values:
['A563185', 'A563186', 'A563187']


Regular invoices (exactly 6 digits): 524928 (100.0%)
Non-standard invoices (not exactly 6 digits): 3 (0.0%)

Expected invoice format:
['536365', '536365', '536365', '536365', '536365']


Invoices with string values:
['A563185', 'A563186', 'A563187']



### Save the cleaned dataset to CSV for visualisation

In [23]:
df.to_csv('../data/clean_online_retail.csv', index=False)

#### Create a Data Dictionary of the cleaned dataset 

In [24]:
# Store the dictionary in a variable
clean_data_dictionary = create_data_dictionary(df)

# Display data dictionary
clean_data_dictionary

Unnamed: 0,Column,Data Type,Missing Values,Missing %,Unique Values,Sample Values,Description
0,InvoiceNo,object,0,0.0,20123,"536365, 536365, 536365",Invoice number - 6-digit integral number uniqu...
1,StockCode,object,0,0.0,3925,"85123A, 71053, 84406B",Product (item) code - 5-digit integral number ...
2,Description,object,0,0.0,4077,"WHITE HANGING HEART T-LIGHT HOLDER, WHITE META...",Product (item) name/description
3,Quantity,int64,0,0.0,383,"6, 6, 8",The quantities of each product (item) per tran...
4,InvoiceDate,object,0,0.0,18630,"2010-12-01 08:26:00, 2010-12-01 08:26:00, 2010...",Invoice date and time - the day and time when ...
5,UnitPrice,float64,0,0.0,1282,"2.55, 3.39, 2.75",Unit price - product price per unit in sterlin...
6,CustomerID,int64,0,0.0,4335,"17850, 17850, 17850",5-digit integral number uniquely assigned to e...
7,Country,object,0,0.0,37,"United Kingdom, United Kingdom, United Kingdom",Country name - the name of the country where e...
