# 🛠️ ETL Process Documentation - Hackathon Project

## 📌 Project: Online Retail Transaction Analysis

### 🎯 Objective
Clean, transform, and prepare the *Online Retail* dataset for analysis by removing errors, correcting inconsistencies, and structuring the data for visualisation and insights.

---

In [None]:
import pandas as pd

df = pd.read_csv(r'../data/online_retail.csv')

#### Create a Data Dictionary of the imported dataset 

In [None]:
# Custom Function to create a comprehensive data dictionary for Online Retail datasets
# Takes a DataFrame and returns a data dictionary
def create_data_dictionary(df):
    # Official descriptions from UCI ML Repository Variables Table: https://archive.ics.uci.edu/dataset/352/online+retail
    descriptions = {
        'InvoiceNo': 'Invoice number - 6-digit integral number uniquely assigned to each transaction. If starts with letter "c", indicates a cancellation',
        'StockCode': 'Product (item) code - 5-digit integral number uniquely assigned to each distinct product',
        'Description': 'Product (item) name/description',
        'Quantity': 'The quantities of each product (item) per transaction',
        'InvoiceDate': 'Invoice date and time - the day and time when each transaction was generated',
        'UnitPrice': 'Unit price - product price per unit in sterling (£)',
        'CustomerID': '5-digit integral number uniquely assigned to each customer',
        'Country': 'Country name - the name of the country where each customer resides'
    }
    
    dictionary_data = []
    for column in df.columns:
        # Get 3 sample values (non-null)
        sample_values = df[column].dropna().head(3).tolist()
        sample_str = ', '.join([str(x) for x in sample_values])
        
        dictionary_data.append({
            'Column': column,
            'Data Type': str(df[column].dtype),
            'Missing Values': df[column].isnull().sum(),
            'Missing %': round((df[column].isnull().sum() / len(df)) * 100, 2),
            'Unique Values': df[column].nunique(),
            'Sample Values': sample_str,
            'Description': descriptions.get(column, 'Custom/Engineered column - description needed')
        })
    return pd.DataFrame(dictionary_data)

# Store the dictionary in a variable
raw_data_dictionary = create_data_dictionary(df)

# Display data dictionary
raw_data_dictionary


#### Check Invoice Column

- We have used tableau prep to quickly identify data conserns, we first noticed that InvoiceNo contains string values.

In [None]:
def analyse_invoice_column(df, column_name='InvoiceNo'):
    """
    Analyse the InvoiceNo column to identify different invoice types
    
    Parameters:
    df (pandas.DataFrame): The dataframe to analyze
    column_name (str): The name of the invoice column to analyze (default: 'InvoiceNo')
    
    Prints analysis results including counts and sample values for regular and non-standard invoices.
    """
    print(f"Invoice Column records: {len(df)}")
    
    # Check for different invoice patterns
    regular_invoices = df[df[column_name].astype(str).str.match(r'^\d{6}$')]
    non_standard_invoices = df[~df[column_name].astype(str).str.match(r'^\d{6}$')]
    
    print(f"\nRegular invoices (exactly 6 digits): {len(regular_invoices)} ({len(regular_invoices)/len(df)*100:.1f}%)")
    print(f"Non-standard invoices (not exactly 6 digits): {len(non_standard_invoices)} ({len(non_standard_invoices)/len(df)*100:.1f}%)")
    
    # Show sample invoice numbers for each type
    print("\nExpected invoice format:")
    print(regular_invoices[column_name].head(5).tolist())
    
    if len(non_standard_invoices) > 0:
        print("\nInvoices with string values:")
        print(non_standard_invoices[column_name].head(10).tolist())

# Run the analysis on the current dataset
analyse_invoice_column(df)

### Filter out rows where InvoiceNo contains 'C' (cancellations)
- Dataset research from {[link](https://archive.ics.uci.edu/dataset/352/online+retail)} shows that invoice numbers begining with C are cancellations, these are not within the scope of our user stories and will be removed from the dataset.

In [None]:
df = df[~df['InvoiceNo'].astype(str).str.contains('C', case=False, na=False)]
analyse_invoice_column(df)

### Filter out rows where InvoiceNo contains 'A' (Adjust bad debt)
- Dataset research from tableau prep shows that A records all have the description "Adjust bad debt" which are outside of the scope of our analysis and will be removed.

In [None]:
df = df[~df['InvoiceNo'].astype(str).str.contains('A', case=False, na=False)]
analyse_invoice_column(df)

#### Or quantity is negative

In [None]:
df = df[df['Quantity'] > 0]

### Drop rows with null product descriptions

In [None]:
df = df.dropna(subset=['Description'])

### Remove cancelled orders (already filtered out above)
# Note: Invoices containing 'C' (cancellations) are already removed in the previous step

In [None]:
# This step is now redundant as we already filtered out invoices containing 'C' above
# df = df[~df['InvoiceNo'].astype(str).str.contains('C', case=False, na=False)]

### Remove duplicates

In [None]:
df = df.drop_duplicates(subset=['InvoiceNo', 'StockCode', 'Quantity', 'CustomerID'])

### Remove missing or unspecified countries

In [None]:
df = df[df['Country'].notna()]
df = df[df['Country'] != 'Unspecified']

### Re-analyze Invoice Column After Cleaning
Now let's check the invoice column again after all our cleaning steps to see how the data looks.

In [None]:
# Use the function to re-analyze the invoice column after cleaning
print("=== AFTER CLEANING ===")
analyse_invoice_column(df)

### Save the cleaned dataset to CSV for visualisation

In [None]:
df.to_csv('../data/clean_online_retail.csv', index=False)

#### Create a Data Dictionary of the cleaned dataset 

In [None]:
# Store the dictionary in a variable
clean_data_dictionary = create_data_dictionary(df)

# Display data dictionary
clean_data_dictionary