## 1) Extraction
The first step in the ETL process is to extract data from various source systems. In this case, we will extract data from UC Irvine's Machine Learning Repository. To do this, we will use the `ucimlrepo` package to fetch the dataset.

In [151]:
# Load necessary Libraries
import pandas as pd
from datetime import datetime

# Load the data
data = pd.read_csv("Data/online_retail_features.csv")
data.head()

Unnamed: 0,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


### a) Description of the data

In [152]:
# Describe the data
print(f"This is a description of the data:\n{data.info()}")

# Check for missing values
print(f"Missing values in each column:\n{data.isnull().sum()}")

# Check for duplicates
print(f"Duplicate rows:\n{data.duplicated().sum()}")


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Description  540455 non-null  object 
 1   Quantity     541909 non-null  int64  
 2   InvoiceDate  541909 non-null  object 
 3   UnitPrice    541909 non-null  float64
 4   CustomerID   406829 non-null  float64
 5   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(3)
memory usage: 24.8+ MB
This is a description of the data:
None
Missing values in each column:
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64
Duplicate rows:
6007
Duplicate rows:
6007


- The data has 6 records and 541909 records.
- All columns are in the correct format apart from `InvoiceDate` which is in string format and should be converted to datetime, and `CustomerID` which is in float format and should be converted to string.
- The data contains some missing values in the `CustomerID` and `Description` columns.
- There are duplicate rows in the data.

### b) Data Cleaning
This process will involve:
- Dropping the missing values since they are not significant enough to impute.
- Converting the `InvoiceDate` column to datetime format.
- Converting the `CustomerID` column to string format.
- Removing duplicate rows.


In [153]:
# Converting InvoiceDate to datetime
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'])

# Converting CustomerID to string
data['CustomerID'] = data['CustomerID'].astype(str)

# Removing duplicate rows
data = data.drop_duplicates()

# Remove missing values in specific columns
data = data.dropna(subset=['CustomerID', 'Description'])

# Resetting the index
data = data.reset_index(drop=True)
print(f"Data size after cleaning: {data.shape}")
data.head()

Data size after cleaning: (534532, 6)


Unnamed: 0,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


## 2) Transformation
The transformation process will involve:
- Create dimensions like extract where you group by `CustomerID` to create customer summary.
- Creating new calculated columns: `TotalPrice` = `Quantity` * `UnitPrice`
- Filtering data to the sales of the year. The entire year 2011.
- Handle outliers by removing values whose `Quantity` < 0 and `UnitPrice` < 0

In [154]:
# Last value of invoice
last_invoice_date = data['InvoiceDate'].max()
print(f"Last value of invoice: {last_invoice_date}")

Last value of invoice: 2011-12-09 12:50:00


### a) Incremental Extraction
Selecting data from the year `2011-01-01` onwards.

In [155]:
# Filter data for the last year
data = data[data['InvoiceDate'] >= '2011-01-01']

In [156]:
# Create a TotalPrice column
data['TotalPrice'] = data['Quantity'] * data['UnitPrice']

# Remove outliers
data = data[(data['Quantity'] > 0) & (data['UnitPrice'] > 0)]
print(f"Data size after transformations: {data.shape}")
data.describe()

Data size after transformations: (483353, 7)


Unnamed: 0,Quantity,InvoiceDate,UnitPrice,TotalPrice
count,483353.0,483353,483353.0,483353.0
mean,10.785327,2011-07-22 04:04:34.842113280,3.84262,20.307545
min,1.0,2011-01-04 10:00:00,0.001,0.001
25%,1.0,2011-04-21 19:51:00,1.25,3.9
50%,4.0,2011-08-05 16:34:00,2.08,9.95
75%,12.0,2011-10-25 12:11:00,4.13,17.7
max,80995.0,2011-12-09 12:50:00,11062.06,168469.6
std,162.491437,,31.563522,281.680944


In [157]:
# Creating customer summary
customer_summary = data.groupby('CustomerID').agg(
    TotalSales=('TotalPrice', 'sum'),
    AverageSales=('TotalPrice', 'mean'),
    PurchaseCount=('InvoiceDate', 'nunique'),
    FirstPurchase=('InvoiceDate', 'min'),
    LastPurchase=('InvoiceDate', 'max'),
    Country=('Country', 'first')
).reset_index()
customer_summary.head()

Unnamed: 0,CustomerID,TotalSales,AverageSales,PurchaseCount,FirstPurchase,LastPurchase,Country
0,12346.0,77183.6,77183.6,1,2011-01-18 10:01:00,2011-01-18 10:01:00,United Kingdom
1,12347.0,3598.21,23.829205,6,2011-01-26 14:30:00,2011-12-07 15:52:00,Iceland
2,12348.0,904.44,64.602857,3,2011-01-25 10:42:00,2011-09-25 13:13:00,Finland
3,12349.0,1757.55,24.076027,1,2011-11-21 09:51:00,2011-11-21 09:51:00,Italy
4,12350.0,334.4,19.670588,1,2011-02-02 16:01:00,2011-02-02 16:01:00,Norway


In [158]:
# Saving the transformed data to a CSV file
data.to_csv('Data/transformed_data.csv', index=False)

## 3) Loading Data into SQLite Database

In this stage, we will load the transformed data into a SQLite database. We will:
1. Create a database file (retail_dw.db)
2. Create dimension tables (CustomerDim, ProductDim, TimeDim)
3. Create a fact table (SalesFact)
4. Load the transformed data into these tables

In [159]:
# Import necessary libraries
import sqlite3
import pandas as pd
import os

# Load the transformed data
data_path = os.path.join('Data', 'transformed_data.csv')
df = pd.read_csv(data_path)

print(f"Loaded transformed data with {df.shape[0]} rows and {df.shape[1]} columns")
print("Data columns:", df.columns.tolist())
df.head()

Loaded transformed data with 483353 rows and 7 columns
Data columns: ['Description', 'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID', 'Country', 'TotalPrice']


Unnamed: 0,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,TotalPrice
0,JUMBO BAG PINK POLKADOT,10,2011-01-04 10:00:00,1.95,13313.0,United Kingdom,19.5
1,BLUE POLKADOT WRAP,25,2011-01-04 10:00:00,0.42,13313.0,United Kingdom,10.5
2,RED RETROSPOT WRAP,25,2011-01-04 10:00:00,0.42,13313.0,United Kingdom,10.5
3,RECYCLING BAG RETROSPOT,5,2011-01-04 10:00:00,2.1,13313.0,United Kingdom,10.5
4,RED RETROSPOT SHOPPER BAG,10,2011-01-04 10:00:00,1.25,13313.0,United Kingdom,12.5


In [160]:
def connect_to_db():
    """Connect to the SQLite database"""
    conn = sqlite3.connect('Retail_dw.db')
    cursor = conn.cursor()
    conn.execute('PRAGMA foreign_keys = ON;')  # Enable foreign key constraints
    print("Connected to SQLite database.")
    return conn


In [161]:
def populate_tables():
    """Populate tables from CSV"""
    conn = connect_to_db()
    cursor = conn.cursor()

    for _, row in df.iterrows():
        # Parse date
        try:
            invoice_date = datetime.strptime(row["InvoiceDate"], "%Y-%m-%d %H:%M:%S")
        except ValueError:
            try:
                invoice_date = datetime.strptime(row["InvoiceDate"], "%m/%d/%Y %H:%M")
            except:
                continue  # Skip invalid dates

        day = invoice_date.day
        month = invoice_date.month
        quarter = (month - 1) // 3 + 1
        year = invoice_date.year
        full_date_desc = invoice_date.strftime("%A, %B %d, %Y")

        # Insert into Time_TB
        cursor.execute('''
        INSERT OR IGNORE INTO Time_TB (TimeID, Date, FullDateDescription, Day, Month, Quarter, Year)
        VALUES (?, ?, ?, ?, ?, ?, ?)
        ''', (
            int(invoice_date.strftime("%Y%m%d")),  # TimeID
            invoice_date.date().isoformat(),
            full_date_desc,
            day, month, quarter, year
        ))
        time_id = int(invoice_date.strftime("%Y%m%d"))

        # Insert into Product_TB
        cursor.execute('''
        INSERT OR IGNORE INTO Product_TB (ProductID, StockCode, ProductName, ProductCategory, ProductSubcategory, Brand)
        VALUES (?, ?, ?, ?, ?, ?)
        ''', (
            row["StockCode"],
            row["StockCode"],
            row["Description"],
            None,  # ProductCategory unknown
            None,  # ProductSubcategory unknown
            None   # Brand unknown
        ))
        product_id = row["StockCode"]

        # Insert into Customer_TB
        cursor.execute('''
        INSERT OR IGNORE INTO Customer_TB (CustomerID, CustomerName, City, State, Country, AgeGroup)
        VALUES (?, ?, ?, ?, ?, ?)
        ''', (
            row["CustomerID"],
            None,  # CustomerName unknown
            None,  # City unknown
            None,  # State unknown
            row["Country"],
            None   # AgeGroup unknown
        ))
        customer_id = row["CustomerID"]

        # Store_TB — leaving empty since no store data in CSV
        store_id = None

        # Insert into FactSales_TB
        cursor.execute('''
        INSERT INTO FactSales_TB (TimeID, ProductID, CustomerID, StoreID, SalesAmount, QuantitySold, DiscountAmount)
        VALUES (?, ?, ?, ?, ?, ?, ?)
        ''', (
            time_id,
            product_id,
            customer_id,
            store_id,
            row["Quantity"] * row["UnitPrice"],  # SalesAmount
            row["Quantity"],
            0  # Discount unknown
        ))

    conn.commit()
    conn.close()
print("Data populated successfully!")

Data populated successfully!
