### *Load and preprocess the dataset*

In [214]:
import hashlib
import random
from faker import Faker
import pandas as pd
from datetime import timedelta
import os

*Description:*

*This chunk focuses on loading the raw dataset from a CSV file into a pandas DataFrame, ensuring that special characters in the data are correctly handled by specifying the appropriate encoding. The 'InvoiceDate' column is explicitly converted into a datetime object, which enables more efficient and accurate manipulation of date and time data. To make the dataset appear current for analysis purposes, all invoice dates are shifted forward by 14 years, adjusting the original 2010-2011 timestamps to approximately 2024-2025. This simulated recency of the data can be important for testing or reporting. Finally, the new date range is printed as a sanity check to confirm that the shift was applied correctly.*


In [215]:
# Load dataset CSV into pandas DataFrame.
# Encoding ISO-8859-1 is used to handle special characters.
df = pd.read_csv('../Data/Online_Retail.csv', encoding='ISO-8859-1')

# Convert 'InvoiceDate' column to datetime type for easier date/time operations.
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Shift invoice dates forward by 14 years to simulate current data (2024-2025).
df['InvoiceDate'] = df['InvoiceDate'] + pd.DateOffset(years=14)

# Verify date range after shifting.
print("Date range after shifting:", df['InvoiceDate'].min(), "to", df['InvoiceDate'].max())

Date range after shifting: 2024-12-01 08:26:00 to 2025-12-09 12:50:00


### *Create Time Dimension Table (TimeDim)*
*Description:*

*In this step, a Time Dimension table is constructed, which is fundamental in data warehousing and analytical processing for providing rich temporal context to sales data. The process starts by extracting all unique dates from the transactional data and normalizing them to remove time components, ensuring that each date appears only once. A unique `TimeID` is generated for each date using the YYYYMMDD format, facilitating efficient joins with fact tables. Additional columns are created to break down each date into components such as day, month, quarter, year, and week number — all valuable for time-based grouping and trend analysis. This denormalized structure simplifies querying and reporting over time.*


In [216]:
# 2. Create the Time Dimension (TimeDim) table
# Create empty DataFrame for Time Dimension
time_dim = pd.DataFrame()

# Extract unique dates from the 'InvoiceDate' column (date only, no time)
time_dim['FullDate'] = pd.to_datetime(df['InvoiceDate'].dt.date.unique())

# Generate a unique TimeID for each date in YYYYMMDD integer format
time_dim['TimeID'] = time_dim['FullDate'].dt.strftime('%Y%m%d').astype(int)

# Extract useful date attributes for analysis
time_dim['Day'] = time_dim['FullDate'].dt.day
time_dim['Month'] = time_dim['FullDate'].dt.month
time_dim['Quarter'] = time_dim['FullDate'].dt.quarter
time_dim['Year'] = time_dim['FullDate'].dt.year
time_dim['WeekOfYear'] = time_dim['FullDate'].dt.isocalendar().week

# Reorder columns for clarity
time_dim = time_dim[['TimeID', 'FullDate', 'Day', 'Month', 'Quarter', 'Year', 'WeekOfYear']]



### *Create Customer Dimension Table (CustomerDim)*

*Description:*

*This chunk builds the Customer Dimension table, which profiles unique customers using a combination of actual and synthetic data. The real `CustomerID` and country information are directly extracted to maintain referential integrity. Since personal identifying information like names and cities are not available or desirable to use, synthetic values are generated to enrich the dataset while preserving privacy. Customer names are created by hashing the `CustomerID` to produce consistent yet anonymous identifiers. Cities are generated using the Faker library with locale settings based on the customer’s country, adding realistic geographic diversity. Additionally, plausible gender and age values are randomly assigned within reasonable bounds to simulate demographic attributes. Finally, the earliest invoice date is used as a proxy for the customer’s registration date, providing a temporal reference for customer activity.*


In [217]:
fake = Faker()

def hash_customer_name(cust_id):
    # Generate a synthetic name by hashing the CustomerID
    return hashlib.sha256(str(cust_id).encode()).hexdigest()[:10]

def generate_city_based_on_country(country):
    # Use Faker locale based on country for city name if possible, else default locale
    # Here we simplify: if country is UK use en_GB, else en_US or default
    if country == 'United Kingdom':
        fake_local = Faker('en_GB')
    else:
        fake_local = Faker()
    return fake_local.city()

# Extract unique customers with country
customer_dim = df[['CustomerID', 'Country']].drop_duplicates().copy()

# Create synthetic CustomerName by hashing CustomerID
customer_dim['CustomerName'] = customer_dim['CustomerID'].apply(hash_customer_name)

# Generate synthetic City based on Country
customer_dim['City'] = customer_dim['Country'].apply(generate_city_based_on_country)

# Generate reasonable synthetic Gender and Age
gender_choices = ['Male', 'Female', 'Other']
customer_dim['Gender'] = [random.choice(gender_choices) for _ in range(len(customer_dim))]
customer_dim['Age'] = [random.randint(18, 75) for _ in range(len(customer_dim))]

# Set CustomerSince as earliest InvoiceDate in the dataset
customer_dim['CustomerSince'] = df['InvoiceDate'].min()

# Drop Email column (not required)
# No Email column added here

customer_dim.head()

Unnamed: 0,CustomerID,Country,CustomerName,City,Gender,Age,CustomerSince
0,17850.0,United Kingdom,54cde5dbb6,Amberport,Female,42,2024-12-01 08:26:00
9,13047.0,United Kingdom,86314fa849,West Joshua,Other,26,2024-12-01 08:26:00
26,12583.0,France,dcff63cd99,East Shannonbury,Female,52,2024-12-01 08:26:00
46,13748.0,United Kingdom,590354e49f,Malcolmberg,Female,33,2024-12-01 08:26:00
65,15100.0,United Kingdom,58ec7997a6,Port Raymond,Other,62,2024-12-01 08:26:00


### *Create Store Dimension Table (StoreDim)*


*Description:*

*This step constructs the Store Dimension table, which represents the stores or sales locations for transactions. Since the dataset primarily references countries rather than specific store locations, each unique country is treated as a distinct store. Unique numeric `StoreID`s are assigned for efficient foreign key references. Store names are generated to include the country name for clarity and uniqueness. The sales channel is hardcoded as "Online," reflecting the dataset's nature as online retail transactions. To provide richer location information, synthetic cities are generated for each store based on the country, using locale-specific Faker instances to maintain geographic plausibility. This approach allows analysis at the store level while enhancing location details without requiring real-world addresses.*

In [218]:
# Extract unique countries as stores
store_dim = df[['Country']].drop_duplicates().reset_index(drop=True)

# Assign unique StoreID starting from 1
store_dim['StoreID'] = store_dim.index + 1

# Assign StoreName with country suffix for uniqueness
store_dim['StoreName'] = store_dim['Country'].apply(lambda x: f"Online Store - {x}")

# Assign Channel as 'Online' (dataset is online retail)
store_dim['Channel'] = 'Online'

# Generate synthetic City based on country using Faker locales
def generate_city(country):
    if country == 'United Kingdom':
        fake_local = Faker('en_GB')
    else:
        fake_local = Faker()
    return fake_local.city()

store_dim['City'] = store_dim['Country'].apply(generate_city)

store_dim.head()

Unnamed: 0,Country,StoreID,StoreName,Channel,City
0,United Kingdom,1,Online Store - United Kingdom,Online,Marianmouth
1,France,2,Online Store - France,Online,North Teresa
2,Australia,3,Online Store - Australia,Online,Aguilarport
3,Netherlands,4,Online Store - Netherlands,Online,East Stephanie
4,Germany,5,Online Store - Germany,Online,Jamestown


In [219]:
fake = Faker()

# 4. Create Product Dimension Table (product_dim)
# ------------------------------------------------
# Extract unique products from the original dataset: StockCode, Description, UnitPrice.
product_dim = df[['StockCode', 'Description', 'UnitPrice']].drop_duplicates().copy()

# Rename columns to fit dimensional model schema
product_dim = product_dim.rename(columns={
    'StockCode': 'ProductID',
    'Description': 'ProductName',
    'UnitPrice': 'UnitCost'
})

# Define a simple function to categorize products based on keywords in ProductName
def categorize_product(name):
    if pd.isna(name):
        return 'Miscellaneous'
    name = name.lower()
    if any(keyword in name for keyword in ['electronic', 'computer', 'usb', 'laptop', 'cable']):
        return 'Electronics'
    elif any(keyword in name for keyword in ['shirt', 'clothing', 'dress', 't-shirt', 'jeans']):
        return 'Clothing'
    elif any(keyword in name for keyword in ['book', 'novel', 'journal']):
        return 'Books'
    elif any(keyword in name for keyword in ['toy', 'game']):
        return 'Toys & Games'
    else:
        return 'Miscellaneous'

# Apply the category function to create a Category column
product_dim['Category'] = product_dim['ProductName'].apply(categorize_product)

# Generate a synthetic Brand name using Faker company names for each product
product_dim['Brand'] = [fake.company() for _ in range(len(product_dim))]

# Display sample of product_dim to verify
print(product_dim.head())


  ProductID                          ProductName  UnitCost       Category  \
0    85123A   WHITE HANGING HEART T-LIGHT HOLDER      2.55  Miscellaneous   
1     71053                  WHITE METAL LANTERN      3.39  Miscellaneous   
2    84406B       CREAM CUPID HEARTS COAT HANGER      2.75  Miscellaneous   
3    84029G  KNITTED UNION FLAG HOT WATER BOTTLE      3.39  Miscellaneous   
4    84029E       RED WOOLLY HOTTIE WHITE HEART.      3.39  Miscellaneous   

                            Brand  
0  Carter, Richardson and Frazier  
1         Smith, Scott and Obrien  
2                    Ware-Sellers  
3            Liu, Moss and Garcia  
4    Bailey, Salazar and Phillips  


In [220]:
# Convert 'InvoiceDate' to datetime if not already
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Create a new column 'InvoiceDateOnly' normalized to midnight (date only, no time)
df['InvoiceDateOnly'] = df['InvoiceDate'].dt.normalize()

# Ensure 'FullDate' in time_dim is datetime type for correct merging
time_dim['FullDate'] = pd.to_datetime(time_dim['FullDate'])


### *Prepare FactSales Table (Fact Table)*

*Description:*

*This final chunk assembles the FactSales table, which records individual sales transactions linked to the various dimension tables through foreign keys. First, any encoding issues in column names are corrected to ensure consistency. Duplicate columns created through multiple merges are removed to avoid confusion and errors. The fact table is enriched by merging the Time Dimension to include a `TimeID` foreign key, facilitating time-based joins and analysis. Invoice dates are normalized to exclude time information, aligning with the date-only nature of the Time Dimension. The product identifier is standardized by assigning `ProductID` as the original stock code. The store foreign key (`StoreID`) is merged in based on country, linking sales to store locations. A key metric, `TotalSales`, is calculated by multiplying the quantity sold by the unit price, providing the total revenue per transaction line. Finally, only the relevant columns necessary for the fact table schema are selected to form the `fact_sales` DataFrame, ready for analytical queries or database loading.*


In [221]:
df = df.rename(columns={'ï»¿InvoiceNo': 'InvoiceNo'})
# Use the 'TimeID' and 'StoreID' columns without suffixes if present
if 'TimeID' not in df.columns:
    if 'TimeID_y' in df.columns:
        df['TimeID'] = df['TimeID_y']
    elif 'TimeID_x' in df.columns:
        df['TimeID'] = df['TimeID_x']

if 'StoreID' not in df.columns:
    if 'StoreID_y' in df.columns:
        df['StoreID'] = df['StoreID_y']
    elif 'StoreID_x' in df.columns:
        df['StoreID'] = df['StoreID_x']


In [222]:
df = df.loc[:,~df.columns.duplicated()]

In [223]:
# Merge df with time_dim to get TimeID by matching on normalized date columns
df = df.merge(
    time_dim[['TimeID', 'FullDate']],
    left_on='InvoiceDateOnly',
    right_on='FullDate',
    how='left'
)

# Convert 'InvoiceDate' column to just date (drop time component) for fact table compatibility
df['InvoiceDate'] = df['InvoiceDate'].dt.date

# Assign 'ProductID' as the same value as 'StockCode' for clarity and schema matching
df['ProductID'] = df['StockCode']

# Merge df with store_dim on 'Country' to get StoreID foreign key
df = df.merge(
    store_dim[['StoreID', 'Country']],
    on='Country',
    how='left'
)

# Calculate total sales amount per transaction line
df['TotalSales'] = df['Quantity'] * df['UnitPrice']

#  Add Discount column if missing (assumed 0)
if 'Discount' not in df.columns:
    df['Discount'] = 0

# Then select the columns including Discount
fact_sales = df[['InvoiceNo', 'InvoiceDate', 'TimeID', 'ProductID', 'CustomerID', 'StoreID',
                 'Quantity', 'UnitPrice', 'Discount', 'TotalSales']].copy()

In [224]:
# Define the folder path
folder_path = 'Task_2_ETL_Process_Implementation/synthetic_data'

# Create folder if it does not exist (won't error if it exists)
os.makedirs(folder_path, exist_ok=True)

# Save each DataFrame to a separate CSV file inside the folder
time_dim.to_csv(os.path.join(folder_path, 'TimeDim.csv'), index=False)
customer_dim.to_csv(os.path.join(folder_path, 'CustomerDim.csv'), index=False)
store_dim.to_csv(os.path.join(folder_path, 'StoreDim.csv'), index=False)
# Assuming product_dim exists
product_dim.to_csv(os.path.join(folder_path, 'ProductDim.csv'), index=False)
fact_sales.to_csv(os.path.join(folder_path, 'FactSales.csv'), index=False)

print(f"All dimension and fact tables have been saved to the folder '{folder_path}'.")

# List all files in the folder to confirm
print("Files in the folder now:")
print(os.listdir(folder_path))


All dimension and fact tables have been saved to the folder 'Task_2_ETL_Process_Implementation/synthetic_data'.
Files in the folder now:
['CustomerDim.csv', 'FactSales.csv', 'ProductDim.csv', 'StoreDim.csv', 'TimeDim.csv']
