## *Extract , Transform & Load:Retail_Data*

In [4]:
import pandas as pd
import sqlite3
from datetime import datetime
import logging
import os
from glob import glob

### *Step 1: Extract*

In [9]:
import os
from glob import glob

# Define your raw data folder path
RAW_PATH = 'synthetic_data'

# Find all CSV files in the folder
csv_files = glob(os.path.join(RAW_PATH, "*.csv"))

# Print message about matched files
print("The matched raw dataset csv_files are:")

# Display the list of CSV file paths
for file in csv_files:
    print(file)


The matched raw dataset csv_files are:
synthetic_data\CustomerDim.csv
synthetic_data\FactSales.csv
synthetic_data\ProductDim.csv
synthetic_data\StoreDim.csv
synthetic_data\TimeDim.csv


## *Preview the Extracted Datasets*

In [10]:
# Create dictionary with cleaned dataset names as keys and DataFrames as values
datasets = {}
for file in csv_files:
    # Extract base file name without extension (e.g., TimeDim.csv → TimeDim)
    base = os.path.basename(file).replace(".csv", "")
    
    # Load each CSV into the dictionary under the cleaned base name key
    datasets[base] = pd.read_csv(file, encoding='ISO-8859-1')


In [11]:
# Loop through each dataset in the dictionary
for name, df in datasets.items():
    # Print dataset name and its dimensions
    print(f"\n--- Previewing Dataset: {name.upper()} ---")
    print(f"{name} — {df.shape[0]} rows × {df.shape[1]} columns loaded.\n")
    
    # Display the first 5 rows of the dataset
    display(df.head())
    
    # Add a line separator for clarity
    print("-" * 80)



--- Previewing Dataset: CUSTOMERDIM ---
CustomerDim — 4389 rows × 7 columns loaded.



Unnamed: 0,CustomerID,Country,CustomerName,City,Gender,Age,CustomerSince
0,17850.0,United Kingdom,54cde5dbb6,Amberport,Female,42,2024-12-01 08:26:00
1,13047.0,United Kingdom,86314fa849,West Joshua,Other,26,2024-12-01 08:26:00
2,12583.0,France,dcff63cd99,East Shannonbury,Female,52,2024-12-01 08:26:00
3,13748.0,United Kingdom,590354e49f,Malcolmberg,Female,33,2024-12-01 08:26:00
4,15100.0,United Kingdom,58ec7997a6,Port Raymond,Other,62,2024-12-01 08:26:00


--------------------------------------------------------------------------------

--- Previewing Dataset: FACTSALES ---
FactSales — 541909 rows × 10 columns loaded.



Unnamed: 0,InvoiceNo,InvoiceDate,TimeID,ProductID,CustomerID,StoreID,Quantity,UnitPrice,Discount,TotalSales
0,536365,2024-12-01,20241201,85123A,17850.0,1,6,2.55,0,15.3
1,536365,2024-12-01,20241201,71053,17850.0,1,6,3.39,0,20.34
2,536365,2024-12-01,20241201,84406B,17850.0,1,8,2.75,0,22.0
3,536365,2024-12-01,20241201,84029G,17850.0,1,6,3.39,0,20.34
4,536365,2024-12-01,20241201,84029E,17850.0,1,6,3.39,0,20.34


--------------------------------------------------------------------------------

--- Previewing Dataset: PRODUCTDIM ---
ProductDim — 18053 rows × 5 columns loaded.



Unnamed: 0,ProductID,ProductName,UnitCost,Category,Brand
0,85123A,WHITE HANGING HEART T-LIGHT HOLDER,2.55,Miscellaneous,"Carter, Richardson and Frazier"
1,71053,WHITE METAL LANTERN,3.39,Miscellaneous,"Smith, Scott and Obrien"
2,84406B,CREAM CUPID HEARTS COAT HANGER,2.75,Miscellaneous,Ware-Sellers
3,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,3.39,Miscellaneous,"Liu, Moss and Garcia"
4,84029E,RED WOOLLY HOTTIE WHITE HEART.,3.39,Miscellaneous,"Bailey, Salazar and Phillips"


--------------------------------------------------------------------------------

--- Previewing Dataset: STOREDIM ---
StoreDim — 38 rows × 5 columns loaded.



Unnamed: 0,Country,StoreID,StoreName,Channel,City
0,United Kingdom,1,Online Store - United Kingdom,Online,Marianmouth
1,France,2,Online Store - France,Online,North Teresa
2,Australia,3,Online Store - Australia,Online,Aguilarport
3,Netherlands,4,Online Store - Netherlands,Online,East Stephanie
4,Germany,5,Online Store - Germany,Online,Jamestown


--------------------------------------------------------------------------------

--- Previewing Dataset: TIMEDIM ---
TimeDim — 305 rows × 7 columns loaded.



Unnamed: 0,TimeID,FullDate,Day,Month,Quarter,Year,WeekOfYear
0,20241201,2024-12-01,1,12,4,2024,48
1,20241202,2024-12-02,2,12,4,2024,49
2,20241203,2024-12-03,3,12,4,2024,49
3,20241205,2024-12-05,5,12,4,2024,49
4,20241206,2024-12-06,6,12,4,2024,49


--------------------------------------------------------------------------------


# *T-Transform*
### *Transformation Strategy*

*Following the full extraction, we proceeded with a`* ***Full Transformation*** *approach. Each dataset was fully inspected and cleaned independently to ensure data quality and consistency before merging.*


### *Inspecting Dataset*

In [12]:
# Iterate through each dataset and display info
for name, df in datasets.items(): # Looping through each key-value pair in the datasets dictionary
    print(f"\n--- Structure of Dataset: {name.upper()} ---\n") # Printing the dataset name in uppercase for clarity
    df.info() # Displaying concise summary of the DataFrame (columns, types, non-null counts, memory usage)
    print("\n" + "-" * 80) # Printing a separator line for readability between datasets


--- Structure of Dataset: CUSTOMERDIM ---

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4389 entries, 0 to 4388
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   CustomerID     4380 non-null   float64
 1   Country        4389 non-null   object 
 2   CustomerName   4389 non-null   object 
 3   City           4389 non-null   object 
 4   Gender         4389 non-null   object 
 5   Age            4389 non-null   int64  
 6   CustomerSince  4389 non-null   object 
dtypes: float64(1), int64(1), object(5)
memory usage: 240.2+ KB

--------------------------------------------------------------------------------

--- Structure of Dataset: FACTSALES ---

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   InvoiceDate  541909 n


---

*The datasets show the structure and basic data quality of each table:*

* *CustomerDim contains 4,389 unique customers with some missing CustomerID values and date fields stored as strings.*

* *FactSales holds 541,909 sales transactions with some missing CustomerIDs and InvoiceDate as a string.*

* *ProductDim includes 18,053 products, with some missing product names.*

* *StoreDim has 38 stores with complete data.*

* *TimeDim consists of 305 unique dates with date columns mostly as integers but FullDate as a string.*

*Overall, the data structure is consistent and ready for cleaning and transformation during ETL.*



### *Transformation Step-by-Step*