## *1. Extract , Transform & Load – Superstore Dataset*

*Author: Mohamed (ETL Lead) – Group 8*

 ### *Importing necessary Libraries*

In [25]:
import pandas as pd
import sqlite3
import os

### *Defining Paths so as to make it easy to reference them*

In [22]:
RAW_PATH = "data/superstore_sales_data.csv"
TRANSFORMED_PATH = "data/superstore_transformed.csv"
FINAL_CSV_PATH = "data/superstore_final.csv"
FINAL_DB_PATH = "data/superstore_final.db"
TABLE_NAME = "superstore_sales"

### *Extracting from the dataset*
in the cell below i am extracting a few columns from our dataset and also checking datatypes and the basic shape of the dataset

In [15]:
print("Extracting raw data...")
df_raw = pd.read_csv(RAW_PATH)

print("\nInitial Overview:")
print(f"Rows: {df_raw.shape[0]}, Columns: {df_raw.shape[1]}")
print(df_raw.dtypes)
df_raw

Extracting raw data...

Initial Overview:
Rows: 9994, Columns: 21
Row ID             int64
Order ID          object
Order Date        object
Ship Date         object
Ship Mode         object
Customer ID       object
Customer Name     object
Segment           object
Country           object
City              object
State             object
Postal Code        int64
Region            object
Product ID        object
Category          object
Sub-Category      object
Product Name      object
Sales            float64
Quantity           int64
Discount         float64
Profit           float64
dtype: object


Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2016-152156,11/8/2016,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.9600,2,0.00,41.9136
1,2,CA-2016-152156,11/8/2016,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.9400,3,0.00,219.5820
2,3,CA-2016-138688,6/12/2016,6/16/2016,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.6200,2,0.00,6.8714
3,4,US-2015-108966,10/11/2015,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.0310
4,5,US-2015-108966,10/11/2015,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.3680,2,0.20,2.5164
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9989,9990,CA-2014-110422,1/21/2014,1/23/2014,Second Class,TB-21400,Tom Boeckenhauer,Consumer,United States,Miami,...,33180,South,FUR-FU-10001889,Furniture,Furnishings,Ultra Door Pull Handle,25.2480,3,0.20,4.1028
9990,9991,CA-2017-121258,2/26/2017,3/3/2017,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,92627,West,FUR-FU-10000747,Furniture,Furnishings,Tenex B1-RE Series Chair Mats for Low Pile Car...,91.9600,2,0.00,15.6332
9991,9992,CA-2017-121258,2/26/2017,3/3/2017,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,92627,West,TEC-PH-10003645,Technology,Phones,Aastra 57i VoIP phone,258.5760,2,0.20,19.3932
9992,9993,CA-2017-121258,2/26/2017,3/3/2017,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,92627,West,OFF-PA-10004041,Office Supplies,Paper,"It's Hot Message Books with Stickers, 2 3/4"" x 5""",29.6000,4,0.00,13.3200


### *Cleaning The Dataset*
in this code i am cleaning the dataset by looking for any missing values, duplicates and then droping them. I also renamed the columns.
 

In [19]:
df = df_raw.copy()
df.columns = df.columns.str.strip().str.replace(" ", "_").str.replace("-", "_")

print("\nMissing Values Summary:")
print(df.isnull().sum())


df.dropna(subset=["Order_ID", "Order_Date", "Sales", "Profit"], inplace=True)


initial_shape = df.shape
df.drop_duplicates(inplace=True)
print(f"\nRemoved {initial_shape[0] - df.shape[0]} duplicate rows.")


Missing Values Summary:
Row_ID           0
Order_ID         0
Order_Date       0
Ship_Date        0
Ship_Mode        0
Customer_ID      0
Customer_Name    0
Segment          0
Country          0
City             0
State            0
Postal_Code      0
Region           0
Product_ID       0
Category         0
Sub_Category     0
Product_Name     0
Sales            0
Quantity         0
Discount         0
Profit           0
dtype: int64

Removed 0 duplicate rows.


### *Transformation*

Now I apply the following transformations to improve data quality and structure:
1. **Date Conversion**: Convert `Order_Date` and `Ship_Date` to datetime objects.
2. **Numeric Conversion**: Ensure columns like `Sales`, `Profit`, `Quantity`, and `Discount` are treated as numeric types.
3. **Text Standardization**: Standardize text fields such as `Customer_Name`, `City`, and `State` using `.str.title()`.
4. **Enrichment**: I create a new column called `Profit_Margin` by dividing `Profit` by `Sales`, which helps in later if we do financial analysis.

These transformations will help downstream processes like mining and visualization.


In [24]:

df["Order_Date"] = pd.to_datetime(df["Order_Date"], errors="coerce")
df["Ship_Date"] = pd.to_datetime(df["Ship_Date"], errors="coerce")

numeric_cols = ["Sales", "Quantity", "Discount", "Profit"]
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors="coerce")

df["Customer_Name"] = df["Customer_Name"].str.title()
df["City"] = df["City"].str.title()
df["State"] = df["State"].str.title()

df["Profit_Margin"] = (df["Profit"] / df["Sales"]).round(2).replace([float('inf'), -float('inf')], pd.NA)

df.to_csv(TRANSFORMED_PATH, index=False)
print(f"\nCleaned data saved to {TRANSFORMED_PATH}")


Cleaned data saved to data/superstore_transformed.csv


### *loading the dataset*
in this cell i load the datset using sqlite3 and save the db file.

In [26]:
conn = sqlite3.connect(FINAL_DB_PATH)
df.to_sql(TABLE_NAME, conn, if_exists="replace", index=False)
conn.close()

print(f"SQLite DB saved to {FINAL_DB_PATH}")

print("\nETL pipeline complete.")

SQLite DB saved to data/superstore_final.db

ETL pipeline complete.
