#### **Extract, Transform and Load (ETL) and Data Modelling in Python** 
- **The following scripts demonstrate the ETL and data modelling process:**
1. Data extraction: Read in the datasets named Orders and Details using pandas
2. Initial Exploration: checked the structure and key attributes of the two dataframes, such as the size, null, uniqueness of keys, descriptive statistics, etc.
3. Data transformation: Joined the two dataframes, and renamed columns for consistency and better readability
4. Data modelling: Created the customer dimension dataframe and added a surrogate key, created the product dimension dataframe and added a surrogate key, created the date dimension dataframe and added a surrogate key, merge the five dataframes to create the orders fact dataframe and added a surrogate key
5. Connecting to SQL Server and pushing the dataframes created: Connected to a database named OnlineSales created in SQL Server and pushed the orders fact, the customer dimension, the product dimension, and the date dimension dataframes to the database

##### **Data source** (Orders and Details csv files): https://www.kaggle.com/datasets/samruddhi4040/online-sales-data?select=Orders.csv


In [55]:
# Import libararies
import pandas as pd
import sqlalchemy
from sqlalchemy import create_engine
import urllib
from urllib.parse import quote_plus
import pypyodbc as odbc  

In [56]:
# Read in the two tables, Orders and Details
df_orders = pd.read_csv(r'C:\Users\drtil\Desktop\PythonETL_SQL_PowerBI_Online Sales\Datasets\Orders.csv')
df_details = pd.read_csv(r'C:\Users\drtil\Desktop\PythonETL_SQL_PowerBI_Online Sales\Datasets\Details.csv')

In [57]:
# Check the first five rows of the Orders dataframe
df_orders.head()

Unnamed: 0,Order ID,Order Date,CustomerName,State,City
0,B-26055,10-03-2018,Harivansh,Uttar Pradesh,Mathura
1,B-25993,03-02-2018,Madhav,Delhi,Delhi
2,B-25973,24-01-2018,Madan Mohan,Uttar Pradesh,Mathura
3,B-25923,27-12-2018,Gopal,Maharashtra,Mumbai
4,B-25757,21-08-2018,Vishakha,Madhya Pradesh,Indore


In [58]:
# Show the structural summary of the Orders dataframe
df_orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Order ID      500 non-null    object
 1   Order Date    500 non-null    object
 2   CustomerName  500 non-null    object
 3   State         500 non-null    object
 4   City          500 non-null    object
dtypes: object(5)
memory usage: 19.7+ KB


In [59]:
# Change the 'Order Date' data type to date
df_orders['Order Date'] = pd.to_datetime(df_orders['Order Date'], dayfirst=True).dt.normalize()
df_orders['Order Date'].dtype

dtype('<M8[ns]')

In [60]:
# Show the descriptive statistics of the Order dataframe
df_orders.describe(include='all')

Unnamed: 0,Order ID,Order Date,CustomerName,State,City
count,500,500,500,500,500
unique,500,,336,19,25
top,B-26055,,Shreya,Maharashtra,Indore
freq,1,,6,94,71
mean,,2018-06-17 04:13:26.400000,,,
min,,2018-01-01 00:00:00,,,
25%,,2018-03-08 18:00:00,,,
50%,,2018-06-02 12:00:00,,,
75%,,2018-10-05 00:00:00,,,
max,,2018-12-31 00:00:00,,,


In [61]:
# Check the first five rows of the Details dataframe
df_details.head()

Unnamed: 0,Order ID,Amount,Profit,Quantity,Category,Sub-Category,PaymentMode
0,B-25681,1096,658,7,Electronics,Electronic Games,COD
1,B-26055,5729,64,14,Furniture,Chairs,EMI
2,B-25955,2927,146,8,Furniture,Bookcases,EMI
3,B-26093,2847,712,8,Electronics,Printers,Credit Card
4,B-25602,2617,1151,4,Electronics,Phones,Credit Card


In [62]:
# Show the structural summary of the Details dataframe
df_details.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Order ID      1500 non-null   object
 1   Amount        1500 non-null   int64 
 2   Profit        1500 non-null   int64 
 3   Quantity      1500 non-null   int64 
 4   Category      1500 non-null   object
 5   Sub-Category  1500 non-null   object
 6   PaymentMode   1500 non-null   object
dtypes: int64(3), object(4)
memory usage: 82.2+ KB


In [63]:
# Show the descriptive statistics of the Details dataframe
df_details.describe(include='all')

Unnamed: 0,Order ID,Amount,Profit,Quantity,Category,Sub-Category,PaymentMode
count,1500,1500.0,1500.0,1500.0,1500,1500,1500
unique,500,,,,3,17,5
top,B-25656,,,,Clothing,Saree,COD
freq,12,,,,949,211,684
mean,,291.847333,24.642,3.743333,,,
std,,461.92462,168.55881,2.184942,,,
min,,4.0,-1981.0,1.0,,,
25%,,47.75,-12.0,2.0,,,
50%,,122.0,8.0,3.0,,,
75%,,326.25,38.0,5.0,,,


In [64]:
# Join the two dataframes on 'Order ID'
merged_orders_details = df_orders.merge(df_details, on='Order ID', how='left')
merged_orders_details

Unnamed: 0,Order ID,Order Date,CustomerName,State,City,Amount,Profit,Quantity,Category,Sub-Category,PaymentMode
0,B-26055,2018-03-10,Harivansh,Uttar Pradesh,Mathura,5729,64,14,Furniture,Chairs,EMI
1,B-26055,2018-03-10,Harivansh,Uttar Pradesh,Mathura,671,114,9,Electronics,Phones,Credit Card
2,B-26055,2018-03-10,Harivansh,Uttar Pradesh,Mathura,443,11,1,Clothing,Saree,COD
3,B-26055,2018-03-10,Harivansh,Uttar Pradesh,Mathura,57,7,2,Clothing,Shirt,UPI
4,B-26055,2018-03-10,Harivansh,Uttar Pradesh,Mathura,227,48,5,Clothing,Stole,COD
...,...,...,...,...,...,...,...,...,...,...,...
1495,B-25742,2018-08-03,Ashwin,Goa,Goa,11,-8,2,Clothing,Skirt,UPI
1496,B-26088,2018-03-26,Bhavna,Sikkim,Gangtok,11,5,2,Clothing,Hankerchief,UPI
1497,B-25707,2018-07-01,Shivani,Maharashtra,Mumbai,8,-6,1,Clothing,Stole,COD
1498,B-25758,2018-08-22,Shubham,Himachal Pradesh,Simla,8,-2,1,Clothing,Stole,COD


In [65]:
# Rename columns
merged_orders_details.rename(columns={'Order ID': 'order_id', 'Order Date': 'order_date', 'CustomerName': 'customer_name', 'State': 'state', 
                                      'City': 'city', 'Amount': 'amount', 'Profit': 'profit', 'Quantity': 'quantity', 'Category': 'category', 
                                      'Sub-Category': 'sub_category', 'PaymentMode': 'payment_mode'}, inplace=True)

merged_orders_details.head()

Unnamed: 0,order_id,order_date,customer_name,state,city,amount,profit,quantity,category,sub_category,payment_mode
0,B-26055,2018-03-10,Harivansh,Uttar Pradesh,Mathura,5729,64,14,Furniture,Chairs,EMI
1,B-26055,2018-03-10,Harivansh,Uttar Pradesh,Mathura,671,114,9,Electronics,Phones,Credit Card
2,B-26055,2018-03-10,Harivansh,Uttar Pradesh,Mathura,443,11,1,Clothing,Saree,COD
3,B-26055,2018-03-10,Harivansh,Uttar Pradesh,Mathura,57,7,2,Clothing,Shirt,UPI
4,B-26055,2018-03-10,Harivansh,Uttar Pradesh,Mathura,227,48,5,Clothing,Stole,COD


In [66]:
# Create customers dimension dataframe
dim_customers = (
    merged_orders_details[['customer_name', 'state', 'city']]
    .drop_duplicates()
    .reset_index(drop=True)
    .reset_index()
    .rename(columns={'index': 'customer_key'})
)

# Add a surrogate key starting from 1
dim_customers['customer_key'] += 1  

In [67]:
# Show the dim_customer dataframe created
dim_customers

Unnamed: 0,customer_key,customer_name,state,city
0,1,Harivansh,Uttar Pradesh,Mathura
1,2,Madhav,Delhi,Delhi
2,3,Madan Mohan,Uttar Pradesh,Mathura
3,4,Gopal,Maharashtra,Mumbai
4,5,Vishakha,Madhya Pradesh,Indore
...,...,...,...,...
400,401,Hemangi,Delhi,Delhi
401,402,Dinesh,Tamil Nadu,Chennai
402,403,Ashwin,Goa,Goa
403,404,Shivani,Maharashtra,Mumbai


In [68]:
# Create products dimension dataframe
dim_products = (
    merged_orders_details[['category', 'sub_category']]
    .drop_duplicates()
    .reset_index(drop=True)
    .reset_index()
    .rename(columns={'index': 'product_key'})
)

# Add a surrogate key starting from 1
dim_products['product_key'] += 1

In [69]:
# Show the dim_product dataframe created
dim_products

Unnamed: 0,product_key,category,sub_category
0,1,Furniture,Chairs
1,2,Electronics,Phones
2,3,Clothing,Saree
3,4,Clothing,Shirt
4,5,Clothing,Stole
5,6,Clothing,T-shirt
6,7,Electronics,Printers
7,8,Furniture,Bookcases
8,9,Furniture,Furnishings
9,10,Furniture,Tables


In [70]:
# Merge the tables, merged_orders_details, dim_customers, and dim_products to create orders dataframe
merged_table = merged_orders_details.merge(
    dim_customers, on=['customer_name', 'state', 'city'], how='left').merge(
        dim_products, on=['category', 'sub_category'], how='left')
            
# Create a surrogate key starting from 1
merged_table = merged_table.reset_index(drop=True).reset_index().rename(columns={'index': 'orders_key'})
merged_table['orders_key'] += 1

# Select columns for the fact_orders dataframe
fact_orders = merged_table[[
    'orders_key', 'customer_key', 'product_key', 'order_id', 'order_date', 
    'amount', 'profit', 'quantity', 'payment_mode'
]]

In [71]:
# Show the fact_orders dataframe created
fact_orders

Unnamed: 0,orders_key,customer_key,product_key,order_id,order_date,amount,profit,quantity,payment_mode
0,1,1,1,B-26055,2018-03-10,5729,64,14,EMI
1,2,1,2,B-26055,2018-03-10,671,114,9,Credit Card
2,3,1,3,B-26055,2018-03-10,443,11,1,COD
3,4,1,4,B-26055,2018-03-10,57,7,2,UPI
4,5,1,5,B-26055,2018-03-10,227,48,5,COD
...,...,...,...,...,...,...,...,...,...
1495,1496,403,15,B-25742,2018-08-03,11,-8,2,UPI
1496,1497,286,11,B-26088,2018-03-26,11,5,2,UPI
1497,1498,404,5,B-25707,2018-07-01,8,-6,1,COD
1498,1499,405,5,B-25758,2018-08-22,8,-2,1,COD


In [72]:
# Show the structural summary of the fact_orders dataframe
fact_orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   orders_key    1500 non-null   int64         
 1   customer_key  1500 non-null   int64         
 2   product_key   1500 non-null   int64         
 3   order_id      1500 non-null   object        
 4   order_date    1500 non-null   datetime64[ns]
 5   amount        1500 non-null   int64         
 6   profit        1500 non-null   int64         
 7   quantity      1500 non-null   int64         
 8   payment_mode  1500 non-null   object        
dtypes: datetime64[ns](1), int64(6), object(2)
memory usage: 105.6+ KB


In [None]:
# Create a database named OnlineSales in SQL Server and connect to the database
DRIVER_NAME = 'SQL SERVER'
SERVER_NAME = r'MSI\SQLEXPRESS'
DATABASE_NAME = 'OnlineSales'

params = urllib.parse.quote_plus(
    "DRIVER={ODBC Driver 17 for SQL Server};"
    f"SERVER={SERVER_NAME};DATABASE={DATABASE_NAME};UID=sa;PWD=sqlpw"  
)   # Remove/replace UserName (UID) and Password (PWD) if needed

engine = sqlalchemy.create_engine(f"mssql+pyodbc:///?odbc_connect={params}")

# Push the tables to the SQL Server database
"""WARNING: When using "if_exists='replace'", the tables will be replaced if they already exist in database, 
and the data model created earlier will be deleted. So use it with caution."""
fact_orders.to_sql('fact_orders', engine, if_exists='replace', index=False)                              
dim_customers.to_sql('dim_customers', engine, if_exists='replace', index=False)   
dim_products.to_sql('dim_products', engine, if_exists='replace', index=False)      

17