# ETL pipeline and Data warehouse using Python and Postgresql

Here, I will create a simple ETL pipeline using Python and Postgresql and won't use AWS. 

### Basic ETL Workflow
1. Extract data from CSV files 
2. Transform the data using Pandas
3. Load the transformed data to a data warehouse created using Postgresql

### For Data warehouse
- Create Data Model illustrating star schema with fact and dimension tables
- Fact table: Product reviews
- Dimension tables: Product info, brand info, date info, rating info, author info

In [1]:
#* Imports 
# sqlalchemy and psycopg2 will be used to connect to the postgresql database
from sqlalchemy import create_engine
import psycopg2
# For data processing
import pandas as pd 
# For data visualization
import matplotlib.pyplot as plt

In [2]:
# Dataset Paths
dataset1_path = "datasets/product_info.csv"
dataset2_path = "datasets/product_reviews.csv"

## 1. Extract Step
- In this step the data is extracted from the CSV file which is the data source in this case. There are two datasets, one containing the products information and other containing product reviews of customer. Both files are part of ecommerce dataset and will be used as data source for this project.  
- Alternatively, dataset from API, relational database, web scraping could also have been used for extraction

Reading Datasets From CSV Files

In [3]:
# Reading the dataset using pandas to explore the data
df1 = pd.read_csv(dataset1_path)
print(f"Rows:{df1.shape[0]}, Columns:{df1.shape[1]}")
df1.head()

Rows:8494, Columns:27


Unnamed: 0,product_id,product_name,brand_id,brand_name,loves_count,rating,reviews,size,variation_type,variation_value,...,online_only,out_of_stock,sephora_exclusive,highlights,primary_category,secondary_category,tertiary_category,child_count,child_max_price,child_min_price
0,P473671,Fragrance Discovery Set,6342,19-69,6320,3.6364,11.0,,,,...,1,0,0,"['Unisex/ Genderless Scent', 'Warm &Spicy Scen...",Fragrance,Value & Gift Sets,Perfume Gift Sets,0,,
1,P473668,La Habana Eau de Parfum,6342,19-69,3827,4.1538,13.0,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,...,1,0,0,"['Unisex/ Genderless Scent', 'Layerable Scent'...",Fragrance,Women,Perfume,2,85.0,30.0
2,P473662,Rainbow Bar Eau de Parfum,6342,19-69,3253,4.25,16.0,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,...,1,0,0,"['Unisex/ Genderless Scent', 'Layerable Scent'...",Fragrance,Women,Perfume,2,75.0,30.0
3,P473660,Kasbah Eau de Parfum,6342,19-69,3018,4.4762,21.0,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,...,1,0,0,"['Unisex/ Genderless Scent', 'Layerable Scent'...",Fragrance,Women,Perfume,2,75.0,30.0
4,P473658,Purple Haze Eau de Parfum,6342,19-69,2691,3.2308,13.0,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,...,1,0,0,"['Unisex/ Genderless Scent', 'Layerable Scent'...",Fragrance,Women,Perfume,2,75.0,30.0


In [4]:
df2 = pd.read_csv(dataset2_path,index_col=0)
print(f"Rows:{df2.shape[0]}, Columns:{df2.shape[1]}")
df2.head()

Rows:49977, Columns:18


  df2 = pd.read_csv(dataset2_path,index_col=0)


Unnamed: 0,author_id,rating,is_recommended,helpfulness,total_feedback_count,total_neg_feedback_count,total_pos_feedback_count,submission_time,review_text,review_title,skin_tone,eye_color,skin_type,hair_color,product_id,product_name,brand_name,price_usd
0,1945004256,5,1.0,0.0,2,2,0,2022-12-10,I absolutely L-O-V-E this oil. I have acne pro...,A must have!,lightMedium,green,combination,,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
1,5478482359,3,1.0,0.333333,3,2,1,2021-12-17,I gave this 3 stars because it give me tiny li...,it keeps oily skin under control,mediumTan,brown,oily,black,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
2,29002209922,5,1.0,1.0,2,0,2,2021-06-07,Works well as soon as I wash my face and pat d...,Worth the money!,lightMedium,brown,dry,black,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
3,7391078463,5,1.0,1.0,2,0,2,2021-05-21,"this oil helped with hydration and breakouts, ...",best face oil,lightMedium,brown,combination,blonde,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
4,1766313888,5,1.0,1.0,13,0,13,2021-03-29,This is my first product review ever so that s...,Maskne miracle,mediumTan,brown,combination,black,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0


Considering only the required columns from both dataframes

In [5]:
df1 = df1[['product_id', 'brand_id', 'loves_count','rating', 'reviews','primary_category','child_count']]
df1.head()

Unnamed: 0,product_id,brand_id,loves_count,rating,reviews,primary_category,child_count
0,P473671,6342,6320,3.6364,11.0,Fragrance,0
1,P473668,6342,3827,4.1538,13.0,Fragrance,2
2,P473662,6342,3253,4.25,16.0,Fragrance,2
3,P473660,6342,3018,4.4762,21.0,Fragrance,2
4,P473658,6342,2691,3.2308,13.0,Fragrance,2


In [6]:
df2 = df2[['author_id', 'rating', 'submission_time', 'review_text',
       'review_title', 'skin_tone', 'eye_color', 'skin_type', 'hair_color',
       'product_id', 'product_name', 'brand_name', 'price_usd']]
df2.head()

Unnamed: 0,author_id,rating,submission_time,review_text,review_title,skin_tone,eye_color,skin_type,hair_color,product_id,product_name,brand_name,price_usd
0,1945004256,5,2022-12-10,I absolutely L-O-V-E this oil. I have acne pro...,A must have!,lightMedium,green,combination,,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
1,5478482359,3,2021-12-17,I gave this 3 stars because it give me tiny li...,it keeps oily skin under control,mediumTan,brown,oily,black,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
2,29002209922,5,2021-06-07,Works well as soon as I wash my face and pat d...,Worth the money!,lightMedium,brown,dry,black,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
3,7391078463,5,2021-05-21,"this oil helped with hydration and breakouts, ...",best face oil,lightMedium,brown,combination,blonde,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
4,1766313888,5,2021-03-29,This is my first product review ever so that s...,Maskne miracle,mediumTan,brown,combination,black,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0


## 2. Transform Step
Merging

In [7]:
# Inner join to merge these 2 datasets
df_merged = pd.merge(df1,df2,how="inner",on="product_id")
print(df_merged.shape)
df_merged.head()

(49977, 19)


Unnamed: 0,product_id,brand_id,loves_count,rating_x,reviews,primary_category,child_count,author_id,rating_y,submission_time,review_text,review_title,skin_tone,eye_color,skin_type,hair_color,product_name,brand_name,price_usd
0,P453818,6018,11763,4.464,125.0,Skincare,0,6921691467,5,2023-01-02,This product is amazing. Makes my skin feel so...,Will keep buying!,,hazel,dry,blonde,GENIUS Collagen Calming Relief,Algenist,58.0
1,P453818,6018,11763,4.464,125.0,Skincare,0,40727014792,5,2022-11-06,I pair this with the algae niacinamide moistur...,Must have,,blue,combination,blonde,GENIUS Collagen Calming Relief,Algenist,58.0
2,P453818,6018,11763,4.464,125.0,Skincare,0,7186952566,5,2022-10-05,Definitely my favorite I use it for under eye ...,Praying it’s Preventing wrinkles,,,,,GENIUS Collagen Calming Relief,Algenist,58.0
3,P453818,6018,11763,4.464,125.0,Skincare,0,2117812169,5,2022-09-15,I bought this with the toner as I was looking ...,Rosacea relief,light,green,combination,brown,GENIUS Collagen Calming Relief,Algenist,58.0
4,P453818,6018,11763,4.464,125.0,Skincare,0,12538328524,5,2022-06-02,Been using for months now. I went in and asked...,,fair,,,,GENIUS Collagen Calming Relief,Algenist,58.0


Renaming columns in merged dataframe

In [8]:
# Old name: New name
rename_dict = {
    "loves_count": "favorites_count",
    "rating_x": "avg_product_rating",
    "reviews": "product_reviews_count",
    "primary_category": "product_category",
    "child_count": "variations_count",
    "rating_y": "review_rating",
    "submission_time": "full_date",
    "skin_tone": "reviewer_skin_tone",
    "skin_type": "reviewer_skin_type",
    "eye_color": "reviewer_eye_color",
    "hair_color": "reviewer_hair_color",
    "price_usd": "product_price",
    "author_id": "reviewer_id"
}
df_merged = df_merged.rename(columns=rename_dict)

Cleaning the dataset

In [9]:
df_merged.isna().sum()

product_id                   0
brand_id                     0
favorites_count              0
avg_product_rating           0
product_reviews_count        0
product_category             0
variations_count             0
reviewer_id                  0
review_rating                0
full_date                    0
review_text                 59
review_title             14378
reviewer_skin_tone        7201
reviewer_eye_color        6260
reviewer_skin_type        3631
reviewer_hair_color       8851
product_name                 0
brand_name                   0
product_price                0
dtype: int64

In [10]:
df_merged[(df_merged['review_title'].isna() == True) & (df_merged['review_text'].isna() == False)].shape

(14319, 19)

- As our main fact table is product reviews and review_text is the main review given by a reviewer for a product, so if there are nan values in review_text, it is better to drop these rows from dataset. 
- On the other hand, review title is an optional title with a short review provided by user. To deal with nan values in this column, we can fill nan values with a default value. Another reason for not dropping the rows with nan in review_title is because, there are more than 14000 rows which has nan in review_title but actual reviews in review_text column. So, for this reason I will fill the nan values in review_title by a default title like "Review Provided"

In [11]:
df_merged = df_merged.dropna(subset=["review_text"])
df_merged["review_title"] = df_merged["review_title"].fillna("Review Provided")
df_merged.isna().sum()

product_id                  0
brand_id                    0
favorites_count             0
avg_product_rating          0
product_reviews_count       0
product_category            0
variations_count            0
reviewer_id                 0
review_rating               0
full_date                   0
review_text                 0
review_title                0
reviewer_skin_tone       7198
reviewer_eye_color       6256
reviewer_skin_type       3629
reviewer_hair_color      8846
product_name                0
brand_name                  0
product_price               0
dtype: int64

Now there are nan values left in reviewer_skin_tone, reviewer_eye_color, reviewer_skin_type and reviewer_hair_color. These are categorical columns that consist of the details of the reviewer. To clean these columns we can either drop all these rows, or we may fill the nan values with the most frequent value i-e mode of that column. I have considered to fill the nan values in these columns by the mode value.

In [12]:
df_merged["reviewer_skin_tone"] = df_merged["reviewer_skin_tone"].fillna(df_merged["reviewer_skin_tone"].mode()[0])
df_merged["reviewer_eye_color"] = df_merged["reviewer_eye_color"].fillna(df_merged["reviewer_eye_color"].mode()[0])
df_merged["reviewer_skin_type"] = df_merged["reviewer_skin_type"].fillna(df_merged["reviewer_skin_type"].mode()[0])
df_merged["reviewer_hair_color"] = df_merged["reviewer_hair_color"].fillna(df_merged["reviewer_hair_color"].mode()[0])
# checking nan values again
df_merged.isna().sum()

product_id               0
brand_id                 0
favorites_count          0
avg_product_rating       0
product_reviews_count    0
product_category         0
variations_count         0
reviewer_id              0
review_rating            0
full_date                0
review_text              0
review_title             0
reviewer_skin_tone       0
reviewer_eye_color       0
reviewer_skin_type       0
reviewer_hair_color      0
product_name             0
brand_name               0
product_price            0
dtype: int64

Now that there are no nan values left, I will transform the full_date column to create 3 columns for date dimension i-e year, month and day

In [13]:
df_merged["full_date"] = pd.to_datetime(df_merged["full_date"])
df_merged["year"] = df_merged["full_date"].dt.year
df_merged["month"] = df_merged["full_date"].dt.month
df_merged["day"] = df_merged["full_date"].dt.day

In [14]:
df_merged.head()

Unnamed: 0,product_id,brand_id,favorites_count,avg_product_rating,product_reviews_count,product_category,variations_count,reviewer_id,review_rating,full_date,...,reviewer_skin_tone,reviewer_eye_color,reviewer_skin_type,reviewer_hair_color,product_name,brand_name,product_price,year,month,day
0,P453818,6018,11763,4.464,125.0,Skincare,0,6921691467,5,2023-01-02,...,light,hazel,dry,blonde,GENIUS Collagen Calming Relief,Algenist,58.0,2023,1,2
1,P453818,6018,11763,4.464,125.0,Skincare,0,40727014792,5,2022-11-06,...,light,blue,combination,blonde,GENIUS Collagen Calming Relief,Algenist,58.0,2022,11,6
2,P453818,6018,11763,4.464,125.0,Skincare,0,7186952566,5,2022-10-05,...,light,brown,combination,brown,GENIUS Collagen Calming Relief,Algenist,58.0,2022,10,5
3,P453818,6018,11763,4.464,125.0,Skincare,0,2117812169,5,2022-09-15,...,light,green,combination,brown,GENIUS Collagen Calming Relief,Algenist,58.0,2022,9,15
4,P453818,6018,11763,4.464,125.0,Skincare,0,12538328524,5,2022-06-02,...,fair,brown,combination,brown,GENIUS Collagen Calming Relief,Algenist,58.0,2022,6,2


In [15]:
df_merged.columns

Index(['product_id', 'brand_id', 'favorites_count', 'avg_product_rating',
       'product_reviews_count', 'product_category', 'variations_count',
       'reviewer_id', 'review_rating', 'full_date', 'review_text',
       'review_title', 'reviewer_skin_tone', 'reviewer_eye_color',
       'reviewer_skin_type', 'reviewer_hair_color', 'product_name',
       'brand_name', 'product_price', 'year', 'month', 'day'],
      dtype='object')

In [16]:
df_merged.shape

(49918, 22)

Converting the reviewer id column to int

In [139]:
# checking for non-numeric reviewer_id
df_merged["is_numeric"] = df_merged["reviewer_id"].apply(checkNumeric)
# removing non numeric reviewer ids
df_merged = df_merged[df_merged["is_numeric"] == True]
# dropping the is_numeric column
df_merged.drop(columns=["is_numeric"],inplace=True)
# converting reviewer_id to int
df_merged["reviewer_id"] = df_merged["reviewer_id"].apply(lambda x: int(x))

Creating multiple dataframes for fact and dimension tables

In [140]:
# Product Reviews Table: Fact Table
reviews_df = df_merged[['product_id','brand_id','reviewer_id','full_date','review_title','review_text','review_rating']]
reviews_df = reviews_df.rename(columns={'full_date': 'date_id'})
reviews_df = reviews_df.reset_index(drop=True)
reviews_df.insert(0, 'review_id', reviews_df.index + 1)
print(reviews_df.shape)
reviews_df.head()

(49915, 8)


Unnamed: 0,review_id,product_id,brand_id,reviewer_id,date_id,review_title,review_text,review_rating
0,1,P453818,6018,6921691467,2023-01-02,Will keep buying!,This product is amazing. Makes my skin feel so...,5
1,2,P453818,6018,40727014792,2022-11-06,Must have,I pair this with the algae niacinamide moistur...,5
2,3,P453818,6018,7186952566,2022-10-05,Praying it’s Preventing wrinkles,Definitely my favorite I use it for under eye ...,5
3,4,P453818,6018,2117812169,2022-09-15,Rosacea relief,I bought this with the toner as I was looking ...,5
4,5,P453818,6018,12538328524,2022-06-02,Review Provided,Been using for months now. I went in and asked...,5


In [141]:
# Product table: Dimension table
product_df = df_merged[['product_id', 'product_name', 'avg_product_rating', 'product_price', 'product_reviews_count', 'favorites_count', 'variations_count', 'product_category']]
print(product_df.shape)
# To keep only unique product descriptions in product_df
product_df = product_df.drop_duplicates("product_id").reset_index(drop=True)
print(product_df.shape)
product_df.head()

(49915, 8)
(1104, 8)


Unnamed: 0,product_id,product_name,avg_product_rating,product_price,product_reviews_count,favorites_count,variations_count,product_category
0,P453818,GENIUS Collagen Calming Relief,4.464,58.0,125.0,11763,0,Skincare
1,P442859,ALIVE Prebiotic Balancing Mask,4.3729,38.0,118.0,14367,0,Skincare
2,P388262,GENIUS Ultimate Anti-Aging Eye Cream,3.7759,74.0,116.0,6866,0,Skincare
3,P457694,Blue Algae Vitamin C Dark Spot Correcting Peel,4.213,85.0,108.0,11488,0,Skincare
4,P447504,AA (Alguronic Acid) Barrier Serum,3.97,85.0,100.0,3877,0,Skincare


In [142]:
# Brand table: dimension table
brand_df = df_merged[['brand_id', 'brand_name']]
print(brand_df.shape)
# To keep only the unique brand details in brands dataframe
brand_df = brand_df.drop_duplicates("brand_id").reset_index(drop=True)
print(brand_df.shape)
brand_df.head()

(49915, 2)
(122, 2)


Unnamed: 0,brand_id,brand_name
0,6018,Algenist
1,6283,Alpha-H
2,6312,alpyn beauty
3,5746,Anastasia Beverly Hills
4,6356,Augustinus Bader


In [143]:
import numpy as np

In [144]:
def checkNumeric(val):
    return True if str(val).isdigit() else False

In [145]:
# Reviewer table: dimension table
reviewer_df = df_merged[['reviewer_id', 'reviewer_skin_tone', 'reviewer_skin_type', 'reviewer_eye_color', 'reviewer_hair_color']]
print(reviewer_df.shape)
# To keep only unique reviewer details
reviewer_df = reviewer_df.drop_duplicates("reviewer_id").reset_index(drop=True)
print(reviewer_df.shape)
reviewer_df.head()

(49915, 5)
(38855, 5)


Unnamed: 0,reviewer_id,reviewer_skin_tone,reviewer_skin_type,reviewer_eye_color,reviewer_hair_color
0,6921691467,light,dry,hazel,blonde
1,40727014792,light,combination,blue,blonde
2,7186952566,light,combination,brown,brown
3,2117812169,light,combination,green,brown
4,12538328524,fair,combination,brown,brown


In [146]:
# Date table: dimension table
date_df = df_merged[['full_date', 'year', 'month', 'day']]
date_df = date_df.rename(columns={"full_date": "date_id"})
date_df = date_df.drop_duplicates("date_id").reset_index(drop=True)
print(date_df.shape)
date_df.head()

(3490, 4)


Unnamed: 0,date_id,year,month,day
0,2023-01-02,2023,1,2
1,2022-11-06,2022,11,6
2,2022-10-05,2022,10,5
3,2022-09-15,2022,9,15
4,2022-06-02,2022,6,2


In [147]:
# saving transformed dataframes to local disk
product_df.to_csv("datasets/product.csv",index=False)
brand_df.to_csv("datasets/brand.csv",index=False)
reviewer_df.to_csv("datasets/reviewer.csv",index=False)
date_df.to_csv("datasets/date.csv",index=False)
reviews_df.to_csv("datasets/reviews.csv",index=False)

## Load Step
In this step, the transformed data in dataframes will be loaded to the tables in data warehouse

Before loading the data to datawarehouse, we will create a database in postgresql and also create tables for the fact table and dimension table

In [102]:
# Loading data from env file
import os
from dotenv import load_dotenv
load_dotenv()
db_host = os.environ.get('DB_HOST')
db_user = os.environ.get('DEFAULT_PG_USER')
db_pwd = os.environ.get('DEFAULT_PG_PASSWORD')
db_default_db = os.environ.get('DEFAULT_DB_NAME')
datawarehouse_name = os.environ.get('DATAWAREHOUSE_NAME')

In [103]:
DROP_TABLE_QUERIES = [
    """ 
    DROP TABLE tbl_product_reviews;
    """,
    """ 
    DROP TABLE tbl_product;
    """,
    """ 
    DROP TABLE tbl_brand;
    """,
    """ 
    DROP TABLE tbl_reviewer;
    """,
    """ 
    DROP TABLE tbl_date;
    """
]

In [135]:
CREATE_TABLE_QUERIES = [
    """ 
    CREATE TABLE IF NOT EXISTS tbl_product(
        product_id TEXT PRIMARY KEY,
        product_name TEXT,
        avg_product_rating FLOAT,
        product_price FLOAT,
        product_reviews_count FLOAT,
        favorites_count INT,
        variations_count INT,
        product_category TEXT
    )
    """,
    """
    CREATE TABLE IF NOT EXISTS tbl_brand(
        brand_id INT PRIMARY KEY,
        brand_name TEXT UNIQUE
    )
    """,
    """ 
    CREATE TABLE IF NOT EXISTS tbl_reviewer(
        reviewer_id BIGINT PRIMARY KEY,
        reviewer_skin_tone TEXT,
        reviewer_skin_type TEXT,
        reviewer_eye_color TEXT,
        reviewer_hair_color TEXT
    )
    """,
    """ 
    CREATE TABLE IF NOT EXISTS tbl_date(
        date_id TIMESTAMP PRIMARY KEY,
        year INT,
        month INT,
        day INT
    )
    """,
    """ 
    CREATE TABLE IF NOT EXISTS tbl_product_reviews(
        review_id INT PRIMARY KEY,
        product_id TEXT REFERENCES tbl_product(product_id),
        brand_id INT REFERENCES tbl_brand(brand_id),
        reviewer_id BIGINT REFERENCES tbl_reviewer(reviewer_id),
        date_id TIMESTAMP REFERENCES tbl_date(date_id),
        review_title TEXT,
        review_text TEXT,
        review_rating INT
    )
    """
]

In [129]:
# To insert data into dataframe directly using
INSERT_TABLE_QUERIES = {
    "product": 
        """ 
        INSERT INTO tbl_product(product_id, product_name, avg_product_rating, product_price, product_reviews_count, favorites_count, variations_count, product_category) VALUES (%s,%s,%s,%s,%s,%s,%s,%s)
        """,
    "brand":
        """ 
        INSERT INTO tbl_brand(brand_id, brand_name) VALUES (%s,%s)
        """,
    "reviewer":
        """ 
        INSERT INTO tbl_reviewer(reviewer_id, reviewer_skin_tone, reviewer_skin_type, reviewer_eye_color, reviewer_hair_color) VALUES (%s,%s,%s,%s,%s)
        """,
    "date":
        """ 
        INSERT INTO tbl_date(date_id, year, month, day) VALUES (%s,%s,%s,%s)
        """,
    "reviews":
        """ 
        INSERT INTO tbl_product_reviews(review_id, product_id, brand_id, reviewer_id, date_id, review_title, review_text, review_rating) VALUES (%s,%s,%s,%s,%s,%s,%s,%s)
        """
}

In [106]:
TRUNCATE_TABLE_QUERIES = [
    """ 
    TRUNCATE TABLE tbl_product_reviews;
    """,
    """ 
    TRUNCATE TABLE tbl_product CASCADE;
    """,
    """ 
    TRUNCATE TABLE tbl_brand CASCADE;
    """,
    """ 
    TRUNCATE TABLE tbl_reviewer CASCADE;
    """,
    """ 
    TRUNCATE TABLE tbl_date CASCADE;
    """
]

In [29]:
# Creating datawarehouse
def create_datawarehouse():
    print("Started creating data warehouse")
    # connection to connect to default postgresql database
    db_conn_string = f"host={db_host} dbname={db_default_db} user={db_user} password={db_pwd}"
    conn = psycopg2.connect(db_conn_string)
    conn.set_session(autocommit=True)
    cur = conn.cursor() 
    
    # creating the datawarehouse 
    cur.execute(f"DROP DATABASE IF EXISTS {datawarehouse_name}")
    cur.execute(f"CREATE DATABASE {datawarehouse_name}")
    # close connection to default database
    print("Data warehouse created succesfully")
    conn.close()

In [31]:
def connect_datawarehouse():
    # creating connection to datawarehouse 
    dw_string = f"host={db_host} dbname={datawarehouse_name} user={db_user} password={db_pwd}"
    dw_conn = psycopg2.connect(dw_string)
    # creating datawarehouse cursor
    dw_cur = dw_conn.cursor()
    print(f"Connected to datawarehouse")
    return dw_conn, dw_cur

In [32]:
def restart_connection(conn):
    # close the current datawarehouse connection
    conn.close()
    # start new connection
    dw_conn, dw_cur = connect_datawarehouse() 
    return dw_conn, dw_cur

In [33]:
def drop_existing_tables(cur,conn,drop_table_queries):
    print("Dropping existing tables from datawarehouse")
    try:
        for query in drop_table_queries: 
            cur.execute(query)
        conn.commit() 
        print("Existing tables dropped")
    except Exception as e:
        conn.rollback() 
        print(f"Tables weren't dropped.\n Following error encountered: {e}")

In [34]:
def truncate_existing_tables(cur,conn,truncate_table_queries):
    print("Truncating existing tables from datawarehouse")
    try:
        for query in truncate_table_queries: 
            cur.execute(query)
        conn.commit() 
        print("Existing tables truncated")
    except Exception as e:
        conn.rollback() 
        print(f"Tables weren't truncated.\n Following error encountered: {e}")

In [35]:
# Creating tables for fact table and dimension table in the datawarehouse
def create_tables(cur,conn,create_table_queries):
    print(f"Creating tables inside data warehouse")
    try:
        for query in create_table_queries:
            cur.execute(query)
        conn.commit() 
        print("Tables created successfully")
    except Exception as e:
        conn.rollback()
        print(f"Tables weren't dropped.\n Following error encountered: {e}")

In [84]:
def insert_data_to_tables(cur, conn, insert_table_queries):
    last_val = 0
    try:
        # insert into product table from product df
        for i, row in product_df.iterrows():
            cur.execute(insert_table_queries["product"],list(row))
        print("Inserted to product")
        # insert into brands table from brand df
        for i, row in brand_df.iterrows():
            cur.execute(insert_table_queries["brand"],list(row))
        print("Inserted to Brand")
        # insert into reviewer table from reviewer df
        for i, row in reviewer_df.iterrows():
            last_val = i
            cur.execute(insert_table_queries["reviewer"],list(row))
        print("Inserted to Reviewer")
        # insert into date table from date df 
        for i,row in date_df.iterrows():
            cur.execute(insert_table_queries["date"],list(row))
        print("Inserted to Date")
        # insert into reviews table from reviews df 
        for i,row in reviews_df.iterrows():
            last_val = i
            cur.execute(insert_table_queries["reviews"],list(row))
        print("Inserted to Product Reviews")
        conn.commit()
        print("All Data was inserted successfully into tables")
    except Exception as e:
        print(last_val)
        print("Query execution failed, so data wasn't inserted")
        print(f"Following error occured: {e}")
        conn.rollback()

In [37]:
# creating data warehouse
create_datawarehouse() 

Started creating data warehouse
Data warehouse created succesfully


In [58]:
dw_conn, dw_cur= connect_datawarehouse() 

Connected to datawarehouse


In [136]:
# dropping existing tables
drop_existing_tables(dw_cur,dw_conn,DROP_TABLE_QUERIES)

Dropping existing tables from datawarehouse
Existing tables dropped


In [61]:
truncate_existing_tables(dw_cur,dw_conn,TRUNCATE_TABLE_QUERIES)

Truncating existing tables from datawarehouse
Existing tables truncated


In [137]:
# creating tables
create_tables(dw_cur,dw_conn,CREATE_TABLE_QUERIES)

Creating tables inside data warehouse
Tables created successfully


In [48]:
# close the current datawarehouse connection
dw_conn.close()

In [60]:
dw_conn, dw_cur= restart_connection(dw_conn)

Connected to datawarehouse


In [148]:
insert_data_to_tables(dw_cur, dw_conn, INSERT_TABLE_QUERIES)

Inserted to product
Inserted to Brand
Inserted to Reviewer
Inserted to Date
Inserted to Product Reviews
All Data was inserted successfully into tables


In [87]:
product_df[product_df["product_id"]== "P402992"]

Unnamed: 0,product_id,product_name,avg_product_rating,product_price,product_reviews_count,favorites_count,variations_count,product_category


In [49]:
"""
using sqlalchemy to create engine and connection to insert the data.
Alternatively, we can use INSERT QUERIES and then loop over these queries to insert the data to each tables manually
"""

db_url = f"postgresql://{db_user}:{db_pwd}@{db_host}:5432/{datawarehouse_name}"

# create the engine
engine = create_engine(db_url)

# test the connection
try:
    dw_conn = engine.connect()
    print("Connected successfully!")
except Exception as e:
    print("Error:", str(e))

Connected successfully!


In [52]:
dw_conn.close()

In [None]:
# Inserting data to the tables in datawarehouse from pandas dataframe
print("Inserting data in dimension table: Product")
product_df.to_sql('tbl_product',con=engine,if_exists='append',index=False)
print("Inserting data in dimension table: Brand")
brand_df.to_sql('tbl_brand',con=engine,if_exists='append',index=False)
print("Inserting data in dimension table: Reviewer")
reviewer_df.to_sql('tbl_reviewer',con=engine,if_exists='append',index=False)
print("Inserting data in dimension table: Date")
date_df.to_sql('tbl_date',con=engine,if_exists='append',index=False)
print("Inserting data in fact table: Product Reviews")
reviews_df.to_sql('tbl_product_reviews',con=engine,if_exists='append',index=False)
print("All data loaded succesfully")

#### Columns Description
1. **product_id**: Unique Product Identifier
2. **product_name**: Full name of Product
3. **brand_id**: Unique Brand Identifier
4. **brand_name**: Full name of Product band
5. **loves_count**: No of people who marked this product as favorite
6. **rating**: Average rating of product based on user reviews
7. **reviews**: No of user reviews for the product 
8. **size**: Product size, may be in oz, ml g, packs, or other units
9. **variation_types**: The type of variation parameter for the product (e.g. Size, Color)
10. **variation_value**: The specific value of the variation parameter for the product (e.g. 100 mL, Golden Sand)
11. **variation_desc**: A description of the variation parameter for the product (e.g. tone for fairest skin)
12. **ingredients**: A list of ingredients included in the product, for example: [‘Product variation 1:’, ‘Water, Glycerin’, ‘Product variation 2:’, ‘Talc, Mica’] or if no variations [‘Water, Glycerin’]
13. **price_usd**: The price of the product in US dollars
14. **value_price_usd**: The potential cost savings of the product, presented on the site next to the regular price
15. **sale_price_usd**: The sale price of the product in US dollars
16. **limited_edition**: Indicates whether the product is a limited edition or not (1-true, 0-false)
17. **new**: Indicates whether the product is new or not (1-true, 0-false)
18. **online_only**: Indicates whether the product is only sold online or not (1-true, 0-false)
19. **out_of_stock**: Indicates whether the product is currently out of stock or not (1 if true, 0 if false)
20. **sephora_exclusive**: Indicates whether the product is exclusive to Sephora or not (1 if true, 0 if false)
21. **highlights**: A list of tags or features that highlight the product's attributes (e.g. [‘Vegan’, ‘Matte Finish’])
22. **primary_category**: First category in the breadcrumb section
23. **secondary_category**: Second category in the breadcrumb section
24. **tertiary_category**: Third category in the breadcrumb section
25. **child_count**: The number of variations of the product available
26. **child_max_price**: The highest price among the variations of the product
27. **child_min_price**: The lowest price among the variations of the product

Among these columns, I will be considering only product_id, product_name, brand_id, brand_name, rating, reviews, price_usd

Reviewing some columns before deciding whether to select them or not

In [36]:
# df1[['size', 'variation_type', 'variation_value','variation_desc', 'ingredients']].head()

Unnamed: 0,size,variation_type,variation_value,variation_desc,ingredients
0,,,,,"['Capri Eau de Parfum:', 'Alcohol Denat. (SD A..."
1,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,,"['Alcohol Denat. (SD Alcohol 39C), Parfum (Fra..."
2,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,,"['Alcohol Denat. (SD Alcohol 39C), Parfum (Fra..."
3,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,,"['Alcohol Denat. (SD Alcohol 39C), Parfum (Fra..."
4,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,,"['Alcohol Denat. (SD Alcohol 39C), Parfum (Fra..."


In [37]:
# df1[['size', 'variation_type', 'variation_value','variation_desc', 'ingredients']].isna().sum()

size               1631
variation_type     1444
variation_value    1598
variation_desc     7244
ingredients         945
dtype: int64

These columns size, variation_type, variation_value, variation_desc, ingredients have a lot of nan values also, these are not relevant to the datawarehouse we are creating

In [38]:
# df1[['price_usd', 'value_price_usd','sale_price_usd']].head()

Unnamed: 0,price_usd,value_price_usd,sale_price_usd
0,35.0,,
1,195.0,,
2,195.0,,
3,195.0,,
4,195.0,,


In [39]:
# df1[['price_usd', 'value_price_usd','sale_price_usd']].isna().sum()

price_usd             0
value_price_usd    8043
sale_price_usd     8224
dtype: int64

Here value_price_usd and sale_price_used have a lot of nan values and cleaning these columns is also not necessary as we are only concerned with the original price in this task

In [44]:
df1[['primary_category','secondary_category', 'tertiary_category']].head()

Unnamed: 0,primary_category,secondary_category,tertiary_category
0,Fragrance,Value & Gift Sets,Perfume Gift Sets
1,Fragrance,Women,Perfume
2,Fragrance,Women,Perfume
3,Fragrance,Women,Perfume
4,Fragrance,Women,Perfume


In [45]:
df1[['primary_category','secondary_category', 'tertiary_category']].isna().sum()

primary_category        0
secondary_category      8
tertiary_category     990
dtype: int64

In [46]:
df1['primary_category'].value_counts()

primary_category
Skincare           2420
Makeup             2369
Hair               1464
Fragrance          1432
Bath & Body         405
Mini Size           288
Men                  60
Tools & Brushes      52
Gifts                 4
Name: count, dtype: int64

I will consider only the primary category as it is the main category and will be enough for our task

In [47]:
product_subset = ["product_id","product_name","rating","reviews","loves_count","price_usd", "child_count","primary_category"]
brand_subset = ["brand_id","brand_name"]

In [48]:
df_product = df1[product_subset]
df_product.head()

Unnamed: 0,product_id,product_name,rating,reviews,loves_count,price_usd,child_count,primary_category
0,P473671,Fragrance Discovery Set,3.6364,11.0,6320,35.0,0,Fragrance
1,P473668,La Habana Eau de Parfum,4.1538,13.0,3827,195.0,2,Fragrance
2,P473662,Rainbow Bar Eau de Parfum,4.25,16.0,3253,195.0,2,Fragrance
3,P473660,Kasbah Eau de Parfum,4.4762,21.0,3018,195.0,2,Fragrance
4,P473658,Purple Haze Eau de Parfum,3.2308,13.0,2691,195.0,2,Fragrance


In [49]:
df_product.shape

(8494, 8)

In [50]:
df_product.isna().sum()

product_id            0
product_name          0
rating              278
reviews             278
loves_count           0
price_usd             0
child_count           0
primary_category      0
dtype: int64

In [51]:
# Replacing the nan values in rating and reviews by 0 considering that the lowest average rating can be 0 and also the lowest no of reviews can be 0
df_product["rating"].fillna(0,inplace=True)
df_product["reviews"].fillna(0,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_product["rating"].fillna(0,inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_product["reviews"].fillna(0,inplace=True)


In [52]:
df_product.isna().sum()

product_id          0
product_name        0
rating              0
reviews             0
loves_count         0
price_usd           0
child_count         0
primary_category    0
dtype: int64

In [53]:
df_product.head()

Unnamed: 0,product_id,product_name,rating,reviews,loves_count,price_usd,child_count,primary_category
0,P473671,Fragrance Discovery Set,3.6364,11.0,6320,35.0,0,Fragrance
1,P473668,La Habana Eau de Parfum,4.1538,13.0,3827,195.0,2,Fragrance
2,P473662,Rainbow Bar Eau de Parfum,4.25,16.0,3253,195.0,2,Fragrance
3,P473660,Kasbah Eau de Parfum,4.4762,21.0,3018,195.0,2,Fragrance
4,P473658,Purple Haze Eau de Parfum,3.2308,13.0,2691,195.0,2,Fragrance


In [54]:
df_brand = df1[brand_subset]
df_brand.head()

Unnamed: 0,brand_id,brand_name
0,6342,19-69
1,6342,19-69
2,6342,19-69
3,6342,19-69
4,6342,19-69


In [55]:
df_brand.isna().sum()

brand_id      0
brand_name    0
dtype: int64

In [56]:
df_brand['brand_name'].value_counts()

brand_name
SEPHORA COLLECTION     352
CLINIQUE               179
Dior                   136
tarte                  131
NEST New York          115
                      ... 
Aquis                    1
Narciso Rodriguez        1
Jillian Dempsey          1
DOMINIQUE COSMETICS      1
iluminage                1
Name: count, Length: 304, dtype: int64

Reading the 2nd dataset containing product_reviews

In [8]:
df2 = pd.read_csv(dataset2_path,index_col=0)
print(f"Rows: {df2.shape[0]}, Columns: {df2.shape[1]}")
df2.head()

Rows: 49977, Columns: 18


  df2 = pd.read_csv(dataset2_path,index_col=0)


Unnamed: 0,author_id,rating,is_recommended,helpfulness,total_feedback_count,total_neg_feedback_count,total_pos_feedback_count,submission_time,review_text,review_title,skin_tone,eye_color,skin_type,hair_color,product_id,product_name,brand_name,price_usd
0,1945004256,5,1.0,0.0,2,2,0,2022-12-10,I absolutely L-O-V-E this oil. I have acne pro...,A must have!,lightMedium,green,combination,,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
1,5478482359,3,1.0,0.333333,3,2,1,2021-12-17,I gave this 3 stars because it give me tiny li...,it keeps oily skin under control,mediumTan,brown,oily,black,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
2,29002209922,5,1.0,1.0,2,0,2,2021-06-07,Works well as soon as I wash my face and pat d...,Worth the money!,lightMedium,brown,dry,black,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
3,7391078463,5,1.0,1.0,2,0,2,2021-05-21,"this oil helped with hydration and breakouts, ...",best face oil,lightMedium,brown,combination,blonde,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
4,1766313888,5,1.0,1.0,13,0,13,2021-03-29,This is my first product review ever so that s...,Maskne miracle,mediumTan,brown,combination,black,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0


In [9]:
df2.columns

Index(['author_id', 'rating', 'is_recommended', 'helpfulness',
       'total_feedback_count', 'total_neg_feedback_count',
       'total_pos_feedback_count', 'submission_time', 'review_text',
       'review_title', 'skin_tone', 'eye_color', 'skin_type', 'hair_color',
       'product_id', 'product_name', 'brand_name', 'price_usd'],
      dtype='object')

### Columns Description
1. **author_id**: The unique identifier for the author of the review on the website
2. **rating**: The rating given by the author for the product on a scale of 1 to 5
3. **is_recommended**: Indicates if the author recommends the product or not (1-true, 0-false)
4. **helpfulness**: The ratio of all ratings to positive ratings for the review: helpfulness = total_pos_feedback_count / total_feedback_count
5. **total_feedback_count**: Total number of feedback (positive and negative ratings) left by users for the review
6. **total_negative_feedback_count**: The number of users who gave a negative rating for the review
7. **total_pos_feedback_count**: The number of users who gave a positive rating for the review
8. **submission_time**: Date the review was posted on the website in the 'yyyy-mm-dd' format
9. **review_text**: The main text of the review written by the author
10. **review_title**: The title of the review written by the author
11. **skin_tone**: Author's skin tone (e.g. fair, tan, etc.)
12. **eye_color**: Author's eye color (e.g. brown, green, etc.)
13. **skin_type**: Author's skin type (e.g. combination, oily, etc.)
14. **hair_color**: Author's hair color (e.g. brown, auburn, etc.)
15. **product_id**: Unique identifier for the product on the website

From here we can create a review table and a reviewer table

In [12]:
author_subset = ['author_id','skin_tone','eye_color','skin_type','hair_color']
review_subset = ['review_text','review_title']
product_subset = ['product_id', 'product_name', 'brand_name', 'price_usd','rating']

In [14]:
df_author = df2[author_subset]
df_author.head()

Unnamed: 0,author_id,skin_tone,eye_color,skin_type,hair_color
0,1945004256,lightMedium,green,combination,
1,5478482359,mediumTan,brown,oily,black
2,29002209922,lightMedium,brown,dry,black
3,7391078463,lightMedium,brown,combination,blonde
4,1766313888,mediumTan,brown,combination,black


In [16]:
df_author.shape

(49977, 5)

In [15]:
df_author.isna().sum()

author_id        0
skin_tone     7201
eye_color     6260
skin_type     3631
hair_color    8851
dtype: int64

In [37]:
df_author[(df_author["skin_tone"].isna() == True) & (df_author["eye_color"].isna()== True) & (df_author["skin_type"].isna()== True) & (df_author["hair_color"].isna()== True)].shape

(3511, 5)

Total 3511 rows have nan values in all of the user's features so these can be dropped

In [38]:
condition = (df_author["skin_tone"].isna() & df_author["eye_color"].isna() & df_author["skin_type"].isna() & df_author["hair_color"].isna())

# Drop the rows that satisfy the condition
df_author = df_author.drop(df_author[condition].index)
df_author.isna().sum()

author_id        0
skin_tone     3690
eye_color     2749
skin_type      120
hair_color    5340
dtype: int64

For remaining replacing them with their mode

In [43]:
df_author["skin_tone"] = df_author["skin_tone"].fillna(df_author["skin_tone"].mode()[0])
df_author["eye_color"] = df_author["eye_color"].fillna(df_author["eye_color"].mode()[0])
df_author["skin_type"] = df_author["skin_type"].fillna(df_author["skin_type"].mode()[0])
df_author["hair_color"] = df_author["hair_color"].fillna(df_author["hair_color"].mode()[0])
df_author.isna().sum()

author_id     0
skin_tone     0
eye_color     0
skin_type     0
hair_color    0
dtype: int64

All Categorical are nan, so possible cleaning step is to replace them with most frequent value i-e mode

In [18]:
df_review = df2[review_subset]
df_review["review_id"] = range(1, len(df_review) + 1)
df_review = df_review[["review_id",'review_title',"review_text"]]
df_review.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_review["review_id"] = range(1, len(df_review) + 1)


Unnamed: 0,review_id,review_title,review_text
0,1,A must have!,I absolutely L-O-V-E this oil. I have acne pro...
1,2,it keeps oily skin under control,I gave this 3 stars because it give me tiny li...
2,3,Worth the money!,Works well as soon as I wash my face and pat d...
3,4,best face oil,"this oil helped with hydration and breakouts, ..."
4,5,Maskne miracle,This is my first product review ever so that s...


In [19]:
df_review.isna().sum()

review_id           0
review_title    14378
review_text        59
dtype: int64

In this case I believe that it is better to replace NAN with a placeholder text like "A product review"

In [28]:
print(f'Both Review title and Review text empty: {df_review[(df_review["review_text"].isna() == True) & (df_review["review_title"].isna()== True)].shape[0]}')
df_review[(df_review["review_text"].isna() == True) & (df_review["review_title"].isna()== True)].head()

Both Review title and Review text empty: 59


Unnamed: 0,review_id,review_title,review_text
993,994,,
1154,1155,,
2570,2571,,
2961,2962,,
3128,3129,,


This indicates that the rows with NAN in review_text should be removed as our core component in this data warehouse is product review and we cannot replace the reviews with custom placeholder

In [33]:
df_review = df_review.dropna(subset=["review_text"])
df_review.isna().sum()

review_id           0
review_title    14319
review_text         0
dtype: int64

replacing the remaining rows with NaN in review title with default placeholder review title

In [34]:
df_review["review_title"] = df_review["review_title"].fillna("Honest customer review")
df_review.isna().sum()

review_id       0
review_title    0
review_text     0
dtype: int64

In [47]:
df_product = df2[product_subset]
df_product.head()

Unnamed: 0,product_id,product_name,brand_name,price_usd,rating
0,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0,5
1,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0,3
2,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0,5
3,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0,5
4,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0,5


In [48]:
df_product.isna().sum()

product_id      0
product_name    0
brand_name      0
price_usd       0
rating          0
dtype: int64

In [49]:
df2.head()

Unnamed: 0,author_id,rating,is_recommended,helpfulness,total_feedback_count,total_neg_feedback_count,total_pos_feedback_count,submission_time,review_text,review_title,skin_tone,eye_color,skin_type,hair_color,product_id,product_name,brand_name,price_usd
0,1945004256,5,1.0,0.0,2,2,0,2022-12-10,I absolutely L-O-V-E this oil. I have acne pro...,A must have!,lightMedium,green,combination,,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
1,5478482359,3,1.0,0.333333,3,2,1,2021-12-17,I gave this 3 stars because it give me tiny li...,it keeps oily skin under control,mediumTan,brown,oily,black,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
2,29002209922,5,1.0,1.0,2,0,2,2021-06-07,Works well as soon as I wash my face and pat d...,Worth the money!,lightMedium,brown,dry,black,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
3,7391078463,5,1.0,1.0,2,0,2,2021-05-21,"this oil helped with hydration and breakouts, ...",best face oil,lightMedium,brown,combination,blonde,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
4,1766313888,5,1.0,1.0,13,0,13,2021-03-29,This is my first product review ever so that s...,Maskne miracle,mediumTan,brown,combination,black,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0


In [52]:
df_date = pd.DataFrame()
df_date["full_date"] = pd.to_datetime(df2["submission_time"])
df_date["year"] = df_date["full_date"].dt.year
df_date["month"] = df_date["full_date"].dt.month
df_date["day"] = df_date["full_date"].dt.day
df_date.head()

Unnamed: 0,full_date,year,month,day
0,2022-12-10,2022,12,10
1,2021-12-17,2021,12,17
2,2021-06-07,2021,6,7
3,2021-05-21,2021,5,21
4,2021-03-29,2021,3,29
