# ETL pipeline and Data warehouse using Python and Postgresql

Here, I will create a simple ETL pipeline using Python and Postgresql and won't use AWS. 

### Basic ETL Workflow
1. Extract data from CSV files 
2. Transform the data using Pandas
3. Load the transformed data to a data warehouse created using Postgresql

### For Data warehouse
- Create Data Model illustrating star schema with fact and dimension tables
- Fact table: Product reviews
- Dimension tables: Product info, brand info, date info, rating info, author info

In [2]:
#* Imports 
# sqlalchemy and psycopg2 will be used to connect to the postgresql database
from sqlalchemy import create_engine
import psycopg2
# pygrametl is used to create ETL flow
import pygrametl 
# To read data from external csv files
from pygrametl.datasources import CSVSource
# Using pygrametl we can interact with dimensions and fat table using set of classes
from pygrametl.tables import CachedDimension, FactTable 
# For data processing
import pandas as pd 
# For data visualization
import matplotlib.pyplot as plt

In [3]:
# Dataset Paths
dataset1_path = "datasets/product_info.csv"
dataset2_path = "datasets/product_reviews.csv"
# dataset3 is just a bigger version of dataset2
dataset3_path = "datasets/product_reviews_extra.csv"

## 1. Extract Step
- In this step the data is extracted from the CSV file which is the data source in this case. There are two datasets, one containing the products information and other containing product reviews of customer. Both files are part of ecommerce dataset and will be used as data source for this project.  
- Alternatively, dataset from API, relational database, web scraping could also have been used for extraction

In [6]:
# Reading the dataset using pandas to explore the data
df1 = pd.read_csv(dataset1_path)
print(f"Rows:{df1.shape[0]}, Columns:{df1.shape[1]}")
df1.head()

Rows:8494, Columns:27


Unnamed: 0,product_id,product_name,brand_id,brand_name,loves_count,rating,reviews,size,variation_type,variation_value,...,online_only,out_of_stock,sephora_exclusive,highlights,primary_category,secondary_category,tertiary_category,child_count,child_max_price,child_min_price
0,P473671,Fragrance Discovery Set,6342,19-69,6320,3.6364,11.0,,,,...,1,0,0,"['Unisex/ Genderless Scent', 'Warm &Spicy Scen...",Fragrance,Value & Gift Sets,Perfume Gift Sets,0,,
1,P473668,La Habana Eau de Parfum,6342,19-69,3827,4.1538,13.0,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,...,1,0,0,"['Unisex/ Genderless Scent', 'Layerable Scent'...",Fragrance,Women,Perfume,2,85.0,30.0
2,P473662,Rainbow Bar Eau de Parfum,6342,19-69,3253,4.25,16.0,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,...,1,0,0,"['Unisex/ Genderless Scent', 'Layerable Scent'...",Fragrance,Women,Perfume,2,75.0,30.0
3,P473660,Kasbah Eau de Parfum,6342,19-69,3018,4.4762,21.0,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,...,1,0,0,"['Unisex/ Genderless Scent', 'Layerable Scent'...",Fragrance,Women,Perfume,2,75.0,30.0
4,P473658,Purple Haze Eau de Parfum,6342,19-69,2691,3.2308,13.0,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,...,1,0,0,"['Unisex/ Genderless Scent', 'Layerable Scent'...",Fragrance,Women,Perfume,2,75.0,30.0


In [7]:
# Checking the columns to determine the necessary columns for this project
df1.columns

Index(['product_id', 'product_name', 'brand_id', 'brand_name', 'loves_count',
       'rating', 'reviews', 'size', 'variation_type', 'variation_value',
       'variation_desc', 'ingredients', 'price_usd', 'value_price_usd',
       'sale_price_usd', 'limited_edition', 'new', 'online_only',
       'out_of_stock', 'sephora_exclusive', 'highlights', 'primary_category',
       'secondary_category', 'tertiary_category', 'child_count',
       'child_max_price', 'child_min_price'],
      dtype='object')

#### Columns Description
1. **product_id**: Unique Product Identifier
2. **product_name**: Full name of Product
3. **brand_id**: Unique Brand Identifier
4. **brand_name**: Full name of Product band
5. **loves_count**: No of people who marked this product as favorite
6. **rating**: Average rating of product based on user reviews
7. **reviews**: No of user reviews for the product 
8. **size**: Product size, may be in oz, ml g, packs, or other units
9. **variation_types**: The type of variation parameter for the product (e.g. Size, Color)
10. **variation_value**: The specific value of the variation parameter for the product (e.g. 100 mL, Golden Sand)
11. **variation_desc**: A description of the variation parameter for the product (e.g. tone for fairest skin)
12. **ingredients**: A list of ingredients included in the product, for example: [‘Product variation 1:’, ‘Water, Glycerin’, ‘Product variation 2:’, ‘Talc, Mica’] or if no variations [‘Water, Glycerin’]
13. **price_usd**: The price of the product in US dollars
14. **value_price_usd**: The potential cost savings of the product, presented on the site next to the regular price
15. **sale_price_usd**: The sale price of the product in US dollars
16. **limited_edition**: Indicates whether the product is a limited edition or not (1-true, 0-false)
17. **new**: Indicates whether the product is new or not (1-true, 0-false)
18. **online_only**: Indicates whether the product is only sold online or not (1-true, 0-false)
19. **out_of_stock**: Indicates whether the product is currently out of stock or not (1 if true, 0 if false)
20. **sephora_exclusive**: Indicates whether the product is exclusive to Sephora or not (1 if true, 0 if false)
21. **highlights**: A list of tags or features that highlight the product's attributes (e.g. [‘Vegan’, ‘Matte Finish’])
22. **primary_category**: First category in the breadcrumb section
23. **secondary_category**: Second category in the breadcrumb section
24. **tertiary_category**: Third category in the breadcrumb section
25. **child_count**: The number of variations of the product available
26. **child_max_price**: The highest price among the variations of the product
27. **child_min_price**: The lowest price among the variations of the product

Among these columns, I will be considering only product_id, product_name, brand_id, brand_name, rating, reviews, price_usd

Reviewing some columns before deciding whether to select them or not

In [36]:
# df1[['size', 'variation_type', 'variation_value','variation_desc', 'ingredients']].head()

Unnamed: 0,size,variation_type,variation_value,variation_desc,ingredients
0,,,,,"['Capri Eau de Parfum:', 'Alcohol Denat. (SD A..."
1,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,,"['Alcohol Denat. (SD Alcohol 39C), Parfum (Fra..."
2,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,,"['Alcohol Denat. (SD Alcohol 39C), Parfum (Fra..."
3,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,,"['Alcohol Denat. (SD Alcohol 39C), Parfum (Fra..."
4,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,,"['Alcohol Denat. (SD Alcohol 39C), Parfum (Fra..."


In [37]:
# df1[['size', 'variation_type', 'variation_value','variation_desc', 'ingredients']].isna().sum()

size               1631
variation_type     1444
variation_value    1598
variation_desc     7244
ingredients         945
dtype: int64

These columns size, variation_type, variation_value, variation_desc, ingredients have a lot of nan values also, these are not relevant to the datawarehouse we are creating

In [38]:
# df1[['price_usd', 'value_price_usd','sale_price_usd']].head()

Unnamed: 0,price_usd,value_price_usd,sale_price_usd
0,35.0,,
1,195.0,,
2,195.0,,
3,195.0,,
4,195.0,,


In [39]:
# df1[['price_usd', 'value_price_usd','sale_price_usd']].isna().sum()

price_usd             0
value_price_usd    8043
sale_price_usd     8224
dtype: int64

Here value_price_usd and sale_price_used have a lot of nan values and cleaning these columns is also not necessary as we are only concerned with the original price in this task

In [44]:
df1[['primary_category','secondary_category', 'tertiary_category']].head()

Unnamed: 0,primary_category,secondary_category,tertiary_category
0,Fragrance,Value & Gift Sets,Perfume Gift Sets
1,Fragrance,Women,Perfume
2,Fragrance,Women,Perfume
3,Fragrance,Women,Perfume
4,Fragrance,Women,Perfume


In [45]:
df1[['primary_category','secondary_category', 'tertiary_category']].isna().sum()

primary_category        0
secondary_category      8
tertiary_category     990
dtype: int64

In [46]:
df1['primary_category'].value_counts()

primary_category
Skincare           2420
Makeup             2369
Hair               1464
Fragrance          1432
Bath & Body         405
Mini Size           288
Men                  60
Tools & Brushes      52
Gifts                 4
Name: count, dtype: int64

I will consider only the primary category as it is the main category and will be enough for our task

In [47]:
product_subset = ["product_id","product_name","rating","reviews","loves_count","price_usd", "child_count","primary_category"]
brand_subset = ["brand_id","brand_name"]

In [48]:
df_product = df1[product_subset]
df_product.head()

Unnamed: 0,product_id,product_name,rating,reviews,loves_count,price_usd,child_count,primary_category
0,P473671,Fragrance Discovery Set,3.6364,11.0,6320,35.0,0,Fragrance
1,P473668,La Habana Eau de Parfum,4.1538,13.0,3827,195.0,2,Fragrance
2,P473662,Rainbow Bar Eau de Parfum,4.25,16.0,3253,195.0,2,Fragrance
3,P473660,Kasbah Eau de Parfum,4.4762,21.0,3018,195.0,2,Fragrance
4,P473658,Purple Haze Eau de Parfum,3.2308,13.0,2691,195.0,2,Fragrance


In [49]:
df_product.shape

(8494, 8)

In [50]:
df_product.isna().sum()

product_id            0
product_name          0
rating              278
reviews             278
loves_count           0
price_usd             0
child_count           0
primary_category      0
dtype: int64

In [51]:
# Replacing the nan values in rating and reviews by 0 considering that the lowest average rating can be 0 and also the lowest no of reviews can be 0
df_product["rating"].fillna(0,inplace=True)
df_product["reviews"].fillna(0,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_product["rating"].fillna(0,inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_product["reviews"].fillna(0,inplace=True)


In [52]:
df_product.isna().sum()

product_id          0
product_name        0
rating              0
reviews             0
loves_count         0
price_usd           0
child_count         0
primary_category    0
dtype: int64

In [53]:
df_product.head()

Unnamed: 0,product_id,product_name,rating,reviews,loves_count,price_usd,child_count,primary_category
0,P473671,Fragrance Discovery Set,3.6364,11.0,6320,35.0,0,Fragrance
1,P473668,La Habana Eau de Parfum,4.1538,13.0,3827,195.0,2,Fragrance
2,P473662,Rainbow Bar Eau de Parfum,4.25,16.0,3253,195.0,2,Fragrance
3,P473660,Kasbah Eau de Parfum,4.4762,21.0,3018,195.0,2,Fragrance
4,P473658,Purple Haze Eau de Parfum,3.2308,13.0,2691,195.0,2,Fragrance


In [54]:
df_brand = df1[brand_subset]
df_brand.head()

Unnamed: 0,brand_id,brand_name
0,6342,19-69
1,6342,19-69
2,6342,19-69
3,6342,19-69
4,6342,19-69


In [55]:
df_brand.isna().sum()

brand_id      0
brand_name    0
dtype: int64

In [56]:
df_brand['brand_name'].value_counts()

brand_name
SEPHORA COLLECTION     352
CLINIQUE               179
Dior                   136
tarte                  131
NEST New York          115
                      ... 
Aquis                    1
Narciso Rodriguez        1
Jillian Dempsey          1
DOMINIQUE COSMETICS      1
iluminage                1
Name: count, Length: 304, dtype: int64

Reading the 2nd dataset containing product_reviews

In [8]:
df2 = pd.read_csv(dataset2_path,index_col=0)
print(f"Rows: {df2.shape[0]}, Columns: {df2.shape[1]}")
df2.head()

Rows: 49977, Columns: 18


  df2 = pd.read_csv(dataset2_path,index_col=0)


Unnamed: 0,author_id,rating,is_recommended,helpfulness,total_feedback_count,total_neg_feedback_count,total_pos_feedback_count,submission_time,review_text,review_title,skin_tone,eye_color,skin_type,hair_color,product_id,product_name,brand_name,price_usd
0,1945004256,5,1.0,0.0,2,2,0,2022-12-10,I absolutely L-O-V-E this oil. I have acne pro...,A must have!,lightMedium,green,combination,,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
1,5478482359,3,1.0,0.333333,3,2,1,2021-12-17,I gave this 3 stars because it give me tiny li...,it keeps oily skin under control,mediumTan,brown,oily,black,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
2,29002209922,5,1.0,1.0,2,0,2,2021-06-07,Works well as soon as I wash my face and pat d...,Worth the money!,lightMedium,brown,dry,black,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
3,7391078463,5,1.0,1.0,2,0,2,2021-05-21,"this oil helped with hydration and breakouts, ...",best face oil,lightMedium,brown,combination,blonde,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
4,1766313888,5,1.0,1.0,13,0,13,2021-03-29,This is my first product review ever so that s...,Maskne miracle,mediumTan,brown,combination,black,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0


In [9]:
df2.columns

Index(['author_id', 'rating', 'is_recommended', 'helpfulness',
       'total_feedback_count', 'total_neg_feedback_count',
       'total_pos_feedback_count', 'submission_time', 'review_text',
       'review_title', 'skin_tone', 'eye_color', 'skin_type', 'hair_color',
       'product_id', 'product_name', 'brand_name', 'price_usd'],
      dtype='object')

### Columns Description
1. **author_id**: The unique identifier for the author of the review on the website
2. **rating**: The rating given by the author for the product on a scale of 1 to 5
3. **is_recommended**: Indicates if the author recommends the product or not (1-true, 0-false)
4. **helpfulness**: The ratio of all ratings to positive ratings for the review: helpfulness = total_pos_feedback_count / total_feedback_count
5. **total_feedback_count**: Total number of feedback (positive and negative ratings) left by users for the review
6. **total_negative_feedback_count**: The number of users who gave a negative rating for the review
7. **total_pos_feedback_count**: The number of users who gave a positive rating for the review
8. **submission_time**: Date the review was posted on the website in the 'yyyy-mm-dd' format
9. **review_text**: The main text of the review written by the author
10. **review_title**: The title of the review written by the author
11. **skin_tone**: Author's skin tone (e.g. fair, tan, etc.)
12. **eye_color**: Author's eye color (e.g. brown, green, etc.)
13. **skin_type**: Author's skin type (e.g. combination, oily, etc.)
14. **hair_color**: Author's hair color (e.g. brown, auburn, etc.)
15. **product_id**: Unique identifier for the product on the website

From here we can create a review table and a reviewer table

In [12]:
author_subset = ['author_id','skin_tone','eye_color','skin_type','hair_color']
review_subset = ['review_text','review_title']
product_subset = ['product_id', 'product_name', 'brand_name', 'price_usd','rating']

In [14]:
df_author = df2[author_subset]
df_author.head()

Unnamed: 0,author_id,skin_tone,eye_color,skin_type,hair_color
0,1945004256,lightMedium,green,combination,
1,5478482359,mediumTan,brown,oily,black
2,29002209922,lightMedium,brown,dry,black
3,7391078463,lightMedium,brown,combination,blonde
4,1766313888,mediumTan,brown,combination,black


In [16]:
df_author.shape

(49977, 5)

In [15]:
df_author.isna().sum()

author_id        0
skin_tone     7201
eye_color     6260
skin_type     3631
hair_color    8851
dtype: int64

In [37]:
df_author[(df_author["skin_tone"].isna() == True) & (df_author["eye_color"].isna()== True) & (df_author["skin_type"].isna()== True) & (df_author["hair_color"].isna()== True)].shape

(3511, 5)

Total 3511 rows have nan values in all of the user's features so these can be dropped

In [38]:
condition = (df_author["skin_tone"].isna() & df_author["eye_color"].isna() & df_author["skin_type"].isna() & df_author["hair_color"].isna())

# Drop the rows that satisfy the condition
df_author = df_author.drop(df_author[condition].index)
df_author.isna().sum()

author_id        0
skin_tone     3690
eye_color     2749
skin_type      120
hair_color    5340
dtype: int64

For remaining replacing them with their mode

In [43]:
df_author["skin_tone"] = df_author["skin_tone"].fillna(df_author["skin_tone"].mode()[0])
df_author["eye_color"] = df_author["eye_color"].fillna(df_author["eye_color"].mode()[0])
df_author["skin_type"] = df_author["skin_type"].fillna(df_author["skin_type"].mode()[0])
df_author["hair_color"] = df_author["hair_color"].fillna(df_author["hair_color"].mode()[0])
df_author.isna().sum()

author_id     0
skin_tone     0
eye_color     0
skin_type     0
hair_color    0
dtype: int64

All Categorical are nan, so possible cleaning step is to replace them with most frequent value i-e mode

In [18]:
df_review = df2[review_subset]
df_review["review_id"] = range(1, len(df_review) + 1)
df_review = df_review[["review_id",'review_title',"review_text"]]
df_review.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_review["review_id"] = range(1, len(df_review) + 1)


Unnamed: 0,review_id,review_title,review_text
0,1,A must have!,I absolutely L-O-V-E this oil. I have acne pro...
1,2,it keeps oily skin under control,I gave this 3 stars because it give me tiny li...
2,3,Worth the money!,Works well as soon as I wash my face and pat d...
3,4,best face oil,"this oil helped with hydration and breakouts, ..."
4,5,Maskne miracle,This is my first product review ever so that s...


In [19]:
df_review.isna().sum()

review_id           0
review_title    14378
review_text        59
dtype: int64

In this case I believe that it is better to replace NAN with a placeholder text like "A product review"

In [28]:
print(f'Both Review title and Review text empty: {df_review[(df_review["review_text"].isna() == True) & (df_review["review_title"].isna()== True)].shape[0]}')
df_review[(df_review["review_text"].isna() == True) & (df_review["review_title"].isna()== True)].head()

Both Review title and Review text empty: 59


Unnamed: 0,review_id,review_title,review_text
993,994,,
1154,1155,,
2570,2571,,
2961,2962,,
3128,3129,,


This indicates that the rows with NAN in review_text should be removed as our core component in this data warehouse is product review and we cannot replace the reviews with custom placeholder

In [33]:
df_review = df_review.dropna(subset=["review_text"])
df_review.isna().sum()

review_id           0
review_title    14319
review_text         0
dtype: int64

replacing the remaining rows with NaN in review title with default placeholder review title

In [34]:
df_review["review_title"] = df_review["review_title"].fillna("Honest customer review")
df_review.isna().sum()

review_id       0
review_title    0
review_text     0
dtype: int64

In [47]:
df_product = df2[product_subset]
df_product.head()

Unnamed: 0,product_id,product_name,brand_name,price_usd,rating
0,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0,5
1,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0,3
2,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0,5
3,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0,5
4,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0,5


In [48]:
df_product.isna().sum()

product_id      0
product_name    0
brand_name      0
price_usd       0
rating          0
dtype: int64

In [49]:
df2.head()

Unnamed: 0,author_id,rating,is_recommended,helpfulness,total_feedback_count,total_neg_feedback_count,total_pos_feedback_count,submission_time,review_text,review_title,skin_tone,eye_color,skin_type,hair_color,product_id,product_name,brand_name,price_usd
0,1945004256,5,1.0,0.0,2,2,0,2022-12-10,I absolutely L-O-V-E this oil. I have acne pro...,A must have!,lightMedium,green,combination,,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
1,5478482359,3,1.0,0.333333,3,2,1,2021-12-17,I gave this 3 stars because it give me tiny li...,it keeps oily skin under control,mediumTan,brown,oily,black,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
2,29002209922,5,1.0,1.0,2,0,2,2021-06-07,Works well as soon as I wash my face and pat d...,Worth the money!,lightMedium,brown,dry,black,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
3,7391078463,5,1.0,1.0,2,0,2,2021-05-21,"this oil helped with hydration and breakouts, ...",best face oil,lightMedium,brown,combination,blonde,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
4,1766313888,5,1.0,1.0,13,0,13,2021-03-29,This is my first product review ever so that s...,Maskne miracle,mediumTan,brown,combination,black,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0


In [52]:
df_date = pd.DataFrame()
df_date["full_date"] = pd.to_datetime(df2["submission_time"])
df_date["year"] = df_date["full_date"].dt.year
df_date["month"] = df_date["full_date"].dt.month
df_date["day"] = df_date["full_date"].dt.day
df_date.head()

Unnamed: 0,full_date,year,month,day
0,2022-12-10,2022,12,10
1,2021-12-17,2021,12,17
2,2021-06-07,2021,6,7
3,2021-05-21,2021,5,21
4,2021-03-29,2021,3,29
