#### 1. Project Overview

Goal:
Build a relational database in PostgreSQL from the Olist Brazilian E-commerce dataset and use Python (Pandas + Plotly Express) for cleaning, feature engineering, and insights.

Tools:

- PostgreSQL (backend storage)
- Python + Pandas (data cleaning, ETL)
- SQLAlchemy (connect engine)
- Plotly Express (visualization)

Database name: `olist_db`


##### 2. Create the database in PostgreSQL


In pdAdmin or the psql shell:

```sql
CREATE DATABASE olist_db;


```

To Confirm

`\l` -> list all database

`\c olist_db` -> connect to the database


##### 3. Dataset files (download from kaggle)

Dataset link:
[Olist Brazilian E-Commerce dataset](https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce)

Download and extract the csv files

```sql
olist_customers_dataset.csv
olist_orders_dataset.csv
olist_order_items_dataset.csv
olist_order_payments_dataset.csv
olist_order_reviews_dataset.csv
olist_products_dataset.csv
olist_sellers_dataset.csv
olist_geolocation_dataset.csv
product_category_name_translation.csv

```


##### 4. Connect Python to PostgreSQL


In [1]:
from sqlalchemy import create_engine
import pandas as pd

# create connection
engine = create_engine("postgresql+psycopg2://postgres:2013%40Wewe@localhost:5432/olist_db")

print("connected successifuly")

connected successifuly


##### 5. Load and clean data with pandas


In [2]:
customers = pd.read_csv("olist_brazil_dataset/olist_customers_dataset.csv")
orders = pd.read_csv("olist_brazil_dataset/olist_orders_dataset.csv")
order_items = pd.read_csv("olist_brazil_dataset/olist_order_items_dataset.csv")
order_payments = pd.read_csv("olist_brazil_dataset/olist_order_payments_dataset.csv")
order_reviews = pd.read_csv("olist_brazil_dataset/olist_order_reviews_dataset.csv")
products = pd.read_csv("olist_brazil_dataset/olist_products_dataset.csv")
sellers = pd.read_csv("olist_brazil_dataset/olist_sellers_dataset.csv")
geoloc = pd.read_csv("olist_brazil_dataset/olist_geolocation_dataset.csv")
category_translation = pd.read_csv("olist_brazil_dataset/product_category_name_translation.csv")


In [12]:
base_path = 'olist_brazil_dataset/'

datasets = {
    "category_translation": "product_category_name_translation.csv",
    "geolocation": "olist_geolocation_dataset.csv",
    "sellers": "olist_sellers_dataset.csv",
    "products": "olist_products_dataset.csv",
    "customers": "olist_customers_dataset.csv",
    "orders": "olist_orders_dataset.csv",
    "order_items": "olist_order_items_dataset.csv",
    "order_payments": "olist_order_payments_dataset.csv",
    "order_reviews": "olist_order_reviews_dataset.csv",
}


dfs = {name:pd.read_csv(base_path+file) for name, file in datasets.items()}

print("loaded all SCV files")

loaded all SCV files


##### Light cleaning


In [13]:
# Convert timestamp columns in orders
date_cols_orders =[
    "order_purchase_timestamp",
    "order_approved_at",
    "order_delivered_carrier_date",
    "order_delivered_customer_date",
    "order_estimated_delivery_date",
]

for col in date_cols_orders:
    dfs['orders'][col] = pd.to_datetime(dfs['orders'][col],errors='coerce')
    
# Convert timestamp columns in reviews
date_cols_reviews = ['review_creation_date','review_answer_timestamp']
for col in date_cols_reviews:
    dfs['order_reviews'][col] = pd.to_datetime(dfs['order_reviews'][col],errors='coerce')
    
# Convert timestamp columns in order_items
dfs["order_items"]["shipping_limit_date"] = pd.to_datetime(
    dfs["order_items"]["shipping_limit_date"], errors="coerce"
)

# handling missing values
dfs['products'].fillna(
    {
        "product_weight_g": 0,
        "product_length_cm": 0,
        "product_height_cm": 0,
        "product_width_cm": 0,
    },
    inplace=True
)

# drop duplicates just in case
for name,df in dfs.items():
    dfs[name] = df.drop_duplicates()
    
print("Cleaned and standardized data")

Cleaned and standardized data


##### Load data into postgres


In [14]:
def load_table(df,tablename):
    """Load a single dataframe to PostgreSQL"""
    df.to_sql(tablename,engine,if_exists='replace',index=False)
    print(f"loaded table :{tablename} ({len(df):,} rows)")
    
for name , df in dfs.items():
    load_table(df,name)

print("All tables successifully loaded into PostgreSQL")

loaded table :category_translation (71 rows)
loaded table :geolocation (738,332 rows)


InternalError: (psycopg2.errors.DependentObjectsStillExist) cannot drop table sellers because other objects depend on it
DETAIL:  constraint order_items_seller_id_fkey on table order_items depends on table sellers
HINT:  Use DROP ... CASCADE to drop the dependent objects too.

[SQL: 
DROP TABLE sellers]
(Background on this error at: https://sqlalche.me/e/20/2j85)