#### 1. Project Overview

Goal:
Build a relational database in PostgreSQL from the Olist Brazilian E-commerce dataset and use Python (Pandas + Plotly Express) for cleaning, feature engineering, and insights.

Tools:

- PostgreSQL (backend storage)
- Python + Pandas (data cleaning, ETL)
- SQLAlchemy (connect engine)
- Plotly Express (visualization)

Database name: `olist_db`


##### 2. Create the database in PostgreSQL


In pdAdmin or the psql shell:

```sql
CREATE DATABASE olist_db;


```

To Confirm

`\l` -> list all database

`\c olist_db` -> connect to the database


##### 3. Dataset files (download from kaggle)

Dataset link:
[Olist Brazilian E-Commerce dataset](https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce)

Download and extract the csv files

```sql
olist_customers_dataset.csv
olist_orders_dataset.csv
olist_order_items_dataset.csv
olist_order_payments_dataset.csv
olist_order_reviews_dataset.csv
olist_products_dataset.csv
olist_sellers_dataset.csv
olist_geolocation_dataset.csv
product_category_name_translation.csv

```


##### 4. Connect Python to PostgreSQL


In [7]:
from sqlalchemy import create_engine,text
import pandas as pd

# create connection
engine = create_engine("postgresql+psycopg2://postgres:2013%40Wewe@localhost:5432/olist_db")

print("connected successifuly")

connected successifuly


##### 5. Load and clean data with pandas


In [2]:
base_path = 'olist_brazil_dataset/'

datasets = {
    "category_translation": "product_category_name_translation.csv",
    "geolocation": "olist_geolocation_dataset.csv",
    "sellers": "olist_sellers_dataset.csv",
    "products": "olist_products_dataset.csv",
    "customers": "olist_customers_dataset.csv",
    "orders": "olist_orders_dataset.csv",
    "order_items": "olist_order_items_dataset.csv",
    "order_payments": "olist_order_payments_dataset.csv",
    "order_reviews": "olist_order_reviews_dataset.csv",
}


dfs = {name:pd.read_csv(base_path+file) for name, file in datasets.items()}

print("loaded all SCV files")

loaded all SCV files


##### Light cleaning


In [3]:
# Convert timestamp columns in orders
date_cols_orders =[
    "order_purchase_timestamp",
    "order_approved_at",
    "order_delivered_carrier_date",
    "order_delivered_customer_date",
    "order_estimated_delivery_date",
]

for col in date_cols_orders:
    dfs['orders'][col] = pd.to_datetime(dfs['orders'][col],errors='coerce')
    
# Convert timestamp columns in reviews
date_cols_reviews = ['review_creation_date','review_answer_timestamp']
for col in date_cols_reviews:
    dfs['order_reviews'][col] = pd.to_datetime(dfs['order_reviews'][col],errors='coerce')
    
# Convert timestamp columns in order_items
dfs["order_items"]["shipping_limit_date"] = pd.to_datetime(
    dfs["order_items"]["shipping_limit_date"], errors="coerce"
)

# handling missing values
dfs['products'].fillna(
    {
        "product_weight_g": 0,
        "product_length_cm": 0,
        "product_height_cm": 0,
        "product_width_cm": 0,
    },
    inplace=True
)

# drop duplicates just in case
for name,df in dfs.items():
    dfs[name] = df.drop_duplicates()
    
print("Cleaned and standardized data")

Cleaned and standardized data


##### Load data into postgres


In [4]:
def load_table(df,tablename):
    """Load a single dataframe to PostgreSQL"""
    df.to_sql(tablename,engine,if_exists='replace',index=False)
    print(f"loaded table :{tablename} ({len(df):,} rows)")
    
for name , df in dfs.items():
    load_table(df,name)

print("All tables successifully loaded into PostgreSQL")

loaded table :category_translation (71 rows)
loaded table :geolocation (738,332 rows)
loaded table :sellers (3,095 rows)
loaded table :products (32,951 rows)
loaded table :customers (99,441 rows)
loaded table :orders (99,441 rows)
loaded table :order_items (112,650 rows)
loaded table :order_payments (103,886 rows)
loaded table :order_reviews (99,224 rows)
All tables successifully loaded into PostgreSQL


##### Basic checks


In [9]:
with engine.connect() as conn:
    result = conn.execute(text("SELECT COUNT(*) FROM orders;"))
    print("Orders table count :",list(result)[0][0])
    
    result = conn.execute(text("SELECT COUNT(*) FROM customers"))
    print("Customers table count :",list(result)[0][0])
    
print("ETL process completed successifully")

Orders table count : 99441
Customers table count : 99441
ETL process completed successifully


Run these to update the constraints

```sql


-- Primary Keys
ALTER TABLE customers ADD PRIMARY KEY (customer_id);
ALTER TABLE sellers ADD PRIMARY KEY (seller_id);
ALTER TABLE products ADD PRIMARY KEY (product_id);
ALTER TABLE orders ADD PRIMARY KEY (order_id);
ALTER TABLE category_translation ADD PRIMARY KEY (product_category_name);
ALTER TABLE order_reviews ADD PRIMARY KEY (review_id);
ALTER TABLE order_items ADD PRIMARY KEY (order_id, order_item_id);
ALTER TABLE order_payments ADD PRIMARY KEY (order_id, payment_sequential);

-- Foreign Keys
ALTER TABLE orders
  ADD CONSTRAINT fk_orders_customers FOREIGN KEY (customer_id) REFERENCES customers(customer_id);

ALTER TABLE order_items
  ADD CONSTRAINT fk_items_orders FOREIGN KEY (order_id) REFERENCES orders(order_id),
  ADD CONSTRAINT fk_items_products FOREIGN KEY (product_id) REFERENCES products(product_id),
  ADD CONSTRAINT fk_items_sellers FOREIGN KEY (seller_id) REFERENCES sellers(seller_id);

ALTER TABLE order_payments
  ADD CONSTRAINT fk_payments_orders FOREIGN KEY (order_id) REFERENCES orders(order_id);

ALTER TABLE order_reviews
  ADD CONSTRAINT fk_reviews_orders FOREIGN KEY (order_id) REFERENCES orders(order_id);

-- sanitize the order_reviews table first
DELETE FROM order_reviews a
USING order_reviews b
WHERE
    a.ctid < b.ctid  AND
    a.review_id = b.review_id:

ALTER TABLE  order_reviews ADD PRIMARY KEY (review_id)

```

- this is because we need the reviews regardless of duplicates
