##### E-Commerce Sales Insights – Olist Dataset (PostgreSQL + Pandas + Plotly)


##### 1. Introduction

##### E-Commerce Sales Insights (Olist + PostgreSQL + Pandas + Plotly)

In this notebook, we explore the Olist Brazilian E-Commerce dataset hosted in PostgreSQL.

We'll:

1. Connect to PostgreSQL
2. Query and join related tables
3. Clean and engineer features
4. Visualize insights using Plotly Express

Focus KPIs:

- Sales & revenue trends
- Top categories, products, and sellers
- Delivery time performance
- Review score distribution


##### 2. Setup and conection


In [5]:
import pandas as pd
import plotly.express as px
from sqlalchemy import create_engine,text

# connection to postgreSQL
engine = create_engine("postgresql+psycopg2://postgres:2013%40Wewe@localhost:5432/olist_db")
print("Connected sucessifully")

Connected sucessifully


##### 3. Preview data


In [8]:
tables = ['customers','orders','order_items','order_payments','order_reviews','products','sellers']
for t in tables:
    q = f"SELECT COUNT(*) FROM {t}"
    count = pd.read_sql_query(q,engine).iloc[0,0]
    print(f"{t:20s} :{count:,} rows")

customers            :99,441 rows
orders               :99,441 rows
order_items          :112,650 rows
order_payments       :103,886 rows
order_reviews        :98,410 rows
products             :32,951 rows
sellers              :3,095 rows


##### 4. Query and join tables (core dataset)


In [57]:
query = """
SELECT 
    o.order_id,
    c.customer_state,
    p.product_category_name,
    ct.product_category_name_english AS category_name,
    oi.price,
    oi.freight_value,
    op.payment_value,
    op.payment_type,
    o.order_purchase_timestamp,
    o.order_delivered_customer_date,
    r.review_score
FROM orders o
JOIN customers c ON c.customer_id = o.customer_id
JOIN order_items oi on o.order_id = oi.order_id
JOIN products p ON p.product_id = oi.product_id
JOIN order_payments op ON op.order_id = o.order_id
LEFT JOIN order_reviews r ON r.order_id = o.order_id
LEFT JOIN category_translation ct ON ct.product_category_name = p.product_category_name
WHERE o.order_delivered_customer_date IS NOT NULL
"""

df = pd.read_sql_query(query,engine)
df.head()


Unnamed: 0,order_id,customer_state,product_category_name,category_name,price,freight_value,payment_value,payment_type,order_purchase_timestamp,order_delivered_customer_date,review_score
0,00010242fe8c5a6d1ba2dd792cb16214,RJ,cool_stuff,cool_stuff,58.9,13.29,72.19,credit_card,2017-09-13 08:59:02,2017-09-20 23:43:48,5.0
1,00018f77f2f0320c557190d7a144bdd3,SP,pet_shop,pet_shop,239.9,19.93,259.83,credit_card,2017-04-26 10:53:06,2017-05-12 16:04:24,4.0
2,000229ec398224ef6ca0657da4fc703e,MG,moveis_decoracao,furniture_decor,199.0,17.87,216.87,credit_card,2018-01-14 14:33:31,2018-01-22 13:19:16,5.0
3,00024acbcdf0a6daa1e931b038114c75,SP,perfumaria,perfumery,12.99,12.79,25.78,credit_card,2018-08-08 10:00:35,2018-08-14 13:32:39,4.0
4,00042b26cf59d7ce69dfabb4e55b4fd9,SP,ferramentas_jardim,garden_tools,199.9,18.14,218.04,credit_card,2017-02-04 13:57:51,2017-03-01 16:42:31,5.0


##### 5. Feature Engineering


In [58]:
df['order_purchase_timestamp'] = pd.to_datetime(df['order_purchase_timestamp'])
df['order_delivered_customer_date'] = pd.to_datetime(df['order_delivered_customer_date'])

df['delivery_days'] = (df['order_delivered_customer_date'] - df['order_purchase_timestamp']).dt.days
df['total_value'] = df['price'] + df['freight_value']

df['order_month'] = df['order_purchase_timestamp'].dt.to_period("M").astype(str)
df['order_year'] = df['order_purchase_timestamp'].dt.year

##### 6. Sales trend Over time


In [59]:
monthly_sales = df.groupby('order_month')['total_value'].sum().reset_index()

fig = px.line(monthly_sales,x='order_month',y='total_value',
    title="Monthly Total sales over time",
    markers=True)

fig.update_layout(xaxis_title='Month',yaxis_title ="Total Sales (BRL)")
fig.show()

##### 7. payment type Analysis


In [60]:
payment_summary = df.groupby('payment_type')['payment_value'].sum().reset_index().sort_values('payment_type',ascending=False)

fig = px.pie(payment_summary,names ='payment_type',values='payment_value',title="Payment Type Share")
fig.show()

##### 8.top products categories


In [62]:
category_sales = df.groupby('category_name')['total_value'].sum().reset_index()
top_categories = category_sales.nlargest(10,'total_value')

fig = px.bar(top_categories, y='category_name', x='total_value',
             title='🏷️ Top 10 Product Categories by Sales',
             text_auto='.2s')
fig.update_layout(yaxis_title="Product Category", xaxis_title="Total Sales",yaxis = dict(autorange="reversed"))
fig.show()


##### 9. Delivery Performance


In [63]:
delivery_perf = df.groupby('customer_state')['delivery_days'].mean().reset_index()

fig = px.bar(delivery_perf,x='customer_state',y='delivery_days',
             title="Average delivery Time by State",
             text_auto='.1f')
fig.update_layout(xaxis_title = "State",yaxis_title ="Avg Delivery Days")
fig.show()

##### 10. Review Score Distribution


In [64]:
review_dist = df.groupby('review_score')['order_id'].count().reset_index().rename(columns={"order_id":"num_reviews"})

fig = px.bar(review_dist, x='review_score', y='num_reviews',
             title='⭐ Review Score Distribution',
             text_auto=True)
fig.update_layout(xaxis_title="Review Score", yaxis_title="Number of Reviews")
fig.show()

##### 11. Correlation: Delivery Time vs Review Score


In [65]:
corr_df = df.groupby('review_score')['delivery_days'].mean().reset_index()

fig = px.scatter(corr_df, x='review_score', y='delivery_days',
                 size='delivery_days', color='review_score',
                 title='⏱️ Delivery Days vs Review Score',
                 trendline="ols")
fig.update_layout(xaxis_title="Review Score", yaxis_title="Average Delivery Days")
fig.show()


##### 12. State-level Sales Performance


In [66]:
state_sales = df.groupby('customer_state')['total_value'].sum().reset_index().sort_values('total_value', ascending=False)

fig = px.bar(state_sales, x='customer_state', y='total_value',
             title='🌍 Total Sales by Customer State',
             text_auto='.2s')
fig.update_layout(xaxis_title="State", yaxis_title="Total Sales")
fig.show()


##### 13. Summary


##### Key Insights

- **Sales Growth:** Steady increase in sales volume until the holiday season peak (November–December).
- **Top Categories:** Tech, home, and electronics dominate revenue share.
- **Payment Preferences:** Credit cards account for the majority of transactions.
- **Delivery Delays:** Some southern regions have longer delivery times (>10 days avg).
- **Customer Experience:** Longer delivery times tend to reduce review scores.

💡 _Action Recommendation:_  
Optimize logistics in high-delay regions to improve customer satisfaction and reduce low-star reviews.


##### 14. Eport to csv


In [None]:
df.to_csv("olist_analysis_dataset.csv", index=False)
print("Clean analysis dataset exported.")
