# **Exploratory Analysis of the Dataset**
*This notebook is the third stage (Analysis) of the project. The data has already been extracted from the source, transformed in Python, and loaded into a clean, relational `db_olist_analytics` database.*

**The objective of this analysis is to answer the core business questions defined in our `ANALYTICS_FRAMEWORK.md` file and to engineer the features for customer segmentation.**

> **Note on Project Structure:**
> This notebook contains the *full, detailed analytical process*. The key findings and final visualizations will be summarized in the main project `README.md` and the final Power BI dashboard.

## 1. Setup: Libraries & Database Connection
In this step, import all necessary libraries and establish our connection to the analytics database.

In [1]:
# Import the necessary library
# Import  data
import sqlalchemy as db
from sqlalchemy import create_engine
import os 
from dotenv import load_dotenv 

# Manipulating the data
import pandas as pd
import warnings
import seaborn as sns
import json

# Plotting and Visualization
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

warnings.filterwarnings(action = 'ignore')

In [2]:
# Access the Database 
# Load and get credentials from the environment 
db_user = os.getenv("DB_USER")
db_password = os.getenv("DB_PASSWORD")
db_host = os.getenv("DB_HOST")
db_port = os.getenv("DB_PORT")
db_name_analysis = os.getenv("DB_ANALYTICS")

# Connection string
connection_string = f"mysql+mysqlconnector://{db_user}:{db_password}@{db_host}:{db_port}/{db_name_analysis}"

# Create SQLAlchemy engine 
try:
    engine = db.create_engine(connection_string)
    print(f"Connection to '{db_name_analysis}' successful.")
except Exception as e: 
    print(f"Connection failed: {e}")

# Create a connection to the Database
conn = engine.connect()

# Test query 
test_query = "SELECT * FROM dim_customers LIMIT 5;"
df_test = pd.read_sql(test_query, conn)
        
print("Test query successful:")
display(df_test)

Connection to 'db_olist_analytics' successful.
Test query successful:


Unnamed: 0,customer_id,customer_sk,customer_unique_id,customer_unique_sk,customer_zip_code_prefix,customer_city,customer_state
0,00012a2ce6f8dcda20d059ce98491703,1,248ffe10d632bebe4f7267f1f44844c9,1,41016,Covington,Kentucky
1,000161a058600d5901f007fab4c27140,2,b0015e09bb4b6e47c52844fab5fb6638,2,10546,Millwood,New York
2,0001fd6190edaaf884bcaf3d49edf079,3,94b11d37cd61cb2994a194d11f89682b,3,62279,Renault,Illinois
3,0002414f95344307404f0ace7a26f1d5,4,4893ad4ea28b2c5b3ddf4e82e79db9e6,4,4742,Fort Fairfield,Maine
4,000379cdec625522490c315e70c7a9fb,5,0b83f73b19c2019e182fd552c048a22c,5,41016,Covington,Kentucky


In [3]:
# Get the list of tables from the database 
with engine.connect() as conn: 
    table_names_query = "SHOW TABLES;"
    table_names = pd.read_sql(table_names_query, conn).iloc[:,0].tolist()
print(f"Found {len(table_names)} tables: {table_names}")

# Loop through the list of table names and load them into a dictionary 
table_dict = {}
with engine.connect() as conn:
    for table in table_names:
        query = f"SELECT * FROM {table};"
        table_dict[table] = pd.read_sql(query,conn)
print(f"All table from {db_name_analysis} database have been loaded into the 'table_dict' dictionary")

Found 8 tables: ['dim_customers', 'dim_geolocation', 'dim_orders', 'dim_products', 'dim_sellers', 'fct_order_items', 'fct_order_payment', 'fct_order_review']
All table from db_olist_analytics database have been loaded into the 'table_dict' dictionary


## 2. Analytical Methodology
The analysis will follow the structure of our four planned reports. We will investigate each theme, extract key insights, and then build the features for our segmentation model.

1.  [**Executive Sales Analysis**](#section-sales)
    * *Questions:* Is the business growing? Are there seasonal trends?
2.  [**Logistics & Operations Analysis**](#section-logistics)
    * *Questions:* Does delivery time affect satisfaction? Where are our bottlenecks?
3.  [**Product & Category Analysis**](#section-product)
    * *Questions:* What are our best-selling products? Are there regional preferences?
4.  [**Customer Insights & RFM**](#section-customer)
    * *Questions:* Where do our best customers live? How do they behave?
5.  [**Summary of Key Findings**](#section-summary)
    * A summary of the most critical insights to take to the next stage.

## 3. Executive Sales Analysis
<a id="section-sales"></a>

load the main `fact_orders` table to analyze overall sales performance.