## Analyzing Brazilian E-Commerce Dynamics  

**Objective**: Investigate customer behavior, seller performance, and operational bottlenecks in Olist’s marketplace using real-world transactional data.  

**Dataset**: Includes 100k+ orders, 9 relational tables (customers, sellers, products, reviews, etc.).  

**Methodology**:  
1. **Data Cleaning**: Geolocation validation, payment anomaly detection.  
2. **EDA**: Customer segmentation, delivery time vs. satisfaction trends.  
3. **Hypothesis Testing**: Pricing bias across regions, seller efficiency impact.  
4. **Advanced Analytics**: PCA for customer behavior patterns.  

**Business Impact**: Recommendations for improving customer retention, delivery logistics, and seller performance.  

In [3]:
# Core Imports
import os
import zipfile
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
%matplotlib inline

# Statistical Analysis
from scipy import stats
from scipy.stats import spearmanr, kruskal

# Machine Learning & Dimensionality Reduction
from sklearn.preprocessing import StandardScaler, TargetEncoder
from sklearn.decomposition import PCA
import umap.umap_ as umap

# Kaggle Dataset Download
import kagglehub

In [4]:
try:
    dataset_path = kagglehub.dataset_download("olistbr/brazilian-ecommerce")
    print(f'Dataset downloaded to {dataset_path}')

    if zipfile.is_zipfile(dataset_path):
        with zipfile.ZipFile(dataset_path, 'r') as zip_ref:
            zip_ref.extractall('data/')
            print(f'Dataset extracted to data/ directory')
    else:
        print('Downloaded file is not a zip archive.')
except Exception as e:
    print(f'Error downloading dataset: {e}')
    print("Alternative: Download manually from https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce")

Downloading from https://www.kaggle.com/api/v1/datasets/download/olistbr/brazilian-ecommerce?dataset_version_number=2...


100%|██████████| 42.6M/42.6M [00:04<00:00, 9.06MB/s]

Extracting files...





Dataset downloaded to /home/krmsh1n5/.cache/kagglehub/datasets/olistbr/brazilian-ecommerce/versions/2
Downloaded file is not a zip archive.


In [11]:
data_files = {
    "customers": "olist_customers_dataset.csv",
    "orders": "olist_orders_dataset.csv",
    "order_items": "olist_order_items_dataset.csv",
    "products": "olist_products_dataset.csv",
    "sellers": "olist_sellers_dataset.csv",
    "reviews": "olist_order_reviews_dataset.csv"
}

datasets = {}
try:
    for name, file in data_files.items():
        file_path = os.path.join(dataset_path, file)
        datasets[name] = pd.read_csv(file_path, encoding='latin-1')
        print(f"Loaded {name} dataset: {datasets[name].shape}")
except FileNotFoundError as e:
    print(f'File not found: {e}')
    print("Verify dataset files are in /data directory")

Loaded customers dataset: (99441, 5)
Loaded orders dataset: (99441, 8)
Loaded order_items dataset: (112650, 7)
Loaded products dataset: (32951, 9)
Loaded sellers dataset: (3095, 4)
Loaded reviews dataset: (99224, 7)
