# üß≠ Exploratory Data Analysis ‚Äî Corporaci√≥n Favorita

This notebook explores the Favorita sales dataset using the custom DataLoader.
The goal is to understand the temporal, regional, and categorical patterns that influence sales.

All visualizations use a minimalist dark theme (`plotly_dark`) and are automatically exported as PNGs
into the `img/reports/eda/` folder for later use in the Streamlit app.

In [1]:
import sys
from pathlib import Path

# Detect project root first (manually, since utils isn't yet importable)
ROOT_CANDIDATE = Path().resolve()
if ROOT_CANDIDATE.name == "notebooks":
    PROJECT_ROOT = ROOT_CANDIDATE.parent
else:
    PROJECT_ROOT = ROOT_CANDIDATE

# Add to sys.path so we can import from src
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

print(f"üì¶ Bootstrapped project root: {PROJECT_ROOT}")

üì¶ Bootstrapped project root: /Users/kiko/Desktop/github/Corporacion-Favorita-Grocery-Sales-Forecasting


In [2]:
from src.utils.files import set_project_root, ensure_dirs
from src.data_loader import DataLoader

# Set proper working dir
PROJECT_ROOT = set_project_root()

# Verify and prepare
IMG_DIR = PROJECT_ROOT / "img" / "reports" / "eda"
ensure_dirs(IMG_DIR)

loader = DataLoader()
df = loader.load_train_data()
print(f"‚úÖ Loaded dataset with {len(df):,} rows and {len(df.columns)} columns.")

2025-11-06 22:38:59.917 
  command:

    streamlit run /Users/kiko/Desktop/github/Corporacion-Favorita-Grocery-Sales-Forecasting/.venv/lib/python3.13/site-packages/ipykernel_launcher.py [ARGUMENTS]


üìÇ Working directory set to project root: /Users/kiko/Desktop/github/Corporacion-Favorita-Grocery-Sales-Forecasting




‚úÖ Loaded dataset with 103,857,647 rows and 6 columns.


In [3]:
# --- Imports & Theme (run first) ---
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio

# Dark theme for all figures
pio.templates.default = "plotly_dark"

# Optional: higher default export resolution
pio.kaleido.scope.default_scale = 2  # needs 'kaleido' installed for PNG export



Use of plotly.io.kaleido.scope.default_scale is deprecated and support will be removed after September 2025.
Please use plotly.io.defaults.default_scale instead.




## 1Ô∏è‚É£ Data Overview

We begin by examining the structure of the dataset ‚Äî the available columns, data types,
and a preview of the first few rows to understand what kind of information we‚Äôre working with.

In [4]:
df.info()
display(df.head())
display(df.describe(include='all').T.head(10))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103857647 entries, 0 to 103857646
Data columns (total 6 columns):
 #   Column       Dtype  
---  ------       -----  
 0   id           int64  
 1   date         object 
 2   store_nbr    float64
 3   item_nbr     float64
 4   unit_sales   float64
 5   onpromotion  object 
dtypes: float64(3), int64(1), object(2)
memory usage: 4.6+ GB


Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion
0,0,2013-01-01,25.0,103665.0,7.0,
1,1,2013-01-01,25.0,105574.0,1.0,
2,2,2013-01-01,25.0,105575.0,2.0,
3,3,2013-01-01,25.0,108079.0,1.0,
4,4,2013-01-01,25.0,108701.0,1.0,


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
id,103857647.0,,,,51928823.0,29981120.370763,0.0,25964411.5,51928823.0,77893234.5,103857646.0
date,103857647.0,1479.0,2016-12-18,114252.0,,,,,,,
store_nbr,103857646.0,,,,27.35018,16.34798,1.0,12.0,28.0,43.0,54.0
item_nbr,103857646.0,,,,933398.967123,498219.53854,96995.0,511394.0,936341.0,1260242.0,2124052.0
unit_sales,103857646.0,,,,8.647428,23.480741,-15372.0,2.0,4.0,9.0,89440.0
onpromotion,82199995.0,2.0,False,76874911.0,,,,,,,


## 2Ô∏è‚É£ Temporal Sales Patterns

Understanding how sales evolve over time helps us detect seasonality, holidays, or long-term trends.

In [5]:
# Aggregate total sales per date
daily_sales = df.groupby('date')['unit_sales'].sum().reset_index()
fig = px.line(daily_sales, x='date', y='unit_sales', title='Total Sales Over Time', template='plotly_dark')
fig.show()
fig.write_image(str(IMG_DIR / 'total_sales_over_time.png'))

Python(81108) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


In [6]:
# Sales by day of week
df['day_of_week'] = pd.to_datetime(df['date']).dt.day_name()
dow = df.groupby('day_of_week')['unit_sales'].mean().reindex([
    'Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'
]).reset_index()
fig = px.bar(dow, x='day_of_week', y='unit_sales', title='Average Sales by Day of Week', template='plotly_dark')
fig.show()
fig.write_image(str(IMG_DIR / 'avg_sales_by_dayofweek.png'))

Python(81197) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


## 3Ô∏è‚É£ Store Performance

We next explore how different stores perform. Large discrepancies may reveal population density differences or supply-chain challenges.

In [7]:
store_perf = df.groupby('store_nbr')['unit_sales'].mean().reset_index().sort_values('unit_sales', ascending=False)
fig = px.bar(store_perf.head(30), x='store_nbr', y='unit_sales', title='Top 30 Stores by Average Sales', template='plotly_dark')
fig.show()
fig.write_image(str(IMG_DIR / 'top_stores_avg_sales.png'))

Python(81217) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


## 4Ô∏è‚É£ Product Families

Now we look at which product families generate the most revenue. This can inform inventory and marketing strategies.

In [8]:
families = df.groupby('family')['unit_sales'].sum().sort_values(ascending=False).reset_index()
fig = px.bar(families.head(15), x='unit_sales', y='family', orientation='h', title='Top 15 Product Families by Total Sales', template='plotly_dark')
fig.show()
fig.write_image(str(IMG_DIR / 'top_product_families.png'))

KeyError: 'family'

## 5Ô∏è‚É£ Promotions & Sales

Promotions can heavily influence short-term demand. Let‚Äôs visualize how they correlate with average sales volume.

In [None]:
promo_df = df.groupby('onpromotion')['unit_sales'].mean().reset_index()
fig = px.bar(promo_df, x='onpromotion', y='unit_sales', title='Average Sales vs Promotion Status', template='plotly_dark')
fig.show()
fig.write_image(str(IMG_DIR / 'promo_vs_sales.png'))

## 6Ô∏è‚É£ Oil Prices & Macroeconomic Impact

Oil prices in Ecuador may have macro effects on transportation costs and consumer behavior.
We compare oil price fluctuations with total sales to observe potential relationships.

In [None]:
if 'dcoilwtico' in df.columns:
    oil_corr = df.groupby('date')[['unit_sales','dcoilwtico']].mean().reset_index()
    fig = px.line(oil_corr, x='date', y=['unit_sales','dcoilwtico'], title='Sales vs Oil Price Trends', template='plotly_dark')
    fig.show()
    fig.write_image(str(IMG_DIR / 'sales_vs_oil_price.png'))

## 7Ô∏è‚É£ Regional Distribution

Finally, we compare regional sales levels to identify key contributing locations.

In [None]:
if 'city' in df.columns:
    region_sales = df.groupby('city')['unit_sales'].sum().sort_values(ascending=False).reset_index()
    fig = px.bar(region_sales.head(15), x='city', y='unit_sales', title='Top Cities by Total Sales', template='plotly_dark')
    fig.show()
    fig.write_image(str(IMG_DIR / 'top_cities_sales.png'))

## üß© Summary of Insights

From this EDA, we can already observe:
- Clear weekly seasonality (higher sales on weekends)
- Noticeable peaks during holiday periods
- A few stores drive a large share of total sales
- Specific product families dominate revenue
- Promotions significantly boost sales
- Oil prices may have mild inverse correlation with consumer demand
- Some cities exhibit far stronger purchasing activity

These patterns will later inform our feature engineering and model tuning in the forecasting pipeline.