In [32]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:,.0f}'.format)

import warnings
warnings.filterwarnings("ignore")

In [33]:
df = pd.read_csv('../data/cleaned_data.csv')

df['date'] = pd.to_datetime(df['date'])
df.head()

Unnamed: 0,date,store_nbr,family,sales,onpromotion,holiday_type,locale,transferred,dcoilwtico,city,state,store_type,cluster,transactions,year,month,week,quarter,day_of_week,is_crisis,sales_lag_7,rolling_mean_7,is_weekend,is_holiday,promo_last_7_days
0,2013-01-01,1,AUTOMOTIVE,0,0,Holiday,National,False,93,Quito,Pichincha,D,13,0,2013,1,1,1,Tuesday,0,0,0,0,1,0
1,2013-01-01,1,BABY CARE,0,0,Holiday,National,False,93,Quito,Pichincha,D,13,0,2013,1,1,1,Tuesday,0,0,0,0,1,0
2,2013-01-01,1,BEAUTY,0,0,Holiday,National,False,93,Quito,Pichincha,D,13,0,2013,1,1,1,Tuesday,0,0,0,0,1,0
3,2013-01-01,1,BEVERAGES,0,0,Holiday,National,False,93,Quito,Pichincha,D,13,0,2013,1,1,1,Tuesday,0,0,0,0,1,0
4,2013-01-01,1,BOOKS,0,0,Holiday,National,False,93,Quito,Pichincha,D,13,0,2013,1,1,1,Tuesday,0,0,0,0,1,0


## 🔍 Question 1: How Have Total Sales Evolved Over Time?

To understand the overall business trend, we calculated the total sales per day from the dataset.

In [34]:
sales_over_time = df.groupby('date')['sales'].sum().reset_index()
sales_over_time

Unnamed: 0,date,sales
0,2013-01-01,2512
1,2013-01-02,496092
2,2013-01-03,361461
3,2013-01-04,354460
4,2013-01-05,477350
...,...,...
1679,2017-08-11,826374
1680,2017-08-12,792631
1681,2017-08-13,865640
1682,2017-08-14,760922


### Key Findings:
- Daily sales range from as low as ~2.5K to over 860K in some peak days.
- There is a clear upward trend in daily revenue, with seasonal fluctuations likely present (to be analyzed in later steps).

> Corresponding plots will be in cell `4` in `visualization_demo.ipynb`

## 🔍 2. Which products or categories contribute the most to total revenue?

Based on the total sales data, the following products or categories contribute the most to the total revenue:

In [48]:
top_products = df.groupby('family')['sales'].sum().sort_values(ascending=False).head(20)
top_products

family
GROCERY I             350,827,298
BEVERAGES             221,663,540
PRODUCE               125,447,968
CLEANING               99,421,019
DAIRY                  65,823,605
BREAD/BAKERY           42,959,924
POULTRY                32,494,451
MEATS                  31,650,996
PERSONAL CARE          25,100,482
DELI                   24,585,627
HOME CARE              16,409,522
EGGS                   15,881,196
FROZEN FOODS           14,646,940
PREPARED FOODS          8,966,728
LIQUOR,WINE,BEER        7,937,172
SEAFOOD                 2,051,636
GROCERY II              2,004,966
HOME AND KITCHEN I      1,905,076
HOME AND KITCHEN II     1,556,511
CELEBRATION               779,502
Name: sales, dtype: float64

1. **GROCERY I**: $350,827,298
2. **BEVERAGES**: $221,663,540
3. **PRODUCE**: $125,447,968
4. **CLEANING**: $99,421,019
5. **DAIRY**: $65,823,605

These categories make up the bulk of the revenue, with **GROCERY I** leading by a significant margin. The top five categories contribute substantially to the overall sales, while the remaining categories (such as **CELEBRATION** and **HOME AND KITCHEN II**) have relatively smaller contributions.

In the analysis, we can observe that categories related to essential products (like groceries, beverages, and produce) lead in sales, which might reflect consistent consumer demand. Further analysis could explore seasonality and trends within these top categories.

> Corresponding plots will be in cell `6` in `visualization_demo.ipynb`

## 🔍 3. Which stores, cities, or states are the top performers in terms of revenue?

In [36]:
top_stores = df.groupby('store_nbr')['sales'].sum().sort_values(ascending=False)
top_cities = df.groupby('city')['sales'].sum().sort_values(ascending=False)
top_regions = df.groupby('state')['sales'].sum().sort_values(ascending=False)

print("Top Stores by Revenue:")
print(top_stores.head())  

print("\n \nTop Cities by Revenue:")
print(top_cities.head()) 

print("\n \nTop States by Revenue:")
print(top_regions.head()) 

Top Stores by Revenue:
store_nbr
44   63,356,137
45   55,689,022
47   52,024,476
3    51,533,528
49   44,346,823
Name: sales, dtype: float64

 
Top Cities by Revenue:
city
Quito           568,679,349
Guayaquil       125,572,186
Cuenca           50,194,046
Ambato           41,159,773
Santo Domingo    36,617,572
Name: sales, dtype: float64

 
Top States by Revenue:
state
Pichincha                        597,585,883
Guayas                           168,649,985
Azuay                             50,194,046
Tungurahua                        41,159,773
Santo Domingo de los Tsachilas    36,617,572
Name: sales, dtype: float64


Based on the total sales data, the following stores, cities, and regions are the top performers:

### **Top Stores by Revenue:**
1. **Store 44**: $63,356,137
2. **Store 45**: $55,689,022
3. **Store 47**: $52,024,476
4. **Store 3**: $51,533,528
5. **Store 49**: $44,346,823

### **Top Cities by Revenue:**
1. **Quito**: $568,679,349
2. **Guayaquil**: $125,572,186
3. **Cuenca**: $50,194,046
4. **Ambato**: $41,159,773
5. **Santo Domingo**: $36,617,572

### **Top States by Revenue:**
1. **Pichincha**: $597,585,883
2. **Guayas**: $168,649,985
3. **Azuay**: $50,194,046
4. **Tungurahua**: $41,159,773
5. **Santo Domingo de los Tsachilas**: $36,617,572

These top performers highlight the most significant contributors to revenue, with **Quito** leading at the city level and **Pichincha** being the highest-performing state. In terms of stores, Store 44 generates the highest revenue.

This analysis can help identify key areas for growth and focus, particularly in high-revenue cities and states.

## 🔍 4. What is the average order size across stores, regions, and categories?

In [37]:
df['transactions'].dtype

dtype('float64')

In [47]:
df_filtered = df[df['transactions'] > 0]

avg_order_size_store = df_filtered.groupby('store_nbr').apply(lambda x: x['sales'].sum() / x['transactions'].sum()).sort_values(ascending=False)
avg_order_size_region = df_filtered.groupby('state').apply(lambda x: x['sales'].sum() / x['transactions'].sum()).sort_values(ascending=False)
avg_order_size_category = df_filtered.groupby('family').apply(lambda x: x['sales'].sum() / x['transactions'].sum()).sort_values(ascending=False)

print("Average Order Size by Store:")
print(avg_order_size_store.head())  

print("\nAverage Order Size by Region:")
print(avg_order_size_region.head()) 

print("\nAverage Order Size by Category:")
print(avg_order_size_category.head())  

Average Order Size by Store:
store_nbr
51   0
42   0
21   0
29   0
52   0
dtype: float64

Average Order Size by Region:
state
Azuay      0
Manabi     0
El Oro     0
Pastaza    0
Los Rios   0
dtype: float64

Average Order Size by Category:
family
GROCERY I   2
BEVERAGES   2
PRODUCE     1
CLEANING    1
DAIRY       0
dtype: float64
