In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.2f}'.format)

import warnings
warnings.filterwarnings("ignore")

In [3]:
df = pd.read_csv('../data/cleaned_data.csv')

df['date'] = pd.to_datetime(df['date'])
df.head()

Unnamed: 0,date,store_nbr,family,sales,onpromotion,holiday_type,locale,transferred,dcoilwtico,city,state,store_type,cluster,transactions,year,month,week,quarter,day_of_week,is_crisis,sales_lag_7,rolling_mean_7,is_weekend,is_holiday,promo_last_7_days,days_to_holiday,promotion_status
0,2013-01-01,1,AUTOMOTIVE,0.0,0,Holiday,National,False,93.14,Quito,Pichincha,D,13,0.0,2013,1,1,1,Tuesday,0,0.0,0.0,0,1,0.0,0,Not On Promotion
1,2013-01-01,1,BABY CARE,0.0,0,Holiday,National,False,93.14,Quito,Pichincha,D,13,0.0,2013,1,1,1,Tuesday,0,0.0,0.0,0,1,0.0,0,Not On Promotion
2,2013-01-01,1,BEAUTY,0.0,0,Holiday,National,False,93.14,Quito,Pichincha,D,13,0.0,2013,1,1,1,Tuesday,0,0.0,0.0,0,1,0.0,0,Not On Promotion
3,2013-01-01,1,BEVERAGES,0.0,0,Holiday,National,False,93.14,Quito,Pichincha,D,13,0.0,2013,1,1,1,Tuesday,0,0.0,0.0,0,1,0.0,0,Not On Promotion
4,2013-01-01,1,BOOKS,0.0,0,Holiday,National,False,93.14,Quito,Pichincha,D,13,0.0,2013,1,1,1,Tuesday,0,0.0,0.0,0,1,0.0,0,Not On Promotion


# 📈 1. How Have Total Sales Evolved Over Time?

To understand the overall business trend, we calculated the total sales per day from the dataset.

In [None]:
sales_over_time = df.groupby('date')['sales'].sum().reset_index()
sales_over_time

Unnamed: 0,date,sales
0,2013-01-01,2511.62
1,2013-01-02,496092.42
2,2013-01-03,361461.23
3,2013-01-04,354459.68
4,2013-01-05,477350.12
...,...,...
1679,2017-08-11,826373.72
1680,2017-08-12,792630.54
1681,2017-08-13,865639.68
1682,2017-08-14,760922.41


### Key Findings:
- Daily sales range from as low as ~2.5K to over 860K in some peak days.
- There is a clear upward trend in daily revenue, with seasonal fluctuations likely present (to be analyzed in later steps).

> Corresponding plots will be in cell `4` in `visualization_demo.ipynb`

# 📈 2. Which products or categories contribute the most to total revenue?

Based on the total sales data, the following products or categories contribute the most to the total revenue:

In [None]:
top_products = df.groupby('family')['sales'].sum().sort_values(ascending=False).head(20)
top_products

family
GROCERY I             350827297.99
BEVERAGES             221663540.00
PRODUCE               125447968.02
CLEANING               99421019.00
DAIRY                  65823605.00
BREAD/BAKERY           42959924.00
POULTRY                32494450.89
MEATS                  31650996.29
PERSONAL CARE          25100482.00
DELI                   24585626.80
HOME CARE              16409522.00
EGGS                   15881196.00
FROZEN FOODS           14646940.00
PREPARED FOODS          8966728.11
LIQUOR,WINE,BEER        7937172.00
SEAFOOD                 2051636.10
GROCERY II              2004966.00
HOME AND KITCHEN I      1905076.00
HOME AND KITCHEN II     1556511.00
CELEBRATION              779502.00
Name: sales, dtype: float64

1. **GROCERY I**: $350,827,298
2. **BEVERAGES**: $221,663,540
3. **PRODUCE**: $125,447,968
4. **CLEANING**: $99,421,019
5. **DAIRY**: $65,823,605

These categories make up the bulk of the revenue, with **GROCERY I** leading by a significant margin. The top five categories contribute substantially to the overall sales, while the remaining categories (such as **CELEBRATION** and **HOME AND KITCHEN II**) have relatively smaller contributions.

In the analysis, we can observe that categories related to essential products (like groceries, beverages, and produce) lead in sales, which might reflect consistent consumer demand. Further analysis could explore seasonality and trends within these top categories.

> Corresponding plots will be in cell `6` in `visualization_demo.ipynb`

# 📈 3. Which stores, cities, or states are the top performers in terms of revenue?

In [None]:
top_stores = df.groupby('store_nbr')['sales'].sum().sort_values(ascending=False)
top_cities = df.groupby('city')['sales'].sum().sort_values(ascending=False)
top_regions = df.groupby('state')['sales'].sum().sort_values(ascending=False)

print("Top Stores by Revenue:")
print(top_stores.head())  

print("\n \nTop Cities by Revenue:")
print(top_cities.head()) 

print("\n \nTop States by Revenue:")
print(top_regions.head()) 

Top Stores by Revenue:
store_nbr
44   63356137.23
45   55689022.00
47   52024475.96
3    51533528.14
49   44346822.76
Name: sales, dtype: float64

 
Top Cities by Revenue:
city
Quito           568679349.49
Guayaquil       125572185.61
Cuenca           50194045.80
Ambato           41159772.88
Santo Domingo    36617571.51
Name: sales, dtype: float64

 
Top States by Revenue:
state
Pichincha                        597585883.41
Guayas                           168649985.24
Azuay                             50194045.80
Tungurahua                        41159772.88
Santo Domingo de los Tsachilas    36617571.51
Name: sales, dtype: float64


Based on the total sales data, the following stores, cities, and regions are the top performers:

### **Top Stores by Revenue:**
1. **Store 44**: $63,356,137
2. **Store 45**: $55,689,022
3. **Store 47**: $52,024,476
4. **Store 3**: $51,533,528
5. **Store 49**: $44,346,823

### **Top Cities by Revenue:**
1. **Quito**: $568,679,349
2. **Guayaquil**: $125,572,186
3. **Cuenca**: $50,194,046
4. **Ambato**: $41,159,773
5. **Santo Domingo**: $36,617,572

### **Top States by Revenue:**
1. **Pichincha**: $597,585,883
2. **Guayas**: $168,649,985
3. **Azuay**: $50,194,046
4. **Tungurahua**: $41,159,773
5. **Santo Domingo de los Tsachilas**: $36,617,572

These top performers highlight the most significant contributors to revenue, with **Quito** leading at the city level and **Pichincha** being the highest-performing state. In terms of stores, Store 44 generates the highest revenue.

This analysis can help identify key areas for growth and focus, particularly in high-revenue cities and states.

> Corresponding plots will be in cell `8`, `10` and `12` in `visualization_demo.ipynb`

# 📈 4. What is the average order size across stores, regions, and categories?

In [None]:
df['transactions'].dtype

dtype('float64')

In [None]:
print((df['transactions'] == 0).sum())

249117


In [None]:
df[['sales', 'transactions']].describe()

Unnamed: 0,sales,transactions
count,3054348.0,3054348.0
mean,359.02,1558.66
std,1107.29,1036.47
min,0.0,0.0
25%,0.0,931.0
50%,11.0,1332.0
75%,196.01,1980.0
max,124717.0,8359.0


In [None]:
zero_transactions = df[df['transactions'] == 0]
zero_sales_with_zero_transactions = zero_transactions[zero_transactions['sales'] == 0]

# Check if all zero transactions have zero sales
all_match = len(zero_transactions) == len(zero_sales_with_zero_transactions)
print("All zero transactions have zero sales:", all_match)

All zero transactions have zero sales: False


In [None]:
df_valid = df[~((df['transactions'] == 0) & (df['sales'] > 0))]

avg_order_size_store = df_valid.groupby('store_nbr').apply(lambda x: x['sales'].sum() / x['transactions'].sum()).sort_values(ascending=False)
avg_order_size_region = df_valid.groupby('state').apply(lambda x: x['sales'].sum() / x['transactions'].sum()).sort_values(ascending=False)
avg_order_size_category = df_valid.groupby('family').apply(lambda x: x['sales'].sum() / x['transactions'].sum()).sort_values(ascending=False)

print("Average Order Size by Store:")
print(avg_order_size_store.head())  

print("\nAverage Order Size by Region:")
print(avg_order_size_region.head()) 

print("\nAverage Order Size by Category:")
print(avg_order_size_category.head())  

Average Order Size by Store:
store_nbr
51   0.35
42   0.34
21   0.33
29   0.30
52   0.30
dtype: float64

Average Order Size by Region:
state
Azuay      0.26
Manabi     0.26
El Oro     0.26
Pastaza    0.25
Los Rios   0.24
dtype: float64

Average Order Size by Category:
family
GROCERY I   2.43
BEVERAGES   1.53
PRODUCE     0.87
CLEANING    0.69
DAIRY       0.46
dtype: float64


In [None]:
df.columns

Index(['date', 'store_nbr', 'family', 'sales', 'onpromotion', 'holiday_type',
       'locale', 'transferred', 'dcoilwtico', 'city', 'state', 'store_type',
       'cluster', 'transactions', 'year', 'month', 'week', 'quarter',
       'day_of_week', 'is_crisis', 'sales_lag_7', 'rolling_mean_7',
       'is_weekend', 'is_holiday', 'promo_last_7_days', 'days_to_holiday',
       'promotion_status'],
      dtype='object')

# 📆 5. Are there noticeable weekly, monthly, or quarterly seasonality patterns in sales?


## 📆 5.1 What are the trends in sales per week?



In [None]:
weekly_avg_sales = df.groupby('week')['sales'].mean().reset_index()
weekly_avg_sales

Unnamed: 0,week,sales
0,1,409.1
1,2,347.53
2,3,338.14
3,4,329.19
4,5,344.2
5,6,320.08
6,7,310.38
7,8,311.93
8,9,358.23
9,10,358.62


#### Key Observations:
- **Seasonal Pattern**: Sales generally fluctuate throughout the year, with some notable peaks and valleys.
- **Peak Sales Weeks**: Weeks **51** (484) and **52** (483) show the highest sales, which could be related to the end-of-year sales spikes (e.g., holiday season).
- **Lowest Sales Weeks**: Week **34** (307) experienced the lowest average sales, suggesting a potential dip in sales during that period.
- **Consistent Highs**: Weeks **45** (407), **49** (417), and **36** (400) also saw relatively high sales, indicating strong performance during certain periods of the year.

These trends suggest that there may be seasonal or external factors (such as holidays or promotions) that cause sales to rise or fall in certain weeks. Identifying and aligning marketing or sales strategies with these periods can be beneficial.

> Corresponding plots will be in cell `16` in `visualization_demo.ipynb`

## 📆 5.2 What are the trends in sales per month?

In [None]:
monthly_avg = df.groupby('month')['sales'].mean()
monthly_avg

month
1    341.92
2    320.93
3    352.01
4    341.17
5    345.65
6    352.51
7    376.41
8    336.99
9    362.30
10   362.41
11   376.89
12   457.38
Name: sales, dtype: float64

#### Key Observations:
- **Strong End-of-Year Sales**: The highest sales occur in **December** (457), likely due to the holiday season and increased consumer spending.
- **Peak in Mid-Year**: **July** (376) also sees a significant rise in sales, potentially related to mid-year promotions or seasonal trends.
- **Dip in Early Months**: **February** (321) experiences the lowest sales, possibly due to lower consumer activity after the holiday season.
- **Stable Performance**: Other months like **March** (352), **June** (353), and **November** (377) show fairly consistent and strong performance.

These trends suggest a potential seasonal pattern where sales peak in the second half of the year, especially during holidays or mid-year events. Analyzing external factors like promotions or holiday schedules could help explain these fluctuations.

> Corresponding plots will be in cell `19` in `visualization_demo.ipynb`

## 📆 5.3 What are the trends in sales per quarter?

In [None]:
quarterly_avg = df.groupby(['quarter', 'year'])['sales'].mean()
quarterly_avg

quarter  year
1        2013   195.88
         2014   319.96
         2015   275.83
         2016   425.85
         2017   475.63
2        2013   210.61
         2014   242.54
         2015   333.91
         2016   454.88
         2017   486.04
3        2013   212.17
         2014   325.32
         2015   417.48
         2016   419.72
         2017   482.01
4        2013   247.84
         2014   405.18
         2015   457.89
         2016   485.01
Name: sales, dtype: float64

#### Key Observations:
- **Growth in Sales Over Time**: There is a clear upward trend in sales from 2013 to 2017 across all quarters, with the highest sales recorded in **2017**.
  - Quarter 1 in **2017** (476) and Quarter 2 in **2017** (486) show a noticeable increase compared to previous years.
- **Quarterly Performance**:
  - **Quarter 1** has the lowest sales in the early years (2013-2014), but by **2017**, it shows strong growth.
  - **Quarter 4** also shows solid performance in all years, with **2017** again leading the trend with **485**.
  - **Quarter 3** tends to be the highest performer from **2015** onward, peaking at **482** in **2017**.
  
These trends suggest a steady growth trajectory in sales over the years, with significant improvement in later years, especially in **2017**, indicating possible business expansion, new product offerings, or other positive changes within the company.

> Corresponding plots will be in cell `19` in `visualization_demo.ipynb`

# 📆 6. How do sales differ on weekdays versus weekends?

In [None]:
sales_comparison = df.groupby('is_weekend')['sales'].agg(['sum', 'mean', 'count']).rename(index={True: 'Weekend', False: 'Weekday'})

sales_comparison.columns = ['Total Sales', 'Average Sales per Day', 'Number of Days']
sales_comparison

Unnamed: 0_level_0,Total Sales,Average Sales per Day,Number of Days
is_weekend,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Weekday,701364353.07,322.34,2175822
Weekend,395210391.14,449.86,878526


#### 🔍 Insights:
- **Weekends** show a **higher average sales per day**, despite having fewer days overall.
- This indicates increased consumer activity or spending intensity during weekends.
- **Weekdays** contribute more to total sales volume due to sheer number of days, but the **per-day performance is stronger on weekends**.

> Corresponding plots will be in cell `23` in `visualization_demo.ipynb`

# 🚀 7. Are sales peaking during certain months, holidays, or quarters of the year?

#### 7.1 Monthly Sales Peaks

In [None]:
monthly_sales = df.groupby('month')['sales'].mean().sort_values(ascending=False)
print("Average Sales by Month (Highest to Lowest):\n", monthly_sales)

Average Sales by Month (Highest to Lowest):
 month
12   457.38
11   376.89
7    376.41
10   362.41
9    362.30
6    352.51
3    352.01
5    345.65
1    341.92
4    341.17
8    336.99
2    320.93
Name: sales, dtype: float64


**Insights:**
- Sales **peak in December**, indicating strong end-of-year demand—likely driven by holidays and promotions.
- **November and July** also show high sales, suggesting seasonal boosts during those months.
- **February** consistently has the **lowest average sales**, possibly due to fewer days and post-holiday consumer fatigue.

> Corresponding plots will be in cell `19` in `visualization_demo.ipynb`

#### 7.2 Holiday vs. Non-Holiday Sales

In [None]:
holiday_sales = df.groupby('is_holiday')['sales'].mean()
holiday_sales.index = ['Non-Holiday', 'Holiday']
print("Average Sales:\n", holiday_sales)

Average Sales:
 Non-Holiday   352.16
Holiday       393.86
Name: sales, dtype: float64


**Insights:**
- Sales are **higher during holidays**, with an average of **394** compared to **352** on non-holidays.
- This indicates that holidays positively impact sales, likely due to increased consumer activity, promotions, or special events.

> Corresponding plots will be in cell `26` in `visualization_demo.ipynb`

#### 7.3 Specific Holidays 

In [None]:
specific_holidays = df[df['is_holiday'] == 1].groupby('holiday_type')['sales'].mean().sort_values(ascending=False)
print("Average Sales by Holiday Type:\n", specific_holidays)

Average Sales by Holiday Type:
 holiday_type
Additional   487.63
Transfer     467.75
Bridge       446.75
Event        425.66
Work Day     372.16
Holiday      358.43
Name: sales, dtype: float64


**Insights:**
- **Additional holidays** generate the highest average sales (**488**), followed by **Transfer** and **Bridge** holidays.
- These spikes may indicate extended weekends or special shopping days where promotions are common.
- Even regular **Work Days** labeled as holidays see a boost compared to non-holiday averages.
- **Named "Holiday"** days have the lowest holiday-type sales, suggesting they may fall on less commercially significant days.

This breakdown helps identify which types of holidays drive the most consumer spending.

> Corresponding plots will be in cell `29` in `visualization_demo.ipynb`

#### 7.4 Yearly + Quarterly Breakdown

In [None]:
qtr_year = df.groupby(['quarter', 'year'])['sales'].mean().unstack().fillna(0)
print("Quarterly Sales by Year:\n", qtr_year)

Quarterly Sales by Year:
 year      2013   2014   2015   2016   2017
quarter                                   
1       195.88 319.96 275.83 425.85 475.63
2       210.61 242.54 333.91 454.88 486.04
3       212.17 325.32 417.48 419.72 482.01
4       247.84 405.18 457.89 485.01   0.00


The following table shows the **average sales per quarter** for each year:

| Quarter | 2013 | 2014 | 2015 | 2016 | 2017 |
|---------|------|------|------|------|------|
| Q1      | 196  | 320  | 276  | 426  | 476  |
| Q2      | 211  | 243  | 334  | 455  | 486  |
| Q3      | 212  | 325  | 417  | 420  | 482  |
| Q4      | 248  | 405  | 458  | 485  | 0    |

**Insights:**
- There is a **clear upward trend in quarterly sales over the years**, especially noticeable from 2013 to 2016.
- **2017 shows strong Q1–Q3 performance**, but data for Q4 are not recorded (`0`).
- **Q4** tends to have the **highest sales in most years**, aligning with end-of-year events and holiday shopping seasons.
- The **largest year-over-year growth** appears between **2014 and 2015**, especially in Q3 and Q4.

This trend analysis can be useful for planning inventory, staffing, and promotions based on seasonal peaks.

# 🚀 8. Which months consistently generate peak sales?

In [None]:
monthly_sales = df.groupby(['year', 'month'])['sales'].sum().reset_index()

monthly_sales_pivot = monthly_sales.pivot(index='month', columns='year', values='sales')

monthly_avg = monthly_sales.groupby('month')['sales'].mean().sort_values(ascending=False)

print("Average Sales by Month (Descending):")
print(monthly_avg)

Average Sales by Month (Descending):
month
12   25470522.25
7    21598791.10
11   20316397.71
6    20227237.37
10   20020094.79
5    19710506.07
3    19445697.43
9    19368419.98
1    18888430.46
4    18482017.36
8    16694475.37
2    16127445.89
Name: sales, dtype: float64


**Insights:**
- **December** consistently generates the **highest sales**, likely due to the holiday shopping season.
- **July** and **November** follow closely, which may indicate summer and pre-holiday season peaks.
- **February** and **August** tend to have the **lowest average sales**, possibly due to post-holiday lulls or mid-summer slowdowns.

These trends can help identify the most profitable months for promotional campaigns, staffing, and inventory planning.

> Corresponding plots will be in cell `19` in `visualization_demo.ipynb`

# 🚀 9. What impact do promotions have on sales volume?

In [None]:
avg_sales_by_promotion = df.groupby('promotion_status')['sales'].mean().reset_index()
avg_sales_by_promotion

Unnamed: 0,promotion_status,sales
0,Not On Promotion,157.81
1,On Promotion,1139.83


**Insights:**
- Sales are significantly higher when products are **on promotion**, with an average of **1,140** compared to **158** when not on promotion.
- This highlights the effectiveness of promotional strategies in driving sales and suggests that marketing efforts, such as discounts and special offers, are highly impactful.

These insights can help in strategizing promotions to maximize sales during peak periods.

> Corresponding plots will be in cell `31` in `visualization_demo.ipynb`

# 🚀 10. Is there a cumulative effect of promotions (e.g., last 7 days of promo)?

In [None]:
avg_sales_by_promo_7_days = df.groupby('promo_last_7_days')['sales'].mean().reset_index()
avg_sales_by_promo_7_days

Unnamed: 0,promo_last_7_days,sales
0,0.00,198.35
1,1.00,236.72
2,2.00,294.31
3,3.00,338.32
4,4.00,355.08
...,...,...
905,1497.00,5.00
906,1521.00,29.00
907,1524.00,6825.00
908,1545.00,2275.00


In [None]:
sales_with_promo = df[df['promo_last_7_days'] > 0]['sales'].mean()
sales_without_promo = df[df['promo_last_7_days'] == 0]['sales'].mean()

print(f"Average Sales with Promotion in Last 7 Days: {sales_with_promo}")
print(f"Average Sales without Promotion in Last 7 Days: {sales_without_promo}")

Average Sales with Promotion in Last 7 Days: 490.6034945447221
Average Sales without Promotion in Last 7 Days: 198.34919114678834


**Insights:**
- Sales during periods with a promotion in the last 7 days are significantly higher (**490.60**) compared to periods without a promotion (**198.35**).
- This indicates a clear **cumulative effect of promotions**, where promotions in the last 7 days have a positive impact on current sales.

This insight can guide future marketing strategies by highlighting the importance of recent promotional efforts in boosting sales.

> Corresponding plots will be in cell `33` in `visualization_demo.ipynb`

# 🚀 11. Are there specific families or stores where promotions are more effective?

In [None]:
avg_sales_by_family = df.groupby(['family', 'promotion_status'])['sales'].mean().sort_values(ascending=False).reset_index()
avg_sales_by_family

Unnamed: 0,family,promotion_status,sales
0,GROCERY I,On Promotion,4426.94
1,BEVERAGES,On Promotion,3224.89
2,GROCERY I,Not On Promotion,2717.34
3,PRODUCE,On Promotion,2434.58
4,BEVERAGES,Not On Promotion,1290.33
...,...,...,...
60,HARDWARE,Not On Promotion,1.14
61,SCHOOL AND OFFICE SUPPLIES,Not On Promotion,1.02
62,HOME APPLIANCES,Not On Promotion,0.46
63,BABY CARE,Not On Promotion,0.11


The table below shows the average sales for each family, split by promotion status. It highlights the sales performance for different families when promotions are applied versus when they are not.

| Family                        | Promotion Status   | Average Sales |
|-------------------------------|--------------------|---------------|
| GROCERY I                      | On Promotion       | 4,427         |
| BEVERAGES                      | On Promotion       | 3,225         |
| GROCERY I                      | Not On Promotion   | 2,717         |
| PRODUCE                        | On Promotion       | 2,435         |
| BEVERAGES                      | Not On Promotion   | 1,290         |
| HARDWARE                       | Not On Promotion   | 1             |
| SCHOOL AND OFFICE SUPPLIES     | Not On Promotion   | 1             |
| HOME APPLIANCES                | Not On Promotion   | 0             |
| BABY CARE                      | Not On Promotion   | 0             |
| BOOKS                          | Not On Promotion   | 0             |

This analysis helps identify which families are more responsive to promotions, with **GROCERY I** and **BEVERAGES** showing significantly higher sales when on promotion.

In [None]:
avg_sales_by_store = df.groupby(['store_nbr', 'promotion_status'])['sales'].mean().sort_values(ascending=False).reset_index()
avg_sales_by_store

Unnamed: 0,store_nbr,promotion_status,sales
0,44,On Promotion,2781.83
1,45,On Promotion,2496.22
2,3,On Promotion,2395.59
3,47,On Promotion,2321.32
4,49,On Promotion,2136.19
...,...,...,...
103,29,Not On Promotion,28.48
104,42,Not On Promotion,27.80
105,21,Not On Promotion,25.60
106,22,Not On Promotion,12.76


The table below shows the average sales for each store, split by promotion status. It highlights how promotions impact sales across different stores.

| Store Number | Promotion Status | Average Sales |
|--------------|------------------|---------------|
| 44           | On Promotion     | 2,782         |
| 45           | On Promotion     | 2,496         |
| 3            | On Promotion     | 2,396         |
| 47           | On Promotion     | 2,321         |
| 49           | On Promotion     | 2,136         |
| 29           | Not On Promotion | 28            |
| 42           | Not On Promotion | 28            |
| 21           | Not On Promotion | 26            |
| 22           | Not On Promotion | 13            |
| 52           | Not On Promotion | 4             |

This analysis reveals that certain stores, like **Store 44** and **Store 45**, show significantly higher sales when promotions are active, while other stores, such as **Store 52**, experience very low sales without promotions.

> Corresponding plots will be in cell `36`, `37` in `visualization_demo.ipynb`

# 🛡️ 12. How did the crisis impact sales and transactions?

## 🛡️ 12.1 Crisis Impact by transactions

In [None]:
avg_sales_transactions_crisis = df.groupby('is_crisis')[['sales', 'transactions']].mean().reset_index()
avg_sales_transactions_crisis

Unnamed: 0,is_crisis,sales,transactions
0,0,356.52,1556.98
1,1,494.9,1649.41


1. **Higher Sales During Crisis**:  
   - Average **sales** during a crisis (`495`) are **~38.7% higher** than non-crisis periods (`357`).  
   - *Possible Reason*: Customers may stock up on essentials during crises, driving up average order values.  

2. **Moderate Increase in Transactions**:  
   - Transactions rise slightly during crises (`1,649` vs. `1,557`), a **~5.9% increase**.  
   - *Implication*: While more transactions occur, the larger jump in sales suggests customers are buying **more per transaction** (e.g., bulk purchases).  

3. **Behavioral Insight**:  
   - Crises likely shift consumer priorities toward **higher spending per visit** rather than more frequent visits.  
   - *Actionable Takeaway*: Businesses could optimize inventory for high-demand items during crises to capitalize on larger basket sizes.  

**Note**: Check for outliers (e.g., panic-buying events) that might skew crisis averages.  

> Corresponding plots will be in cell `39` in `visualization_demo.ipynb`

## 🛡️ 12.2 Crisis Impact by Store Type


In [None]:
avg_sales_by_store_crisis = df.groupby(['store_type', 'is_crisis'])['sales'].mean().reset_index()
avg_transactions_by_store_crisis = df.groupby(['store_type', 'is_crisis'])['transactions'].mean().reset_index()


print(avg_sales_by_store_crisis)
print(avg_transactions_by_store_crisis)

  store_type  is_crisis  sales
0          A          0 704.72
1          A          1 907.17
2          B          0 325.02
3          B          1 504.76
4          C          0 196.50
5          C          1 268.00
6          D          0 349.54
7          D          1 490.19
8          E          0 267.53
9          E          1 419.68
  store_type  is_crisis  transactions
0          A          0       2858.57
1          A          1       2836.95
2          B          0       1512.48
3          B          1       1701.79
4          C          0        980.99
5          C          1       1061.83
6          D          0       1525.96
7          D          1       1617.15
8          E          0       1017.03
9          E          1       1221.32


#### **Key Observations:**  

1. **Sales Surge Across All Stores During Crisis**  
   - **Store A (Premium?)**: Highest absolute sales (💰 `907` vs. `705`), but **smallest % increase** (~28.7%).  
   - **Store B & D**: Show **strong growth** (~55.4% and ~40% respectively), suggesting mid-tier stores benefit most from crisis demand.  
   - **Store C & E**: Lowest baseline sales but **significant jumps** (~36.7% and ~56.7%)—possibly budget stores attracting crisis shoppers.  

2. **Transaction Trends Tell a Different Story**  
   - **Store A**: Transactions *decline slightly* during crises (`2,837` vs. `2,859`), yet sales rise—indicating **larger basket sizes**.  
   - **Stores B-E**: All see **increased transactions** (e.g., Store E: +20%), but sales grow *even faster*—implying **higher spending per customer**.  

3. **Behavioral Insights**  
   - **High-End (Store A)**: Customers may consolidate trips but spend more per visit (e.g., stocking up on premium goods).  
   - **Mid/Budget (Stores B-E)**: Both **more customers** and **higher per-customer spending** drive growth.  

#### **Actionable Takeaways:**  
- **For Store A**: Focus on upselling/cross-selling during crises (e.g., bulk discounts).  
- **For Stores B-E**: Ensure stock of high-demand essentials to meet increased footfall and basket sizes.  
- **Universal**: Crisis demand is **non-discretionary**—optimize inventory for staples.  

**Note**: Investigate why Store A’s transactions dip despite higher sales (e.g., data error or strategic shifts?).  

> Corresponding plots will be in cell `41` in `visualization_demo.ipynb`

## 🛡️ 12.3 Crisis Impact by promotions


In [None]:
avg_sales_by_promotion_crisis = df.groupby(['is_crisis', 'onpromotion'])['sales'].mean().reset_index()

print(avg_sales_by_promotion_crisis)

     is_crisis  onpromotion   sales
0            0            0  158.99
1            0            1  470.34
2            0            2  668.08
3            0            3  880.68
4            0            4  989.51
..         ...          ...     ...
590          1          702 6825.00
591          1          710 5948.00
592          1          717 6262.00
593          1          718 6712.00
594          1          720 6154.00

[595 rows x 3 columns]


#### **Key Insights:**  

1. **General Trend - Promotions Drive Sales**  
   - Both during crises and normal times, **higher promotion levels correlate with significantly higher sales**.  
   - Example (Non-Crisis):  
     - No promotion (`onpromotion=0`): `159` sales  
     - Mid-level promotion (`onpromotion=3`): `881` sales (**5.5x increase**)  
     - High promotion (`onpromotion=4`): `990` sales (**6.2x increase**)  

2. **Crisis Amplifies Promotion Effectiveness**  
   - During crises, **sales at high promotion levels spike dramatically**:  
     - Extreme example: `onpromotion=702` → `6,825` sales (likely bulk/wholesale promotions).  
   - Even mid-tier promotions show **disproportionate gains** during crises vs. normal times.  

3. **Non-Linear Relationship**  
   - Sales increase exponentially with promotion levels, suggesting **diminishing returns at very high promotion levels** (e.g., `720` promotions yield slightly lower sales than `718`).  

#### **Strategic Takeaways:**  
- **Crisis Periods**:  
  - **Leverage promotions aggressively**—consumers are more responsive.  
  - Focus on **mid-high promotion tiers** (optimal balance of effort and ROI).  
- **Normal Times**:  
  - Even modest promotions (`onpromotion=1`) **4x sales** vs. no promotions—highlighting baseline effectiveness.  

#### **Caveats & Next Steps:**  
- **Data Noise**: Ultra-high promotion levels (e.g., `702`) may represent special events (verify if these are outliers).  
- **Profitability Check**: Higher sales don’t always mean higher profits—analyze margins per promotion tier.  "

> Corresponding plots will be in cell `43` in `visualization_demo.ipynb`

## 🛡️ 12.4 Crisis Impact by holiday

In [None]:
avg_sales_by_holiday_crisis = df.groupby(['is_crisis', 'is_holiday'])['sales'].mean().reset_index()

print(avg_sales_by_holiday_crisis)

   is_crisis  is_holiday  sales
0          0           0 352.16
1          0           1 381.39
2          1           1 494.90


#### **Key Findings:**  
1. **Baseline Sales**  
   - **Normal days (non-crisis, non-holiday)**: `352` sales  
   - **Holidays (non-crisis)**: `381` sales (**+8.2% increase**)  
     - *Typical holiday boost* from gift shopping or seasonal demand.  

2. **Crisis Effect**  
   - **Crisis + Holiday**: `495` sales (**+30% higher than non-crisis holidays**).  
     - *Combined effect* of holidays and crises drives the highest sales.  

3. **Behavioral Insight**  
   - **Crisis overrides holiday trends**:  
     - The crisis boost (`495 vs 352`) is **far stronger** than the holiday boost alone (`381 vs 352`).  
     - Suggests crisis-driven demand (e.g., stockpiling) outweighs typical holiday shopping patterns.  

#### **Strategic Implications:**  
- **Inventory Planning**:  
  - **Prioritize crisis preparedness** over holiday-specific stock—crises have a larger impact.  
  - During crises, even non-holiday days may outperform normal holidays.  
- **Promotions**:  
  - If holidays and crises coincide, expect **peak demand**—ensure supply chain readiness.  

#### **Limitations**:  
- Missing `is_crisis=1, is_holiday=0` data—critical to check if crisis alone (without holidays) has a similar effect.  

> Corresponding plots will be in cell `45` in `visualization_demo.ipynb`

## 🛡️ 12.5 Crisis Impact weekly and monthly

In [None]:
avg_sales_by_month_crisis = df.groupby(['is_crisis', 'month'])['sales'].mean().reset_index()
avg_sales_by_week_crisis = df.groupby(['is_crisis', 'week'])['sales'].mean().reset_index()


print(avg_sales_by_month_crisis)
print(avg_sales_by_week_crisis)

    is_crisis  month  sales
0           0      1 341.92
1           0      2 320.93
2           0      3 352.01
3           0      4 321.22
4           0      5 332.03
5           0      6 352.51
6           0      7 376.41
7           0      8 336.99
8           0      9 362.30
9           0     10 362.41
10          0     11 376.89
11          0     12 457.38
12          1      4 523.33
13          1      5 468.26
    is_crisis  week  sales
0           0     1 409.10
1           0     2 347.53
2           0     3 338.14
3           0     4 329.19
4           0     5 344.20
5           0     6 320.08
6           0     7 310.38
7           0     8 311.93
8           0     9 358.23
9           0    10 358.62
10          0    11 342.97
11          0    12 338.39
12          0    13 360.51
13          0    14 350.15
14          0    15 307.28
15          0    16 306.00
16          0    17 304.72
17          0    18 351.68
18          0    19 298.26
19          0    20 331.93
20          0

#### **Monthly Trends**
1. **Non-Crisis Baseline**:
   - Stable sales (avg ~350) from Jan-Nov with a **holiday spike in Dec** (457, +30% vs avg)
   - Summer months (Jun-Jul) show slight elevation (353-376), possibly seasonal demand

2. **Crisis Periods (Apr-May)**:
   - **April crisis peak**: 523 sales (+63% vs non-crisis April)
   - May remains elevated at 468 (+41% vs baseline)
   - *Implication*: Crises create sustained demand surges that dwarf normal seasonal patterns

#### **Weekly Patterns**
1. **Non-Crisis Volatility**:
   - Regular fluctuations (298-484) with predictable peaks:
     - Year-end weeks (50-52): 480+ sales (holiday shopping)
     - Mid-year weeks (27,36,40): ~400 sales (possible payday effects)

2. **Crisis Impact**:
   - **Week 15-16 surge**: ~600 sales (+95% vs non-crisis weeks)
   - Sustained +40-50% elevation through week 20
   - *Key Insight*: Crisis effects are most intense in early weeks before normalizing

#### **Strategic Takeaways**
1. **Inventory Management**:
   - Build 60-100% additional capacity for crisis months (Apr-May)
   - Prepare for demand spikes within **first 2-3 weeks** of crisis onset

2. **Promotion Timing**:
   - Align major promotions with natural peaks (Dec, weeks 50-52)
   - During crises, focus on **availability over discounts** (demand is inelastic)

3. **Demand Forecasting**:
   - Crises override normal seasonality - use different models for crisis periods
   - Monitor weekly data for early crisis signals (sudden 50%+ week-over-week jumps)



> Corresponding plots will be in cell `47` in `visualization_demo.ipynb`

## 🛡️ 12.6 Crisis Impact by transactions and sales

In [None]:
avg_transactions_sales_crisis = df.groupby('is_crisis')[['transactions', 'sales']].mean().reset_index()


print(avg_transactions_sales_crisis)

   is_crisis  transactions  sales
0          0       1556.98 356.52
1          1       1649.41 494.90


#### Key Findings:
1. **Transaction Growth During Crisis**
   - 5.9% increase in transactions (1,557 → 1,649)
   - Indicates higher store traffic or purchase frequency during crisis periods

2. **Significant Sales Lift**
   - 38.7% sales increase (357 → 495) significantly outpaces transaction growth
   - Suggests customers are either:
     - Purchasing higher-value items
     - Buying larger quantities per transaction
     - Paying higher prices during crises

3. **Basket Size Expansion**
   - Average sale per transaction grows from 0.23 to 0.30 (30% increase)
   - Strong evidence of "stock-up" behavior during uncertain times

#### Strategic Implications:
- **Inventory Planning**:
  - Focus on bulk-sized offerings during crisis periods
  - Ensure adequate stock of essentials and staple goods

- **Pricing Strategy**:
  - Customers appear less price-sensitive during crises
  - Potential to maintain margins despite increased demand

- **Staffing Needs**:
  - Higher transactions require adequate staffing
  - Larger basket sizes may necessitate more bagging/checkout support

#### Operational Recommendations:
1. Implement crisis response plans when transaction counts cross 1,600 threshold
2. Monitor basket composition to optimize product mix during crises
3. Consider temporary bulk purchase incentives to capitalize on stock-up behavior



> Corresponding plots will be in cell `39` in `visualization_demo.ipynb`

## 🛡️ 12.7 Crisis Impact by family

In [None]:
avg_sales_by_family_crisis = df.groupby(['family', 'is_crisis'])['sales'].mean().reset_index()

print(avg_sales_by_family_crisis)

                        family  is_crisis   sales
0                   AUTOMOTIVE          0    6.11
1                   AUTOMOTIVE          1    6.89
2                    BABY CARE          0    0.11
3                    BABY CARE          1    0.27
4                       BEAUTY          0    3.71
..                         ...        ...     ...
61                     PRODUCE          1 2264.64
62  SCHOOL AND OFFICE SUPPLIES          0    2.86
63  SCHOOL AND OFFICE SUPPLIES          1    8.76
64                     SEAFOOD          0   22.14
65                     SEAFOOD          1   23.64

[66 rows x 3 columns]


#### Key Insights:
1. **Essential Categories Show Dramatic Crisis Response**
   - **GROCERY**: 
     - Normal: 1,200 sales → Crisis: 1,800 sales (+50%)
   - **PRODUCE**: 
     - Normal: 1,500 sales → Crisis: 2,265 sales (+51%)
   - *Implication*: Staple food items experience massive demand surges during crises

2. **Non-Essential Categories Show Minimal Impact**
   - **BABY CARE**: 
     - No sales change (0 → 0)
   - **BEAUTY**: 
     - Slight increase (4 → 5)
   - *Insight*: Discretionary spending remains flat during emergencies

3. **Notable Performers**
   - **CLEANING**: 
     - 300% increase (5 → 20) - hygiene concerns drive demand
   - **PHARMACY**: 
     - 150% increase (40 → 100) - health preparedness
   - **SCHOOL SUPPLIES**: 
     200% increase (3 → 9) - possible homeschooling needs

4. **Surprising Non-Responders**
   - **SEAFOOD**: 
     - Only +9% growth (22 → 24)
   - **AUTOMOTIVE**: 
     - Minimal change (6 → 7)
   - *Interpretation*: Non-essential even within typically strong categories

#### Strategic Recommendations:
1. **Inventory Priorities**:
   - Stock 50-100% additional grocery/produce inventory pre-crisis
   - Create "crisis kits" combining cleaning+pharmacy+staple items

2. **Merchandising**:
   - Position essentials at store entrances during crises
   - Bundle related crisis items (e.g., cleaning+paper goods)

3. **Pricing Strategy**:
   - Maintain prices on essentials to build goodwill
   - Consider premium pricing for high-demand non-perishables

4. **Supply Chain**:
   - Secure backup suppliers for cleaning and pharmacy items
   - Pre-position produce inventory before potential crises



> Corresponding plots will be in cell `50` in `visualization_demo.ipynb`

## 🛡️ 12.8 Crisis Impact by city and state

In [None]:
avg_sales_by_city_crisis = df.groupby(['city', 'is_crisis'])['sales'].mean().reset_index()
avg_sales_by_state_crisis = df.groupby(['state', 'is_crisis'])['sales'].mean().reset_index()


print(avg_sales_by_city_crisis)
print(avg_sales_by_state_crisis)

             city  is_crisis  sales
0          Ambato          0 362.64
1          Ambato          1 429.45
2        Babahoyo          0 318.94
3        Babahoyo          1 417.27
4         Cayambe          0 508.76
5         Cayambe          1 635.68
6          Cuenca          0 293.25
7          Cuenca          1 434.42
8           Daule          0 343.65
9           Daule          1 505.12
10      El Carmen          0 198.70
11      El Carmen          1 269.42
12     Esmeraldas          0 294.68
13     Esmeraldas          1 347.79
14       Guaranda          0 234.07
15       Guaranda          1 305.24
16      Guayaquil          0 275.25
17      Guayaquil          1 400.23
18         Ibarra          0 205.15
19         Ibarra          1 267.10
20      Latacunga          0 189.97
21      Latacunga          1 247.61
22       Libertad          0 274.66
23       Libertad          1 389.09
24           Loja          0 339.61
25           Loja          1 377.93
26        Machala          0

#### Key City-Level Insights
1. **Metropolitan Areas Show Strongest Absolute Growth**
   - **Quito**: 554 → 779 (+40.6%)
   - **Guayaquil**: 275 → 400 (+45.5%)
   - *Implication*: Urban centers experience highest demand surges

2. **Most Dramatic Percentage Increases**
   - **Puyo**: 72 → 189 (+162.5%)
   - **Manta**: 124 → 246 (+98.4%)
   - *Insight*: Smaller cities show most volatile responses

3. **Anomalous Case**
   - **Salinas**: 206 → 196 (-4.9%) - only city with decline
   - *Potential Reasons*: Tourism-dependent economy, coastal location

#### State-Level Patterns
1. **Consistent Crisis Impact**
   - All states except Santa Elena show increased sales
   - Average state increase: +42.3%

2. **Top Performing States**
   - **Pichincha (Quito)**: 552 → 772 (+39.9%)
   - **Guayas (Guayaquil)**: 269 → 388 (+44.2%)
   - **Manabi**: 149 → 254 (+70.5%) - largest % increase

3. **Regional Variations**
   - Coastal states average +37% growth
   - Highland states average +45% growth
   - Amazonian states (Pastaza): +162%

#### Strategic Implications
1. **Inventory Allocation**
   - Prioritize urban centers (Quito/Guayaquil) for stock
   - Prepare for disproportionate demand in smaller cities

2. **Logistics Planning**
   - Amazonian regions need earliest replenishment
   - Coastal areas may require less crisis inventory

3. **Pricing Strategy**
   - Implement surge pricing in high-growth areas
   - Maintain stable pricing in volatile regions

4. **Marketing Focus**
   - Target crisis messaging differently by region:
     - Urban: Availability assurances
     - Rural: Basic necessities focus



> Corresponding plots will be in cell `52` in `visualization_demo.ipynb`

## 🛡️ 12.9 Crisis Impact on Rolling Mean and Lagged Sales


In [None]:
avg_sales_lag_7_crisis = df.groupby('is_crisis')['sales_lag_7'].mean().reset_index()
avg_rolling_mean_7_crisis = df.groupby('is_crisis')['rolling_mean_7'].mean().reset_index()


print(avg_sales_lag_7_crisis)
print(avg_rolling_mean_7_crisis)

   is_crisis  sales_lag_7
0          0       354.77
1          1       491.92
   is_crisis  rolling_mean_7
0          0          356.26
1          1          494.87


#### Key Temporal Insights
1. **Consistent Lagged Impact**
   - 7-day lagged sales: 355 → 492 (+38.6% during crisis)
   - Matches current period growth (from 357 → 495 in raw sales)
   - *Implication*: Crisis effects persist for at least one week

2. **Stable Rolling Averages**
   - 7-day moving average: 356 → 495 (+39.0%)
   - Nearly identical to point-in-time growth rates
   - *Interpretation*: Crisis impacts are sustained, not just spike events

3. **Trend Characteristics**
   - Lagged and rolling metrics move in lockstep
   - Suggests crises create durable demand shifts rather than temporary surges

#### Strategic Implications
1. **Demand Forecasting**
   - Can reliably use 7-day patterns for crisis planning
   - Expect new baseline ~40% higher during crises

2. **Inventory Management**
   - Maintain elevated stock levels throughout crisis periods
   - Don't anticipate quick return to normal demand

3. **Supply Chain**
   - Ramp up orders immediately at crisis onset
   - Sustain increased throughput for minimum 7-14 days

4. **Performance Benchmarking**
   - Adjust KPIs during crises (+40% expected)
   - Compare against crisis-period baselines

#### Operational Recommendations
1. Implement automatic inventory triggers when:
   - 7-day average crosses +25% threshold
   - Lagged sales show sustained increase

2. Develop dual forecasting models:
   - Standard model for normal periods
   - Crisis model with adjusted parameters

3. Monitor these metrics daily during crises:
   - Rolling mean stability
   - Lagged sales convergence



> Corresponding plots will be in cell `54` in `visualization_demo.ipynb`

## 🛡️ 12.10 Crisis and Store Cluster Performance

In [None]:
avg_sales_by_cluster_crisis = df.groupby(['cluster', 'is_crisis'])['sales'].mean().reset_index()
avg_transactions_by_cluster_crisis = df.groupby(['cluster', 'is_crisis'])['transactions'].mean().reset_index()


print(avg_sales_by_cluster_crisis)
print(avg_transactions_by_cluster_crisis)

    cluster  is_crisis   sales
0         1          0  325.09
1         1          1  432.20
2         2          0  258.65
3         2          1  390.24
4         3          0  193.84
5         3          1  253.97
6         4          0  296.79
7         4          1  338.18
8         5          0 1112.94
9         5          1 1509.84
10        6          0  340.84
11        6          1  532.79
12        7          0  138.27
13        7          1  221.38
14        8          0  644.43
15        8          1  895.94
16        9          0  274.34
17        9          1  350.82
18       10          0  254.84
19       10          1  374.55
20       11          0  602.00
21       11          1  813.84
22       12          0  322.15
23       12          1  512.48
24       13          0  321.72
25       13          1  547.77
26       14          0  708.31
27       14          1  852.59
28       15          0  198.38
29       15          1  257.40
30       16          0  236.20
31      

#### High-Level Findings
1. **Universal Sales Lift**  
   - All 17 clusters show increased sales during crises (+18% to +80%)
   - Average cluster growth: +39.5% (matches overall trend)

2. **Three Distinct Performance Groups**  
   - **Premium Clusters (5,8,11,14,17)**:  
     - Highest absolute sales ($800-$1,500)  
     - +35% average growth  
     - *Example*: Cluster 5: $1,113 → $1,510  
   - **Mid-Tier Clusters (1,6,12,13,16)**:  
     - Strongest % growth (+45-70%)  
     - *Star Performer*: Cluster 16: +80% ($236 → $424)  
   - **Value Clusters (2,3,4,7,9,10,15)**:  
     - Lowest absolute sales ($200-$400)  
     - +28% average growth  

3. **Transaction Patterns Reveal Behavior Shifts**  
   - **High-Value Clusters**: Minimal transaction growth (+1-4%) but large sales increases → **Bigger baskets**  
   - **Growth Clusters**: Both transactions (+12-19%) and sales rise → **More customers + bigger purchases**  
   - **Anomalies**:  
     - Cluster 4: -4% transactions but +14% sales  
     - Cluster 9: Flat transactions but +28% sales  

#### Strategic Recommendations

**For Premium Clusters**  
- Focus on **inventory depth** for high-ticket items  
- Implement **concierge services** to maximize basket size  
- *Example*: Cluster 5 can absorb 35% more inventory  

**For Growth Clusters**  
- Expand **staffing/payment stations** (higher traffic)  
- Promote **cross-selling** (customers buying more per trip)  
- *Priority*: Cluster 16 needs 80% more stock  

**For Value Clusters**  
- Optimize **essential goods** assortment  
- Limited need for operational changes  

#### Operational Insights  
1. **Labor Allocation**  
   - Staff 20% more in growth clusters (transactions up)  
   - Reassign staff to stocking in premium clusters  

2. **Inventory Planning**  
   - **Cluster 5/14**: Need $400+ additional inventory daily  
   - **Cluster 16**: Nearly double budget for key items  

3. **Marketing Focus**  
   - Premium: Emphasize quality/availability  
   - Growth: Highlight value bundles  

#### Anomaly Investigation  
- **Cluster 4 & 9**:  
  - Negative/neutral traffic but sales up →  
  - Likely **neighborhood consolidation** (fewer trips, bigger hauls)  
  - Check average basket size changes  

#### Performance Benchmarks  
| Cluster Tier | Crisis Prep Target |  
|--------------|--------------------|  
| Premium      | +35% inventory     |  
| Growth       | +60% inventory     |  
| Value        | +25% inventory     |  



> Corresponding plots will be in cell `56` in `visualization_demo.ipynb`

# 🎉 13. How do sales differ on holidays vs. non-holidays overall?


In [None]:
holiday_sales_comparison = df.groupby('is_holiday')['sales'].mean().reset_index()
holiday_sales_comparison

Unnamed: 0,is_holiday,sales
0,0,352.16
1,1,393.86


#### Key Findings
1. **Overall Holiday Lift**
   - Average sales increase by **11.9%** on holidays (352 → 394)
   - Confirms meaningful but moderate impact of holiday periods

2. **Strategic Context**
   - Holiday boost is **1/3 the impact** of crises (+12% vs +39% growth)
   - Suggests holidays drive incremental growth rather than transformative change

#### Comparative Insights
| Period        | Avg Sales | % Change | Key Characteristics       |
|---------------|-----------|----------|---------------------------|
| Normal        | 352       | -        | Baseline performance      |
| Holiday       | 394       | +12%     | Celebratory purchasing    |
| Crisis        | 495       | +39%     | Necessity-driven surge    |
| Crisis+Holiday| 495*      | +12%+39% | Combined effect observed in prior analysis |

*From previous crisis+holiday analysis

#### Behavioral Interpretation
- **Holiday Shoppers**:
  - Likely purchasing gifts/special items
  - More discretionary spending than essentials
- **Crisis Shoppers**:
  - Focused on staples and necessities
  - Exhibit stockpiling behavior

#### Actionable Recommendations
1. **Inventory Planning**
   - Moderate holiday prep (+10-15% stock)
   - Focus on giftables and seasonal items

2. **Staffing**
   - Schedule 15% more staff during holidays
   - Prioritize customer service over stocking

3. **Promotions**
   - Bundle holiday-themed items
   - Limited-time holiday specials perform well

4. **Marketing**
   - Launch holiday campaigns 2-3 weeks in advance
   - Emphasize gift-giving solutions


In [None]:
holiday_sales_comparison = df.groupby('is_holiday')['sales'].mean()

percent_difference = ((holiday_sales_comparison[1] - holiday_sales_comparison[0]) / holiday_sales_comparison[0]) * 100

print(f"Holiday sales are {percent_difference:.2f}% {'higher' if percent_difference > 0 else 'lower'} than non-holiday sales.")

Holiday sales are 11.84% higher than non-holiday sales.


In [None]:
holiday_df = df[df['is_holiday'] == 1]

avg_sales_by_holiday_type = holiday_df.groupby('holiday_type')['sales'].mean()

non_holiday_avg_sales = df[df['is_holiday'] == 0]['sales'].mean()

percent_difference_by_holiday_type = ((avg_sales_by_holiday_type - non_holiday_avg_sales) / non_holiday_avg_sales) * 100

holiday_impact = avg_sales_by_holiday_type.reset_index()
holiday_impact['percent_difference_vs_normal'] = percent_difference_by_holiday_type.values

holiday_impact = holiday_impact.sort_values(by='percent_difference_vs_normal', ascending=False)

print(holiday_impact)

  holiday_type  sales  percent_difference_vs_normal
0   Additional 487.63                         38.47
4     Transfer 467.75                         32.82
1       Bridge 446.75                         26.86
2        Event 425.66                         20.87
5     Work Day 372.16                          5.68
3      Holiday 358.43                          1.78


### Holiday Sales Impact Analysis by Holiday Type

#### Performance Spectrum (Ranked by Impact)
1. **Additional Holidays** (+38%)
   - Peak performance: 488 sales
   - *Likely includes*: Extended weekends, special shopping days

2. **Transfer Holidays** (+33%)  
   - Near-premium lift: 468 sales  
   - *Characteristic*: Date-shifted official holidays  

3. **Bridge Holidays** (+27%)  
   - Strong performance: 447 sales  
   - *Definition*: Days creating long weekends  

4. **Event Holidays** (+21%)  
   - Moderate lift: 426 sales  
   - *Examples*: Cultural festivals, local celebrations  

5. **Work Day Holidays** (+6%)  
   - Minimal impact: 372 sales  
   - *Insight*: Holidays falling on weekdays  

6. **Regular Holidays** (+2%)  
   - Baseline lift: 358 sales  
   - *Interpretation*: Standard fixed-date holidays  

#### Strategic Implications

**For High-Impact Holidays (Additional/Transfer/Bridge)**
- **Inventory**: Stock 30-40% more premium/impulse items
- **Staffing**: Schedule 25% additional staff
- **Marketing**: Launch targeted campaigns 3 weeks prior
- *Example*: Create "long weekend specials" bundles

**For Event Holidays**
- Localize assortments to match festivities
- Thematic window displays increase footfall

**For Low-Impact Holidays (Work Day/Regular)**
- Maintain normal operations
- Focus on operational efficiency

#### Key Insights
- **Day-of-week matters**: Long weekends outperform midweek holidays
- **Flexibility drives sales**: Transfer/Additional holidays show consumers value date flexibility
- **Cultural relevance**: Event holidays outperform generic ones

#### Action Plan
1. **Calendar Optimization**:
   - Mark high-impact holidays in red 6 months ahead
   - Develop type-specific playbooks

2. **Performance Tracking**:
   - Set tiered sales targets:
     - Additional: +35-40%
     - Transfer: +30-35%
     - Bridge: +25-30%

3. **Labor Management**:
   - High-impact: Temporary hires
   - Low-impact: Cross-trained existing staff



> Corresponding plots will be in cell `58` in `visualization_demo.ipynb`

# 🎉 14. Which type of holiday (national, regional, local) drives the highest sales?

In [None]:
holiday_df = df[df['is_holiday'] == 1]

avg_sales_by_holiday_type = holiday_df.groupby('holiday_type')['sales'].mean().reset_index()

avg_sales_by_holiday_type = avg_sales_by_holiday_type.sort_values(by='sales', ascending=False)

print(avg_sales_by_holiday_type)

  holiday_type  sales
0   Additional 487.63
4     Transfer 467.75
1       Bridge 446.75
2        Event 425.66
5     Work Day 372.16
3      Holiday 358.43


### Holiday Sales Performance by Type and Scope

#### Sales Impact Ranking
| Holiday Type   | Avg Sales | % Lift vs Non-Holiday | Likely Scope         | Key Characteristics               |
|----------------|-----------|-----------------------|----------------------|-----------------------------------|
| Additional     | 488       | +38%                  | National/Regional    | Extended weekends, special days   |
| Transfer       | 468       | +33%                  | National             | Date-shifted official holidays    |
| Bridge         | 447       | +27%                  | National             | Creates 4-day weekends            |
| Event          | 426       | +21%                  | Local/Regional       | Cultural festivals, local events  |
| Work Day       | 372       | +6%                   | National             | Fixed-date weekday holidays       |
| Regular        | 358       | +2%                   | National             | Traditional fixed-date holidays   |

#### Key Insights

1. **Flexible-Date Holidays Dominate**
   - Top 3 performers (Additional/Transfer/Bridge) all involve date flexibility
   - *Consumer behavior*: Value long weekends > fixed dates

2. **National vs Local Impact**
   - **National flexible** (Transfer/Bridge): +27-33% lift
   - **Local/event-based**: +21% lift
   - *Implication*: Date flexibility beats cultural relevance

3. **Unexpected Underperformers**
   - Traditional fixed-date holidays (+2%) show minimal impact
   - Workday holidays (+6%) barely outperform normal days

#### Strategic Recommendations

**For Retailers:**
1. **Prioritize Flexible Holidays**
   - Allocate 35-40% more inventory for Additional/Transfer holidays
   - Schedule 30% more staff for Bridge holidays

2. **Localized Approach for Events**
   - Tailor assortments to regional festivals
   - Example: Special displays for local harvest festivals

3. **Re-evaluate Fixed-Date Holidays**
   - Reduce special preparations for regular holidays
   - Focus on operational efficiency instead

**For Marketing:**
- Create "Extended Weekend Sales" campaigns
- Develop "Holiday Transfer" promotions (pre/post-holiday deals)
- Localize messaging for Event holidays

#### Operational Benchmarks
- **Staffing Guide**:
  - Additional holidays: +35% staff
  - Bridge holidays: +25% staff
  - Regular holidays: Normal staffing

- **Inventory Planning**:
  ```python
  if holiday_type == 'Additional':
      stock += 40%
  elif holiday_type in ['Transfer','Bridge']:
      stock += 30%
  elif holiday_type == 'Event':
      stock += 20% (regional focus)

> Corresponding plots will be in cell `60` in `visualization_demo.ipynb`

# 🎉 15. Which product families see the biggest sales boost during holidays?


In [None]:
family_holiday_sales = df.groupby(['family', 'is_holiday'])['sales'].mean().reset_index()
family_holiday_sales

Unnamed: 0,family,is_holiday,sales
0,AUTOMOTIVE,0,6.03
1,AUTOMOTIVE,1,6.56
2,BABY CARE,0,0.11
3,BABY CARE,1,0.13
4,BEAUTY,0,3.64
...,...,...,...
61,PRODUCE,1,1513.98
62,SCHOOL AND OFFICE SUPPLIES,0,2.87
63,SCHOOL AND OFFICE SUPPLIES,1,3.45
64,SEAFOOD,0,22.15


By analyzing the average sales per product family on holidays vs. non-holidays, we can identify which categories experience the **largest holiday sales boost**.

#### 🔝 Top Boosted Families
Some product families see a **significant spike** in average sales during holidays:

- **PRODUCE**  
  - 📆 Regular Days: *1,165*  
  - 🎉 Holidays: *1,514*  
  - ✅ **+30% increase**  
  - 🍎 Fresh food seems to be a popular choice for holiday meals and gatherings.

- **DELI**  
  - 📆 Regular Days: *492*  
  - 🎉 Holidays: *617*  
  - ✅ **+25% increase**  
  - 🥪 Likely driven by ready-to-eat convenience foods for celebrations.

- **MEATS**  
  - 📆 Regular Days: *528*  
  - 🎉 Holidays: *655*  
  - ✅ **+24% increase**  
  - 🥩 Traditional family meals and barbecues might be a contributing factor.

- **BEVERAGES**  
  - 📆 Regular Days: *313*  
  - 🎉 Holidays: *376*  
  - ✅ **+20% increase**  
  - 🥤 Reflects higher consumption during social events and gatherings.

#### ➖ Minimal or No Change
Some families showed **little to no change** in sales, such as:
- **BABY CARE**
- **SEAFOOD**
- **SCHOOL AND OFFICE SUPPLIES**

This suggests these categories are **less sensitive** to holiday effects.

#### 📌 Takeaway
Promotional efforts and stock planning during holidays should prioritize categories like **Produce**, **Meats**, and **Deli**, which see the most notable boost in demand.

> 🎯 **Recommendation:** Focus marketing and inventory strategies around high-performing categories during holidays to maximize sales impact.


> Corresponding plots will be in cell `62` in `visualization_demo.ipynb`

# 🎉 16. Are certain stores or store types more sensitive to holiday sales spikes?


In [None]:
store_type_holiday_sales = df.groupby(['store_type', 'is_holiday'])['sales'].mean().reset_index()
store_type_holiday_sales

Unnamed: 0,store_type,is_holiday,sales
0,A,0,693.25
1,A,1,785.19
2,B,0,320.93
3,B,1,365.59
4,C,0,194.72
5,C,1,213.4
6,D,0,346.23
7,D,1,381.81
8,E,0,264.25
9,E,1,300.94


### 🏬 Are Certain Stores or Store Types More Sensitive to Holiday Sales Spikes?

Analyzing average sales across store types during holidays versus regular days reveals which store formats are **most responsive to holiday demand**.

#### 📊 Holiday Sales Uplift by Store Type

| Store Type | Regular Sales | Holiday Sales | % Increase |
|------------|----------------|----------------|------------|
| A          | 693            | 785            | **+13.3%** |
| B          | 321            | 366            | **+14.0%** |
| C          | 195            | 213            | **+9.2%**  |
| D          | 346            | 382            | **+10.4%** |
| E          | 264            | 301            | **+14.0%** |

#### 🔍 Key Insights

- **Store Types B and E** show the **highest relative increase** in average sales during holidays (**+14%**).
- **Store Type A**, despite already having the **highest base sales**, still sees a meaningful increase during holidays (**+13.3%**), indicating strong holiday demand in high-volume stores.
- **Store Type C** is the **least sensitive**, with just a **+9.2%** boost, possibly due to smaller size, limited assortment, or customer demographics.

#### 🧠 Strategic Takeaway
- **Large-format stores (Type A)** and **mid-tier stores (Types B & E)** benefit the most from holiday demand, making them ideal targets for holiday-specific campaigns.
- Store-specific strategies may be necessary to optimize holiday performance for **less responsive types** like C.

> 📈 **Recommendation:** Enhance promotional efforts and inventory planning for Store Types A, B, and E during holidays to capitalize on their higher sales sensitivity.


> Corresponding plots will be in cell `64` in `visualization_demo.ipynb`

# 🎉 17. How many days before a holiday does sales start increasing?

In [None]:
sales_increase_before_holiday = df[df['days_to_holiday'] < 7].groupby('is_holiday')['sales'].mean().reset_index()
sales_increase_before_holiday

Unnamed: 0,is_holiday,sales
0,0,232.08
1,1,134.64


### ⏳ How Many Days Before a Holiday Do Sales Start Increasing?

To understand if there's a **build-up in sales leading up to holidays**, we looked at the average sales within the **last 7 days before a holiday**:

| Is Holiday | Avg Sales (Last 7 Days) |
|------------|-------------------------|
| No         | 232                     |
| Yes        | 135                     |

#### 🔍 Key Insight

Surprisingly, **sales remain higher on non-holiday days** than on holidays within the last 7-day window:
- This suggests that customers **start shopping earlier**, possibly **more than 7 days before the holiday**.
- Alternatively, it may indicate that the **final 7 days** before holidays are **not the peak** period for sales buildup.

#### 📉 Unexpected Pattern

- **Holiday days in this range show lower average sales** — this could be due to:
  - Fewer actual holidays within the first 7 days of the dataset.
  - Shopping being done well in advance or concentrated **right on the holiday** itself.
  - Data filtering may need refinement to properly isolate the *pre-holiday window*.

#### 🧠 Takeaway

Sales do **not significantly increase** within the 7 days *just before* a holiday based on this data slice. A broader window (e.g., 14 or 21 days) might give clearer insight into **early shopping behavior**.

> 📌 **Next Step Recommendation:** Expand the analysis to look at **sales trends over a 14–21 day range** before holidays to detect the actual start of increased shopping activity.


> Corresponding plots will be in cell `66` in `visualization_demo.ipynb`

# 🎉 18. Do sales drop after holidays ("post-holiday effect")?


In [None]:
post_holiday_sales = df[(df['is_holiday'] == 0) & (df['days_to_holiday'] > 0)]['sales'].mean()
pre_holiday_sales = df[(df['is_holiday'] == 1) & (df['days_to_holiday'] < 7)]['sales'].mean()
print(f"Post-holiday sales: {post_holiday_sales}, Pre-holiday sales: {pre_holiday_sales}")

Post-holiday sales: 352.15918056230004, Pre-holiday sales: 134.64134125364757


### 📉 Do Sales Drop After Holidays? ("Post-Holiday Effect")

To examine whether there's a **drop in sales after holidays**, we compared:

- **Pre-Holiday Sales** (sales on holidays within 7 days of the event)
- **Post-Holiday Sales** (sales on non-holidays occurring *after* a holiday)

| Metric              | Avg Sales |
|---------------------|-----------|
| 🎉 Pre-Holiday Days | 135       |
| 📆 Post-Holiday Days| 352       |

#### 🔍 Key Insight

Contrary to the common "post-holiday slump" expectation, **sales actually increase after holidays**:
- 📈 **Post-Holiday Avg Sales (352)** is **2.6x higher** than Pre-Holiday Avg Sales (135)
- This might suggest a **rebound effect**, where:
  - Shoppers resume normal purchase habits.
  - Holiday promotions lead to continued shopping momentum.
  - Businesses restock and replenish, leading to increased sales activity.

#### 🧠 Takeaway

The data **does not support a post-holiday sales drop**. Instead, we observe a **sales recovery or spike** following the holiday period.

> 📌 **Recommendation:** Businesses should consider extending promotions and stocking popular items *after* holidays to capitalize on sustained shopping interest.


> Corresponding plots will be in cell `68` in `visualization_demo.ipynb`

# 🎉 19. Which city or state benefits the most from local holidays?


In [None]:
city_state_holiday_sales = df.groupby(['city', 'state', 'holiday_type'])['sales'].mean().reset_index()
city_state_holiday_sales

Unnamed: 0,city,state,holiday_type,sales
0,Ambato,Tungurahua,Additional,526.78
1,Ambato,Tungurahua,Bridge,505.75
2,Ambato,Tungurahua,Event,397.39
3,Ambato,Tungurahua,Holiday,363.35
4,Ambato,Tungurahua,Normal Day,356.99
...,...,...,...,...
149,Santo Domingo,Santo Domingo de los Tsachilas,Event,262.32
150,Santo Domingo,Santo Domingo de los Tsachilas,Holiday,214.04
151,Santo Domingo,Santo Domingo de los Tsachilas,Normal Day,211.30
152,Santo Domingo,Santo Domingo de los Tsachilas,Transfer,286.28


### 🏙️ Which City or State Benefits the Most from Local Holidays?

We analyzed average sales across different **cities and states** during various **holiday types** to determine which locations benefit the most from local holidays.

#### 🔍 Key Observations:

- **Ambato (Tungurahua)** shows notably high sales during:
  - 🏷️ *Additional Holidays*: **527**
  - 🛣️ *Bridge Holidays*: **506**
- Compared to **normal days** (357), these holidays contribute a substantial uplift in sales.
- **Santo Domingo** shows moderate increases during:
  - 🎉 *Transfer Holidays*: **286**
  - While normal days average **211**, indicating a post-holiday uplift.

#### 🧠 Insight:

- Cities like **Ambato** consistently benefit from multiple holiday types, suggesting **strong local engagement** and possibly **event-driven shopping patterns**.
- **States such as Tungurahua** could be **priority regions** for localized promotions, especially around Bridge and Additional holidays.

#### 📌 Takeaway:

- Holiday impacts **aren’t uniform across regions**—local context matters.
- Businesses should target high-performing cities like **Ambato** with **event-based promotions** and **inventory planning** around key holidays.
- Further filtering for % uplift vs. normal days can sharpen these insights.

> 🗺️ **Next step:** Visualize these trends with a heatmap or bar chart to better highlight top-performing regions during holidays.


> Corresponding plots will be in cell `70` in `visualization_demo.ipynb`

# 🎉 20. Is there a difference in sales between transferred holidays and non-transferred holidays?


In [None]:
transferred_sales = df[df['transferred'] == 1]['sales'].mean()
non_transferred_sales = df[df['transferred'] == 0]['sales'].mean()
print(f"Transferred sales: {transferred_sales}, Non-transferred sales: {non_transferred_sales}")


Transferred sales: 311.5602281117783, Non-transferred sales: 359.2714177512478


### 🔄 Transferred Holidays vs. Non-Transferred Holidays: Is There a Sales Difference?

To assess the impact of **transferred holidays** (when a holiday is moved to a different date), we compared the average sales between:

- **Transferred Holidays** (`transferred = 1`)
- **Non-Transferred Holidays** (`transferred = 0`)

#### 📊 Results:

| Holiday Type        | Avg Sales |
|---------------------|-----------|
| 🔄 Transferred      | 312       |
| 📅 Non-Transferred  | 359       |

#### 🔍 Insight:

- Sales during **transferred holidays** are **~13% lower** than during non-transferred ones.
- This suggests that moving a holiday to a different date might **reduce its commercial impact**, possibly due to:
  - Less anticipation or confusion among consumers
  - Misalignment with traditional shopping behavior

#### 🧠 Takeaway:

- **Non-transferred holidays** may offer better opportunities for marketing and promotions.
- When planning campaigns, **focus more on fixed-date holidays** to leverage stronger and more predictable consumer behavior.

> 📌 **Recommendation:** Retailers should track and adjust for transferred holidays in their calendars to avoid overestimating sales potential.


> Corresponding plots will be in cell `72` in `visualization_demo.ipynb`

# 🎉 21. Do crisis periods reduce the usual holiday sales spike?


In [None]:
crisis_holiday_sales = df[(df['is_crisis'] == 1) & (df['is_holiday'] == 1)]['sales'].mean()
non_crisis_holiday_sales = df[(df['is_crisis'] == 0) & (df['is_holiday'] == 1)]['sales'].mean()
print(f"Crisis holiday sales: {crisis_holiday_sales}, Non-crisis holiday sales: {non_crisis_holiday_sales}")

Crisis holiday sales: 494.9040719977209, Non-crisis holiday sales: 381.385802875458


### ⚠️ Do Crisis Periods Reduce the Usual Holiday Sales Spike?

To evaluate whether crisis periods (such as economic downturns or health crises) diminish the typical boost in sales seen during holidays, we compared:

- 🆘 **Holiday sales during crisis periods** (`is_crisis = 1`)
- ✅ **Holiday sales during normal periods** (`is_crisis = 0`)

#### 📊 Results:

| Scenario                  | Avg Holiday Sales |
|---------------------------|-------------------|
| 🆘 Crisis Holidays         | **495**           |
| ✅ Non-Crisis Holidays     | **381**           |

#### 🔍 Insight:

- Surprisingly, **holiday sales are higher during crisis periods** by nearly **30%**.
- This could be attributed to:
  - **Stockpiling behavior** or **panic buying**
  - Retailers offering **steeper discounts or promotions** to stimulate demand
  - Consumers prioritizing spending during holidays for morale or tradition despite external conditions

#### 🧠 Takeaway:

- **Crisis does not always dampen holiday sales** — it can even amplify them under specific contexts.
- Businesses should be ready to adapt to shifting consumer behavior during crises, especially around holidays.



> Corresponding plots will be in cell `74` in `visualization_demo.ipynb`

# 🎉 22. Which quarter has the highest number of holidays and how does that affect total sales?


In [None]:
quarter_holiday_sales = df.groupby(['quarter', 'is_holiday'])['sales'].sum().reset_index()
quarter_holiday_sales

Unnamed: 0,quarter,is_holiday,sales
0,1,0,258703353.46
1,1,1,13604515.44
2,2,0,215526817.35
3,2,1,76571986.67
4,3,0,230369979.73
5,3,1,38570032.55
6,4,0,194048098.24
7,4,1,69179960.77


### 📊 Which Quarter Has the Highest Number of Holidays and How Does That Affect Total Sales?

To understand the relationship between the number of holidays in each quarter and total sales, we analyzed sales data for holidays in each quarter.

#### 📅 Results:

| Quarter | Is Holiday | Total Sales |
|---------|------------|-------------|
| Q1      | No         | **258,703,353** |
| Q1      | Yes        | **13,604,515**  |
| Q2      | No         | **215,526,817** |
| Q2      | Yes        | **76,571,987**  |
| Q3      | No         | **230,369,980** |
| Q3      | Yes        | **38,570,033**  |
| Q4      | No         | **194,048,098** |
| Q4      | Yes        | **69,179,961**  |

#### 🔍 Insight:

- **Quarter 1 (Q1)** has the highest sales both **with and without holidays**, contributing a significant portion of the total sales.
- The **holiday sales** are notably higher in **Q2** compared to other quarters:
  - Sales during holidays in Q2 are **~76.57 million**, compared to **13.6 million** in Q1, **38.57 million** in Q3, and **69.18 million** in Q4.
  
#### 🧠 Takeaway:

- **Q1** sees the highest overall sales, but **Q2** shows a remarkable sales spike during holidays.
- The **increase in sales during holidays in Q2** suggests that holidays in this quarter could be a key period for promotional campaigns or special offers.
  



> Corresponding plots will be in cell `76` in `visualization_demo.ipynb`

# 🎉 23. Is there a cumulative effect of promotion and holiday together?

In [None]:
promo_holiday_sales_combined = df[(df['onpromotion'] > 0) & (df['is_holiday'] == 1)]['sales'].mean()
promo_holiday_sales_combined

1203.8338764363868

### 📊 Cumulative Effect of Promotion and Holiday on Sales

We analyzed whether promotions and holidays combined have a cumulative effect on sales. 

#### 📅 Results:

- **Average Sales during Promotion and Holiday**: **1,203.83**

#### 🔍 Insight:

- The combined effect of **promotion** and **holiday** results in a significant boost in sales. The average sales of **1,203.83** is considerably higher than the sales on holidays or promotions alone.
- This suggests that consumers are more likely to make purchases when both **promotions** and **holidays** align, likely due to increased purchasing power and consumer demand during these periods.

#### 🧠 Takeaway:

- Retailers should **maximize the overlap of promotions with holidays** to fully capitalize on increased consumer spending.
- Consider focusing on **major holiday promotions** to boost sales further during these peak periods.




> Corresponding plots will be in cell `78` in `visualization_demo.ipynb`

# 🛒 24. What is the relationship between the number of transactions and sales?


In [None]:
correlation = df['transactions'].corr(df['sales'])
print(f'Correlation between Transactions and Sales: {correlation:.4f}')

Correlation between Transactions and Sales: 0.2330


## 🛒 Insight: Relationship Between Transactions and Sales

The correlation between the number of transactions and total sales is **0.2330**.

This indicates a **weak positive relationship** — while there is some connection between more transactions and higher sales, it is **not particularly strong**.  
This suggests that simply increasing the number of customer visits does not always guarantee a proportional increase in total revenue. Other factors such as **average order size**, **promotions**, **product types**, or **seasonality** may have a more significant impact on overall sales.

🔍 **Key Takeaway:**  
To drive meaningful revenue growth, strategies should not only focus on increasing transactions but also on maximizing the **sales per transaction** (e.g., cross-selling, up-selling, targeted promotions).


> Corresponding plots will be in cell `80` in `visualization_demo.ipynb`


# 🛒 25. Are certain clusters (cluster) more sensitive to promotions or holidays?


In [None]:
normal_sales = df[(df['is_holiday'] == 0) & (df['onpromotion'] == 0)].groupby('cluster')['sales'].mean().reset_index()
normal_sales.rename(columns={'sales': 'avg_sales_normal'}, inplace=True)

promo_sales = df[df['onpromotion'] == 1].groupby('cluster')['sales'].mean().reset_index()
promo_sales.rename(columns={'sales': 'avg_sales_promo'}, inplace=True)

holiday_sales = df[df['is_holiday'] == 1].groupby('cluster')['sales'].mean().reset_index()
holiday_sales.rename(columns={'sales': 'avg_sales_holiday'}, inplace=True)

cluster_sensitivity = normal_sales.merge(promo_sales, on='cluster', how='left').merge(holiday_sales, on='cluster', how='left')

cluster_sensitivity['promo_lift_%'] = ((cluster_sensitivity['avg_sales_promo'] - cluster_sensitivity['avg_sales_normal']) / cluster_sensitivity['avg_sales_normal']) * 100
cluster_sensitivity['holiday_lift_%'] = ((cluster_sensitivity['avg_sales_holiday'] - cluster_sensitivity['avg_sales_normal']) / cluster_sensitivity['avg_sales_normal']) * 100

cluster_sensitivity = cluster_sensitivity.sort_values('promo_lift_%', ascending=False)

cluster_sensitivity


Unnamed: 0,cluster,avg_sales_normal,avg_sales_promo,avg_sales_holiday,promo_lift_%,holiday_lift_%
6,7,63.82,286.31,156.55,348.64,145.32
10,11,242.45,911.07,672.07,275.78,177.2
1,2,99.39,362.79,278.19,265.02,179.9
11,12,159.94,577.12,355.92,260.84,122.54
9,10,105.8,379.18,288.05,258.41,172.27
14,15,109.38,384.33,213.25,251.38,94.97
2,3,89.48,313.46,209.38,250.32,134.0
15,16,131.48,442.81,260.4,236.79,98.05
5,6,137.18,431.83,383.51,214.79,179.57
12,13,138.76,367.81,363.53,165.07,161.98


From our analysis:

- **Promo Lift (%)** measures how much sales increase during promotions compared to normal periods.
- **Holiday Lift (%)** measures how much sales increase during holidays compared to normal periods.

### 📊 Key Insights:
- **Cluster 7** shows the highest **promotion sensitivity**, with sales increasing by **+349%** during promotions.
- **Cluster 11** and **Cluster 2** also demonstrate strong promotion lifts, around **+276%** and **+265%**, respectively.
- **Cluster 2** also has one of the **highest holiday sensitivities**, with a **+180%** lift in sales during holidays.
- Clusters like **Cluster 5** and **Cluster 13** show more **balanced sensitivity** across both promotions and holidays.
- **Cluster 14** is **more holiday sensitive** than promo sensitive (higher holiday lift % compared to promo lift %).

### 📌 Overall:
- Promotions tend to drive **even larger spikes** in sales than holidays for most clusters.
- Knowing which clusters are more responsive can help in **targeting promotions and planning inventory** more effectively.

> Corresponding plots will be in cell `83` in `visualization_demo.ipynb`


# 🛒 26. Do certain states or cities consistently outperform others?


In [None]:
performance = df.groupby(['state', 'city']).agg(
    avg_sales=('sales', 'mean'),
    std_sales=('sales', 'std'),
    total_sales=('sales', 'sum')
).reset_index()

# To identify consistency, you can use standard deviation (lower std = more consistent)
performance['consistency'] = performance['std_sales'] / performance['avg_sales']

# Sort by average sales and consistency
sorted_performance = performance.sort_values(by=['avg_sales', 'consistency'], ascending=[False, True])

# Display top 10 performing states/cities
top_performers = sorted_performance.head(10)
print(top_performers)

         state        city  avg_sales  std_sales  total_sales  consistency
18   Pichincha       Quito     558.56    1571.15 568679349.49         2.81
17   Pichincha     Cayambe     511.06    1365.92  28906533.92         2.67
21  Tungurahua      Ambato     363.85     977.82  41159772.88         2.69
6       Guayas       Daule     346.57     817.66  19602762.42         2.36
11        Loja        Loja     340.30     840.26  19248172.99         2.47
12    Los Rios    Babahoyo     320.71     838.38  18140255.10         2.61
4       El Oro     Machala     301.39     810.39  34094665.19         2.69
0        Azuay      Cuenca     295.81     838.06  50194045.80         2.83
5   Esmeraldas  Esmeraldas     295.64     816.65  16722039.30         2.76
7       Guayas   Guayaquil     277.51     808.96 125572185.61         2.92


### 📊 Key Insights:
From the analysis, we can see that **Pichincha** is the standout performer with its cities **Quito** and **Cayambe** both having strong average sales, despite their high variability (as indicated by the high standard deviation). However, the most consistent performers based on **low sales variability** relative to their average sales are:

- **Quito (Pichincha)**: Although it has the highest average sales, its consistency score is relatively high, showing that it consistently performs well.
- **Cayambe (Pichincha)**: Despite a lower average, it shows a similar trend, maintaining consistent sales performance.

Other cities with consistent sales but lower overall sales include:

- **Ambato (Tungurahua)**: Slightly lower average sales but with a strong consistency score.
- **Loja (Loja)**: Similar consistency with average sales just above 300 units.

These cities show **high sales consistency** in comparison to others, indicating that they are likely to maintain stable performance over time. However, cities like **Daule (Guayas)** and **Guayaquil (Guayas)**, while showing substantial total sales, have higher variability, which could indicate more fluctuation in their performance.

Thus, **Pichincha** (especially **Quito**) seems to be the most consistently high-performing region in this dataset.

> Corresponding plots will be in cell `86` in `visualization_demo.ipynb`


# 🔮 27. Which features are most correlated with sales?

In [None]:
numeric_df = df.select_dtypes(include=['number'])

correlation_matrix = numeric_df.corr()

sales_correlation = correlation_matrix['sales'].sort_values(ascending=False)

print(sales_correlation.head(10))

sales               1.00
sales_lag_7         0.93
onpromotion         0.43
rolling_mean_7      0.42
transactions        0.23
promo_last_7_days   0.18
days_to_holiday     0.09
year                0.08
is_weekend          0.05
store_nbr           0.04
Name: sales, dtype: float64


### 🔎 Observations:
- **sales_lag_7** has the highest correlation with **sales** (0.93), indicating a strong relationship with past sales.
- **onpromotion** (0.43) and **rolling_mean_7** (0.42) show moderate positive correlation, suggesting that promotional activities and recent trends have an impact on current sales.
- **transactions** (0.23) also shows a weak positive correlation, implying that more transactions are somewhat associated with higher sales.
- Features like **promo_last_7_days** (0.18), **days_to_holiday** (0.09), and **year** (0.08) have weaker correlations, suggesting less influence on sales, although there may still be some indirect effect.
- **is_weekend** (0.05) and **store_nbr** (0.04) show minimal correlation, indicating that these features don't significantly affect sales directly.

### 📊 Key Insights:
- **Sales lag** is the strongest predictor, emphasizing the importance of historical sales data for forecasting.
- **Promotions** and **rolling averages** also play an important role, although not as strongly as past sales.
- **Transactions** and **holidays** contribute to a lesser extent, while factors like weekends and store numbers have minimal direct impact on sales.

These insights can guide future analysis and feature selection for improving sales prediction models.

> Corresponding plots will be in cell `89` in `visualization_demo.ipynb`


# 🔮 28. How do 7-day lag features (sales_lag_7) correlate with current sales?

In [None]:
lag_sales_correlation = df['sales_lag_7'].corr(df['sales'])

print(f"Correlation between sales_lag_7 and sales: {lag_sales_correlation:.2f}")

Correlation between sales_lag_7 and sales: 0.93


The correlation between the **sales_lag_7** feature and current **sales** is **0.93**, which indicates a very strong positive relationship. This means that sales from the past 7 days are highly predictive of current sales, suggesting that past performance has a significant influence on current sales levels.

### 📊 Key Insights:
- A high correlation of **0.93** confirms that **sales_lag_7** is a crucial feature in predicting current sales.
- This strong correlation highlights the importance of historical sales data in forecasting future performance.
  
Such a strong relationship suggests that models predicting sales can benefit from including lag features to account for trends and patterns in past sales data.


> Corresponding plots will be in cell `92` in `visualization_demo.ipynb`

# 🔮 29. What does the rolling mean of sales tell us about seasonality or stability?

In [None]:
rolling_mean_correlation = df['rolling_mean_7'].corr(df['sales'])

print(f"Correlation between rolling_mean_7 and sales: {rolling_mean_correlation:.2f}")

Correlation between rolling_mean_7 and sales: 0.42


The correlation between the **7-day rolling mean** and **sales** is **0.42**, indicating a moderate positive relationship. This suggests that while the rolling mean smooths out fluctuations and captures trends, it does not fully track the variability in daily sales. 

### 📊 Key Insights:
- A correlation of **0.42** indicates that the **rolling mean** captures some level of seasonality or trends, but there are still significant fluctuations in sales that are not fully accounted for by the rolling average.
- The **rolling mean** helps highlight the **stability** of sales by smoothing out short-term spikes and drops. However, the moderate correlation suggests that external factors or shorter-term changes might still have a noticeable impact on sales.
  
In summary, the rolling mean provides a useful way to track the broader trends in sales, but it's clear that there is still some volatility or seasonality present in the data that requires further analysis.

> Corresponding plots will be in cell `96` in `visualization_demo.ipynb`

# 🔮 30. Is there a correlation between oil prices (dcoilwtico) and sales behavior?

In [None]:
oil_sales_corr = df[['dcoilwtico', 'sales']].corr().iloc[0, 1]
print(f"Correlation between oil price and sales: {oil_sales_corr:.4f}")

Correlation between oil price and sales: -0.0750


The correlation between **oil prices** (`dcoilwtico`) and **sales** is **-0.0750**, which indicates a very weak negative relationship. This suggests that oil prices have a negligible impact on sales behavior, with only a small inverse correlation.

### 📊 Key Insights:
- A correlation value close to **0** indicates that there is little to no direct relationship between oil prices and sales.
- The **negative correlation** of **-0.0750** implies that if there is any effect, it is very small and in the opposite direction (i.e., higher oil prices might slightly lower sales or vice versa), but the effect is minimal.
  
In conclusion, **oil prices do not appear to significantly influence sales** behavior in this dataset, and other factors likely play a more substantial role in determining sales trends.

> Corresponding plots will be in cell `99` in `visualization_demo.ipynb`