In [120]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:,.0f}'.format)

import warnings
warnings.filterwarnings("ignore")

In [121]:
df = pd.read_csv('../data/cleaned_data.csv')

df['date'] = pd.to_datetime(df['date'])
df.head()

Unnamed: 0,date,store_nbr,family,sales,onpromotion,holiday_type,locale,transferred,dcoilwtico,city,state,store_type,cluster,transactions,year,month,week,quarter,day_of_week,is_crisis,sales_lag_7,rolling_mean_7,is_weekend,is_holiday,promo_last_7_days
0,2013-01-01,1,AUTOMOTIVE,0,0,Holiday,National,False,93,Quito,Pichincha,D,13,0,2013,1,1,1,Tuesday,0,0,0,0,1,0
1,2013-01-01,1,BABY CARE,0,0,Holiday,National,False,93,Quito,Pichincha,D,13,0,2013,1,1,1,Tuesday,0,0,0,0,1,0
2,2013-01-01,1,BEAUTY,0,0,Holiday,National,False,93,Quito,Pichincha,D,13,0,2013,1,1,1,Tuesday,0,0,0,0,1,0
3,2013-01-01,1,BEVERAGES,0,0,Holiday,National,False,93,Quito,Pichincha,D,13,0,2013,1,1,1,Tuesday,0,0,0,0,1,0
4,2013-01-01,1,BOOKS,0,0,Holiday,National,False,93,Quito,Pichincha,D,13,0,2013,1,1,1,Tuesday,0,0,0,0,1,0


## 🔍 1. How Have Total Sales Evolved Over Time?

To understand the overall business trend, we calculated the total sales per day from the dataset.

In [122]:
sales_over_time = df.groupby('date')['sales'].sum().reset_index()
sales_over_time

Unnamed: 0,date,sales
0,2013-01-01,2512
1,2013-01-02,496092
2,2013-01-03,361461
3,2013-01-04,354460
4,2013-01-05,477350
...,...,...
1679,2017-08-11,826374
1680,2017-08-12,792631
1681,2017-08-13,865640
1682,2017-08-14,760922


### Key Findings:
- Daily sales range from as low as ~2.5K to over 860K in some peak days.
- There is a clear upward trend in daily revenue, with seasonal fluctuations likely present (to be analyzed in later steps).

> Corresponding plots will be in cell `4` in `visualization_demo.ipynb`

## 🔍 2. Which products or categories contribute the most to total revenue?

Based on the total sales data, the following products or categories contribute the most to the total revenue:

In [123]:
top_products = df.groupby('family')['sales'].sum().sort_values(ascending=False).head(20)
top_products

family
GROCERY I             350,827,298
BEVERAGES             221,663,540
PRODUCE               125,447,968
CLEANING               99,421,019
DAIRY                  65,823,605
BREAD/BAKERY           42,959,924
POULTRY                32,494,451
MEATS                  31,650,996
PERSONAL CARE          25,100,482
DELI                   24,585,627
HOME CARE              16,409,522
EGGS                   15,881,196
FROZEN FOODS           14,646,940
PREPARED FOODS          8,966,728
LIQUOR,WINE,BEER        7,937,172
SEAFOOD                 2,051,636
GROCERY II              2,004,966
HOME AND KITCHEN I      1,905,076
HOME AND KITCHEN II     1,556,511
CELEBRATION               779,502
Name: sales, dtype: float64

1. **GROCERY I**: $350,827,298
2. **BEVERAGES**: $221,663,540
3. **PRODUCE**: $125,447,968
4. **CLEANING**: $99,421,019
5. **DAIRY**: $65,823,605

These categories make up the bulk of the revenue, with **GROCERY I** leading by a significant margin. The top five categories contribute substantially to the overall sales, while the remaining categories (such as **CELEBRATION** and **HOME AND KITCHEN II**) have relatively smaller contributions.

In the analysis, we can observe that categories related to essential products (like groceries, beverages, and produce) lead in sales, which might reflect consistent consumer demand. Further analysis could explore seasonality and trends within these top categories.

> Corresponding plots will be in cell `6` in `visualization_demo.ipynb`

## 🔍 3. Which stores, cities, or states are the top performers in terms of revenue?

In [124]:
top_stores = df.groupby('store_nbr')['sales'].sum().sort_values(ascending=False)
top_cities = df.groupby('city')['sales'].sum().sort_values(ascending=False)
top_regions = df.groupby('state')['sales'].sum().sort_values(ascending=False)

print("Top Stores by Revenue:")
print(top_stores.head())  

print("\n \nTop Cities by Revenue:")
print(top_cities.head()) 

print("\n \nTop States by Revenue:")
print(top_regions.head()) 

Top Stores by Revenue:
store_nbr
44   63,356,137
45   55,689,022
47   52,024,476
3    51,533,528
49   44,346,823
Name: sales, dtype: float64

 
Top Cities by Revenue:
city
Quito           568,679,349
Guayaquil       125,572,186
Cuenca           50,194,046
Ambato           41,159,773
Santo Domingo    36,617,572
Name: sales, dtype: float64

 
Top States by Revenue:
state
Pichincha                        597,585,883
Guayas                           168,649,985
Azuay                             50,194,046
Tungurahua                        41,159,773
Santo Domingo de los Tsachilas    36,617,572
Name: sales, dtype: float64


Based on the total sales data, the following stores, cities, and regions are the top performers:

### **Top Stores by Revenue:**
1. **Store 44**: $63,356,137
2. **Store 45**: $55,689,022
3. **Store 47**: $52,024,476
4. **Store 3**: $51,533,528
5. **Store 49**: $44,346,823

### **Top Cities by Revenue:**
1. **Quito**: $568,679,349
2. **Guayaquil**: $125,572,186
3. **Cuenca**: $50,194,046
4. **Ambato**: $41,159,773
5. **Santo Domingo**: $36,617,572

### **Top States by Revenue:**
1. **Pichincha**: $597,585,883
2. **Guayas**: $168,649,985
3. **Azuay**: $50,194,046
4. **Tungurahua**: $41,159,773
5. **Santo Domingo de los Tsachilas**: $36,617,572

These top performers highlight the most significant contributors to revenue, with **Quito** leading at the city level and **Pichincha** being the highest-performing state. In terms of stores, Store 44 generates the highest revenue.

This analysis can help identify key areas for growth and focus, particularly in high-revenue cities and states.

> Corresponding plots will be in cell `8`, `10` and `12` in `visualization_demo.ipynb`

## 🔍 4. What is the average order size across stores, regions, and categories?

In [125]:
df['transactions'].dtype

dtype('float64')

In [126]:
print((df['transactions'] == 0).sum())

249117


In [127]:
df[['sales', 'transactions']].describe()

Unnamed: 0,sales,transactions
count,3054348,3054348
mean,359,1559
std,1107,1036
min,0,0
25%,0,931
50%,11,1332
75%,196,1980
max,124717,8359


In [128]:
zero_transactions = df[df['transactions'] == 0]
zero_sales_with_zero_transactions = zero_transactions[zero_transactions['sales'] == 0]

# Check if all zero transactions have zero sales
all_match = len(zero_transactions) == len(zero_sales_with_zero_transactions)
print("All zero transactions have zero sales:", all_match)

All zero transactions have zero sales: False


In [129]:
df_valid = df[~((df['transactions'] == 0) & (df['sales'] > 0))]

avg_order_size_store = df_valid.groupby('store_nbr').apply(lambda x: x['sales'].sum() / x['transactions'].sum()).sort_values(ascending=False)
avg_order_size_region = df_valid.groupby('state').apply(lambda x: x['sales'].sum() / x['transactions'].sum()).sort_values(ascending=False)
avg_order_size_category = df_valid.groupby('family').apply(lambda x: x['sales'].sum() / x['transactions'].sum()).sort_values(ascending=False)

print("Average Order Size by Store:")
print(avg_order_size_store.head())  

print("\nAverage Order Size by Region:")
print(avg_order_size_region.head()) 

print("\nAverage Order Size by Category:")
print(avg_order_size_category.head())  

Average Order Size by Store:
store_nbr
51   0
42   0
21   0
29   0
52   0
dtype: float64

Average Order Size by Region:
state
Azuay      0
Manabi     0
El Oro     0
Pastaza    0
Los Rios   0
dtype: float64

Average Order Size by Category:
family
GROCERY I   2
BEVERAGES   2
PRODUCE     1
CLEANING    1
DAIRY       0
dtype: float64


In [130]:
df.columns

Index(['date', 'store_nbr', 'family', 'sales', 'onpromotion', 'holiday_type',
       'locale', 'transferred', 'dcoilwtico', 'city', 'state', 'store_type',
       'cluster', 'transactions', 'year', 'month', 'week', 'quarter',
       'day_of_week', 'is_crisis', 'sales_lag_7', 'rolling_mean_7',
       'is_weekend', 'is_holiday', 'promo_last_7_days'],
      dtype='object')

## ⏳ 5. Are there noticeable weekly, monthly, or quarterly seasonality patterns in sales?

### What are the trends in sales per day of the week?


In [131]:
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

In [132]:
avg_sales_by_day = df.groupby('day_of_week')['sales'].mean().reindex(day_order)
avg_sales_by_day

day_of_week
Monday      348
Tuesday     320
Wednesday   331
Thursday    287
Friday      327
Saturday    435
Sunday      465
Name: sales, dtype: float64

#### Key Observations:
- **Weekend Effect**: Sales are noticeably higher on **Saturday** (435) and **Sunday** (465) compared to weekdays, with Sunday being the highest.
- **Weekday Pattern**: Sales tend to be lower on weekdays, with **Thursday** (287) showing the lowest average sales.
- **Midweek Consistency**: **Monday**, **Tuesday**, and **Wednesday** have relatively similar sales, with a slight dip on **Thursday**.
  
These trends suggest that sales are highest on the weekends, potentially due to increased customer activity, while weekdays (especially Thursday) see a decline in sales.

### What are the trends in sales per week?

In [133]:
weekly_avg_sales = df.groupby('week')['sales'].mean().reset_index()
weekly_avg_sales

Unnamed: 0,week,sales
0,1,409
1,2,348
2,3,338
3,4,329
4,5,344
5,6,320
6,7,310
7,8,312
8,9,358
9,10,359


#### Key Observations:
- **Seasonal Pattern**: Sales generally fluctuate throughout the year, with some notable peaks and valleys.
- **Peak Sales Weeks**: Weeks **51** (484) and **52** (483) show the highest sales, which could be related to the end-of-year sales spikes (e.g., holiday season).
- **Lowest Sales Weeks**: Week **34** (307) experienced the lowest average sales, suggesting a potential dip in sales during that period.
- **Consistent Highs**: Weeks **45** (407), **49** (417), and **36** (400) also saw relatively high sales, indicating strong performance during certain periods of the year.

These trends suggest that there may be seasonal or external factors (such as holidays or promotions) that cause sales to rise or fall in certain weeks. Identifying and aligning marketing or sales strategies with these periods can be beneficial.

### What are the trends in sales per month?

In [134]:
monthly_avg = df.groupby('month')['sales'].mean()
monthly_avg

month
1    342
2    321
3    352
4    341
5    346
6    353
7    376
8    337
9    362
10   362
11   377
12   457
Name: sales, dtype: float64

#### Key Observations:
- **Strong End-of-Year Sales**: The highest sales occur in **December** (457), likely due to the holiday season and increased consumer spending.
- **Peak in Mid-Year**: **July** (376) also sees a significant rise in sales, potentially related to mid-year promotions or seasonal trends.
- **Dip in Early Months**: **February** (321) experiences the lowest sales, possibly due to lower consumer activity after the holiday season.
- **Stable Performance**: Other months like **March** (352), **June** (353), and **November** (377) show fairly consistent and strong performance.

These trends suggest a potential seasonal pattern where sales peak in the second half of the year, especially during holidays or mid-year events. Analyzing external factors like promotions or holiday schedules could help explain these fluctuations.

### What are the trends in sales per quarter?

In [135]:
quarterly_avg = df.groupby(['quarter', 'year'])['sales'].mean()
quarterly_avg

quarter  year
1        2013   196
         2014   320
         2015   276
         2016   426
         2017   476
2        2013   211
         2014   243
         2015   334
         2016   455
         2017   486
3        2013   212
         2014   325
         2015   417
         2016   420
         2017   482
4        2013   248
         2014   405
         2015   458
         2016   485
Name: sales, dtype: float64

#### Key Observations:
- **Growth in Sales Over Time**: There is a clear upward trend in sales from 2013 to 2017 across all quarters, with the highest sales recorded in **2017**.
  - Quarter 1 in **2017** (476) and Quarter 2 in **2017** (486) show a noticeable increase compared to previous years.
- **Quarterly Performance**:
  - **Quarter 1** has the lowest sales in the early years (2013-2014), but by **2017**, it shows strong growth.
  - **Quarter 4** also shows solid performance in all years, with **2017** again leading the trend with **485**.
  - **Quarter 3** tends to be the highest performer from **2015** onward, peaking at **482** in **2017**.
  
These trends suggest a steady growth trajectory in sales over the years, with significant improvement in later years, especially in **2017**, indicating possible business expansion, new product offerings, or other positive changes within the company.

## ⏳ 6. How do sales differ on weekdays versus weekends?

In [136]:
sales_comparison = df.groupby('is_weekend')['sales'].agg(['sum', 'mean', 'count']).rename(index={True: 'Weekend', False: 'Weekday'})

sales_comparison.columns = ['Total Sales', 'Average Sales per Day', 'Number of Days']
sales_comparison

Unnamed: 0_level_0,Total Sales,Average Sales per Day,Number of Days
is_weekend,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Weekday,701364353,322,2175822
Weekend,395210391,450,878526


#### 🔍 Insights:
- **Weekends** show a **higher average sales per day**, despite having fewer days overall.
- This indicates increased consumer activity or spending intensity during weekends.
- **Weekdays** contribute more to total sales volume due to sheer number of days, but the **per-day performance is stronger on weekends**.

## ⏳ 7. Are sales peaking during certain months, holidays, or quarters of the year?

#### 1. Monthly Sales Peaks

In [172]:
monthly_sales = df.groupby('month')['sales'].mean().sort_values(ascending=False)
print("Average Sales by Month (Highest to Lowest):\n", monthly_sales)

Average Sales by Month (Highest to Lowest):
 month
12   457
11   377
7    376
10   362
9    362
6    353
3    352
5    346
1    342
4    341
8    337
2    321
Name: sales, dtype: float64


**Insights:**
- Sales **peak in December**, indicating strong end-of-year demand—likely driven by holidays and promotions.
- **November and July** also show high sales, suggesting seasonal boosts during those months.
- **February** consistently has the **lowest average sales**, possibly due to fewer days and post-holiday consumer fatigue.

#### 2. Holiday vs. Non-Holiday Sales

In [173]:
holiday_sales = df.groupby('is_holiday')['sales'].mean()
holiday_sales.index = ['Non-Holiday', 'Holiday']
print("Average Sales:\n", holiday_sales)

Average Sales:
 Non-Holiday   352
Holiday       394
Name: sales, dtype: float64


**Insights:**
- Sales are **higher during holidays**, with an average of **394** compared to **352** on non-holidays.
- This indicates that holidays positively impact sales, likely due to increased consumer activity, promotions, or special events.

#### 4. Specific Holidays 

In [174]:
specific_holidays = df[df['is_holiday'] == 1].groupby('holiday_type')['sales'].mean().sort_values(ascending=False)
print("Average Sales by Holiday Type:\n", specific_holidays)

Average Sales by Holiday Type:
 holiday_type
Additional   488
Transfer     468
Bridge       447
Event        426
Work Day     372
Holiday      358
Name: sales, dtype: float64


**Insights:**
- **Additional holidays** generate the highest average sales (**488**), followed by **Transfer** and **Bridge** holidays.
- These spikes may indicate extended weekends or special shopping days where promotions are common.
- Even regular **Work Days** labeled as holidays see a boost compared to non-holiday averages.
- **Named "Holiday"** days have the lowest holiday-type sales, suggesting they may fall on less commercially significant days.

This breakdown helps identify which types of holidays drive the most consumer spending.

#### 5. Yearly + Quarterly Breakdown

In [175]:
qtr_year = df.groupby(['quarter', 'year'])['sales'].mean().unstack().fillna(0)
print("Quarterly Sales by Year:\n", qtr_year)

Quarterly Sales by Year:
 year     2013  2014  2015  2016  2017
quarter                              
1         196   320   276   426   476
2         211   243   334   455   486
3         212   325   417   420   482
4         248   405   458   485     0


The following table shows the **average sales per quarter** for each year:

| Quarter | 2013 | 2014 | 2015 | 2016 | 2017 |
|---------|------|------|------|------|------|
| Q1      | 196  | 320  | 276  | 426  | 476  |
| Q2      | 211  | 243  | 334  | 455  | 486  |
| Q3      | 212  | 325  | 417  | 420  | 482  |
| Q4      | 248  | 405  | 458  | 485  | 0    |

**Insights:**
- There is a **clear upward trend in quarterly sales over the years**, especially noticeable from 2013 to 2016.
- **2017 shows strong Q1–Q3 performance**, but data for Q4 are not recorded (`0`).
- **Q4** tends to have the **highest sales in most years**, aligning with end-of-year events and holiday shopping seasons.
- The **largest year-over-year growth** appears between **2014 and 2015**, especially in Q3 and Q4.

This trend analysis can be useful for planning inventory, staffing, and promotions based on seasonal peaks.

## ⏳ 8. Which months consistently generate peak sales?

In [177]:
monthly_sales = df.groupby(['year', 'month'])['sales'].sum().reset_index()

monthly_sales_pivot = monthly_sales.pivot(index='month', columns='year', values='sales')

monthly_avg = monthly_sales.groupby('month')['sales'].mean().sort_values(ascending=False)

print("Average Sales by Month (Descending):")
print(monthly_avg)

Average Sales by Month (Descending):
month
12   25,470,522
7    21,598,791
11   20,316,398
6    20,227,237
10   20,020,095
5    19,710,506
3    19,445,697
9    19,368,420
1    18,888,430
4    18,482,017
8    16,694,475
2    16,127,446
Name: sales, dtype: float64


**Insights:**
- **December** consistently generates the **highest sales**, likely due to the holiday shopping season.
- **July** and **November** follow closely, which may indicate summer and pre-holiday season peaks.
- **February** and **August** tend to have the **lowest average sales**, possibly due to post-holiday lulls or mid-summer slowdowns.

These trends can help identify the most profitable months for promotional campaigns, staffing, and inventory planning.

## 💸 9. What impact do promotions have on sales volume?

In [185]:
df['promotion_status'] = df['onpromotion'].apply(lambda x: 'On Promotion' if x > 0 else 'Not On Promotion')
print(df['promotion_status'].value_counts())

promotion_status
Not On Promotion    2428528
On Promotion         625820
Name: count, dtype: int64


In [186]:
avg_sales_by_promotion = df.groupby('promotion_status')['sales'].mean().reset_index()
avg_sales_by_promotion

Unnamed: 0,promotion_status,sales
0,Not On Promotion,158
1,On Promotion,1140


**Insights:**
- Sales are significantly higher when products are **on promotion**, with an average of **1,140** compared to **158** when not on promotion.
- This highlights the effectiveness of promotional strategies in driving sales and suggests that marketing efforts, such as discounts and special offers, are highly impactful.

These insights can help in strategizing promotions to maximize sales during peak periods.

## 💸 10. Is there a cumulative effect of promotions (e.g., last 7 days of promo)?

In [191]:
avg_sales_by_promo_7_days = df.groupby('promo_last_7_days')['sales'].mean().reset_index()
avg_sales_by_promo_7_days

Unnamed: 0,promo_last_7_days,sales
0,0,198
1,1,237
2,2,294
3,3,338
4,4,355
...,...,...
905,1497,5
906,1521,29
907,1524,6825
908,1545,2275


In [189]:
sales_with_promo = df[df['promo_last_7_days'] > 0]['sales'].mean()
sales_without_promo = df[df['promo_last_7_days'] == 0]['sales'].mean()

print(f"Average Sales with Promotion in Last 7 Days: {sales_with_promo}")
print(f"Average Sales without Promotion in Last 7 Days: {sales_without_promo}")

Average Sales with Promotion in Last 7 Days: 490.6034945447221
Average Sales without Promotion in Last 7 Days: 198.34919114678834


**Insights:**
- Sales during periods with a promotion in the last 7 days are significantly higher (**490.60**) compared to periods without a promotion (**198.35**).
- This indicates a clear **cumulative effect of promotions**, where promotions in the last 7 days have a positive impact on current sales.

This insight can guide future marketing strategies by highlighting the importance of recent promotional efforts in boosting sales.

## 💸 11. Are there specific families or stores where promotions are more effective?

In [192]:
avg_sales_by_family = df.groupby(['family', 'promotion_status'])['sales'].mean().sort_values(ascending=False).reset_index()
avg_sales_by_family

Unnamed: 0,family,promotion_status,sales
0,GROCERY I,On Promotion,4427
1,BEVERAGES,On Promotion,3225
2,GROCERY I,Not On Promotion,2717
3,PRODUCE,On Promotion,2435
4,BEVERAGES,Not On Promotion,1290
...,...,...,...
60,HARDWARE,Not On Promotion,1
61,SCHOOL AND OFFICE SUPPLIES,Not On Promotion,1
62,HOME APPLIANCES,Not On Promotion,0
63,BABY CARE,Not On Promotion,0


The table below shows the average sales for each family, split by promotion status. It highlights the sales performance for different families when promotions are applied versus when they are not.

| Family                        | Promotion Status   | Average Sales |
|-------------------------------|--------------------|---------------|
| GROCERY I                      | On Promotion       | 4,427         |
| BEVERAGES                      | On Promotion       | 3,225         |
| GROCERY I                      | Not On Promotion   | 2,717         |
| PRODUCE                        | On Promotion       | 2,435         |
| BEVERAGES                      | Not On Promotion   | 1,290         |
| HARDWARE                       | Not On Promotion   | 1             |
| SCHOOL AND OFFICE SUPPLIES     | Not On Promotion   | 1             |
| HOME APPLIANCES                | Not On Promotion   | 0             |
| BABY CARE                      | Not On Promotion   | 0             |
| BOOKS                          | Not On Promotion   | 0             |

This analysis helps identify which families are more responsive to promotions, with **GROCERY I** and **BEVERAGES** showing significantly higher sales when on promotion.

In [193]:
avg_sales_by_store = df.groupby(['store_nbr', 'promotion_status'])['sales'].mean().sort_values(ascending=False).reset_index()
avg_sales_by_store

Unnamed: 0,store_nbr,promotion_status,sales
0,44,On Promotion,2782
1,45,On Promotion,2496
2,3,On Promotion,2396
3,47,On Promotion,2321
4,49,On Promotion,2136
...,...,...,...
103,29,Not On Promotion,28
104,42,Not On Promotion,28
105,21,Not On Promotion,26
106,22,Not On Promotion,13


The table below shows the average sales for each store, split by promotion status. It highlights how promotions impact sales across different stores.

| Store Number | Promotion Status | Average Sales |
|--------------|------------------|---------------|
| 44           | On Promotion     | 2,782         |
| 45           | On Promotion     | 2,496         |
| 3            | On Promotion     | 2,396         |
| 47           | On Promotion     | 2,321         |
| 49           | On Promotion     | 2,136         |
| 29           | Not On Promotion | 28            |
| 42           | Not On Promotion | 28            |
| 21           | Not On Promotion | 26            |
| 22           | Not On Promotion | 13            |
| 52           | Not On Promotion | 4             |

This analysis reveals that certain stores, like **Store 44** and **Store 45**, show significantly higher sales when promotions are active, while other stores, such as **Store 52**, experience very low sales without promotions.

# 🌍7. Crisis Impact Analysis

### Crisis Impact by transactions

In [143]:
avg_sales_transactions_crisis = df.groupby('is_crisis')[['sales', 'transactions']].mean().reset_index()
avg_sales_transactions_crisis

Unnamed: 0,is_crisis,sales,transactions
0,0,357,1557
1,1,495,1649


### Crisis Impact by Store Type

In [144]:
avg_sales_by_store_crisis = df.groupby(['store_type', 'is_crisis'])['sales'].mean().reset_index()
avg_transactions_by_store_crisis = df.groupby(['store_type', 'is_crisis'])['transactions'].mean().reset_index()


print(avg_sales_by_store_crisis)
print(avg_transactions_by_store_crisis)

  store_type  is_crisis  sales
0          A          0    705
1          A          1    907
2          B          0    325
3          B          1    505
4          C          0    196
5          C          1    268
6          D          0    350
7          D          1    490
8          E          0    268
9          E          1    420
  store_type  is_crisis  transactions
0          A          0         2,859
1          A          1         2,837
2          B          0         1,512
3          B          1         1,702
4          C          0           981
5          C          1         1,062
6          D          0         1,526
7          D          1         1,617
8          E          0         1,017
9          E          1         1,221


### Crisis Impact by promotions

In [145]:
avg_sales_by_promotion_crisis = df.groupby(['is_crisis', 'onpromotion'])['sales'].mean().reset_index()


print(avg_sales_by_promotion_crisis)

     is_crisis  onpromotion  sales
0            0            0    159
1            0            1    470
2            0            2    668
3            0            3    881
4            0            4    990
..         ...          ...    ...
590          1          702  6,825
591          1          710  5,948
592          1          717  6,262
593          1          718  6,712
594          1          720  6,154

[595 rows x 3 columns]


### Crisis Impact by holiday

In [146]:
avg_sales_by_holiday_crisis = df.groupby(['is_crisis', 'is_holiday'])['sales'].mean().reset_index()


print(avg_sales_by_holiday_crisis)

   is_crisis  is_holiday  sales
0          0           0    352
1          0           1    381
2          1           1    495


### Crisis Impact weekly and monthly

In [147]:
avg_sales_by_month_crisis = df.groupby(['is_crisis', 'month'])['sales'].mean().reset_index()
avg_sales_by_week_crisis = df.groupby(['is_crisis', 'week'])['sales'].mean().reset_index()


print(avg_sales_by_month_crisis)
print(avg_sales_by_week_crisis)

    is_crisis  month  sales
0           0      1    342
1           0      2    321
2           0      3    352
3           0      4    321
4           0      5    332
5           0      6    353
6           0      7    376
7           0      8    337
8           0      9    362
9           0     10    362
10          0     11    377
11          0     12    457
12          1      4    523
13          1      5    468
    is_crisis  week  sales
0           0     1    409
1           0     2    348
2           0     3    338
3           0     4    329
4           0     5    344
5           0     6    320
6           0     7    310
7           0     8    312
8           0     9    358
9           0    10    359
10          0    11    343
11          0    12    338
12          0    13    361
13          0    14    350
14          0    15    307
15          0    16    306
16          0    17    305
17          0    18    352
18          0    19    298
19          0    20    332
20          0

### Crisis Impact by transactions, sales

In [148]:
avg_transactions_sales_crisis = df.groupby('is_crisis')[['transactions', 'sales']].mean().reset_index()


print(avg_transactions_sales_crisis)

   is_crisis  transactions  sales
0          0         1,557    357
1          1         1,649    495


### Crisis Impact by family

In [149]:
avg_sales_by_family_crisis = df.groupby(['family', 'is_crisis'])['sales'].mean().reset_index()


print(avg_sales_by_family_crisis)

                        family  is_crisis  sales
0                   AUTOMOTIVE          0      6
1                   AUTOMOTIVE          1      7
2                    BABY CARE          0      0
3                    BABY CARE          1      0
4                       BEAUTY          0      4
..                         ...        ...    ...
61                     PRODUCE          1  2,265
62  SCHOOL AND OFFICE SUPPLIES          0      3
63  SCHOOL AND OFFICE SUPPLIES          1      9
64                     SEAFOOD          0     22
65                     SEAFOOD          1     24

[66 rows x 3 columns]


### Crisis Impact by city and state

In [150]:
avg_sales_by_city_crisis = df.groupby(['city', 'is_crisis'])['sales'].mean().reset_index()


avg_sales_by_state_crisis = df.groupby(['state', 'is_crisis'])['sales'].mean().reset_index()


print(avg_sales_by_city_crisis)
print(avg_sales_by_state_crisis)

             city  is_crisis  sales
0          Ambato          0    363
1          Ambato          1    429
2        Babahoyo          0    319
3        Babahoyo          1    417
4         Cayambe          0    509
5         Cayambe          1    636
6          Cuenca          0    293
7          Cuenca          1    434
8           Daule          0    344
9           Daule          1    505
10      El Carmen          0    199
11      El Carmen          1    269
12     Esmeraldas          0    295
13     Esmeraldas          1    348
14       Guaranda          0    234
15       Guaranda          1    305
16      Guayaquil          0    275
17      Guayaquil          1    400
18         Ibarra          0    205
19         Ibarra          1    267
20      Latacunga          0    190
21      Latacunga          1    248
22       Libertad          0    275
23       Libertad          1    389
24           Loja          0    340
25           Loja          1    378
26        Machala          0

### Crisis Impact on Rolling Mean and Lagged Sales

In [151]:
avg_sales_lag_7_crisis = df.groupby('is_crisis')['sales_lag_7'].mean().reset_index()
avg_rolling_mean_7_crisis = df.groupby('is_crisis')['rolling_mean_7'].mean().reset_index()


print(avg_sales_lag_7_crisis)
print(avg_rolling_mean_7_crisis)

   is_crisis  sales_lag_7
0          0          355
1          1          492
   is_crisis  rolling_mean_7
0          0             356
1          1             495


### Crisis and Store Cluster Performance

In [152]:
avg_sales_by_cluster_crisis = df.groupby(['cluster', 'is_crisis'])['sales'].mean().reset_index()
avg_transactions_by_cluster_crisis = df.groupby(['cluster', 'is_crisis'])['transactions'].mean().reset_index()


print(avg_sales_by_cluster_crisis)
print(avg_transactions_by_cluster_crisis)

    cluster  is_crisis  sales
0         1          0    325
1         1          1    432
2         2          0    259
3         2          1    390
4         3          0    194
5         3          1    254
6         4          0    297
7         4          1    338
8         5          0  1,113
9         5          1  1,510
10        6          0    341
11        6          1    533
12        7          0    138
13        7          1    221
14        8          0    644
15        8          1    896
16        9          0    274
17        9          1    351
18       10          0    255
19       10          1    375
20       11          0    602
21       11          1    814
22       12          0    322
23       12          1    512
24       13          0    322
25       13          1    548
26       14          0    708
27       14          1    853
28       15          0    198
29       15          1    257
30       16          0    236
31       16          1    424
32       1

# 📅 8. Holidays & Events 

### How do sales differ on holidays vs. non-holidays overall?

In [153]:
holiday_sales_comparison = df.groupby('is_holiday')['sales'].mean().reset_index()
holiday_sales_comparison

Unnamed: 0,is_holiday,sales
0,0,352
1,1,394


In [154]:

holiday_sales_comparison = df.groupby('is_holiday')['sales'].mean()


percent_difference = ((holiday_sales_comparison[1] - holiday_sales_comparison[0]) / holiday_sales_comparison[0]) * 100


print(f"Holiday sales are {percent_difference:.2f}% {'higher' if percent_difference > 0 else 'lower'} than non-holiday sales.")


Holiday sales are 11.84% higher than non-holiday sales.


In [155]:

holiday_df = df[df['is_holiday'] == 1]

avg_sales_by_holiday_type = holiday_df.groupby('holiday_type')['sales'].mean()


non_holiday_avg_sales = df[df['is_holiday'] == 0]['sales'].mean()


percent_difference_by_holiday_type = ((avg_sales_by_holiday_type - non_holiday_avg_sales) / non_holiday_avg_sales) * 100


holiday_impact = avg_sales_by_holiday_type.reset_index()
holiday_impact['percent_difference_vs_normal'] = percent_difference_by_holiday_type.values


holiday_impact = holiday_impact.sort_values(by='percent_difference_vs_normal', ascending=False)


print(holiday_impact)


  holiday_type  sales  percent_difference_vs_normal
0   Additional    488                            38
4     Transfer    468                            33
1       Bridge    447                            27
2        Event    426                            21
5     Work Day    372                             6
3      Holiday    358                             2


### Which type of holiday (national, regional, local) drives the highest sales?

In [156]:

holiday_df = df[df['is_holiday'] == 1]


avg_sales_by_holiday_type = holiday_df.groupby('holiday_type')['sales'].mean().reset_index()


avg_sales_by_holiday_type = avg_sales_by_holiday_type.sort_values(by='sales', ascending=False)


print(avg_sales_by_holiday_type)


  holiday_type  sales
0   Additional    488
4     Transfer    468
1       Bridge    447
2        Event    426
5     Work Day    372
3      Holiday    358


### Promotion vs. Holiday Impact

In [157]:

holiday_with_promo = holiday_df[holiday_df['onpromotion'] > 0]
holiday_without_promo = holiday_df[holiday_df['onpromotion'] == 0]


avg_sales_with_promo = holiday_with_promo.groupby('holiday_type')['sales'].mean()
avg_sales_without_promo = holiday_without_promo.groupby('holiday_type')['sales'].mean()

print("With Promotion:")
print(avg_sales_with_promo)
print("Without Promotion:")
print(avg_sales_without_promo)


With Promotion:
holiday_type
Additional   1,426
Bridge       1,176
Event        1,201
Holiday      1,161
Transfer     1,048
Work Day     1,330
Name: sales, dtype: float64
Without Promotion:
holiday_type
Additional   196
Bridge        91
Event        113
Holiday      152
Transfer      91
Work Day     161
Name: sales, dtype: float64


### Which product families see the biggest sales boost during holidays?

In [158]:
family_holiday_sales = df.groupby(['family', 'is_holiday'])['sales'].mean().reset_index()
family_holiday_sales

Unnamed: 0,family,is_holiday,sales
0,AUTOMOTIVE,0,6
1,AUTOMOTIVE,1,7
2,BABY CARE,0,0
3,BABY CARE,1,0
4,BEAUTY,0,4
...,...,...,...
61,PRODUCE,1,1514
62,SCHOOL AND OFFICE SUPPLIES,0,3
63,SCHOOL AND OFFICE SUPPLIES,1,3
64,SEAFOOD,0,22


### Are certain stores or store types more sensitive to holiday sales spikes?

In [159]:
store_type_holiday_sales = df.groupby(['store_type', 'is_holiday'])['sales'].mean().reset_index()
store_type_holiday_sales

Unnamed: 0,store_type,is_holiday,sales
0,A,0,693
1,A,1,785
2,B,0,321
3,B,1,366
4,C,0,195
5,C,1,213
6,D,0,346
7,D,1,382
8,E,0,264
9,E,1,301


### Do transactions (customer visits) increase significantly during holidays?

In [160]:
holiday_transactions = df[df['is_holiday'] == 1]['transactions'].sum()
non_holiday_transactions = df[df['is_holiday'] == 0]['transactions'].sum()
print(holiday_transactions, non_holiday_transactions)

825639144.0 3935038272.0


### How many days before a holiday does sales start increasing?

In [161]:
df['days_to_holiday'] = (df['date'] - df['date'].min()).dt.days
sales_increase_before_holiday = df[df['days_to_holiday'] < 7].groupby('is_holiday')['sales'].mean().reset_index()
sales_increase_before_holiday

Unnamed: 0,is_holiday,sales
0,0,232
1,1,135


### Do sales drop after holidays ("post-holiday effect")?

In [162]:
post_holiday_sales = df[(df['is_holiday'] == 0) & (df['days_to_holiday'] > 0)]['sales'].mean()
pre_holiday_sales = df[(df['is_holiday'] == 1) & (df['days_to_holiday'] < 7)]['sales'].mean()
print(f"Post-holiday sales: {post_holiday_sales}, Pre-holiday sales: {pre_holiday_sales}")

Post-holiday sales: 352.15918056230004, Pre-holiday sales: 134.64134125364757


### Which city or state benefits the most from local holidays?

In [163]:
city_state_holiday_sales = df.groupby(['city', 'state', 'holiday_type'])['sales'].mean().reset_index()
city_state_holiday_sales

Unnamed: 0,city,state,holiday_type,sales
0,Ambato,Tungurahua,Additional,527
1,Ambato,Tungurahua,Bridge,506
2,Ambato,Tungurahua,Event,397
3,Ambato,Tungurahua,Holiday,363
4,Ambato,Tungurahua,Normal Day,357
...,...,...,...,...
149,Santo Domingo,Santo Domingo de los Tsachilas,Event,262
150,Santo Domingo,Santo Domingo de los Tsachilas,Holiday,214
151,Santo Domingo,Santo Domingo de los Tsachilas,Normal Day,211
152,Santo Domingo,Santo Domingo de los Tsachilas,Transfer,286


### Is there a difference in sales between transferred holidays and non-transferred holidays?

In [164]:
transferred_sales = df[df['transferred'] == 1]['sales'].mean()
non_transferred_sales = df[df['transferred'] == 0]['sales'].mean()
print(f"Transferred sales: {transferred_sales}, Non-transferred sales: {non_transferred_sales}")


Transferred sales: 311.5602281117783, Non-transferred sales: 359.2714177512478


###  Do crisis periods reduce the usual holiday sales spike?

In [165]:
crisis_holiday_sales = df[(df['is_crisis'] == 1) & (df['is_holiday'] == 1)]['sales'].mean()
non_crisis_holiday_sales = df[(df['is_crisis'] == 0) & (df['is_holiday'] == 1)]['sales'].mean()
print(f"Crisis holiday sales: {crisis_holiday_sales}, Non-crisis holiday sales: {non_crisis_holiday_sales}")

Crisis holiday sales: 494.9040719977209, Non-crisis holiday sales: 381.385802875458


### Are weekend holidays (holidays falling on Saturday/Sunday) more profitable than weekday holidays?

In [166]:
df['is_weekend_holiday'] = (df['is_holiday'] == 1) & (df['day_of_week'].isin(['Saturday', 'Sunday']))
weekend_holiday_sales = df[df['is_weekend_holiday'] == 1]['sales'].mean()
weekday_holiday_sales = df[df['is_weekend_holiday'] == 0]['sales'].mean()
print(f"Weekend holiday sales: {weekend_holiday_sales}, Weekday holiday sales: {weekday_holiday_sales}")

Weekend holiday sales: 447.83476584036254, Weekday holiday sales: 354.4439809744127


###  Which quarter has the highest number of holidays and how does that affect total sales?

In [167]:
quarter_holiday_sales = df.groupby(['quarter', 'is_holiday'])['sales'].sum().reset_index()
quarter_holiday_sales

Unnamed: 0,quarter,is_holiday,sales
0,1,0,258703353
1,1,1,13604515
2,2,0,215526817
3,2,1,76571987
4,3,0,230369980
5,3,1,38570033
6,4,0,194048098
7,4,1,69179961


### Is there a cumulative promotion , holiday effect?

In [168]:
promo_holiday_sales_combined = df[(df['onpromotion'] > 0) & (df['is_holiday'] == 1)]['sales'].mean()
promo_holiday_sales_combined

1203.8338764363868

# 🌍 8. External Factors



### Is there a correlation between oil prices (dcoilwtico) and sales behavior?


In [169]:

oil_sales_corr = df[['dcoilwtico', 'sales']].corr().iloc[0, 1]
print(f"Correlation between oil price and sales: {oil_sales_corr:.4f}")


Correlation between oil price and sales: -0.0750


In [170]:

lag_corr = df[['sales', 'sales_lag_7']].corr().iloc[0, 1]
print(f"Correlation between sales and sales_lag_7: {lag_corr:.4f}")


Correlation between sales and sales_lag_7: 0.9310


In [171]:

crisis_sales = df[df['is_crisis'] == 1]['sales'].mean()
non_crisis_sales = df[df['is_crisis'] == 0]['sales'].mean()


percent_change = ((crisis_sales - non_crisis_sales) / non_crisis_sales) * 100

print(f"Average Sales During Crisis: {crisis_sales:.2f}")
print(f"Average Sales Outside Crisis: {non_crisis_sales:.2f}")
print(f"Sales changed by {percent_change:.2f}% during crisis.")


Average Sales During Crisis: 494.90
Average Sales Outside Crisis: 356.52
Sales changed by 38.82% during crisis.
