# 📈 Elasticity Project: Model Summary

This model focuses on **Price Elasticity of Demand (PED)** and its effect on total revenue. It allows users to explore how changes in price influence quantity demanded and overall sales performance.

## ✅ **Key Components:**

1. **Price Elasticity of Demand (PED) Calculation**

   The elasticity is calculated using the midpoint formula to provide stable and realistic elasticity estimates:

   $$
   E_d = \frac{\frac{Q_2 - Q_1}{(Q_2 + Q_1)/2}}{\frac{P_2 - P_1}{(P_2 + P_1)/2}}
   $$

   Where:
   - \( Q_1 \), \( Q_2 \) = Original and new quantity demanded.
   - \( P_1 \), \( P_2 \) = Original and new price.

2. **Elasticity Classification**

   The model classifies elasticity as:
   - **Elastic** if \( E_d > 1 \)
   - **Inelastic** if \( E_d < 1 \)
   - **Unitary Elastic** if \( E_d = 1 \)

3. **Revenue Impact Calculation**

   We calculate **Total Revenue (TR)** before and after the price change:

   $$
   TR_1 = P_1 \times Q_1
   $$

   $$
   TR_2 = P_2 \times Q_2
   $$

   The **change in revenue** is expressed as:

   $$
   \Delta TR = TR_2 - TR_1
   $$

4. **Visualizations**

   - **Demand Curve Plot:**
     Shows the demand curve shifting based on user input.
   - **Revenue Comparison:**
     Displays side-by-side revenue before and after the price change.

5. **User Inputs (via Sliders):**
   - Initial price (\( P_1 \))
   - Initial quantity (\( Q_1 \))
   - % change in price (\( \%\Delta P \))

6. **Output:**
   - New price & quantity estimates.
   - Elasticity classification (with interpretation).
   - Revenue before & after (with impact summary).
   - Interactive graph updates in real-time.


In [4]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [5]:
processed_data = pd.read_csv('../data/processed/processed_data.csv')

## ✅ Check data is clean

In [6]:
processed_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 843482 entries, 0 to 843481
Data columns (total 9 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   Date           843482 non-null  object
 1   Store          843482 non-null  int64 
 2   DayOfWeek      843482 non-null  int64 
 3   Sales          843482 non-null  int64 
 4   Customers      843482 non-null  int64 
 5   Open           843482 non-null  int64 
 6   Promo          843482 non-null  int64 
 7   StateHoliday   843482 non-null  int64 
 8   SchoolHoliday  843482 non-null  int64 
dtypes: int64(8), object(1)
memory usage: 57.9+ MB


In [7]:
processed_data.head()


Unnamed: 0,Date,Store,DayOfWeek,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
0,2015-07-31,1,5,5263,555,1,1,0,1
1,2015-07-31,2,5,6064,625,1,1,0,1
2,2015-07-31,3,5,8314,821,1,1,0,1
3,2015-07-31,4,5,13995,1498,1,1,0,1
4,2015-07-31,5,5,4822,559,1,1,0,1


## 🔥 First elasticity-style insight: Promo effect
- We can directly model the effect of Promo (binary: 0/1) on Sales. This tells you:

- How much more (or less) you sell when running a promo vs. not running one.

- Even a simple OLS regression can give you:

- The coefficient for Promo → this acts like a proxy elasticity for how responsive sales are to promotions.

## 💡 Let’s draft the steps:


### 1️⃣ Convert Date as before:

In [8]:
processed_data['Date'] = pd.to_datetime(processed_data['Date'])
processed_data['Month'] = processed_data['Date'].dt.month
processed_data['Year'] = processed_data['Date'].dt.year
processed_data['WeekOfYear'] = processed_data['Date'].dt.isocalendar().week


### 2️⃣ Filter to open stores only (because closed = 0 sales):

In [9]:
data_open = processed_data[processed_data['Open'] == 1]


### 3️⃣ Set up features & target:

In [10]:
features = ['Promo', 'StateHoliday', 'SchoolHoliday', 'DayOfWeek', 'Month', 'Year']
X = data_open[features]
y = data_open['Sales']


### 4️⃣ Linear regression:

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lr = LinearRegression()
lr.fit(X_train, y_train)

print("Train R^2:", lr.score(X_train, y_train))
print("Test R^2:", lr.score(X_test, y_test))


Train R^2: 0.1502372392327186
Test R^2: 0.1487795746920464


### 5️⃣ Elasticity-like insight: Promo effect
After training, check the coefficients:

In [12]:
coef_table = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': lr.coef_
})
print(coef_table)


         Feature   Coefficient
0          Promo  2.158871e+03
1   StateHoliday  4.831691e-13
2  SchoolHoliday  7.052238e+01
3      DayOfWeek -1.367336e+02
4          Month  8.163864e+01
5           Year  2.024603e+02


## 🔍 Analysis of the model

### 🟢 Promo: +2158.87
💥 BOOM—this is your headline stat.

✅ On average, when a promo is running, sales increase by about 2,159 units compared to days when there’s no promo.

📈 This is your "promotion elasticity proxy"—while it’s not a percentage change (since we don’t have price), it tells you how sensitive sales are to the presence of a promotion.



### 🟠 StateHoliday: ~ 0 (4.8e-13)
That’s super tiny—basically no effect.

This tells us:

🏖️ Whether it’s a state holiday or not doesn’t seem to impact sales much in your data.

Do we know if this column had real variation (were there holidays at all?), or was it sparse? Worth checking with:

In [13]:
print(processed_data['StateHoliday'].value_counts())


StateHoliday
0    843482
Name: count, dtype: int64


## This tells us:

- ✅ 100% of your data points (843,482 rows) have StateHoliday = 0.
- ❌ No actual state holidays are present.

### 💡 Why did the model give us that tiny coefficient (~4.8e-13)?
- Because the StateHoliday feature is constant—it never changes. That means it's giving the model no real signal at all.

- In linear regression, when a feature has no variation, it can’t actually contribute meaningfully to prediction. The regression still assigns it a tiny (basically zero) coefficient, but it’s doing nothing.

### ✅ Next Steps?
- Remove StateHoliday from the feature list going forward because:
    - It’s useless here (no variation = no predictive power).
    - It might even slightly slow down or complicate future models (especially tree-based ones that don’t handle constant features well).

### 🟡 SchoolHoliday: +70.5
This one's interesting:

- When there’s a school holiday, sales increase by ~71 units on average.
- Not a massive effect, but it’s positive.

✅ This makes intuitive sense—families might shop more when kids are out of school.

### 🔵 DayOfWeek: -136.7
This one tells us that as the day of the week increases (likely Monday=1 up to Sunday=7):

- Sales drop about 137 units per day going later in the week.
- It’s linear here, so it might not fully capture patterns like weekend spikes—this could be better handled later with dummy variables (categoricals).

### 🟣 Month: +81.6
Each later month in the year is associated with ~82 units more in sales.

- This may reflect seasonality trends (e.g., Q4 increases), but it’s a pretty small per-month bump.

### 🟤 Year: +202.5
Each year forward (like from 2022 to 2023) is associated with ~202 extra sales units.

- This suggests an upward trend year over year (maybe business growth, inflation, or other market factors).

## 🚦 What’s the Big Takeaway?

| 📊 **Feature**      | 💥 **Interpretation**                                                                                      |
|---------------------|----------------------------------------------------------------------------------------------------------|
| **Promo**           | 🔥 **Major impact: +2159 sales boost.** This is your *main elasticity-like driver.*                       |
| **StateHoliday**    | 💤 **No real effect.**                                                                                   |
| **SchoolHoliday**   | 👍 Small positive bump (~71 units).                                                                      |
| **DayOfWeek**       | 📉 Sales **decline by ~137 units** later in the week (might hint at a weekend lull—worth deeper analysis). |
| **Month**           | 📈 Slight positive trend across months (~82 units increase per month).                                    |
| **Year**            | 🚀 Solid +200 unit boost per year—suggests business growth or other long-term upward trend.               |


**Note:** The `StateHoliday` feature was removed from further modeling because the dataset contains no actual state holidays (100% of rows have `StateHoliday = 0`), making it a constant feature with no predictive value.


In [14]:
features = ['Promo', 'SchoolHoliday', 'DayOfWeek', 'Month', 'Year']


### 2️⃣ 🔄 Re-split your data:
Let’s keep things clean:

In [15]:
X = data_open[features]
y = data_open['Sales']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


### 3️⃣ 🚀 Refit the model:

In [16]:
lr = LinearRegression()
lr.fit(X_train, y_train)

print("Train R^2:", lr.score(X_train, y_train))
print("Test R^2:", lr.score(X_test, y_test))


Train R^2: 0.15023723923271926
Test R^2: 0.14877957469204728


### 4️⃣ 🧐 Get the updated coefficients:

In [17]:
coef_table = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': lr.coef_
})
print(coef_table)


         Feature  Coefficient
0          Promo  2158.870979
1  SchoolHoliday    70.522376
2      DayOfWeek  -136.733631
3          Month    81.638638
4           Year   202.460311


### ✅ What are the coefficents telling us?

| **Feature**       | **Coefficient** | **What it means**                                                                                           |
|-------------------|-----------------|------------------------------------------------------------------------------------------------------------|
| **Promo**         | 2158.87         | ➔ When `Promo` changes from 0 → 1 (no promo → promo), **sales increase by ~2159 units.**                    |
| **SchoolHoliday** | 70.52           | ➔ When `SchoolHoliday` changes from 0 → 1, **sales increase by ~71 units.**                                 |
| **DayOfWeek**     | -136.73         | ➔ For each *increment* in `DayOfWeek` (e.g., Monday=1 → Tuesday=2), **sales decrease by ~137 units.**       |
| **Month**         | 81.64           | ➔ For each *increment* in `Month` (e.g., January=1 → February=2), **sales increase by ~82 units.**          |
| **Year**          | 202.46          | ➔ For each *increment* in `Year` (e.g., 2023 → 2024), **sales increase by ~202 units.**                     |


### 🔄 Model Update: Removed `StateHoliday`

We re-ran the model after removing the `StateHoliday` feature (constant = 0). The updated model shows:

- ✅ Similar `Promo` effect (~ +2150 units).
- ✅ Slight refinement in other coefficients.
- ✅ Model performance remained stable, confirming `StateHoliday` had no predictive value.


In [19]:
print("Train R^2:", lr.score(X_train, y_train))
print("Test R^2:", lr.score(X_test, y_test))


Train R^2: 0.15023723923271926
Test R^2: 0.14877957469204728


### **Model Performance:**

- **Train R²:** 0.15
- **Test R²:** 0.15

This indicates the model explains ~15% of the variance in sales. While this is relatively low, it reflects the noisy nature of sales data and the limited feature set (no price data, no detailed store/product information). The model successfully captures general patterns (e.g., the strong positive effect of promotions) but is not suitable for high-precision forecasting in its current form.

#### **Next steps:**
- Introduce categorical encoding for `DayOfWeek` and `Month`.
- Add `Store` as a feature.
- Explore non-linear models (e.g., RandomForest).
- Investigate feature interactions (e.g., `Promo * DayOfWeek`).


### Check to see how many stores are included in the store data

In [20]:
print(data_open['Store'].value_counts())
print(data_open['Store'].nunique())


Store
562    918
85     918
423    918
262    918
682    918
      ... 
909    607
100    606
744    605
348    597
644    592
Name: count, Length: 1115, dtype: int64
1115


### The data shows there are 1,115 unique store IDs

### Why Move from Linear Regression to RandomForest?

The initial linear regression model provided useful directional insights but yielded low explanatory power (R² ~0.15). This is expected because linear regression assumes purely linear relationships between features and sales. However, real-world retail sales are influenced by complex, non-linear patterns—such as store-specific behavior, varying promo effectiveness, and seasonal effects.

Key reasons for adopting RandomForest:

- **Non-linear modeling:** RandomForest captures complex, non-linear relationships automatically, without requiring manual feature engineering (e.g., interaction terms between Promo and Store).
- **Better handling of categorical variables:** While linear regression requires one-hot encoding (adding 1,100+ dummy variables for `Store`), RandomForest efficiently handles categorical labels through simple label encoding.
- **Automatic interaction learning:** RandomForest naturally identifies important interactions, such as certain stores being more sensitive to promotions on specific days.
- **Improved predictive power:** Tree-based models typically yield higher R² in noisy retail datasets, offering more accurate predictions even with the same data.

For these reasons, RandomForest was selected as the next modeling step to improve performance and capture deeper patterns in the sales data.
