In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np

# Load dataset
data = pd.read_csv("weekly_sales_dataset.csv")

# Features and target
X = data[['Advertising_Spend', 'Price', 'Competitor_Price']]
y = data['Weekly_Sales']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Model evaluation
print("R2 Score:", r2_score(y_test, y_pred))
print("MAE:", mean_absolute_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

# Cross-validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print("Average CV R2:", cv_scores.mean())

# Coefficients
coef_table = pd.DataFrame({
    "Feature": X.columns,
    "Coefficient": model.coef_
})
print(coef_table)


R2 Score: 0.8005903658927165
MAE: 93.46242431692248
RMSE: 119.23467253853967
Average CV R2: 0.8370832722474602
             Feature  Coefficient
0  Advertising_Spend     1.901454
1              Price   -20.089307
2   Competitor_Price    12.857481



## 1. Baseline Linear Regression Model
 train a multiple linear regression model using:

- Advertising Spend
- Product Price
- Competitor Price

These represent the main business drivers of sales.


In [2]:

features = ['Advertising_Spend', 'Price', 'Competitor_Price']
target = 'Weekly_Sales'

X = data[features]
y = data[target]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LinearRegression()
model.fit(X_train, y_train)

preds = model.predict(X_test)

r2 = r2_score(y_test, preds)
mae = mean_absolute_error(y_test, preds)
rmse = np.sqrt(mean_squared_error(y_test, preds))

print("R2:", r2)
print("MAE:", mae)
print("RMSE:", rmse)

pd.DataFrame({
    "Feature": features,
    "Coefficient": model.coef_
})


R2: 0.8005903658927165
MAE: 93.46242431692248
RMSE: 119.23467253853967


Unnamed: 0,Feature,Coefficient
0,Advertising_Spend,1.901454
1,Price,-20.089307
2,Competitor_Price,12.857481



### Interpretation

- The model explains roughly 80% of sales variation, indicating strong predictive power.
- Average prediction error remains within acceptable operational limits.
- Advertising increases sales, price reduces demand, and competitor price increases our sales.

This makes the model suitable for budget and pricing decisions.



## 2. Counterfactual Experiment

We simulate increasing advertising spend by **20%**, while keeping all other factors constant.

This helps evaluate whether predicted changes are economically reasonable.


In [3]:

X_cf = X_test.copy()
X_cf['Advertising_Spend'] *= 1.20

cf_preds = model.predict(X_cf)

avg_change = (cf_preds - preds).mean()

print("Average predicted sales (baseline):", preds.mean())
print("Average predicted sales after +20% advertising:", cf_preds.mean())
print("Average increase in sales:", avg_change)


Average predicted sales (baseline): 912.4378379934234
Average predicted sales after +20% advertising: 1107.60615660913
Average increase in sales: 195.1683186157067



### Interpretation

Sales increase after advertising increases, which is economically logical.

The magnitude aligns with the advertising coefficient, meaning the model behaves realistically.

This supports advertising budget increases as a valid strategy.



## 3. Intentional Misspecification

We now remove Advertising Spend, a key variable, and retrain the model.

This simulates a situation where important business drivers are missing.


In [4]:

X_miss = data[['Price', 'Competitor_Price']]

Xm_train, Xm_test, ym_train, ym_test = train_test_split(
    X_miss, y, test_size=0.2, random_state=42
)

miss_model = LinearRegression()
miss_model.fit(Xm_train, ym_train)

miss_preds = miss_model.predict(Xm_test)

print("Misspecified R2:", r2_score(ym_test, miss_preds))
print("Misspecified MAE:", mean_absolute_error(ym_test, miss_preds))
print("Misspecified RMSE:", np.sqrt(mean_squared_error(ym_test, miss_preds)))


Misspecified R2: 0.32340491694348117
Misspecified MAE: 175.69033044243196
Misspecified RMSE: 219.63113893388834



### Interpretation

Performance drops sharply when advertising is removed.

The model becomes unreliable, and price effects get distorted because the model tries to compensate for missing information.

### Business Risk

- Advertising importance becomes hidden.
- Budget allocation decisions become incorrect.
- Sales predictions become unstable.

This creates real financial risk.



## 4. When We Refuse Deployment

Even with good metrics, deployment should be refused if market conditions change significantly.

Examples:
- Economic slowdown,
- New competitor entry,
- Supply chain disruptions,
- Regulatory changes.

In such cases, historical relationships no longer hold, making predictions unreliable.

Thus, deployment must consider business reality, not just model accuracy.


In [None]:

# Question 2: Non-Parametric Algorithms — Accountability, Robustness & Deployment


A) Algorithm accountability using a non-parametric model  
B) Robustness and stability testing  

The analysis includes:
1. Selection and explanation of a non-parametric algorithm
2. Algorithm assumptions and risks
3. Business situations where higher accuracy may be rejected
4. Stability testing via data perturbation
5. Comparison with Linear Regression
6. Deployment recommendation for regulated businesses



In [5]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error


In [6]:
features = ['Advertising_Spend', 'Price', 'Competitor_Price']
target = 'Weekly_Sales'

X = data[features]
y = data[target]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

data.head()

Unnamed: 0,Advertising_Spend,Price,Competitor_Price,Weekly_Sales
0,559.61,43.37,57.3,1191.85
1,483.41,45.52,45.54,750.17
2,577.72,55.98,58.09,938.41
3,682.76,54.88,61.49,1360.39
4,471.9,49.83,54.89,839.55



## A1. Chosen Non-Parametric Method: Random Forest Regression

Random Forest is selected because:

• It captures nonlinear relationships.  
• It handles complex interactions automatically.  
• It provides strong predictive accuracy.

Unlike linear regression, it does not assume a fixed mathematical form.


In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training rows:", X_train.shape[0])
print("Testing rows:", X_test.shape[0])


Training rows: 240
Testing rows: 60


In [8]:
rf_model = RandomForestRegressor(
    n_estimators=200,
    random_state=42
)

rf_model.fit(X_train, y_train)


In [9]:
predictions = rf_model.predict(X_test)

print("Sample predictions:")
print(predictions[:10])


Sample predictions:
[ 868.9817  1485.45295  762.69495 1135.50975  757.91445 1293.3091
  728.6735   862.4379   833.4655   967.19535]


In [10]:
r2 = r2_score(y_test, predictions)
mae = mean_absolute_error(y_test, predictions)

print("R2 Score:", r2)
print("Mean Absolute Error:", mae)


R2 Score: 0.7489342802544712
Mean Absolute Error: 104.66098166666659


In [11]:
importances = pd.DataFrame({
    "Feature": features,
    "Importance": rf_model.feature_importances_
}).sort_values(by="Importance", ascending=False)

importances


Unnamed: 0,Feature,Importance
0,Advertising_Spend,0.590834
1,Price,0.295826
2,Competitor_Price,0.11334


In [12]:
# Example prediction sensitivity check
import numpy as np

print("Prediction variability check:")
print("Std deviation of predictions:", np.std(predictions))


Prediction variability check:
Std deviation of predictions: 220.54334213274183



## B1. Train Non-Parametric Model with Hyperparameter Tuning

We tune the number of trees (`n_estimators`) in Random Forest.


In [25]:
# Split dataset into training and testing sets
# Training data teaches the model
# Testing data evaluates model performance on unseen data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [26]:
# Testing different numbers of trees to find stable performance
tree_options = [50, 100, 200]

for trees in tree_options:
    
    # Create model with selected number of trees
    model = RandomForestRegressor(
        n_estimators=trees,
        random_state=42
    )
    
    # Train model
    model.fit(X_train, y_train)
    
    # Make predictions
    preds = model.predict(X_test)
    
    # Print performance for comparison
    print("Trees:", trees)
    print("R2:", r2_score(y_test, preds))
    print("MAE:", mean_absolute_error(y_test, preds))
    print("-------------------")


Trees: 50
R2: 0.7330453203537741
MAE: 109.73258333333332
-------------------
Trees: 100
R2: 0.7361611975764759
MAE: 108.13050333333331
-------------------
Trees: 200
R2: 0.7489342802544712
MAE: 104.66098166666659
-------------------


# the hyper parameter used is n estimator means number of decision tree to get more accuracy
# AS 200 number of trees give more explainaition as of R2 so we go with 200 tress as n_estimator

In [27]:
# Train final model using best hyperparameter
rf_model = RandomForestRegressor(
    n_estimators=200,
    random_state=42
)

rf_model.fit(X_train, y_train)

# Store predictions for later comparison
rf_preds = rf_model.predict(X_test)


## B2. Stability Test via Data Perturbation

We slightly modify data by removing 5% of rows and retrain the model.

In [29]:
# Remove 5% of rows randomly
perturbed_data = data.sample(frac=0.95, random_state=42)

Xp = perturbed_data[features]
yp = perturbed_data[target]

Xp_train, Xp_test, yp_train, yp_test = train_test_split(
    Xp, yp, test_size=0.2, random_state=42
)


In [31]:
# Train model on slightly changed data
rf_model_perturbed = RandomForestRegressor(
    n_estimators=200,
    random_state=42
)

rf_model_perturbed.fit(Xp_train, yp_train)

# Predict using original test data
rf_preds_perturbed = rf_model_perturbed.predict(X_test)
rf_preds_perturbed

array([ 897.78865, 1645.59115,  773.90405, 1327.396  ,  828.19135,
       1197.26205,  727.64255,  840.2891 ,  897.0587 ,  979.73915,
        747.1558 ,  836.18585,  702.74785,  521.06755,  806.16435,
        582.7911 ,  796.46205, 1004.7718 , 1092.8934 ,  809.18595,
        454.6519 , 1106.55755,  959.7295 ,  416.80825,  761.08595,
        776.74845, 1112.86185,  828.53555, 1223.4553 ,  663.56855,
       1000.8712 , 1101.9789 ,  522.9826 ,  799.8605 ,  775.5101 ,
        965.9186 , 1027.667  ,  952.36345,  641.5012 ,  896.4427 ,
        466.48115,  599.56275,  918.5659 ,  643.83265,  688.40105,
        428.0776 , 1508.8887 , 1105.65965, 1004.8753 , 1093.0857 ,
        974.7892 , 1110.31215, 1133.1709 ,  906.20275, 1120.22325,
       1040.6968 ,  705.8111 ,  764.4354 ,  946.55545,  886.8638 ])

In [32]:
# Measure how much predictions changed
prediction_shift = np.mean(
    np.abs(rf_preds - rf_preds_perturbed)
)

print("Average prediction change:", prediction_shift)


Average prediction change: 65.22483500000035


## After removing only 5% of data and retraining the model, predictions changed on average by 65 sales units.

## A stable model should produce nearly the same predictions when data changes slightly.
## Here, predictions shift noticeably, meaning the Random Forest model has moderate instability.

In [None]:
#In practice, data changes every quarter due to:

#Seasonal demand,

#Market conditions,

#Pricing changes,

#Customer behavior shifts.

#If predictions move by around 65 units simply due to small data changes, then:

#• Budget allocation decisions may fluctuate
#• Sales forecasts may become inconsistent
#• Business planning becomes less reliable

In [33]:
# Train simpler parametric model for comparison
lr_model = LinearRegression()

lr_model.fit(X_train, y_train)

# Predictions
lr_preds = lr_model.predict(X_test)

# Evaluate performance
print("Linear R2:", r2_score(y_test, lr_preds))
print("Linear MAE:", mean_absolute_error(y_test, lr_preds))


Linear R2: 0.8005903658927165
Linear MAE: 93.46242431692248


In [34]:
# Compare predictions of both models
lr_shift = np.mean(np.abs(lr_preds - rf_preds))

print("Average RF vs Linear difference:", lr_shift)


Average RF vs Linear difference: 56.4761755028745


In [35]:
# Show linear regression coefficients
# Helps understand business impact
pd.DataFrame({
    "Feature": features,
    "Coefficient": lr_model.coef_
})

Unnamed: 0,Feature,Coefficient
0,Advertising_Spend,1.901454
1,Price,-20.089307
2,Competitor_Price,12.857481


## Interpretation of Linear Regression Coefficients

The linear regression coefficients show how each business variable affects weekly sales while holding other factors constant.

### Advertising Spend (Coefficient = 1.90)
An increase of one unit in advertising spend leads to an increase of approximately 1.9 units in weekly sales.  
This confirms that advertising investment positively drives sales growth.

### Product Price (Coefficient = -20.09)
A one-unit increase in product price reduces sales by about 20 units.  
This indicates customers are highly price sensitive, and price increases can significantly reduce demand.

### Competitor Price (Coefficient = 12.86)
When competitor prices increase by one unit, our sales increase by around 13 units.  
This suggests customers shift toward our product when competitors become more expensive.

### Business Interpretation
These coefficients provide clear and direct business insight:
• Advertising boosts sales,
• Higher prices reduce demand,
• Competitive pricing affects customer choices.

This transparency makes linear regression easy for managers and regulators to understand, supporting stable business decision-making.


## Predictive Stability and Interpretability Comparison

### Predictive Stability

Predictive stability refers to how much model predictions change when the training data changes slightly.

In our experiment, after slightly modifying the dataset, Random Forest predictions changed by about 65 sales units on average. Additionally, predictions from Random Forest and Linear Regression differ by about 56 units.

This indicates that Random Forest predictions are more sensitive to data variations because the algorithm adapts to complex patterns in the data.

In contrast, Linear Regression tends to produce more stable predictions since it assumes a fixed relationship between variables.

For business decision-making, stable predictions are important because unstable forecasts can lead to fluctuating budgets and planning errors.

---

### Interpretability

Interpretability refers to how easily business stakeholders can understand model decisions.

Random Forest is less interpretable because predictions result from many decision trees working together, making it difficult to explain why a specific prediction occurs.

Linear Regression, however, directly shows how each variable affects sales through coefficients. For example, managers can clearly see how advertising or pricing changes influence sales.

Because stakeholders and regulators often require explanation of decisions, Linear Regression provides a significant advantage in transparency.

---

### Summary

Random Forest provides higher flexibility and often higher accuracy, but predictions may be less stable and harder to explain.

Linear Regression offers better interpretability and prediction stability, which is important for regulated or business-critical decisions.


### Deployment Recommendation

Random Forest gives strong accuracy but predictions may vary when data changes.

Linear Regression is slightly less accurate but offers:
• Stable predictions
• Clear interpretation
• Regulatory transparency

Therefore, Linear Regression is recommended for regulated deployment.
