In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error

In [9]:
import pandas as pd

df = pd.read_csv(
    r"C:\Users\Chaitra C P\OneDrive\Desktop\BIG DATA AND MACHINE LEARNING\weekly_sales_dataset.csv"
)

df.head()

Unnamed: 0,Advertising_Spend,Price,Competitor_Price,Weekly_Sales
0,559.61,43.37,57.3,1191.85
1,483.41,45.52,45.54,750.17
2,577.72,55.98,58.09,938.41
3,682.76,54.88,61.49,1360.39
4,471.9,49.83,54.89,839.55


In [16]:
df.shape

(300, 4)

In [6]:
1. Baseline Model Training and Performance
X = df[['Advertising_Spend', 'Price', 'Competitor_Price']]
y = df['Weekly_Sales']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train baseline linear regression model
baseline_model = LinearRegression()
baseline_model.fit(X_train, y_train)

# Predictions
y_pred = baseline_model.predict(X_test)

# Performance metrics
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

r2, mae    

#Interpretation: The baseline linear regression model explains approximately 80% of the variance in weekly sales (R² ≈ 0.80), indicating a strong fit.
The mean absolute error is around 93 units, which suggests that on average, predictions deviate moderately from actual sales.
Given the business context of weekly budgeting, this level of accuracy is acceptable and provides a reasonable balance between performance and interpretability.

(0.8005903658927165, 93.46242431692248)

In [7]:
## 2. Counterfactual Analysis
2.1 Increase Advertising Spend by 20% (Holding Other Features Constant)

X_cf = X_test.copy()
X_cf['Advertising_Spend'] = X_cf['Advertising_Spend'] * 1.20


y_cf_pred = baseline_model.predict(X_cf)

# Average change in predicted sales
avg_change = np.mean(y_cf_pred - y_pred)
avg_change

2.2 Economic Plausibility Analysis
Interpretation: The counterfactual experiment shows that a 20% increase in advertising spend leads to an average increase of approximately 195 units in weekly sales.
This change is economically plausible because advertising is expected to positively influence demand while all other factors remain constant.
The magnitude of change is linear and proportional, which aligns with the assumptions of a linear regression model and supports its use for budget planning decisions.

195.1683186157067

In [8]:
3. Intentional Model Misspecification
3.1 Removing a Key Variable (Advertising_Spend)

X_miss = df[['Price', 'Competitor_Price']]
y = df['Weekly_Sales']

X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(
    X_miss, y, test_size=0.2, random_state=42
)

miss_model = LinearRegression()
miss_model.fit(X_train_m, y_train_m)

y_miss_pred = miss_model.predict(X_test_m)

# Performance comparison
miss_r2 = r2_score(y_test_m, y_miss_pred)
miss_mae = mean_absolute_error(y_test_m, y_miss_pred)

baseline_model.coef_, miss_model.coef_, miss_r2, miss_mae

##3.2 Impact on Coefficients and Predictions

Interpretation: Removing advertising spend causes a sharp decline in model performance, with R² dropping from 0.80 to 0.32 and MAE increasing substantially.
The coefficients of remaining variables change significantly, indicating omitted variable bias.
This misspecification leads to unreliable predictions and distorted interpretation of price and competitor effects.

3.3 Business Risk from Misspecification
From a business perspective, this misspecification creates serious risk.
Budget decisions may incorrectly reduce advertising investment, even though advertising is a key sales driver.
Such errors can result in revenue loss, inefficient resource allocation, and poor strategic decisions despite the model appearing statistically valid.

## 4. Scenario Where the Model Should Not Be Deployed
Even if performance metrics are strong, this parametric model should not be deployed when the relationship between inputs and sales is structurally unstable, such as during major market disruptions, regulatory changes, or shifts in consumer behavior.
In such cases, linear assumptions break down, and relying on the model could lead to systematically incorrect budget decisions and financial losses.


(array([  1.90145437, -20.08930699,  12.85748125]),
 array([-21.64002089,  12.04379481]),
 0.32340491694348117,
 175.69033044243196)

In [None]:
#next question 

In [10]:
Question2 : ## B) Robustness and Stability Testing 1. Train a Non-Parametric Model and Tune a Hyperparameter
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

# Load dataset
df = pd.read_csv(
    r"C:\Users\Chaitra C P\OneDrive\Desktop\BIG DATA AND MACHINE LEARNING\weekly_sales_dataset.csv"
)

df.head()
X = df[['Advertising_Spend', 'Price', 'Competitor_Price']]
y = df['Weekly_Sales']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
# Train Decision Tree with tuned hyperparameter (max_depth)
tree_model = DecisionTreeRegressor(max_depth=4, random_state=42)
tree_model.fit(X_train, y_train)

tree_pred = tree_model.predict(X_test)
tree_mae = mean_absolute_error(y_test, tree_pred)

tree_mae

## Interpretation: A Decision Tree regressor is trained as the chosen non-parametric model.
The hyperparameter max_depth is tuned to limit overfitting and improve generalization.
The resulting MAE indicates acceptable predictive performance, but the model remains flexible and sensitive to data changes.

149.3580598999215

In [12]:
##2. Stability Test - 2.1 Slightly Perturb the Data (Remove 5% of Rows)
df_perturbed = df.sample(frac=0.95, random_state=42)

X_p = df_perturbed[['Advertising_Spend', 'Price', 'Competitor_Price']]
y_p = df_perturbed['Weekly_Sales']

X_train_p, X_test_p, y_train_p, y_test_p = train_test_split(
    X_p, y_p, test_size=0.2, random_state=42
)


In [13]:
##2.2 Re-train the Model
tree_model_p = DecisionTreeRegressor(max_depth=4, random_state=42)
tree_model_p.fit(X_train_p, y_train_p)

tree_pred_p = tree_model_p.predict(X_test)


In [14]:
##2.3 Compare Predictions Before and After Perturbation
prediction_difference = np.mean(np.abs(tree_pred - tree_pred_p[:len(tree_pred)]))
prediction_difference

##Interpretation: After removing just 5% of the data, the average absolute change in predictions is relatively large.
This shows that the Decision Tree model is not highly stable, as small changes in the dataset lead to noticeable changes in predictions.
Such instability poses a risk in business environments where consistency over time is important.

114.70709464164253

In [15]:
##3. Comparison with Linear Regression: Predictive Stability
# Train baseline Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

lr_pred = lr_model.predict(X_test)

# Retrain on perturbed data
lr_model_p = LinearRegression()
lr_model_p.fit(X_train_p, y_train_p)

lr_pred_p = lr_model_p.predict(X_test)

# Compare prediction stability
tree_diff = np.mean(np.abs(tree_pred - tree_pred_p[:len(tree_pred)]))
lr_diff = np.mean(np.abs(lr_pred - lr_pred_p[:len(lr_pred)]))

tree_diff, lr_diff

#Interpretation – Predictive Stability
The Decision Tree shows much larger prediction changes after data perturbation compared to linear regression.
Linear regression remains relatively stable, indicating stronger robustness to small data changes.

Interpretability Comparison:
1. Decision Trees generate complex, conditional rules that may change across retraining cycles.
2. Linear regression provides stable coefficients with clear economic interpretation.
3. From a governance perspective, linear regression is easier to audit & explain

(114.70709464164253, 12.340604995489171)

In [None]:
## 4. Deployment Recommendation for a Regulated Business
Recommendation
For a regulated business environment, linear regression should be preferred over the non-parametric Decision Tree model, even if the Decision Tree occasionally shows higher accuracy.

Justification
Although Decision Trees are flexible and capture non-linear patterns, they demonstrate lower stability and higher sensitivity to small data changes.
In regulated industries, consistency, interpretability, and auditability are more important than marginal accuracy gains.
Linear regression offers predictable behavior, transparent decision logic, and lower governance risk, making it more suitable for deployment in regulated business contexts.