In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 

import warnings
warnings.simplefilter("ignore")

In [2]:
df=pd.read_excel("cleaned_advertising_data.xlsx")
df.head()

Unnamed: 0,sales,total_sales
0,22100,337100
1,10400,128900
2,9300,132400
3,18500,251300
4,12900,250000


In [3]:
x=df[["total_sales"]]
y=df["sales"]

In [4]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=2)

**Modelling**

In [24]:
from sklearn.preprocessing import PolynomialFeatures
polynomial_converter = PolynomialFeatures(degree=3,include_bias=False)
x_train_poly=pd.DataFrame(polynomial_converter.fit_transform(x_train))
x_test_poly=pd.DataFrame(polynomial_converter.transform(x_test))

In [25]:
from sklearn.linear_model import LinearRegression
model=LinearRegression()
model.fit(x_train_poly,y_train)

print('Coefficients:',model.coef_)
print('intercept:',model.intercept_)

Coefficients: [ 6.62854182e-02 -9.61804834e-08  1.66520026e-13]
intercept: 3430.514152904623


**Evaluation**

In [26]:
from sklearn.metrics import r2_score

ypred_train=model.predict(x_train_poly)
print("Train R2",r2_score(y_train,ypred_train))

from sklearn.model_selection import cross_val_score
print("Train cv",cross_val_score(model,x_train_poly,y_train,cv=5,scoring="r2").mean())

ypred_test=model.predict(x_test_poly)
print("Test R2",r2_score(y_test,ypred_test))

Train R2 0.8162547370874194
Train cv 0.8002243459015939
Test R2 0.3480208378992392


**Key Conclusions**

Overfitting:
- The large gap between the Train R2 (0.8162) and Test R2 (0.3480) indicates that the model is overfitting.
- Overfitting occurs when the model learns the noise or specific patterns in the training data that do not generalize well to unseen data.

High Variance:
- The model has high variance, meaning it is too complex for the given dataset. This is common with high-degree polynomial regression, as it can fit the training data very well but fail to generalize.

Cross-Validation Confirms Overfitting:
- The Train CV R2 (0.8002) is close to the Train R2 (0.8162), which suggests that the model is consistent across different subsets of the training data. However, the low Test R2 (0.3480) confirms that the model is still overfitting.



**What Should You Do Next?**

1. Reduce Model Complexity
- Lower the Polynomial Degree: The current model uses a 3rd-degree polynomial (degree=3). Try reducing the degree to 2 or 1 (linear regression) to see if the model generalizes better.
   - example:- polynomial_converter = PolynomialFeatures(degree=2, include_bias=False)

2. Use Regularization
- Regularization techniques like Ridge Regression or Lasso Regression can help reduce overfitting by penalizing large coefficients.

3. Collect More Data
- If possible, collect more data to help the model generalize better. Polynomial regression models often require a large amount of data to avoid overfitting.

4. Feature Selection
- If you have multiple features, consider selecting only the most important ones. High-degree polynomial features can introduce unnecessary complexity.

5. Cross-Validation for Hyperparameter Tuning
- Use cross-validation to tune hyperparameters like the polynomial degree or regularization strength (alpha in Ridge/Lasso).

6. Evaluate Other Models
- If polynomial regression continues to overfit, consider trying other models like:
   - Random Forest
   - Gradient Boosting
   - Support Vector Regression (SVR)

**Final Thoughts**

- The current model is overfitting, as evidenced by the large gap between the Train R2 and Test R2.
- To improve generalization, reduce model complexity, use regularization, or try alternative models.
- Always validate your model's performance on unseen data (test set) to ensure it generalizes well.