<a href="https://colab.research.google.com/github/PSamita/MIT-Course/blob/main/Effects_of_Advertising_on_Sales.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Effects of Advertising on Sales

### LVC 1 - Introduction to Supervised Learning: Regression


## Context and Problem

- An interesting application of regression is to quantify the effect of advertisement on sales. Various channels of advertisement are newspaper, TV, radio, etc. 
- In this case study, we will have a look at the advertising data of a company and try to see its effect on sales.
- We will also try to predict the sales given the different parameters of advertising. 


## Data Information

The data at hand has three features about the spending on advertising and the target variable is the net sales. Attributes are:

- TV    - Independent variable quantifying budget for TV ads
- Radio - Independent variable quantifying budget for radio ads 
- News  - Independent variable quantifying budget for news ads
- Sales - Dependent variable

### Let us start by importing necessary packages

In [None]:
import pandas as pd
import numpy as np
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt

In [None]:
# Let us import the files from our system. Note that you can also load the data from the drive. 
# The below code is applicable only if you are working on Google Colab, In case you are using Jupyter Notebook, you can directly use pd.read_csv(filename) to load data into dataframe

#from google.colab import files
#uploaded = files.upload()

In [None]:
Ad_df = pd.read_csv('Advertising.csv')

# we have loaded the data into the Ad_df data frame. Let us now have a quick look.
Ad_df.head()

In [None]:
# we can drop the first column as it is just the index
Ad_df.drop(columns = 'Unnamed: 0', inplace=True)

In [None]:
Ad_df

In [None]:
Ad_df.info()

**Observations:** All the variables are of float data type.

### Let us now start with the simple linear regression. We will use one feature at a time and have a look at the target variable. 

In [None]:
# Dataset is stored in a Pandas Dataframe. Let us take out all the variables in a numpy array.
Sales = Ad_df.Sales.values.reshape(len(Ad_df['Sales']), 1)
TV = Ad_df.TV.values.reshape(len(Ad_df['Sales']), 1)
Radio = Ad_df.Radio.values.reshape(len(Ad_df['Sales']), 1)
Newspaper = Ad_df.Newspaper.values.reshape(len(Ad_df['Sales']), 1)

In [None]:
# let us fit the simple linear regression model with the TV feature
tv_model = linear_model.LinearRegression()
tv_model.fit(TV, Sales)
coeffs_tv = np.array(list(tv_model.intercept_.flatten()) + list(tv_model.coef_.flatten()))
coeffs_tv = list(coeffs_tv)

# let us fit the simple linear regression model with the Radio feature
radio_model = linear_model.LinearRegression()
radio_model.fit(Radio, Sales)
coeffs_radio = np.array(list(radio_model.intercept_.flatten()) + list(radio_model.coef_.flatten()))
coeffs_radio = list(coeffs_radio)

# let us fit the simple linear regression model with the Newspaper feature
newspaper_model = linear_model.LinearRegression()
newspaper_model.fit(Newspaper, Sales)
coeffs_newspaper = np.array(list(newspaper_model.intercept_.flatten()) + list(newspaper_model.coef_.flatten()))
coeffs_newspaper = list(coeffs_newspaper)

# let us store the above results in a dictionary and then display using a dataframe
dict_Sales = {}
dict_Sales["TV"] = coeffs_tv
dict_Sales["Radio"] = coeffs_radio
dict_Sales["Newspaper"] = coeffs_newspaper

metric_Df_SLR =  pd.DataFrame(dict_Sales)
metric_Df_SLR.index = ['Intercept', 'Coefficient']
metric_Df_SLR

In [None]:
# Let us now calculate R^2
tv_rsq = tv_model.score(TV, Sales)
radio_rsq = radio_model.score(Radio, Sales)
newspaper_rsq = newspaper_model.score(Newspaper, Sales)

print("TV simple linear regression R-Square :", tv_rsq)
print("Radio simple linear regression R-Square :", radio_rsq)
print("Newspaper simple linear regression R-Square :", newspaper_rsq)
list_rsq = [tv_rsq, radio_rsq, newspaper_rsq]
list_rsq

In [None]:
metric_Df_SLR.loc['R-Squared'] = list_rsq
metric_Df_SLR

**Observations:** We can see that TV has the highest R^2 value i.e. 61% followed by Radio and Newspaper

Let's try to visualize the best fit line using the regression plot

In [None]:
plt.scatter(TV, Sales,  color='red')
plt.xlabel('TV Ads')
plt.ylabel('Sales')
plt.plot(TV, tv_model.predict(TV), color='blue', linewidth=3)
plt.show()

plt.scatter(Radio, Sales,  color='red')
plt.xlabel('Radio Ads')
plt.ylabel('Sales')
plt.plot(Radio, radio_model.predict(Radio), color='blue', linewidth=3)
plt.show()

plt.scatter(Newspaper, Sales,  color='red')
plt.xlabel('Newspaper Ads')
plt.ylabel('Sales')
plt.plot(Newspaper, newspaper_model.predict(Newspaper), color='blue', linewidth=3)
plt.show()


## Multiple Linear Regression

- Let us now build a multiple linear regression model.

In [None]:
mlr_model = linear_model.LinearRegression()
mlr_model.fit(Ad_df[['TV', 'Radio', 'Newspaper']], Ad_df['Sales'])

In [None]:
Ad_df['Sales_Predicted']  = mlr_model.predict(Ad_df[['TV', 'Radio', 'Newspaper']]) 
Ad_df['Error'] = (Ad_df['Sales_Predicted'] - Ad_df['Sales'])**2
MSE_MLR = Ad_df['Error'].mean()

In [None]:
MSE_MLR

In [None]:
mlr_model.score(Ad_df[['TV', 'Radio', 'Newspaper']], Ad_df['Sales'])

**Observations:** The R^2 value for the multiple linear regression comes out to be 89.7% i.e. way better than simple linear regression

Let's now try to use statsmodel to get a more detailed model interpretation

In [None]:
# let us get a more detailed model through statsmodel.
import statsmodels.formula.api as smf
lm1 = smf.ols(formula= 'Sales ~ TV+Radio+Newspaper', data = Ad_df).fit()
lm1.params
print(lm1.summary())  #Inferential statistics

In [None]:
print("*************Parameters**************")
print(lm1.params)
print("*************P-Values**************")
print(lm1.pvalues)
print("************Standard Errors***************")
print(lm1.bse) 
print("*************Confidence Interval**************")
print(lm1.conf_int())
print("*************Error Covariance Matrix**************")
print(lm1.cov_params())


### Visualizing the confidence bands in Simple linear regression

In [None]:
import seaborn as sns
sns.lmplot(x = 'TV', y = 'Sales', data = Ad_df)

sns.lmplot(x = 'Radio', y = 'Sales', data = Ad_df )

sns.lmplot(x = 'Newspaper', y = 'Sales', data = Ad_df)

# LVC  2 - Model Evaluation: Cross validation and Bootstrapping

- We realize that the newspaper can be omitted from the list of significant features owing to the p-value.
- Let us now run the regression analysis adding a multiplicative feature in it.

In [None]:
Ad_df['TVandRadio'] = Ad_df['TV']*Ad_df['Radio']

In [None]:
# let us remove the sales_predicted and the error column generated earlier
Ad_df.drop(columns = ["Error", "Sales_Predicted"], inplace = True)

In [None]:
# Let us do the modelling with the new feature.
import statsmodels.formula.api as smf
lm2 = smf.ols(formula= 'Sales ~ TV+Radio+Newspaper+TVandRadio', data = Ad_df).fit()
lm2.params
print(lm2.summary())  #Inferential statistics

**Observations**
- We see an increase in the R-square here. However, is this model useful for prediction? Does it predict well for the unseen data? Let us find out!

## Performance assessment, testing and validation

### Train, Test, and Validation set
- We will split data into three sets, one to train the model, one to validate the model performance (not seen during training) and make improvements, and the last to test the model.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
features_base = [i for i in Ad_df.columns if i not in ("Sales" , "TVandRadio")]
features_added = [i for i in Ad_df.columns if i not in "Sales"]
target  = 'Sales'
train, test = train_test_split(Ad_df, test_size = 0.10, train_size = 0.9)

In [None]:
train, validation = train_test_split(train, test_size = 0.2, train_size = 0.80)

In [None]:
train.shape, validation.shape,test.shape

In [None]:
# now let us start with the modelling
from sklearn.linear_model import LinearRegression

mlr = LinearRegression()
mlr.fit(train[features_base], train[target])
print("*********Training set Metrics**************")
print("R-Squared:", mlr.score(train[features_base], train[target]))
se_train = (train[target] - mlr.predict(train[features_base]))**2
mse_train = se_train.mean()
print('MSE: ', mse_train)
print("********Validation set Metrics**************")
print("R-Squared:", mlr.score(validation[features_base], validation[target]))
se_val = (validation[target] - mlr.predict(validation[features_base]))**2
mse_val = se_val.mean()
print('MSE: ', mse_val)

In [None]:
# Can we increase the model performance by adding the new feature? 
# We found that to be the case in the analysis above but let's check the same for the validation dataset

mlr_added_feature = LinearRegression()
mlr_added_feature.fit(train[features_added], train[target])
print("*********Training set Metrics**************")
print("R-Squared:", mlr_added_feature.score(train[features_added], train[target]))
se_train = (train[target] - mlr_added_feature.predict(train[features_added]))**2
mse_train = se_train.mean()
print('MSE: ', mse_train)
print("********Validation set Metrics**************")
print("R-Squared:", mlr_added_feature.score(validation[features_added], validation[target]))
se_val = (validation[target] - mlr_added_feature.predict(validation[features_added]))**2
mse_val = se_val.mean()
print('MSE: ', mse_val)

**Observations**
- We found the R-squared increased as we would expect after adding a feature. Also the error decreased. Let us now fit a regularized model.

## Regularization 

In [None]:
features_added

In [None]:
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

#fitting Ridge with the default features
ridge = Ridge()
ridge.fit(train[features_added], train[target])

print("*********Training set Metrics**************")
print("R-Squared:", ridge.score(train[features_added], train[target]))
se_train = (train[target] - ridge.predict(train[features_added]))**2
mse_train = se_train.mean()
print('MSE: ', mse_train)
print("********Validation set Metrics**************")
print("R-Squared:", ridge.score(validation[features_added], validation[target]))
se_val = (validation[target] - ridge.predict(validation[features_added]))**2
mse_val = se_val.mean()
print('MSE: ', mse_val)

In [None]:
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

#fitting Lasso with the default features
lasso = Lasso()
lasso.fit(train[features_added], train[target])

print("*********Training set Metrics**************")
print("R-Squared:", lasso.score(train[features_added], train[target]))
se_train = (train[target] - lasso.predict(train[features_added]))**2
mse_train = se_train.mean()
print('MSE: ', mse_train)
print("********Validation set Metrics**************")
print("R-Squared:", lasso.score(validation[features_added], validation[target]))
se_val = (validation[target] - lasso.predict(validation[features_added]))**2
mse_val = se_val.mean()
print('MSE: ', mse_val)

In [None]:
#Let us predict on the unseen data using Ridge

rsq_test = ridge.score(test[features_added], test[target])
se_test = (test[target] - ridge.predict(test[features_added]))**2
mse_test = se_test.mean()

print("*****************Test set Metrics******************")

print("Rsquared: ", rsq_test)
print("MSE: ", mse_test)
print("Intercept is {} and Coefficients are {}".format(ridge.intercept_, ridge.coef_))

- We will now evaluate the performance using the LooCV and KFold methods.

### K-Fold and LooCV

In [None]:
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import cross_val_score

In [None]:
ridgeCV = Ridge()
cvs = cross_val_score(ridgeCV, Ad_df[features_added], Ad_df[target], cv = 10)
print("Mean Score:")
print(cvs.mean(), "\n")
print("Confidence Interval:")
cvs.mean() - cvs.std(), cvs.mean() + cvs.std() 

# note that the same can be set as LooCV if cv parameter above is set to n, i.e, 200.

## Extra: Statsmodel to fit regularized model

In [None]:
# You can also use the statsmodel for the regularization using the below code
# import statsmodels.formula.api as smf
# We will use the below code to fit a regularized regression.


# Here, lasso is fit
# lm3 = smf.ols(formula= 'Sales ~ TV+Radio+Newspaper+TVandRadio', data = Ad_df).fit_regularized(method = 'elastic_net', L1_wt = 1)
# print("*************Parameters**************")
# print(lm3.params)

# Here, ridge regularization has been fit
# lm4 = smf.ols(formula= 'Sales ~ TV+Radio+Newspaper+TVandRadio', data = Ad_df).fit_regularized(method = 'elastic_net', L1_wt = 0)
# print("*************Parameters**************")
# print(lm4.params)

## Bootstrapping

In [None]:
# let us get a more detailed model through statsmodel.
import statsmodels.formula.api as smf
lm2 = smf.ols(formula= 'Sales ~ TV', data = Ad_df).fit()
lm2.params
print(lm2.summary())  #Inferential statistics

In [None]:
#Now, let us calculate the slopes a 1000 times using bootstrapping

import statsmodels.formula.api as smf


Slope = []
for i in range(1000):
  bootstrap_df = Ad_df.sample(n = 200, replace = True )
  lm3 = smf.ols(formula= 'Sales ~ TV', data = bootstrap_df).fit()
  Slope.append(lm3.params.TV)
  
  plt.xlabel('TV Ads')
  plt.ylabel('Sales')
  plt.plot(bootstrap_df['TV'], lm3.predict(bootstrap_df['TV']), color='green', linewidth=3)
  
plt.scatter(Ad_df['TV'], Ad_df['Sales'],  color=(0,0,0.5))
plt.show()


In [None]:
# Let's now find out the 2.5 and 97.5 percentile for the slopes obtained
import numpy as np

Slope = np.array(Slope)
Sort_Slope = np.sort(Slope)


Slope_limits = np.percentile(Sort_Slope, (2.5, 97.5))
Slope_limits

In [None]:
# Plotting the slopes and the upper and the lower limits

plt.hist(Slope, 50)
plt.axvline(Slope_limits[0], color = 'r')
plt.axvline(Slope_limits[1], color = 'r')