<a href="https://colab.research.google.com/github/HowardHNguyen/Machine-Learning-Deep-Learning/blob/main/Regression_Models_Techniques_for_Evaluating.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd

df = pd.read_csv("/content/drive/MyDrive/_Python/Data-Science-for-Marketing-Analytics-Second-Ed/Chapter06/Activity6.01/offer_responses.csv")
df.head()

Unnamed: 0,responses,offer_discount,offer_quality,offer_reach
0,4151.0,26.0,10.25768,31344.0
1,3397.0,35.0,15.19438,24016.0
2,3274.0,21.0,13.971468,28832.0
3,3426.0,27.0,6.054338,26747.0
4,5745.0,42.0,16.801365,46968.0


Extract the target variable (y) and the predictor variable (X) from the data:
X

In [3]:
X = df[['offer_quality','offer_discount','offer_reach']]

y = df['responses']

Import train_test_split from sklearn and use it to split the data into
training and test sets, using responses as the y variable and all others as the
predictor (X) variables.

Use random_state=10 for train_test_split:

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split\
                                   (X, y, random_state = 10)

Import LinearRegression and mean_squared_error from sklearn. Fit
the model to the training data (using all the predictors), get predictions from the model on the test data, and print out the calculated RMSE on the test data:

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

model = LinearRegression()
model.fit(X_train,y_train)

predictions = model.predict(X_test)

print('RMSE with all variables: ' + str(mean_squared_error(predictions, y_test)**0.5))

RMSE with all variables: 966.2461828577946


Create X_train2 and X_test2 by dropping the offer_quality column
from X_train and X_test. Train and evaluate the RMSE of the model using
X_train2 and X_test2:

In [6]:
X_train2 = X_train.drop('offer_quality',axis=1)
X_test2 = X_test.drop('offer_quality',axis=1)

model = LinearRegression()
model.fit(X_train2,y_train)

predictions = model.predict(X_test2)

print('RMSE without offer quality: ' + str(mean_squared_error(predictions, y_test)**0.5))

RMSE without offer quality: 965.5346123758474


As we can see, by dropping the offer_quality column, the RMSE went
down, which shows the model was able to give more accurate predictions. This
shows an improvement in the model performance and robustness, which is a
positive sign.

Repeat the instructions given in step above, but this time dropping the
offer_discount column instead of the offer_quality column:

In [7]:
X_train3 = X_train.drop('offer_discount',axis=1)
X_test3 = X_test.drop('offer_discount',axis=1)

model = LinearRegression()
model.fit(X_train3,y_train)

predictions = model.predict(X_test3)

print('RMSE without offer discount: ' + \
      str(mean_squared_error(predictions, y_test)**0.5))

RMSE without offer discount: 1231.6766556327284


Perform the same sequence of steps, but this time dropping the
offer_reach column:

In [8]:
X_train4 = X_train.drop('offer_reach',axis=1)
X_test4 = X_test.drop('offer_reach',axis=1)

model = LinearRegression()
model.fit(X_train4,y_train)

predictions = model.predict(X_test4)

print('RMSE without offer reach: ' + \
      str(mean_squared_error(predictions, y_test)**0.5))

RMSE without offer reach: 1185.8456831644116


Let's summarize the RMSE values we have obtained so far:
Figure

RMSE with all variables: 966.2461828577946

RMSE without offer quality: 965.5346123758474

RMSE without offer discount: 1231.6766556327284

RMSE without offer reach: 1185.8456831644116

We should notice that the RMSE went up when the offer_reach column, or the
offer_discount column, was removed from the model but remained about the
same when the offer_quality column was removed. The change in RMSE values
is directly related to the importance of the feature being removed.

For example, if the RMSE goes down, this means that the predictions made by
the model were more accurate. This further means that the new model is a more
accurate model than the original one. Similarly, if removing a feature results in the RMSE increasing, this means that the predictions have become further apart from the original values, which means that the feature was important and should not have been removed.

This suggests that the offer_quality column is not contributing to the accuracy
of the model and could be safely removed to simplify the model.

# **USING RFE TO CHOOSE FEATURES FOR PREDICTING CUSTOMER SPEND**

In [9]:
import pandas as pd

df = pd.read_csv("/content/drive/MyDrive/_Python/Data-Science-for-Marketing-Analytics-Second-Ed/Chapter06/Activity6.02/customer_spend.csv")
df.head()

Unnamed: 0,cur_year_spend,prev_year_spend,days_since_last_purchase,days_since_first_purchase,total_transactions,age,income,engagement_score
0,5536.46,1681.26,7,61,34,61,97914.93,-0.652392
1,871.41,1366.74,12,34,33,68,30904.69,0.007327
2,2046.74,1419.38,10,81,22,54,48194.59,0.221666
3,4662.7,1561.21,12,32,34,49,93551.98,1.149641
4,3539.46,1397.6,17,72,34,66,66267.57,0.835834


In [10]:
# Extract the target variable (y) and the predictor variable (X) from the data:
cols = df.columns[1:]
X = df[cols]
y = df['cur_year_spend']

In [11]:
# Use train_test_split from sklearn to split the data into training and test
# sets, with random_state=100 and cur_year_spend as the y variable:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 100)

In [12]:
# Import RFE from sklearn and use LinearRegression as the estimator. Use
# n_features_to_select = 3, since we only want the top three features:

from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
rfe = RFE(estimator=LinearRegression(), n_features_to_select=3)

In [13]:
# Next, fit RFE (created in the previous step) on the training dataset:
rfe.fit(X_train,y_train)

In [14]:
# Print the columns that were selected by RFE, along with their rank:

for featureNum in range(X_train.shape[1]):
  # If feature was selected
  if rfe.support_[featureNum] == True:
    # Print feature name and rank
    print("Feature: {}, Rank: {}".format(X_train.columns[featureNum],rfe.ranking_[featureNum]))

Feature: days_since_first_purchase, Rank: 1
Feature: total_transactions, Rank: 1
Feature: engagement_score, Rank: 1


Notice that only three features were selected by RFE and all of those features were given a rank 1, meaning that RFE considered all three features to be equally important.

In [15]:
# Using the information from the preceding step, create a reduced dataset having
# just the columns selected by RFE:

X_train_reduced = X_train[X_train.columns[rfe.support_]]
X_test_reduced = X_test[X_train.columns[rfe.support_]]

In [16]:
# Next, use the reduced training dataset to fit a new linear regression model:

rfe_model = LinearRegression()
rfe_model.fit(X_train_reduced,y_train)

In [17]:
# Import mean_squared_error from sklearn and use it to calculate the
# RMSE of the linear regression model on the test data:

from sklearn.metrics import mean_squared_error

rfe_predictions = rfe_model.predict(X_test_reduced)
print(mean_squared_error(rfe_predictions, y_test)**0.5)

1075.9083016269917


RMSE is a common metric used to evaluate the performance of regression models. In this context, it measures the average error between the actual target values (y_test) and the predicted values (rfe_predictions). A lower RMSE indicates better model performance, as it represents a smaller average prediction error.

Here's the interpretation of our RMSE value:

An RMSE of 1075.91 means that, on average, our model's predictions are off by approximately 1075.91 units (or whatever units 0ur target variable is in).
It's important to consider the scale and context of our problem when interpreting RMSE. Whether this RMSE value is good or bad depends on the specific problem and the range of values in our target variable. In some cases, an RMSE of 1075.91 might be acceptable, while in others, it might not be.

In general, when working with RMSE or any evaluation metric, it's a good practice to compare the model's performance to a baseline or other models to determine if it meets your performance requirements or if further model improvement is needed.

In [18]:
mean_cur_year_expense = df['cur_year_spend'].mean()
mean_cur_year_expense

3419.0223200000005