<a href="https://colab.research.google.com/github/Sukanya1901/CAM_DS/blob/main/Draft_CAM_DS_C101_Activity_2_3_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**First things first** - please go to 'File' and select 'Save a copy in Drive' so that you have your own version of this activity set up and ready to use.
Remember to update your Course 1 notebook with links to your own work once completed!

# Activity 2.3.5 Building models and interpreting results

## Scenario
You are a data scientist working for a retail consumer good company that seeks to deepen its understanding of customer behaviour to enhance loyalty. Your client, a national retail chain, is keen to leverage its extensive customer loyalty data set to reveal the relationship between loyalty and potential influencing factors, such as perceived product quality, brand awareness, and the impact of negative publicity.

Your primary goal is to develop a predictive model to help the client optimise its strategic initiatives in order to refine their marketing efforts, improve product offerings, optimise customer engagement, and drive growth and market share.


## Objective
The goal is to:
- review and reinforce key concepts related to regression analysis, including understanding coefficients and residuals
- explore the importance of evaluating regression models for their predictive power and the practical application of metrics like $R^2$, adjusted-$R^2$, and RSS.

## Assessment criteria
Evidence critical analysis of outputs to evaluate predictive models, accurately interpreting relevant metrics to ensure generalisability.


## Activity guidance
1. Import the necessary libraries and load the data set.
2. Split the data set into features (`X`) and the target variable (`y`). In this case the variable 'Loyalty' is the target variable.
3. Check if there are any features that are not suitable or required to be included in the regression model.
3. Create a regression model using Python (utilise scikit-learn) to predict the target variable based on the features.
4. Calculate the $R^2$ and adjusted-$R^2$ values to assess the model's explanatory power.
5. Calculate the residual sum of squares (RSS) to quantify the model's error.
6. Experiment with different sets of independent variables or model configurations to see how it affects $R^2$, adjusted-$R^2$, and RSS.
7. Explore the data set, experiment with model configurations, and think critically about the practical implications of your findings.


> Start your activity here. Select the pen from the toolbar to add your entry.

In [None]:
#Step 1: Import the libraries you need for this model.
import pandas as pd
# Load the data.
data = pd.read_csv("https://raw.githubusercontent.com/fourthrevlxd/cam_dsb/refs/heads/main/LoyaltyData.csv")
data

Unnamed: 0,CustomerID,Loyalty,Quality,Brand awareness,Negative publicity
0,1525,6.145,0.87,-0.07,0.04
1,1531,6.033,0.93,0.14,0.05
2,1526,6.531,0.86,-0.02,0.06
3,1523,6.834,0.92,0.29,0.06
4,1524,6.642,0.85,0.05,0.07
...,...,...,...,...,...
1706,200,5.249,0.79,0.23,0.98
1707,414,5.385,0.79,-0.20,0.98
1708,1500,4.815,0.77,-0.18,0.98
1709,1005,5.467,0.93,-0.30,0.98


In [None]:
features=['Quality','Brand awareness','Negative publicity']
target=['Loyalty']


from sklearn.linear_model import LinearRegression
X_train = data[features]
y_train = data[target]
model = LinearRegression()
model.fit(X_train, y_train)

# Compute RSS
y_pred = model.predict(X_train)
y_true=y_train
RSS=((y_true-y_pred)**2).sum()
print("This is RSS:",RSS)

# Compute R2
R2=model.score(X_train, y_train)
print("This is R2:",R2)


# Adjusted R2
n = X_train.shape[0]   # number of observations
k = X_train.shape[1]   # number of predictors
adj_R2 = 1 - (1 - R2) * (n - 1) / (n - k - 1)
print("R²:", R2)
print("Adjusted R²:", adj_R2)


This is RSS: Loyalty    886.432853
dtype: float64
This is R2: 0.5980449406336068
R²: 0.5980449406336068
Adjusted R²: 0.5973385169791843


In [None]:
#features=['Quality','Brand awareness','Negative publicity']
features=['Quality','Brand awareness']
#features=['Quality','Negative publicity']
#features=[,'Brand awareness','Negative publicity']

target=['Loyalty']


from sklearn.linear_model import LinearRegression
X_train = data[features]
y_train = data[target]
model = LinearRegression()
model.fit(X_train, y_train)

# Compute RSS
y_pred = model.predict(X_train)
y_true=y_train
RSS=((y_true-y_pred)**2).sum()
print("This is RSS:",RSS)

# Compute R2
R2=model.score(X_train, y_train)
print("This is R2:",R2)


# Adjusted R2
n = X_train.shape[0]   # number of observations
k = X_train.shape[1]   # number of predictors
adj_R2 = 1 - (1 - R2) * (n - 1) / (n - k - 1)
print("R²:", R2)
print("Adjusted R²:", adj_R2)


This is RSS: Loyalty    1040.620686
dtype: float64
This is R2: 0.5281281058191742
R²: 0.5281281058191742
Adjusted R²: 0.5275755626175573


# Reflect

Write a brief paragraph highlighting your process and the rationale to showcase critical thinking and problem-solving.

How to read the results:

R² (train) will almost always go up if we add complexity.
Adjusted R² penalizes that complexity.R² (test) tell us about generalization. RSS (train) naturally decreases as models get more flexible.

Based on experiments with the data, here are the key findings from the different model configurations:

The very low R2 and Adjusted R2 values across all experiments show that the features above, while slightly useful, are not the primary drivers of customer loyalty. In a real-world scenario, this would indicate that the brand/company need to explore and incorporate other factors—such as customer service, product pricing, or competitor actions—to build a more robust and accurate predictive model.







