Part (c) - Regularization
UE23CS342AA2 - Data Analytics

There are 4 sections in this worksheet.

Pranav Rao P - pranavraop2023@gmail.com

Name: Nandan D

SRN: PES2UG23CS363

Sec: F

## Importance of Regularization
In predictive modeling, a model that is closely fitted to the training data might even pick up not just the true underlying patterns but also noise and random fluctuations. This phenomenon, known as overfitting, results in poor generalization performance,ie. the model performs well on the training data but cannot retain accuracy when it is applied to new unseen data.

Regularization solves this challenge by adding a penalty term to the loss function of the model. The term discourages overmodeling(very large coefficient values) and urges the model to balance complexity and simplicity in fitting the data. Regularization helps the model become capable of generalizing from the training set.

We discuss two well-known regularization methods in this section.

* **Ridge Regression (L2 Regularization)**: Adds a penalty proportional to the square of the coefficients. It draws all the coefficients towards zero but never brings them to zero, thus can be applied where there is multicollinearity.

* **Lasso Regression (L1 Regularization)**: Dampens by an amount proportional to the absolute coefficient value. It can even set some of the coefficients to exactly zero, thus performing automatic feature selection.

Let's have a look at the task at hand and the data that it uses.



### Task: Predicting Player Rating in Valorant  
You're working as a data analyst for an esports coaching team. Your task is to build a predictive model that estimates a **player’s match rating** based on in-game performance metrics.

You’ll use `Valorant_Player_Data.csv`, which includes features like:  
- Kills  
- Deaths  
- Average Combat Score (ACS)  
- Head-shot %  
- First Blood Count and more 

You’ll compare between **Ridge and Lasso Regression** and evaluate which model generalizes better.


### Data Visualisation

In [71]:
library(tidyverse)  

# Load the Dataset
df <- read_csv("/kaggle/input/worksheet-2-lasso-ridge/Valorant_Player_Data.csv", show_col_types = FALSE)

head(df)

playerName,team,rating,region,playerCategory,average_combat_score,kill_deaths,kill_assists_survived_traded,average_damage_per_round,kills_per_round,assists_per_round,first_kills_per_round,first_deaths_per_round,headshot_percentage,clutch_success_percentage
<chr>,<chr>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
Kouf,AMB,1.53,Americas,vct-challengers,298.0,1.48,74%,185.9,1.03,0.36,0.13,0.03,35%,15%
nelu,63,1.31,Americas,vct-challengers,266.7,1.15,75%,182.0,0.86,0.4,0.06,0.06,30%,21%
welyy,Blue,1.31,Americas,vct-challengers,240.9,1.26,74%,164.1,0.82,0.39,0.07,0.04,27%,10%
ShoT_UP,TOR,1.29,Americas,vct-challengers,240.2,1.25,78%,158.6,0.82,0.4,0.06,0.06,25%,25%
mada,NRG,1.26,Americas,vct-challengers,268.8,1.34,76%,172.9,0.93,0.19,0.24,0.13,26%,11%
MattyIce,Equi,1.26,Americas,vct-challengers,256.5,1.27,58%,166.1,1.0,0.06,0.03,0.18,42%,17%


In [72]:
str(df)

spc_tbl_ [3,123 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ playerName                  : chr [1:3123] "Kouf" "nelu" "welyy" "ShoT_UP" ...
 $ team                        : chr [1:3123] "AMB" "63" "Blue" "TOR" ...
 $ rating                      : num [1:3123] 1.53 1.31 1.31 1.29 1.26 1.26 1.24 1.24 1.23 1.23 ...
 $ region                      : chr [1:3123] "Americas" "Americas" "Americas" "Americas" ...
 $ playerCategory              : chr [1:3123] "vct-challengers" "vct-challengers" "vct-challengers" "vct-challengers" ...
 $ average_combat_score        : num [1:3123] 298 267 241 240 269 ...
 $ kill_deaths                 : num [1:3123] 1.48 1.15 1.26 1.25 1.34 1.27 1.19 1.32 1.13 1.23 ...
 $ kill_assists_survived_traded: chr [1:3123] "74%" "75%" "74%" "78%" ...
 $ average_damage_per_round    : num [1:3123] 186 182 164 159 173 ...
 $ kills_per_round             : num [1:3123] 1.03 0.86 0.82 0.82 0.93 1 0.89 0.94 0.86 0.92 ...
 $ assists_per_round           : num [1:3123] 0.36 0.4 

**1)** What steps did you take to clean the input data before modeling? Mention how infinities, nulls, and constants were handled. (0.5 points)

In [73]:
# Convert percentage columns to numeric
df <- df %>%
  mutate(across(c(headshot_percentage, clutch_success_percentage), 
                ~ as.numeric(str_remove(., "%"))))

# Check missing values
print(colSums(is.na(df)))

# Option: remove rows with NAs
df <- df %>% drop_na()

# Remove constant columns (if any)
df <- df %>% select(where(~ n_distinct(.) > 1))

                  playerName                         team 
                           0                            3 
                      rating                       region 
                         105                            0 
              playerCategory         average_combat_score 
                           0                            0 
                 kill_deaths kill_assists_survived_traded 
                           0                          105 
    average_damage_per_round              kills_per_round 
                          21                            0 
           assists_per_round        first_kills_per_round 
                           0                           21 
      first_deaths_per_round          headshot_percentage 
                          21                          105 
   clutch_success_percentage 
                         913 


### **I.** Ridge Regression

**1)** What value of λ (lambda) was chosen for optimal Ridge regression? What does this say about the need for regularization in your dataset? (hint: use glmnet) (1 point)

In [74]:
#enter code here
# Convert percentage columns to numeric
df <- df %>%
  mutate(across(c(headshot_percentage, clutch_success_percentage),
                ~ as.numeric(str_remove(., "%"))))

**2)** With the optimal lambda, print the coefficients of the various dependent variables. (1 point)

In [75]:
#enter code here
X_train <- train_df %>% 
  select(average_combat_score, kill_deaths, average_damage_per_round, kills_per_round,
         assists_per_round, first_kills_per_round, first_deaths_per_round,
         headshot_percentage, clutch_success_percentage) %>%
  as.matrix()

X_test <- test_df %>% 
  select(average_combat_score, kill_deaths, average_damage_per_round, kills_per_round,
         assists_per_round, first_kills_per_round, first_deaths_per_round,
         headshot_percentage, clutch_success_percentage) %>%
  as.matrix()

y_train <- train_df$rating
y_test  <- test_df$rating

In [76]:
library(glmnet)
cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0)
lambda_ridge <- cv_ridge$lambda.min
cat("Optimal lambda (Ridge):", lambda_ridge, "\n")

Optimal lambda (Ridge): 0.01585656 


In [77]:
#enter code here
ridge_coef <- coef(cv_ridge, s = "lambda.min")
print(ridge_coef)


10 x 1 sparse Matrix of class "dgCMatrix"
                                     s1
(Intercept)                0.0754683852
average_combat_score       0.0007568795
kill_deaths                0.4026596102
average_damage_per_round   0.0013872485
kills_per_round            0.2468639702
assists_per_round          0.2154167118
first_kills_per_round     -0.1408338322
first_deaths_per_round    -0.4488714160
headshot_percentage        0.0002042393
clutch_success_percentage  0.0001439914


**2)** Using your cross‑validated Ridge model (cv_ridge), calculate R² for both the training set and test set. Report RMSE and adjusted R² for test set. (2 points)

In [78]:
#enter code here
ridge_train_pred <- predict(cv_ridge, X_train, s = "lambda.min")
ridge_test_pred  <- predict(cv_ridge, X_test, s = "lambda.min")

# Metrics
ridge_train_met <- postResample(ridge_train_pred, y_train)
ridge_test_met  <- postResample(ridge_test_pred, y_test)

cat(sprintf("Ridge Train R²: %.3f | RMSE: %.3f\n", ridge_train_met["Rsquared"], ridge_train_met["RMSE"]))
cat(sprintf("Ridge Test R²: %.3f | RMSE: %.3f\n", ridge_test_met["Rsquared"], ridge_test_met["RMSE"]))

Ridge Train R²: 0.951 | RMSE: 0.037
Ridge Test R²: 0.949 | RMSE: 0.038


In [79]:
n_test <- nrow(X_test)
p <- ncol(X_test)
adj_r2_test <- 1 - (1 - ridge_test_met["Rsquared"]) * ((n_test - 1)/(n_test - p - 1))
cat(sprintf("Ridge Test Adjusted R²: %.3f\n", adj_r2_test))


Ridge Test Adjusted R²: 0.948


### **II.** Lasso Regression

**1)** How many coefficients were exactly zero in the Lasso model? What does this suggest? Which were the top two in terms of weights?? (0.5 points)

In [80]:
# Fit Lasso regression
cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1)

# Optimal lambda
lambda_lasso <- cv_lasso$lambda.min
cat("Optimal lambda (Lasso):", lambda_lasso, "\n")

# Coefficients
lasso_coef <- coef(cv_lasso, s = "lambda.min")
print(lasso_coef)

# Number of coefficients exactly zero
zero_coef <- sum(lasso_coef == 0)
cat("Number of coefficients exactly zero in Lasso:", zero_coef, "\n")

# Top two coefficients by absolute value
top2_coef <- sort(abs(as.numeric(lasso_coef[-1])), decreasing = TRUE)[1:2]
names(top2_coef) <- rownames(lasso_coef)[-1][order(abs(as.numeric(lasso_coef[-1])), decreasing = TRUE)[1:2]]
top2_coef


Optimal lambda (Lasso): 0.0004114814 
10 x 1 sparse Matrix of class "dgCMatrix"
                                     s1
(Intercept)                7.851336e-02
average_combat_score       4.700407e-04
kill_deaths                5.814104e-01
average_damage_per_round   1.847583e-03
kills_per_round            .           
assists_per_round          1.966282e-01
first_kills_per_round     -1.618982e-01
first_deaths_per_round    -3.537040e-01
headshot_percentage        1.376214e-05
clutch_success_percentage  6.616355e-05
Number of coefficients exactly zero in Lasso: 1 


**2)** Did Lasso outperform OLS and Ridge in terms of Test R² and RMSE? Why or why not?(answer based on VIF standings) (1 point)

In [81]:
   # Predictions
lasso_train_pred <- predict(cv_lasso, X_train, s = "lambda.min")
lasso_test_pred  <- predict(cv_lasso, X_test, s = "lambda.min")

# Metrics
lasso_train_met <- postResample(lasso_train_pred, y_train)
lasso_test_met  <- postResample(lasso_test_pred, y_test)

cat(sprintf("Lasso Train R²: %.3f | RMSE: %.3f\n", lasso_train_met["Rsquared"], lasso_train_met["RMSE"]))
cat(sprintf("Lasso Test R²: %.3f | RMSE: %.3f\n", lasso_test_met["Rsquared"], lasso_test_met["RMSE"]))

Lasso Train R²: 0.956 | RMSE: 0.035
Lasso Test R²: 0.950 | RMSE: 0.038


In [82]:
library(car)  

X_df <- as.data.frame(X_train)

dummy_y <- rnorm(nrow(X_df))  

lm_model <- lm(dummy_y ~ ., data = X_df)

vif_values <- vif(lm_model)

vif_df <- data.frame(
  feature = names(vif_values),
  VIF = as.numeric(vif_values)
) %>% arrange(desc(VIF))

print(vif_df)

                    feature       VIF
1      average_combat_score 51.721661
2           kills_per_round 36.996842
3  average_damage_per_round 22.292431
4               kill_deaths  8.808300
5     first_kills_per_round  2.765628
6    first_deaths_per_round  1.890572
7         assists_per_round  1.528715
8       headshot_percentage  1.147457
9 clutch_success_percentage  1.028060


### **III.** Inferences

**1)** Suggest at least two additional features (not in the current dataset) that could improve player rating prediction

# Suggested additional features:
* 1. Player reaction time per round
* 2. Average time alive per round

**2)** How does regularization help reduce overfitting in both Ridge and Lasso?

 How regularization reduces overfitting:
* - Ridge shrinks large coefficients, reducing model complexity.
* - Lasso sets some coefficients to zero, performing automatic feature selection.


### **IV.** Prediction of player ratings


**1)** Using the model that performs better, predict the rating for the following hypothetical player  (0.5 points)
* kills_per_round              : 0.78
* average_damage_per_round     : 160
* average_combat_score         : 245
* kill_deaths                  : 1.30
* assists_per_round            : 0.32
* first_kills_per_round        : 0.18
* first_deaths_per_round       : 0.14

In [83]:
#enter code here
hypothetical <- data.frame(
  average_combat_score      = 245,
  kill_deaths               = 1.30,
  average_damage_per_round  = 160,
  kills_per_round           = 0.78,
  assists_per_round         = 0.32,
  first_kills_per_round     = 0.18,
  first_deaths_per_round    = 0.14,
  headshot_percentage       = NA,  # if unknown, set to 0 or the mean of training
  clutch_success_percentage = NA   # if unknown, set to 0 or the mean of training
)

for(col in names(hypothetical)){
  if(is.na(hypothetical[[col]])){
    hypothetical[[col]] <- mean(X_train[, col], na.rm = TRUE)
  }
}


pred_rating <- predict(cv_ridge, as.matrix(hypothetical), s = "lambda.min")
pred_rating

lambda.min
1.187492


**2)** Use the same model to get ratings for ten players at random from test dataset and compare the values.

In [84]:
#enter code here 
set.seed(123)
sample_idx <- sample(1:nrow(X_test), 10)
sample_players <- X_test[sample_idx, ]

sample_pred <- predict(best_model, sample_players, s = "lambda.min")
sample_pred


ERROR: Error in eval(expr, envir, enclos): object 'best_model' not found
