# Predictive Analysis of Student Performance in Math Courses using R and glmnet

In this Machine Learning (ML) tutorial, we'll explore the realm of predictive analytics with a focus on student performance in math courses. Our goal is to leverage predictive modeling to identify students who may require additional academic support, potentially guiding them towards private lessons to enhance their learning outcomes. Utilizing the `glmnet` library in R, we aim to demonstrate a comprehensive approach to predicting student grades based on various socio-economic and school-related factors.

## Dataset Overview

We base our analysthe student datasetRdata`, sourced from detailed records of student achievements in Portuguese schools. These datasets provide a rich tapestry of data encompassing students' math grades, socio-economic backgrounds, and school-related characteristics. Below is a brief overview of the types of data we'll be examining:

- **Math Grade:** The primary focus of our analysis, representing students' final grades in math.
- **Socio-economic Characteristics:** Attributes detailing students' family backgrounds, parents' occupations, home resources like internet access, and additional factors that could influence academic performance.
- **School Related Features:** Information covering school attendance, study habits, previous academic failures, absences, and other relevant variables.

For an in-depth look at all the variables included in our datasets, please consult the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Student+Performance) for a comprehensive data description.

## Tutorial Outline

This tutorial is structured to guide participants through the entire process of a machine learning project, from initial data handling to the final stages of model evaluation and prediction. Here’s what you can expect:

### Data Preparation
Using the `r_functions.r` script, we'll streamline the  loading, cleaning, and preparing our data for analysis. These functions are essential for ensuring our data is in the right format for modeling.

### Model Training
With the `glmnet` package, we'll fit a generalized linear model to our training data. This phase involves feature selection and model parameter tuning to optimize our predictive model's performance.

### Evaluation and Prediction
We'll assess our model's accuracy using the test dataset, evaluating its effectiveness in predicting math grades. This step will enable us to identify students who could benefit from additional educational support.

This tutorial offers more than just a walkthrough of machine learning implementation; it provides a deeper understanding of the data and strategic insights necessary for impactful predictive modeling in the educational domain. Whether you're an experienced data scientist or a newcomer to the field, this tutorial promises to enrich your knowledge and skills in educational data analytics.


## Loading Training and Testing Datasets in R

In this code snippet, we're loading the training and testing datasets necessary for our analysis. These datasets are stored as `.Rdata` files, a format used in R for saving and loading objects. By using the `load()` function, we directly read these files into our R environment. The datasets are located in the `scripts_and_data` directory, within the broader `self_study_tutorial` directory of the course's materials. The paths provided are absolute, ensuring that R can locate and load the datasets regardless of the current working directory. 

In [5]:
# Load the training dataset
train = load("/home/jupyter-mlcourseuser/M02-Machine-Learning/self_study_tutorial/scripts_and_data/student-mat-train.Rdata")

# Load the test dataset
test = load("/home/jupyter-mlcourseuser/M02-Machine-Learning/self_study_tutorial/scripts_and_data/student-mat-test.Rdata")


This action makes all the valuable functions we developed in the past tutorial instantly available for use. Among these, we have functions for data downloading, preprocessing, and splitting, which are foundational for any machine learning project.

For instance, consider the scenario where we aim to predict student performance in a math course. A critical first step involves obtaining and preparing the data:

# Estimating Linear Regression to Predict Student Performance

In this section of our tutorial, we focus on estimating a linear regression model to predict students' final math grades (G3) based on a variety of socio-economic and school-related features. Linear regression is a basic yet powerful statistical method that elucidates the relationship between one dependent variable and one or more independent variables.

The core idea is to fit a linear equation to observed data, which, in our case, involves using the `lm()` function in R. The `lm()` function, which stands for linear model, requires a formula specifying the model to be fitted and the dataset for the model fitting.

After fitting our model, we'll proceed to assess its performance. We do this by predicting math grades on our test dataset and then calculating the Mean Squared Error (MSE) of these predictions. MSE is a critical measure that quantifies the average of the squares of errors; essentially, it's the average squared difference between the observed actual outcomes and the predictions made by the model. Generally, a lower MSE signifies a model with high accuracy in its predictions.

Let's proceed with the code implementation:s a


In [6]:
library(glmnet)
library(dplyr)

# Fit the linear regression model
ols <- lm(G3 ~ ., data = train)
# Display the summary of the model to understand its performance
summary(ols)

# Predicting the math grades for the test dataset
test$predols <- predict(ols, newdata = test)

# Calculating the Mean Squared Error (MSE) for our predictions
predMSEols <- mean((test$G3 - test$predols)^2)
# Print the MSE to the console
print(predMSEols)

Loading required package: Matrix

Loaded glmnet 4.1-8


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union





Call:
lm(formula = G3 ~ ., data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.5524 -1.9313  0.1568  1.8190  8.5320 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 14.62145    3.68541   3.967 0.000103 ***
sex         -0.62733    0.49221  -1.275 0.204052    
age         -0.16293    0.19173  -0.850 0.396517    
address     -0.42789    0.53423  -0.801 0.424171    
famsize     -0.47520    0.46329  -1.026 0.306346    
Pstatus      0.26090    0.63117   0.413 0.679816    
Medu         0.34875    0.25380   1.374 0.171046    
Fedu         0.09652    0.23729   0.407 0.684661    
traveltime   0.22792    0.32065   0.711 0.478090    
studytime    0.68759    0.27570   2.494 0.013496 *  
failures    -0.69699    0.32437  -2.149 0.032934 *  
schoolsup   -3.47010    0.67917  -5.109 7.91e-07 ***
famsup      -0.76877    0.46120  -1.667 0.097202 .  
paid        -0.39143    0.44383  -0.882 0.378945    
activities   0.22703    0.42261   0.537 0.591760    
n

[1] 9.300887


##  Mean Squared Error (MSE) Overview

The **Mean Squared Error (MSE)** is a critical metric in supervised learning, used to evaluate the performance of a predictive model. It calculates the average squared difference between the actual observed values and the model's predictions:

$$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$

where $n$ is the number of observations, $y_i$ the actual value, and $\hat{y}_i$ the predicted value.

### Objective

- **Minimize MSE**: A lower MSE indicates a model that predicts more closely to the actual observations, showing higher accuracy.

### Interpreting MSE: 9.300887

- **Contextual Value**: MSE needs to be interpreted in relation to the scale of the target variable and compared to benchmarks or other models.
- **Indication of Error**: An MSE of 9.300887 suggests that the model's predictions deviate from the actual values, with the squared average of these deviations being around 9.3. This value aids in understanding the model's accuracy.
- **Improvement Marker**: Reductions in MSE across model iterations signal improvements in predictive accuracy, keeping in mind the balance with overfitting concerns.

The goal is to develop a model that not only minimizes MSE but also generalizes well across different datasets.
fferent datasets.


## Predicting Student Performance with Lasso Regression

The **Lasso** (Least Absolute Shrinkage and Selection Operator) minimizes the sum of squared residuals, with a penalty on the absolute size of the coefficients ($\beta_j$). The objective function for Lasso is defined as:

$$ \min_{\beta} \left\{ \sum_{i=1}^{N} (Y_i - \gamma - \sum_{j=1}^{p} X_{ij} \beta_j)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right\}, $$

where $p$ is the total number of control variables, and $\lambda \geq 0$ is the penalty term that regulates the degree of penalization. The Lasso performs both variable selection and regularization:

- When $\lambda = 0$, there's no penalization, making the Lasso equivalent to OLS.
- As $\lambda$ increases, some coefficients are shrunk towards zero, leading to exact zero coefficients for sufficiently large $\lambda$.
- The constant term ($\gamma$) is not penalized.
- Zero coefficients imply exclusion of corresponding variables from the model, making Lasso a model selection tool.

## Tuning Parameter: $\lambda$

The penalty term $\lambda$ is crucial for both Lasso and Ridge. Optimal $\lambda$ can be determined through **cross-validation**:

- Partition the sample into $k$ equally large folds.
- Specify a grid of $\lambda$ values.
- For each $\lambda$, estimate the model on $k-1$ folds, predict for the held-out fold, and measure the prediction error (e.g., MSE).
- Repeat for all folds, averaging the prediction error.
- Select $\lambda$ with the best average prediction error or apply the one-standard-error rule.

Before estimation, control variables are standardized, ensuring variable scaling does not impact the results.






### Estimating a Lasso Model and Determining Optimal Lambda with Cross-Validation

In this part of our tutorial, we'll dive into how to estimate a Lasso regression model and use cross-validation to find the optimal lambda value, which balances model complexity and prediction accuracy. This process is crucial in preventing overfitting and underfitting, ensuring that our model generalizes well to new, unseen data.

### Setting the Stage

Before we begin, it's important to set a seed for replicability. This ensures that our results are consistent and can be reproduced by anyone rerunning this analy(27112019)


In [13]:
set.seed(27112019)

## Estimating the Lasso Model

The Lasso model is estimated using the `glmnet` package in R, which requires the predictor variables to be in a matrix format and the response variable to be numeric. Here, we're using the first 25 columns of our training dataset as predictors to estimate the model for predicting the `G3` variable, which represents student grades.


In [7]:
# Estimate a Lasso model using the first 25 columns as predictors
lasso <- glmnet(as.matrix(train[,c(1:25)]), train$G3, alpha = 1)

## Cross-Validation to Determine Optimal Lambda

Cross-validation is a technique used to assess how the results of a statistical analysis will generalize to an independent dataset. In the context of Lasso regression, we use cross-validation to determine the optimal lambda value. Lambda is a parameter that controls the amount of shrinkage: the larger the lambda, the more features are driven to zero. This process helps in feature selection and in preventing overfitting.

The `cv.glmnet` function automatically performs cross-validation and is specified to use Mean Squared Error (MSE) as the measure to assess model accuracy. We've chosen a 5-fold cross-validation, meaning the data is split into 5 parts, with each part being used as a testing set at some point.


In [8]:
# Perform cross-validation to find the optimal lambda value
lasso.cv <- cv.glmnet(as.matrix(train[,c(1:25)]), train$G3, type.measure = "mse", nfolds = 5, alpha = 1)


## Results and Next Steps

After running the cross-validation, the optimal lambda λ value is stored within `lasso.cv` and can be accessed using `lasso.cv$lambda.min` for the lambda that minimizes MSE, or `lasso.cv$lambda.1se` for the most regularized model within one standard error of the minimum.

Understanding and selecting the optimal lambda allows us to balance model complexity and accuracy, tailoring our Lasso regression model to perform optimally on unseen data.

In the next steps of our analysis, we would use this optimal lambda value to re-estimate our Lasso model or proceed with model evaluation, such as calculating MSE on a test set to gauge the model's prediction accuracy.


In [9]:
# Print the optimal lambda value
print(paste0("Optimal lambda that minimizes cross-validated MSE: ", lasso.cv$lambda.min))
print(paste0("Optimal lambda using one-standard-error-rule: ", lasso.cv$lambda.1se))

[1] "Optimal lambda that minimizes cross-validated MSE: 0.156611125450941"
[1] "Optimal lambda using one-standard-error-rule: 0.435779759613754"


## Analyzing Lasso Coefficients and Calculating train Sample MSE

In the code snippet below, we examine the coefficients of the Lasso model corresponding to the optimal lambda value determined through cross-validation. In Lasso regression, certain coefficients may take on a value of zero, indicating that the associated control variables are excluded from the model. This property of Lasso aids in simplifying the model by removing less relevant variables, thereby enhancing interpretability and generalization to unseen data.


In [10]:
# Print Lasso coefficients
print(coef(lasso.cv, s = "lambda.min"))

# Save for later comparison
coef_lasso1 <- coef(lasso.cv, s = "lambda.min") 


26 x 1 sparse Matrix of class "dgCMatrix"
                     s1
(Intercept) 12.88293900
sex         -0.34763178
age         -0.04783289
address      .         
famsize     -0.11991735
Pstatus      .         
Medu         0.24728969
Fedu         .         
traveltime   .         
studytime    0.32696194
failures    -0.73390530
schoolsup   -2.81748136
famsup      -0.38144513
paid        -0.11329474
activities   0.02107000
nursery      .         
higher       .         
internet     0.40280870
romantic     .         
famrel       0.02443835
freetime     .         
goout        .         
Dalc        -0.02576019
Walc        -0.32702432
health       .         
absences    -0.05943503


###  Evaluating Model Performance with Test Sample Mean Squared Error (MSE)

Following the coefficient analysis, we proceed to calculate the Mean Squared Error (MSE) in the test sample. The test sample MSE enables us to assess the performance of our Lasso model on unseen data and compare it with other estimators. By evaluating the model's accuracy on a separate test dataset, we ensure that our Lasso regression model can generalize effectively beyond the training data, providing reliable predictions in practical scenarios.

In [11]:
test$predlasso <- predict(lasso.cv, newx = as.matrix(test[,c(1:25)]), s = lasso.cv$lambda.min)

# Calculate the MSE
predMSElasso <- mean((test$G3 - test$predlasso)^2)
print(paste0("MSE: ", predMSElasso))

[1] "MSE: 8.90119284460168"


In [12]:
# Fitted values
test$predlasso <- predict(lasso.cv, newx = as.matrix(test[,c(1:25)]), s = lasso.cv$lambda.min)

# Calculate the MSE
predMSElasso <- mean((test$G3 - test$predlasso)^2)
print(paste0("MSE: ", predMSElasso))

[1] "MSE: 8.90119284460168"


# Ridge Regression

The **Ridge** regression minimizes the sum of squared residuals, with a penalty on the squared size of the coefficients ($\beta_j$). Its objective function is:

$$ \min_{\beta} \left\{ \sum_{i=1}^{N} (Y_i - \gamma - \sum_{j=1}^{p} X_{ij} \beta_j)^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \right\}. $$

Similar to Lasso, larger $\lambda$ values imply more penalization. However, unlike Lasso, Ridge does not set coefficients exactly to zero. Thus, Ridge is not used for variable selection but is more suitable when the true underlying model is dense.




## Ridge Regression Cross-Validation and Optimal Lambda Selection

As you know, we will follow similar steps as we did for Lasso regression, but this time for Ridge regression. We'll begin by setting a starting value using `set.seed(27112019)` for reproducibility.

Next, we'll conduct cross-validation for the Ridge regression model using the `cv.glmnet` function, aiming to determine the optimal lambda value. By specifying `type.measure = "mse"` and `nfolds = 5`, we ensure that the model's performance is evaluated based on Mean Squared Error (MSE) using a 5-fold cross-validation strategy.

After completing the cross-validation process, we'll extract and output the optimal lambda value. This lambda value, accessible through `ridge.cv$lambda.min`, represents the parameter that minimizes the cross-validated MSE, indicating the optimal level of regularization for our Ridge regression model. Additionally, we'll obtain the lambda value based on the one-standard-error rule using `ridge.cv$lambda.1se`, providing an alternative perspective on selecting the regularization parameter.

By performing Ridge regression cross-validation and obtaining the optimal lambda value, we'll ensure that our model is appropriately regularized, balancing bias and variance for optimal performance on unseen data.


In [13]:
# Set starting value
set.seed(27112019)
ridge <- glmnet(as.matrix(train[,c(1:25)]), train$G3, alpha = 0)
# Cross-validate the Ridge model 
ridge.cv <- cv.glmnet(as.matrix(train[,c(1:25)]), train$G3, type.measure = "mse", nfolds = 5, alpha = 0)

In [14]:
# Print the optimal lambda value
print(paste0("Optimal lambda that minimizes cross-validated MSE: ", ridge.cv$lambda.min))
print(paste0("Optimal lambda using one-standard-error-rule: ", ridge.cv$lambda.1se))

[1] "Optimal lambda that minimizes cross-validated MSE: 1.97620826614986"
[1] "Optimal lambda using one-standard-error-rule: 10.5464291346655"


In [15]:
# Print Ridge coefficients
print(coef(ridge.cv, s = "lambda.min"))

# Save for later comparison
coef_ridge <- coef(ridge.cv, s = "lambda.min") 

26 x 1 sparse Matrix of class "dgCMatrix"
                     s1
(Intercept) 12.94080662
sex         -0.42225544
age         -0.08314731
address     -0.24984549
famsize     -0.30876169
Pstatus      0.11796154
Medu         0.21089603
Fedu         0.09683688
traveltime   0.04969724
studytime    0.36062997
failures    -0.55560665
schoolsup   -2.08642928
famsup      -0.44044860
paid        -0.24836700
activities   0.20087023
nursery     -0.11340779
higher       0.38613173
internet     0.50139834
romantic    -0.15049638
famrel       0.11637309
freetime     0.02250704
goout       -0.12642101
Dalc        -0.17216585
Walc        -0.19404329
health      -0.05676600
absences    -0.04516606


In contrast to the Lasso model, the Ridge model keeps all control variables. Accordingly, Ridge is suited for dense models. In coparison to OLS, the Ridge coefficients are shrunken towards zero.





Following the cross-validation and optimal lambda selection for Ridge regression, we proceed with predicting the test sample using the fitted model. We estimate the values using the predict function and calculate the Mean Squared Error (MSE) to evaluate the model's performance on the test data. This process allows us to assess how well the Ridge regression model generalizes to new, unseen data, providing insights into its predictive accuracy.

In [16]:
test$predridge <- predict(ridge, newx = as.matrix(test[,c(1:25)]), s = ridge.cv$lambda.min)

# Calculate the MSE
predMSEridge <- mean((test$G3 - test$predridge)^2)
print(paste0("MSE: ", predMSEridge))

[1] "MSE: 8.31374220052241"


# Comparing Regression Models: OLS, Lasso, and Ridge

The output you've observed represents the mean squared error (MSE) values for three different regression models applied to your dataset: Ordinary Least Squares (OLS), Lasso, and Ridge regression. MSE is a common measure used to quantify the error of a model's predictions, where a lower MSE value indicates a model with better predictive accuracy. Below is an explanation of each model's MSE in your output:

- **OLS Regression MSE (9.300887):** This MSE value is associated with the OLS regression model, the simplest form of linear regression that does not apply any regularization. The relatively higher MSE compared to the other models might suggest overfitting or a less effective capture of the underlying data structure.

- **Lasso Regression MSE (8.901193):** Lasso regression, which incorporates an L1 penalty leading to coefficient sparsity (setting some coefficients to zero), shows a slightly improved MSE. This indicates better generalization, likely due to its feature selection effect, which helps in reducing overfitting.

- **Ridge Regression MSE (8.313742):** The lowest MSE among the three models belongs to Ridge regression, which uses an L2 penalty to shrink the coefficients but not to zero. This suggests Ridge regression is the most accurate in predicting the data for this particular scenario. This improvement can be attributed to its capacity to handle multicollinearity more effectively than OLS, and potentially better than Lasso in this case.

In essence, this comparison indicates that for the specific dataset and conditions at hand, Ridge regression outperforms both Lasso and OLS regression in terms of predictive accuracy. This underscores the benefit of incorporating regularization (both L1 and L2) to enhance model performance by mitigating overfitting and dealing more adeptly with multicollinearity.


In [17]:
print(c(predMSEols, predMSElasso, predMSEridge))

[1] 9.300887 8.901193 8.313742
