<a href="https://colab.research.google.com/github/McNealFielies/McNealFielies.github.io/blob/main/Student_Performance_using_Hyperparameter_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [21]:
# Import the necessary liberies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
# Loading in the dataset

df = pd.read_csv('/content/Student_Performance.csv')

In [5]:
# Looking at the first 5 rows of the dataset

df.head()

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,99,Yes,9,1,91.0
1,4,82,No,4,2,65.0
2,8,51,Yes,7,2,45.0
3,5,52,Yes,5,2,36.0
4,7,75,No,8,5,66.0


In [6]:
# See if there are any missing values in the dataset

df.isnull().sum()

Hours Studied                       0
Previous Scores                     0
Extracurricular Activities          0
Sleep Hours                         0
Sample Question Papers Practiced    0
Performance Index                   0
dtype: int64

In [7]:
# Information about the dataset

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 6 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Hours Studied                     10000 non-null  int64  
 1   Previous Scores                   10000 non-null  int64  
 2   Extracurricular Activities        10000 non-null  object 
 3   Sleep Hours                       10000 non-null  int64  
 4   Sample Question Papers Practiced  10000 non-null  int64  
 5   Performance Index                 10000 non-null  float64
dtypes: float64(1), int64(4), object(1)
memory usage: 468.9+ KB


In [8]:
# Description of the dataset

df.describe()

Unnamed: 0,Hours Studied,Previous Scores,Sleep Hours,Sample Question Papers Practiced,Performance Index
count,10000.0,10000.0,10000.0,10000.0,10000.0
mean,4.9929,69.4457,6.5306,4.5833,55.2248
std,2.589309,17.343152,1.695863,2.867348,19.212558
min,1.0,40.0,4.0,0.0,10.0
25%,3.0,54.0,5.0,2.0,40.0
50%,5.0,69.0,7.0,5.0,55.0
75%,7.0,85.0,8.0,7.0,71.0
max,9.0,99.0,9.0,9.0,100.0


In [11]:
# Changing the Extracurricular Activities to Numerical Value

from sklearn.preprocessing import LabelEncoder


label_encoder = LabelEncoder()

df['Extracurricular Activities'] = label_encoder.fit_transform(df['Extracurricular Activities'])

In [12]:
# Let's have a look and see if the values has been change to a numerical value

df.head()

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,99,1,9,1,91.0
1,4,82,0,4,2,65.0
2,8,51,1,7,2,45.0
3,5,52,1,5,2,36.0
4,7,75,0,8,5,66.0


# **Linear Regression**

In [17]:
# Define the list of regression models
model_list = [
    LinearRegression(),
    DecisionTreeRegressor(),
    RandomForestRegressor(random_state=0),
    GradientBoostingRegressor(),
]

In [23]:
# Create a dictionary of hyperparameters for each model
model_hyperparameters = {
    'linear_reg_hyperparameter': {
        'fit_intercept': [True, False],
    },
    'decision_tree_hyperparameters': {
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
    },
    'random_forest_hyperparameter': {
        'n_estimators': [10, 20, 50, 100],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
    },
    'gradient_boosting_hyperparameter': {
        'n_estimators': [10, 50, 100],
        'learning_rate': [0.1, 0.01, 0.001],
        'max_depth': [3, 4, 5],
    },
}

# **Apply GridCV**

In [24]:
# Create a function for model selection and hyperparameter tuning
def ModelSelection(list_of_models, Hyperparameter_dictionary, X, y):
    result = []
    i = 0
    model_keys = list(Hyperparameter_dictionary.keys())

    for model in list_of_models:
        key = model_keys[i]
        params = Hyperparameter_dictionary[key]
        i += 1

        print(model)
        print(params)
        print('------------------------------------------------------------------')

        regressor = GridSearchCV(model, params, scoring='neg_mean_squared_error', cv=5)
        regressor.fit(X, y)

        y_pred = regressor.predict(X)
        mse = mean_squared_error(y, y_pred)
        r2 = r2_score(y, y_pred)

        result.append({
            'model used': model.__class__.__name__,
            'best hyperparameter': regressor.best_params_,
            'mean squared error': mse,
            'R-squared (R2)': r2,
        })

    result_dataframe = pd.DataFrame(result, columns=['model used', 'best hyperparameter', 'mean squared error', 'R-squared (R2)'])
    return result_dataframe

In [25]:
# Call the ModelSelection function with your data
result_df = ModelSelection(model_list, model_hyperparameters, X, y)

LinearRegression()
{'fit_intercept': [True, False]}
------------------------------------------------------------------
DecisionTreeRegressor()
{'max_depth': [None, 10, 20, 30], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4]}
------------------------------------------------------------------
RandomForestRegressor(random_state=0)
{'n_estimators': [10, 20, 50, 100], 'max_depth': [None, 10, 20, 30], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4]}
------------------------------------------------------------------
GradientBoostingRegressor()
{'n_estimators': [10, 50, 100], 'learning_rate': [0.1, 0.01, 0.001], 'max_depth': [3, 4, 5]}
------------------------------------------------------------------


In [26]:
# Print the results
print(result_df)

                  model used  \
0           LinearRegression   
1      DecisionTreeRegressor   
2      RandomForestRegressor   
3  GradientBoostingRegressor   

                                 best hyperparameter  mean squared error  \
0                            {'fit_intercept': True}            4.151351   
1  {'max_depth': 10, 'min_samples_leaf': 4, 'min_...            3.559434   
2  {'max_depth': 10, 'min_samples_leaf': 4, 'min_...            3.237936   
3  {'learning_rate': 0.1, 'max_depth': 4, 'n_esti...            3.986742   

   R-squared (R2)  
0        0.988752  
1        0.990356  
2        0.991227  
3        0.989198  


In [28]:
import pandas as pd

# Create a list of dictionaries with the result information
result_list = [
    {
        'model used': 'LinearRegression',
        'best hyperparameter': {'fit_intercept': True},
        'mean squared error': 4.151351,
        'R-squared (R2)': 0.988752,
    },
    {
        'model used': 'DecisionTreeRegressor',
        'best hyperparameter': {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 2},
        'mean squared error': 3.559434,
        'R-squared (R2)': 0.990356,
    },
    {
        'model used': 'RandomForestRegressor',
        'best hyperparameter': {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 2},
        'mean squared error': 3.237936,
        'R-squared (R2)': 0.991227,
    },
    {
        'model used': 'GradientBoostingRegressor',
        'best hyperparameter': {'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 100},
        'mean squared error': 3.986742,
        'R-squared (R2)': 0.989198,
    }
]

# Create a pandas DataFrame from the list of dictionaries
result_df = pd.DataFrame(result_list)

# Reorder the columns
result_df = result_df[['model used', 'R-squared (R2)', 'best hyperparameter', 'mean squared error']]

# Print the DataFrame
print(result_df)


                  model used  R-squared (R2)  \
0           LinearRegression        0.988752   
1      DecisionTreeRegressor        0.990356   
2      RandomForestRegressor        0.991227   
3  GradientBoostingRegressor        0.989198   

                                 best hyperparameter  mean squared error  
0                            {'fit_intercept': True}            4.151351  
1  {'max_depth': 10, 'min_samples_leaf': 4, 'min_...            3.559434  
2  {'max_depth': 10, 'min_samples_leaf': 4, 'min_...            3.237936  
3  {'learning_rate': 0.1, 'max_depth': 4, 'n_esti...            3.986742  


# **Conclusion**

In this analysis, we applied four different regression algorithms to predict the target variable using our dataset. The models used were Linear Regression, Decision Tree Regression, Random Forest Regression, and Gradient Boosting Regression.

### **Linear Regression:**

Linear Regression provided a good fit to the data with an R-squared (R2) value of approximately **0.989**.
The best hyperparameter for Linear Regression was to fit the intercept.


### **Decision Tree Regression:**

Decision Tree Regression exhibited excellent performance with an R-squared (R2) value of approximately **0.990**.
The best hyperparameters included a **maximum depth of 10, minimum samples per leaf of 4, and minimum samples to split of 2**.
Random Forest Regression:

Random Forest Regression outperformed other models with an R-squared (R2) value of approximately **0.991**.
The best hyperparameters included **100 estimators, a maximum depth of 10, minimum samples per leaf of 4, and minimum samples to split of 2**.


**Gradient Boosting Regression:**

Gradient Boosting Regression achieved a high R-squared (R2) value of approximately **0.989.**
The best hyperparameters included a learning rate of **0.1, maximum depth of 4, and 100 estimators.**


Overall, the Random Forest Regression model with a maximum depth of 10, minimum samples per leaf of 4, and minimum samples to split of 2 provided the best performance in terms of R-squared (R2) and mean squared error.