# Student Performance Indicator: Model Training

### About the Dataset
This project understands how the student's performance (test scores) is affected by other variables such as Gender, Ethnicity, Parental level of education, Lunch and Test preparation course.

**Source:** https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?datasetId=74977

### Description
- **gender:** Sex of students
- **race/ethnicity:** Indicates the ethnicity of students
- **parental level of education:** Parents' final education 
- **lunch:** Type of lunch the student had before test
- **test preparation course:** Tells whether the course was completed before the test or not
- **math score:** Math's score of the student
- **reading score:** Reading's score of the student
- **writing score:** Writing's score of the student

### Objective
Our objective is to predict a student's math score on the test they underwent.

## 1. Importing libraries

In [21]:
# Essentials
import numpy as np
import pandas as pd

# Feature Engineering
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Model Selection & Evaluation
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor, GradientBoostingRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

# from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, confusion_matrix

# Ignore useless warnings
import warnings
warnings.filterwarnings(action="ignore")

# Configuring some settings
from sklearn import set_config
set_config(display='diagram')

# Configuring some settings
pd.set_option('display.max_columns', None)
pd.options.display.max_seq_items = 8000
pd.options.display.max_rows = 8000

OSError: [WinError 8] Not enough memory resources are available to process this command

## 2. Getting the dataset

In [2]:
# Creating a dataframe
df = pd.read_csv('./data/Students_Performance.csv')
df.sample(5)

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
911,female,group A,some college,standard,none,69,84,82
979,female,group C,associate's degree,standard,none,91,95,94
25,male,group A,master's degree,free/reduced,none,73,74,72
434,male,group C,some high school,standard,none,73,66,66
200,female,group C,associate's degree,standard,completed,67,84,86


## 3. Feature Engineering

In [3]:
# Seperating target column from rest of the dataset
Y = df['math score']
Y.sample(5)

661    73
953    58
250    47
16     88
508    79
Name: math score, dtype: int64

In [4]:
# Dropping target column from the dataset
X = df.drop(columns=['math score'])
X.sample(5)

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,reading score,writing score
200,female,group C,associate's degree,standard,completed,84,86
81,male,group B,high school,free/reduced,none,45,45
688,male,group A,high school,free/reduced,none,58,44
241,female,group E,bachelor's degree,standard,none,83,83
115,male,group C,high school,standard,none,77,74


In [5]:
# Performing train-test split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=42, shuffle=True)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((750, 7), (250, 7), (750,), (250,))

In [10]:
# Extracting all the numerical features
numerical_features = [feature for feature in df.select_dtypes(exclude='object').columns if feature not in ['math score']]

# Extracting all the categorical features
categorical_features = [feature for feature in df.select_dtypes(include='object').columns]

In [7]:
# Outliers removal using IQR Method on X_train
for feature in numerical_features:
    percentile25 = df[feature].quantile(0.25)
    percentile75 = df[feature].quantile(0.75)
    iqr = percentile75 - percentile25
    upper_limit = percentile75 + 1.5*iqr
    lower_limit = percentile25 - 1.5*iqr
    X_train[feature] = np.where(X_train[feature]>upper_limit,
                                upper_limit,
                                np.where(X_train[feature]<lower_limit,
                                         lower_limit,
                                         X_train[feature]
                                        )
                               )

In [8]:
# Outliers removal using IQR Method on X_test
for feature in numerical_features:
    percentile25 = df[feature].quantile(0.25)
    percentile75 = df[feature].quantile(0.75)
    iqr = percentile75 - percentile25
    upper_limit = percentile75 + 1.5*iqr
    lower_limit = percentile25 - 1.5*iqr
    X_test[feature] = np.where(X_test[feature]>upper_limit,
                               upper_limit,
                               np.where(X_test[feature]<lower_limit,
                                        lower_limit,
                                        X_test[feature]
                                       )
                              )

In [12]:
# Creating pipeline for numerical features
num_pipeline = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]
)

In [13]:
# Creating pipeline for categorical features
cat_pipeline = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('one_hot_encoder', OneHotEncoder()),
        ('scaler', StandardScaler(with_mean=False))
    ]
)

In [15]:
# Packing the preprocessors together
preprocessor = ColumnTransformer(
    [
        ('num_pipeline', num_pipeline, numerical_features),
        ('cat_pipeline', cat_pipeline, categorical_features)
    ], verbose=True
)
preprocessor

In [16]:
# Performing preprocessing on X_train
X_train = pd.DataFrame(preprocessor.fit_transform(X_train))
X_train.sample(5)

[ColumnTransformer] .. (1 of 2) Processing num_pipeline, total=   0.0s
[ColumnTransformer] .. (2 of 2) Processing cat_pipeline, total=   0.0s


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
282,-1.317084,-1.124349,0.0,2.003772,0.0,0.0,2.136752,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,2.099358,0.0,2.093829,0.0
280,0.016281,0.089572,2.003772,0.0,0.0,2.552497,0.0,0.0,0.0,2.403636,0.0,0.0,0.0,0.0,0.0,2.099358,0.0,0.0,2.093829
707,-0.124073,-0.584829,0.0,2.003772,0.0,2.552497,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,0.0,2.099358,0.0,2.093829
69,-0.545136,-0.584829,2.003772,0.0,0.0,0.0,2.136752,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,2.099358,0.0,0.0,2.093829
2,-0.334604,-0.517389,2.003772,0.0,0.0,0.0,2.136752,0.0,0.0,0.0,0.0,0.0,4.349306,0.0,0.0,2.099358,0.0,0.0,2.093829


In [17]:
# Performing preprocessing on X_test
X_test = pd.DataFrame(preprocessor.transform(X_test))
X_test.sample(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
145,-0.615313,-0.922029,2.003772,0.0,0.0,2.552497,0.0,0.0,0.0,2.403636,0.0,0.0,0.0,0.0,0.0,2.099358,0.0,0.0,2.093829
210,-0.404782,-0.517389,0.0,2.003772,0.0,2.552497,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.649846,0.0,2.099358,0.0,2.093829
188,-0.264427,-0.315069,0.0,2.003772,0.0,0.0,2.136752,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,0.0,2.099358,0.0,2.093829
213,-0.053896,0.494212,2.003772,0.0,0.0,0.0,2.136752,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.649846,0.0,2.099358,0.0,2.093829
211,-0.545136,-0.315069,2.003772,0.0,0.0,0.0,2.136752,0.0,0.0,0.0,0.0,0.0,0.0,2.388487,0.0,2.099358,0.0,0.0,2.093829


## 4. Model Selection

### 4.1 Initialising models

In [18]:
# Linear Model
lr = LinearRegression()

# K Nearest Neighbours
knc = KNeighborsRegressor()

# Tree Model
dt = DecisionTreeRegressor()

# Ensemble Model
rf = RandomForestRegressor()
et = ExtraTreesRegressor()
abc = AdaBoostRegressor()
gb = GradientBoostingRegressor()

# LightGBM Classifier
lgbm = LGBMRegressor()

# XGBoost Classifier
xgb = XGBRegressor()

NameError: name 'LinearRegression' is not defined