# This Notebook corresponds to the the first exercise at the second TP.

Exercise: Data Preparation for Modelling.

Objective: Make the dataset sweet for machine learning model.

Dataset: [Student Performance dataset](https://data.mendeley.com/datasets/5b82ytz489/1)

Dataset Description:


The Student Performance Metrics Dataset provides a diverse collection of academic and non-academic attributes aimed at evaluating factors influencing student performance in higher education. It enables researchers to analyse relationships between student demographics, academic achievements, socio-economic factors, and extracurricular activities.

<span style="color:green; font-weight:bold">Dataset Dictionary:</span> 
|Column|Description|
|:-|:-|
|Department|The academic department the student is enrolled in (e.g., Computer Science, Business, etc.)|
|Gender|The gender of the student|
|HSC|Score obtained in higher secondary education|
|SSC|Score obtained in secondary school education|
|Income|Monthly family income of their parents|
|Hometown|The type of area where the student resides (e.g., urban, rural)|
|Computer|Proficiency level in computer usage|
|Preparation|Time spent on study preparation outside class hours|
|Gaming|Time spent on gaming activities daily|
|Attendance|Regularity in class participation|
|Job|Indicates if the student has a part-time job|
|English| Proficiency in English communication skills|
|Extra| Participation in extracurricular activities|
|Semester| Current semester the student is enrolled in|
|Last|Performance in the last semester|
|Overall|Cumulative Grade Point Average (CGPA)|


##### Library importation:

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from  sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, LabelEncoder, MinMaxScaler
from  sklearn.model_selection import train_test_split

##### Loading dataset:

In [31]:
dataset= "./../dataset/student_performance.csv"
data = pd.read_csv(dataset, sep=",")

##### Load the dataset and display its basic information:

In [32]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 493 entries, 0 to 492
Data columns (total 16 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Department   493 non-null    object 
 1   Gender       493 non-null    object 
 2   HSC          493 non-null    float64
 3   SSC          493 non-null    float64
 4   Income       493 non-null    object 
 5   Hometown     493 non-null    object 
 6   Computer     493 non-null    int64  
 7   Preparation  493 non-null    object 
 8   Gaming       493 non-null    object 
 9   Attendance   493 non-null    object 
 10  Job          493 non-null    object 
 11  English      493 non-null    int64  
 12  Extra        493 non-null    object 
 13  Semester     493 non-null    object 
 14  Last         493 non-null    float64
 15  Overall      493 non-null    float64
dtypes: float64(4), int64(2), object(10)
memory usage: 61.8+ KB


In [33]:
# After checking English colummn values, we found they are ordered categorical values from 1 to 5.
data["English"] = pd.Categorical(data["English"], categories=[1, 2, 3, 4, 5], ordered=True)


In [34]:
# After checking Computer colummn values, we found they are ordered categorical values from 1 to 5.
data["Computer"] = pd.Categorical(data["Computer"], categories=[1, 2, 3, 4, 5], ordered=True)

##### Identify and separate features (X) and target (y) columns:

In [35]:
X, Y = data.drop(["Overall"], axis=1), data[["Overall"]]

##### Handle missing values if any:


In [36]:
X.isnull().sum()

Department     0
Gender         0
HSC            0
SSC            0
Income         0
Hometown       0
Computer       0
Preparation    0
Gaming         0
Attendance     0
Job            0
English        0
Extra          0
Semester       0
Last           0
dtype: int64

In [37]:
Y.isnull().sum()

Overall    0
dtype: int64

##### Encode categorical variables using LabelEncoder or OneHotEncoder:

In [38]:
# Select qualitative columns
Qaulitative_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()
Qaulitative_cols

['Department',
 'Gender',
 'Income',
 'Hometown',
 'Computer',
 'Preparation',
 'Gaming',
 'Attendance',
 'Job',
 'English',
 'Extra',
 'Semester']

In [39]:
# Check unique values in qualitative columns
X[Qaulitative_cols].nunique()

Department     10
Gender          2
Income         10
Hometown        2
Computer        5
Preparation     3
Gaming          3
Attendance      4
Job             2
English         5
Extra           2
Semester       11
dtype: int64

In [40]:
X["Semester"].value_counts()

Semester
2nd     183
10th     57
8th      54
3rd      35
5th      33
9th      30
6th      29
7th      24
4th      23
11th     21
12th      4
Name: count, dtype: int64

In [41]:
# identify Ordinal columns by inspecting each one.
Ordinal_cols = ["English", "Computer", "Income", "Preparation", "Gaming", "Attendance", "Semester"]

In [42]:
# Identify nominal columns by inspecting each one.
Nominal_cols = [col for col in Qaulitative_cols if col not in Ordinal_cols]

In [43]:
# encoding ordinal cols
ordinal_encoder = OrdinalEncoder()
X[Ordinal_cols] = ordinal_encoder.fit_transform(X[Ordinal_cols])


In [44]:
# encoding nominal cols
onehot_encoder = OneHotEncoder(sparse_output=False, drop='first')
X_nominal_encoded = onehot_encoder.fit_transform(X[Nominal_cols])
X_nominal_encoded_df = pd.DataFrame(X_nominal_encoded, columns=onehot_encoder.get_feature_names_out(Nominal_cols))
X = pd.concat([X.drop(Nominal_cols, axis=1).reset_index(drop=True), X_nominal_encoded_df.reset_index(drop=True)], axis=1)

##### Split the data into training and test sets using train_test_split (e.g., 80% train, 20% test).

In [46]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

##### Scale numerical features using StandardScaler or MinMaxScaler.

standard scaler: new x = (x- mean)/ std

MinMax scaler: new_x = (x-min)/(max-min)

##### Print shapes of training and test sets to confirm the split.

In [47]:
x_train.shape

(394, 23)

In [49]:
x_test.shape

(99, 23)