# Predicting Higher Education Students Performance

This project will use ML to predict whether students will have a good final grade based on personal characteristics, family characteristics, and education habits. If it is possible to predict which students would likely have a poor outcome, those students could be offered extra services to help them succeed.
<br>The dataset has the final grades of students encoded by letter, this will be divided into 2 groups: those with a "C" or higher are considered "good" and those with less than a "C" are considered to be "bad."
<br>The dataset used is "Higher Education Students Performance Evaluation" made available by the Faculty of Engineering and Faculty of Educational Sciences students in 2019 and was downloaded from the UC Irvine Machine Learing Repository (https://archive.ics.uci.edu/dataset/856/higher+education+students+performance+evaluation).  

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Citation: Yilmaz,Nevriye and Şekeroğlu,Boran. (2023). Higher Education Students Performance Evaluation. UCI Machine Learning Repository. https://doi.org/10.24432/C51G82.

<hr>
From the dataset documentation:

Student ID<br/>
1- Student Age (1: 18-21, 2: 22-25, 3: above 26)<br/>
2- Sex (1: female, 2: male)<br/>
3- Graduated high-school type: (1: private, 2: state, 3: other)<br/>
4- Scholarship type: (1: None, 2: 25%, 3: 50%, 4: 75%, 5: Full)<br/>
5- Additional work: (1: Yes, 2: No)<br/>
6- Regular artistic or sports activity: (1: Yes, 2: No)<br/>
7- Do you have a partner: (1: Yes, 2: No)<br/>
8- Total salary if available (1: USD 135-200, 2: USD 201-270, 3: USD 271-340, 4: USD 341-410, 5: above 410)<br/>
9- Transportation to the university: (1: Bus, 2: Private car/taxi, 3: bicycle, 4: Other)<br/>
10- Accommodation type in Cyprus: (1: rental, 2: dormitory, 3: with family, 4: Other)<br/>
11- Mothersâ€™ education: (1: primary school, 2: secondary school, 3: high school, 4: university, 5: MSc., 6: Ph.D.)<br/>
12- Fathersâ€™ education: (1: primary school, 2: secondary school, 3: high school, 4: university, 5: MSc., 6: Ph.D.)<br/>
13- Number of sisters/brothers (if available): (1: 1, 2:, 2, 3: 3, 4: 4, 5: 5 or above)<br/>
14- Parental status: (1: married, 2: divorced, 3: died - one of them or both)<br/>
15- Mothersâ€™ occupation: (1: retired, 2: housewife, 3: government officer, 4: private sector employee, 5: self-employment, 6: other)<br/>
16- Fathersâ€™ occupation: (1: retired, 2: government officer, 3: private sector employee, 4: self-employment, 5: other)<br/>
17- Weekly study hours: (1: None, 2: <5 hours, 3: 6-10 hours, 4: 11-20 hours, 5: more than 20 hours)<br/>
18- Reading frequency (non-scientific books/journals): (1: None, 2: Sometimes, 3: Often)<br/>
19- Reading frequency (scientific books/journals): (1: None, 2: Sometimes, 3: Often)<br/>
20- Attendance to the seminars/conferences related to the department: (1: Yes, 2: No)<br/>
21- Impact of your projects/activities on your success: (1: positive, 2: negative, 3: neutral)<br/>
22- Attendance to classes (1: always, 2: sometimes, 3: never)<br/>
23- Preparation to midterm exams 1: (1: alone, 2: with friends, 3: not applicable)<br/>
24- Preparation to midterm exams 2: (1: closest date to the exam, 2: regularly during the semester, 3: never)<br/>
25- Taking notes in classes: (1: never, 2: sometimes, 3: always)<br/>
26- Listening in classes: (1: never, 2: sometimes, 3: always)<br/>
27- Discussion improves my interest and success in the course: (1: never, 2: sometimes, 3: always)<br/>
28- Flip-classroom: (1: not useful, 2: useful, 3: not applicable)<br/>
29- Cumulative grade point average in the last semester (/4.00): (1: <2.00, 2: 2.00-2.49, 3: 2.50-2.99, 4: 3.00-3.49, 5: above 3.49)<br/>
30- Expected Cumulative grade point average in the graduation (/4.00): (1: <2.00, 2: 2.00-2.49, 3: 2.50-2.99, 4: 3.00-3.49, 5: above 3.49)<br/>
31- Course ID<br/>
32- OUTPUT Grade (0: Fail, 1: DD, 2: DC, 3: CC, 4: CB, 5: BB, 6: BA, 7: AA)<br/>

<hr> The dataset already has its categorical features coded to integers, and no feature has more than 10 values, so One-Hot-Encoding is not needed for this set.

In [40]:
# Import the dependencies.
from pathlib import Path
from sqlalchemy import create_engine
from config import db_password
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import tensorflow as tf
import sklearn as skl
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [2]:
# Create an engine to retrieve the data from PostgreSQL database.
engine = create_engine(f"postgresql://postgres:{db_password}@localhost:5432/student_data_db")

In [11]:
# Retrieve the data and place it into a DataFrame
sql_query = "SELECT * FROM student_data"
student_df = pd.read_sql(sql_query, engine)
student_df.head(10)

# Preprocessing the data

In [13]:
# Drop Student ID column
student_df = student_df.drop('STUDENT ID', axis=1)
student_df.head()

In [16]:
# check number of unique values for each feature
student_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145 entries, 0 to 144
Data columns (total 32 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   1          145 non-null    int64
 1   2          145 non-null    int64
 2   3          145 non-null    int64
 3   4          145 non-null    int64
 4   5          145 non-null    int64
 5   6          145 non-null    int64
 6   7          145 non-null    int64
 7   8          145 non-null    int64
 8   9          145 non-null    int64
 9   10         145 non-null    int64
 10  11         145 non-null    int64
 11  12         145 non-null    int64
 12  13         145 non-null    int64
 13  14         145 non-null    int64
 14  15         145 non-null    int64
 15  16         145 non-null    int64
 16  17         145 non-null    int64
 17  18         145 non-null    int64
 18  19         145 non-null    int64
 19  20         145 non-null    int64
 20  21         145 non-null    int64
 21  22         145 n

In [15]:
# Replace the "2" with "0" used for "no" and "male" in features 2, 5, 6, 7, 20
mapping_dict = {2:0, 1:1}  # Defines the mapping dictionary
mapped_student_df = student_df
mapped_student_df.loc[:,['2', '5', '6', '7', '20']] = mapped_student_df[['2', '5', '6', '7', '20']].applymap(mapping_dict.get)
mapped_student_df.head(10)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,23,24,25,26,27,28,29,30,COURSE ID,GRADE
0,2,0,3,3,1,0,0,1,1,1,...,1,1,3,2,1,2,1,1,1,1
1,2,0,3,3,1,0,0,1,1,1,...,1,1,3,2,3,2,2,3,1,1
2,2,0,2,3,0,0,0,2,4,2,...,1,1,2,2,1,1,2,2,1,1
3,1,1,1,3,1,0,1,2,1,2,...,1,2,3,2,2,1,3,2,1,1
4,2,0,1,3,0,0,1,3,1,4,...,2,1,2,2,2,1,2,2,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
140,2,1,2,3,1,1,0,1,1,2,...,1,1,2,1,2,1,3,3,9,5
141,1,1,2,4,0,0,0,1,4,2,...,1,1,3,2,2,1,5,3,9,5
142,1,1,1,4,0,0,0,1,1,1,...,1,1,3,3,2,1,4,3,9,1
143,2,1,2,4,1,1,1,5,2,3,...,2,1,2,1,2,1,5,3,9,4


"Machine learning: one-hot encoding vs integer encoding" by Stéphanie Crêteur,Published in Geek Culture, Dec 16, 2022, https://medium.com/geekculture/machine-learning-one-hot-encoding-vs-integer-encoding-f180eb831cf1 was consulted when deciding which categories made sense to leave integer encoded, and which to ues One-Hot Encoding on.

In [18]:
# Get_dummies for categorical data where numeric values don't make sense
# These are columns 3, 9, 10, 14, 15, 16, 23, 24, 28, COURSE ID
encoded_student_df = pd.get_dummies(mapped_student_df, 
                                     columns=['3', '9', '10', '14', '15', '16', '23', '24', '28', 'COURSE ID'])
encoded_student_df.head()

Unnamed: 0,1,2,4,5,6,7,8,11,12,13,...,28_3,COURSE ID_1,COURSE ID_2,COURSE ID_3,COURSE ID_4,COURSE ID_5,COURSE ID_6,COURSE ID_7,COURSE ID_8,COURSE ID_9
0,2,0,3,1,0,0,1,1,2,3,...,0,1,0,0,0,0,0,0,0,0
1,2,0,3,1,0,0,1,2,3,2,...,0,1,0,0,0,0,0,0,0,0
2,2,0,3,0,0,0,2,2,2,2,...,0,1,0,0,0,0,0,0,0,0
3,1,1,3,1,0,1,2,1,2,5,...,0,1,0,0,0,0,0,0,0,0
4,2,0,3,0,0,1,3,3,3,2,...,0,1,0,0,0,0,0,0,0,0
5,2,0,3,0,0,0,2,3,3,2,...,0,1,0,0,0,0,0,0,0,0
6,1,0,4,0,0,0,1,1,3,1,...,1,1,0,0,0,0,0,0,0,0
7,1,1,3,1,1,1,2,4,3,1,...,0,1,0,0,0,0,0,0,0,0
8,2,1,3,0,1,1,1,2,4,2,...,0,1,0,0,0,0,0,0,0,0
9,2,1,3,0,0,1,3,1,2,3,...,0,1,0,0,0,0,0,0,0,0


The documentation indicates that the values of the rest of the columns are very close, so no need to apply StandardScaler.

## Binomial Logistic Regression
(This is the type covered in the bootcamp simply called "Logistic Regression." There are 2 values for the target variable.)

In [24]:
# In preparation for turning this into a "good grade/bad grade scenario" (C's get degrees?)
# Check to see that there are both "good" (4, 5, 6, 7) and "bad" (0, 1, 2, 3) grades
encoded_student_df['GRADE'].value_counts()


1    35
2    24
3    21
5    17
7    17
6    13
4    10
0     8
Name: GRADE, dtype: int64

In [25]:
# To make this a binary problem, replace "bad" grades with "0", and "good"
# grades with "1"
binary_student_df = encoded_student_df
binary_student_df['GRADE'] = binary_student_df['GRADE'].replace({1:0, 2:0, 3:0, 4:1, 5:1, 6:1, 7:1})
binary_student_df.tail()

Unnamed: 0,1,2,4,5,6,7,8,11,12,13,17,18,19,20,21,22,25,26,27,29,30,GRADE,3_1,3_2,3_3,9_1,9_2,9_3,9_4,10_1,10_2,10_3,10_4,14_1,14_2,14_3,15_1,15_2,15_3,15_4,15_5,16_1,16_2,16_3,16_4,16_5,23_1,23_2,23_3,24_1,24_2,24_3,28_1,28_2,28_3,COURSE ID_1,COURSE ID_2,COURSE ID_3,COURSE ID_4,COURSE ID_5,COURSE ID_6,COURSE ID_7,COURSE ID_8,COURSE ID_9
0,2,0,3,1,0,0,1,1,2,3,3,2,2,1,1,1,3,2,1,1,1,0,0,0,1,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0
1,2,0,3,1,0,0,1,2,3,2,2,2,2,1,1,1,3,2,3,2,3,0,0,0,1,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0
2,2,0,3,0,0,0,2,2,2,2,2,1,2,1,1,1,2,2,1,2,2,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0
3,1,1,3,1,0,1,2,1,2,5,3,1,2,1,1,1,3,2,2,3,2,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0
4,2,0,3,0,0,1,3,3,3,2,2,1,1,1,1,1,2,2,2,2,2,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
140,2,1,3,1,1,0,1,1,2,2,3,3,2,1,1,1,2,1,2,3,3,1,0,1,0,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1
141,1,1,4,0,0,0,1,1,1,5,3,2,2,0,1,2,3,2,2,5,3,1,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1
142,1,1,4,0,0,0,1,3,4,4,2,2,2,1,1,1,3,3,2,4,3,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1
143,2,1,4,1,1,1,5,4,4,1,2,2,1,1,1,1,2,1,2,5,3,1,0,1,0,0,1,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1


In [26]:
# Separate Features X from Target y
y = binary_student_df['GRADE']
X = binary_student_df.drop(columns='GRADE')

# Split data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10, stratify=y)
X_train.shape

(108, 63)

In [39]:
# Create Logistic Regression Model and check accuracy
classifier = LogisticRegression(solver='lbfgs', max_iter=200, random_state=10)
classifier.fit(X_train, y_train)
target_names = ["Bad Grade", "Good Grade"]
predictions = classifier.predict(X_test)
results_df = pd.DataFrame({'Prediction': predictions, 'Actual': y_test})
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")
print(f"Accuracy Score: {accuracy_score(y_test, predictions)}")
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions, target_names=target_names))
results_df.tail(10)

Training Data Score: 0.9444444444444444
Testing Data Score: 0.8108108108108109
Accuracy Score: 0.8108108108108109
[[21  1]
 [ 6  9]]
              precision    recall  f1-score   support

   Bad Grade       0.78      0.95      0.86        22
  Good Grade       0.90      0.60      0.72        15

    accuracy                           0.81        37
   macro avg       0.84      0.78      0.79        37
weighted avg       0.83      0.81      0.80        37



Unnamed: 0,Prediction,Actual
70,1,1
128,0,0
79,1,0
11,0,0
117,0,0
105,1,1
64,0,0
13,0,0
10,0,0
125,0,0


# Model Optimization
Let's see if we can do better. Accuracy score is 81%, but several "Good" grades are being predicted as "Bad". If a professor were using this algorithm to grant permission to attempt their course, some students would be needlessly rejected.<br><br>
"Fine-tuning parameters in Logistic Regression" By Saturn Cloud, Monday, July 10, 2023, https://saturncloud.io/blog/finetuning-parameters-in-logistic-regression/ was consulted for ways to optimize this model.

In [27]:
# Create a function to be used for optimization of the model
def BinomialLogRegress(C, class_weight, solver, max_iter, X_train, X_test, y_train, y_test):
    classifier = LogisticRegression(solver='lbfgs', max_iter=maxiters, random_state=10)

    # Fit (train) model using the training data
    classifier.fit(X_train, y_train)
    predictions = classifier.predict(X_test)
    results = pd.DataFrame({"Prediction": predictions, "Actual": y_test})
    acc_score = accuracy_score(y_test, predictions)
    # Determine performance 
    print(results)
    target_names = ["Bad Grade", "Good Grade"]
    print(f"Number of iterations: {maxiters}")
    print(f"Training Data Score: {classifier.score(X_train, y_train)}")
    print(f"Testing Data Score: {classifier.score(X_test, y_test)}")
    print(f"Accuracy Score: {acc_score}")
    print(classification_report(y_test, predictions, target_names=target_names))

In [29]:
# See if precision for "Bad Grade" and recall for "Good Grade" improves with more iterations & where it stops improving
summary_column_labels = ['C', 'Class Weight', 'Solver', 'Max Iterations', 'Accuracy']
C_vals = [0.0, 0.25, 0.5, 0.75, 1.0]
class_weights = ['None', 'balanced']
solvers = ['lbfgs', 'liblinear', 'newton-cg', 'sag', 'saga']
max_iters = [100, 150, 200, 300]
for C in C_vals:
    for weight in class_weights:
        for solver in solvers:
            for iter in max_iters:
                BinomialLogRegress(C, weight, solver, iter, X_train, X_test, y_train, y_test)

     Prediction  Actual
37            0       0
0             0       0
17            0       0
15            0       0
45            0       0
40            0       0
85            1       1
89            1       1
30            0       1
27            0       0
115           0       0
93            0       1
126           0       0
60            0       0
77            0       1
121           0       0
96            1       1
101           0       1
33            0       0
74            1       1
57            1       1
32            0       0
80            0       1
6             0       1
18            0       0
107           1       1
72            1       1
70            1       1
128           0       0
79            1       0
11            0       0
117           0       0
105           1       1
64            0       0
13            0       0
10            0       0
125           0       0
Number of iterations: 100
Training Data Score: 0.9444444444444444
Testing Data Score: 0.

KeyError: "None of [Int64Index([  0,   1,   3,   4,   7,   8,   9,  10,  11,  12,\n            ...\n            131, 132, 133, 135, 137, 138, 139, 140, 141, 143],\n           dtype='int64', length=116)] are in the [columns]"

# Try Multinomial Logistic Regression

So, that's nice. All of the students who recieved "bad" grades were accurately flagged, along with some students who actually received good grades. So, services could be offered to the flagged students. It won't hurt those who don't need the services, but no one who needs them is omitted. Nice. <br><hr>
<strong>But,</strong> the original dataset had the grades subdivided into more "normal" categories, roughly corresponding to letter grades. To classify this, we need a Multinomial Logistic Regression.<br>
<br>
"Changing logistic regression from binomial to multinomial probability requires a change to the loss function used to train the model (e.g. log loss to cross-entropy loss), and a change to the output from a single probability value to one probability for each class label." by Jason Brownlee on January 1, 2021 in Python Machine Learning, https://machinelearningmastery.com/multinomial-logistic-regression-with-python/<br>
<br>The Multinomial Logistic Regression Model used here is shown on the above referenced website.

In [None]:
multi_student_df=student_df = pd.read_sql(sql_query, engine)
multi_student_df=multi_student_df.drop("STUDENT ID", axis=1)
multi_student_df

In [None]:
# Separate Features X from Target y using the multi_student_df
y = multi_student_df['GRADE']
X = multi_student_df.drop(columns='GRADE')

# Split data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10, stratify=y)
X_train.shape

In [None]:
# Create a Multinomial Logistic Regression Model
classifier = LogisticRegression(multi_class='multinomial', solver='lbfgs', random_state=10)

# Fit (train) model using the training data
classifier.fit(X_train, y_train)

predictions = classifier.predict(X_test)
pd.DataFrame({"Prediction": predictions, "Actual": y_test})


In [None]:
# Determine performance
target_names = ["0","1", "2", "3", "4", "5", "6", "7"]
print(classification_report(y_test, predictions, target_names=target_names))

In [None]:
# Define a function to use to test different max_iterations
def LogRegress(maxiters, X_train, X_test, y_train, y_test):
    classifier = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=maxiters, random_state=10)

    # Fit (train) model using the training data
    classifier.fit(X_train, y_train)
    predictions = classifier.predict(X_test)
    pd.DataFrame({"Prediction": predictions, "Actual": y_test})
    
    # Determine performance 
    target_names = ["0","1", "2", "3", "4", "5", "6", "7"]
    print(f"Number of iterations: {maxiters}")
    print(classification_report(y_test, predictions, target_names=target_names))

In [None]:
# See if accuracy improves with more iterations & where it stops improving
test_iters = [200, 300, 400, 500, 600, 700, 800, 900, 1000]
for iter in test_iters:
    LogRegress(iter, X_train, X_test, y_train, y_test)

##### Even with 1000 iterations, the max Accuracy achieved was 24% and that was with 300 iterations. After this point, Accuracy decreased with more iterations.
Next try tuning the penalty

In [None]:
# Create a list of models using max_iters=300 and various values of "C" which
# is the penalty weight
def make_models():
    multi_log_reg_models = dict()
    for penalty in [0.0, 0.0001, 0.001, 0.01, 0.1, 1.0]:
        # create model name
        key = '%.4f' % penalty
        # set penalty to "none" when penalty = 0.0
        if penalty==0.0:
            multi_log_reg_models[key] = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=300, random_state=10, penalty='none')
        else:
            multi_log_reg_models[key] = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=300, random_state=10, penalty='l2', C=penalty)
    return multi_log_reg_models

In [None]:
# Evaluate model
def evaluate_model(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    score = accuracy_score(y_test, predictions)
    return score

In [None]:
models = make_models()
results, names = list(), list()
for name, model in models.items():
    score = evaluate_model(model, X_train,X_test, y_train, y_test)
    results.append(score)  # store the accuracy for the model
    names.append(name)      # store the model name
    print(f"{name} has an accuracy of {score}")

The warning says to scale the data, so let's do that.

In [None]:
# Create the StandardScaler instance
scaler = StandardScaler()
# Fit the Standard Scaler with the training data
X_scaler = scaler.fit(X_train)
# Scale the training data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

In [None]:
# Create a Multinomial Logistic Regression Model
classifier = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=300, random_state=10, penalty='l2', C=1.0)

# Fit (train) model using the training data
classifier.fit(X_train_scaled, y_train)

predictions = classifier.predict(X_test_scaled)

accuracy_score(y_test, predictions)

The accuracy score for the Multinomial Logistic Regression for this dataset is an abysmal 24% most likely because the dataset does not have enough observations for the number of features. There is a standard in ML of min observations = 10 * features, so this dataset needs to be double the size. Perhaps we can remove some of the features. Also, there may be features producing "noise". Next step, have a look at how each of the features correlates to the target "GRADE". An explanation of how to do this is here: https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e. <br>
To visualize feature correlation to the target, produce correlation matrix with heatmap.

In [None]:
multi_student_df

In [None]:
# Select the data
X = multi_student_df.iloc[:,0:31]  #independent columns
y = multi_student_df.iloc[:,31]    #target column i.e GRADE

#get correlations of each feature in the dataset
correlation_matrix = multi_student_df.corr()
top_corr_features = correlation_matrix.index
plt.figure(figsize=(20,20))

#plot heat map
g=sns.heatmap(multi_student_df[top_corr_features].corr(), annot=True, cmap="Pastel1")


In [None]:
# get the correlation values to determine which features to use in the model
correlations = correlation_matrix.iloc[-1,:]

# take the absolute values of the correlations and sort them in descending order
abs_corr = correlations.abs()
sorted_abs_corr = abs_corr.sort_values(ascending=False)
print(f"Absolute values of correlations with GRADE: {sorted_abs_corr}")

In [None]:
# Extract the column names for the features with 10 highest absolute values of correlations to use for the model
top_corr_features = sorted_abs_corr.index[0:15]
print(top_corr_features)

In [None]:
reduced_multi_student_df = multi_student_df[top_corr_features]
reduced_multi_student_df

In [None]:
# Separate Features X from Target y using the reduced_multi_student_df
y = reduced_multi_student_df['GRADE']
X = reduced_multi_student_df.drop(columns='GRADE')

# Split data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10, stratify=y)
X_train.shape

In [None]:
# Create a Multinomial Logistic Regression Model
classifier = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=400, random_state=10, penalty='l2', C=1.0)

# Fit (train) model using the training data
classifier.fit(X_train, y_train)

predictions = classifier.predict(X_test)

accuracy_score(y_test, predictions)

Accuracy is now 27%, and the best that I can understand from https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression , the solver and penalty are optimized as well as the number of max_iters. Next try the top 10 correlated features.

In [None]:
top_10_corr_features = sorted_abs_corr.index[0:11]
print(top_10_corr_features)

In [None]:
ten_multi_student_df = multi_student_df[top_10_corr_features]
ten_multi_student_df

In [None]:
# Separate Features X from Target y using the reduced_multi_student_df
y = ten_multi_student_df['GRADE']
X = ten_multi_student_df.drop(columns='GRADE')

# Split data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10, stratify=y)
X_train.shape

In [None]:
# Create a Multinomial Logistic Regression Model
classifier = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=400, random_state=10, penalty='l2', C=1.0)

# Fit (train) model using the training data
classifier.fit(X_train, y_train)

predictions = classifier.predict(X_test)

accuracy_score(y_test, predictions)

# Try Decision Tree

In [None]:
# The data is already split into taining and testing datasets
# Create decision tree classifier instance
decision_tree_model = tree.DecisionTreeClassifier()

# Fitting the model
decision_tree_model = decision_tree_model.fit(X_train_scaled, y_train)

# Make predictions
decision_tree_predictions = decision_tree_model.predict(X_test_scaled)

In [None]:
# Calculate and display the classification report
print("Classification Report")
print(classification_report(y_test, decision_tree_predictions))

# Try Random Forest

In [None]:
# Create and fit Random Forest classifier
random_forest_model = RandomForestClassifier(n_estimators=3, random_state=10)
random_forest_model = random_forest_model.fit(X_train_scaled, y_train)

# Make Predictions
random_forest_predictions = random_forest_model.predict(X_test_scaled)

# Display Classification Report
print("Classification Report")
print(classification_report(y_test, random_forest_predictions))


# Try Neural Network

In [None]:
# Create a Keras Sequential model
keras_nn_model = tf.keras.models.Sequential()

# Add an input Layer
keras_nn_model.add(tf.keras.layers.Dense(units=8, activation="relu", input_dim=107))

# Add the output layer
keras_nn_model.add(tf.keras.layers.Dense(units=1, activation="softmax"))

# Check the structure of the Sequential model
keras_nn_model.summary()

In [None]:
# Compile the Sequential model and customize metrics
keras_nn_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

# Fit the model
fit_keras_nn_model = keras_nn_model.fit(X_train_scaled, y_train, epochs=100)

In [None]:
# Evaluate the model using the test data
model_loss, model_accuracy = keras_nn_model.evaluate(X_test_scaled,y_test,verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

The accuracy is very low. Suspect there may not be enough data points for the number of features. Also, there may be features producing "noise". Next step, have a look at how each of the features correlates to the target "GRADE". An explanation of how to do this is here: https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e.<br>
To visualize feature correlation to the target, produce correlation matrix with heatmap.

In [None]:
student_df

In [None]:
# Select the data
X = student_df.iloc[:,0:31]  #independent columns
y = student_df.iloc[:,31]    #target column i.e GRADE

#get correlations of each feature in the dataset
correlation_matrix = student_df.corr()
top_corr_features = correlation_matrix.index
plt.figure(figsize=(20,20))

#plot heat map
g=sns.heatmap(student_df[top_corr_features].corr(), annot=True, cmap="RdYlGn")


In [None]:
# get the correlation values to determine which features to use in the model
correlations = correlation_matrix.iloc[-1,:]

# take the absolute values of the correlations and sort them in descending order
abs_corr = correlations.abs()
sorted_abs_corr = abs_corr.sort_values(ascending=False)
print(f"Absolute values of correlations with GRADE: {sorted_abs_corr}")

In [None]:
# Extract the column names for the features with 10 highest absolute values of correlations to use for the model
top_corr_features = sorted_abs_corr.index[0:15]
print(top_corr_features)

# 2nd Logistic Regression Model

Following the rule of 10 and using features that are more highly correlated than the rest.<br>
Also, after reading this article: https://medium.com/geekculture/machine-learning-one-hot-encoding-vs-integer-encoding-f180eb831cf1 , I will try not using OneHotEncoding as the data is already integer encoded. If the results are sub par, then use OneHotEncoding.

In [None]:
reduced_student_df = student_df[top_corr_features]
reduced_student_df

In [None]:
# Upon refering to the documentation for the dataset:
# binary features: 2, 20, 5 , reassign the value '2' to '0' for these features 
# Features to OneHotEncode because no ordinal relationship between the values: 21, 9, COURSE ID, 28, 3
mapping_dict = {2:0, 1:1}  # Defines the mapping dictionary
reduced_student_df.loc[:,['2', '20', '5']] = reduced_student_df[['2', '20', '5']].applymap(mapping_dict.get)
reduced_student_df

In [None]:
enc2 = OneHotEncoder(sparse_output=False)
categorical2 = ['21', '9', 'COURSE ID', '28', '3']

# Fit and transfor OneHotEncoder
encode_reduced_student_df = pd.DataFrame(enc2.fit_transform(reduced_student_df[categorical2]))

# Add the encoded features to the dataframe
encode_reduced_student_df.columns = enc2.get_feature_names_out(categorical2)
encode_reduced_student_df.head()

In [None]:
# Merge OneHotEncoder features and drop the originals
reduced_student_encode_df = reduced_student_df
reduced_student_encode_df = reduced_student_encode_df.merge(encode_reduced_student_df, left_index=True, right_index=True)
reduced_student_encode_df = reduced_student_encode_df.drop(columns=categorical2)
reduced_student_encode_df.head()

In [None]:
# Split preprocessed reduced data into features and target arrays
y = reduced_student_encode_df['GRADE'].values
X = reduced_student_encode_df.drop("GRADE", axis=1).values

# Split data into testing and training datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=10)

In [None]:
# Data values are very similar, StandardScaler is not needed
# Create a Logistic Regression Model
classifier = LogisticRegression(solver='lbfgs', max_iter=1000, random_state=1)

# Fit (train) model using the training data
classifier.fit(X_train, y_train)

predictions = classifier.predict(X_test)
pd.DataFrame({"Prediction": predictions, "Actual": y_test}).tail(10)

In [None]:
# Determine performance
target_names = ["0","1", "2", "3", "4", "5", "6", "7"]
print(classification_report(y_test, predictions, target_names=target_names))

The accuracy went from 38% to 24%. <br>
## Decision Tree Model

In [None]:
# The data is already split into taining and testing datasets
# Create decision tree classifier instance
decision_tree_model = tree.DecisionTreeClassifier()

# Fitting the model
decision_tree_model = decision_tree_model.fit(X_train, y_train)

# Make predictions
decision_tree_predictions = decision_tree_model.predict(X_test)

In [None]:
# Calculate and display the classification report
print("Classification Report")
print(classification_report(y_test, decision_tree_predictions))

In [None]:
y_test

In [None]:
decision_tree_predictions