<div align="center">

# 🧠 Machine Learning Models

###  **Artin Tavasoli** 👋🏻 
📘 **Student ID:** `810102543`

</div>


<div align="left">

#### 📚 Import Needed Libraries

</div>


In [361]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import LabelEncoder

<div align="left">

#### ⏳ Load Dataset

</div>


In [362]:
grades_df = pd.read_csv('AI-Course-Grades.csv')

<div align="left">

# 📤 Data Preprocessing

</div>


<div align="left">

### 🧩 Handling Missing Values

#### 👻 Handle Absent Data

</div>


In [363]:
print(grades_df.isnull().any())

university           False
sex                  False
age                  False
address              False
motherEducation      False
fatherEducation      False
motherJob            False
fatherJob            False
reason               False
travelTime           False
studyTime            False
failures             False
universitySupport    False
paid                 False
higher               False
internet             False
romantic             False
freeTime             False
goOut                False
Dalc                 False
Walc                 False
absences             False
EPSGrade             False
DSGrade              False
finalGrade           False
dtype: bool


<div align="left">

#### ❗ Handle Unassigned Values
There are several strategies for dealing with missing data, each suited to different circumstances:

1. Remove Rows with Missing Values: This involves discarding data points (rows) that have missing entries. While simple, it's often impractical if many data points have missing features or if those points contain otherwise valuable information.

2. Remove Columns with Missing Values: This approach drops the entire feature (column) when only a small portion of the dataset contains values for it. It's useful when the missing data renders the feature unreliable or irrelevant.

3. Impute with Statistical Measures: Missing values can be filled using statistical metrics such as the mean, median, or mode of the corresponding column. This method is simple and maintains the dataset's size, though it may introduce bias.

4. Predict Missing Values Using Machine Learning: A supervised learning model, such as a classification or k-nearest neighbors (KNN) model, can predict missing values using patterns in the existing data. This method is more sophisticated and potentially more accurate.

5. Manual Imputation Based on Domain Knowledge: In some cases, domain experts can analyze the data and assign missing values based on contextual understanding or existing labels.


</div>


In [364]:
print('number of unassigned values in 'f"{grades_df.shape[0]} data points")

print((grades_df == 'other').sum())

number of unassigned values in 397 data points
university             0
sex                    0
age                    0
address                0
motherEducation        0
fatherEducation        0
motherJob            142
fatherJob            217
reason                37
travelTime             0
studyTime              0
failures               0
universitySupport      0
paid                   0
higher                 0
internet               0
romantic               0
freeTime               0
goOut                  0
Dalc                   0
Walc                   0
absences               0
EPSGrade               0
DSGrade                0
finalGrade             0
dtype: int64



<span style="font-size:24px; font-weight:bold; color:pink;">About 36% of motherJob and 55% of fatherJob values' are missing, removing these features is the best action here</span>


In [365]:
del grades_df['motherJob']
del grades_df['fatherJob']
print((grades_df == 'other').sum())

university            0
sex                   0
age                   0
address               0
motherEducation       0
fatherEducation       0
reason               37
travelTime            0
studyTime             0
failures              0
universitySupport     0
paid                  0
higher                0
internet              0
romantic              0
freeTime              0
goOut                 0
Dalc                  0
Walc                  0
absences              0
EPSGrade              0
DSGrade               0
finalGrade            0
dtype: int64


<span style="font-size:24px; font-weight:bold; color:pink;">assign values to "reason" feature: </span>

if travel time is less than 15 minutes of travel time then the reason probably was how close this university is to it's home

break these datapoints into three groups based on their universities (CM or PR) and create 2 panda dataframe for each one

    for the remaining unassigned reasons in each newly created dataframe:

        if mod reason column was reputation (the most seen reason was reputation)

            put reason['other'] to 'reputation' because  this means the university was a very reputable one and others probably chose the same reason

        else
        
            put reason['other'] = course


In [366]:
def assign_reasons(df):
    df['original_reason'] = df['reason']

    mask = df['reason'] == 'other'
    df.loc[mask, 'reason'] = np.where(
    df.loc[mask, 'travelTime'] == 1,
    'home',
    np.where(
        df.loc[mask, 'original_reason'] == 'reputation',
        'reputation',
        'course'
    )
    )


    mask = df['reason'] == 'other'
    reason_counts = df['reason'].value_counts()
    if reason_counts.get('reputation') > reason_counts.get('course'):
        df.loc[mask, 'reason'] = 'reputation'
    else:
        df.loc[mask, 'reason'] = 'course'
    df.drop(columns='original_reason', inplace=True)
    return df

df_cm = grades_df[grades_df['university'] == 'CM'].copy()
df_pr = grades_df[grades_df['university'] == 'PR'].copy()
df_cm = assign_reasons(df_cm)
df_pr = assign_reasons(df_pr)

print("CM head:\n", df_cm[['university','travelTime','reason']].head(), "\n")
print("PR head:\n", df_pr[['university','travelTime','reason']].head())
df_cm.to_csv('df_cm.csv', index=False)
grades_df = pd.concat([df_pr, df_cm], ignore_index=True)

CM head:
     university  travelTime  reason
349         CM           2  course
350         CM           3    home
351         CM           2  course
352         CM           1  course
353         CM           3    home 

PR head:
   university  travelTime  reason
0         PR           2  course
1         PR           1  course
2         PR           1    home
3         PR           1    home
4         PR           1    home


clean data
abscenses can be from 0 to 93, classify them into 4 groups:

0 : 0-2

1 : 3-5

2 : 6-9

3 : 10 or more

EPSGrade,DSGrade,finalGrade classify them into 4 groups:

0 : 10 or lower

1 : 10-14

2 : 14-17

3 : 17 or more


In [367]:
def classify_absences(number_of_absences):
    if number_of_absences <= 2:
        return 0
    elif number_of_absences <= 5:
        return 1
    elif number_of_absences <= 9:
        return 2
    else:
        return 3

def classify_grade(grade):
    if grade <= 10:
        return 0
    elif grade <= 14:
        return 1
    elif grade <= 17:
        return 2
    else:
        return 3

grades_df['absences'] = grades_df['absences'].apply(classify_absences)
grades_df['EPSGrade'] = grades_df['EPSGrade'].apply(classify_grade)
grades_df['DSGrade'] = grades_df['DSGrade'].apply(classify_grade)
grades_df['finalGrade'] = grades_df['finalGrade'].apply(classify_grade)

print(grades_df.head())

  university sex  age address  motherEducation  fatherEducation  reason  \
0         PR   F   18       U                4                4  course   
1         PR   F   17       U                1                1  course   
2         PR   F   15       U                1                1    home   
3         PR   F   15       U                4                2    home   
4         PR   F   16       U                3                3    home   

   travelTime  studyTime  failures  ... internet romantic freeTime goOut Dalc  \
0           2          2         0  ...       no       no        3     4    1   
1           1          2         0  ...      yes       no        3     3    1   
2           1          2         3  ...      yes       no        3     2    2   
3           1          3         0  ...      yes      yes        2     2    1   
4           1          2         0  ...       no       no        3     2    1   

   Walc  absences  EPSGrade  DSGrade  finalGrade  
0     1    

use label encoding

In [368]:
cat_cols = grades_df.select_dtypes(include=['object','category']).columns.tolist()


for col in cat_cols:
    grades_df[col] = grades_df[col].astype('category')

for col in cat_cols:
    grades_df[col] = grades_df[col].cat.codes

print(grades_df.head())


   university  sex  age  address  motherEducation  fatherEducation  reason  \
0           1    0   18        1                4                4       0   
1           1    0   17        1                1                1       0   
2           1    0   15        1                1                1       1   
3           1    0   15        1                4                2       1   
4           1    0   16        1                3                3       1   

   travelTime  studyTime  failures  ...  internet  romantic  freeTime  goOut  \
0           2          2         0  ...         0         0         3      4   
1           1          2         0  ...         1         0         3      3   
2           1          2         3  ...         1         0         3      2   
3           1          3         0  ...         1         1         2      2   
4           1          2         0  ...         0         0         3      2   

   Dalc  Walc  absences  EPSGrade  DSGrade  finalG