<h1 align=center style="line-height:200%;color:#0099cc">
Student GPA Prediction
</h1>

<p style="text-align: justify; line-height:200%; font-size:medium">
In this question, we intend to estimate the students' Grade Point Average (GPA) using a dataset that includes student information. To do this, after preprocessing the data, you must engage in feature engineering and build an appropriate model. Note that, in the end, only your model will be evaluated, but naturally, the better your preprocessing and feature engineering, the better model you will ultimately achieve.

</p>


<h2 style="line-height:200%;color:#0099cc">
Dataset Introduction
</h2>

<p style="text-align: justify; line-height:200%; font-size:medium">
    In the initial file for this question, there is a folder named <code>data</code>.
    This folder includes two files named <code>train.csv</code> and <code>test.csv</code>, which are the training and test datasets, respectively.
    The training dataset for this question includes 1913 rows and 14 columns, and
        the test dataset has 479 rows and is missing only the <code>GPA</code> column.
    The descriptions of the columns are as follows:
</p>

<center>
<div style="line-height:200%;font-size:medium">

|      <b>Feature Name</b>       |                            <b>Feature Description</b>                             |
| :----------------------------: | :-------------------------------------------------------------------------------: | --- |
|     <code>StudentID</code>     |                                    Student ID                                     |
|        <code>Age</code>        |                                 Individual's Age                                  |
|      <code>Gender</code>       |                       Gender, 0 for males and 1 for females                       |
|     <code>Ethnicity</code>     |                                Students' Ethnicity                                |
| <code>ParentalEducation</code> |                         Student's family education level                          |
|  <code>StudyTimeWeekly</code>  |                       Weekly study hours from 0 to 20 hours                       |
|     <code>Absences</code>      |           Number of student absences in one academic year from 0 to 30            |
|     <code>Tutoring</code>      |               Tutoring status, 0 indicating no and 1 indicating yes               |
|  <code>ParentalSupport</code>  |                     Level of parental support for the student                     |
|  <code>Extracurricular</code>  | Participation in extracurricular activities, 0 indicating no and 1 indicating yes |
|      <code>Sports</code>       |      Participation in sports programs, 0 indicating no and 1 indicating yes       |     |
|       <code>Music</code>       |       Participation in music programs, 0 indicating no and 1 indicating yes       |     |
|   <code>Volunteering</code>    |   Participation in volunteering programs, 0 indicating no and 1 indicating yes    |     |
|        <code>GPA</code>        |                 Grade Point Average in the range of zero to four                  |

<p style="text-align: justify; line-height:500%; font-size:medium">
<font color="red" size=3><b>Note:</b></font>
<font size=3>
The evaluation data may contain missing values (NaN).
</font>
</p>
</div>
</center>


<h2 style="line-height:200%;color:#0099cc">
Reading the Dataset
</h2>

<p style="text-align: justify; line-height:200%; font-size:medium">
    Initially, you need to read the dataset files. The training samples are saved in the <code>train.csv</code> file, and the test samples, for which you must predict the value of the target variable, are saved in the <code>test.csv</code> file. If you deem it necessary, you can optionally separate a portion of the training set as a validation set.
</p>


In [154]:
import pandas as pd
import numpy as np

In [155]:
from pathlib import Path

# Reading the dataset files
MAIN_DATA_DIR = Path('../data')
train_df = pd.read_csv(MAIN_DATA_DIR / 'train.csv')
test_df = pd.read_csv(MAIN_DATA_DIR / 'test.csv')

TARGET = 'GPA'
X = train_df.drop(columns=[TARGET])
y = train_df[TARGET]

# Print the shape of the training and test datasets 
print(f'Train dataset shape: {train_df.shape}\nTest dataset shape: {test_df.shape}')


Train dataset shape: (1913, 14)
Test dataset shape: (479, 13)


<h2 style="line-height:200%;color:#0099cc">
Preprocessing and Feature Engineering
</h2>

<p style="text-align: justify; line-height:200%; font-size:medium">
        In this question, you are free to use any preprocessing/feature engineering technique of your choice.
    <br>
    The techniques you use will <b>not</b> be directly evaluated by the judging system. Instead, they will all affect your model's accuracy; therefore, the better your preprocessing/feature engineering to improve the model's accuracy, the higher score you will achieve for this question.

</p>


In [156]:
# Preprocessing step
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

# Remove identifier from features cz it's not a feature
feature_columns = [c for c in X.columns if c != 'StudentID']

# Separate numeric and categorical features for better preprocessing
numeric_features = X[feature_columns].select_dtypes(include=['number']).columns.tolist()
categorical_features = [c for c in feature_columns if c not in numeric_features]

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='drop'
)

print(f'Numeric features: {len(numeric_features)} \nCategorical features: {len(categorical_features)}')

Numeric features: 9 
Categorical features: 3


<h2 style="line-height:200%;color:#0099cc">
Model Training
</h2>

<p style="text-align: justify;line-height:200%;font-size:medium">
    Now that you have cleaned the data and perhaps added or removed features, it is time to train a model that can predict the target variable for this problem.
</p>


In [157]:
# Model design
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline

# Hold-out validation split based on 20% of the training data for evaluation
X_train, X_valid, y_train, y_valid = train_test_split(
    X[feature_columns], y, test_size=0.2, random_state=42
)

# Build pipeline with preprocessor and random forest regressor
model = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('rf', RandomForestRegressor(
        n_estimators=500,
        max_depth=None,
        min_samples_split=2,
        min_samples_leaf=1,
        n_jobs=-1,
        random_state=42
    ))
])

model.fit(X_train, y_train)
print('Model trained on training split.')

Model trained on training split.


<h2 style="line-height:200%;color:#0099cc">
Evaluation Metric
</h2>

<p style="text-align: justify; line-height:200%; font-size:medium">
    The metric we have chosen to evaluate the model's performance is called <code>r2_score</code>.
    <br>
    This metric is the quality assessment measure for your model. In other words, the judging system also uses this exact metric for scoring.
    <br>
    It is suggested that you evaluate your model's performance on the training or validation set based on this metric.
</p>

<p style="text-align: justify; line-height:200%; font-size:medium">
<b style="color:red;">Note:</b>
    To receive a score for this question, your model's accuracy must be greater than the threshold of 0.4.
    If your model's accuracy is less than 0.4, your score will be 
    <b>zero</b>
    , otherwise, it will be calculated with the following formula:
</p>


In [158]:

# evaluate your model
from sklearn.metrics import r2_score

y_valid_pred = model.predict(X_valid)
r2 = r2_score(y_valid, y_valid_pred)
print(f'Validation R2: {r2:.4f}')

# Ensure threshold guidance is visible
if r2 < 0.4:
    print('Warning: R2 below 0.4 threshold. Consider tuning or trying another model.')

Validation R2: 0.9233


<h2 style="line-height:200%;color:#0099cc">
Prediction on Test Data and Output
</h2>

<p style="text-align: justify;line-height:200%;font-size:medium">
    Save your model's predictions on the test data in a dataframe (<code>dataframe</code>) with the following format.
</p>

<p style="text-align: justify;line-height:200%;font-size:medium">
    Note that the dataframe name must be <code>submission</code>; otherwise, the judging system will not be able to evaluate your output.
    This dataframe contains only 1 column named <code>GPA</code> and has 479 rows.
    <br>
    For each row in the test dataset, you must have one predicted value.
    For example, the table below shows the first 5 rows of the <code>submission</code> dataframe. However, these numbers are hypothetical, and the numbers in the <code>GPA</code> column in your answer may be different.
</p>

<center>
<div style="line-height:200%;font-size:medium">
    
||<code>GPA</code>|
|:----:|:-----:|
|0|2.6765|
|1|3.9865434|
|2|1.0323434|
|3|0.0434253|
|4|2.060680|

</div>
</center>


In [159]:
# Refit on full training data (using same preprocessing)
final_model = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('rf', RandomForestRegressor(
        n_estimators=500,
        max_depth=None,
        min_samples_split=2,
        min_samples_leaf=1,
        n_jobs=-1,
        random_state=42
    ))
])

final_model.fit(X[feature_columns], y)

# Predict on test data
test_pred = final_model.predict(test_df[feature_columns])
submission = pd.DataFrame({'GPA': test_pred})
print("-"*10)
print( submission.head())
print('\nSubmission shape:', submission.shape)
print("-"*10)


----------
        GPA
0  1.265554
1  3.031956
2  1.824635
3  3.447805
4  0.514094

Submission shape: (479, 1)
----------


<h2 style="line-height:200%;color:#0099cc">
<b>Submission File Generator Cell</b>
</h2>

<p style="text-align: justify; line-height:200%;font-size:medium">
    Run the cell below to create the <code>result.zip</code> file. Note that you must save the changes made in the notebook (<code>ctrl+s</code>) before running the cell below, otherwise, your score will change to zero at the end of the contest.
    <br>
    Also, if you are using Colab to run this notebook file, download the latest version of your notebook and include it in the submission file before sending <code>result.zip</code>.
</p>


In [160]:
import zipfile
import os

if not os.path.exists(os.path.join(os.getcwd(), 'Student_GPA.ipynb')):
    # Note: %notebook -e is a magics command, which might not work in all environments. 
    # It's intended to export the current notebook content.
    # Assuming the user environment supports this or a manual save is done.
    # If running in a proper Jupyter environment, this line might be fine.
    try:
        get_ipython().run_line_magic('notebook', '-e Student_GPA.ipynb')
    except NameError:
        print("Warning: Not running in a Jupyter environment with magics enabled. Please ensure 'Student_GPA.ipynb' is saved manually.")

def compress(file_names):
    print("File Paths:")
    print(file_names)
    compression = zipfile.ZIP_DEFLATED
    with zipfile.ZipFile("result.zip", mode="w") as zf:
        for file_name in file_names:
            zf.write('./' + file_name, file_name, compress_type=compression)

submission.to_csv('submission.csv', index=False)

file_names = ['Student_GPA.ipynb', 'submission.csv']
compress(file_names)

File Paths:
['Student_GPA.ipynb', 'submission.csv']
