# **CRUSHgebra**

The goal of this project is to design and train a single neural network that can perform two different tasks simultaneously (Multi-Task Learning).
1. Task 1 is regression, aiming to predict the student's final grade, `G3` (a number from 0 to 20).
2. Task 2 is classification, aiming to determine whether the student is in a `romantic` relationship or not.
   
This notebook fully covers all the code and experiments included in this repository through Python scripts. The interpreter used to run the code snippets in this notebook was 3.14; it was the same interpreter used for the entire project. After ensuring that you are using Python 3.14, install all the necessary libraries for this project:

In [1]:
!pip install -U pandas scikit-learn torch ucimlrepo



Here I will list all the imports used throughout the notebook (and project) so that I will not need to need to run any functional code snippet twice:

In [2]:
import pandas as pd
import torch

from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from torch.utils.data import Dataset, DataLoader

Now let's start by using UC Irvine's Python API to retrieve the "Student Performance" dataset:

In [3]:
student_performance = fetch_ucirepo(name='Student Performance')
student_performance.metadata.additional_info.summary

'This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).'

### Attributes of the dataset:
1. school: student's school (binary: "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira)
2. sex: student's sex (binary: "F" - female or "M" - male)
3. age: student's age (numeric: from 15 to 22)
4. address: student's home address type (binary: "U" - urban or "R" - rural)
5. famsize: family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
6. Pstatus: parent's cohabitation status (binary: "T" - living together or "A" - apart)
7. Medu: mother's education (numeric: 0 - none,  1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
8. Fedu: father's education (numeric: 0 - none,  1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
9. Mjob: mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
10. Fjob: father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
11. reason: reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other")
12. guardian: student's guardian (nominal: "mother", "father" or "other")
13. traveltime: home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14. studytime: weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15. failures: number of past class failures (numeric: n if $1 \leq n <3$, else 4)
16. schoolsup: extra educational support (binary: yes or no)
17. famsup: family educational support (binary: yes or no)
18. paid: extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19. activities: extra-curricular activities (binary: yes or no)
20. nursery: attended nursery school (binary: yes or no)
21. higher: wants to take higher education (binary: yes or no)
22. internet: Internet access at home (binary: yes or no)
23. ***romantic***: with a romantic relationship (binary: yes or no) [TARGET]
24. famrel: quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25. freetime: free time after school (numeric: from 1 - very low to 5 - very high)
26. goout: going out with friends (numeric: from 1 - very low to 5 - very high)
27. Dalc: workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28. Walc: weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29. health: current health status (numeric: from 1 - very bad to 5 - very good)
30. absences: number of school absences (numeric: from 0 to 93)

These grades are related with the course subject, Math or Portuguese:

31. G1: first period grade (numeric: from 0 to 20)
31. G2: second period grade (numeric: from 0 to 20)
32. ***G3***: final grade (numeric: from 0 to 20) [TARGET]

## Part 1: Data Preprocessing (`preprocessing.py`)

Just like the majority of machine learning models, neural network performance heavily relies on the quality of the data used to train and evaluate it. Additionally, neural networks only accept numeric data as input, so categorical data needs to be transformed into numerical format as well.

Usually, one would check for missing values, but since the UCI website states that **there are no missing values** in the dataset, we can skip this step.

In this order, we will:
1. Separate target variables
2. Perform a train/test split
3. Encode categorical variables into numerical ones
4. Normalize/standardize numerical variables

First, we separate the target variables from the dataset. Now, grades (`G1`, `G2`, `G3`) are in `student_performance.data.targets`, while the rest are under `student_performance.data.features`.

From the [webpage for the dataset](https://archive.ics.uci.edu/dataset/320/student+performance), we read:
> Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

So we will ignore `G1` and `G2` altogether, separating `romantic`, `G3`, and all the remaining features of `student_performance.data.features`:

In [4]:
G3 = student_performance.data.targets["G3"]
romantic = student_performance.data.features["romantic"]
X = student_performance.data.features.drop(columns=["romantic"])
y = pd.DataFrame({'G3': G3, 'romantic': romantic})
og_columns = [col for col in X.columns]

Now, we perform a train/test split. `random_state` for reproducibility (95 for Lightning McQueen); `stratify` to ensure balanced distribution for `romantic`:

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=95, stratify=y['romantic'])

G3_train, G3_test = y_train['G3'], y_test['G3']
romantic_train, romantic_test = y_train['romantic'], y_test['romantic']

We will now check which of our variables are categorical and which are numerical, and then start by converting the categorical variables into numerical format:

In [6]:
for col in X_train.columns:
    print(f"{col}: {X_train[col].dtype}")

school: object
sex: object
age: int64
address: object
famsize: object
Pstatus: object
Medu: int64
Fedu: int64
Mjob: object
Fjob: object
reason: object
guardian: object
traveltime: int64
studytime: int64
failures: int64
schoolsup: object
famsup: object
paid: object
activities: object
nursery: object
higher: object
internet: object
famrel: int64
freetime: int64
goout: int64
Dalc: int64
Walc: int64
health: int64
absences: int64


As we can see, for this particular dataset, the datatype for all of the categorical variables is `"object"`, so we can proceed with the following:

In [7]:
categorical_columns = []
numerical_columns = []
for col in X_train.columns:
    if X_train[col].dtype == "object":
        categorical_columns.append(col)
    else:
        numerical_columns.append(col)

print(categorical_columns)
print(numerical_columns)

['school', 'sex', 'address', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet']
['age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


For both categorical and numerical variables, we fit the encoder, normalizer, or standardizer on `X_train`, and then apply the fitted method to `X_test` to prevent data leakage.

As for the strategy of encoding categorical variables:
- For binary categorical variables, binary encoding doesn't create artificial ordinal relationships. Each variable will simply map to 0 or 1.
- For nominal variables (`Mjob`, `Fjob`, `reason`, `guardian`), one-hot encoding creates binary columns for each category, preventing the model from assuming false ordinal relationships (e.g., "teacher" > "health"). This is not an overhead, as none of these variables take more than four possible values.
- Ordinal categorical variables are numerical features for this dataset. We standardize all numerical columns. This is necessary because even though ordinal variables have meaningful ordering, they exist on different scales (some 0-4, others 1-5, and `absences` goes 0-93), which would cause the neural network to give disproportionate weight to larger-scale features during gradient descent. Standardization puts all features on the same scale ($\mu=0, \sigma=1$) so the network can learn their relative importance based on predictive power, not magnitude.

In [8]:
# Binary categorical variables (encode as 0/1)
binary_columns = ['school', 'sex', 'address', 'famsize', 'Pstatus', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet']
# Nominal categorical variables (one-hot encode)
nominal_columns = ['Mjob', 'Fjob', 'reason', 'guardian']

# Create the preprocessor
column_transformer = ColumnTransformer(
    transformers=[
        ('binary', OrdinalEncoder(), binary_columns),
        ('nominal', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'), nominal_columns),
        ('numerical', StandardScaler(), numerical_columns),
    ]
)

# Fit on training data ONLY
column_transformer.fit(X_train)

# Transform both training and test sets
X_train_encoded = column_transformer.transform(X_train)
X_test_encoded = column_transformer.transform(X_test)

Converting back to DataFrame for easier inspection:

In [9]:
feature_names = (
    binary_columns + 
    column_transformer.named_transformers_['nominal'].get_feature_names_out(nominal_columns).tolist() +
    numerical_columns
)

X_train_encoded = pd.DataFrame(X_train_encoded, columns=feature_names, index=X_train.index)
X_test_encoded = pd.DataFrame(X_test_encoded, columns=feature_names, index=X_test.index)

Let's create a custom PyTorch dataset (`CrushSet.py`):

In [10]:
class CrushSet(Dataset):
    def __init__(self, X: pd.DataFrame, y: pd.DataFrame):
        """
        Create a dataset for heartbroken nerds. (Not for me I swear, I have the most beautiful and loving girlfriend in the history of the world)

        Args:
            X: (n_samples, n_features) -> preprocessed features
            y: (n_samples, 2) -> targets DataFrame containing both 'G3' and 'romantic' columns
        """
        self.X = torch.FloatTensor(X.values)
        self.y_grade = torch.FloatTensor(y['G3'].values)
        self.y_romantic = torch.LongTensor((y['romantic'] == 'yes').astype(int).values)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        """
        Returns three items: (x_data, y_grade_data, y_romantic_data)

        Returns:
            x_data: feature vector, shape (n_features,)
            y_grade_data: grade target (scalar)
            y_romantic_data: romantic status target (0 or 1)
        """
        return self.X[idx], self.y_grade[idx], self.y_romantic[idx]

Completing train/test/validation split and creating DataLoaders:

In [11]:
# Split train into train and validation sets (80/20 split of training data)
X_train_final, X_val, y_train_final, y_val = train_test_split(X_train_encoded, y_train, test_size=0.2, random_state=95, stratify=y_train['romantic'])

# Create Dataset instances
train_dataset = CrushSet(X_train_final, y_train_final)
test_dataset = CrushSet(X_test_encoded, y_test)
val_dataset = CrushSet(X_val, y_val)

# Create DataLoaders
BS = 32  # yeah, let's go with the default batch size

train_loader = DataLoader(train_dataset, batch_size=BS, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BS, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=BS, shuffle=False)