# HW 1: Applying the ML pipeline

### COSC 410B: Spring 2025, Colgate University



## Instructions

In this HW you will apply the ML pipeline by using KNN models to predict student exam factors using several different features. Concretely, you will be working with the [Student Performance Factors](https://www.kaggle.com/datasets/lainguyn123/student-performance-factors?resource=download) on Kaggle. 

**Your task is to predict exam scores in this dataset as well as you can.**  

#### The ML Pipeline

1. Preprocess the data and split it into train and test (no need for validation set because we will use k-fold cross-validation). For this homework use 80% of the data for training, and 20% for evaluation. 
2. Exploring the training data: What are the input features you could use? What features do you want to use and explore? Why? The answers to this can either be theory driven or data-driven. 
3. Explore different features and hyperparameters with k-fold cross validation. Pick the best model (i.e., feature set and hyperparameter combination)
4. Evaluate the best model and discuss the results.

You should structure the code and writing in this notebook in format that makes it easy to follow along with your thought process and argument. 


#### Questions to answer before you start
* Will you use KNN-regression or KNN-classification? Why? 
* How will you handle non-numerical columns?
* You will notice that many of the columns are on different scales. Why is this a problem for KNN models? How can you handle this? 

**[WRITE YOUR ANSWERS HERE]**

My thought processes and code are written below. I'll answer all the questions here though.
Pre-lab questions:
1) I used regression after converting class data into numbers. Otherwise we can't use euclidean distance on classes so most of the features would be useless
2) As I said I converted them to numerical data based on general assumptions of their impact on performance
3) I scaled everything down to within the 0-2 scale, that way no feature should have drastically more weight than another
Lab Discussion:
2) I want to use the data with the most information, which to me is data that isn't just split between three classes or two options. So data like attendance and previous scores are really attractive to me. Also, on the theory side, I feel like certain features will be more impactful than others, like sleep_hours, since sleep is really important
4) OK so originally I was getting fscore of .05 or .06, but now it's around .01 or .00. I know this is terrible, but I'm not sure what is wrong with my approach. I think I have the approach with the k-fold stuff correct. I feel like there should be maybe half of the features in an ideal set - I don't think only two or three could work. For the k value, I went with a classic 10-fold split. The number of folds seemed to lightly increase accuracy, and 10 seems like a nice balance between accuracy and runtime (runtime increases as k increases)

## Your report

Add code and markdown chunks for your data analysis report here


First, we're going to clean the data then fold it


In [137]:
import pandas as pd
import numpy as np
import KNN
import util
from importlib import reload
reload(util)
reload(KNN)

## feature set development

columns_to_remove = [#'Hours_Studied',
                     #'Attendance',
                     'Parental_Involvement',
                     #'Access_to_Resources',
                     #'Extracurricular_Activities',
                     #'Sleep_Hours',
                     #'Previous_Scores',
                     'Motivation_Level',
                     'Internet_Access',
                     #'Tutoring_Sessions',
                     #'Family_Income',
                     #'Teacher_Quality',
                     'School_Type',
                     #'Peer_Influence',
                     'Physical_Activity',
                     #'Learning_Disabilities',
                     #'Parental_Education_Level',
                     'Distance_from_Home',
                     'Gender',
                     #'Exam_Score'
                    ]

## clean data into all continuous

df = pd.read_csv("StudentPerformanceFactors.csv")

df = df.drop(columns=columns_to_remove)

## scale

df['Hours_Studied'] = df['Hours_Studied'] / 30
df['Attendance'] = df['Attendance'] / 100
df['Sleep_Hours'] = df['Sleep_Hours'] / 10
df['Previous_Scores'] = df['Previous_Scores'] / 100

## convert all to numerical

df.replace({"Low" : 0, "Medium" : 1, "High" : 2, "Negative" : 0, "Positive" : 2, "Neutral" : 1, "Private" : 1, "Public" : 0, "High School" : 0, "College" : 1, "Postgraduate" : 2, "Near" : 2, "Moderate" : 1, "Far" : 0, "Male" : 0, "Female" : 1, "Yes" : 1, "No" : 0}, inplace=True)
df.to_csv("Clean_Data.csv", index=True)

## split into fold sections

split_data = util.splitData("Clean_Data.csv", 10)
test_data = split_data[0]
split_data = split_data[1]
myKnn = KNN.KNN("Regression", 5)




  df.replace({"Low" : 0, "Medium" : 1, "High" : 2, "Negative" : 0, "Positive" : 2, "Neutral" : 1, "Private" : 1, "Public" : 0, "High School" : 0, "College" : 1, "Postgraduate" : 2, "Near" : 2, "Moderate" : 1, "Far" : 0, "Male" : 0, "Female" : 1, "Yes" : 1, "No" : 0}, inplace=True)


Now we're going to run the analysis on each fold

In [138]:
cumulative_fscores = 0
for i in range(len(split_data)):
    ## for each fold, where the ith index of split_data is the test data
    x_labels = []
    y_values = []
    for j in range(len(split_data)):

        ## collecting training data from [not test] arrays
        if (i != j):
            for k in range(len(split_data[j])): 
                x_labels.append(split_data[j][k][:-1])
                y_values.append(split_data[j][k][-1])

    ## now we have x_labels and y_values, so let's fit
    myKnn.fit(x_labels, y_values)

    ## now let's test on the test fold
    pred = []
    true = []
    for j in range(len(split_data[i])):
        pred.append(myKnn.predict(split_data[i][j][:-1]))
        true.append(split_data[i][j][-1])
    
    ## now let's compare values and get an f-score
    pred = np.array(pred)
    true = np.array(true)
    evaluation = util.fscore(pred, true, 1)
    cumulative_fscores += evaluation

## time to average cumulative_fscores
print("Average fscore", str(cumulative_fscores/len(split_data)))



Average fscore 0.006


Now, we evaluate chosen hyperparameters and feature set on test_set

In [139]:
pred = []
true = []
for j in range(len(test_data)):
    pred.append(myKnn.predict(test_data[j][:-1]))
    true.append(test_data[j][-1])

## now let's compare values and get an f-score
pred = np.array(pred)
true = np.array(true)
evaluation = util.fscore(pred, true, 1)
print(evaluation)

0.0
