<a href="https://colab.research.google.com/github/EudaimonicPi/MLsessions/blob/main/ISEA_Week5_ML_HW.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Your Task: Predict Student Success

The purpose of this HW is to get you hands on with real data trying out the modelling techniques we talked about.

You are free to use gen-ai with this project to help with the coding (of course, you don't have to!). [Hands on Machine Learning](https://www.oreilly.com/library/view/hands-on-machine-learning/9781098125967/) is also a great resource.

Your code needs to run, but I want you to focus less on the specifics of the code and more on understanding the component steps that go into building and validating a model. Creating code is now pretty easy, creating a "good" model is hard.

For this exercise we will use open data on student dropout from Portugal. Full documentation is available [here](https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success)

M.V.Martins, D. Tolledo, J. Machado, L. M.T. Baptista, V.Realinho. (2021) "Early prediction of studentâ€™s performance in higher education: a case study" Trends and Applications in Information Systems and Technologies, vol.1, in Advances in Intelligent Systems and Computing series. Springer. DOI: 10.1007/978-3-030-72657-7_16

You will turn in on the class website a google slide deck that has:
1. A cover slide contianing your name (and all group member names if working together) and a link to your colab (**Create slide 1 now**)
2. 3 slides answering the questions below - they are clearly indicated as you go through the colab notebook.


# Get the data

Here I provide some code to get the data for you

In [None]:
!pip install ucimlrepo
from ucimlrepo import fetch_ucirepo

In [None]:
# fetch dataset
predict_students_dropout_and_academic_success = fetch_ucirepo(id=697)

# data (as pandas dataframes)
X = predict_students_dropout_and_academic_success.data.features
y = predict_students_dropout_and_academic_success.data.targets

# metadata
print(predict_students_dropout_and_academic_success.metadata)

# variable information
print(predict_students_dropout_and_academic_success.variables)

# 1 Data Checking

- Look at your outcome variable - any cases to exclude?
- Determine the base-rate accuracy for a naive model
- Create Test and Training Sets
- Look at distributions of x variables, look up meta data, decide which to include

At the end of this section you should have
`x_train`, `x_text`, `y_train`, `y_test`
And an estimate of the base rate accuracy.

In [None]:
""" the goal of this function is to get an estimate of the base rate.
    the base rate is the proportion of the most frequent outcome
"""
def get_counts():
    # helper function that gets graduate and dropout counts
    grad_count = 0
    dropout_count = 0

    # y is a df, so we have to iterate through the "Target" column
    for target in y['Target']:
        if target == "Graduate":
            grad_count += 1
        elif target == "Dropout":
            dropout_count += 1
        elif target == "Enrolled":
            continue
        else:
            print("We have a missing target ", target)
    return grad_count, dropout_count

def estimate_base_rate():
    # what is the most frequent outcome (amongst base & target?)
    # let's calculate "Graduate" and "Dropout" rates to find out (and exclude "Enrolled" outcomes)
    grad_count, dropout_count = get_counts()

    print("Graduate count is " ,grad_count) # 2209
    print("Droupout count is ", dropout_count) # 1421
    total_count = grad_count + dropout_count
    # if grad_count >= dropout_count: # hmmm... what IF they were equal?
    #     return grad_count/total_count
    # else:
    return dropout_count/total_count

# WE HAVE TO FILTER THE DATASET FROM THOSE THAT ARE CURRENTLY ENROLLED
# does that require you to get the X values corresponding to the y values that are either grad or dropout?

# can split 70/30
# how to get length?
# x_train = X[:5]
# x_test =
# print(len(X.columns))
# print("Amount of X vector is ", len(X))
# print("Amount of y is ", len(y))
# x_test = X

def split_dataset():
    grad_count, dropout_count = get_counts()
    total = grad_count + dropout_count
    # could join the data based on rows
    # then if the 'Target' column is 'Enrolled', filter it out
    joint_xy_df = X.join(y)
    filtered_df = joint_xy_df[joint_xy_df['Target'] != "Enrolled"]
    # print(filtered_df.head(4))

    # calculate length of new data set
    dataset_length = len(filtered_df)
    # print("Filtered df length is ", dataset_length)2
    # get .7, get .3 number
    training_percentage = .7
    testing_percentage = 1 - training_percentage

    num_train = int(training_percentage*dataset_length) # this num matters for indexing purposes
    num_test = int(testing_percentage*dataset_length)

    # with those numbers, split the data set
    print("Number of training and number of testing", num_train, num_test)

    # now we've got to split the data set to get separate x_train and y_train, right?
    filtered_y_df = filtered_df[['Target']]
    filtered_x_df = filtered_df.drop('Target', axis = 1)

    x_train = filtered_x_df[:num_train]
    x_test = filtered_x_df[num_train:]
    y_train = filtered_y_df[:num_train]
    y_test = filtered_y_df[num_train:]
    return x_train, x_test, y_train, y_test



base_accuracy = estimate_base_rate()
print("Base accuracy estimate is , ", base_accuracy)
x_train, x_test, y_train, y_test = split_dataset()


"""
NOTES
 * The outcome variable is y, meaning the actual outcomes (not y_pred)
 * base-rate accuracy is the proportion of the most frequent
 *
 *
 *
 *
"""

"""
QUESTIONS
 * Should the total for base rate accuracy exclude "Enrolled"
 *
 *
 *
 *
"""
# what is the outcome variable, y_pred. No it's y (actual)

# what is base-rate accuracy? is that just (TP + TN/ (TP + TN + FP + FN)) # how to determine that?
# base-rate accuracy is the proportion of the most frequent
# count dropouts vs graduates and deyermine most frequent

# create test and training sets, from the raw data?
# looking at the distribtuions through what kind of measures?
# could this just be done with an 80 20 split => chat suggested 70/80 ti 30/20

# 2 Train a Model
* Pick one of the models we discussed today and train it.
* Report its accuracy and print a confusion matrix.
   * How much better is your model than the base rate?
   * How does accuracy on the train set compare to accuracy on the test set?
   * **Report Slide 2: Include an image of the confusion matrix, the base rate accuracy, train-set accuracy and test-set accuracy**

In [None]:
import numpy as np
import pandas as pd

# goal: our model should return a vector of y_pred
# how to train the weights, these are adjusted through gradient descent
# x_train, y_train



#tbh probably an automatic way to do this :)
# def compute_cost(m, b, x, y):
#     n = len(x)
#     y_pred = sigmoid(np.dot(x, m) + b)  # Apply sigmoid for binary classification
#     cost = (-1/n) * np.sum(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))  # Binary Cross-Entropy Loss
#     return cost

def compute_cost(m, b, x, y):
    n = len(x)

    # Ensure y and y_pred are NumPy arrays
    y = np.array(y)  # Convert y to a NumPy array if it's a list
    y_pred = sigmoid(np.dot(x, m) + b)  # Apply sigmoid for predictions
    y_pred = np.clip(y_pred, 1e-10, 1 - 1e-10)

    # Compute the binary cross-entropy cost
    cost = (-1/n) * np.sum(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))
    return cost
# def compute_cost(m, b, x, y):
#     # Number of data points
#     n = len(x)
#     # Compute the predictions
#     y_pred = np.dot(x, m) + b  # Assuming x is a 2D matrix, and m is a vector (dot product)
#     # Calculate the cost using the Mean Squared Error (MSE) formula
#     cost = (1 / (2 * n)) * np.sum((y_pred - y) ** 2)
#     return cost

def sigmoid(z):
    z = np.clip(z, -500, 500)
    return 1 / (1 + np.exp(-z))

def predict(m, b, x): # goes from linear to sigmoid
    z = np.dot(x, m) + b  # Linear combination of inputs and weights
    return sigmoid(z)  # Apply sigmoid to get probabilities

# confusion matrix (TP, TN, FP, FN)
# base rate accuracy: proportion of most frequent target
# train-set accuracy: TP + TN/all for train
# test-set: TP + TN/all for test
# def predict(m, b, x):
#     # y = mx*b
#     # y pred is a vector with mx + b applied to each par
#     y_pred = m*x + b # this is vector multiplication (m is vec, no?)
#     return y_pred

def compute_gradients(x, y, y_pred):
    # your code here, reference the equations provided above
    # You can use the function `np.sum` for the summation notation
    n = len(x)
    dm = (-1/n) * np.dot(x.T, (y - y_pred))  # Compute gradient for m (weights)
    db = (-1/n) * np.sum(y - y_pred)  # Compute gradient for b (bias)

    return dm, db #  Gradient with respect to m, b

def update_parameters(m, b, dm, db, learning_rate=0.01):
    # your code here
    # print("LENGTH OF DM AND DB", len(dm), len(db))
    m = m - learning_rate * dm # neg grad descent here?
    b = b - learning_rate *db
    return m, b

def initialize_weights_xavier(n_features):
    # Xavier initialization (good for deep networks)
    limit = np.sqrt(6) / np.sqrt(n_features + 1)
    m = np.random.uniform(-limit, limit, (n_features, 1))

    b = 0  # Bias is usually initialized to 0
    return m


def train_ML(m, b, x, y, epochs = 10):
    # Replacing values in the 'Target' column
    y['Target'] = y['Target'].replace({'Graduate': 0, 'Dropout': 1})

    # Convert the 'Target' column to a list
    y = y['Target'].tolist()

    for epoch in range(epochs):
      # make predictions
      y_pred = predict(m,b,x)

      # compute gradients
      dm, db = compute_gradients(x, y, y_pred) # HERE dm and db hmm...
      # print("dm and db are ", dm, db)

      # print(f"m-gradient = {dm:.1f}, b-gradient = { db:.1f}")

      # update parameters
      m, b = update_parameters(m, b, dm, db)
      # print("m and b are now, " ,m, b)
    #   # print(f"New m = {m:.2f}, New b = { b:.2f}")

      cost = compute_cost(m, b, x, y)
      print(f"Epoc {epoch}:M Cost = {cost:.4f}")

    # print("Final y is ", y_pred)
dataset_length = len(x_train)
# m_vector = np.zeros((1, 36))
initial_bias = 0
# m_vector2 = np.zeros(( 36, 1))
m_vector2 = initialize_weights_xavier(36)
# print("HI")
train_ML(m_vector2, initial_bias, x_train, y_train)


# row by col, num rows is 1 for each feature, num cols is 1 for each input




# 3 Train a Different Model
* Repeat all the steps in 2, but use a different model
* In addition, compare the accuracy of 1 and 2
* **Report Slide 3: Model 2 confusion matrix, train-set accuracy and test-set accuracy. Comparison Model 1 and Model 2 accuracy**

# 4 Reflection
* **Type responses on Slide 4**
* Contextualizing accuracy - think about different use cases for your model, which ones would you feel its accurate enough to use for? I only asked you to look at overall accuracy, is that good enough?
* Contextualizing features - think about these same use cases, are the prediction features you included appropriate for these uses?
* Generalizability - again thinking about your features, could you use this model in other educational contexts? How hard would it be to get that same data? Are there issues with it generalizing over time and location?

# 5 Extra Credit
* Consider ensembling your two models. Does that perform better?
* Check accuracy for different subgroups