# Starter Notebook

In order to help minimize start up difficulties, we have provided you with a basic ML workflow for this project, as well as a few possible avenues to explore. 

## Section 1: ML Workflow for Submitting *(g,h)* pairs

### 1.0 Pip Installs and Imports

We will be using a package *dill* which is a variant of *pickle*, but allows a bit more expressive byte code serialization. This package is essential to saving your *(g,h)* pairs!.

In [87]:
!pip install dill



Here is a non-inclusive list of packages you may find helpful

In [88]:
# Imports
import pandas as pd
import numpy as np
import sklearn as sk
from sklearn import *
import dill as pkl
import math
from sklearn.metrics import mean_squared_error


### 1.1 Download/Load Data

Navigate to the project [webpage](https://declancharrison.github.io/CIS_5230_Bias_Bounty_2023/) and click "Download Training Data". Extract the .zip files in the folder where this notebook is located, then run the cell below.

In [89]:
x_train = pd.read_csv('training_data.csv') 
y_train = np.genfromtxt('training_labels.csv', delimiter=',', dtype = float)

### 1.2 Define a (g,h) pair

Below is an example of training a Decision Tree Regressor on individuals identified as white from the dataset.

In [116]:
models = [sk.ensemble.RandomForestRegressor, sk.ensemble.GradientBoostingRegressor, sk.neural_network.MLPRegressor]
params = [[(75, 100, 125, 150), (5, 10, 15)], [(75, 100, 125, 150), (5, 10, 15), (2, 5, 10)], [(10), (20), (10, 100), (50), (200), (20, 50, 20), (20, 100, 20), (20, 50, 100), (20, 50, 75)]]

best_model = -1
best_param = -1
best_rmse = 1000000000
indices = (x_train['AGEP'] >= 75)
x_train_subset, x_val, y_train_subset, y_val = sk.model_selection.train_test_split(x_train[indices], y_train[indices], test_size = .15, random_state = 42)
for i in range(len(models)):
    if (i == 0):
        j = params[0]
        for k in range(len(j[0])): #estimators
            for kk in range(len(j[1])):
                clf = models[i](n_estimators = j[0][k], max_depth = j[1][kk])
                clf.fit(x_train_subset, y_train_subset)
                mse = mean_squared_error(clf.predict(x_val), y_val)
                rmse = math.sqrt(mse)
                print(f"RF: {j[0][k] , j[1][kk]}: {rmse}")
                if (rmse < best_rmse):
                    best_rmse = rmse
                    best_model = "Random Forest"
                    best_param = (j[0][k], j[1][kk])
        
    if (i == 1):
        j = params[1]
        for k in range(len(j[0])): #estimators
            for kk in range(len(j[1])):
                for kkk in range(len(j[2])):
                    clf = models[i](n_estimators = j[0][k], max_depth = j[1][kk], min_samples_split = j[2][kkk])
                    clf.fit(x_train_subset, y_train_subset)
                    mse = mean_squared_error(clf.predict(x_val), y_val)
                    rmse = math.sqrt(mse)
                    print(f"GB: {j[0][k] , j[1][kk], j[2][kkk]}: {rmse}")
                    if (rmse < best_rmse):
                        best_rmse = rmse
                        best_model = "Gradient Boost"
                        best_param = (j[0][k], j[1][kk], j[2][kkk])

    if (i == 2):
        j = params[2]
        for k in j: #estimators
            clf = models[i](hidden_layer_sizes = k)
            clf.fit(x_train_subset, y_train_subset)
            mse = mean_squared_error(clf.predict(x_val), y_val)
            rmse = math.sqrt(mse)
            print(f"NN: {k}: {rmse}")
            if (rmse < best_rmse):
                best_rmse = rmse
                best_model = "Neural Network"
                best_param = k

    

    




NN: 10: 30988.799399515283




NN: 20: 27377.25295845386




NN: (10, 100): 22470.022021042718




NN: 50: 23718.126053918135




NN: 200: 23011.915782655775




NN: (20, 50, 20): 22153.750790916984
NN: (20, 100, 20): 22337.091659421407
NN: (20, 50, 100): 22317.754324282254
NN: (20, 50, 75): 22226.468139135075


In [105]:
print(f"{best_model, best_model, best_param}")
def get_g(X):
    # Tried CIT ==1 and CIT ==2 and RACP == 1
    return (X['AGEP'] >= 75)

def get_h(x_train, y_train):
    clf = sk.ensemble.GradientBoostingRegressor(n_estimators=75, max_depth=5,min_samples_split=2, random_state = 42)
    
    # find group indices on data
    indices = get_g(x_train)

    # fit model specifically to group
    clf.fit(x_train[indices], y_train[indices])

    # define hypothesis function as bound clf.predict
    h = clf.predict
    return h

(1, 1, (0, 0))


In [107]:
h = get_h(x_train, y_train)

### 1.3 Save Objects

The following cell will save your group model *g* with filename *g.pkl*, and your hypothesis function *h* with filename *h.pkl*.

In [109]:
# save group function to g.pkl
with open('g.pkl', 'wb') as file:
    pkl.dump(get_g, file)

# save hypothesis function to h.pkl
with open('h.pkl', 'wb') as file:
    pkl.dump(h, file)

### 1.4 Upload Models to Google Drive and Submit PR Request with Links

Follow instructions on GitHub Repo to submit a *(g,h)* pair update request!

## Section 2: Reducing Workflow Time Requirements by Creating a Local PDL

As you have probably noticed, submitting a *(g,h)* pair to the GitHub repository can take a long time depending on the current workload of the server. To approximate whether or not an update will be accepted, we have provided you the PDL architecture file and a workflow that will mimic your team's private PDL maintained by the server. 

**NOTE: One major caveat is the validation data this workflow uses is a cut from the training data, meaning you will want to refrain from training on it to prevent overfitting.**

The way we suggest getting around this without losing data efficacy is to train a *(g,h)* pair on the subset of training data that does not include the validation set, and attempt the *(g,h)* pair update on the local PDL. If the pair is rejected, you can continue tuning hyperparameters or searching for new groups. If the pair is accepted, you can retrain a new *(g,h)* pair over ALL the training data, and submit this pair to the server for an update. This will allow you to "squeeze all the juice" from your training data and test potential updates much quicker.  

In [51]:
### DONT CHANGE THIS CELL ###
from pdl import PointerDecisionList

x_train_subset, x_val, y_train_subset, y_val = sk.model_selection.train_test_split(x_train, y_train, test_size = .15, random_state = 42)
base_clf = sk.tree.DecisionTreeRegressor(max_depth = 1, random_state = 42)
base_clf.fit(x_train_subset, y_train_subset)
PDL = PointerDecisionList(base_clf, x_train_subset, y_train_subset, x_val, y_val, 1, 1)

Train your *(g,h)* pair on the subset of training data below:

Attempt an update using the following syntax

In [52]:
update_flag = PDL.update(get_g, h, x_train_subset, y_train_subset, x_val, y_val)

Update Accepted!


You can put these two together to train a classifier using the whole training dataset after if it has been accepted:

In [53]:

if update_flag:

    # recompute indices over whole training dataset
    indices = get_g(x_train)

    # refit classifier to full group
    h = get_h(x_train, y_train)

    # define hypothesis function as bound clf.predict

Submit *(g,h)* pair to GitHub!

**NOTE: You can save your PDL but it will require that your validation set does not change! Thus, you should not change the random state used to split your training data once you create your PDL**

In [54]:
# save PDL
PDL.save_model()

# open PDL structure
with open('PDL/model.pkl', 'rb') as file:
    PDL = pkl.load(file)

# reload group/hypothesis functions to PDL
PDL.reload_functions()