# Starter Notebook

In order to help minimize start up difficulties, we have provided you with a basic ML workflow for this project, as well as a few possible avenues to explore. 

## Section 1: ML Workflow for Submitting *(g,h)* pairs

### 1.0 Pip Installs and Imports

We will be using a package *dill* which is a variant of *pickle*, but allows a bit more expressive byte code serialization. This package is essential to saving your *(g,h)* pairs!.

In [1]:
!pip install dill



Here is a non-inclusive list of packages you may find helpful

In [2]:
# Imports

import numpy as np
import pandas as pd
import sklearn as sk
from sklearn import *
import dill as pkl
import math
import time
from sklearn.metrics import mean_squared_error
import random
from google.oauth2.credentials import Credentials
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from googleapiclient.http import MediaFileUpload
from google.auth.transport.requests import Request
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from google.oauth2 import service_account
from github import Github
import string
import os

### 1.1 Download/Load Data

Navigate to the project [webpage](https://declancharrison.github.io/CIS_5230_Bias_Bounty_2023/) and click "Download Training Data". Extract the .zip files in the folder where this notebook is located, then run the cell below.

In [3]:
x_train = pd.read_csv('training_data.csv') 
y_train = np.genfromtxt('training_labels.csv', delimiter=',', dtype = float)
prediction_path = 'training_predictions.csv'
x_predictions = np.loadtxt(prediction_path, delimiter=",", dtype=str).astype('float64')
print(f'x_predictions.shape: {x_predictions.shape}')

x_predictions.shape: (340134,)


### 1.2 Define a (g,h) pair

Below is an example of training a Decision Tree Regressor on individuals identified as white from the dataset.

In [4]:
def submit_to_drive(file_name):
    creds = None
    SCOPES = 'https://www.googleapis.com/auth/drive'
    creds = service_account.Credentials.from_service_account_file(
    './s_key.json',
    scopes=['https://www.googleapis.com/auth/drive']
    )       
    # Save the credentials for the next run

    #creds = Credentials.from_authorized_user_file('./key.json', ['https://www.googleapis.com/auth/drive'])
    service = build('drive', 'v3', credentials=creds)
    # Set up the file metadata
    # find the "cis" folder
    folder_name = "423 3"
    query = "mimeType='application/vnd.google-apps.folder' and trashed=false and name='%s'" % folder_name
    folder = service.files().list(q=query).execute().get('files', [])[0]

    # find the file you want to overwrite
    query = "mimeType='application/octet-stream' and trashed=false and name='%s' and parents in '%s'" % (file_name, folder['id'])
    file = service.files().list(q=query).execute().get('files', [])[0]

    # upload the updated file
    file_metadata = {'name': file_name}
    media = MediaFileUpload(f'./{file_name}', mimetype='application/octet-stream')
    file = service.files().update(fileId=file['id'], body=file_metadata, media_body=media, fields='id').execute()

    print("File overwritten successfully.")

In [5]:
submit_to_drive('g_2.pkl')

File overwritten successfully.


In [6]:
def github_request(name):
    g = Github('insert github key')
    repo = g.get_repo("Declancharrison/CIS_5230_Bias_Bounty_2023")
    branch = repo.get_branch("main")
    new_branch = repo.create_git_ref(f"refs/heads/{name}", branch.commit.sha)
    file = repo.get_contents("competitors/request.yaml", ref=name)
    g_drive = "https://drive.google.com/file/d/1pq--9QJIaTFtsMsVH76iRbtxpgB--Tjs/view?usp=share_link"
    h_drive = "https://drive.google.com/file/d/1BQ8u2JD9DMTPtESEABHkcJHkQKbx0GBf/view?usp=share_link"

    file_content = f"# insert URL links to each model\n\n# google drive group function\n\ng_url : \"{g_drive}\"\n\n# google drive hypothesis function\n\nh_url : \"{h_drive}\"\n\n### DO NOT DELETE NEWLINE"

    repo.update_file(
        file.path,
        "commit message",
        file_content,
        file.sha,
        branch=name
    )
    while (True):
        try:
            pull = repo.create_pull(
            title="auto",
            body="auto",
            head=name,
            base=branch.name
            )
            break
            
        except Exception as e:
            print(e)
            time.sleep(600)
    print(f"Created pull request #{pull.number}: {pull.title}")
    return pull
    
    

In [7]:
def generate_random_string(length):
    letters = string.ascii_lowercase
    result_str = ''.join(random.choice(letters) for i in range(length))
    return result_str
generate_random_string(10)

'seawtzztij'

In [8]:
# Import necessary libraries
import pandas as pd
import numpy as np

# Load the US Census data into a Pandas DataFrame
df = pd.read_csv('training_data.csv')

# Define the list of column codes
columns = ['COW', 'DDRS', 'DEAR', 'DEYE', 'DOUT', 'DRAT', 'DREM', 'ENG', 'FER', 'JWTRNS', 'MAR', 'MIL', 'SCHL', 'SEX', 'RAC1P', 'CIT', 'AGEP', 'WKHP', 'OCCP']

# Define an empty dictionary to store the results
result = {}

# Loop through each column in the list of column codes
for column in columns:
    
    # Check if the column is categorical or continuous
    if column in [ 'CIT', 'COW', 'DDRS', 'DEAR', 'DEYE', 'DOUT', 'DRAT', 'DREM', 'ENG', 'FER', 'JWTRNS', 'MAR', 'MIL', 'SCHL', 'SEX', 'RAC1P']:
            
        # For categorical columns, get the unique values and add them as values in the dictionary
        result[column] = df[column].unique().tolist()
        
    elif column == 'AGEP':
        
        # For the continuous variable, create 5 equally sized bins or 'quadrants' and add them as values in the dictionary
        result[column] = [17, 32.4, 47.8, 63.2, 78.6, 94.0]

    elif column == 'WKHP':
        
        # For another continuous variable, create 5 equally sized bins or 'quadrants' and add them as values in the dictionary
        result[column] = [1, 20.6, 40.2, 59.8, 79.4, 99.0]
    elif column == 'OCCP':
        
        # For another continuous variable, create 5 equally sized bins or 'quadrants' and add them as values in the dictionary
        result[column] = [0, 2000, 4000, 6000, 8000, 10000]
        
# Print the resulting dictionary
print(result)

{'COW': [1.0, 6.0, 4.0, 2.0, 7.0, 3.0, 5.0, 8.0], 'DDRS': [2.0, 1.0], 'DEAR': [2.0, 1.0], 'DEYE': [2.0, 1.0], 'DOUT': [2.0, 1.0], 'DRAT': [0.0, 6.0, 3.0, 2.0, 4.0, 5.0, 1.0], 'DREM': [2.0, 1.0], 'ENG': [0.0, 4.0, 1.0, 3.0, 2.0], 'FER': [2.0, 0.0, 1.0], 'JWTRNS': [0.0, 1.0, 11.0, 12.0, 8.0, 10.0, 3.0, 2.0, 9.0, 7.0, 6.0, 5.0, 4.0], 'MAR': [5.0, 2.0, 3.0, 4.0, 1.0], 'MIL': [4.0, 1.0, 2.0, 3.0], 'SCHL': [16.0, 19.0, 1.0, 22.0, 15.0, 5.0, 20.0, 14.0, 18.0, 21.0, 17.0, 13.0, 7.0, 23.0, 24.0, 10.0, 11.0, 12.0, 9.0, 6.0, 8.0, 4.0, 2.0, 3.0], 'SEX': [2.0, 1.0], 'RAC1P': [2.0, 1.0, 3.0, 6.0, 8.0, 9.0, 7.0, 4.0, 5.0], 'CIT': [1.0, 5.0, 4.0, 2.0, 3.0], 'AGEP': [17, 32.4, 47.8, 63.2, 78.6, 94.0], 'WKHP': [1, 20.6, 40.2, 59.8, 79.4, 99.0], 'OCCP': [0, 2000, 4000, 6000, 8000, 10000]}


In [13]:
### This is the Automation Code to test the non-categorical variables AGEP, WKHP, OCCP with every possible combination of the other variables
### We run 2 more separate notebooks in parallel (with different g_drive links, folders) to test other combinations
def write_pkl(g, h):
    with open(f'g_2.pkl', 'wb') as file:
        pkl.dump(g, file)

    with open('h_2.pkl', 'wb') as file:
        pkl.dump(h, file)
pull = "init"
models = [sk.ensemble.RandomForestRegressor, sk.ensemble.GradientBoostingRegressor]
params = [[(75, 100, 125, 150), (5, 10, 15)], [(75, 100, 125, 150), (5, 10, 15), (2, 5, 10)]]
result_keys = list(result.keys())
continous_vars = ['AGEP', 'WKHP', 'OCCP']
for ii in range(len(continous_vars)):
    for jj in range(len(result_keys)):
        key1 = continous_vars[ii]
        key2 = result_keys[jj]
        if (key2 == 'AGEP' or key2 == 'WKHP' or key2 == 'OCCP'):
            continue
        for val1 in range(len(result[key1]) - 1):
            for val2 in range(len(result[key2])):
                best_model = -1
                best_param = -1
                best_rmse = 1000000000
                indices = ((x_train[key1] >= result[key1][val1]) & (x_train[key1] < result[key1][val1 + 1]) & 
                                      (x_train[key2] == result[key2][val2]))
                if len(x_train[indices]) <= 5:
                    continue
                x_train_subset, x_val, y_train_subset, y_val = sk.model_selection.train_test_split(x_train[indices], y_train[indices], test_size = .15, random_state = 42)

                
                def get_g(X, k1 = key1, k2 = key2, v1 = result[key1][val1], v11 = result[key1][val1 + 1], v2 = result[key2][val2]):
                    return ((X[k1] >= v1) & (X[k1] < v11) & (X[k2] == v2))
                
                print(f"{key1}: {result[key1][val1]} {result[key1][val1 + 1]} -- {key2}: {result[key2][val2]}")
                try:
                    x_train_subset, x_val, y_train_subset, y_val = sk.model_selection.train_test_split(x_train[indices], y_train[indices], test_size = .15, random_state = 42)
                except:
                    continue
                if (len(x_train_subset) <=3):
                    continue
                print(len(x_train_subset))
                for i in range(len(models) - 1):
                    if (i == 0):
                        j = params[0]
                        for k in range(len(j[0])):
                            for kk in range(len(j[1])):
                                clf = models[i](n_estimators = j[0][k], max_depth = j[1][kk])
                                clf.fit(x_train_subset, y_train_subset)
                                mse = mean_squared_error(clf.predict(x_val), y_val)
                                rmse = math.sqrt(mse)
                                if (rmse < best_rmse):
                                    best_rmse = rmse
                                    best_model = 0
                                    best_param = (j[0][k], j[1][kk])

                        
                    if (i == 1):
                        j = params[1]
                        for k in range(len(j[0])): 
                            for kk in range(len(j[1])):
                                for kkk in range(len(j[2])):
                                    clf = models[i](n_estimators = j[0][k], max_depth = j[1][kk], min_samples_split = j[2][kkk])
                                    clf.fit(x_train_subset, y_train_subset)
                                    mse = mean_squared_error(clf.predict(x_val), y_val)
                                    rmse = math.sqrt(mse)
                                    #print(f"GB: {j[0][k] , j[1][kk], j[2][kkk]}: {rmse}")
                                    if (rmse < best_rmse):
                                        best_rmse = rmse
                                        best_model = 1
                                        best_param = (j[0][k], j[1][kk], j[2][kkk])
                                    
            
                    
                if best_model == 0: 
                    def get_h(x_train_h, y_train_h, n_estim = best_param[0], max_d = best_param[1]):
                        clf = sk.ensemble.RandomForestRegressor(n_estimators = n_estim, max_depth = max_d)
                        # find group indices on data
                        indices_h = get_g(x_train_h)

                        # fit model specifically to group
                        clf.fit(x_train_h[indices_h], y_train_h[indices_h])

                        # define hypothesis function as bound clf.predict
                        h = clf.predict

                        return h
                    h = get_h(x_train, y_train)
                    write_pkl(get_g, h)
                

        
                if best_model == 1:
                    def get_h(x_train_h, y_train_h, n_estim = best_param[0], max_d = best_param[1], min_s_s = best_param[2]):
                        clf = sk.ensemble.GradientBoostingRegressor(n_estimators = n_estim, max_depth = max_d, min_samples_split = min_s_s)
                        # find group indices on data
                        indices_h = get_g(x_train_h)

                        # fit model specifically to group
                        clf.fit(x_train_h[indices_h], y_train_h[indices_h])

                        # define hypothesis function as bound clf.predict
                        h = clf.predict
                        return h
                    h = get_h(x_train, y_train)
                    write_pkl(get_g, h)
                
                if pull != "init":
                    print("waiting")
                    while pull is not None and pull.state == "open":
                        try:
                            pull.update()  
                        except:
                            ()
                
                submit_to_drive('g_2.pkl')
                submit_to_drive('h_2.pkl')
                pull = github_request(f'f1eAWO-{generate_random_string(10)}')
                    

                
                
                


WKHP: 1 20.6 -- MAR: 5.0
17568
File overwritten successfully.
File overwritten successfully.
Created pull request #7569: auto
WKHP: 1 20.6 -- MAR: 2.0
1487
waiting
File overwritten successfully.
File overwritten successfully.
Created pull request #7571: auto
WKHP: 1 20.6 -- MAR: 3.0
3423
waiting
File overwritten successfully.
File overwritten successfully.
Created pull request #7573: auto
WKHP: 1 20.6 -- MAR: 4.0
544
waiting
File overwritten successfully.
File overwritten successfully.
Created pull request #7576: auto
WKHP: 1 20.6 -- MAR: 1.0
15474
waiting
File overwritten successfully.
File overwritten successfully.
Created pull request #7578: auto
WKHP: 20.6 40.2 -- MAR: 5.0
65636
waiting
File overwritten successfully.
File overwritten successfully.
403 {"message": "You have exceeded a secondary rate limit and have been temporarily blocked from content creation. Please retry your request again later.", "documentation_url": "https://docs.github.com/rest/overview/resources-in-the-rest-

KeyboardInterrupt: 