# Training and Serealizing the model

Things to consider in this notebook:

1. Preparing a dataset, watch out for data that might bias your model. The solution here has been to leave out the features that could bias the solution. This can definitely jeapordize the model's performance, but it is a trade-off that has been made to avoid biasing the model.
2. Training the model using the training data. No fine-tuning is made in this notebook, but it is definitely something that should be done to improve the model's performance.
3. Serializing the model to be used in production.

You can also create your scorer function to evaluate a model's performance and take into account the bias that might be present in the model.

In [14]:
import pandas as pd
from sklearn.pipeline import make_pipeline, Pipeline

# 1. Prepare the Dataset

Read original dataset from disk and take a look at it

In [15]:
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,observation_id,Type,Date,Part of a standard enforcement protocol,Galactic X,Galactic Y,Reproduction,Age range,Self-defined species category,Officer-defined species category,Governing law,Object of inspection,Inspection involving more than just outerwear,Enforcement station,Inspection Outcome
0,8ca4b11e-2814-42ab-b8e9-bf467d6f9bc1,Entity inspection,2239-12-14 13:22:00+00:00,False,8.528523,-1398.677772,Asexual,Senior,Zoltrax - Diverse Clans,Zoltrax,Galactic Enforcement and Evidence Code 3984 (C...,Vandalism Tools,,Galactic Hub X101-IO,False
1,93930d11-d5cc-46a0-91eb-96fea69eea2f,Entity and Spaceship search,2241-06-07 14:00:00+00:00,False,23.330735,-3826.24054,Asexual,Adult,Zoltrax - Diverse Clans,Silicar,Interstellar Substance Control Ordinance 4719 ...,Regulated Star Substances,,Galactic Hub X101-IO,False
2,b68b83bc-9eba-484b-9369-9707132592c8,Entity inspection,2241-09-17 15:30:00+00:00,,-128.358204,21050.745456,Sexual,Adult,Terran - Northern Cluster,Terran,Interstellar Substance Control Ordinance 4719 ...,Regulated Star Substances,False,Helios Command C210-ZR,True
3,2bdea27c-2168-4884-9bda-68a862aea7e1,Entity and Spaceship search,2241-10-11 08:10:00+00:00,True,126.623742,-20766.293688,Asexual,Adult,Terran - Northern Cluster,Terran,Interstellar Substance Control Ordinance 4719 ...,Regulated Star Substances,False,Krypton Dock D175-WU,False
4,c8782f5f-1c50-4710-a129-d495a404c1c3,Entity and Spaceship search,2240-05-01 20:40:00+00:00,False,-23.578062,3866.802168,Asexual,Young Adult,Zoltrax - Diverse Clans,Zoltrax,Interstellar Substance Control Ordinance 4719 ...,Regulated Star Substances,,Galactic Hub X101-IO,False


Let's get rid of a few features that don't hold anything particularly useful, or that could have an effect on creating bias, and take another peak

In [16]:
features_we_want = ['Type', 'Part of a standard enforcement protocol',
       'Age range', 'Object of inspection', 'Enforcement station']
full_ds = df[features_we_want + ['Inspection Outcome']].copy()

Now let's split it into X_train and y_train

In [17]:
X_train, y_train = full_ds.drop('Inspection Outcome', axis=1), df['Inspection Outcome']
X_train.head()

Unnamed: 0,Type,Part of a standard enforcement protocol,Age range,Object of inspection,Enforcement station
0,Entity inspection,False,Senior,Vandalism Tools,Galactic Hub X101-IO
1,Entity and Spaceship search,False,Adult,Regulated Star Substances,Galactic Hub X101-IO
2,Entity inspection,,Adult,Regulated Star Substances,Helios Command C210-ZR
3,Entity and Spaceship search,True,Adult,Regulated Star Substances,Krypton Dock D175-WU
4,Entity and Spaceship search,False,Young Adult,Regulated Star Substances,Galactic Hub X101-IO


# 2 Build the Pipeline

In [18]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier

categorical_transformer_others = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value=True)),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('cat_others', categorical_transformer_others, ['Type','Age range',
                    'Object of inspection', 'Enforcement station']),
        ('cat_enforcement', categorical_transformer, ['Part of a standard enforcement protocol'])
    ])

pipeline = make_pipeline(
    preprocessor,
    RandomForestClassifier(max_depth=3, min_samples_leaf=.03, class_weight="balanced", random_state=42, n_jobs=-1),
)

pipeline.fit(X_train, y_train)

## 2.1 Evaluate

Check how good the model is against the whole train data. We are not going to improve it, just check it. For techniques on how to improve the model, check other specializations.

In [20]:
import json
from sklearn import metrics


def fairness_score_f1(sensitive_class):
    '''
    Computes a fairness score corresponding to the highest difference
     between f1-score among sensitive classes
    :param sensitive_class: dictionary with the following structure
      {
       'sensitive_class_1': {'outcomes': [...], 'predictions': [...]},
       'sensitive_class_2': {'outcomes': [...], 'predictions': [...]},
       ...
      }
    :return: maximum difference between f1-scores in sensitive clasess
    '''
    f1_scores = []
    for s_class in sensitive_class.keys():
        outcomes = sensitive_class[s_class]["outcomes"]
        predictions = sensitive_class[s_class]["predictions"]
        if len(outcomes) > 40:
            f1_scores.append(metrics.f1_score(outcomes, predictions))

    f1_score_diff = 0
    if f1_scores:
        f1_score_diff = max(f1_scores) - min(f1_scores)
    

    return f1_score_diff


def score(data):
    '''
    Calculates the score of the students' API model
    :param student_api: StudentApi object
    :return: score as a float
    '''

    outcomes = []
    predictions = []
    sensitive_class_species = {}
    sensitive_class_reproduction = {}
    for row in data.itertuples():
        outcome = row.real
        prediction = row.pred
        if prediction not in [True, False]:
            prediction = not outcome
        reproduction = row.Reproduction.lower()
        if reproduction not in sensitive_class_reproduction:
            sensitive_class_reproduction[reproduction] = {"outcomes": [], "predictions": []}

        sensitive_class_reproduction[reproduction]["outcomes"].append(outcome)
        sensitive_class_reproduction[reproduction]["predictions"].append(prediction)
        
        try:
            specie = row._9.lower()
        except AttributeError:
            continue
            

        if specie not in sensitive_class_species:
            sensitive_class_species[specie] = {"outcomes": [], "predictions": []}

        sensitive_class_species[specie]["outcomes"].append(outcome)
        sensitive_class_species[specie]["predictions"].append(prediction)

        predictions.append(prediction)
        outcomes.append(outcome)
    
    f1_score = metrics.f1_score(outcomes, predictions, pos_label=True)

    race_diff = fairness_score_f1(sensitive_class_species)
    gender_diff = fairness_score_f1(sensitive_class_reproduction)
    max_diff = max(race_diff, gender_diff)
    

    fairness_penalty = 0.0
    if max_diff > 0.15:
        fairness_penalty += max_diff

    score = f1_score - (0.1*fairness_penalty)
    
    return max(0, score)


In [21]:
y_pred = pipeline.predict(X_train)
f1_score(y_train, y_pred)

0.4789303079416532

In [22]:
df['pred'] = y_pred
df['real'] = y_train
df.head()
score(df)

0.43940131578947367

# 3. Serialization of the necessary components

In [23]:
with open('columns.json', 'w') as fh:
    json.dump(X_train.columns.tolist(), fh)

In [24]:
import joblib


joblib.dump(pipeline, 'pipeline.pickle') 

['pipeline.pickle']

# 4. Testing the App.

In [142]:
import requests
import json


url = "https://railway-model-deploy-production-5b36.up.railway.app/predict" # url = "http://localhost:8000/predict"
data = {
    'observation_id': 'e4cb04ab-2611-46ec-82d7-eac3777306a1',
    'Type': 'Entity inspection',
    'Date': '2240-12-02 00:00:00+00:00',
    'Part of a standard enforcement protocol': None,
    'Galactic X': -406.232176,
    'Galactic Y': 66622.076864,
    'Reproduction': 'Asexual',
    'Age range': 'Young Adult',
    'Self-defined species category': 'Terran - Northern Cluster', 
    'Officer-defined species category': 'Terran',
    'Governing law': 'Interstellar Substance Control Ordinance 4719 (Directive 23)',
    'Object of inspection': 'Regulated Star Substances',
    'Inspection involving more than just outerwear': False,
    'Enforcement station': 'Dorset Delta I55-FO'
}
headers = {'Content-Type': 'application/json'}
r = requests.post(url, data=json.dumps(data), headers=headers)
r.json()