# Introduction
This series of notebooks illustrates how to interface a model to Certifai, run a scan and perform some simple analyses

## Part 1 - Train models
The first notebook in the series (this one) does necessary data preprocessing for model building and trains a couple of simple models on the `German Credit` dataset.

This dataset was sourced from Kaggle: https://www.kaggle.com/uciml/german-credit
The original source is: https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29

Note - this notebook really has nothing to do with Certifai - it is just here to generate some models to work
with for example purposes

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import numpy as np
import random
import pickle
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Set random seeds so we have a deterministic result
random.seed(42)
np.random.seed(42)

# Data Preparation

Import the data and separate the label column

In [2]:
base_path = '..'
all_data_file = f"{base_path}/datasets/german_credit_eval.csv"

df = pd.read_csv(all_data_file)

cat_columns = [
    'checkingstatus',
    'history',
    'purpose',
    'savings',
    'employ',
    'status',
    'others',
    'property',
    'age',
    'otherplans',
    'housing',
    'job',
    'telephone',
    'foreign'
    ]

label_column = 'outcome'

# Separate outcome
y = df[label_column]
X = df.drop(label_column, axis=1)

# Train Models
Train a couple of simple models (we'll use a logistic  classifier and a decision tree for the sake of example, but any model family is fine).

*Note* - we put the one-hot encoding in the model pipeline so that we wind up with models that take unpreprocessed input.  In a real pipeline we might instead choose to preprocess the data and work always on pre-encoded data, but to minimize the number of intermediary assets we generate we'll wrap it all in the model pipeline for this example.

In [3]:
cat_cols = [
    'checkingstatus',
    'history',
    'purpose',
    'savings',
    'employ',
    'status',
    'others',
    'property',
    'age',
    'otherplans',
    'housing',
    'job',
    'telephone',
    'foreign'
    ]

categorical_features = [X.columns.get_loc(c) for c in cat_cols]
numeric_features = [X.columns.get_loc(c) for c in X.columns if c not in cat_cols]
numeric_transformer = StandardScaler()
categorical_transformer = preprocessing.OneHotEncoder(sparse=False)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

preprocessor.fit(X)

def build_model(train_data, train_labels, test_data, test_labels, model, name):
    pipe = Pipeline([('encoder', preprocessor), (name, model)])
    pipe.fit(train_data, train_labels)
    print(f"Model {name} accuracy on holdout set: {pipe.score(test_data, test_labels)}")
    return pipe

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

logistic_model = build_model(X_train, y_train, X_test, y_test, LogisticRegression(), 'Logistic')
dtree_model = build_model(X_train, y_train, X_test, y_test, DecisionTreeClassifier(), 'Decision tree')

Model Logistic accuracy on holdout set: 0.77
Model Decision tree accuracy on holdout set: 0.69


# Save the trained models

We will just pickle the models for use in subsequent notebooks

In [4]:
model_dict = {
    'logistic': logistic_model,
    'dtree': dtree_model
}

with open('models.pkl', 'wb') as f:
    pickle.dump(model_dict, f)