In [None]:
# Only execute if you haven't already. Make sure to restart the kernel if these libraries have not been previously installed.
!pip install xgboost==0.82 --user
!pip install scikit-learn==0.20.4 --user

# Import Python packages

Execute the command below (__Shift + Enter__) to load all the python libraries we'll need for the lab.

In [24]:
import datetime
import pickle
import os

import pandas as pd
import xgboost as xgb
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import FeatureUnion, make_pipeline
from sklearn.utils import shuffle

from witwidget.notebook.visualization import WitWidget, WitConfigBuilder

import custom_transforms

import warnings
warnings.filterwarnings(action='ignore', category=DeprecationWarning)

# Download and process data

The models you'll build will predict the quality score of a wine given 11 numerical data points about that wine. You'll train your models on the UCI wine quality dataset.

The models you'll build will predict the income level, whether it's less than or equal to $50,000 per year, of individuals given 14 data points about each individual. You'll train your models on this UCI [Census Income Dataset](https://archive.ics.uci.edu/ml/datasets/Adult).

We'll read the data into a Pandas DataFrame to see what we'll be working with. It's important to shuffle our data in case the original dataset is ordered in a specific way. We use an sklearn utility called shuffle to do this, which we imported in the first cell:

In [3]:
train_csv_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'

COLUMNS = (
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'income-level'
)

raw_train_data = pd.read_csv(train_csv_path, names=COLUMNS, skipinitialspace=True)
raw_train_data = shuffle(raw_train_data, random_state=4)

`data.head()` lets us preview the first five rows of our dataset in Pandas.

In [4]:
raw_train_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income-level
28762,25,Private,307643,HS-grad,9,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,40,United-States,<=50K
4823,34,Private,424988,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,45,United-States,<=50K
3106,42,Local-gov,245307,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,1977,48,United-States,>50K
11293,44,Private,56483,Bachelors,13,Never-married,Adm-clerical,Own-child,White,Female,0,0,37,United-States,<=50K
7008,49,Private,215389,Bachelors,13,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,0,0,48,United-States,<=50K


The `income-level` column is the thing our model will predict. This is the binary outcome of whether the individual makes more than $50,000 per year. To see the distribution of income levels in the dataset, run the following:

In [5]:
print(raw_train_data['income-level'].value_counts())

<=50K    24720
>50K      7841
Name: income-level, dtype: int64


As explained in [this paper](http://cseweb.ucsd.edu/classes/sp15/cse190-c/reports/sp15/048.pdf), each entry in the dataset contains the following information
about an individual:

* __age__: the age of an individual
* __workclass__: a general term to represent the employment status of an individual
* __fnlwgt__: final weight. In other words, this is the number of people the census believes
the entry represents...
* __education__: the highest level of education achieved by an individual.
* __education-num__: the highest level of education achieved in numerical form.
* __marital-status__: marital status of an individual. 
* __occupation__: the general type of occupation of an individual
* __relationship__: represents what this individual is relative to others. For example an
individual could be a Husband. Each entry only has one relationship attribute and is
somewhat redundant with marital status. 
* __race__: Descriptions of an individual’s race
* __sex__: the biological sex of the individual
* __capital-gain__: capital gains for an individual
* __capital-loss__: capital loss for an individual
* __hours-per-week__: the hours an individual has reported to work per week
* __native-country__: country of origin for an individual
* __income-level__: whether or not an individual makes more than $50,000 annually

An important concept in machine learning is train / test split. We'll take the majority of our data and use it to train our model, and we'll set aside the rest for testing our model on data it's never seen before. There are many ways to create training and test datasets. Fortunately, for our census data we can simply download a pre-defined test set. 

In [6]:
test_csv_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'
raw_test_data = pd.read_csv(test_csv_path, names=COLUMNS, skipinitialspace=True, skiprows=1)

In [7]:
raw_test_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income-level
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K.
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K.
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K.
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K.
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K.


Since we don't want to train a model on our labels, we're going to separate them from the features in both the training and test datasets. Also, notice that `income-level` is a string datatype. For machine learning, it's better to convert this to an binary integer datatype. We do this in the next cell.  

In [9]:
raw_train_features = raw_train_data.drop('income-level', axis=1).values
raw_test_features = raw_test_data.drop('income-level', axis=1).values

# Create training labels list
train_labels = (raw_train_data['income-level'] == '>50K').values.astype(int)
test_labels = (raw_test_data['income-level'] == '>50K.').values.astype(int)

Now you're ready to build and train your first model!

# Build a First Model

The model we build will closely follow a template for the [census dataset found on AI Hub](https://aihub.cloud.google.com/p/products%2F526771c4-9b36-4022-b9c9-63629e9e3289). Four our model we'll use an XGBoost classifier. However, before we can train our model we have to pre-process the data a little bit. We'll build a processing pipeline using [Scikit-Learn's Pipeline constructor](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). We'll be applying some custom transformations that are defined in `custom_transforms.py`. Open the file `custom_transforms.py` and inspect the code. Out features are either numerical or categorical. The numerical features are `age-num`, and `hours-per-week`. These features will be processed by applying [Scikit-Learn's StandardScaler function](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html). The categorical features are `workclass`, `education`, `marital-status`, and `relationship`. These features are [one-hot encoded](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/). 

In [10]:
numerical_indices = [0, 12]  
categorical_indices = [1, 3, 5, 7]  

p1 = make_pipeline(
    custom_transforms.PositionalSelector(categorical_indices),
    custom_transforms.StripString(),
    custom_transforms.SimpleOneHotEncoder()
)
p2 = make_pipeline(
    custom_transforms.PositionalSelector(numerical_indices),
    StandardScaler()
)
p3 = FeatureUnion([
    ('numericals', p1),
    ('categoricals', p2),
])

To finalize the pipeline we attach an XGBoost classifier at the end. The complete pipeline object takes the raw data we loaded from csv files, processes the categorical features, processes the numerical features, concatenates the two, and then passes the result through the XGBoost classifier.   

In [11]:
pipeline = make_pipeline(
    p3,
    xgb.sklearn.XGBClassifier(max_depth=4)
)

We can train our model with one function call using the fit() method, and passing it our training data.

In [12]:
pipeline.fit(raw_train_features, train_labels)



Pipeline(memory=None,
     steps=[('featureunion', FeatureUnion(n_jobs=None,
       transformer_list=[('numericals', Pipeline(memory=None,
     steps=[('positionalselector', PositionalSelector(positions=[1, 3, 5, 7])), ('stripstring', StripString()), ('simpleonehotencoder', SimpleOneHotEncoder())])), ('categoricals', Pipeline...
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1))])

Let's go ahead and save our model as a pickle file. Executing the command below will save the trained model in the file `model.pkl` in the same directory as this notebook. 

In [13]:
with open('model.pkl', 'wb') as model_file:
    pickle.dump(pipeline, model_file)

# Save Trained Model to AI Platform

We've got our model working locally, but it would be nice if we could make predictions on it from anywhere (not just this notebook!). In this step we'll deploy it to the cloud. For detailed instructions on how to do this visit [the official documenation](https://cloud.google.com/ai-platform/prediction/docs/exporting-for-prediction). Note that since we have custom components in our data pipeline we need to go through a few extra steps.  

## Create a Cloud Storage bucket for the model

We first need to create a storage bucket to store our pickled model file. We'll point Cloud AI Platform at this file when we deploy. Run this gsutil command to create a bucket, replacing `your-gcp-project` with your qwiklabs project id. This will ensure the name of the cloud storage bucket you create will be globally unique.

In [15]:
%%bash

QWIKLABS_PROJECT_ID=your-gcp-project

gsutil mb gs://$QWIKLABS_PROJECT_ID

bash: line 3: QWIKLABS_PROJECT_ID: command not found
CommandException: The mb command requires at least 1 argument. Usage:

  gsutil mb [-b <on|off>] [-c class] [-l location] [-p proj_id]
            [--retention time] url...

For additional help run:
  gsutil help mb


CalledProcessError: Command 'b'\n# Update these to your own GCP project\nQWIKLABS_PROJECT_ID = bahumbug2 # your-gcp-project\n\ngsutil mb $QWIKLABS_PROJECT_ID\n'' returned non-zero exit status 1

## Package custom transform code

Since we're using custom transformation code we need to package it up and direct AI Platform to it when we ask it make predictions. To package our custom code we create a source distribution. The following code creates this distribution and then ports the distribution and the model file to the bucket we created. 

In [None]:
%%bash

QWIKLABS_PROJECT_ID=your-gcp-project

python setup.py sdist --formats=gztar

gsutil cp model.pkl gs://$QWIKLABS_PROJECT_ID/
gsutil cp dist/custom_transforms-0.1.tar.gz gs://$QWIKLABS_PROJECT_ID/

## Create and Deploy Model

The following ai-platform gcloud command will create a new model in your project. We'll call this one `census_income_classifier`.

In [20]:
!gcloud ai-platform models create census_income_classifier --regions us-central1

Created ml engine model [projects/nytaxi-query-test/models/census_income_classifier].


Now it's time to deploy the model. We can do that with this gcloud command (remember to replace `your-gcp-project` with your qwiklabs project id):

In [None]:
%%bash

QWIKLABS_PROJECT_ID=your-gcp-project

MODEL_DIR="gs://$QWIKLABS_PROJECT_ID/"
CUSTOM_CODE_PATH="gs://$QWIKLABS_PROJECT_ID/custom_transforms-0.1.tar.gz"
VERSION_NAME="v1"
MODEL_NAME="census_income_classifier"
FRAMEWORK="SCIKIT_LEARN"

gcloud beta ai-platform versions create $VERSION_NAME \
  --model $MODEL_NAME \
  --origin $MODEL_DIR \
  --runtime-version=1.15 \
  --framework $FRAMEWORK \
  --python-version=3.7 \
  --package-uris=$CUSTOM_CODE_PATH

While this is running, check the [models section](https://console.cloud.google.com/ai-platform/models) of your AI Platform console. You should see your new version deploying there. When the deploy completes successfully you'll see a green check mark where the loading spinner is. The deploy should take 2-3 minutes.

## Test the deployed model

To make sure your deployed model is working, test it out using gcloud to make a prediction. First, save a JSON file with one test instance for prediction:

In [22]:
%%writefile predictions.json
[25, "Private", 226802, "11th", 7, "Never-married", "Machine-op-inspct", "Own-child", "Black", "Male", 0, 0, 40, "United-States"]

Writing predictions.json


Test your model by running this code:

In [23]:
!gcloud ai-platform predict --model=census_income_classifier --json-instances=predictions.json --version=v1

[0]


You should see your model's prediction in the output.

# What-If Tool

To connect the What-if Tool to your AI Platform models, you need to pass it a subset of your test examples along with the ground truth values for those examples. Let's create a Numpy array of 2000 of our test examples.

In [39]:
num_datapoints = 2000  

test_examples = np.hstack(
    (raw_test_features[:num_datapoints], 
     test_labels[:num_datapoints].reshape(-1,1)
    )
)

Instantiating the What-if Tool is as simple as creating a WitConfigBuilder object and passing it the AI Platform model we built. We use set_predict_output_tensor('sequential').set_uses_predict_api(True) calls when we create the visualization here because our tf.keras model returns results inside a dict with the key of sequential:

# Create a What-if Tool visualization, it may take a minute to load
# See the cell below this for exploration ideas

# This prediction adjustment function is needed as this xgboost model's
# prediction returns just a score for the positive class of the binary
# classification, whereas the What-If Tool expects a list of scores for each
# class (in this case, both the negative class and the positive class).


In [47]:
def adjust_prediction(pred):
    return [1 - pred, pred]

config_builder = (
    WitConfigBuilder(test_examples.tolist(), COLUMNS)
    .set_ai_platform_model('nytaxi-query-test', 'census_income_classifier', 'v1', adjust_prediction=adjust_prediction)
    .set_target_feature('income-level')
    .set_model_type('classification')
)

WitWidget(config_builder, height=800)

WitWidget(config={'inference_address': 'nytaxi-query-test', 'label_vocab': [], 'uses_json_input': True, 'model…

# Narrative for identifying bias...

Aha! We found a bias

# Make changes to model to identify bias

Redeploy to AI Platform. Train in AI Platform?

# What-If Tool to show new model is less biased