# Predicting Income with the Census Income Dataset Using Scikit Learn on ML Engine
To use Scikit Learn on ML Engine for predictions, you will need to have a pre-trained model that you can upload to ML Engine. This datalab use the [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) to create a simple model, train the model, upload the model to ML Engine, and lastly use the model to make predictions. 

# Part 0: Setup
* [Create a project on GCP](https://cloud.google.com/resource-manager/docs/creating-managing-projects)
* [Create a Google Cloud Storage Bucket](https://cloud.google.com/storage/docs/quickstart-console)
* [Enable Cloud Machine Learning Engine and Compute Engine APIs](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component&_ga=2.217405014.1312742076.1516128282-1417583630.1516128282)

## Download the data
The [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) that this sample
uses for training is hosted by the [UC Irvine Machine Learning
Repository](https://archive.ics.uci.edu/ml/datasets/). We have hosted the data
on Google Cloud Storage in a slightly cleaned form:

 * Training file is `adult.data.csv`
 * Evaluation file is `adult.test.csv`


### Disclaimer
The source of this dataset is from a third party. Google provides no representation,
warranty, or other guarantees about the validity or any other aspects of this dataset.

In [None]:
# Create a directory to hold the data
! mkdir census_data

# Download the data
! curl https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data --output census_data/adult.data
! curl https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test --output census_data/adult.test

# Part 1: Train the model 
First the data is loaded into a numpy array that can be used by Scikit Learn. Then a simple model is created and fit against the training data. 

In [None]:
import numpy as np
import pandas as pd
import pickle
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.externals import joblib
import tensorflow as tf  # Tensorflow is only used to retrieve data from the census data files


# Define the format of your input data including unused columns (These are the columns from the census data files)
COLUMNS = (
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'income-level'
)

# Categorical columns are columns that need to be turned into a numerical value to be used by Scikit Learn
CATEGORICAL_COLUMNS = (
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country'
)


# Load the training census dataset
with tf.gfile.Open('./census_data/adult.data', 'r') as train_data:
    raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)
# Remove the column we are trying to predict ('income-level') from our features list
train_features = raw_training_data.drop('income-level', axis=1)
# Create our training labels list
train_labels = (raw_training_data['income-level'] == ' >50K')


# Load the test census dataset
with tf.gfile.Open('./census_data/adult.test', 'r') as test_data:
    raw_testing_data = pd.read_csv(test_data, names=COLUMNS, skiprows=1)
# Remove the column we are trying to predict ('income-level') from our features list
test_features = raw_testing_data.drop('income-level', axis=1)
# Create our training labels list
test_labels = (raw_testing_data['income-level'] == ' >50K.')


# Convert the categorical columns to a numerical value in both the training and testing dataset
from sklearn.preprocessing import LabelEncoder
encoders = {col:LabelEncoder() for col in CATEGORICAL_COLUMNS}

for col in CATEGORICAL_COLUMNS:
    train_features[col] = encoders[col].fit_transform(train_features[col])

for col in CATEGORICAL_COLUMNS:
    test_features[col] = encoders[col].fit_transform(test_features[col])


# Create and train a classifier
classifier = RandomForestClassifier()
classifier.fit(train_features, train_labels)

# Export the classifier to a file
joblib.dump(classifier, 'model.joblib')

print('Done')

# Part 2: Upload the model
Next, you'll need to upload the model to your project's storage bucket in GCS.

**Replace** `BUCKET_ID` with the bucket id you used in the setup step.

Note: The exact file name of of the exported model you upload to GCS is important! Your model must be named  “model.joblib”, “model.pkl”, or “model.bst” with respect to the library you used to export it. This restriction ensures that the model will be safely reconstructed later by using the same technique for import as was used during export.

In [None]:
! gsutil cp ./model.joblib gs://[BUCKET_ID]/model.joblib

# Part 3: Create a Model resource
** Replace ** `MODEL_NAME` with the name for your model. (We used "scikit_learn_prediction")

In [None]:
! gcloud ml-engine models create [MODEL_NAME] --regions us-central1

# Part 4: Create a Version resource

** Replace: **
* `VERSION` - with your version name, such as "v1"
* `MODEL_NAME` - with the model name you used above, such as "scikit_learn_prediction"
* `PROJECT_ID` - with your project's id
* `BUCKET_ID` - with the bucked id you created in the setup step

Note: If you require a feature of XGBoost that isn’t available in the publicly released version yet, you can specify “runtimeVersion”: “HEAD” instead, and that would get the latest version of XGBoost available from the github repo. Otherwise the following versions will be used:
* XGBoost: 0.6a2
* Scikit-learn: 0.19.0

In [None]:
! curl -X POST -H "Content-Type: application/json" \
   -d '{"name": "[VERSION]", "deploymentUri": "gs://[BUCKET_ID]/", "runtimeVersion": "1.4", "framework": "SCIKIT_LEARN"}' \
   -H "Authorization: Bearer `gcloud auth print-access-token`" \
    https://ml.googleapis.com/v1/projects/[PROJECT_ID]/models/[MODEL_NAME]/versions

# Part 5: Get online predictions
To get online predictions, the data needs to be converted from a numpy array to a json array. 

In [None]:
import json

data = []
for i in range(len(test_features)):
  data.append([])
  for col in COLUMNS[:-1]: # Ignore the 'income-level' column as it is not in the feature set.
    # Convert from numpy integers to standard integers
    data[i].append(int(np.uint64(test_features[col][i]).item()))

# Write the test data to a json file
with open('data.json', 'w') as outfile:
  json.dump(data, outfile)

# Get one person that makes <50K and one that makes >50K to test our model.
print('Show a person that makes <50K:')
print('\tFeatures: {0} --> Label: {1}\n'.format(data[0], test_labels[0]))

with open('less_than_50K.json', 'w') as outfile:
  json.dump(data[0], outfile)

  
print('Show a person that makes >50K:')
print('\tFeatures: {0} --> Label: {1}'.format(data[2], test_labels[2]))

with open('more_than_50K.json', 'w') as outfile:
  json.dump(data[2], outfile)

## Use Gcloud to make online predictions
Use the two people (as seen in the table) gathered in the previous step for the gcloud predictions.

| **Person** | age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country || (Label) income-level|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:||:-:
| **1** | 25| 4 | 226802 | 1 | 7 | 4 | 7 | 3 | 2 | 1 | 0 | 0 | 40 | 38 || False (<50K) |
| **2** | 28| 2 | 336951 | 7 | 12 | 2 | 11 | 0 | 4 | 1 | 0 | 0 | 40 | 38 || True (>50K) |

** Replace: **
* `VERSION` - with your version name, such as "v1"
* `MODEL_NAME` - with the model name you used above, such as "scikit_learn_prediction"

Test the model with an online prediction using the data of a person who makes <50K.

In [None]:
! gcloud ml-engine predict --model [MODEL_NAME] --version [VERSION] --json-instances less_than_50K.json

Test the model with an online prediction using the data of a person who makes >50K.

In [None]:
! gcloud ml-engine predict --model [MODEL_NAME] --version [VERSION] --json-instances more_than_50K.json

## Use Python to make online predictions
Test the model with the entire test set.

In [None]:
import googleapiclient.discovery

# TODO: Fill these variables in with your variables
PROJECT_ID = 'project_id'
VERSION = 'version'  # Use your version name, such as "v1"'
MODEL_NAME = 'model_name'  # Use the model name you used above, such as "scikit_learn_prediction"'

service = googleapiclient.discovery.build('ml', 'v1')
name = 'projects/{}/models/{}'.format(PROJECT_ID, MODEL_NAME)
name += '/versions/{}'.format(VERSION)

response = service.projects().predict(
    name=name,
    body={'instances': data}
).execute()

if 'error' in response:
  print response['error']
else:
  print response['predictions']