# Predicting Income with the Census Income Dataset Using Scikit Learn on ML Engine
To use Scikit Learn on ML Engine for predictions, you will need to have a pre-trained model that you can upload to ML Engine. This datalab use the [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) to create a simple model, train the model, upload the model to ML Engine, and lastly use the model to make predictions. 

# Part 0: Setup
* [Create a project on GCP](https://cloud.google.com/resource-manager/docs/creating-managing-projects)
* [Create a Google Cloud Storage Bucket](https://cloud.google.com/storage/docs/quickstart-console)
* [Enable Cloud Machine Learning Engine and Compute Engine APIs](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component&_ga=2.217405014.1312742076.1516128282-1417583630.1516128282)

** Replace: **
* `PROJECT_ID <true-ability-192918>` - with your project's id
* `BUCKET_ID <true-ability-192918>` - with the bucket id you created above
* `MODEL <scikit_learn_prediction>` - with your model name, such as "scikit_learn_prediction"
* `VERSION <v1>` - with your version name, such as "v1"

In [3]:
%env PROJECT_ID true-ability-192918

env: PROJECT_ID=true-ability-192918


In [5]:
%env BUCKET_ID true-ability-192918

env: BUCKET_ID=true-ability-192918


In [12]:
%env MODEL scikit_learn_prediction

env: MODEL=scikit_learn_prediction


In [13]:
%env VERSION v1

env: VERSION=v1


## Download the data
The [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) that this sample
uses for training is hosted by the [UC Irvine Machine Learning
Repository](https://archive.ics.uci.edu/ml/datasets/). We have hosted the data
on Google Cloud Storage in a slightly cleaned form:

 * Training file is `adult.data.csv`
 * Evaluation file is `adult.test.csv`


### Disclaimer
This dataset is provided by a third party. Google provides no representation,
warranty, or other guarantees about the validity or any other aspects of this dataset.

In [10]:
# Create a directory to hold the data
! mkdir census_data

# Download the data
! curl https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data --output census_data/adult.data
! curl https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test --output census_data/adult.test

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3881k  100 3881k    0     0  2733k      0  0:00:01  0:00:01 --:--:-- 2733k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1956k  100 1956k    0     0  2034k      0 --:--:-- --:--:-- --:--:-- 2033k


# Part 1: Train the model 
First the data is loaded into a numpy array that can be used by Scikit Learn. Then a simple model is created and fit against the training data. 

In [19]:
import numpy as np
import pandas as pd
import pickle
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.externals import joblib

# Define the format of your input data including unused columns (These are the columns from the census data files)
COLUMNS = (
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'income-level'
)

# Categorical columns are columns that need to be turned into a numerical value to be used by Scikit Learn
CATEGORICAL_COLUMNS = (
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country'
)


# Load the training census dataset
with open('./census_data/adult.data', 'r') as train_data:
    raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)
# Remove the column we are trying to predict ('income-level') from our features list
train_features = raw_training_data.drop('income-level', axis=1)
# Create our training labels list
train_labels = (raw_training_data['income-level'] == ' >50K')


# Load the test census dataset
with open('./census_data/adult.test', 'r') as test_data:
    raw_testing_data = pd.read_csv(test_data, names=COLUMNS, skiprows=1)
# Remove the column we are trying to predict ('income-level') from our features list
test_features = raw_testing_data.drop('income-level', axis=1)
# Create our training labels list
test_labels = (raw_testing_data['income-level'] == ' >50K.')


# Convert the categorical columns to a numerical value in both the training and testing dataset
from sklearn.preprocessing import LabelEncoder
encoders = {col:LabelEncoder() for col in CATEGORICAL_COLUMNS}

for col in CATEGORICAL_COLUMNS:
    train_features[col] = encoders[col].fit_transform(train_features[col])

for col in CATEGORICAL_COLUMNS:
    test_features[col] = encoders[col].fit_transform(test_features[col])


# Create and train a classifier
classifier = RandomForestClassifier()
classifier.fit(train_features, train_labels)

# Export the classifier to a file
joblib.dump(classifier, 'model.joblib')

print('Done')

Done


# Part 2: Upload the model
Next, you'll need to upload the model to your project's storage bucket in GCS.

Note: The exact file name of of the exported model you upload to GCS is important! Your model must be named  “model.joblib”, “model.pkl”, or “model.bst” with respect to the library you used to export it. This restriction ensures that the model will be safely reconstructed later by using the same technique for import as was used during export.

In [14]:
! gsutil cp ./model.joblib gs://$BUCKET_ID/model.joblib

Copying file://./model.joblib [Content-Type=application/octet-stream]...
-
Operation completed over 1 objects/6.5 MiB.                                      


# Part 3: Create a Model resource

In [16]:
! gcloud ml-engine models create $MODEL --regions us-central1

Created ml engine model [projects/true-ability-192918/models/scikit_learn_prediction].


# Part 4: Create a Version resource
This can take several minutes.

Note: If you require a feature of Scikit-learn that isn’t available in the publicly released version yet, you can specify “runtimeVersion”: “HEAD” instead, and that would get the latest version of Scikit-learn available from the github repo. Otherwise the following versions will be used:
* Scikit-learn: 0.19.0

In [25]:
! curl -X POST -H "Content-Type: application/json" \
   -d '{"name": "'$VERSION'", "deploymentUri": "gs://'$BUCKET_ID'/", "runtimeVersion": "1.4", "framework": "SCIKIT_LEARN"}' \
   -H "Authorization: Bearer `gcloud auth print-access-token`" \
    https://ml.googleapis.com/v1/projects/$PROJECT_ID/models/$MODEL/versions

{
  "name": "projects/true-ability-192918/operations/create_scikit_learn_prediction_v1-1516657639682",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.ml.v1.OperationMetadata",
    "createTime": "2018-01-22T21:47:20Z",
    "operationType": "CREATE_VERSION",
    "modelName": "projects/true-ability-192918/models/scikit_learn_prediction",
    "version": {
      "name": "projects/true-ability-192918/models/scikit_learn_prediction/versions/v1",
      "deploymentUri": "gs://true-ability-192918/",
      "createTime": "2018-01-22T21:47:19Z",
      "etag": "+0zOopN1SC8=",
      "framework": "SCIKIT_LEARN"
    }
  }
}


# Part 5: Get online predictions
To get online predictions, the data needs to be converted from a numpy array to a json array. 

In [20]:
import json

data = []
for i in range(len(test_features)):
  data.append([])
  for col in COLUMNS[:-1]: # Ignore the 'income-level' column as it is not in the feature set.
    # Convert from numpy integers to standard integers
    data[i].append(int(np.uint64(test_features[col][i]).item()))

# Write the test data to a json file
with open('data.json', 'w') as outfile:
  json.dump(data, outfile)

# Get one person that makes <=50K and one that makes >50K to test our model.
print('Show a person that makes <=50K:')
print('\tFeatures: {0} --> Label: {1}\n'.format(data[0], test_labels[0]))

with open('less_than_50K.json', 'w') as outfile:
  json.dump(data[0], outfile)

  
print('Show a person that makes >50K:')
print('\tFeatures: {0} --> Label: {1}'.format(data[2], test_labels[2]))

with open('more_than_50K.json', 'w') as outfile:
  json.dump(data[2], outfile)

Show a person that makes <=50K:
	Features: [25, 4, 226802, 1, 7, 4, 7, 3, 2, 1, 0, 0, 40, 38] --> Label: False

Show a person that makes >50K:
	Features: [28, 2, 336951, 7, 12, 2, 11, 0, 4, 1, 0, 0, 40, 38] --> Label: True


## Use Gcloud to make online predictions
Use the two people (as seen in the table) gathered in the previous step for the gcloud predictions.

| **Person** | age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country || (Label) income-level|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:||:-:
| **1** | 25| 4 | 226802 | 1 | 7 | 4 | 7 | 3 | 2 | 1 | 0 | 0 | 40 | 38 || False (<=50K) |
| **2** | 28| 2 | 336951 | 7 | 12 | 2 | 11 | 0 | 4 | 1 | 0 | 0 | 40 | 38 || True (>50K) |

Creating a model version can take several minutes, check the status of your model version to see if it is available.

In [22]:
! gcloud ml-engine versions list --model $MODEL

NAME  DEPLOYMENT_URI             STATE
v1    gs://true-ability-192918/  CREATING


In [27]:
! gcloud ml-engine versions list --model $MODEL

NAME  DEPLOYMENT_URI             STATE
v1    gs://true-ability-192918/  READY


Test the model with an online prediction using the data of a person who makes <=50K.

Note: If you see an error, the model from Part 4 may not be created yet as it takes several minutes for a new model version to be created.

In [28]:
! gcloud ml-engine predict --model $MODEL --version $VERSION --json-instances less_than_50K.json

[False]


Test the model with an online prediction using the data of a person who makes >50K.

In [29]:
! gcloud ml-engine predict --model $MODEL --version $VERSION --json-instances more_than_50K.json

[True]


## Use Python to make online predictions
Test the model with the entire test set.

In [30]:
import googleapiclient.discovery
import os

PROJECT_ID = os.environ['PROJECT_ID']
VERSION = os.environ['VERSION']
MODEL = os.environ['MODEL']

service = googleapiclient.discovery.build('ml', 'v1')
name = 'projects/{}/models/{}'.format(PROJECT_ID, MODEL)
name += '/versions/{}'.format(VERSION)

response = service.projects().predict(
    name=name,
    body={'instances': data}
).execute()

if 'error' in response:
  print response['error']
else:
  print response['predictions']

[2018-01-22 21:50:24,183] {discovery.py:273} INFO - URL being requested: GET https://www.googleapis.com/discovery/v1/apis/ml/v1/rest
[2018-01-22 21:50:24,650] {discovery.py:863} INFO - URL being requested: POST https://ml.googleapis.com/v1/projects/true-ability-192918/models/scikit_learn_prediction/versions/v1:predict?alt=json
[False, False, True, True, False, False, False, True, False, False, True, False, False, False, False, True, False, True, False, True, True, False, False, False, False, True, False, False, True, False, True, False, False, False, False, False, False, False, False, False, True, True, False, False, False, False, False, True, False, False, False, False, False, True, False, False, False, True, True, False, False, False, False, False, False, False, True, False, False, False, True, True, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, True, False, False, True, False, False, False, False, False, True, F