## Confidentiality

This notebook is downloaded from Gcp AI hub and is for demonstrational purposes only.

Please do not copy or distribute this notebook.

## Introduction
This notebook uses the [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) to demonstrate how to train a model and generate local predictions.


##  The data
The [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) that this sample
uses for training is provided by the [UC Irvine Machine Learning
Repository](https://archive.ics.uci.edu/ml/datasets/). Google has hosted the data on a public GCS bucket `gs://cloud-samples-data/ml-engine/sklearn/census_data/` and also hosted in the UC Irvine dataset repository.

 * Training file is `adult.data`
 * Evaluation file is `adult.test`


### Disclaimer
This dataset is provided by a third party. Google provides no representation,
warranty, or other guarantees about the validity or any other aspects of this dataset.

# Build your model

First, the model is created (provided below). This is similar to a normal process for creating a scikit-learn model. However, there is one key difference:

1. Downloading the data at the start of the file, so that the data can be accessed. 

The code in this file loads the data into a pandas DataFrame that can be used by scikit-learn. Then the model is fit against the training data. Lastly, sklearn's built in version of joblib is used to save the model to a file that can be uploaded to [ML Engine's prediction service](https://cloud.google.com/ml-engine/docs/scikit/getting-predictions#deploy_models_and_versions).

In [None]:
import pandas as pd
import json

from sklearn.ensemble import RandomForestClassifier
from sklearn.externals import joblib
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelBinarizer



Adding code to download the data (in this case, using the publicly hosted data).
to be able to use the data when training the model.

In [None]:
# Downloading the data
! curl https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data --output adult.data
! curl https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test --output adult.test

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3881k  100 3881k    0     0  8511k      0 --:--:-- --:--:-- --:--:-- 8511k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1956k  100 1956k    0     0  5719k      0 --:--:-- --:--:-- --:--:-- 5719k


# Reading in the data

In [None]:
# Defining the format of the input data including unused columns (These are the columns from the census data files)
COLUMNS = (
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'income-level'
)

# Categorical columns are columns that need to be turned into a numerical value to be used by scikit-learn
CATEGORICAL_COLUMNS = (
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country'
)

Loading the training census dataset

In [None]:
with open('./adult.data', 'r') as train_data:
    raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)

# Removing the column that has to be predicted ('income-level') from the features list
# Converting the Dataframe to a lists of lists
train_features = raw_training_data.drop('income-level', axis=1).values.tolist()
# Creating the training labels list, converting the Dataframe to a lists of lists
train_labels = (raw_training_data['income-level'] == ' >50K').values.tolist()

Loading the test census dataset

In [None]:
with open('./adult.test', 'r') as test_data:
    raw_testing_data = pd.read_csv(test_data, names=COLUMNS, skiprows=1)
# Removing the column that has to be predicted ('income-level') from the features list
# Converting the Dataframe to a lists of lists
test_features = raw_testing_data.drop('income-level', axis=1).values.tolist()
# Creating the training labels list, converting the Dataframe to a lists of lists
test_labels = (raw_testing_data['income-level'] == ' >50K.').values.tolist()

This is the model code. Below is an example model using the census dataset.
Since the census data set has categorical features, the numerical values have to be converted. A list of pipelines is used to convert each
categorical column and then using FeatureUnion to combine them before calling the RandomForestClassifier.

Each categorical column needs to be extracted individually and converted to a numerical value.
To do this, each categorical column a pipeline is used that extracts one feature column via
 `SelectKBest(k=1) and a LabelBinarizer()` to convert the categorical value to a numerical one.
A scores array (created below) selects and extracts the feature column. The scores array is
created by iterating over the COLUMNS and checking if it is a CATEGORICAL_COLUMN.

In [None]:
categorical_pipelines = []

for i, col in enumerate(COLUMNS[:-1]):
    if col in CATEGORICAL_COLUMNS:
        # Create a scores array to get the individual categorical column.
        # Example:
        #  data = [39, 'State-gov', 77516, 'Bachelors', 13, 'Never-married', 'Adm-clerical', 
        #         'Not-in-family', 'White', 'Male', 2174, 0, 40, 'United-States']
        #  scores = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
        #
        # Returns: [['State-gov']]
        # Build the scores array
        scores = [0] * len(COLUMNS[:-1])
        # This column is the categorical column you want to extract.
        scores[i] = 1
        skb = SelectKBest(k=1)
        skb.scores_ = scores
        # Convert the categorical column to a numerical value
        lbn = LabelBinarizer()
        r = skb.transform(train_features)
        lbn.fit(r)
        # Create the pipeline to extract the categorical feature
        categorical_pipelines.append(
            ('categorical-{}'.format(i), Pipeline([
                ('SKB-{}'.format(i), skb),
                ('LBN-{}'.format(i), lbn)])))

# Create pipeline to extract the numerical features
skb = SelectKBest(k=6)
# From COLUMNS use the features that are numerical
skb.scores_ = [1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0]
categorical_pipelines.append(('numerical', skb))

# Combine all the features using FeatureUnion
preprocess = FeatureUnion(categorical_pipelines)

# Create the classifier
classifier = RandomForestClassifier()

# Transform the features and fit them to the classifier
classifier.fit(preprocess.transform(train_features), train_labels)

# Create the overall model as a single pipeline
pipeline = Pipeline([
    ('union', preprocess),
    ('classifier', classifier)
])

Export the model to a file

In [None]:
model = 'model.joblib'
joblib.dump(pipeline, model)

['model.joblib']

In [None]:
!ls -al model.joblib

-rw-r--r-- 1 root root 81903458 Sep 27 18:07 model.joblib


## Predictions
Selecting one person that makes <=50K and one that makes >50K to test the model.

In [None]:
print('Show a person that makes <=50K:')
print('\tFeatures: {0} --> Label: {1}\n'.format(test_features[0], test_labels[0]))

with open('less_than_50K.json', 'w') as outfile:
  json.dump(test_features[0], outfile)

print('Show a person that makes >50K:')
print('\tFeatures: {0} --> Label: {1}'.format(test_features[3], test_labels[3]))

with open('more_than_50K.json', 'w') as outfile:
  json.dump(test_features[3], outfile)

Show a person that makes <=50K:
	Features: [25, ' Private', 226802, ' 11th', 7, ' Never-married', ' Machine-op-inspct', ' Own-child', ' Black', ' Male', 0, 0, 40, ' United-States'] --> Label: False

Show a person that makes >50K:
	Features: [44, ' Private', 160323, ' Some-college', 10, ' Married-civ-spouse', ' Machine-op-inspct', ' Husband', ' Black', ' Male', 7688, 0, 40, ' United-States'] --> Label: True


## Use Python to make local predictions
Test the model with the entire test set and print out some of the results.

In [None]:
local_results = pipeline.predict(test_features)
local = pd.Series(local_results, name='local')

In [None]:
local[:10]

0    False
1    False
2     True
3     True
4    False
5    False
6    False
7     True
8    False
9    False
Name: local, dtype: bool

In [None]:
# Print the first 10 responses
for i, response in enumerate(local[:10]):
    print('Prediction: {}\tLabel: {}'.format(response, test_labels[i]))

Prediction: False	Label: False
Prediction: False	Label: False
Prediction: True	Label: True
Prediction: True	Label: True
Prediction: False	Label: False
Prediction: False	Label: False
Prediction: False	Label: False
Prediction: True	Label: True
Prediction: False	Label: False
Prediction: False	Label: False
