#Activity 1: Train and Deploy an Income Predictor Model using Flask
You are working for a governmental agency and you have tasked to build and deploy predictive model using historical census data that will assess if a person based on his/her personal information is more likely to have a salary over or under 50k.

The following steps will help you complete this activity:
- Download and load the dataset
- Extract the response variable
- Split the dataset into training and test sets
- Extract the list of categories for each categorical column
- Save the list of categories and categorical column names into files
- Perform One-Hot encoding on categorical variables
- Train a RandomForest for predicting the binary outcome
- Save the trained model into a file
- Create a Flask app
- Create an API endpoint that will perform the same data transformation as for the training set and predict the outcome for a single record
- Send a request to this endpoint

The dataset was originally shared by Kohavi and Barry Becker from Silicon Graphics:
https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)

The CSV version of this dataset can be found here:
https://www.openml.org/data/get_csv/1595261/phpMawTba

1. Open on a new Colab notebook

2. Import the packages pandas and joblib, RandomForestClassifier from sklearn.ensemble and train_test_split from sklearn.model_selection

In [0]:
import pandas as pd
import joblib
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

3. Assign the link to the dataset to a variable called 'file_url'



In [0]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter18/Dataset/phpMawTba.csv'

4. Load the dataset into DataFrame using pd.read_csv()

In [0]:
df = pd.read_csv(file_url)

5. Print out the first 5 rows of this DataFrame

In [4]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


6. Extract the response variable 'class' using the method .pop() and save it into a variavle called 'y'

In [0]:
y = df.pop('class')

7. Create a list called 'cat_columns' containing only the columns of type 'object' using the attribute dtype and print its content

In [6]:
cat_columns = [col for col in df.columns if df[col].dtype == 'object']
cat_columns

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country']

8. Split the DataFrames 'df' and 'y' into training and test sets using the  train_test_split function with the parameters: test_size=0.33, random_state=8

In [0]:
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.33, random_state=8)

9. Create an empty dictionary called 'column_categories'

In [0]:
column_categories = {}

10. Iterate through 'cat_columns' and populate the dictionary with the column name and the list of categories using the .astype() method and .cat.categories attribute

In [0]:
for col in cat_columns:
  column_categories[col] = X_train[col].astype('category').cat.categories

11. Save 'column_categories' and 'cat_columns' into 2 files respectively called 'categories_data.pkl' and 'categorical_columns.pkl' using the pickle.dump() method

In [0]:
pickle.dump(column_categories, open("categories_data.pkl", "wb"))
pickle.dump(cat_columns, open("categorical_columns.pkl", "wb"))

12. Create a function called 'apply_categories' that takes a DataFrame and a dictionary as inputs and will import CategoricalDtype from pandas.api.types, iterate though this dictionary and convert each column (keys) with the list of categories (values) using the .astype() method and CategoricalDtype

In [0]:
def apply_categories(input_df, cat_dict):
  from pandas.api.types import CategoricalDtype

  for col, cat in cat_dict.items():
    input_df[col] = input_df[col].astype(CategoricalDtype(categories=cat))

  return input_df

13. Apply this function on X_train and column_categories and save the result in a new DataFrane called 'X_train_cat'. Print the data type of its columns using the .dtypes attribute

In [12]:
X_train_cat = apply_categories(X_train, column_categories)
X_train_cat.dtypes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


age                  int64
workclass         category
fnlwgt               int64
education         category
education-num        int64
marital-status    category
occupation        category
relationship      category
race              category
sex               category
capital-gain         int64
capital-loss         int64
hours-per-week       int64
native-country    category
dtype: object

14. Perform One-Hot encoding on the catagorical columns using the .get_dummies() method and save the result into a new variable called 'X_train_final'

In [0]:
X_train_final = pd.get_dummies(X_train_cat, columns=cat_columns)

15. Instanciate a RandomForestClassifier with random_state=8 and train it with the training sets using the .fit() method. Save the model into a file called 'model.pkl' using the moethod joblib.dump()

In [14]:
rf_model = RandomForestClassifier(random_state=8)
rf_model.fit(X_train_final, y_train)
joblib.dump(rf_model, "model.pkl") 



['model.pkl']

16. Import the socket, threading, requests, json and numpy packages and the classes Flask, jsonify and request from the package flask

In [0]:
import socket
import threading
import requests
import json
from flask import Flask, jsonify, request
import numpy as np

17. Create a new Flask app and save it into a variable called 'app'

In [0]:
app = Flask(__name__)

18. Load the pre-trained model from the file 'model.pkl' using joblib.load() and save it into a variable called 'trained_model'. Load the saved dictionary from 'categories_data.pkl' using pickle.load() and save it into a variable called 'var_means'

In [0]:
trained_model = joblib.load("model.pkl")
var_means = pickle.load(open("categories_data.pkl", "rb"))
cat_cols = pickle.load(open("categorical_columns.pkl", "rb"))

19. Create an API endpoint for the path 'api' that accepts only POST requests and will call a function called predict(). This function will read the JSON received using the method request.get_json(), transform it into a DataFrame, apply the apply_categories() function on it with 'var_means', perform one-hot encoding with .get_dummies(), predict the outcome with 'trained_model', convert the prediction from numpy array to string with array2string() and then to JSON with jsonify()

In [0]:
@app.route('/api', methods=['POST'])
def predict():
  data = request.get_json()
  df_test = pd.DataFrame(data, index=[0])
  df_test_clean = apply_categories(df_test, var_means)
  df_test_final = pd.get_dummies(df_test_clean, columns=cat_cols)
  prediction = trained_model.predict(df_test_final)
  str_pred = np.array2string(prediction)
  return jsonify(str_pred)

20. Create a new thread for running your Flask app using the method threading.Thread with the following parameters: target=app.run, kwargs={'host':'0.0.0.0','port':80}

In [19]:
flask_thread = threading.Thread(target=app.run, kwargs={'host':'0.0.0.0','port':80})
flask_thread.start()

 * Serving Flask app "__main__" (lazy loading)


21. Select the first record of X_test and convert it into json format using the .to_json() method

In [20]:
record = X_test.iloc[0,].to_json()
record

 * Environment: production


'{"age":51,"workclass":" Private","fnlwgt":106151,"education":" 11th","education-num":7,"marital-status":" Divorced","occupation":" Transport-moving","relationship":" Own-child","race":" White","sex":" Male","capital-gain":0,"capital-loss":0,"hours-per-week":40,"native-country":" United-States"}'

   Use a production WSGI server instead.
 * Debug mode: off


 * Running on http://0.0.0.0:80/ (Press CTRL+C to quit)


22. Create a dictionary called headers with the following key-value pairs: 'content-type': 'application/json', 'Accept-Charset': 'UTF-8'. Extract into a new variable called 'ip_address' the IP address of the host using the methods socket.gethostname() and socket.gethostbyname()

In [0]:
headers = {'content-type': 'application/json', 'Accept-Charset': 'UTF-8'}
ip_address = socket.gethostbyname(socket.gethostname())

23. Send a HTTP POST request to the server using the method requests.post() with the HTTP url to the endpoint, record and headers as its parameters and print its .text attribute

In [22]:
r = requests.post(f"http://{ip_address}/api", data=record, headers=headers)
r.text

172.28.0.2 - - [06/Nov/2019 11:58:40] "[37mPOST /api HTTP/1.1[0m" 200 -


'"[\' <=50K\']"\n'

Big kudos to you! In this activity, you have trained a Machine Learning model to assess the likelihood of a person having a low or high salary and deploy it into a Web API using Flask. This model can now be accessed any time and make predictions in real-time. You saw how to save the key information required to reproduce the same data processing as during the model training for new input data. You are now ready to deploy more Machine Learning models as a service.