#Exercise 2: Adding Data Processing steps into Web API
In this exercise, we will save the parameters used for processing the training dataset and reuse them on the API to perform the same data transformation steps before getting a prediction.

Note
The dataset used for this exercise is the Breast Cancer Detection shared by Dr. WIlliam H. Wolberg from the University of Wisconsin Hospitals and the attribute information can be found here - https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)

The dataset can also be found in our repository here - https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter11/dataset/breast-cancer-wisconsin.data


1. Open on a new Colab notebook

2. Import the packages pandas and joblib, RandomForestClassifier from sklearn.ensemble

In [0]:
import pandas as pd
import joblib
from sklearn.ensemble import RandomForestClassifier

3. Assign the link to the Breast Cancer dataset to a variable called 'file_url'

In [0]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter11/dataset/breast-cancer-wisconsin.data'

4. Create a list called 'col_names' with the following names: 'Sample code number','Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size',
'Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class'

In [0]:
col_names = ['Sample code number','Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size',
'Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class']

5. Load the dataset into DataFrame using pd.read_csv() with the follwong parameters: header=None, names=col_names, na_values='?'

In [0]:
df = pd.read_csv(file_url, header=None, names=col_names, na_values='?')

6. Extract the response variable 'Class' using the method .pop()

In [0]:
y = df.pop('Class')

7. Remove the column 'Sample code number' from the DataFrame using the method .drop() with axis=1 as parameter to specify we are dropping columns and not rows.

In [0]:
df.drop('Sample code number', axis=1, inplace=True)

8. Create a variable called 'training_rows' that will contain the number of rows that corresponds to 70% of the records.

In [0]:
training_rows = int(df.shape[0] * 0.7)
training_rows

489

9. Split the dataframes df and y into training and test sets using 'training_rows' as the threshold for the split.

In [0]:
X_train = df[:training_rows]
y_train = y[:training_rows]
X_test = df[training_rows:]
y_test = y[training_rows:]

10. Calculate the number of missing values for each column by combining the methods .isna() with .sum()

In [0]:
X_train.isna().sum()

Clump Thickness                 0
Uniformity of Cell Size         0
Uniformity of Cell Shape        0
Marginal Adhesion               0
Single Epithelial Cell Size     0
Bare Nuclei                    15
Bland Chromatin                 0
Normal Nucleoli                 0
Mitoses                         0
dtype: int64

11. Extract the list of columns that are not of type 'object' and save the result in a variable called 'num_columns'

In [0]:
num_columns = [col for col in X_train.columns if X_train[col].dtype != 'object']
num_columns

['Clump Thickness',
 'Uniformity of Cell Size',
 'Uniformity of Cell Shape',
 'Marginal Adhesion',
 'Single Epithelial Cell Size',
 'Bare Nuclei',
 'Bland Chromatin',
 'Normal Nucleoli',
 'Mitoses']

12. Create an empty dictionary called 'column_mean', iterate through the list 'num_columns' and for each column add the column name and its average value to this dictionary and display its content.

In [0]:
column_mean = {}
for col in num_columns:
  column_mean[col] = X_train[col].mean()
column_mean

{'Bare Nuclei': 4.0042194092827,
 'Bland Chromatin': 3.61758691206544,
 'Clump Thickness': 4.644171779141105,
 'Marginal Adhesion': 2.9529652351738243,
 'Mitoses': 1.7198364008179958,
 'Normal Nucleoli': 3.1533742331288344,
 'Single Epithelial Cell Size': 3.462167689161554,
 'Uniformity of Cell Shape': 3.4478527607361964,
 'Uniformity of Cell Size': 3.347648261758691}

13. Import the pickle package and save 'column_mean' into a file called 'columns_mean.pkl'

In [0]:
import pickle
pickle.dump(column_mean, open("columns_mean.pkl", "wb" ) )

14. Iterate through the list 'num_columns' and for each column, replace missing values by the relevant average contained in the 'column_mean' dictionary

In [0]:
for col in num_columns:
  X_train[col].fillna(column_mean[col], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


15. Instanciate a RandomForestClassifier with random_state=1 and train it with the training sets using the .fit() method. Save the model into a file called 'model.pkl' using the moethod joblib.dump()

In [0]:
rf_model = RandomForestClassifier(random_state=1)
rf_model.fit(X_train, y_train)
joblib.dump(rf_model, "model.pkl") 



['model.pkl']

16. Import the socket, threading, requests, json and numpy packages and the classes Flask, jsonify and request from the package flask

In [0]:
import socket
import threading
import requests
import json
from flask import Flask, jsonify, request
import numpy as np

17. Create a new Flask app and save it into a variable called 'app'

In [0]:
app = Flask(__name__)

18. Load the pre-trained model from the file 'model.pkl' using joblib.load() and save it into a variable called 'trained_model'. Load the saved dictionary from 'columns_mean.pkl' using pickle.load() and save it into a variable called 'var_means'

In [0]:
trained_model = joblib.load("model.pkl")
var_means = pickle.load(open("columns_mean.pkl", "rb" ) )

19. Create an API endpoint for the path 'api' that accepts only POST requests and will call a function called predict(). This function will read the JSON received using the method request.get_json(), transform it into a DataFrame, loop through all the items from 'var_means' and use its keys and values to replace missing value, predict the outcome with 'trained_model', convert the prediction from numpy array to string with array2string() and then to JSON with jsonify()

In [0]:
@app.route('/api', methods=['POST'])
def predict():
  data = request.get_json()
  df_test = pd.DataFrame(data, index=[0])
  for col, avg_value in var_means.items():
    df_test[col].fillna(avg_value, inplace=True)
  prediction = trained_model.predict(df_test)
  str_pred = np.array2string(prediction)
  return jsonify(str_pred)

20. Create a new thread for running your Flask app using the method threading.Thread with the following parameters: target=app.run, kwargs={'host':'0.0.0.0','port':80}

In [0]:
flask_thread = threading.Thread(target=app.run, kwargs={'host':'0.0.0.0','port':80})
flask_thread.start()

 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off


Exception in thread Thread-6:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 990, in run
    run_simple(host, port, self, **options)
  File "/usr/local/lib/python3.6/dist-packages/werkzeug/serving.py", line 1010, in run_simple
    inner()
  File "/usr/local/lib/python3.6/dist-packages/werkzeug/serving.py", line 963, in inner
    fd=fd,
  File "/usr/local/lib/python3.6/dist-packages/werkzeug/serving.py", line 806, in make_server
    host, port, app, request_handler, passthrough_errors, ssl_context, fd=fd
  File "/usr/local/lib/python3.6/dist-packages/werkzeug/serving.py", line 699, in __init__
    HTTPServer.__init__(self, server_address, handler)
  File "/usr/lib/python3.6/socketserver.py", line 456, in __init__
    self.server_bind()
  File "

21. Convert the first record of X_test that has missing value on the column 'Bare Nuclei' and convert it into json format using the .to_json() method

In [0]:
record = X_test[X_test['Bare Nuclei'].isna()].iloc[0].to_json()
record

'{"Clump Thickness":1.0,"Uniformity of Cell Size":1.0,"Uniformity of Cell Shape":1.0,"Marginal Adhesion":1.0,"Single Epithelial Cell Size":1.0,"Bare Nuclei":null,"Bland Chromatin":1.0,"Normal Nucleoli":1.0,"Mitoses":1.0}'

22. Create a dictionary called headers with the following key-value pairs: 'content-type': 'application/json', 'Accept-Charset': 'UTF-8'. Extract into a new variable called 'ip_address' the IP address of the host using the methods socket.gethostname() and socket.gethostbyname()

In [0]:
headers = {'content-type': 'application/json', 'Accept-Charset': 'UTF-8'}
ip_address = socket.gethostbyname(socket.gethostname())

23. Send a HTTP POST request to the server using the method requests.post() with the HTTP url to the endpoint, record and headers as its parameters and print its .text attribute

In [0]:
r = requests.post(f"http://{ip_address}/api", data=record, headers=headers)
r.text

172.28.0.2 - - [06/Nov/2019 03:35:04] "[37mPOST /api HTTP/1.1[0m" 200 -


'"[2]"\n'

Excellent! We just deployed our pre-trained Machine Learning algorithm into a WeB API. In a real-world project, you will have to deploy it on separate server within your organisation and need to configure networking settings so that the authorised systems or services can send requests to this API.