Reading and writing files is slightly different between Local and Cloud. In the Cloud, the easiest way is to use project-lib (see https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/project-lib-python.html?audience=wdp if you want to learn more). When opening this notebook in Watson Studio Cloud for the first time, insert a project token by clicking on the "hamburger" icon on the right hand side. If no project token exists, follow the link to create a new one (Access tokens --> New token, with Editor role). Then, return to this notebook and repeat the steps to insert a project token. This will now add an additional cell above. Run this cell!

# Titanic Modeling, Evaluation and Deployment

## CRISP-DM

In [None]:
from IPython.core.display import Image, display
display(Image('https://www.kdnuggets.com/wp-content/uploads/crisp-dm-4-problems-fig1.png', width=500, unconfined=True))

## Import relevant packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report

## Load data and prepare for modeling

Data preparation for modeling (including 'pd.get_dummies') has been performed in the previous notebook. Now, data are imported again and split into training and test. Models are built on training data only and, afterwards, evaluated on (previously unseen) test data. 

Attention: As stated previously, loading data differs between local and cloud versions, select the right one depending on the platform used

In [None]:
# Local
# df_dummies = pd.read_csv('train_dummies.csv') # use full path if notebook and file in different folders! 

# Cloud: Fetch the file
my_file = project.get_file("train_dummies.csv")

# Cloud: Read the CSV data file from the object storage into a pandas DataFrame
my_file.seek(0)
import pandas as pd
df_dummies = pd.read_csv(my_file)

In [None]:
df_dummies.head()

In [None]:
target = df_dummies['Survived'] # feature to be predicted
predictors = df_dummies.drop(['Survived'], axis = 1) # all other features are used as predictors

In [None]:
predictors.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.2, random_state=123) # 80-20 split into training and test data

## Create and evaluate classification models

Predicting whether a pasenger on the Titanic survived or not is a supervised machine learning problem. Some commonly used algorithms include decision trees, random forest and logistic regression. Once a classification model has been built, evaluation metrics are calculated and interpreted. 

### Decision Tree

In [None]:
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)

In [None]:
confusion_matrix(y_test, tree.predict(X_test)) # yields count of true negatives, false positives, false negatives, true positives

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, tree.predict(X_test)).ravel() # check that tp, fp, tn, fn are not confused
print(tn, fp, fn, tp)

In [None]:
print(classification_report(y_train, tree.predict(X_train))) # yields class-specific precision, recall and f1-score

In [None]:
print(classification_report(y_test, tree.predict(X_test)))

Performance on test data is significantly lower than on training data. Probably the decision tree overfits on training data and does not generalize well on unseen test data. 

In [None]:
list(zip(X_train.columns, tree.feature_importances_)) # lists features and their importance in predicting the target

### Random Forest

In [None]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

In [None]:
confusion_matrix(y_test, rf.predict(X_test))

In [None]:
print(classification_report(y_train, rf.predict(X_train)))

In [None]:
print(classification_report(y_test, rf.predict(X_test)))

As before, test performance is lower than training performance. Random forests, too, can suffer from overfitting on training data. 

In [None]:
list(zip(X_train.columns, rf.feature_importances_))

### Logistic Regression

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

In [None]:
print(confusion_matrix(y_test, logreg.predict(X_test)))

In [None]:
# nicer way to inspect confusion matrix
conf_mat = confusion_matrix(y_test, logreg.predict(X_test))
df_cm = pd.DataFrame(conf_mat, index=['0','1'], columns=['0', '1'],)
fig = plt.figure(figsize=[10,7])
heatmap = sns.heatmap(df_cm, annot=True, fmt="d")
heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=14)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=14)
plt.ylabel('True label')
plt.xlabel('Predicted label')

In [None]:
print(classification_report(y_test, logreg.predict(X_test)))

In [None]:
print(classification_report(y_train, logreg.predict(X_train)))

For logistic regression, training and test performance are very similar. This probably means that the created model generalizes well on new data. 

### Building many models

When building and comparing lots of models, it may be useful to loop over several classifiers or over one classifier with several parameters. An idea to overcome the overfitting problem with tree-based classifiers is to limit the depth of trees and inspect evaluation metrics.

In [None]:
# vary maximum tree depth for random forest
tree_depth = [5, 10, 20]
for i in tree_depth:
    rf = RandomForestClassifier(max_depth=i)
    rf.fit(X_train, y_train)
    print('Max tree depth: ', i)
    print('Train results: ', classification_report(y_train, rf.predict(X_train)))
    print('Test results: ',classification_report(y_test, rf.predict(X_test)))

Feel free to consider additional aspects if you are familiar with machine learning: You could check for class imbalance and mitigate this by oversampling training data. You could also try more classification algorithms like SVM. 

# Titanic Deployment

In a final step we want to *deploy* our model to make it publicly available.

## Preparation & Recap

In [None]:
# classifier to deploy: we will reuse the logistic regression classifier 
deployment_classifier = logreg

In [None]:
print(deployment_classifier)

In [None]:
# recap: first two rows of training data
df_dummies.head(2)

In [None]:
# recap: first two rows of training data (without predicted column "Survived")
predictors.head(2)

Use the classifier to predict the survival information for two exemplary passengers. Review the output.
- Which passenger's survival is predicted correctly?

In [None]:
deployment_classifier.predict(predictors.iloc[0:2])

## Deployment as REST API

In this notebook we will use the _Watson Machine Learning (WML)_ service to deploy the model. Please fill in the service credentials of your instance below.

Sample format:

```json
{
  "apikey": "...",
  "iam_apikey_description": "...",
  "iam_apikey_name": "...",
  "iam_role_crn": "crn:v1:bluemix:public:iam::::serviceRole:Manager",
  "iam_serviceid_crn": "...",
  "instance_id": "...",
  "url": "https://eu-de.ml.cloud.ibm.com"
}
```

In [None]:
# fill in your credentials
wml_credentials = {
}

In [None]:
# import watson machine learning Python client library
from watson_machine_learning_client import WatsonMachineLearningAPIClient

In [None]:
wml_client = WatsonMachineLearningAPIClient(wml_credentials)

In [None]:
wml_client.deployments.list()

If the previous cell returned a (potentially empty) list of deployments, the provided credentials are correct and you are all set to create your first deployment. For more information please check:
- [REST API](https://watson-ml-api.mybluemix.net)
- [Python client](https://pypi.org/project/watson-machine-learning-client)

In [None]:
metadata = {
        wml_client.repository.ModelMetaNames.NAME: 'Titanic Deployment',
        wml_client.repository.ModelMetaNames.DESCRIPTION: 'My first Titanic deployment.',
        wml_client.repository.ModelMetaNames.AUTHOR_NAME: 'Your Name'
}

In [None]:
# store the scikit-learn model in WML
model = wml_client.repository.store_model(deployment_classifier, meta_props=metadata)

In [None]:
# review artefacts in your WML instance (e.g. models, deployments)
wml_client.repository.list()

In [None]:
published_model_uid = wml_client.repository.get_model_uid(model)

In [None]:
# deploy the model as a REST API, (hint: rerun the cell that lists the repository artefacts after the deployment and check the output)
created_deployment = wml_client.deployments.create(published_model_uid, name="Titanic Deployment")

In [None]:
scoring_endpoint = wml_client.deployments.get_scoring_url(created_deployment)

In [None]:
print(scoring_endpoint)

In [None]:
# prepare payload to send
scoring_values = predictors.iloc[0:2].to_numpy().tolist()
scoring_payload = {"values": scoring_values}
print(scoring_payload)

In [None]:
# make a prediction and review the outcome
predictions = wml_client.deployments.score(scoring_endpoint, scoring_payload)
print(predictions)

- Do the results match the predictions executed in this notebook?
- What information does the response payload include in addition to the classification?

In [None]:
wml_client.repository.list()

In [None]:
# delete deployments / models you created in this exercise
wml_client.repository.delete("your-deployment-id")