# Generate Explainable Report with Titanic dataset using XAI

This notebook demonstrates how to generate explanations report using complier implemented in the XAI library.


## Motivation
Once the PoC is done (and you know where your data comes from, what it looks like, and what it can predict) comes the ideal next step is to put your model into production and make it useful for the rest of the business.

Does it sound familiar? do you also need to answer the questions below, before promoting your model into production:
1. _How you sure that your model is ready for production?_
2. _How you able to explain the model performance? in business context that non-technical management can understand?_
3. _How you able to compare between newly trained models and existing models is done manually every iteration?_

In XAI project, our simply vision is to:
1. __Speed up data validation__
2. __Simplify model engineering__
3. __Build trust__  
  
For more details, please refer to our [whitepaper](https://sap.sharepoint.com/sites/100454/ML_Apps/Shared%20Documents/Reusable%20Components/Explainability/XAI_Whitepaper.pdf?csf=1&e=phIUNN&cid=771297d7-d488-441a-8a65-dab0305c3f04)

## Steps
1. Create a model to Predict survival on the Titanic, using the data provide in [titanic](https://www.kaggle.com/c/titanic/data)
2. Evaluate the model performance with XAI report

***

### 1. Performance Model Training

In [1]:
import numpy as np
import pandas as pd

#### 1.1 Loading Data

In [2]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
data = pd.concat([train, test], sort=False)

data.to_csv('titanic.csv', index=False)
data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,1309.0,891.0,1309.0,1046.0,1309.0,1309.0,1308.0
mean,655.0,0.383838,2.294882,29.881138,0.498854,0.385027,33.295479
std,378.020061,0.486592,0.837836,14.413493,1.041658,0.86556,51.758668
min,1.0,0.0,1.0,0.17,0.0,0.0,0.0
25%,328.0,0.0,2.0,21.0,0.0,0.0,7.8958
50%,655.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,982.0,1.0,3.0,39.0,1.0,0.0,31.275
max,1309.0,1.0,3.0,80.0,8.0,9.0,512.3292


#### 1.2 Quick Check

In [3]:
data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [4]:
data.isnull().sum()

PassengerId       0
Survived        418
Pclass            0
Name              0
Sex               0
Age             263
SibSp             0
Parch             0
Ticket            0
Fare              1
Cabin          1014
Embarked          2
dtype: int64

#### 1.3 Feature quantity engineering

In [5]:
# Sex
data['Sex'].replace(['male','female'], [0, 1], inplace=True)

# Embarked
data['Embarked'].fillna(('S'), inplace=True)
data['Embarked'] = data['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

# Fare
data['Fare'].fillna(np.mean(data['Fare']), inplace=True)

# Age
age_avg = data['Age'].mean()
age_std = data['Age'].std()

data['Age'].fillna(np.random.randint(age_avg - age_std, age_avg + age_std), inplace=True)

# Final Clean-up
delete_columns = ['Name', 'PassengerId', 'SibSp', 'Parch', 'Ticket', 'Cabin']
data.drop(delete_columns, axis=1, inplace=True)

#### 1.4 Data Splitting

In [6]:
train = data[:len(train)]
test = data[len(train):]

y_train = train['Survived']
X_train = train.drop('Survived', axis = 1)
X_test = test.drop('Survived', axis = 1)

X_train.to_csv("train_data.csv", index=False)

X_train.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Embarked
0,3,0,22.0,7.25,0
1,1,1,38.0,71.2833,1
2,3,1,26.0,7.925,0
3,1,1,35.0,53.1,0
4,3,0,35.0,8.05,0


#### 1.5 ML train a LogisticRegression Model

In [7]:
import pickle
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

pkl = open('model.pkl', 'wb')
pickle.dump(clf, pkl)
clf = None

#### 1.6 ML load model and evaluation

In [8]:
model_pkl = open('model.pkl', 'rb')
model = pickle.load(model_pkl)

accuracy = round(model.score(X_train, y_train) * 100, 2)
print("Model Accuracy: ", accuracy)

Model Accuracy:  97.87


#### 1.7 ML Inference and output result

In [9]:
y_pred = model.predict(X_test)

sub = pd.DataFrame(pd.read_csv("test.csv")['PassengerId'])
sub['Survived'] = list(map(int, y_pred))
sub.to_csv("submission.csv", index=False)

***

### 2. Involve XAI complier

In [10]:
import sys
sys.path.append('../../../')
from xai.compiler.base import Configuration, Controller

#### 2.1 Specify config file

In [11]:
json_config = 'basic-report.json'

#### 2.2  Initial compiler controller with config

In [12]:
controller = Controller(config=Configuration(json_config))
print(controller.config)

{'name': 'Report for Titanic Dataset', 'overview': True, 'content_table': True, 'contents': [{'title': 'Feature Importance Analysis', 'desc': 'This section provides the analysis on feature', 'sections': [{'title': 'Feature Importance Ranking', 'component': {'_comment': 'refer to document section xxxx', 'class': 'FeatureImportanceRanking', 'attr': {'trained_model': 'model.pkl', 'train_data': 'train_data.csv'}}}]}, {'title': 'Data Statistics Analysis', 'desc': 'This section provides the analysis on data', 'sections': [{'title': 'Simple Data Statistic', 'component': {'_comment': 'refer to document section xxxx', 'class': 'DataStatisticsAnalysis', 'attr': {'data': 'titanic.csv'}}}]}], 'writers': [{'class': 'Pdf', 'attr': {'name': 'titanic-basic-report'}}]}


#### 2.2  Finally compiler render

In [13]:
 controller.render()

[Examples]: [1, 2, 3, 4, 5]

  '[Examples]: %s\n' % (column, col_data.tolist()[:5]))
[Examples]: ['A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803', '373450']

  '[Examples]: %s\n' % (column, col_data.tolist()[:5]))
[Examples]: [nan, 'C85', nan, 'C123', nan]

  '[Examples]: %s\n' % (column, col_data.tolist()[:5]))
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column][data[column].isnull()] = 'NAN'
