# Generate Explainable Report with Titanic dataset using XAI

This notebook demonstrates how to generate explanations report using complier implemented in the XAI library.


## Motivation
Once the PoC is done (and you know where your data comes from, what it looks like, and what it can predict) comes the ideal next step is to put your model into production and make it useful for the rest of the business.

Does it sound familiar? do you also need to answer the questions below, before promoting your model into production:
1. _How you sure that your model is ready for production?_
2. _How you able to explain the model performance? in business context that non-technical management can understand?_
3. _How you able to compare between newly trained models and existing models is done manually every iteration?_

In XAI project, our simply vision is to:
1. __Speed up data validation__
2. __Simplify model engineering__
3. __Build trust__  
  
For more details, please refer to our [whitepaper](https://sap.sharepoint.com/sites/100454/ML_Apps/Shared%20Documents/Reusable%20Components/Explainability/XAI_Whitepaper.pdf?csf=1&e=phIUNN&cid=771297d7-d488-441a-8a65-dab0305c3f04)

## Steps
1. Create a model to Predict survival on the Titanic, using the data provide in [titanic](https://www.kaggle.com/c/titanic/data)
2. Evaluate the model performance with XAI report

***

### 1. Performance Model Training

In [1]:
import numpy as np
import pandas as pd
import re as re
import warnings

#### 1.1 Loading Data

In [2]:
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

data = pd.concat([train_data, test_data], sort=False)
data.to_csv('titanic.csv', index=False)

all_data = [train_data, test_data]

all_data

[     PassengerId  Survived  Pclass  \
 0              1         0       3   
 1              2         1       1   
 2              3         1       3   
 3              4         1       1   
 4              5         0       3   
 ..           ...       ...     ...   
 886          887         0       2   
 887          888         1       1   
 888          889         0       3   
 889          890         1       1   
 890          891         0       3   
 
                                                   Name     Sex   Age  SibSp  \
 0                              Braund, Mr. Owen Harris    male  22.0      1   
 1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
 2                               Heikkinen, Miss. Laina  female  26.0      0   
 3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
 4                             Allen, Mr. William Henry    male  35.0      0   
 ..                                               

#### 1.2 Feature quantity engineering

In [3]:
#Feature 3
for data in all_data:
    data['family_size'] = data['SibSp'] + data['Parch'] + 1

#Feature 3.1
for data in all_data:
    data['is_alone'] = 0
    data.loc[data['family_size'] == 1, 'is_alone'] = 1

#Feature 4
for data in all_data:
    data['Embarked'] = data['Embarked'].fillna('S')

#Feature 5
for data in all_data:
    data['Fare'] = data['Fare'].fillna(data['Fare'].median())
train_data['category_fare'] = pd.qcut(train_data['Fare'], 4)

#Feature 6
for data in all_data:
    age_avg  = data['Age'].mean()
    age_std  = data['Age'].std()
    age_null = data['Age'].isnull().sum()

    random_list = np.random.randint(age_avg - age_std, age_avg + age_std , size = age_null)
    data['Age'][np.isnan(data['Age'])] = random_list
    data['Age'] = data['Age'].astype(int)

train_data['category_age'] = pd.cut(train_data['Age'], 5)

#Feature 7
def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\. ', name)
    if title_search:
        return title_search.group(1)
    return ""

for data in all_data:
    data['title'] = data['Name'].apply(get_title)

for data in all_data:
    data['title'] = data['title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'],'Rare')
    data['title'] = data['title'].replace('Mlle','Miss')
    data['title'] = data['title'].replace('Ms','Miss')
    data['title'] = data['title'].replace('Mme','Mrs')

#Map Data
for data in all_data:

    #Mapping Sex
    sex_map = { 'female':0 , 'male':1 }
    data['Sex'] = data['Sex'].map(sex_map).astype(int)

    #Mapping Title
    title_map = {'Mr':1, 'Miss':2, 'Mrs':3, 'Master':4, 'Rare':5}
    data['title'] = data['title'].map(title_map)
    data['title'] = data['title'].fillna(0)

    #Mapping Embarked
    embark_map = {'S':0, 'C':1, 'Q':2}
    data['Embarked'] = data['Embarked'].map(embark_map).astype(int)

    #Mapping Fare
    data.loc[ data['Fare'] <= 7.91, 'Fare']                            = 0
    data.loc[(data['Fare'] > 7.91) & (data['Fare'] <= 14.454), 'Fare'] = 1
    data.loc[(data['Fare'] > 14.454) & (data['Fare'] <= 31), 'Fare']   = 2
    data.loc[ data['Fare'] > 31, 'Fare']                               = 3
    data['Fare'] = data['Fare'].astype(int)

    #Mapping Age
    data.loc[ data['Age'] <= 16, 'Age'] 		      = 0
    data.loc[(data['Age'] > 16) & (data['Age'] <= 32), 'Age'] = 1
    data.loc[(data['Age'] > 32) & (data['Age'] <= 48), 'Age'] = 2
    data.loc[(data['Age'] > 48) & (data['Age'] <= 64), 'Age'] = 3
    data.loc[ data['Age'] > 64, 'Age']                        = 4


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


#### 1.3 Feature Selection

In [4]:
# Create list of columns to drop
drop_elements = ["Name", "Ticket", "Cabin", "SibSp", "Parch", "family_size"]

# Drop columns from both data sets
train_data = train_data.drop(drop_elements, axis = 1)
train_data = train_data.drop(['PassengerId','category_fare', 'category_age'], axis = 1)
test_data  = test_data.drop(drop_elements, axis = 1)

X_train = train_data.drop("Survived", axis=1)
Y_train = train_data["Survived"]

X_train.to_csv("train_data.csv", index=False)
X_train.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Embarked,is_alone,title
0,3,1,1,0,0,0,1
1,1,0,2,3,1,0,3
2,3,0,1,1,0,1,2
3,1,0,2,3,0,0,3
4,3,1,2,1,0,1,1


#### 1.4 ML train a DecisionTreeClassifier Model

In [5]:
import pickle
from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
pkl = open('model.pkl', 'wb')
pickle.dump(decision_tree, pkl)
decision_tree = None

#### 1.5 ML load model and evaluation

In [6]:
model_pkl = open('model.pkl', 'rb')
model = pickle.load(model_pkl)

X_test  = test_data.drop("PassengerId", axis=1).copy()
accuracy = round(model.score(X_train, Y_train) * 100, 2)
print("Model Accuracy: ", accuracy)

Model Accuracy:  87.32


#### 1.6 ML Inference and output result

In [7]:
Y_pred = model.predict(X_test)
result = pd.DataFrame({
    "PassengerId":test_data["PassengerId"],
    "Survived": Y_pred
})
result.to_csv('result.csv', index = False)

***

### 2. Involve XAI complier

In [8]:
import os
import sys
sys.path.append('../../../')
from xai.compiler.base import Configuration, Controller

#### 2.1 Specify config file

In [9]:
json_config = 'basic-report.json'

#### 2.2  Initial compiler controller with config

In [10]:
controller = Controller(config=Configuration(json_config))
print(controller.config)

{'name': 'Report for Titanic Dataset', 'overview': True, 'content_table': True, 'contents': [{'title': 'Feature Importance Analysis', 'desc': 'This section provides the analysis on feature', 'sections': [{'title': 'Feature Importance Ranking', 'component': {'_comment': 'refer to document section xxxx', 'class': 'FeatureImportanceRanking', 'attr': {'trained_model': 'model.pkl', 'train_data': 'train_data.csv'}}}]}, {'title': 'Data Statistics Analysis', 'desc': 'This section provides the analysis on data', 'sections': [{'title': 'Simple Data Statistic', 'component': {'_comment': 'refer to document section xxxx', 'class': 'DataStatisticsAnalysis', 'attr': {'data': 'titanic.csv'}}}]}], 'writers': [{'class': 'Pdf', 'attr': {'name': 'titanic-basic-report'}}]}


#### 2.2  Finally compiler render

In [11]:
 controller.render()

[Examples]: [1, 2, 3, 4, 5]

  '[Examples]: %s\n' % (column, col_data.tolist()[:5]))
[Examples]: ['A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803', '373450']

  '[Examples]: %s\n' % (column, col_data.tolist()[:5]))
[Examples]: [nan, 'C85', nan, 'C123', nan]

  '[Examples]: %s\n' % (column, col_data.tolist()[:5]))
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column][data[column].isnull()] = 'NAN'


***

### Result

In [12]:
print("report generated : %s/titanic-basic-report.pdf" % os.getcwd())

report generated : /Users/i062308/Development/Explainable_AI/tutorials/compiler/titanic2/titanic-basic-report.pdf
