# Titanic: Applying the [KISS principle](https://en.wikipedia.org/wiki/KISS_principle) (Keep It Small and Simple)

> *KISS is a design principle noted by the U.S. Navy in 1960. The KISS principle states that most systems work best if they are kept simple rather than made complicated; therefore, simplicity should be a key goal in design, and unnecessary complexity should be avoided.*

## Explainable AI
Recently the [National Institute of Standards and Technology (NIST)](https://www.nist.gov/) published a draft paper ["Four Principles of Explainable Artificial Intelligence"](https://nvlpubs.nist.gov/nistpubs/ir/2020/NIST.IR.8312-draft.pdf), which are as follows:

* **Explanation:** *Systems deliver accompanying evidence or reason(s) for all outputs.*
* **Meaningful:** *Systems provide explanations that are understandable to individual users.*
* **Explanation Accuracy:** *The explanation correctly reflects the system’s process for generating the output.* 
* **Knowledge Limits:** *The system only operates under conditions for which it was designed or when the system reaches a sufficient confidence in its output.*

In this notebook we list a selection of simple but meaningful models, *i.e.* <font color='red'>models that you can explain to your boss whilst in the elevator.</font>

## Explainability and the GDPR
Being able to easily explain how a model works, or how a decision was made based on the model, is not a mere intelectual nicety; in fact the [EU General Data Protection Regulation (GDPR) 2016/679](https://eur-lex.europa.eu/eli/reg/2016/679), Article 15(1)(h) states:

> "*The data subject shall have the right to obtain... ...meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing*"

also, in Article 22:

> "*The data subject shall have the right to obtain... ...human intervention on the part of the controller, to express his or her point of view and to contest the decision.*"

In order to comply with this, the data scientist must be able to clearly explain how any decision was originally arrived at. 

Non-compliance with the GDPR by a company can result in serious consequences, and it is part of the job of a data scientist to mitigate such risks for their employers. (For those interested the website [GDPR Enforcement Tracker](https://www.enforcementtracker.com/) has a partial list of fines that have been imposed).

## How the scores are calculated
In order to avoid submitting each of these models to the competition for scoring I shall make use of the [ground truth file](https://www.kaggle.com/martinglatz/submission-solution), in conjunction with the [scikit-learn accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html).

## Set up the essentials

In [1]:
import pandas  as pd
import numpy   as np
from sklearn.metrics import accuracy_score

solution   = pd.read_csv('../input/submission-solution/submission_solution.csv')
test_data  = pd.read_csv('../input/titanic/test.csv')
X_test     = pd.get_dummies(test_data)

# $\approx$50%
This is the average result of the magnificently minimalist notebook ["Titanic Random Survival Prediction"](https://www.kaggle.com/tarunpaparaju/titanic-random-survival-prediction) by kaggle Grandmaster [Tarun Paparaju](https://www.kaggle.com/tarunpaparaju):

In [2]:
predictions = np.round(np.random.random((len(test_data)))).astype(int)
print("The score is %.5f" % accuracy_score( solution['Survived'] , predictions ) )

The score is 0.48325


# 62.2%
### No survivors

This model represents the most terrible scenario; there are no survivors. This model actually correctly guesses the fate of 260 of the 418 passengers, which is a stark reminder of the tragedy that was the Titanic.

In [3]:
predictions = np.zeros((418), dtype=int)
print("The score is %.5f" % accuracy_score( solution['Survived'] , predictions ) )

The score is 0.62201


This model is what is known as the [Zero Rule](https://machinelearningcatalogue.com/algorithm/alg_zero-rule.html) classifier (aka. **ZeroR** or **0-R**), and it simply consists of the majority class of the dataset. It is against this baseline (and not the random model above) that one should compare the performance of all other models based on this data. Any model that does not beat this score has something *very* wrong with it.

The **no survivors** model is also useful in another respect; it provides us with an indication as to whether the data is imbalanced or not. If the data were perfectly 'balanced' we would have as many survivors as those who did not survive, and the [accuracy score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) would be `50%`. However, for highly imbalanced datasets then the accuracy score evaluation metric can be misleading. For example, imagine a scenario in which only 42 passengers in the test data survived, then the **no survivors** model would have an accuracy score of `90%` before we even start modelling. Clearly in such a situation the accuracy score is no longer fit for purpose and an alternative must be found.

In such a case we can use the scikit-learn [balanced_accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html), which is calculated using 

$$\texttt{balanced-accuracy} = \frac{1}{2}\left( \frac{TP}{TP + FN} + \frac{TN}{TN + FP}\right )$$

i.e. the arithmetic mean of sensitivity (true positive rate) and specificity (true negative rate). 

For example:

In [4]:
from sklearn.metrics import balanced_accuracy_score
print("The balanced accuracy score is %.5f" % balanced_accuracy_score( solution['Survived'] , predictions ) )

The balanced accuracy score is 0.50000


we can see that any imbalance has now been compensated for. That said, in this case `62%` isn't too bad, so we shall continue using the standard accuracy score metric.
# 76.6%

### Only the women survived
This is essentially the `gender_submission.csv` file that comes free with the competition.

In [5]:
predictions = np.zeros((418), dtype=int)
# now use our model
survived_df = X_test[(X_test["Sex_female"]==1)]

for i in survived_df.index:
    predictions[i] = 1 # the 1's are now the survivors
    
print("The score is %.5f" % accuracy_score( solution['Survived'] , predictions ) )

The score is 0.76555


# 77.5%
### Only women from 1st and 2nd class survive:

In [6]:
predictions = np.zeros((418), dtype=int)
# now use our model
survived_df = X_test[((X_test["Pclass"] ==1)|(X_test["Pclass"] ==2)) & (X_test["Sex_female"]==1 )]

for i in survived_df.index:
    predictions[i] = 1 # the 1's are now the survivors
    
print("The score is %.5f" % accuracy_score( solution['Survived'] , predictions ) )

The score is 0.77512


### Only women from who embarked in either Cherbourg or Southampton survive:

In [7]:
predictions = np.zeros((418), dtype=int)
# now use our model
survived_df = X_test[((X_test["Embarked_S"] ==1)|(X_test["Embarked_C"] ==1)) & (X_test["Sex_female"]==1 )]

for i in survived_df.index:
    predictions[i] = 1 # the 1's are now the survivors
    
print("The score is %.5f" % accuracy_score( solution['Survived'] , predictions ) )

The score is 0.77512


# 78.5%
This masterpiece is from the wonderful notebook ["Three lines of code for Titanic Top 20%"](https://www.kaggle.com/vbmokin/three-lines-of-code-for-titanic-top-20) written by kaggle Master [Vitalii Mokin](https://www.kaggle.com/vbmokin). His model is the following:

* **All the women survived, and all the men died**
* **All boys ('Master') from 1st and 2nd class survived**
* **Everybody in 3rd class that embarked at Southampton ('S') died.**

I shall make a copy of the `test_data` dataframe to maintain the original code as it is in his notebook, as well as preserving `test_data` for future small and simple models:


In [8]:
test = test_data

In [9]:
test['Boy'] = (test.Name.str.split().str[1] == 'Master.').astype('int')
test['Survived'] = [1 if (x == 'female') else 0 for x in test['Sex']]     
test.loc[(test.Boy == 1), 'Survived'] = 1                                 
test.loc[((test.Pclass == 3) & (test.Embarked == 'S')), 'Survived'] = 0

In [10]:
predictions = test['Survived']
print("The score is %.5f" % accuracy_score( solution['Survived'] , predictions ) )

The score is 0.78469


# How to create a submission.csv
If you wish to submit any of these predictions to the competition simply use this snippet of code to output a `submission.csv` file:

In [11]:
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)

## Please feel free to mention more KISS models to add!