# Chapter I :  White Box Techniques ( ~ 30 minutes) 

Guideline:  
 
- Prerequisites: Data Preparation (Understand and load the data) 
- Decision Rules
    - Building decision rules
    - OneR 
    - ZeroR 
- Decision Trees 


### Preliminaries:  Understand the problem and load the data
This dataset contains information about survival passengers of titanic. The objective of the task is to classify if a passenger survives or not according to the features. 

The dataset is available in folder: data/titanic.xls

Attribute Information (in order):
    - pclass     Passenger class 
    - name       Name
    - sex        Sex 
    - age        Age
    - sibsp      Number of Siblings/Spouses Aboard
    - parch      Number of Parents/Children Aboard
    - ticket     Ticket Number
    - fare       Passenger Fare
    - cabin      Cabin 
    - embarked   Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
    - boat       Lifeboat (if survived)

Target Variable 
    - survided     The passenger survived or not 


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import *
import pandas as pd
import numpy as np

def metrics(y_test: list, y_pred: list):
    print("Accuracy: {}".format(accuracy_score(y_test, y_pred)))
    print("Recall  : {}".format(recall_score(y_test, y_pred)))
    print("F1-Score: {}".format(f1_score(y_test, y_pred)))


#Read Data from source 
data_df = pd.read_excel("Data/titanic_dataset.xls").set_index("name")

# Split the data into train and test set 
x_train, x_test, y_train, y_test = train_test_split(data_df.iloc[:, :-1], data_df["survived"], 
                                                    test_size=0.25, random_state=4242)

## Decision Rules
Decision rules are a set of IF-THEN rules. The combination of several rules can be used to make predictions. The rules can be defined manually by the user (if the user has good domain knowledge it can encode very good rules). 

An IF-THEN rule is an expression of the form: 
```
IF condition THEN conclusion

# Example: 
IF age=youth AND love_coffe=YES AND favourite_meetup="Python Barcelona"  
        THEN uses_python=YES
ELSE uses_python=NO

```
In this section is proposed to think and build your own decision rules set (just making some hypothesis about the dataset). Later in this section are explained two simple algorithms to learn rules from the dataset. To evaluate Decision rules are commonly used the following metrics:  Coverage (percentage of instances which the condition apply) and Accuracy (percentage of correct instances)
<div>
    <center> <img src="Data/img/coverage_and_accuracy.JPG" width="250" /> </center>
</div>

|D| number of observations on Dataset, Ncovers : covered observations by the rules and Ncorrect: number of observations correctly classified



### TODO

A) Analyze the dataset and encode an IF-ELSE block on the training set  
        
B) Calculate the coverage of the rule in training set

C) Calculate the accuracy of the rule in the training set

D) Filter and build y_pred (an array with the class binary probability of the input samples using the test set using your IF-ELSE condition
        1 if the passenger survived 
        0 if not survived 

E) Evaluate on test set using metrics() funtion

In [None]:
x_train_dr = x_train.copy()
# Analyze the dataset and encode an IF-ELSE block on training set 
x_train_dr["index"] = range(0, x_train_dr.shape[0]) 

# IF EMBARKED = Q THEN survived = YES 
# ELSE survived = NO 
x_train_dr = x_train_dr[x_train_dr["embarked"] == "Q"]

# @TODO: Try to build your own rule

In [None]:
# Calculate the coverage of the rule 
print("Coverage on training set {}".format(x_train_dr.shape[0] / x_train.shape[0]))

In [None]:
# Calculate the accuracy of the rule
original_tags = y_train[x_train_dr["index"]]
n_correct = sum(np.ones(original_tags.shape[0])*original_tags)
print ("Accuracy on training set: {}".format(n_correct / x_train_dr.shape[0]))

In [None]:
# Build y_pred using previous rule 
x_test_dr = x_test.copy()


# Evaluate the rule
metrics(y_test, y_pred)

## OneR Algorithm  
OneR probably is one of the simplest methods for classification (for discrete attributes)  due to the simplicity we can quickly explain each prediction.  Although the simplicity of this algorithm it is only a few percentage points less accurate than decision trees (source: Very Simple Classification Rules Perform Well on Most Commonly Used Datasets [link](https://www.mlpack.org/papers/ds.pdf)) 

OneR works as follows: 

```
For feature in the dataset: 
     We build a Frequency table 
         - 1. Count how often each value of target appears in category groups
         - 2. Encode the frequency class into a rule
         - 3. Calculate the quality of the rule 

The best predictor is chosen as the one with the smallest error


```
Read more about OneR Algorithm at the following [link](https://www.saedsayad.com/oner.htm)

In [None]:
# 1. Count how often each value of target appears 
# 2. Find the most frequent class 
pd.concat([x_train, pd.Series(y_train, name="survived")], axis = 1) \
  .groupby(['embarked','survived']) \
  .size()

In [None]:
# 3. Make the rule assigning that class to this value
# Using the previous frequency we can determine the following rules: 

```
IF embarked = C THEN survived = YES
IF embarked = Q THEN survived = NO
IF embarked = S THEN survived = NO
```

In [None]:
# Codify previous rules
rule_codification = {"C": 1, "Q": 0, "S": 0}
y_pred = x_test['embarked'].map(rule_codification)

# Evaluate the rule
metrics(y_test, y_pred)

### TODO 

A) Build a Frequency table for sex feature 

B) Encode the Frequency table to a rule  

C) Compare the embarked predictor with sex predictor, better or worst? 

D) Try ZeroR,  even simplest classification method

In [None]:
# A:  Frequency table using sex feature 


In [None]:
# B: Calculate Accuracy on Test set 
# hint consider using map function


In [None]:
# C: Evaluate the rule  
metrics(y_test, y_pred)

##### ZeroR Algorithm 
ZeroR algorithm is even more simple. This algorithm is based on predicting the majority class, the classifier relies only on the target value and ignores the predictors. 
Example: imagine a dataset for email spam classification (is_spam) looking at the target value we found 57 cases of SPAM and 20 of NO SPAM. ZeroR builts the following rule: is_spam(X) = YES, in other words: for all predicted instances is returned YES. 
Of course, the limitations of ZeroR and OneR are obvious but these two algorithms can be used as a useful baseline for Machine Learning models. 

In [None]:
# Count occurences on train set (y_test) and determine mayority class (Survived or NOT). 
# Hint: Use unique function from numpy and zip 


# Evaluate the rule using y_test
metrics(y_test, np.zeros(len(y_test)))

## Decision Trees
Not too much to explain .. decision Trees is one of the top popular supervised Machine Learning methods, it builds a classifier or a regressor model in the form of a tree structure. 
Decision trees are simple to understand and interpret;  we can easilly print the decision tree or determine the decision path of a prediction.  


<div>
    <center> <img src="Data/img/decision_tree.JPG" width="500" /> </center>
</div>

During this section is to show how to output graphical trees (using Graphviz, a Graph visualization Software). The last section includes a function to print the decision path of the decision tree. Decision_path is a function from DecisionTreeClassifier which returns a sparse matric showing which nodes of the tree the prediction goes through,  this information can be used to understand the why of a prediction. 

### TODO

A) Preprocess data. Detect Missing values and encode categorical features using OneHotEncoder. Use the function preprocessing_dataframe()

B) Train a decision tree model using x_train and x_test

C) Evaluate the model 

D) Print the tree using graphviz 

E) Determine the decision path for an observation


In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
from sklearn import preprocessing
import numpy as np

def preprocessing_dataframe(data_df: object, missing_values_to_convert : list, categorical_to_encode: list) -> object: 
    # Handle Missing Values
    for feature in missing_values_to_convert: 
        imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
        data_df[feature] = imputer.fit_transform(data_df[[feature]])

    # One Hot Encoding
    for feature in categorical_to_encode: 
        onehot_encoding = pd.get_dummies(data_df[feature],prefix=feature)
        data_df = pd.concat([data_df, pd.get_dummies(data_df[feature], prefix=feature)],axis=1)
        data_df.drop([feature],axis=1, inplace=True)

    return data_df

# Preprocessing, detect missing vaules and encode categorical features 
# use the function to preprocessing_dataframe() on x_train and x_test
x_train_dtree = preprocessing_dataframe(  ,  , )
x_test_dtree = preprocessing_dataframe( ,  , )

In [None]:
# Train a Decision Tree using Titanic Dataset (use the implementation from scikit-learn library) 
dtree = 

In [None]:
# Evaluate model using metrics funtion (build y_pred and test with y_test) 


metrics(y_test, y_pred)

In [None]:
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

def visualize_tree(model: object, feature_names: list) -> object:
    dot_data = StringIO()
    export_graphviz(model, out_file=dot_data, feature_names = feature_names,
                    filled=True, rounded=True, special_characters=True)
    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
    return Image(graph.create_png())

# Use the function visualize_tree to print the tree structure
# Control de complexity of the three using max_depth
visualize_tree(, )


In [None]:
def print_decision_path(dtree: object, dataset: object, sample_id: int): 
    # Adapted from https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html 
    
    node_indicator = dtree.decision_path(x_test_dtree)
    feature = dtree.tree_.feature
    feature_names = x_test_dtree.columns.values
    leave_id = dtree.apply(x_test_dtree)

    node_index = node_indicator.indices[node_indicator.indptr[sample_id]: node_indicator.indptr[sample_id + 1]]
    threshold = dtree.tree_.threshold
    print("Decision Path for sample : {} (predicted as {}) \n".format(sample_id, dtree.predict([x_test_dtree.iloc[sample_id]])[0]))
    for node_id in node_index:
        if leave_id[sample_id] == node_id:
            continue
            
        if (x_test_dtree.iloc[sample_id, feature[node_id]] <= threshold[node_id]):
            threshold_sign = "<="
        else:
            threshold_sign = ">"

        print("decision id node {} : {} (= {}) {} {})".format(node_id, feature_names[feature[node_id]], 
                                                              x_test_dtree.iloc[sample_id, feature[node_id]], threshold_sign, threshold[node_id]))

        
# Use the function print_decision_path to obtain the decision path for an instance (sample_id) in x_test (dataset)
print_decision_path( , , )