# Lab3: Introduction to supervised learning
This lab will be separated into two parts:

1. First, we will code ourselves a random-based classifier and evaluate it using k-fold validation on the Titanic dataset.

2. We will learn to do the same thing using the [sklearn](https://scikit-learn.org/stable/) library.

In [1]:
import pandas as pd
import numpy as np

## Loading the dataset

Load the Titanic dataset (or the `pre_processed.csv` one we did in the previous session).

In [2]:
df = pd.read_csv("../titanic.csv")

Extract features into `X` and target `y` (conventional notations of `sklearn`).

In [3]:
# I am only extracting a few subset of variables, 
# because we are working on the random classifier, but you can take all the features we studied last lab
X = df[["Age", "Fare", "Sex"]].values
y = df["Survived"].values

## Coding our own solution

### Coding a random classifier

1. Implement the simplest possible classifier: given a numpy vector and its ground truth, return a random value between 0 and 1 (use `numpy.random.binomial`). Make $p$ (the probability of being classified as 1) a variable.

In [4]:
def random_classifier(X, y, p: float = .5):
    """Random classifier: given a numpy vector X and a truth value y, return either the value 0 or 1, with probability p.
    
    Example:
        random_classifier(np.array([1, 2, 3])) returns 1
    """
    return np.random.binomial(1, p, X.shape[0])

2. Apply this classifier on all values in the `X` numpy matrix and store it in `y_predict`.

In [5]:
print("======== Random predictions for Survival =======")
y_predict = random_classifier(X, y)
print(f"==== {y_predict}")

==== [1 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 1 1 1 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 1 0
 1 1 1 1 0 1 0 0 0 0 1 0 1 1 1 0 1 0 1 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 1 0 1
 0 0 1 0 1 0 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 1
 0 1 1 1 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 1 1 0 0 0 0 0 0 1 0 0 1 1 1
 1 0 0 1 1 1 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 1 0
 1 0 1 1 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 1 1
 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 0 0 0 1 0 0 1 1 1 0
 0 1 1 0 0 1 1 0 0 0 0 1 1 1 0 1 1 1 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 0 1
 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 1 1 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0
 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0
 1 0 1 1 0 1 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 1 1 1 0 1 1 1 1 0 0 1 1 0 1
 0 1 1 1 1 0 0 1 1 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 1 1 1 1 1
 1 1 0 1 0 1 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 0
 0 1 1 0 1 0 0 0 1 1

3. Create the four evaluation functions we saw during lecture 4, that takes as iput :
- `accuracy`
- `recall`
- `f1_score`
- `precision`

In [6]:
def accuracy(y, y_predict):
    """Compute the accuracy between y and y_predict.
    
    Example:
        accuracy([1, 1], [1, 1]) = 1
    """
    return sum(y == y_predict)/len(y)

In [7]:
def precision(y, y_predict):
    """Compute the prediction between y and y_predict.
    """
    return sum((y == 1) & (y_predict == 1))/sum(y_predict == 1)

In [8]:
def recall(y, y_predict):
    """Compute the recall between y and y_predict.
    """
    return sum((y == 1) & (y_predict == 1))/sum(y == 1)

In [9]:
def f1_score(y, y_predict):
    """Compute the F1 score between y and y_predict.
    """
    computed_precision = precision(y, y_predict)
    computed_recall = recall(y, y_predict)
    return 2 * (computed_precision*computed_recall)/(computed_precision + computed_recall)

4. Apply these functions to `y` and `y_predict` and draw conclusion.

In [10]:
print("========= Accuracy")
print(f"{accuracy(y, y_predict)}")
print("========= Precision")
print(f"{precision(y, y_predict)}")
print("========= Recall")
print(f"{recall(y, y_predict)}")
print("========= F1 score")
print(f"{f1_score(y, y_predict)}")

0.5140291806958474
0.40130151843817785
0.5409356725146199
0.46077210460772106


We can see that all scores are above .5, value that should be used to compare the quality of our algorithms to.

### Separation between tests and train
We will evaluate our algorithm by "training" it on a subset of the data `X_train`, `y_train` and evaluate it on the data `X_test`, and compare `y_test` with the ground truth.

1. Is there a training phase of the random classifier ?

No !

2. Create a function `split_train_test` that takes as input a matrix `X` and a target `y` and randomly splits into two matrixes `X_train` and `X_test` and a target `y_train` and `y_test`. You can use the function `numpy.random.choice`.

In [11]:
from numpy.random import choice, shuffle

In [12]:
def split_train_test(X, y, p_train = .5):
    """Random splits the numpy matrixes X into two sub-matrixes X_train, X_test, they target y into two sub targets y_train, y_test, with the ratio p (p sets to .5 means that half of X will be in test and the other half in train).
    
    Example:
        split_train_test(X = np.array([1, 2], [3, 4], [3, 3]), y = [0, 0, 1], p=2/3) => (np.array( [3, 4], [3, 3]), np.array([0, 1])), (np.array([1, 2]), np.array([0]))
    """
    # Select train index
    train_indexes = choice(np.arange(X.shape[0]), replace=False, size=round(p_train*X.shape[0]))
    # Get test indexes as a difference
    test_indexes = np.array(list(set(np.arange(X.shape[0])) - set(train_indexes)))
    # Index X and y accordingly
    return X[train_indexes], X[test_indexes], y[train_indexes], y[test_indexes]

In [13]:
X_train, X_test, y_train, y_test = split_train_test(X, y)

3. Predict the value on the test dataset `X_test` on `y_test_predict`.

In [15]:
y_test_predict = random_classifier(X_test, y_test)

4. Compute the accuracy, precision, recall, f1_score by comparing `y_test_predict` to `y_test`.

In [16]:
print("========= Accuracy")
print(f"{accuracy(y_test, y_test_predict)}")
print("========= Precision")
print(f"{precision(y_test, y_test_predict)}")
print("========= Recall")
print(f"{recall(y_test, y_test_predict)}")
print("========= F1 score")
print(f"{f1_score(y_test, y_test_predict)}")


0.46292134831460674
0.3760330578512397
0.5083798882681564
0.43230403800475053


5. Can you see what is the limitation of using simply accuracy ? What would be the problem if we had an unbalanced dataset ?

Precision reflects the repartition of the data, in the case of an unbalanced dataset, if we predicted always the same value we would get a good score even though our classifier is a constant.

### K-fold validation

The other, more robust approach we saw in class is k fold validation, which consists in using *k-1* fold for training and 1 fold for testing. We then compute an average/median of the performance metrics over all experiments.

1. Create a function `k_fold_train_test` that will first shuffle an input matrix and then divide into k-fold with the number of folds specified as input.

In [17]:
def k_fold_train_test(X, y, nbr_folds=3):
    """Shuffle the matrix X and the target vector y, and then returns as a tuple the k folds ((X_1, y_1), (X_2, y_2), ..., (X_k, y_k)).
    
    Example:
        k_fold_train_test(np.array([1, 2], [3, 4], [3, 3], [3, 5]), y=np.array([1, 0, 0, 1]), nbr_folds=2) returns (np.array([1, 2], [3, 4]), np.array([1, 1])), np.array([3, 3], [3, 5]), np.array([0, 0]))
    """
    # Will store folds
    folds = []
    # Get indexes and shuffle them
    indexes = np.arange(len(X))
    shuffle(indexes)
    # Compute fold size (round)
    fold_size = round(len(X)/nbr_folds)
    # Iterate over indexes
    index = 0
    for fold in range(1, nbr_folds+1):
        k_fold_index = indexes[index:index+fold_size]
        folds.append((X[k_fold_index], y[k_fold_index]))
        index += fold_size
    return folds

In [18]:
folds = k_fold_train_test(X, y, nbr_folds=5)

2. Use the k-fold algorithm to compute the average accuracy and recall the k folds. The algorithm will:
    - Iterate over the k folds
    - Train the model on the k-1 models
    - Evaluate the performance on the 1 remaining fold and store it
    - Compute the average/median performance

In [20]:
NBR_FOLDS = 5
folds = k_fold_train_test(X, y, nbr_folds=NBR_FOLDS)

accuracy_scores = []
for fold in range(NBR_FOLDS):
    # Concatenate all folds except the one with index fold
    train_folds = [folds[fold_ix] for fold_ix in range(NBR_FOLDS) if fold_ix != fold]
    X_train = np.concatenate([train_fold[0] for train_fold in train_folds])
    y_train = np.concatenate([train_fold[1] for train_fold in train_folds])
    # Retrieve test
    X_test = folds[fold][0]
    y_test = folds[fold][1]
    # "train"
    print(f"======= Training classifier on {X_train.shape[0]} individuals ")
    # Predict and compute score on test fold
    y_fold_predict = random_classifier(X_test, y_test)
    accuracy_scores.append(accuracy(y_test, y_fold_predict))



In [21]:
print(f"======== Median scores {np.median(accuracy_scores)}")
print(f"======== Median scores {np.mean(accuracy_scores)}")



3. What problem do you see with this approach ?

This approach is **non-reproducible**, which can cause some issues when writing a paper because our results cannot be reproduced. We should set the random seed to a value to avoid this approach (but then, this can introduce some bias as well).

## Using sklearn
Sklearn is THE usual library for machine learning (but not so much deep learning), which comes with built-in methods (and many more) for training and performance evaluation.

1. Import different performance evaluation metrics by reading the documentation [here](https://scikit-learn.org/stable/modules/model_evaluation.html). (it's a too long a read for a lab, but it's definitely an interesting read). Compare the `balanced_accuracy` and `accuracy` to our previous implementation (see [here](https://scikit-learn.org/stable/modules/model_evaluation.html#balanced-accuracy-score) for more). Compute the scores on `y` for the random classifier we implemented.

Imbalanced accuracy avoids performance bias in the case of unbalanced dataset. In the case of balanced dataset, it is equal to accuracy (in our case with Titanic).

In [22]:
from sklearn.metrics import balanced_accuracy_score, accuracy_score

In [23]:
print(f"======== Balanced accuracy {balanced_accuracy_score(y, y_predict)}")

print(f"======== Accuracy: {accuracy_score(y, y_predict)}")



2. Plenty of functions are available to split the dataset into train and test (see [here](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) for the complete list). Split `X` and `y` into train and test using the function `sklearn.model_selection.train_test_split`. What is the role of the `stratify` variable ? What problem does it solve ?

In [39]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.7)

3. Use the function `sklearn.model_selection.KFold` to get the proper indexes and perform cross validation on the random classifier using `balanced_accuracy`.

In [42]:
from sklearn.model_selection import KFold
indexes = KFold(n_splits=NBR_FOLDS, shuffle=True)

In [58]:
balanced_accuracy_scores = []
for ix, (train_index, test_index) in enumerate(indexes.split(X)):
    print(f"For fold {ix}")
    print("===== 'Train' model")
    print(f"====== Predict on test fold")
    y_predict_fold = random_classifier(X[test_index], y[test_index])
    balanced_accuracy = balanced_accuracy_score(y[test_index], y_predict_fold)
    print(f"Balanced accuracy {balanced_accuracy}")
    balanced_accuracy_scores.append(balanced_accuracy)
    
print("===========")
print("===========")
print(f"Median accuracy: {np.median(balanced_accuracy_scores)}")

For fold 0
===== 'Train' model
Balanced accuracy 0.4936750130412102
For fold 1
===== 'Train' model
Balanced accuracy 0.5539364941278817
For fold 2
===== 'Train' model
Balanced accuracy 0.5147423352902805
For fold 3
===== 'Train' model
Balanced accuracy 0.4310160427807487
For fold 4
===== 'Train' model
Balanced accuracy 0.48049575994781474
Median accuracy: 0.4936750130412102


# Conclusion and further works
What do you think could be the use of this random classifier for the rest of our work on the titanic dataset ?

**We will use it as a comparison to other classifiers !**.


**Highly advised bonus** (you will be able to use it during the exam): 
Create a Python module `utils.py` with the different functions and tools we coded today. We will re-use it throughout the rest of the labs.