# Lab3: Introduction to supervised learning
This lab will be separated into two parts:

1. First, we will code ourselves a random-based classifier and evaluate it using k-fold validation on the Titanic dataset.

2. We will learn to do the same thing using the [sklearn](https://scikit-learn.org/stable/) library.

In [42]:
import pandas as pd
import numpy as np

## Loading the dataset

Load the Titanic dataset (or the `pre_processed.csv` one we did in the previous session).

In [2]:
df = pd.read_csv("../titanic.csv")

Extract features into `X` and target `y` (conventional notations of `sklearn`).

## Coding our own solution

### Coding a random classifier

1. Implement the simplest possible classifier: given a numpy vector, return a random value between 0 and 1 (use `numpy.random.binomial`). Make $p$ (the probability of being classified as 1) a variable.

In [43]:
def random_classifier(X, p: float = .5):
    """Random classifier: given a numpy vector X, return either the value 0 or 1, with probability p.
    
    Example:
        random_classifier(np.array([1, 2, 3])) returns 1
    """

2. Apply this classifier on all values in the `X` numpy matrix and store it in `y_predict`.

3. Create the four evaluation functions we saw during lecture 4, that takes as iput :
- `accuracy`
- `recall`
- `f1_score`
- `precision`

In [24]:
def accuracy(y, y_predict):
    """Compute the accuracy between y and y_predict.
    
    Example:
        accuracy([1, 1], [1, 1]) = 1
    """

4. Apply these functions to `y` and `y_predict` and draw conclusion.

### Separation between tests and train
We will evaluate our algorithm by "training" it on a subset of the data `X_train`, `y_train` and evaluate it on the data `X_test`, and compare `y_test` with the ground truth.

1. Is there a training phase of the random classifier ?

2. Create a function `split_train_test` that takes as input a matrix `X` and a target `y` and randomly splits into two matrixes `X_train` and `X_test` and a target `y_train` and `y_test`. You can use the function `numpy.random.sample`.

In [32]:
def split_train_test(X, y, p_train = .5):
    """Random splits the numpy matrixes X into two sub-matrixes X_train, X_test, they target y into two sub targets y_train, y_test, with the ratio p (p sets to .5 means that half of X will be in test and the other half in train).
    
    Example:
        split_train_test(X = np.array([1, 2], [3, 4], [3, 3]), y = [0, 0, 1], p=2/3) => (np.array( [3, 4], [3, 3]), np.array([0, 1])), (np.array([1, 2]), np.array([0]))
    """

3. Predict the value on the test dataset `X_test` on `y_test_predict`.

4. Compute the accuracy, precision, recall, f1_score by comparing `y_test_predict` to `y_test`.

5. Can you see what is the limitation of using simply ? What would be the problem if we had an unbalanced dataset ?

### K-fold validation

The other, more robust approach we saw in class is k fold validation, which consists in using *k-1* fold for training and 1 fold for testing. We then compute an average/median of the performance metrics over all experiments.

1. Create a function `k_fold_train_test` that will first shuffle an input matrix and then divide into k-fold with the number of folds specified as input.

In [44]:
def k_fold_train_test(X, y, nbr_folds=3):
    """Shuffle the matrix X and the target vector y, and then returns as a tuple the k folds ((X_1, y_1), (X_2, y_2), ..., (X_k, y_k)).
    
    Example:
        k_fold_train_test(np.array([1, 2], [3, 4], [3, 3], [3, 5]), y=np.array([1, 0, 0, 1]), nbr_folds=2) returns (np.array([1, 2], [3, 4]), np.array([1, 1])), np.array([3, 3], [3, 5]), np.array([0, 0]))
    """
    

2. Use the k-fold algorithm to compute the average accuracy and recall the k folds. The algorithm will:
    - Iterate over the k folds
    - Train the model on the k-1 models
    - Evaluate the performance on the 1 remaining fold and store it
    - Compute the average/median performance

3. What problem do you see with this approach ?

## Using sklearn
Sklearn is THE usual library for machine learning (but not so much deep learning), which comes with built-in methods (and many more) for training and performance evaluation.

1. Import different performance evaluation metrics by reading the documentation [here](https://scikit-learn.org/stable/modules/model_evaluation.html). (it's a too long a read for a lab, but it's definitely an interesting read). Compare the `balanced_accuracy` and `accuracy` to our previous implementation (see [here](https://scikit-learn.org/stable/modules/model_evaluation.html#balanced-accuracy-score) for more). Compute the scores on `y` for the random classifier we implemented.

2. Plenty of functions are available to split the dataset into train and test (see [here](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) for the complete list). Split `X` and `y` into train and test using the function `sklearn.model_selection.train_test_split`. What is the role of the `stratify` variable ? What problem does it solve ?

3. Use the function `sklearn.model_selection.KFold` to get the proper indexes and perform cross validation on the random classifier using `balanced_accuracy`.

# Conclusion and further works
What do you think could be the use of this random classifier for the rest of our work on the titanic dataset ?

**Highly advised bonus** (you will be able to use it during the exam): 
Create a Python module `utils.py` with the different functions and tools we coded today. We will re-use it throughout the rest of the labs.