# Lab3: Introduction to supervised learning
This lab will be separated into two parts:

1. First, we will code ourselves a random-based classifier and evaluate it using k-fold validation on the Pokemon dataset.

2. We will learn to do the same thing using the [sklearn](https://scikit-learn.org/stable/) library.

In [4]:
import pandas as pd
import numpy as np

## Loading the dataset

Load the Pokemon dataset (or the `pre_processed.csv` one we did in the previous session).

In [5]:
df = pd.read_csv("../pokemon.csv")

Extract 2 features of your choice into an array `X` and a target array `y` (conventional notations of `sklearn`).

In [6]:
# I am only extracting a few subset of variables, 
# because we are working on the random classifier, but you can take all the features we studied last lab
X = df[['sp_attack', 'sp_defense']].values
y = df["is_legendary"].values

## Coding our own solution

### Coding a random classifier

1. Implement the simplest possible classifier: given a numpy vector and its ground truth, return a random value between 0 and 1 (use `numpy.random.binomial`). Make $p$ (the probability of being classified as 1) a variable.

In [57]:
def random_classifier(X, p: float = .5):
    """Random classifier: given a numpy vector X, return either the value 0 or 1, with probability p.
    
    Example:
        random_classifier(np.array([[1, 2, 3]])) returns 1
    """
    return np.random.binomial(1, p, X.shape[0])

In [58]:
random_classifier(np.array([np.random.normal(1, 10, 10) for _ in range(10)]))

array([1, 0, 0, 0, 1, 0, 0, 0, 0, 0])

2. Apply this classifier on all values in the `X` numpy matrix and store it in `y_predict`.

In [59]:
y_predict = random_classifier(X)

In [60]:
# First ten individuals predicted
y_predict[:10]

array([0, 1, 1, 0, 1, 1, 0, 1, 1, 0])

3. Create the four evaluation functions we saw during lecture 4, that takes as iput :
- `accuracy`
- `recall`
- `f1_score`
- `precision`

In [61]:
def accuracy(y_pred, y_true):
    return np.sum(y_pred == y_true)/y_pred.shape[0]

def recall(y_pred, y_true):
    return np.sum((y_pred == 1) & (y_true==1))/(np.sum((y_pred == 1) & (y_true==1)) + np.sum((y_pred == 0) & (y_true==1)))

def precision(y_pred, y_true):
    return np.sum((y_pred == 1) & (y_true==1))/(np.sum((y_pred == 1) & (y_true==1)) + np.sum((y_pred == 1) & (y_true==0)))

def f1_score(y_pred, y_true):
    """Compute the F1 score between y and y_predict.
    """
    computed_precision = precision(y_pred, y_true)
    computed_recall = recall(y_pred, y_true)
    return 2 * (computed_precision*computed_recall)/(computed_precision + computed_recall)

4. Apply these functions to `y` and `y_predict` and draw conclusion.

In [62]:
print("========= Accuracy")
print(f"{accuracy(y, y_predict)}")
print("========= Precision")
print(f"{precision(y, y_predict)}")
print("========= Recall")
print(f"{recall(y, y_predict)}")
print("========= F1 score")
print(f"{f1_score(y, y_predict)}")

0.5143570536828964
0.5
0.08997429305912596
0.15250544662309368


### Separation between tests and train
We will evaluate our algorithm by "training" it on a subset of the data `X_train`, `y_train` and evaluate it on the data `X_test`, and compare `y_test` with the ground truth.

1. Is there a training phase of the random classifier ?

No !

2. Create a function `split_train_test` that takes as input a matrix `X` and a target `y` and randomly splits into two matrixes `X_train` and `X_test` and a target `y_train` and `y_test`. You can use the function `numpy.random.choice`.

In [67]:
def split_train_test(X, y, p_train = 0.7):
    # Select train index
    train_indexes = np.random.choice(np.arange(X.shape[0]), replace=False, size=round(p_train*X.shape[0]))
    # Get test indexes as a difference
    test_indexes = np.array(list(set(np.arange(X.shape[0])) - set(train_indexes)))
    # Index X and y accordingly
    return X[train_indexes], X[test_indexes], y[train_indexes], y[test_indexes]

X_train, X_test, y_train, y_test = split_train_test(X, y)

print(X_train)

[[ 40  79]
 [ 80 120]
 [ 85  50]
 ...
 [ 70  80]
 [ 97  80]
 [ 40  40]]


3. Predict the value on the test dataset `X_test` on `y_test_predict`.

In [68]:
y_test_predict = random_classifier(X_test)

4. Compute the accuracy, precision, recall, f1_score by comparing `y_test_predict` to `y_test`.

In [72]:
print("========= Accuracy")
print(f"{accuracy(y_test, y_test_predict)}")
print("========= Precision")
print(f"{precision(y_test, y_test_predict)}")
print("========= Recall")
print(f"{recall(y_test, y_test_predict)}")
print("========= F1 score")
print(f"{f1_score(y_test, y_test_predict)}")

0.55
0.6666666666666666
0.13793103448275862
0.2285714285714286


In [76]:
pd.value_counts(y_test)

pd.value_counts(y_test_predict)

  pd.value_counts(y_test)
  pd.value_counts(y_test_predict)


0    124
1    116
Name: count, dtype: int64

5. Can you see what is the limitation of using simply accuracy ?

Precision reflects the repartition of the data, in the case of an unbalanced dataset, if we predicted always the same value we would get a good score even though our classifier is a constant.

### K-fold validation

The other, more robust approach we saw in class is k fold validation, which consists in using *k-1* fold for training and 1 fold for testing. We then compute an average/median of the performance metrics over all experiments.

1. Create a function `k_fold_train_test` that will first shuffle an input matrix and then divide into k-fold with the number of folds specified as input.

2. Use the k-fold algorithm to compute the average accuracy and recall the k folds. The algorithm will:
    - Iterate over the k folds
    - Train the model on the k-1 models
    - Evaluate the performance on the 1 remaining fold and store it
    - Compute the average/median performance

3. What problem do you see with this approach ?

## Using sklearn
Sklearn is THE usual library for machine learning (but not so much deep learning), which comes with built-in methods (and many more) for training and performance evaluation.

1. Import different performance evaluation metrics by reading the documentation [here](https://scikit-learn.org/stable/modules/model_evaluation.html). (it's too long a read for a lab, but it's definitely an interesting read). Compare the `balanced_accuracy` and `accuracy` to our previous implementation (see [here](https://scikit-learn.org/stable/modules/model_evaluation.html#balanced-accuracy-score) for more). Compute the scores on `y` for the random classifier we implemented.

2. Plenty of functions are available to split the dataset into train and test (see [here](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) for the complete list). Split `X` and `y` into train and test using the function `sklearn.model_selection.train_test_split`. What is the role of the `stratify` variable ? What problem does it solve ?

3. Use the function `sklearn.model_selection.KFold` to get the proper indexes and perform cross validation on the random classifier using `balanced_accuracy`.

# Conclusion and further works
What do you think could be the use of this random classifier for the rest of our work on the titanic dataset ?


**Highly advised bonus** (you will be able to use it during the exam): 
Create a Python module `utils.py` with the different functions and tools we coded today. We will re-use it throughout the rest of the labs.