# Train a Classifier

In this notebook we train a Gradient Boosting Decision Tree (GBDT) classifier using the implementation of the package [LightGBM](https://lightgbm.readthedocs.io/en/latest/).

#### Index<a name="index"></a>
1. [Import Packages](#imports)
2. [Load Features](#loadFeatures)
3. [Generate Classifier](#generateClassifier)
    1. [Untrained Classifier](#createClassifier)
    2. [Train Classifier](#trainClassifier)
    3. [Save the Classifier Instance](#saveClassifier)
4. [Performance](#performance) <font color=salmon>(Optional)</font>
    1. [Classify Train Set](#classify)
    2. [Metrics](#metrics)
    3. [Confusion Matrix](#cm)

## 1. Import Packages<a name="imports"></a>

In [None]:
!pip install ../snmachine/

In [None]:
import collections
import os
import pickle
import sys
import time

In [None]:
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats
import seaborn as sns

In [None]:
from snmachine import snclassifier
from utils.plasticc_pipeline import get_directories, load_dataset

In [None]:
import warnings
warnings.simplefilter('always', DeprecationWarning)

In [None]:
%config Completer.use_jedi = False  # enable autocomplete

#### Aestetic settings

In [None]:
%matplotlib inline

sns.set(font_scale=1.3, style="ticks")

## 2. Load Features<a name="loadFeatures"></a>

First, **write** the path to the folder that contains the features and the labels of the events (`path_saved_features`). These quantities were calculated and saved in [5_feature_extraction](5_feature_extraction.ipynb).

### 2.1. Features Path<a name="pathFeatures"></a>

**<font color=Orange>A)</font>** Obtain path from folder structure.

If you created a folder structure, you can obtain the path from there. **Write** the name of the folder in `analysis_name`. 

In [None]:
is_only_roll = 0
is_updated = 1

In [None]:
analysis_name = 'aug_wfd_46k'
if is_only_roll:
    analysis_name = 'aug_wfd_roll_46k'
if is_updated:
    analysis_name = analysis_name + '_updated'
analysis_name

In [None]:
# os_name = 'baseline_v2_0_paper'
# os_name = 'noroll_v2_0_paper'
os_name = 'presto_v2_0_paper'

folder_path = f'/path/folder/'

In [None]:
directories = get_directories(folder_path, analysis_name) 
path_saved_features = directories['features_directory']

**<font color=Orange>B)</font>** Directly **write** where you saved the files.

```python
folder_path = '../snmachine/example_data'
path_saved_features = folder_path
```

### 2.2. Load<a name="load"></a>

Then, load the features and labels.

In [None]:
X = pd.read_pickle(os.path.join(path_saved_features, 'features.pckl'))  # features
y = pd.read_pickle(os.path.join(path_saved_features, 'data_labels.pckl'))  # class label of each event

In [None]:
collections.Counter(y)

**<font color=Orange>A)</font>** If the dataset is not augmented, skip **<font color=Orange>B)</font>**.


**<font color=Orange>B)</font>** If the dataset is augmented, load the augmented dataset.

In order to avoid information leaks during the classifier optimization, all synthetic events generated by the training set augmentation which derived from the same original event must be placed in the same cross-validation fold. 

First, **write** in `data_file_name` the name of the file where your dataset is saved.

In this notebook we use the dataset saved in [4_augment_data](4_augment_data.ipynb).

In [None]:
data_file_name = analysis_name + '.pckl'
data_file_name

Then, load the augmented dataset.

In [None]:
folder_path_data = f'/folder/path/data/augmented_data'
data_path = os.path.join(folder_path_data, data_file_name)
dataset = load_dataset(data_path)

In [None]:
metadata = dataset.metadata

## 3. Generate Classifier<a name="generateClassifier"></a>

### 3.1. Untrained Classifier<a name="createClassifier"></a>

Start by creating a classifier. For that **choose**: 

- classifier type: `snmachine` contains the following classifiers
    * [LightGBM](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html?highlight=classifier) classifier - `snclassifier.LightGBMClassifier`
    * Boosted decision trees - `snclassifier.BoostDTClassifier`
    * Boosted random forests - `snclassifier.BoostRFClassifier`
    * K-nearest neighbors vote - `snclassifier.KNNClassifier`
    * Support vector machine - `snclassifier.SVMClassifier`
    * Multi-layer Perceptron classifier of a Neural Network - `snclassifier.NNClassifier`
    * Random forest - `snclassifier.RFClassifier`
    * Decision tree - `snclassifier.DTClassifier`
    * Gaussian Naive Bayes - `snclassifier.NBClassifier`
- `random_seed`: this allows reproducible results (**<font color=green>optional</font>**).
- `classifier_name`: name under which the classifier is saved (**<font color=green>optional</font>**).
- `**kwargs`: optional keywords to pass arguments into the underlying classifier; see the docstring in each classifier for more information (**<font color=green>optional</font>**).

Here we chose a LightGBM classifier.

In [None]:
classifier_name = 'full_opt'
classifier_instance = snclassifier.LightGBMClassifier(classifier_name=classifier_name, random_seed=42)

### 3.2. Train Classifier<a name="trainClassifier"></a>

We can now train and use the classifier generated above or optimise it beforehand. In general, it is important to optimise the classifier hyperparameters.

If you do not want to optimise the classifier, **run** **<font color=Orange>A)</font>**.

**<font color=Orange>A)</font>** Train unoptimised classifier.

```python
classifier.fit(X, y)
```

If you want to optimise the classifier, run **<font color=Orange>B)</font>**.

**<font color=Orange>B)</font>** Optimise and train classifier.

For that, **choose**:
- `param_grid`: parameter grid containing the hyperparameters names and lists of their possible settings as values. If none is provided, the code uses a default parameter grid. (**<font color=green>optional</font>**)
- `scoring`: metric used to evaluate the predictions on the validation sets and write it in `scoring`. 
    * `snmachine` contains the `'auc'` and the PLAsTiCC `'logloss'` costum metrics. For more details about these, see `snclassifier.logloss_score` and `snclassifier.auc_score`, respectively.
    * Additionally, you can choose a different metric from the list in [Scikit-learn](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) or create your own (see [`sklearn.model_selection._search.GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) for details).
- `number_cv_folds`: number of folds for cross-validation. By default it is 5. (**<font color=green>optional</font>**)
- `metadata`: metadata of the events with which to train the classifier. This ensures all synthetic events generated by the training set augmentation that were derived from the same original event are placed in the same cross-validation fold. (**<font color=green>optional</font>**)

In [None]:
param_grid={'num_leaves': [10, 30]}

classifier_instance.optimise(X, y.astype(str), param_grid=param_grid, scoring='logloss', 
                             number_cv_folds=5, metadata=metadata)

In [None]:
ini_time = time.time()
classifier_instance.optimise(X, y.astype(str), param_grid=None, scoring='logloss', 
                             number_cv_folds=5, metadata=metadata)
print(time.time()-ini_time)

The classifier is optimised and its optimised hyperparameters are:

In [None]:
classifier_instance.classifier

In [None]:
classifier_instance.grid_search.best_params_

In [None]:
classifier_instance.classifier_name

### 3.3. Save the Classifier Instance<a name="saveClassifier"></a>

**Write** in `path_saved_classifier` the path to the folder where to save the trained classifier instance.

In [None]:
path_saved_classifier = directories['classifications_directory']
path_saved_classifier

Save the classifier instance (which includes the grid search used to optimise the classifier).

In [None]:
classifier_instance.save_classifier(path_saved_classifier)

[Go back to top.](#index)

## 4. Performance<a name="performance"></a> <font color=salmon>(Optional)</font>

Here we see the training set performance.

First, obtain the classifier.

In [None]:
from snmachine import analysis

In [None]:
classifier_name = 'full_opt.pck'
with open(os.path.join(path_saved_classifier, classifier_name), 'rb') as input:
    classifier_instance = pickle.load(input)

In [None]:
classifier = classifier_instance.classifier

### 4.1. Classify Train Set<a name="classify"></a>

Compute the predicted class (`y_pred`) and the probability of belonging to each different class (`y_probs`). Note that the predicted class is the one with the highest probability.

In [None]:
y_pred_train = classifier.predict(X)
y_probs_train = classifier.predict_proba(X)

### 4.2. Metrics<a name="metrics"></a>

We start by computing the Area under the ROC Curve (AUC) and the PLAsTiCC logloss. For that, choose which class to consider as *positive* (the other classes will be considered *negative*). Then, **write** in `which_column` the column that corresponds to that class. Note that the class order is accessed through the classifier.

In [None]:
classifier._classes

In [None]:
which_column = 2  # we are interested in SN Ia vs others

Obtain the metrics.

In [None]:
classifier.which_column = which_column
auc_test = snclassifier.auc_score(classifier=classifier, X_features=X, 
                                  y_true=y.astype(str), which_column=which_column)
logloss_test = snclassifier.logloss_score(classifier=classifier, X_features=X, 
                                          y_true=y.astype(str))
print('{:^10} {:^10} {:^10}'.format('', 'AUC', 'Logloss'))
print('{:^10} {:^10.3f} {:^10.3f}'.format('test', auc_test, logloss_test))

Check how many events we correctly classified.

In [None]:
is_pred_right = y_pred_train == y.astype(str)
np.sum(is_pred_right), np.sum(is_pred_right)/len(is_pred_right)

### 4.3. Confusion Matrix<a name="cm"></a>

Now, plot the confusion matrix.

In [None]:
analysis.dict_label_to_real_plasticc

In [None]:
from snmachine import analysis

In [None]:
title = 'Confusion matrix Accuracy'
analysis.plot_confusion_matrix(y.astype(str), y_pred_train, 
                               normalise='accuracy', title=title,
                               dict_label_to_real=analysis.dict_label_to_real_plasticc)

In [None]:
title = 'Confusion matrix Precision'
analysis.plot_confusion_matrix(y.astype(str), y_pred_train, 
                               normalise='precision', title=title,
                               dict_label_to_real=analysis.dict_label_to_real_plasticc)

[Go back to top.](#index)