## 0. Introduction 

The aim of this lab is to get familiar with **classification problems** using **Multinomial Logistic Regression**, **Decision Trees** and **Naive Bayes**. 

For this lab, we will be using the [iris dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#iris-dataset) and the [forest covertypes](https://scikit-learn.org/stable/datasets/real_world.html#forest-covertypes) dataset.

In [None]:
from sklearn import model_selection
from sklearn import preprocessing
from sklearn import naive_bayes
from sklearn import linear_model
from sklearn import tree
from sklearn import datasets
from sklearn import metrics
from imblearn import over_sampling
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
from IPython import display

import typing
%matplotlib inline

In [None]:
iris = datasets.load_iris()
print(iris.DESCR)

We will again be splitting the data into train and test sets and this time, will use sklearn `StandardScaler` to scale the attributes.

In [None]:
X = iris.data
Y = iris.target
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X,
    Y,
    test_size=0.2,
    random_state=42
    )
scaler = preprocessing.StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 1. Multinomial Regression
In the previous lab, we implemented a logistic regression classifier ourselves. Here, we will instead import it from the python's scikit-learn library. As there are several parameters that can be passed, please read and understand the documentation of sklearn for [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

Adjust the code below to use L2 regularization, the appropriare solver and multi_class options.

In [None]:
### your code here
# log_reg_classifier = linear_model.LogisticRegression(C=1e5, solver='???', multi_class='???', penalty='???')
# log_reg_classifier.fit(X_train, y_train)

What is the accuracy of the classifier on the test set? How does it compare to last weeks 1 vs all example?

In [None]:
### your code here

As the iris example is relatively simple, it achieves high accuracy easily. We will be using the Forest Covertype dataset for a more complex example. Please read the dataset documentation for details on the attributes and the dataset in general.

In [None]:
forest = datasets.fetch_covtype()
print(forest.DESCR)

In [None]:
X = forest.data
Y = forest.target
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X,
    Y,
    test_size=0.2,
    random_state=42
    )
scaler = preprocessing.StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# X_train[:2]

Train a classifier on the forest train data and use the test set to estimate the accuracy score.

In [None]:
### your code here

Build a [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html) visualisation of the classifier's performance on the test set. what do you observe? Why do you think this is the case?

In [None]:
### your code here

# 2. Decision Trees

Decision Trees are a type of Supervised Machine Learning where the data is continuously split according to a certain learned parameter. The trees are composed by two main entities, decision nodes and leaf nodes, where the latter is the final outcome.

Using the sklearn [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html?highlight=decision+tree), train a classifier on the forest train set, calculate the overall accuracy and plot the confusion matrix of the classifier. How does this compare to the Multinomial Logistic Regression?

Use Entropy as a measure for information gain.

In [None]:
### your code here

# 3. Naive Bayes

As the attributes of the dataset are scaled to continuous rela values, we will use a Gaussian Naive Bayes model. 

Using the [`GaussianNB`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html?highlight=naive+bayes) implementation for sklearn train a classifier on the forest train set, calculate the overall accuracy and plot the confusion matrix of the classifier. How does this compare to the previous two algorithms?

Tune the hyper-parameter var-smoothing of the classifier for a fairer comparisson.

In [None]:
### your code here

# 4. Data Imbalance
You will notice that contrary to the iris dataset, the forest dataset is imbalanced, this means that not all classes have an equal number of samples. This can lead to over-fitting as it can encourage the classifier to only predict the dominant class. Some algorithms are more prone to this than others, we see for example that the decision tree achieves high accuracy without adjusting while Multinomial regression achieves high accuracy on the dominant classes but not great otherwise.

`LogisticRegression` has a parameter for class_weight, let's set this to balanced and retrain the classifier. Does the performance of the classifier improve?

In [None]:
### your code here

Another solution, is to implement a sampling strategy. Various methods for over-sampling and under-sampling are available, we will use Synthetic Minority Over-sampling Technique ([SMOTE](https://doi.org/10.1613/jair.953)) as implemented in [imblearn](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html).

Retrain the logistic regression classifier using the resampled x, y data (this will take a while now) and evaluate on the original test set. 

Compare the metrics to previous implementations.

In [None]:
sampler = over_sampling.SMOTE(sampling_strategy='not majority', random_state=42)
X_resampled, y_resampled = sampler.fit_resample(X_train, y_train)
print(len(X_resampled)) # notice the difference in the number of samples

In [None]:
### your code here

What happens if you retrain the NB classifier with the resampled data? Why is this happening?

In [None]:
### your code here