# Lab assignment №1, part 2

This lab assignment consists of several parts. You are supposed to make some transformations, train some models, estimate the quality of the models and explain your results.

Several comments:
* Don't hesitate to ask questions, it's a good practice.
* No private/public sharing, please. The copied assignments will be graded with 0 points.
* Blocks of this lab will be graded separately.

__*This is the second part of the assignment. First and third parts are waiting for you in the same directory.*__

## Part 2. Data preprocessing, model training and evaluation.

### 1. Reading the data
Today we work with the [dataset](https://archive.ics.uci.edu/ml/datasets/Statlog+%28Vehicle+Silhouettes%29), describing different cars for multiclass ($k=4$) classification problem. The data is available below.

In [None]:
!pip install ucimlrepo
!pip install scikit-plot
!pip install --upgrade scikit-plot
!pip install scipy==1.7.3

In [None]:
import ucimlrepo as uci
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
dataset = uci.fetch_ucirepo(id=149)

print(dataset.metadata.name, '\n')
print(dataset.metadata.abstract, '\n')
print(dataset.metadata.additional_info.summary, '\n')

In [None]:
# При дальгейшем обучении было выяснено, что дата содержит Nan-ы.
# Исправим это, заменив Nan на среднее по столбцам
data = dataset.data.features
data = data.fillna(data.median())
data.info()

In [None]:
# Далее при обучении было выяснено, что затесался импостер в виде класса 204.
# Удалим его, чтоб жизнь не портил...

target = dataset.data.targets

# Исключение строк, где target равен '204'
mask = target['class'] != '204'
data_filtered = data[mask].reset_index(drop=True)
target_filtered = target[mask].reset_index(drop=True)

In [None]:
print(data.shape, target.shape)

X_train, X_test, y_train, y_test = train_test_split(data_filtered, target_filtered, test_size=0.35)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

In [None]:
print(f"Уникальные классы после фильтрации в y_train: {y_train['class'].unique()}")
print(f"Уникальные классы после фильтрации в y_test: {y_test['class'].unique()}")

To get some insights about the dataset, `pandas` might be used. The `train` part is transformed to `pd.DataFrame` below.

In [None]:
X_train_pd = pd.DataFrame(X_train)

# First 15 rows of our dataset.
X_train_pd.head(15)

Methods `describe` and `info` deliver some useful information.

In [None]:
X_train_pd.describe()

In [None]:
X_train_pd.info()

### 2. Machine Learning pipeline
Here you are supposed to perform the desired transformations. Please, explain your results briefly after each task.

#### 2.0. Data preprocessing
* Make some transformations of the dataset (if necessary). Briefly explain the transformations

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

Для улучшения работы моделей необходимо нормализовать данные, чтобы предотвратить слишком большие значения весов, а также для эффективного выполнения PCA.

#### 2.1. Basic logistic regression
* Find optimal hyperparameters for logistic regression with cross-validation on the `train` data (small grid/random search is enough, no need to find the *best* parameters).

* Estimate the model quality with `f1` and `accuracy` scores.
* Plot a ROC-curve for the trained model. For the multiclass case you might use `scikitplot` library (e.g. `scikitplot.metrics.plot_roc(test_labels, predicted_proba)`).

*Note: please, use the following hyperparameters for logistic regression: `multi_class='multinomial'`, `solver='saga'` `tol=1e-3` and ` max_iter=500`.*

In [None]:
# You might use this command to install scikit-plot.
# Warning, if you a running locally, don't call pip from within jupyter, call it from terminal in the corresponding
# virtual environment instead

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

In [None]:
estimator = LogisticRegression(penalty='elasticnet',
                               multi_class='multinomial',
                               solver='saga',
                               tol=1e-3, max_iter=500,
                               random_state=42)

param_grid = {
    'l1_ratio': np.linspace(0, 1, 51),
    'C': np.logspace(-1, 2, 50)
  }

logreg_gridsearch = GridSearchCV(estimator, param_grid, scoring='f1_macro', n_jobs=-1, cv=5)

In [None]:
logreg_gridsearch.fit(X_train_scaled, y_train)

In [None]:
print(f'Лучшие гиперпараметры: {logreg_gridsearch.best_params_}')

In [None]:
import sklearn.metrics as metr

X_test_scaled = scaler.transform(X_test)

accuracy = metr.accuracy_score(y_test, logreg_gridsearch.predict(X_test_scaled))
f1 = metr.f1_score(y_test, logreg_gridsearch.predict(X_test_scaled), average='macro')

print(f'Accuracy: {accuracy}')
print(f'F1-score: {f1}')

In [None]:
from scikitplot.metrics import plot_roc

plot_roc(y_test, logreg_gridsearch.predict_proba(X_test_scaled))

Для логистической регрессии результат довольно хороший. Судя по ROC-кривым, модель уверенно справляется с классами **bus** и **van**, однако для классов **saab** и **opel** предсказания оказываются менее точными.

#### 2.2. PCA: explained variance plot
* Apply the PCA to the train part of the data. Build the explaided variance plot.

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

In [None]:
pca = PCA()
pca.fit(X_train_scaled)

In [None]:
plt.figure(figsize=(8, 5))

explained_variance_ratio = np.cumsum(pca.explained_variance_ratio_)
plt.plot(np.arange(1, X_train.shape[1] + 1), explained_variance_ratio, marker='o')

plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance ratio')
plt.title('Explained Variance by PCA Components')

plt.grid()

# Просто для красоты добавим:

# Меток для каждого компонента
for i, evr in enumerate(explained_variance_ratio):
    plt.text(i+1, evr, f'{evr:.2f}', ha='center', va='bottom')

plt.show()

#### 2.3. PCA trasformation
* Select the appropriate number of components. Briefly explain your choice. Should you normalize the data?

*Use `fit` and `transform` methods to transform the `train` and `test` parts.*

In [None]:
n_components = 17
opt_pca = PCA(n_components=n_components)

X_train_components = opt_pca.fit_transform(X_train_scaled)
X_test_components = opt_pca.transform(X_test_scaled)

Я выбрал 17 компонент, чтобы с одной стороны уменьшить размерность данных, но при этом сохранить почти 100% объясненной дисперсии.

**Note: From this point `sklearn` [Pipeline](https://scikit-learn.org/stable/modules/compose.html) might be useful to perform transformations on the data. Refer to the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) for more information.**

#### 2.4. Logistic regression on PCA-preprocessed data.
* Find optimal hyperparameters for logistic regression with cross-validation on the transformed by PCA `train` data.

* Estimate the model quality with `f1` and `accuracy` scores.
* Plot a ROC-curve for the trained model. For the multiclass case you might use `scikitplot` library (e.g. `scikitplot.metrics.plot_roc(test_labels, predicted_proba)`).

*Note: please, use the following hyperparameters for logistic regression: `multi_class='multinomial'`, `solver='saga'` and `tol=1e-3`*

In [None]:
comp_scaler = StandardScaler()
X_train_comp_scaled = comp_scaler.fit_transform(X_train_components)

estimator = LogisticRegression(penalty='elasticnet',
                               multi_class='multinomial',
                               solver='saga',
                               tol=1e-3,
                               random_state=42,
                               max_iter=1000)

param_grid = {
    'class_weight': [None, 'balanced'],
    'l1_ratio': np.linspace(0, 1, 41),
    'warm_start': [True, False, ]
  }

logreg_gridsearch = GridSearchCV(estimator, param_grid, scoring='f1_macro', n_jobs=-1, cv=7)
logreg_gridsearch.fit(X_train_comp_scaled, y_train)

In [None]:
print("Лучшие гиперпараметры:", logreg_gridsearch.best_params_)

In [None]:
import sklearn.metrics as metr

X_test_comp_scaled = comp_scaler.transform(X_test_components)
y_pred = logreg_gridsearch.predict(X_test_comp_scaled)

accuracy = metr.accuracy_score(y_test, y_pred)
f1 = metr.f1_score(y_test, y_pred, average='macro')

print('Accuracy:', accuracy)
print('F1 Score:', f1)

In [None]:
plot_roc(y_test, logreg_gridsearch.predict_proba(comp_scaler.transform(X_test_components)))

При обучении модели на признаках, полученных после PCA, значения метрик точности и F1 увеличились примерно на 0.04, а ROC AUC также показал рост. Теперь модель почти с 100% уверенностью предсказывает класс **van**!

#### 2.8. Learning curve
Your goal is to estimate, how does the model behaviour change with the increase of the `train` dataset size.

* Split the training data into 10 equal (almost) parts. Then train the models from above (Logistic regression, Desicion Tree, Random Forest) with optimal hyperparameters you have selected on 1 part, 2 parts (combined, so the train size in increased by 2 times), 3 parts and so on.

* Build a plot of `accuracy` and `f1` scores on `test` part, varying the `train` dataset size (so the axes will be score - dataset size.

* Analyse the final plot. Can you make any conlusions using it?

Tip: there's a function in sklern to do that

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import learning_curve

from tqdm.auto import tqdm

In [None]:
logreg_params = {
    'multi_class' : 'multinomial',
    'solver' : 'saga',
    'penalty' : 'elasticnet',
    'max_iter' : 1000,
    'class_weight' : 'balanced',
    'l1_ratio' : 0.1
}

In [None]:
models = [LogisticRegression(**logreg_params),
          DecisionTreeClassifier(max_depth=5),
          RandomForestClassifier(n_estimators=100)]

In [None]:
split_size = X_train.shape[0] // 10

accs = [[], [], []]
f1s  = [[], [], []]
sizes = []

In [None]:
for i in tqdm(range(1, 11)):
    if i==10:
        X_subtrain = X_train_comp_scaled
        y_subtrain = y_train
    else:
        X_subtrain = X_train_comp_scaled[:i*split_size]
        y_subtrain = y_train[:i*split_size]
    sizes.append(X_subtrain.shape[0])

    for j, model in enumerate(models):
        model.fit(X_subtrain, y_subtrain)
        pred = model.predict(X_test_comp_scaled)
        accs[j].append(metr.accuracy_score(y_test, pred))
        f1s[j].append(metr.f1_score(y_test, pred, average='macro'))

In [None]:
names = ['Logistic Regression', 'Decision Tree', 'Random Forest']
fig, axs = plt.subplots(ncols=2, figsize=(15, 5))

axs[0].set_title('Accuracy')
for i in range(3):
    axs[0].plot(sizes, accs[i], label=names[i])

axs[0].legend()
axs[0].set_xlabel('Number of training observations')
axs[0].grid()

axs[1].set_title('F1')
for i in range(3):
    axs[1].plot(sizes, f1s[i], label=names[i])

axs[1].legend()
axs[1].set_xlabel('Number of training observations')
axs[1].grid()

1. Первое, что бросается в глаза: логистическая регрессия и случайный лес показывают хорошие результаты, а вот решающее дерево явно отстает.

2. Для логистической регрессии наблюдается практически монотонный рост качества с увеличением объема обучающей выборки. Это говорит о том, что для этой модели больше данных — лучше (по крайней мере, в этом диапазоне и на данном наборе данных).

3. В случае случайного леса видно, что модель достигает насыщения: после определенного объема данных качество перестает заметно улучшаться.

4. С решающим деревом картина похожа на случайный лес, но есть ощущение, что качество сначала растет, а затем начинает немного падать. Возможно, это просто случайные колебания модели, и для более точного вывода следовало бы провести несколько перезапусков экспериментов.