### Задание 2.7

Объёмная и содержательная практика у нас ещё впереди, но в качестве разминки давайте поработаем с уже известным вам датасетом о вине/

Ранее вы обучали на данных только один алгоритм, а теперь мы попробуем сравнить несколько.

Подготовьте данные к классификации. Условно разделите вино на хорошее и плохое.
Хорошим вином будем называть то, параметр quality которого — 6 и более.

Сравните несколько методов классификации: логистическую регрессию, дерево решений и бэггинг. 
Это позволит вам увидеть, как меняется качество в зависимости от выбора того или иного алгоритма.

Разделите выборку на обучающую и тестовую в соотношении 70/30, в качестве значения параметра random_state возьмите число 42.
Для начала обучите два классификатора: логистическую регрессию (с параметрами по умолчанию) и дерево решений (random_state = 42, максимальная глубина — 10).

In [104]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn import model_selection
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV

from sklearn import set_config
set_config(transform_output='pandas')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

sns.set_theme('notebook') 
sns.set_palette('Set2')

plt.rcParams['figure.figsize'] = (12, 8) 

In [105]:
df = pd.read_csv('data/wineQualityReds.zip', index_col=0)
TARGET = 'quality'
df

Unnamed: 0,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality
1,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
2,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
3,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
4,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
5,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1595,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1596,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1597,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1598,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


In [106]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1599 entries, 1 to 1599
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed.acidity         1599 non-null   float64
 1   volatile.acidity      1599 non-null   float64
 2   citric.acid           1599 non-null   float64
 3   residual.sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free.sulfur.dioxide   1599 non-null   float64
 6   total.sulfur.dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 162.4 KB


In [107]:
df[TARGET] = df[TARGET].apply(lambda x: True if x >= 6 else False)

In [108]:
X, y = df.drop(columns=TARGET), df[TARGET]

In [109]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [110]:
models = {
    'LogisticRegression': LogisticRegression(max_iter=1000),
    'DecisionTree': DecisionTreeClassifier(max_depth=10, random_state=42),
    'RandomForest': RandomForestClassifier(n_estimators=1500, max_depth=10, random_state=42),
}

In [111]:
for name, model in models.items():
    print('Model:', name)
    
    model.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    print('****TRAIN****')
    print(f'Accuracy: {accuracy_score(y_train, y_train_pred):.3f}')
    print(f'Precision: {precision_score(y_train, y_train_pred):.3f}')
    print(f'Recall: {recall_score(y_train, y_train_pred):.3f}')
    print(f'F1_score: {f1_score(y_train, y_train_pred):.3f}')
    print()
    
    print('****TEST****')
    print(f'Accuracy: {accuracy_score(y_test, y_test_pred):.3f}')
    print(f'Precision: {precision_score(y_test, y_test_pred):.3f}')
    print(f'Recall: {recall_score(y_test, y_test_pred):.3f}')
    print(f'F1_score: {f1_score(y_test, y_test_pred):.3f}')
    print()

Model: LogisticRegression


****TRAIN****
Accuracy: 0.757
Precision: 0.782
Recall: 0.745
F1_score: 0.763

****TEST****
Accuracy: 0.727
Precision: 0.772
Recall: 0.723
F1_score: 0.747

Model: DecisionTree
****TRAIN****
Accuracy: 0.940
Precision: 0.948
Recall: 0.937
F1_score: 0.943

****TEST****
Accuracy: 0.760
Precision: 0.764
Recall: 0.824
F1_score: 0.793

Model: RandomForest
****TRAIN****
Accuracy: 0.989
Precision: 0.991
Recall: 0.988
F1_score: 0.990

****TEST****
Accuracy: 0.804
Precision: 0.822
Recall: 0.828
F1_score: 0.825



### Задание 2.8

Обучите модель с использованием бэггинга (класс BaggingClassifier с random_state=42).

Возьмите из предыдущего задания алгоритм, показавший наилучшее качество, и укажите для него новое количество моделей — 1500. 

Вычислите новое значение F1-score.

In [112]:
model = BaggingClassifier(estimator=DecisionTreeClassifier(max_depth=10, random_state=42), n_estimators=1500)

In [113]:
model.fit(X_train, y_train)
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

print('****TRAIN****')
print(f'Accuracy: {accuracy_score(y_train, y_train_pred):.3f}')
print(f'Precision: {precision_score(y_train, y_train_pred):.3f}')
print(f'Recall: {recall_score(y_train, y_train_pred):.3f}')
print(f'F1_score: {f1_score(y_train, y_train_pred):.3f}')
print()
    
print('****TEST****')
print(f'Accuracy: {accuracy_score(y_test, y_test_pred):.3f}')
print(f'Precision: {precision_score(y_test, y_test_pred):.3f}')
print(f'Recall: {recall_score(y_test, y_test_pred):.3f}')
print(f'F1_score: {f1_score(y_test, y_test_pred):.3f}')
print()

****TRAIN****
Accuracy: 0.990
Precision: 0.991
Recall: 0.990
F1_score: 0.991

****TEST****
Accuracy: 0.800
Precision: 0.828
Recall: 0.809
F1_score: 0.818



Получилась симуляция RandomForest, вручную собранный случайный лес из отдельных деревьев решений