In [151]:
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [8]:
RANDOM_STATE = 42

In [283]:
dataset = load_boston()
X = pd.DataFrame(dataset.data)
X.columns = dataset.feature_names
y = dataset.target

In [284]:
X.shape

(506, 13)

In [285]:
y.shape

(506,)

1. Разделите выборку на обучающую и тестовую в отношении 80%/20%

In [286]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.8,
                                                    random_state=RANDOM_STATE)

In [287]:
X_train.shape

(404, 13)

In [288]:
X_test.shape

(102, 13)

In [289]:
y_train.shape

(404,)

In [290]:
y_test.shape

(102,)

2. Обучите стандартную регрессию, а также Ridge и  Lasso и параметрами по умолчанию и выведите их R2 на тестовой выборке

In [17]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score

In [18]:
def r2_test(model):
    reg = model.fit(X_train, y_train)
    r2 = reg.score(X_test, y_test)
    print('%.7f' % r2)
    try:
        coef = reg.alpha_
        print(f'coef: {coef}')
    except AttributeError:
        print('')

In [19]:
r2_test(LinearRegression())

0.6687595



In [20]:
r2_test(Ridge())

0.6662222



In [21]:
r2_test(Lasso())

0.6671454



3. Для Ridge и Lasso подберите коэффициент регуляризации(используйте GridSearchCV, RidgeCV, LassoCV) в пределах от $10^{-5}$ до $10^5$ (по степеням 10). Посчитайте R2 на тестовой выборке по лучшим моделям и сравните с предыдущими результатами. Напишите как изменился результат

In [114]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import RidgeCV, LassoCV

param_grid = [10**i for i in range(-5,6)]
param_grid

[1e-05, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000]

In [23]:
def r2_grid(model, param_grid):
    grid_search = GridSearchCV(model, {'alpha' : param_grid}, cv=10)
    grid_search.fit(X_train, y_train)
    r2 = grid_search.score(X_test, y_test)
    coef = grid_search.best_params_['alpha']
    print('%.7f' % r2)
    print (f'coef: {coef}')

In [24]:
r2_grid(Ridge(), param_grid)

0.6687595
coef: 1e-05


In [25]:
r2_grid(Lasso(), param_grid)

0.6687599
coef: 1e-05


In [26]:
r2_test(RidgeCV(param_grid))

0.6687510
coef: 0.01


In [27]:
r2_test(LassoCV(alphas=param_grid))

0.6687599
coef: 1e-05


Результат немного улучшился.

4. Проведите масштабирование выборки(используйте Pipeline, StandardScaler, ), посчитайте R2 и сравните с предыдущими результатами. Напишите как изменился результат

In [101]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline

In [29]:
def r2_scale(scaler, reg):
    pipe = Pipeline(steps=[('scaler', scaler), ('reg', reg)]).fit(X_train, y_train)
    r2 = pipe.score(X_test, y_test)
    print('%.7f' % r2)

In [30]:
r2_scale(StandardScaler(), Ridge())

0.6684624


In [31]:
r2_scale(MinMaxScaler(), Ridge())

0.6764100


In [32]:
r2_scale(StandardScaler(), Lasso())

0.6239429


In [33]:
r2_scale(MinMaxScaler(), Lasso())

0.2573921


Результаты хуже предыдущих, особенно MinMaxScaler() + Lasso()

5. Подберите коэффициент регуляризации для Ridge и Lasso на масштабированных данных, посчитайте R2 и сравните с предыдущими результатами. Напишите как изменился результат

In [48]:
def r2_scale_grid(scaler, reg, param_grid):
    pipe = Pipeline(steps=[('scaler', scaler), ('reg', reg)])
    grid_search = GridSearchCV(pipe, {'reg__alpha' : param_grid}, cv=10)
    grid_search.fit(X_train, y_train)
    r2 = grid_search.score(X_test, y_test)
    coef = grid_search.best_params_['reg__alpha']
    print('%.7f' % r2)
    print (f'coef: {coef}')

In [49]:
r2_scale_grid(StandardScaler(), Ridge(), param_grid)

0.6659678
coef: 10


In [50]:
r2_scale_grid(MinMaxScaler(), Ridge(), param_grid)

0.6764100
coef: 1


In [51]:
r2_scale_grid(StandardScaler(), Lasso(), param_grid)

0.6681816
coef: 0.01


In [52]:
r2_scale_grid(MinMaxScaler(), Lasso(), param_grid)

0.6676993
coef: 0.01


Ridge() + StandardScaler() результат немного уменьшился. Ridge() + MinMaxScaler() не изменился. Lasso() + MinMaxScaler() немного увеличился. Lasso() + MinMaxScaler() стал значительно лучше.

6. Добавьте попарные произведения признаков и их квадраты (используйте PolynomialFeatures) на масштабированных признаках, посчитайте R2 и сравните с предыдущими результатами. Напишите как изменился результат

In [42]:
from sklearn.preprocessing import PolynomialFeatures

In [60]:
def r2_scale_poly(scaler, reg):
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    poly = PolynomialFeatures(degree=2).fit(X_train_scaled)
    X_train_poly = poly.transform(X_train_scaled)
    X_test_poly = poly.transform(X_test_scaled)
    reg = reg.fit(X_train_poly, y_train)
    r2 = reg.score(X_test_poly, y_test)
    print('%.7f' % r2)

In [61]:
r2_scale_poly(StandardScaler(), Ridge())

0.8162949


In [62]:
r2_scale_poly(MinMaxScaler(), Ridge())

0.8299337


In [63]:
r2_scale_poly(StandardScaler(), Lasso())

0.7322763


In [64]:
r2_scale_poly(MinMaxScaler(), Lasso())

0.2611263


По ridge регрессии результат заметно улучшился, также результат стал чуть лучше при регрессии lasso и стандартном масштабировании. Сильно упал результат MinMaxScaler() + Lasso()

7. Подберите наилучшую модель (используйте Pipeline, GridSearchSCV) подбирая тип регуляризации (L1,L2), коэффициент регуляризации, метод масштабирования и степень полинома в PolynomialFeatures. Выведите итоговые параметры и результат R2. Напишите как изменился R2 по сравнению с предыдущими экспериментами

In [291]:
param_grid_2 = {'poly__degree': [1, 2, 3],
                'reg__alpha': param_grid}
param_grid_2

{'poly__degree': [1, 2, 3],
 'reg__alpha': [1e-05,
  0.0001,
  0.001,
  0.01,
  0.1,
  1,
  10,
  100,
  1000,
  10000,
  100000]}

In [292]:
def r2_scale_poly_grid(scaler, reg, param_grid):
    pipe = Pipeline([('scaler', scaler),
                     ('poly', PolynomialFeatures()),
                     ('reg', reg)])
    grid_search = GridSearchCV(pipe, param_grid, cv=10, n_jobs=-1)
    grid_search.fit(X_train, y_train)
    r2 = grid_search.score(X_test, y_test)
    coef = grid_search.best_params_['reg__alpha']
    degree = grid_search.best_params_['poly__degree']
    print('%.7f' % r2)
    print (f'coef: {coef}')
    print (f'degree: {degree}')

In [293]:
r2_scale_poly_grid(StandardScaler(), Ridge(), param_grid_2)

0.8180466
coef: 10
degree: 2


In [294]:
r2_scale_poly_grid(MinMaxScaler(), Ridge(), param_grid_2)

0.8588480
coef: 0.1
degree: 3


In [295]:
r2_scale_poly_grid(StandardScaler(), Lasso(), param_grid_2)

0.8138519
coef: 0.01
degree: 2


In [296]:
r2_scale_poly_grid(MinMaxScaler(), Lasso(), param_grid_2)

0.8449734
coef: 0.001
degree: 3


Ridge() с масштабированием MinMaxScaler(), коэффициентом регуляризации 0,1 и порядком полинома 3 является лучшей моделью. В целом по всем вариантам результат стал лучше.

http://archive.ics.uci.edu/ml/datasets/Adult

In [135]:
link = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/adult-all.csv'
data = pd.read_csv(link, header=None)

In [136]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


8. Разделите выборку на признаки и целевую переменную(колонка со зачениями {<=50K,>50K}). Замените целевую переменную на числовые значения.

In [137]:
features = data.drop(columns=14)
target = data[14]

In [138]:
features.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


In [139]:
target.head()

0    <=50K
1    <=50K
2    <=50K
3    <=50K
4    <=50K
Name: 14, dtype: object

In [161]:
target = target.replace(['<=50K','>50K'], [0, 1])
target

0        0
1        0
2        0
3        0
4        0
        ..
48837    0
48838    0
48839    0
48840    0
48841    1
Name: 14, Length: 48842, dtype: int64

9. Выясните, присутствуют ли в данных пропуски. Заполните их самыми частыми значениями (испольуйте SimpleImputer)

In [141]:
from sklearn.impute import SimpleImputer

In [142]:
data.isnull().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
dtype: int64

10. Выберите колонки с числовыми и категориальными переменными.

In [143]:
# Пропусков в таблице нет. Ниже код для заполнения пропусков самыми частыми значениями
imp = SimpleImputer(missing_values='NaN', strategy="most_frequent")
imp.fit(data)
imp.transform(data)

array([[39, 'State-gov', 77516, ..., 40, 'United-States', '<=50K'],
       [50, 'Self-emp-not-inc', 83311, ..., 13, 'United-States', '<=50K'],
       [38, 'Private', 215646, ..., 40, 'United-States', '<=50K'],
       ...,
       [38, 'Private', 374983, ..., 50, 'United-States', '<=50K'],
       [44, 'Private', 83891, ..., 40, 'United-States', '<=50K'],
       [35, 'Self-emp-inc', 182148, ..., 60, 'United-States', '>50K']],
      dtype=object)

In [144]:
data.dtypes

0      int64
1     object
2      int64
3     object
4      int64
5     object
6     object
7     object
8     object
9     object
10     int64
11     int64
12     int64
13    object
14    object
dtype: object

In [145]:
numerical = []
categorical = []
for name in data.columns:
    if (type(data[name][0]) == np.int64):
        numerical.append(name)
    elif (data[name].nunique()<=42):
        categorical.append(name)
print(numerical)
print(categorical)

[0, 2, 4, 10, 11, 12]
[1, 3, 5, 6, 7, 8, 9, 13, 14]


In [162]:
data[categorical].head()

Unnamed: 0,1,3,5,6,7,8,9,13,14
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States,<=50K
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,<=50K
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States,<=50K
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States,<=50K
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba,<=50K


In [163]:
data[numerical].head()

Unnamed: 0,0,2,4,10,11,12
0,39,77516,13,2174,0,40
1,50,83311,13,0,0,13
2,38,215646,9,0,0,40
3,53,234721,7,0,0,40
4,28,338409,13,0,0,40


По сути все колонки, со строковыми данными, можно отнести к категориальным

11. Создайте пайплайн по обработке колонок(используйте OneHotEncoder,MinMaxScaler).

In [164]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression

In [165]:
X = pd.get_dummies(features)
X.head()

Unnamed: 0,0,2,4,10,11,12,1_?,1_Federal-gov,1_Local-gov,1_Never-worked,...,13_Portugal,13_Puerto-Rico,13_Scotland,13_South,13_Taiwan,13_Thailand,13_Trinadad&Tobago,13_United-States,13_Vietnam,13_Yugoslavia
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,53,234721,7,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,28,338409,13,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [166]:
y = target
y.head()

0    0
1    0
2    0
3    0
4    0
Name: 14, dtype: int64

In [167]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.8,
                                                    random_state=RANDOM_STATE)

In [168]:
scaler = MinMaxScaler()
reg = LogisticRegression()
pipe = Pipeline(steps=[('scaler', scaler), ('reg', reg)]).fit(X_train, y_train)

12. Посчитайте метрики accuracy и f1_score на предсказании только самого частого класса в целевой переменной.

In [169]:
from sklearn.metrics import accuracy_score, f1_score

In [170]:
y_pred = pipe.predict(X_test)
accuracy_score(y_test, y_pred)

0.8491145460128979

In [171]:
f1_score(y_test, y_pred)

0.6488804192472606

13. Посчитайте cross_val_score по алгоритмам LogisticRegression, SVC, LinearSVC по метрикам accuracy и f1_score.
Напишите удалось ли превзойти предыдущий результат.

In [172]:
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC, LinearSVC

In [173]:
def cross_val_scoring(reg, scoring):
    scores = cross_val_score(reg, X, y, cv=5, scoring = scoring)
    print('%.7f' % scores.mean())

In [174]:
cross_val_scoring(LogisticRegression(), 'accuracy')

0.7972237


In [175]:
cross_val_scoring(LogisticRegression(), 'f1')

0.3960576


In [177]:
cross_val_scoring(SVC(), 'accuracy')

0.7980631
0.7980631


In [178]:
cross_val_scoring(SVC(), 'f1')

0.2780992


In [179]:
cross_val_scoring(LinearSVC(), 'accuracy')

0.7855946


In [180]:
cross_val_scoring(LinearSVC(), 'f1')

0.2214728


14. Можно заметить что в данных присутствуют значения '?', замените их самыми частыми значениями (испольуйте SimpleImputer)

In [222]:
df = data[categorical]
df = df.drop(columns=14)
imp = SimpleImputer(missing_values='?', strategy="most_frequent")
imp.fit(df)
trans = imp.fit_transform(df)
df_trans = pd.DataFrame(trans, columns=df.columns)

In [223]:
df_trans

Unnamed: 0,1,3,5,6,7,8,9,13
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba
...,...,...,...,...,...,...,...,...
48837,Private,Bachelors,Divorced,Prof-specialty,Not-in-family,White,Female,United-States
48838,Private,HS-grad,Widowed,Prof-specialty,Other-relative,Black,Male,United-States
48839,Private,Bachelors,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States
48840,Private,Bachelors,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,United-States


In [224]:
new_df = pd.concat([df_trans, data[numerical]], axis=1)

In [225]:
new_df

Unnamed: 0,1,3,5,6,7,8,9,13,0,2,4,10,11,12
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States,39,77516,13,2174,0,40
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,50,83311,13,0,0,13
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States,38,215646,9,0,0,40
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States,53,234721,7,0,0,40
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba,28,338409,13,0,0,40
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,Private,Bachelors,Divorced,Prof-specialty,Not-in-family,White,Female,United-States,39,215419,13,0,0,36
48838,Private,HS-grad,Widowed,Prof-specialty,Other-relative,Black,Male,United-States,64,321403,9,0,0,40
48839,Private,Bachelors,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States,38,374983,13,0,0,50
48840,Private,Bachelors,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,United-States,44,83891,13,5455,0,40


In [226]:
new_df.dtypes

1     object
3     object
5     object
6     object
7     object
8     object
9     object
13    object
0      int64
2      int64
4      int64
10     int64
11     int64
12     int64
dtype: object

15. Посчитайте cross_val_score на новых данных. Напишите удалось ли улучшить результат.

In [227]:
X = pd.get_dummies(new_df)
X

Unnamed: 0,0,2,4,10,11,12,1_Federal-gov,1_Local-gov,1_Never-worked,1_Private,...,13_Portugal,13_Puerto-Rico,13_Scotland,13_South,13_Taiwan,13_Thailand,13_Trinadad&Tobago,13_United-States,13_Vietnam,13_Yugoslavia
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
3,53,234721,7,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
4,28,338409,13,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,215419,13,0,0,36,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
48838,64,321403,9,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
48839,38,374983,13,0,0,50,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
48840,44,83891,13,5455,0,40,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0


In [228]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.8,
                                                    random_state=RANDOM_STATE)

In [229]:
cross_val_scoring(LogisticRegression(), 'accuracy')

0.7981450


In [230]:
cross_val_scoring(LogisticRegression(), 'f1')

0.3817505


In [231]:
cross_val_scoring(SVC(), 'accuracy')

0.7980631


In [232]:
cross_val_scoring(SVC(), 'f1')

0.2780992


In [233]:
cross_val_scoring(LinearSVC(), 'accuracy')

0.5839154


In [234]:
cross_val_scoring(LinearSVC(), 'f1')

0.2303798


16. Посчитайте cross_val_score, если просто удалить значения '?'. Напишите как изменился результат

In [240]:
df = data.replace('?', np.nan).dropna()
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48836,33,Private,245211,Bachelors,13,Never-married,Prof-specialty,Own-child,White,Male,0,0,40,United-States,<=50K
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K


In [241]:
features = df.drop(columns=14)
target = df[14]
target = target.replace(['<=50K','>50K'], [0, 1])

In [243]:
X = pd.get_dummies(features)
y = target

In [244]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.8,
                                                    random_state=RANDOM_STATE)

In [245]:
cross_val_scoring(LogisticRegression(), 'accuracy')

0.7913626


In [246]:
cross_val_scoring(LogisticRegression(), 'f1')

0.3866514


In [247]:
cross_val_scoring(SVC(), 'accuracy')

0.7906550


In [248]:
cross_val_scoring(SVC(), 'f1')

0.2762619


In [249]:
cross_val_scoring(LinearSVC(), 'accuracy')

0.7776523


In [250]:
cross_val_scoring(LinearSVC(), 'f1')

0.3462704


 17. Посчитайте cross_val_score для RandomForestClassifier,GradientBoostingClassifier. Напишите как изменился результат и какой вывод можно из этого сделать.

In [252]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

In [253]:
cross_val_scoring(RandomForestClassifier(), 'accuracy')

0.8485473


In [254]:
cross_val_scoring(RandomForestClassifier(), 'f1')

0.6702503


In [255]:
cross_val_scoring(GradientBoostingClassifier(), 'accuracy')

0.8629208


In [256]:
cross_val_scoring(GradientBoostingClassifier(), 'f1')

0.6869949


RandomForestClassifier,GradientBoostingClassifier показывают лучший результат.

18. Подберите наилучшую модель, подбирая методы обработки колонок - масштабирование признаков, кодирование признаков и заполнение пропусков. Параметры алгоритмов оставьте по умолчанию. Выведите итоговые параметры и результат accuracy и f1_score.