## Задание 08
В задании не будет подробных инструкций. Ваша задача - построить классификатор с как можно лучшим качеством (AUC-ROC). 

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score
import xgboost as xgb

  data = yaml.load(f.read()) or {}
  defaults = yaml.load(f)


Будем работать с данными по авиарейсам в США. Задача  -  предсказать задержку вылета более 15 минут (задача бинарной классификации).

Признаки:

* Month, DayofMonth, DayOfWeek, месяц, день месяца и день недели
* DepTime, время отправления
* UniqueCarrier, код перевозчика
* Origin, место вылета
* Dest, место назначения
* Distance, расстояние между аэропортами вылета и прилета
* dep_delayed_15min, просрочка вылета на 15 и более минут (целевой признак)


In [2]:
df = pd.read_csv('flight_delays.csv')

In [3]:
df.tail()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min
99995,c-5,c-4,c-3,1618,OO,SFO,RDD,199,N
99996,c-1,c-18,c-3,804,CO,EWR,DAB,884,N
99997,c-1,c-24,c-2,1901,NW,DTW,IAH,1076,N
99998,c-4,c-27,c-4,1515,MQ,DFW,GGG,140,N
99999,c-11,c-17,c-4,1800,WN,SEA,SMF,605,N


В качестве простейшего бенчмарка возьмем логистическую регрессию и два признака, которые проще всего взять: `DepTime` и `Distance`:

In [4]:
X, y = df[['Distance', 'DepTime']].values, df['dep_delayed_15min'].map({'Y': 1, 'N': 0}).values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=17)

In [5]:
lr = LogisticRegression( solver='lbfgs',random_state=17)

lr.fit(X_train, y_train)
y_pred = lr.predict_proba(X_test)[:, 1]

print('AUC-ROC:',roc_auc_score(y_test, y_pred))

AUC-ROC: 0.6795697123357751


####  Критерии оценивания:

#### 1. Предобработка данных  - 3 балла 

Предобработка включает one-hot кодирование категориальных признаков, заполнение пропусков (если есть),  генерацию новых признаков (например, вместо Origin и Dest можно ввести признак маршрут: Origin-Dest)

#### 2. Настройка параметров обучения - (4 балла) 
Настройка включает определение гиперпараметров бустинга (предполагается, что вы будете использовать XGBoost) на кросс-валидации: глубина деревьев, темп обучения, число деревьев и т.д. 

#### 3. Использование стекинга, отбора признаков и другое - (3 балла)

Например, можно реализовать простейшую схему стекинга - блендинг:
* обучите логистическую регрессию и градиентный бустинг;
* постройте линейную смесь ответов логистической регрессии и градиентного бустинга вида 
$$p=w_1∗p_{lr}+(1−w_1)∗p_{xgb},$$ 
где  $p_{lr}$  – предсказанные логистической регрессией вероятности класса 1,  $p_{xgb}$ – бустинга. 

In [2]:
! pip install XGboost

Collecting XGboost
  Downloading https://files.pythonhosted.org/packages/b1/11/cba4be5a737c6431323b89b5ade818b3bbe1df6e8261c6c70221a767c5d9/xgboost-1.0.2-py3-none-win_amd64.whl (24.6MB)
Installing collected packages: XGboost
Successfully installed XGboost-1.0.2


In [6]:
df.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min
0,c-8,c-21,c-7,1934,AA,ATL,DFW,732,N
1,c-4,c-20,c-3,1548,US,PIT,MCO,834,N
2,c-9,c-2,c-5,1422,XE,RDU,CLE,416,N
3,c-11,c-25,c-6,1015,OO,DEN,MEM,872,N
4,c-10,c-7,c-6,1828,WN,MDW,OMA,423,Y


In [7]:
df.isnull().sum().sum()

0

In [10]:
df['dep_delayed_15min'] = (df['dep_delayed_15min'] == 'Y').astype('int')

df["Month"] = df["Month"].str.replace('[^0-9]+','').astype('int')
df["DayofMonth"] = df["DayofMonth"].str.replace('[^0-9]+','').astype('int')
df["DayOfWeek"] = df["DayOfWeek"].str.replace('[^0-9]+','').astype('int')

df["Origin-Dest"] = pd.DataFrame(df["Origin"] + "-" + df["Dest"])

In [12]:
from sklearn.preprocessing import LabelEncoder

enc_d = LabelEncoder()
df["Origin-Dest"] = enc_d.fit_transform(df['Origin-Dest'])

In [13]:
df.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min,Origin-Dest
0,8,21,7,1934,AA,ATL,DFW,732,0,152
1,4,20,3,1548,US,PIT,MCO,834,0,3527
2,9,2,5,1422,XE,RDU,CLE,416,0,3619
3,11,25,6,1015,OO,DEN,MEM,872,0,1181
4,10,7,6,1828,WN,MDW,OMA,423,1,2681


In [15]:
data = df.drop(["Origin","Dest"],axis=1)
data = pd.get_dummies(data,columns=["UniqueCarrier", "Month", "DayOfWeek"])

In [17]:
data.shape

(100000, 46)

In [18]:
data.head()

Unnamed: 0,DayofMonth,DepTime,Distance,dep_delayed_15min,Origin-Dest,UniqueCarrier_AA,UniqueCarrier_AQ,UniqueCarrier_AS,UniqueCarrier_B6,UniqueCarrier_CO,...,Month_10,Month_11,Month_12,DayOfWeek_1,DayOfWeek_2,DayOfWeek_3,DayOfWeek_4,DayOfWeek_5,DayOfWeek_6,DayOfWeek_7
0,21,1934,732,0,152,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,20,1548,834,0,3527,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,2,1422,416,0,3619,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,25,1015,872,0,1181,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
4,7,1828,423,1,2681,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0


Построим изначально классификатор с параметрами по умолчанию

In [190]:
X_train, X_test, y_train, y_test = train_test_split(data.drop(['dep_delayed_15min'], axis=1), 
                                                    data['dep_delayed_15min'], test_size=0.3,
                                                    stratify=data['dep_delayed_15min'], random_state = 50)

In [191]:
clf = XGBClassifier()

clf.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
       importance_type='gain', interaction_constraints=None,
       learning_rate=0.300000012, max_delta_step=0, max_depth=6,
       min_child_weight=1, missing=nan, monotone_constraints=None,
       n_estimators=100, n_jobs=0, num_parallel_tree=1,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method=None,
       validate_parameters=False, verbosity=None)

In [192]:
print(roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1]))

0.7432902932867145


Теперь попробуем найти оптимальные параметры обучения

In [193]:
from sklearn.model_selection import GridSearchCV

In [194]:
clf = XGBClassifier()

In [195]:
clf_grid = GridSearchCV(clf,
                   {'max_depth': [3,4,5,6]}, verbose=1)
clf_grid.fit(data.drop('dep_delayed_15min', axis=1), data['dep_delayed_15min'])
print(clf_grid.best_score_)
print(clf_grid.best_params_)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:  2.1min finished


0.8209
{'max_depth': 5}


In [31]:
clf2 = XGBClassifier(max_depth = 5)

In [196]:
clf_grid2 = GridSearchCV(clf2,
                   {'n_estimators': [70, 90, 110, 130, 150]}, verbose=1)
clf_grid2.fit(data.drop('dep_delayed_15min', axis=1), data['dep_delayed_15min'])
print(clf_grid2.best_score_)
print(clf_grid2.best_params_)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:  3.1min finished


0.82141
{'n_estimators': 150}


In [197]:
clf3 = XGBClassifier(max_depth = 5, n_estimators = 150)

In [198]:
clf_grid3 = GridSearchCV(clf2,
                   {'learning_rate': [0.05,0.1,0.15,0.2,0.4]}, verbose=1)
clf_grid3.fit(data.drop('dep_delayed_15min', axis=1), data['dep_delayed_15min'])
print(clf_grid3.best_score_)
print(clf_grid3.best_params_)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:  2.8min finished


0.82084
{'learning_rate': 0.2}


In [244]:
new_params = {
    'objective': 'binary:logistic',
    'max_depth': 5,
    'n_estimators' : 150,
    'learning_rate': 0.2,
    'silent': True,
    'colsample_bytree': 0.8, 
    'subsample': 0.9
}

In [245]:
clf_final = XGBClassifier(**new_params)

clf_final.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.8, gamma=0, gpu_id=-1,
       importance_type='gain', interaction_constraints=None,
       learning_rate=0.2, max_delta_step=0, max_depth=5,
       min_child_weight=1, missing=nan, monotone_constraints=None,
       n_estimators=150, n_jobs=0, num_parallel_tree=1,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, silent=True, subsample=0.9,
       tree_method=None, validate_parameters=False, verbosity=None)

In [246]:
print(roc_auc_score(y_test, clf_final.predict_proba(X_test)[:, 1]))

0.7442224408879201


Воспользуемся блендингом

In [247]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

lr = LogisticRegression()
lr.fit(X_train_scaled, y_train)
lr_ypred = lr.predict(X_test_scaled)
lr_ypred_train = lr.predict(X_train_scaled)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  This is separate from the ipykernel package so we can avoid doing imports until


In [260]:
def select_weights(y_true, y_pred_1, y_pred_2):
    grid = np.linspace(0, 1, 1000)
    metric = []
    for w_0 in grid:
        w_1 = 1 - w_0
        y_a = w_0 * y_pred_1 + w_1 * y_pred_2
        metric.append([roc_auc_score(y_true, y_a), w_0, w_1])
    return metric

In [261]:
auc, w_0, w_1 = max(select_weights(y_train, clf_final.predict_proba(X_train)[:, 1], lr_ypred_train), key=lambda x: x[0])

In [262]:
print(w_0,w_1)

0.9819819819819819 0.018018018018018056


In [263]:
print(roc_auc_score(y_test, clf_final.predict_proba(X_test)[:, 1] * w_0 +  lr_ypred * w_1))

0.7916662939983748


### В итоге максимальный AUC-ROC получился равным 0.7917 после блендинга