# Методы построения моделей машинного обучения
Задание. Для заданного набора данных (по Вашему варианту) постройте модели классификации или регрессии (в зависимости от конкретной задачи, рассматриваемой в наборе данных). Для построения моделей используйте методы 1 и 2 (по варианту для Вашей группы). Оцените качество моделей на основе подходящих метрик качества (не менее двух метрик). Какие метрики качества Вы использовали и почему? Какие выводы Вы можете сделать о качестве построенных моделей? Для построения моделей необходимо выполнить требуемую предобработку данных: заполнение пропусков, кодирование категориальных признаков, и т.д.

Набор данных содержит сведения о бронированиях отелей. Будем решать задачу классификации: было бронирование отменено или нет?

In [30]:
import opendatasets as od
od.download('https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand')

Skipping, found downloaded files in "./hotel-booking-demand" (use force=True to force download)


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
# скроем предупреждения о возможных ошибках для лучшей читаемости
import warnings
warnings.filterwarnings('ignore')


## Предварительная обработка данных

In [2]:
hotels = pd.read_csv('hotel-booking-demand/hotel_bookings.csv')
hotels.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


In [3]:
hotels.shape

(119390, 32)

In [4]:
hotels.isnull().sum()

hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company         

Избавимся от пропущенных значений. Удалим столбцы agent и company, в которых очень много пропусков. Также удалим строки, в которых пропущены значения в столбцах country и children. Чтобы уменьшить время на обработку, возьмем первые 200 строк в датасете.

In [5]:
hotels = hotels.drop(['agent', 'company'], axis='columns')
hotels = hotels.dropna(axis=0, how='any')
hotels = hotels.loc[:200]

Теперь удалим столбцы, которые не понадобятся для построения модели, потому что в них большинство значений одни и те же или эти столбцы содержат ненужную информацию. Например, нам не нужен год заезда, т.к данные не рассматриваются как временной ряд.

In [6]:
hotels = hotels.drop(['arrival_date_year', 'arrival_date_month', 'previous_cancellations', 
                     'previous_bookings_not_canceled', 'days_in_waiting_list', 'reservation_status', 
                      'reservation_status_date', 'assigned_room_type'], axis='columns')
l_targ = ['is_canceled']
l_feat =list(hotels)
l_feat.remove('is_canceled')

In [7]:
hotels.info()

<class 'pandas.core.frame.DataFrame'>
Index: 200 entries, 0 to 200
Data columns (total 22 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   hotel                        200 non-null    object 
 1   is_canceled                  200 non-null    int64  
 2   lead_time                    200 non-null    int64  
 3   arrival_date_week_number     200 non-null    int64  
 4   arrival_date_day_of_month    200 non-null    int64  
 5   stays_in_weekend_nights      200 non-null    int64  
 6   stays_in_week_nights         200 non-null    int64  
 7   adults                       200 non-null    int64  
 8   children                     200 non-null    float64
 9   babies                       200 non-null    int64  
 10  meal                         200 non-null    object 
 11  country                      200 non-null    object 
 12  market_segment               200 non-null    object 
 13  distribution_channel     

Теперь проведем кодирование категориальных признаков для столбцов, имеющих тип object. Также проведем масштабирование данных.

In [8]:
oe = OrdinalEncoder()
hotels[['hotel', 'meal', 'country', 'market_segment', 'distribution_channel', 'reserved_room_type',
        'deposit_type', 'customer_type']] = oe.fit_transform(hotels[['hotel', 'meal', 'country', 'market_segment', 'distribution_channel', 'reserved_room_type',
        'deposit_type', 'customer_type']])

In [9]:
sc = MinMaxScaler()
hotels[l_feat] = sc.fit_transform(hotels[l_feat])

In [10]:
hotels.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,...,market_segment,distribution_channel,is_repeated_guest,reserved_room_type,booking_changes,deposit_type,customer_type,adr,required_car_parking_spaces,total_of_special_requests
0,0.0,0,0.464043,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,...,0.5,0.5,0.0,0.166667,0.6,0.0,0.5,0.0,0.0,0.0
1,0.0,0,1.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,...,0.5,0.5,0.0,0.166667,0.8,0.0,0.5,0.0,0.0,0.0
2,0.0,0,0.009498,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,...,0.5,0.5,0.0,0.0,0.0,0.0,0.5,0.333333,0.0,0.0
3,0.0,0,0.017639,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,...,0.25,0.0,0.0,0.0,0.0,0.0,0.5,0.333333,0.0,0.0
4,0.0,0,0.018996,0.0,0.0,0.0,0.133333,0.333333,0.0,0.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.5,0.435556,0.0,0.333333


In [11]:
hotels['is_canceled'].value_counts()

is_canceled
0    160
1     40
Name: count, dtype: int64

В наборе данных наблюдается дисбаланс классов.

## Разделение данных на обучающую и тестовую выборки

Целевой признак - is_canceled

In [12]:

x_train, x_test, y_train, y_test = train_test_split(hotels[l_feat], hotels[l_targ], random_state=1)

## Обучение и оценка качества моделей

Обучаем модели логистической регрессии и градиентного бустинга. Для оценки качества моделей с учетом дисбаланса классов используем метрики balanced_accuracy_score и f1_score. Наилучшее значение метрик = 1.

In [13]:
def print_metrics(test: np.ndarray, pred: np.ndarray):
    print('balanced_accuracy: ', balanced_accuracy_score(test, pred))
    print('f1: ',f1_score(test, pred, average='weighted'))

In [18]:
lr = LogisticRegression()
lr.fit(x_train, y_train)
pred = lr.predict(x_test)
print_metrics(y_test, pred)

balanced_accuracy:  0.4875
f1:  0.7011235955056181


In [19]:
gbc = GradientBoostingClassifier(n_estimators=5)
gbc.fit(x_train, y_train)
gbc.score(x_test, y_test)

0.8

In [20]:
pred = gbc.predict(x_test)
print_metrics(y_test, pred)

balanced_accuracy:  0.5
f1:  0.7111111111111111


## Вывод

Модель градиентного бустинга, как более сложная, получила оценки лучше, чем логистическая регрессия.

Полученные модели имеют не очень высокое качество. Возможно, их стоит обучить на большем объеме данных или изменить набор признаков, оставленных в датасете. 
