# 0. Predict customer's choice

Objective: To anticipate the choice of a passenger based on the search and booking of air tickets.

I want to note right away that before preprocessing the data, I went through the fields and selected only those that are more or less relevant to the task. And I did not include this selection process in this report.

# 1. Loading dataset

In [1]:
import pandas as pd
from sqlalchemy import create_engine

date_columns = ['search_datetime', 'flight_date']
dtype_dic = {'search_type':'category', 'routes':'category', 'provider':'category', 'origin_country':'category',
           'destination_country':'category', 'cabin_class':'category'}
aviata = pd.read_csv('aviata.csv', parse_dates=date_columns, dtype=dtype_dic)

  interactivity=interactivity, compiler=compiler, result=result)


# 2. Preprocessing

As it turned out later, I do not need two more fields. I removed these fields. And set a filter so as not to skip lines
which have a value of '\\ N'. This is a necessary step for the fields (original_amount, is_owc, is_direct) to wrap dtype int and float. And before that they had a mixed type.

In [2]:
aviata.drop(['Unnamed: 0', 'offers_count'], axis=1, inplace=True)
aviata_cln = aviata[(aviata.original_amount != '\\N') & (aviata.is_owc != '\\N') & (aviata.is_direct != '\\N')]
aviata_cln = aviata_cln.astype({'original_amount': 'float32', 'is_owc': 'int8', 'is_direct':'int8'})


We have departure dates and flight search dates. In order for this data to bring maximum benefit, we must convert
date in an understandable language for the machine. 

In [16]:
aviata_cln["search_year"] = aviata_cln["search_datetime"].dt.year  
aviata_cln["search_month"] = aviata_cln["search_datetime"].dt.month

aviata_cln["flight_year"] = aviata_cln["flight_date"].dt.year  
aviata_cln["flight_month"] = aviata_cln["flight_date"].dt.month

Same thing with fields that are of type category. They must also have a numerical value.

In [23]:
aviata_cln["search_type_code"] = aviata_cln["search_type"].cat.codes
aviata_cln["routes_code"] = aviata_cln["routes"].cat.codes
aviata_cln["provider_code"] = aviata_cln["provider"].cat.codes
aviata_cln["origin_country_code"] = aviata_cln["origin_country"].cat.codes
aviata_cln["destination_country_code"] = aviata_cln["destination_country"].cat.codes
aviata_cln["cabin_class_code"] = aviata_cln["cabin_class"].cat.codes


Now we leave only the fields that the machine will understand.

In [30]:
aviata_read = aviata_cln[['pass_adt', 'pass_child', 'pass_inf', 'pass_stud', 'original_amount','is_owc','is_direct', 'is_booked', 
                          'search_year', 'search_month', 'flight_year', 'flight_month', 'search_type_code', 'routes_code','provider_code', 'origin_country_code', 'destination_country_code', 'cabin_class_code']]

Checking how balanced our data is. We see that is_booked 0 has a lot more lines that can lead
to the fact that the machine will cast its voice to the side that has more lines. 

In [40]:
aviata_read.is_booked.value_counts()

0    2394275
1     135204
Name: is_booked, dtype: int64

Balancing process:

In [44]:
import numpy as np

bkn = aviata_read[aviata_read.is_booked == 1]
nbkn = aviata_read[aviata_read.is_booked == 0].index

indx = np.random.choice(nbkn, bkn.shape[0], replace=False)
nnkn_sample = aviata_read.loc[indx]

aviata_balanced = pd.concat([bkn, nnkn_sample], axis=0)
aviata_balanced = aviata_balanced.sample(frac=1).reset_index(drop=True)

In [45]:
aviata_balanced.is_booked.value_counts()

1    135204
0    135204
Name: is_booked, dtype: int64

We divide the data into X and y.

In [46]:
X_data = aviata_balanced.loc[:, aviata_balanced.columns != 'is_booked']
y_data = aviata_balanced.loc[:, aviata_balanced.columns == 'is_booked']

# 3.Training and testing RandomForestClassifier

In [79]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_data = scaler.fit_transform(X_data)


In [80]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA


X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.25, random_state=42)
pca = PCA(n_components=13)  
X_train = pca.fit_transform(X_train)  
X_test = pca.transform(X_test)

rfc = RandomForestClassifier(max_depth=15, random_state=0)  
rfc.fit(X_train, y_train)

y_pred = rfc.predict(X_test)

print('F1: ', f1_score(y_test, y_pred))  
print('Accuracy', accuracy_score(y_test, y_pred))

  


F1:  0.8318204324145897
Accuracy 0.8191177775805449


# 4. GridSearchCV - CrossValidation

In [73]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'bootstrap': [True],
    'max_depth': [10, 15, 40],
    'max_features': [2, 3],
    'n_estimators': [100, 200]
}

rfc = RandomForestClassifier()
grid_search = GridSearchCV(estimator = rfc, param_grid = param_grid, cv = 3, n_jobs = -1, verbose = 2)
grid_search.fit(X_train, y_train)
grid_search.best_params_
model = grid_search.best_estimator_
y_pred = model.predict(X_test)

print('F1: ', f1_score(y_test, y_pred))  
print('Accuracy', accuracy_score(y_test, y_pred))

Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed: 16.9min finished
  self.best_estimator_.fit(X, y, **fit_params)


F1:  0.834438353542672
Accuracy 0.8208484956066389


# 5. Logistic Regression - GridSearcgCV

In [76]:
from sklearn.linear_model import LogisticRegression

param_grid = {
    'penalty': ['l1', 'l2'],
    'C': np.logspace(0, 3, 10),
}

log = LogisticRegression()
grid_search = GridSearchCV(estimator = log, param_grid = param_grid, cv = 3, n_jobs = -1, verbose = 2)
grid_search.fit(X_train, y_train)
grid_search.best_params_
model = grid_search.best_estimator_
y_pred = model.predict(X_test)

print('F1: ', f1_score(y_test, y_pred))  
print('Accuracy', accuracy_score(y_test, y_pred))

Fitting 3 folds for each of 20 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   13.4s
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:   18.9s finished
  y = column_or_1d(y, warn=True)


F1:  0.7800263023698972
Accuracy 0.7575219668057158
