Atalov S. (TSI AUCA)

---

# Lab3. Passenger Satisfaction
- **5 points**
- **Submit:** csv file and this notebook
- **Deadline:** Mar 29, 23:59.

This task is devoted to a full-fledged solution to the problem of machine learning.

<div>
    <img src="https://live-production.wcms.abc-cdn.net.au/ac56ffe2b5282f82358e6b396e2da2ba?impolicy=wcms_crop_resize&cropH=1915&cropW=3404&xPos=5&yPos=0&width=862&height=485" width="500"/>
</div>


---

## 0. Problem Statement

About Company:

**TSI Airlines** - largest airline of Kyrgyzstan by size and passengers carried.

#### Problem
You need to create a model that will accurately predict passenger **satisfaction**.

In [1]:
# read the datafile
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

df = pd.read_csv("satisfaction_train.csv")



In [2]:
df.drop(['id'], axis=1, inplace=True)

In [3]:
df.isnull().sum()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80000 entries, 0 to 79999
Data columns (total 23 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Gender                             80000 non-null  object 
 1   Customer Type                      80000 non-null  object 
 2   Age                                80000 non-null  int64  
 3   Type of Travel                     80000 non-null  object 
 4   Class                              80000 non-null  object 
 5   Flight Distance                    80000 non-null  int64  
 6   Inflight wifi service              80000 non-null  int64  
 7   Departure/Arrival time convenient  80000 non-null  int64  
 8   Ease of Online booking             80000 non-null  int64  
 9   Gate location                      80000 non-null  int64  
 10  Food and drink                     80000 non-null  int64  
 11  Online boarding                    80000 non-null  int

## 1. Data Preprocessing

In [4]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder

In [43]:
def pipeline(df, pred=False):
    df = df.copy()
    label_encoder = LabelEncoder()
    
    # это котегориальные браззер
    categorical_features = df.select_dtypes(include=['object']).columns
    if pred:
        zero = df[pred].unique()[0]
        one = df[pred].unique()[1]
        categorical_features = categorical_features.drop(pred)
        df[pred] = df[pred].replace({df[pred].unique()[0]:0, df[pred].unique()[1]:1})
    for feature in categorical_features:
        if df[feature].isnull().sum() != 0:
            X = df.dropna()
            y = label_encoder.fit_transform(X.pop(feature))
            forest = RandomForestClassifier(max_depth=4)
            forest.fit(X, y)
            missing_X = df[df[feature].isnull()].drop(columns=[feature])
            missing_predictions = forest.predict(missing_X)
            df.loc[df[feature].isnull(), feature] = label_encoder.inverse_transform(missing_predictions)
    df = pd.get_dummies(df, columns=categorical_features)
    
    # это числа браззер
    numerical_features = df.select_dtypes(include=['float64', 'int64']).columns
    for feature in numerical_features:
        if df[feature].isnull().sum() != 0:
            lin_reg = Dec()
            X = df.dropna()
            y = X.pop(feature)
            lin_reg.fit(X, y)
            missing_predictions = lin_reg.predict(df[df[feature].isnull()].drop(columns=[feature]))
            df.loc[df[feature].isnull(), feature] = missing_predictions
    if pred:
        df[pred] = df[pred].replace({0:zero,1:one})
    return df


In [41]:
df = pipeline(df,'satisfaction')

df.isnull().sum()

Age                                  0
Flight Distance                      0
Inflight wifi service                0
Departure/Arrival time convenient    0
Ease of Online booking               0
Gate location                        0
Food and drink                       0
Online boarding                      0
Seat comfort                         0
Inflight entertainment               0
On-board service                     0
Leg room service                     0
Baggage handling                     0
Checkin service                      0
Inflight service                     0
Cleanliness                          0
Departure Delay in Minutes           0
Arrival Delay in Minutes             0
satisfaction                         0
Gender_Female                        0
Gender_Male                          0
Customer Type_Loyal Customer         0
Customer Type_disloyal Customer      0
Type of Travel_Business travel       0
Type of Travel_Personal Travel       0
Class_Business           

In [6]:
X = df.copy()
y = X.pop('satisfaction')

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.1,random_state=0)

## 2. Modeling

In [8]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier

In [28]:
model = GradientBoostingClassifier(learning_rate=0.4, max_depth=5, n_estimators=150, min_samples_split=2)

In [29]:
model.fit(X_train, y_train)
print(model.score(X_train,y_train))
model.score(X_test, y_test)

0.9827638888888889


0.9605

In [22]:
forest = RandomForestClassifier(max_depth=None,n_estimators = 150, min_samples_split=2)

In [23]:
forest.fit(X_train, y_train)
print(forest.score(X_train,y_train))
forest.score(X_test, y_test)

1.0


0.96475

## 3. Hyperparameter Tuning (Find Best Parameters)

In [16]:
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV

In [18]:
param_dist = {
    'n_estimators': [50, 100, 150, 200],  # Список возможных значений для числа оценщиков
    'learning_rate': [0.01, 0.1, 0.2, 0.3, 0.4, 0.5],  # Список возможных значений для скорости обучения
    'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]  # Список возможных значений для максимальной глубины дерева
}
param_dist_forest = {
    'n_estimators': [50, 100, 150, 200],  # Список возможных значений для числа деревьев
    'max_depth': [None, 2, 3, 4, 5, 6, 7, 8, 9, 10],  # Список возможных значений для максимальной глубины деревьев
    'min_samples_split': [2, 5, 10],  # Минимальное количество образцов, необходимых для разделения внутреннего узла
}
grid = HalvingGridSearchCV(estimator=forest, param_grid = param_dist_forest, cv = 3, n_jobs=-1)
grid.fit(X_train, y_train)

In [21]:
grid.best_params_

{'max_depth': None, 'min_samples_split': 2, 'n_estimators': 150}

## 4. Write Pipeline For Data Preparation and Prediction

## 5. Predict Test Data

### Read and Prepare test data using your pipeline

In [30]:
df_test = pd.read_csv("satisfaction_test.csv")

In [38]:
df_test

Unnamed: 0,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,...,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes
0,Female,Loyal Customer,31,Business travel,Business,1669,1,1,1,1,...,4,4,3,2,5,3,4,4,2,14.0
1,Male,Loyal Customer,38,Business travel,Eco,397,5,4,4,4,...,5,5,4,5,4,3,2,5,0,0.0
2,Female,Loyal Customer,69,Personal Travel,Eco,2296,3,5,4,3,...,5,3,3,4,3,4,3,3,0,0.0
3,Male,Loyal Customer,64,Business travel,Business,406,1,1,1,1,...,4,5,5,5,5,3,5,3,23,17.0
4,Male,Loyal Customer,47,Business travel,Business,2022,5,5,5,5,...,5,4,4,4,4,5,4,3,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23899,Female,Loyal Customer,69,Business travel,Business,1938,4,2,2,2,...,1,4,4,4,4,1,4,4,0,14.0
23900,Male,Loyal Customer,61,Business travel,Eco,1185,2,2,2,2,...,2,2,1,1,4,2,4,2,18,0.0
23901,Male,Loyal Customer,30,Business travel,Business,3375,4,4,4,4,...,5,5,3,1,2,2,3,5,101,97.0
23902,Female,Loyal Customer,23,Business travel,Business,628,2,3,3,3,...,2,2,4,4,4,2,3,2,30,22.0


In [37]:
df_test.drop(['id'], axis=1, inplace=True)

In [44]:
df_test = pipeline(df_test)

### Make a prediction using your best model:

In [None]:
predictions = forest.predict(df_test)

### Save predictions as `YourName.csv` and submit csv file and this notebook in ecourse

HINT: Use `df.to_csv('YourName.csv', index=False)`

In [None]:
df.to_csv('YourName.csv', index=False)