<div style="border-width:1; border-radius: 15px; border-style: solid; border-color: rgb(10, 10, 10); background-color: #2b6777; color:#ffffff;text-align: center;font: 14pt 'Candara';font-weight:bold;"><h1>Airline passenger satisfaction (Part 1)</h1></div>

<center><a><img src="https://apartmentinteriors.ru/wp-content/uploads/samolot-charter-private-jet-charter-02.jpg" border="3" width=800 height=600 class="center"></a>

<div style="border-width:1; border-radius: 15px; border-style: solid; border-color: rgb(10, 10, 10); background-color: #52ab98; text-align: center;font: 14pt 'Candara';font-weight:bold;"><h1>Task description</h1></div>

**There is the following information about the passengers of some airline:**

1. **Gender:** male or female
2. **Customer type:** regular or non-regular airline customer
3. **Age:** the actual age of the passenger
4. **Type of travel:** the purpose of the passenger's flight (personal or business travel)
5. **Class:** business, economy, economy plus
6. **Flight distance**
7. **Inflight wifi service:** satisfaction level with Wi-Fi service on board (0: not rated; 1-5)
8. **Departure/Arrival time convenient:** departure/arrival time satisfaction level (0: not rated; 1-5)
9. **Ease of Online booking:** online booking satisfaction rate (0: not rated; 1-5)
10. **Gate location:** level of satisfaction with the gate location (0: not rated; 1-5)
11. **Food and drink:** food and drink satisfaction level (0: not rated; 1-5)
12. **Online boarding:** satisfaction level with online boarding (0: not rated; 1-5)
13. **Seat comfort:** seat satisfaction level (0: not rated; 1-5)
14. **Inflight entertainment:** satisfaction with inflight entertainment (0: not rated; 1-5)
15. **On-board service:** level of satisfaction with on-board service (0: not rated; 1-5)
16. **Leg room service**: level of satisfaction with leg room service (0: not rated; 1-5)
17. **Baggage handling:** level of satisfaction with baggage handling (0: not rated; 1-5)
18. **Checkin service:** level of satisfaction with checkin service (0: not rated; 1-5)
19. **Inflight service:** level of satisfaction with inflight service (0: not rated; 1-5)
20. **Cleanliness:** level of satisfaction with cleanliness (0: not rated; 1-5)
21. **Departure delay in minutes**
22. **Arrival delay in minutes**

This data set contains a survey on <b>air passenger satisfaction</b>. The following <b>classification problem</b> is set:

It is necessary to predict which of the <b>two</b> levels of satisfaction with the airline the passenger belongs to:
<ol>
    <li><em>Satisfaction</em></li>
    <li><em>Neutral or dissatisfied</em></li>
</ol>

<div style="border-width:1; border-radius: 15px; border-style: solid; border-color: rgb(10, 10, 10); background-color: #52ab98; text-align: center;font: 14pt 'Candara';font-weight:bold;"><h1>Installation scikit-learn-intelex</h1></div>

In [None]:
!pip install scikit-learn-intelex

In [None]:
from sklearnex import patch_sklearn
patch_sklearn()

<div style="border-width:1; border-radius: 15px; border-style: solid; border-color: rgb(10, 10, 10); background-color: #52ab98; text-align: center;font: 14pt 'Candara';font-weight:bold;"><h1>Reading data</h1></div>

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import sklearn
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn import ensemble
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

In [None]:
data = pd.read_csv("/kaggle/input/airline-passenger-satisfaction/train.csv")

Table dimensions:

In [None]:
data.shape

Each row corresponds to one passenger, and each column to a specific feature.<br>
Let's look at the first and last rows of the data set:

In [None]:
data.head()

In [None]:
data.tail()

Let's take a closer look at the dataset data:

In [None]:
data.info()

You may notice the following:
<ol>
     <li><b>The column</b> corresponding to the <b>Arrival Delay in Minutes feature has 310 missing values</b>.</li>
     <li><b>The first two features are useless and will not affect the classification</b>, so you should get rid of them.</li>
     <li><b>Many columns contain categorical values</b> but are of type 'object' or 'int64'. Let's replace this type with a special one designed for storing categorical values.</li>
</ol>

In [None]:
data = data.drop(data.iloc[:,[0, 1]], axis = 1)

In [None]:
categorical_indexes = [0, 1, 3, 4] + list(range(6, 20))
data.iloc[:,categorical_indexes] = data.iloc[:,categorical_indexes].astype('category')

Now the dataset information looks like this:

In [None]:
data.info()

The first 22 features have been detailed above. The <b>satisfaction</b> feature is the target.

<div style="border-width:1; border-radius: 15px; border-style: solid; border-color: rgb(10, 10, 10); background-color: #52ab98; text-align: center;font: 14pt 'Candara';font-weight:bold;"><h1>Data visualization and calculation of the main characteristics</h1></div>

Get summary information about quantitative features by calling the "describe" method with default parameters:

In [None]:
data.describe()

For each quantitative attribute, mean values, standard deviation, minimum and maximum values, median and quartile values are given.

Now we get information about categorical features:

In [None]:
data.describe(include = ['category'])

For each categorical feature, the total number of values, the number of unique values, the most frequently occurring element and the total number of such elements are given.

Let's look at the ratio of the values of the target variable:

In [None]:
plt.pie(data.satisfaction.value_counts(), labels = ["Neutral or dissatisfied", "Satisfied"], colors = sns.color_palette("YlOrBr"), autopct = '%1.1f%%')
pass

As you can see from the pie chart, <b>the selection is more or less balanced</b>.

Let's calculate the correlation matrix for quantitative features and use the visual image of this matrix:

In [None]:
corr_mat = data.corr()
corr_mat

In [None]:
sns.heatmap(corr_mat, square = True, cmap = 'Blues')
pass

You can see that there is a strong correlation between the features <em>'Departure delay in minutes'</em> and <em>'Arrival delay in minutes'</em>. The value of the correlation coefficient reaches the following value:

In [None]:
corr_mat.where(np.triu(corr_mat > 0.5, k=1)).stack().sort_values(ascending = False)

Let's build a scatterplot for these features:

In [None]:
plt.scatter(data['Arrival Delay in Minutes'], data['Departure Delay in Minutes'], alpha = 0.5)
pass

You can see that the points lined up more or less along a straight line going from the lower left corner to the upper right. Thus, in some approximation <b>the dependence of the arrival time delay on the departure time delay is linear</b>.

The results obtained are quite logical and can be explained as follows. If the flight of the airline's customers was delayed by a certain amount of time at departure, then the flight will be delayed by about the same amount of time at landing (provided that the aircraft does not accelerate in flight to make up for lost time).

Consider the ratio of values for each of the categorical features:

In [None]:
categ = data.iloc[:,categorical_indexes]
fig, axes = plt.subplots(6, 3, figsize = (20, 20))
for i, col in enumerate(categ):
    column_values = data[col].value_counts()
    labels = column_values.index
    sizes = column_values.values
    axes[i//3, i%3].pie(sizes, labels = labels, colors = sns.color_palette("YlOrBr"), autopct = '%1.0f%%', startangle = 90)
    axes[i//3, i%3].axis('equal')
    axes[i//3, i%3].set_title(col)
plt.show()

Some conclusions about the considered sample:
<ul>
    <li>The number of men and women in this sample is approximately the same</li>
    <li>The vast majority of the airline's customers are repeat customers</li>
    <li>Most of our clients flew for business rather than personal reasons</li>
    <li>About half of the passengers were in business class</li>
    <li>More than 60% of passengers were satisfied with the luggage transportation service (rated 4-5 out of 5)</li>
    <li>More than 50% of passengers were comfortable sitting in their seats (rated 4-5 out of 5)</li>
</ul>

Now let's look at a few "boxes with whiskers" (box diagrams).

In [None]:
f, ax = plt.subplots(1, 2, figsize = (15,5))
sns.boxplot(x = "Customer Type", y = "Age", palette = "YlOrBr", data = data, ax = ax[0])
sns.histplot(data, x = "Age", hue = "Customer Type", multiple = "stack", palette = "YlOrBr", edgecolor = ".3", linewidth = .5, ax = ax[1])
pass

From this box diagram, we can conclude that <b>most of the airline's regular customers are between the ages of 30 and 50 (their average age is slightly over 40)</b>. The age range of non-regular customers is slightly smaller (from 25 to 40 years old, on average - a little less than 30).

In [None]:
f, ax = plt.subplots(1, 2, figsize = (15,5))
sns.boxplot(x = "Class", y = "Age", palette = "YlOrBr", data = data, ax = ax[0])
sns.histplot(data, x = "Age", hue = "Class", multiple = "stack", palette = "YlOrBr", edgecolor = ".3", linewidth = .5, ax = ax[1])
pass

It can be seen that, on average, the age range of those customers who travel in business class is the same (according to the previous box chart) as the age range of regular customers. Based on this observation, it can be assumed that <b>regular customers mainly buy business class for themselves</b>.

In [None]:
f, ax = plt.subplots(1, 2, figsize = (15,5))
sns.boxplot(x = "Class", y = "Flight Distance", palette = "YlOrBr", data = data, ax = ax[0])
sns.histplot(data, x = "Flight Distance", hue = "Class", multiple = "stack", palette = "YlOrBr", edgecolor = ".3", linewidth = .5, ax = ax[1])
pass

From this box diagram, the following conclusion can be drawn: <b>customers whose flight distance is long, mostly fly in business class</b>.

In [None]:
f, ax = plt.subplots(2, 2, figsize = (15,8))
sns.boxplot(x = "Inflight entertainment", y = "Flight Distance", palette = "YlOrBr", data = data, ax = ax[0, 0])
sns.histplot(data, x = "Flight Distance", hue = "Inflight entertainment", multiple = "stack", palette = "YlOrBr", edgecolor = ".3", linewidth = .5, ax = ax[0, 1])
sns.boxplot(x = "Leg room service", y = "Flight Distance", palette = "YlOrBr", data = data, ax = ax[1, 0])
sns.histplot(data, x = "Flight Distance", hue = "Leg room service", multiple = "stack", palette = "YlOrBr", edgecolor = ".3", linewidth = .5, ax = ax[1, 1])
pass

The following pattern can be seen: <b>the more distance an aircraft passenger travels (respectively, the longer they are in flight), the more they are satisfied with the entertainment in flight and the extra legroom (on average)</b>.

Now consider the dependence graphs of some categorical features on the target - <em>satisfaction of air passengers</em>:

In [None]:
sns.countplot(x = 'Class', hue = 'satisfaction', palette = "YlOrBr", data = data)
plt.show()

This chart is very revealing. You can see that <b>most of the passengers who flew in economy plus or economy class were dissatisfied with the flight, and those who were lucky enough to fly in business class were satisfied</b>.

In [None]:
sns.countplot(x = 'Inflight wifi service', hue = 'satisfaction', palette = "YlOrBr", data = data)
plt.show()

According to this graph, you can see that <b>almost all passengers who rated the wifi service 5 out of 5 points were satisfied with the flight</b>.

In [None]:
f, ax = plt.subplots(1, 2, figsize = (20,5))
sns.countplot(x = 'Seat comfort', hue = 'satisfaction', palette = "YlOrBr", data = data,ax = ax[0])
sns.countplot(x = 'Leg room service', hue = 'satisfaction', palette = "YlOrBr", data = data, ax = ax[1])
plt.show()


From the graphs above, we can conclude the following: <b>most passengers who rated the comfort of the seats and the extra legroom at 4 and 5 points out of 5 were satisfied with the flight</b>.

<div style="border-width:1; border-radius: 15px; border-style: solid; border-color: rgb(10, 10, 10); background-color: #52ab98; text-align: center;font: 14pt 'Candara';font-weight:bold;"><h1>Filling in missing values</h1></div>

Let's see how many missing values are in each column of the table:

In [None]:
data.isna().sum()

Fill in the missing values with <b>medians</b> in the columns corresponding to quantitative features:

In [None]:
data['Arrival Delay in Minutes'].fillna(data['Arrival Delay in Minutes'].median(axis = 0), inplace = True)

In [None]:
data.isna().sum()

In [None]:
data.describe()

This table shows that there are no more missing values.

<div style="border-width:1; border-radius: 15px; border-style: solid; border-color: rgb(10, 10, 10); background-color: #52ab98; text-align: center;font: 14pt 'Candara';font-weight:bold;"><h1>Handling categorical features</h1></div>

We divide the signs into quantitative and categorical:

In [None]:
numerical_columns = [c for c in data.columns if data[c].dtype.name != 'category']
numerical_columns.remove('satisfaction')
categorical_columns = [c for c in data.columns if data[c].dtype.name == 'category']
data_describe = data.describe(include = ['category'])

We divide categorical features into binary and non-binary:

In [None]:
binary_columns = [c for c in categorical_columns if data_describe[c]['unique'] == 2]
nonbinary_columns = [c for c in categorical_columns if data_describe[c]['unique'] > 2]
print(binary_columns, nonbinary_columns)

Let's look at the unique values for each binary feature:

In [None]:
for col in binary_columns:
    print(col, ': ', end = '')
    for uniq in data[col].unique():
        if uniq == data[col].unique()[-1]:
            print(uniq, end = '.')
        else:
            print(uniq, end = ', ')
    print()

Let's do the binarization:

In [None]:
data[col] == uniq

In [None]:
for col in binary_columns:
    data[col] = data[col].astype('object')
    k = 0
    for uniq in data[col].unique():
        data.at[data[col] == uniq, col] = k
        k +=1 
for col in binary_columns:
    print(data[col].describe(), end = '\n\n')

Now let's look at non-binary categorical features:

In [None]:
data[nonbinary_columns]

The following vectorization method is applicable to non-binary features:

The feature j, which takes s values, will be replaced by s features, which take the values 0 or 1, depending on what the value of the original feature j is.

This vectorization is carried out by the get_dummies method:

In [None]:
data_nonbinary = pd.get_dummies(data[nonbinary_columns])
print(data_nonbinary.columns)

In [None]:
len(data_nonbinary.columns)

<div style="border-width:1; border-radius: 15px; border-style: solid; border-color: rgb(10, 10, 10); background-color: #52ab98; text-align: center;font: 14pt 'Candara';font-weight:bold;"><h1>Normalization of quantitative features</h1></div>

We have the following quantitative characteristics:

In [None]:
data_numerical = data[numerical_columns]
data_numerical.describe()

Let's perform a <b>standardization</b> (<em>linear transformation that reduces all values to zero mean and one standard deviation</em>) of all quantitative features:

In [None]:
data_numerical = (data_numerical - data_numerical.mean(axis = 0))/data_numerical.std(axis = 0)

In [None]:
data_numerical.describe()

<div style="border-width:1; border-radius: 15px; border-style: solid; border-color: rgb(10, 10, 10); background-color: #52ab98; text-align: center;font: 14pt 'Candara';font-weight:bold;"><h1>Table formation</h1></div>

We join all the transformed columns into one table:

In [None]:
target = data['satisfaction']
data = pd.concat((data_numerical, data_nonbinary, data[binary_columns]), axis = 1)
print(data.shape)

Now it looks like this:

In [None]:
data.describe()

<div style="border-width:1; border-radius: 15px; border-style: solid; border-color: rgb(10, 10, 10); background-color: #52ab98; text-align: center;font: 14pt 'Candara';font-weight:bold;"><h1>Splitting the data on training and test sets</h1></div>

Get <b>X</b> and <b>y</b>:

In [None]:
X = data
y = target
N, d = X.shape
N, d

In [None]:
X.columns

In [None]:
y

Let's split the data into training and test samples in a ratio of 9:1 <b>(90% - training sample, 10% - test)</b>:

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 777)

N_train, _ = X_train.shape 
N_test,  _ = X_test.shape 

N_train, N_test

<div style="border-width:1; border-radius: 15px; border-style: solid; border-color: rgb(10, 10, 10); background-color: #52ab98; text-align: center;font: 14pt 'Candara';font-weight:bold;"><h1>K-nearest neighbors (kNN) method</h1></div>

Let's train and run kNN for the 10 number of neighbors:

In [None]:
knn = KNeighborsClassifier()

knn.set_params(n_neighbors = 10)
knn.fit(X_train, y_train)

err_train = np.mean(y_train != knn.predict(X_train))
err_test  = np.mean(y_test  != knn.predict(X_test))

print('Training sample error: ', err_train)
print('Error on the test sample: ', err_test)

<div style="border-width:1; border-radius: 15px; border-style: solid; border-color: rgb(10, 10, 10); background-color: #52ab98; text-align: center;font: 14pt 'Candara';font-weight:bold;"><h1>SVC</h1></div>

Let's train and run the support vector machine:

In [None]:
svc = SVC(gamma = 'auto')
svc.fit(X_train, y_train)

err_train = np.mean(y_train != svc.predict(X_train))
err_test  = np.mean(y_test  != svc.predict(X_test))

print('Training sample error: ', err_train)
print('Error on the test sample: ', err_test)

<div style="border-width:1; border-radius: 15px; border-style: solid; border-color: rgb(10, 10, 10); background-color: #52ab98; text-align: center;font: 14pt 'Candara';font-weight:bold;"><h1>Random Forest</h1></div>

Train and run random forest:

In [None]:
rf = ensemble.RandomForestClassifier(n_estimators = 100)
rf.fit(X_train, y_train)

err_train = np.mean(y_train != rf.predict(X_train))
err_test  = np.mean(y_test  != rf.predict(X_test))

print('Training sample error: ', err_train)
print('Error on the test sample: ', err_test)

<div style="border-width:1; border-radius: 15px; border-style: solid; border-color: rgb(10, 10, 10); background-color: #52ab98; text-align: center;font: 14pt 'Candara';font-weight:bold;"><h1>Extremely Randomized Trees</h1></div>

Train and run an extreme random forest:

In [None]:
ert = ensemble.ExtraTreesClassifier(n_estimators = 100).fit(X_train, y_train)

err_train = np.mean(y_train != ert.predict(X_train))
err_test  = np.mean(y_test  != ert.predict(X_test))

print('Training sample error: ', err_train)
print('Error on the test sample: ', err_test)

<div style="border-width:1; border-radius: 15px; border-style: solid; border-color: rgb(10, 10, 10); background-color: #52ab98; text-align: center;font: 14pt 'Candara';font-weight:bold;"><h1>AdaBoost</h1></div>

Let's train and run the AdaBoost algorithm:

In [None]:
ada = ensemble.AdaBoostClassifier(n_estimators = 100)
ada.fit(X_train, y_train)

err_train = np.mean(y_train != ada.predict(X_train))
err_test = np.mean(y_test != ada.predict(X_test))

print('Training sample error: ', err_train)
print('Error on the test sample: ', err_test)

<div style="border-width:1; border-radius: 15px; border-style: solid; border-color: rgb(10, 10, 10); background-color: #52ab98; text-align: center;font: 14pt 'Candara';font-weight:bold;"><h1>GBT</h1></div>

Train and run gradient boosting decision trees:

In [None]:
gbt = ensemble.GradientBoostingClassifier(n_estimators = 100)
gbt.fit(X_train, y_train)

err_train = np.mean(y_train != gbt.predict(X_train))
err_test = np.mean(y_test != gbt.predict(X_test))

print('Training sample error: ', err_train)
print('Error on the test sample: ', err_test)

<div style="border-width:1; border-radius: 15px; border-style: solid; border-color: rgb(10, 10, 10); background-color: #52ab98; text-align: center;font: 14pt 'Candara';font-weight:bold;"><h1>General conclusions</h1></div>

***
<b>Conclusions about the considered sample</b>:

> <ul>
> <li>The sample is more or less balanced <em>(56.7% on 43.3%)</em>.</li>
> <li>The number of men and women in this sample is approximately the same.</li>
> <li>The vast majority of the airline's customers are repeat customers.</li>
> <li>Most of our clients flew for business rather than personal reasons.</li>
> <li>About half of the passengers were in business class.</li>
> <li>More than 60% of passengers were satisfied with the luggage transportation service (rating 4-5 out of 5).</li>
> <li>More than 50% of passengers were comfortable sitting in their seats (rated 4-5 out of 5).</li>
> <li>There was a strong correlation <em>(96%)</em> between the features 'Departure delay in minutes' and 'Arrival delay in minutes' (which is quite logical and was discussed in detail above).</li>
> <li>Most of the airline's regular customers are between the ages of 30 and 50 (averaging a little over 40). The age range for non-regular customers is slightly smaller (from 25 to 40 years old, with an average of slightly less than 30).</li>
> <li>Customers whose flight distance is long tend to fly in business class.</li>
> <li>The more distance an airplane passenger travels (respectively, the longer they are in flight), the more satisfied they are with in-flight entertainment and extra legroom (on average).</li>
> <li>Most of the passengers who flew in Economy Plus or Economy Class were dissatisfied with the flight, and those who were lucky enough to fly in Business Class were satisfied.</li>
> <li>Almost all passengers who rated the wifi service 5 out of 5 were satisfied with the flight.</li>
> <li>The majority of passengers who rated the comfort of the seats and the extra legroom at 4 and 5 points out of 5 were satisfied with the flight.
> </ul>

***
<b>Conclusions on classification results</b>:

> <ul>
> <li>For the kNN method, the error on the training set was <em>5.2%</em>, and on the test set it was <em>6.5%</em>.</li>
> <li>For the support vector machine, the error on both the training and test sets was <em>5%</em>.</li>
> <li>For the Random Forest and Extreme Random Forest classifiers, the error on the training set was not observed, but on the test set it was <em>3.9%</em>.</li>
> <li>For the AdaBoost algorithm, the error on the training set was <em>7%</em>, and on the test set it was <em>7.2%</em>.</li>
> <li>For gradient boosted decision trees, the error on the training and test samples was <em>5.5% and 5.6%</em> respectively.</li>
> <li>The classifiers Random and Extremely Random Forest showed the best result (the error on the test sample is <em>3.9%).</em></li>
</ul>

***

<div style="border-width:1; border-radius: 15px; border-style: solid; border-color: rgb(10, 10, 10); background-color: #52ab98; text-align: center;font: 14pt 'Candara';font-weight:bold;"><h1>Link to the second part</h1></div>

<div style="text-align: center;font: 14pt 'Candara';font-weight:bold;"><h3><a href="https://www.kaggle.com/code/frixinglife/airline-passenger-satisfaction-part-2">Airline Passenger Satisfaction (Part 2)</a></h3></div>