<h1 style="text-align:center; color:#000000; font-weight:bold;">Airline satisfaction</h1>

- [Dataset Description](#1)
- [Import libraries](#2)
- [Reading the data](#3)
- [Data exploration](#4)
- [Data preparation & EDA](#5)
- [Conclusions](#6)
- [PreProcessing](#7)
- [Modeling](#8)
****

# Dataset Description

## [Dataset](https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction)

- **id :** passengers id.
- **Gender :** Gender of the passengers (Female, Male).
- **Customer Type :** The customer type (Loyal customer, disloyal customer).
- **Age :** The actual age of the passengers.
- **Type of Travel:** Purpose of the flight of the passengers (Personal Travel, Business Travel).
- **Class:** Travel class in the plane of the passengers (Business, Eco, Eco Plus)
- **Flight distance:** The flight distance of this journey
- **Inflight wifi service:** Satisfaction level of the inflight wifi service (0:Not Applicable;1-5)
- **Departure/Arrival time convenient:** Satisfaction level of Departure/Arrival time convenient
- **Ease of Online booking:** Satisfaction level of online booking
- **Gate location :** Satisfaction level of Gate location
- **Food and drink:** Satisfaction level of Food and drink
- **Online boarding:** Satisfaction level of online boarding
- **Seat comfort:** Satisfaction level of Seat comfort
- **Inflight entertainment:** Satisfaction level of inflight entertainment
- **On-board service:** Satisfaction level of On-board service
- **Leg room service:** Satisfaction level of Leg room service
- **Baggage handling:** Satisfaction level of baggage handling
- **Check-in service:** Satisfaction level of Check-in service
- **Inflight service:** Satisfaction level of inflight service
- **Cleanliness:** Satisfaction level of Cleanliness
- **Departure Delay in Minutes:** Minutes delayed when departure
- **Arrival Delay in Minutes:** Minutes delayed when Arrival
- **Satisfaction:** Airline satisfaction level(Satisfaction, neutral or dissatisfaction)

# Import libraries

In [None]:
# Cleaning and Viualization Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# ML Libraries
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler

## Remove Warning

In [None]:
import warnings
warnings.simplefilter("ignore")

# Reading the data

In [None]:
# Load the train and test datasets
train = pd.read_csv('/kaggle/input/airline-passenger-satisfaction/train.csv')
test = pd.read_csv('/kaggle/input/airline-passenger-satisfaction/test.csv')

# Concatenate the datasets vertically (stacking one on top of the other)
df = pd.concat([train, test], axis=0)

# Reset the index to ensure it is continuous
df = df.reset_index(drop=True)
pd.set_option('display.max_columns',None)
df.head(10)

# Data exploration

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.nunique().sort_values(ascending=False)

In [None]:
df=df.drop(['id','Unnamed: 0'],axis=1)

In [None]:
df.describe()

In [None]:
df.describe(include = object)

# Data preparation & EDA

In [None]:
df.duplicated().sum()

In [None]:
df.isnull().sum().sort_values(ascending=False)

In [None]:
objects=df[['Customer Type','satisfaction','Type of Travel','Class']]
objects

## Gender

In [None]:
df['Gender'].value_counts()

In [None]:
plt.pie(df.Gender.value_counts(), labels = ["Female", "Male"], colors = sns.color_palette("ch:.25"), autopct = '%1.1f%%')
pass

In [None]:
fig = plt.figure(figsize=(25,15))

counter = 0

for i in objects.columns:
    if objects[i].dtype == 'object':
        sub = fig.add_subplot(3,3,counter+1)
        g = sns.countplot(x='Gender',hue=i,data=df,palette="ch:.25_r")
        counter = counter + 1

In [None]:
# It seems like the gender has no effect on the satsfication so I will drop the column

df=df.drop(['Gender'],axis=1)

## Age

In [None]:
print(df['Age'].unique())

In [None]:
g = sns.catplot(x='Age',data=df,kind="box",palette="crest_r", height=4, aspect=2)
plt.show()

In [None]:
g=sns.kdeplot(data = df, x = 'Age', shade = True,palette="ch:.25",multiple="stack",fill=True, common_norm=False,alpha=.8, linewidth=0)
plt.title('Age distribution')
plt.show()

In [None]:
fig = plt.figure(figsize=(25,15))

counter = 0

for i in objects.columns:
    if objects[i].dtype == 'object':
        sub = fig.add_subplot(3,3,counter+1)
        g = sns.kdeplot(data = df, x = "Age",hue = i,palette="ch:.25",multiple="stack",fill=True, common_norm=False,alpha=.8, linewidth=0,)
        counter = counter + 1

In [None]:
sns.kdeplot(data = df, x = "Age",hue = "satisfaction" , shade = True, color = "red")

## Customer Type

In [None]:
df['Customer Type'].value_counts()

In [None]:
plt.pie(df['Customer Type'].value_counts(), labels = ["Loyal", "Disloyal"],colors = sns.color_palette("ch:.25"), autopct = '%1.1f%%')
pass

In [None]:
sns.countplot(x ='Customer Type', hue = 'satisfaction',palette="ch:.25_r", data = df)
plt.show()

## Type of Travel

In [None]:
df['Type of Travel'].value_counts()

In [None]:
plt.pie(df['Type of Travel'].value_counts(), labels = ["Business", "Personal"],colors = sns.color_palette("ch:.25"), autopct = '%1.1f%%')
pass

In [None]:
sns.countplot(x ='Type of Travel', hue = 'satisfaction',palette="ch:.25_r", data = df)
plt.show()

## Class

In [None]:
df['Class'].value_counts()

In [None]:
plt.pie(df['Class'].value_counts(), labels = ["Business","Eco", "Eco Plus"],colors = sns.color_palette("ch:.25"), autopct = '%1.1f%%')
pass

In [None]:
sns.countplot(x = 'Class', hue = 'satisfaction',palette="ch:.25_r", data = df)
plt.show()

## Delay

In [None]:
sns.kdeplot(data = df, x = "Departure Delay in Minutes", shade = True,palette="ch:.25",multiple="stack",fill=True, common_norm=False,alpha=.8, linewidth=0)

In [None]:
# Fill missing values in the "Arrival Delay in Minutes" column with the calculated median

arrival_delay_median = df["Arrival Delay in Minutes"].median()
df["Arrival Delay in Minutes"] = df["Arrival Delay in Minutes"].fillna( value = arrival_delay_median)

In [None]:
sns.kdeplot(data = df, x = "Arrival Delay in Minutes", shade = True,palette="ch:.25",multiple="stack",fill=True, common_norm=False,alpha=.8, linewidth=0)

In [None]:
sns.kdeplot(data = df, x = "Flight Distance", shade = True,palette="ch:.25",multiple="stack",fill=True, common_norm=False,alpha=.8, linewidth=0)

In [None]:
sns.kdeplot(data = df, x = "Flight Distance",hue = "satisfaction" , shade = True,palette="ch:.25",multiple="stack",fill=True, common_norm=False,alpha=.8, linewidth=0)

In [None]:
plt.figure(figsize = (10, 6))
sns.scatterplot(x = 'Departure Delay in Minutes', y = 'Arrival Delay in Minutes', data = df,palette="ch:.25")
plt.show()

## Survey

In [None]:
survey = df[['Inflight wifi service','Departure/Arrival time convenient','Ease of Online booking',
       'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
       'Inflight entertainment', 'On-board service', 'Leg room service',
       'Baggage handling', 'Checkin service', 'Inflight service',
       'Cleanliness']]

In [None]:
# This function is designed for generating a count plot with additional features like showing the counts and percentages. 

def count(df, col, hue = None, annot = True, ax = None):
    g = sns.countplot(data = df, x = col, hue = hue,palette="ch:.25_r", ax = ax)
    if annot:
        for p in g.patches:
            percent = p.get_height() * 100 / len(df)
            g.annotate(f"{p.get_height()}\n({percent:.2f}%)",(p.get_x() + 0.2, p.get_height()),
                      ha = "center", va = "bottom")

In [None]:
fig, ax = plt.subplots(7, 2, figsize = (20, 55))
for i, service in enumerate(survey):
    count(df, service, ax = ax[i // 2, i % 2])
plt.show()

In [None]:
fig, ax = plt.subplots(7, 2, figsize = (20, 55))
for i, service in enumerate(survey):
    count(df, service, hue = "satisfaction", ax = ax[i // 2, i % 2])
plt.show()

In [None]:
fig, ax = plt.subplots(7, 2, figsize = (20, 55))
for i, service in enumerate(survey):
    count(df, service, hue = "Class", ax = ax[i // 2, i % 2])
plt.show()

## Survey Rating

In [None]:
# This code calculates the mean value for each survey and sorts them in descending order.

survey_rate_list = []
for i in survey.columns:
    survey_rate = survey[i].mean()
    survey_rate_list.append(survey_rate)
scores_table = list(zip(list(survey.columns), survey_rate_list))
survey_sys = pd.DataFrame(scores_table,columns=['survey', 'survey_rate'])
survey_sys.sort_values('survey_rate',ascending=False,inplace=True)
survey_sys.reset_index(inplace = True, drop = True)
survey_sys

In [None]:
g = sns.catplot(y='survey',x='survey_rate',data=survey_sys,kind="point",palette="ch:.25_r")
g.fig.suptitle("survey Rating ",y=1)
plt.show()

## Satisfaction

In [None]:
plt.pie(df.satisfaction.value_counts(), labels = ["Neutral or dissatisfied", "Satisfied"],colors = sns.color_palette("ch:.25"), autopct = '%1.1f%%')
pass

# Conclusions

#### **The airline company is facing a big problem as 56.6% of passengers are Neutral or dissatisfied. Gender doesn't appear to affect the satisfaction of the passengers. However, other factors such as age, delay, and some services seem to affect them.**

- **Gender:** Most of the passengers are females and the majority of them seem to be Neutral or dissatisfied.

- **Age:** Satisfied people are about 40-60 years old. Unsatisfied are about 20-40 years old.

- **Passengers:**  47.9% of the customers take business travel and most of them are satisfied while 44.9% of the customers take economy travel and 7.2% take economy plus travel and they both are Neutral or dissatisfied.

- **Flight Distance:**  Most travels are about 600 units of distance.

- **Airline Services:** Most services rate is 4 and the best services they have are (Inflight service and baggage handling) while Inflight Wi-Fi service is the worst.

# PreProcessing

## Encoding

In [None]:
#Perform one-hot encoding

categorical_features = ['Customer Type',	'Type of Travel',	'Class']

df_encoded = pd.get_dummies(df, columns=categorical_features)

df = df_encoded

In [None]:
#Perform label encoding

categorical_column = 'satisfaction'

label_encoder = LabelEncoder()

df[categorical_column] = label_encoder.fit_transform(df[categorical_column])

## Rescaling

In [None]:
# Showing the distirbution of the data
df.hist(bins=50, figsize=(20,15))
plt.show()

In [None]:
# Detecting outliers
def detect_outliers_iqr(data):
    Q1 = np.percentile(data, 25)
    Q3 = np.percentile(data, 75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = (data < lower_bound) | (data > upper_bound)
    return outliers

In [None]:
# The Percentage of outliers in numerical features
for column in df.select_dtypes(include=[np.number]):
    outliers = detect_outliers_iqr(df[column])
    percentage_outliers = (sum(outliers) / len(df)) * 100
    print(f"Percentage of outliers in {column}: {percentage_outliers}%")

 * I will use 2 types of rescaling for the data because i have some outliers in the data and some methods are sensitive to them

In [None]:
#Perform Standard Scaling

numerical_features = df.select_dtypes(include=['number']).columns.tolist()


exclude_columns = ['Class_Eco Plus', 'Customer Type_disloyal Customer', 'Customer Type_Loyal Customer', 'Checkin service' , 'Arrival Delay in Minutes' , 'Departure Delay in Minutes','Flight Distance']
numerical_features = [col for col in numerical_features if col not in exclude_columns]
scaler = MinMaxScaler()

df[numerical_features] = scaler.fit_transform(df[numerical_features])

In [None]:
#Perform Robust Scaling

numerical_features = ['Class_Eco Plus', 'Customer Type_disloyal Customer', 'Customer Type_Loyal Customer', 'Checkin service' , 'Arrival Delay in Minutes' , 'Departure Delay in Minutes','Flight Distance']

scaler = RobustScaler()

df[numerical_features] = scaler.fit_transform(df[numerical_features])

## Features Correlation

In [None]:
# The Correlation between featuers
corr=df.corr()
sns.heatmap(corr,cmap='RdBu', vmax=None, center=0,square=True, linewidths=.5, cbar_kws={"shrink": .9})

# Modeling

In [None]:
x = df.drop('satisfaction', axis=1)
y = df['satisfaction']

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)

In [None]:
model = LogisticRegression()
model.fit(x_train, y_train)
log_Y_pred = model.predict(x_test)

In [None]:
gaussian = GaussianNB()
gaussian.fit(x_train,y_train)
gaussian_Y_pred = gaussian.predict(x_test)

In [None]:
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(x_train, y_train)
random_forest_Y_pred = random_forest.predict(x_test)

In [None]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(x_train,y_train)
knn_Y_pred = knn.predict(x_test)

In [None]:
# This code compares different machine learning models using various metrics and visualizes the results.

models = [model,gaussian,random_forest,knn]
model_names = ['Logistic Regression', 'Naive Bayes', 'Random Forest', 'KNN']

metrics = {
    "Accuracy": accuracy_score,
    "Precision": precision_score,
    "Recall": recall_score,
    "F1 Score": f1_score
}

results = []

for model, name in zip(models, model_names):
    model_preds = model.predict(x_test)
    model_results = {"Model": name}
    for metric_name, metric_func in metrics.items():
        model_results[metric_name] = metric_func(y_test, model_preds)
    results.append(model_results)

results_df = pd.DataFrame(results)
print(results_df)

plt.figure(figsize=(10, 6))
plt.bar(results_df['Model'], results_df['Accuracy'], color='skyblue')
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.title('Model Comparison')
plt.show()

* Seems like Random Forest is the best model

In [None]:
# This code generates a confusion matrix for the Random Forest model's predictions.

conf_mat = confusion_matrix(y_test, random_forest_Y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(conf_mat, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Class 0', 'Class 1'],
            yticklabels=['Class 0', 'Class 1'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

**`Precision:`** The precision of the model is 0.97

**`Recall:`** The recall of the model is 0.93

**`F1-score:`** The F1-score for of the model is 0.95

**`Accuracy:`** The overall accuracy of the model is 0.96