<a href="https://colab.research.google.com/github/AnandSaumya/Airlines-Customer-Satisfaction/blob/main/Airlines_Customer_Satisfaction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

$\huge{\textbf{Analyzing Airlines Customer Data}}$



---


$\textit{This is a self-paced project focused on analyzing airlines customer data to understand the dynamics behind the customer satisfaction and the revenue generated.}$

The dataset of this project has been taken from Kaggle. Originally, this dataset was given to a tech competition participants from an airline. To ensure sensitive data protection, the names have been changed.

# $\text{Table of Content}$

1. Data Preparation
2. Exploratory Data Analysis
3.

# Data Preparation



---

The following steps have been taken up to prepare the data for analysis:



1.   Data is loaded from personal GitHub
2.   Data is checked for nulls and conflicting data types to ensure uniformity.
3.   Data is cleaned by removing/imputing nulls by general guess.



In [None]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import scipy as sc
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
# Load the dataset
df = pd.read_csv('https://raw.githubusercontent.com/AnandSaumya/Airlines-Customer-Satisfaction/main/Dataset/Invistico_Airline.csv')  # When trying to load the datafile from GitHub: 1) Replace https://github.com/ with https://raw.githubusercontent.com/. and 2) Remove /blob/ from the URL.

In [None]:
# View a couple of rows of the dataset
df.head()

In [None]:
# View what the datatype is of the various columns in the dataset
df.info()

In [None]:
# Get the number of nulls in each column
df.isnull().sum()

In [None]:
mean_of_arrival_delay=np.round(np.mean(df['Arrival Delay in Minutes']),1)
mean_of_arrival_delay

In [None]:
# Imputing the missing values by adding the departure delay and average of arrival delay UNDER THE ASSUMPTION that the airline wasn't able to recover the lost time in departure delay.
df.loc[df['Arrival Delay in Minutes'].isnull(), 'Arrival Delay in Minutes'] = (df['Departure Delay in Minutes'] + mean_of_arrival_delay)

In [None]:
# Making sure no nulls are left
df.isnull().sum()

# Exploratory Data Analysis

In [None]:
# Understanding the customer satisfaction distribution

plt.figure(figsize=(8, 5))
sns.countplot(x='satisfaction', data=df, palette='coolwarm')
plt.title('Distribution of Customer Satisfaction')
plt.xlabel('Satisfaction Level')
plt.ylabel('Count')
plt.show()

The above chart gives a really important inisght into the satisfaction level of the customers, as it can be clearly seen that the count of `satisfied` customers is significantly higher than the `disstatisfied` customers, helping understand the confidence level of our customers in our services.


In [None]:
# Flight Class v/s Satisfaction
plt.figure(figsize=(8, 5))
sns.countplot(x='Class', hue='satisfaction', data=df, palette='Set2')
plt.title('Satisfaction Level Across Flight Classes')
plt.xlabel('Flight Class')
plt.ylabel('Count')
plt.show()

# Numerical values
class_satisfaction_counts = df.groupby(['Class', 'satisfaction']).size().unstack(fill_value=0)
class_satisfaction_counts

In [None]:
# Calculate the ratio of satisfied to dissatisfied customers for each class
class_satisfaction_ratios = class_satisfaction_counts['satisfied'] / class_satisfaction_counts['dissatisfied']

# Print the ratios
business_ratio=np.round(class_satisfaction_ratios.loc['Business'],2)
print("The Business class ratio of satisfied to dissatisfied customers is ",business_ratio)

eco_ratio=np.round(class_satisfaction_ratios.loc['Eco'],2)
print("The Economy class ratio of satisfied to dissatisfied customers is ",eco_ratio)

eco_plus_ratio=np.round(class_satisfaction_ratios.loc['Eco Plus'],2)
print("The Economy Plus class ratio of satisfied to dissatisfied customers is ",eco_plus_ratio)

From above it can be seen that the ratio of satisfied to dissatisfied customers is the highest in Business Class ratio whereas, in the case of Economy and Economy Plus it can be seen that the ratio is pretty close.

If there was additional information available regarding revenue, occupancy rates, operating costs for each fare class, and per square meter for each cabin space, we could better compare the three classes and see increasing or decreasing the space in which compartment would be the most profitable.

In [None]:
# Aggregate the data to get satisfaction counts by Gender
satisfaction_counts = df.groupby(['Gender', 'satisfaction']).size().reset_index(name='Count')

# Create an interactive bar chart
fig = px.bar(
    satisfaction_counts,
    x='Gender',
    y='Count',
    color='satisfaction',
    barmode='group',
    title='Satisfaction Level vs Gender',
    labels={'Count': 'Number of Customers', 'Gender': 'Gender', 'satisfaction': 'Satisfaction Level'},
    hover_data={'Count': True}  # This enables the hover information,
)

# Show the interactive plot
fig.show()

# Show the satisfaction counts in tabular form
satisfaction_counts

From the above chart, we can see that as per the dataset, the satisfaction level in women is relatively higher than that of men. The dataset points that, overall, the male gender identifying customers are more dissatisfied.

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(x='satisfaction', hue='Customer Type', data=df, palette='husl')
plt.title('Customer Type vs Satisfaction')
plt.xlabel('Satisfaction Level')
plt.ylabel('Count')
plt.show()

# Calculate the ratio of satisfied to dissatisfied customers for each customer type
customer_type_satisfaction_counts = df.groupby(['Customer Type', 'satisfaction']).size().unstack(fill_value=0)
customer_type_satisfaction_ratios = customer_type_satisfaction_counts['satisfied'] / customer_type_satisfaction_counts['dissatisfied']

for customer_type, ratio in customer_type_satisfaction_ratios.items():
    print(f"The {customer_type} type ratio of satisfied to dissatisfied customers is {ratio:.2f}")

It can be seen from above the ratio of satisfied to dissatisfied customer is almost double in the case of `Loyal Customers`.

# Data Preparation for Modeling Purposes

In [None]:
# Convert target variable to binary
df['satisfaction'] = df['satisfaction'].map({'satisfied': 1, 'neutral or dissatisfied': 0})

# Identify categorical and numerical columns
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Remove target column from numerical_cols if present
if 'satisfaction' in numerical_cols:
    numerical_cols.remove('satisfaction')

# Model Building

In [None]:
# Define preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ])

# Define models to test
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest": RandomForestClassifier(random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42)
}

# Split data into training and testing sets
X = df.drop('satisfaction', axis=1)
y = df['satisfaction']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train and evaluate models
results = {}
for model_name, model in models.items():
    pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"{model_name} Accuracy: {acc:.2f}")
    print(classification_report(y_test, y_pred))
    results[model_name] = acc

# Identify the best model
best_model_name = max(results, key=results.get)
print(f"Best Model: {best_model_name} with accuracy {results[best_model_name]:.2f}")
