# Neural Network Implementation

# Project Description

The project aims to analyze and predict passenger satisfaction with their airline travel experience using machine learning techniques. It utilizes a dataset containing various attributes related to passengers' demographics, travel preferences, flight details, and ratings for different services provided during the flight. By leveraging this dataset, the project seeks to understand the factors that influence passenger satisfaction and develop predictive models to forecast whether a passenger will be satisfied or dissatisfied based on their characteristics and flight-related factors.

The project involves several steps, including data preprocessing, exploratory data analysis (EDA), feature engineering, model selection, training, and evaluation. Machine learning algorithms such as decision trees are initially explored. Furthermore, neural networks are introduced to explore their potential for capturing complex patterns in the data that may lead to improved prediction accuracy compared to traditional models.

# Imports

In [64]:
import math
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Clean Dataset

## Airplane

The CSV file contains detailed information about airline passengers' travel experiences and satisfaction levels. Each row in the CSV file represents a single passenger's feedback, while the columns represent different attributes and ratings associated with their travel experience. Key attributes include passenger demographics (e.g., gender, age), travel details (e.g., flight distance, type of travel), and ratings for various services provided during the flight (e.g., seat comfort, inflight entertainment). Additionally, the file includes columns for departure and arrival delays and the overall satisfaction level of passengers.

The dataset serves as the primary source of information for the project, providing valuable insights into passenger preferences, behaviors, and satisfaction levels. It is used for exploratory data analysis, feature engineering, model training, and evaluation to develop effective predictive models for understanding and predicting passenger satisfaction with airline travel.

- Read csv file

In [65]:
data = pd.read_csv("csv/Airplane.csv")

- Dropping missing data

In [66]:
data = data.dropna()

- Converting non-numeric values to numeric

In [67]:
data["satisfaction"] = data["satisfaction"].map(
    {"neutral or dissatisfied": 0, "satisfied": 1}).astype(int)
data["Customer Type"] = data["Customer Type"].map(
    {"disloyal Customer": 0, "Loyal Customer": 1}).astype(int)
data["Type of Travel"] = data["Type of Travel"].map(
    {"Personal Travel": 0, "Business travel": 1}).astype(int)
data["Gender"] = data["Gender"].map({"Female": 0, "Male": 1}).astype(int)
data["Class"] = data["Class"].map(
    {"Eco": 0, "Eco Plus": 1, "Business": 2}).astype(int)

- Categorizing continuous data

- -- Arrival Delay in Minutes

In [68]:
data.loc[data["Arrival Delay in Minutes"] <= 5, "Arrival Delay in Minutes"] = 0
data.loc[(data["Arrival Delay in Minutes"] > 5),
         "Arrival Delay in Minutes"] = 1

- -- Age

In [69]:
data.loc[data["Age"] <= 20, "Age"] = 0
data.loc[(data["Age"] > 20) & (data["Age"] <= 39), "Age"] = 1
data.loc[(data["Age"] > 39) & (data["Age"] <= 60), "Age"] = 2
data.loc[(data["Age"] > 60), "Age"] = 3

- -- Cleanliness

In [70]:
data.loc[data["Cleanliness"] < 3, "Cleanliness"] = 0
data.loc[data["Cleanliness"] == 3, "Cleanliness"] = 1
data.loc[(data["Cleanliness"] > 3), "Cleanliness"] = 2

- -- Flight Distance

In [71]:
data.loc[data["Flight Distance"] <= 1000, "Flight Distance"] = 0
data.loc[(data["Flight Distance"] > 1000) & (
    data["Flight Distance"] <= 2000), "Flight Distance"] = 1
data.loc[(data["Flight Distance"] > 2000) & (
    data["Flight Distance"] <= 3000), "Flight Distance"] = 2
data.loc[(data["Flight Distance"] > 3000), "Flight Distance"] = 3

- -- Departure Delay in Minutes

In [72]:
data.loc[data["Departure Delay in Minutes"]
         <= 5, "Departure Delay in Minutes"] = 0
data.loc[(data["Departure Delay in Minutes"] > 5) & (
    data["Departure Delay in Minutes"] <= 25), "Departure Delay in Minutes"] = 1
data.loc[(data["Departure Delay in Minutes"] > 25),
         "Departure Delay in Minutes"] = 2

- Selecting the last 10,000 rows as test data

In [73]:
test = data.tail(10000)

- Removing the last 10,000 rows from the data frame

In [74]:
data = data.head(90000)

- Separating data for satisfied and neutral or dissatisfied

In [75]:
satisfaction_0 = data[data['satisfaction'] == 0]
satisfaction_1 = data[data['satisfaction'] == 1]
random.seed(43)

- Selecting 10,000 random samples from each group

In [76]:
satisfaction_0_random = random.sample(satisfaction_0.index.tolist(), 10000)
satisfaction_1_random = random.sample(satisfaction_1.index.tolist(), 10000)

- Combining these two data sets

In [77]:
data = pd.concat([data.loc[satisfaction_0_random],
                 data.loc[satisfaction_1_random]])

- Dropping unnecessary columns

In [78]:
data = data.drop(["index", "id", "Gender"], axis=1)
test = test.drop(["index", "id", "Gender"], axis=1)

- show clean data

In [79]:
# Data is not shown due to large size
# data

In [80]:
print(data[["Age","satisfaction"]].groupby(["Age"],as_index=False).mean());

   Age  satisfaction
0    0      0.245791
1    1      0.456366
2    2      0.634953
3    3      0.253857


# Neural Network Implementation

#### Data Preparation:
1. **Target and Features Extraction**: The target variable `satisfaction` is extracted from the dataset `data`, while the feature variables are obtained by dropping the `satisfaction` column from `data`.

2. **Train-Test Split**: The dataset is split into training (`X_train`, `y_train`) and testing (`X_test`, `y_test`) sets using `train_test_split` from `sklearn.model_selection`. Here, 80% of the data is used for training (`X_train`, `y_train`) and 20% for testing (`X_test`, `y_test`). The `random_state` parameter ensures reproducibility of results.


In [81]:
y = data['satisfaction']
x = data.drop(["satisfaction"], axis=1)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42)

#### Data Normalization:
3. **Normalization**: Standardization of the feature data is performed using `StandardScaler` from `sklearn.preprocessing`. This step ensures that all features are on the same scale, which is important for the neural network model to converge efficiently during training.

In [82]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#### Model Building and Training:
4. **MLPRegressor Model**: An MLPRegressor model is initialized with two hidden layers containing 100 and 50 neurons respectively (`hidden_layer_sizes=(100, 50)`). `max_iter=500` specifies the maximum number of iterations for training, and `random_state=42` ensures reproducibility.

5. **Model Training**: The model is trained on the scaled training data (`X_train_scaled`, `y_train`) using the `fit` method.

model = MLPRegressor(hidden_layer_sizes=(100, 50),
                     max_iter=500, random_state=42)
model.fit(X_train_scaled, y_train)

#### Model Evaluation:
6. **Evaluation Metrics**: After training, the model predicts the satisfaction scores (`y_pred`) for the test set (`X_test_scaled`). The Mean Squared Error (MSE) is computed between the actual satisfaction scores (`y_test`) and the predicted scores (`y_pred`) using `mean_squared_error` from `sklearn.metrics`. Lower MSE values indicate better model performance.


In [83]:
y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

Mean Squared Error: 0.04765577012398413


#### Conclusion:
This script demonstrates the process of building an MLPRegressor model for predicting passenger satisfaction based on airline travel data. It covers data preparation steps such as feature extraction, train-test split, and normalization, followed by model training and evaluation. The MSE provides a quantitative measure of how well the model predicts satisfaction scores on unseen data, aiding in assessing its effectiveness and guiding potential model improvements.

#### Function `compare_y_test_and_y_pred`

1. **Purpose**: This function compares the predicted values (`y_pred`) with the actual values (`y_test`) and calculates the accuracy of a binary classification task based on a specified threshold.

2. **Parameters**:
   - `y_test`: The actual target values from the test set.
   - `y_pred`: The predicted target values from the model.
   - `threshold`: (Optional) Threshold value used to convert predicted probabilities to binary predictions. Defaults to 0.5.

3. **Steps**:
   - **Convert Predictions**: Predicted values (`y_pred`) are converted into binary class labels (`y_pred_class`) by comparing each prediction against `threshold` and converting values greater than `threshold` to 1 and less than or equal to `threshold` to 0.
   
   - **Calculate Accuracy**: Using `accuracy_score` from `sklearn.metrics`, the accuracy of the binary classification is computed by comparing `y_test` (actual values) with `y_pred_class` (predicted values).

4. **Output**: The function prints the calculated accuracy as a percentage with two decimal places.


In [84]:
def compare_y_test_and_y_pred(y_test, y_pred, threshold=0.5):
    # Convert predicted values to binary classification
    y_pred_class = (y_pred > threshold).astype(int)

    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred_class)
    print(f'Accuracy: {accuracy * 100:.2f}%')

#### Example Usage

- `compare_y_test_and_y_pred(y_test, y_pred)`: This function call compares the predicted satisfaction scores (`y_pred`) with the actual satisfaction scores (`y_test`) using a default threshold of 0.5. It then prints the accuracy of the binary classification task based on these predictions.

This function is useful for evaluating the performance of models that predict binary outcomes, such as predicting passenger satisfaction (satisfied or dissatisfied) in this project. Adjusting the `threshold` parameter allows for exploring different trade-offs between sensitivity and specificity in the classification predictions.

In [85]:
compare_y_test_and_y_pred(y_test, y_pred)

Accuracy: 94.60%
