<a href="https://colab.research.google.com/github/MonaRansing/ML-capstone-classification-Airline-Passenger-Referral-Prediction-project/blob/main/ML_Capstone_Classification_Project_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Airline Passenger Referral Prediction**



##### **Project Type**    - Classification

# **Project Summary -**

This project aims to predict whether passengers will recommend an airline to their friends using airline review data collected from 2006 to 2019. The dataset includes both multiple-choice and free-text responses. To achieve this, we will perform extensive exploratory data analysis (EDA) to understand patterns and trends in the data. We will then implement various classification algorithms, including Logistic Regression, Decision Trees, Naive Bayes, Support Vector Machines (SVM), and Random Forests. The performance of these models will be evaluated to determine the most effective method for predicting passenger recommendations. The goal is to provide insights that airlines can use to improve customer satisfaction and referral rates.

# **GitHub Link -**

https://github.com/MonaRansing/ML-capstone-classification-Airline-Passenger-Referral-Prediction-project.git

# **Problem Statement**


**Data includes airline reviews from 2006 to 2019 for popular airlines around the world with multiple choice and free text questions. Data is scraped in Spring 2019. The main objective is to predict whether passengers will refer the airline to their friends.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Import Libraries
# Data manipulation and numerical operations
import pandas as pd
import numpy as np

# Data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# machine learning algorithms and preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.svm import SVC, LinearSVC
import time
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve, auc, roc_auc_score
from scipy import stats
from scipy.stats import f_oneway
from scipy.stats import ttest_ind


# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# For statistical models and tests
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Read dataset
dataset = pd.read_excel('/content/drive/MyDrive/Almabetter/Data Science/dataset/data_airline_reviews.xlsx')
airline_df = dataset.copy()

### Dataset First View

In [None]:
# Dataset First Look
airline_df.head()

In [None]:
airline_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
airline_df.shape

In [None]:
airline_df.columns

### Dataset Information

In [None]:
# Dataset Info
airline_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_values  = airline_df.duplicated().sum()
print("Number of duplicate values:", duplicate_values)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = airline_df.isnull().sum()
print("Missing values:\n", missing_values)

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.barplot(x=airline_df.columns, y=airline_df.isnull().sum(), palette=sns.color_palette('rainbow'))
plt.title('Missing Values')
plt.xlabel('Columns')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()

### What did you know about your dataset?

Answer : In the given dataset have 131895 rows and 17 columns. There are 70711 duplicates values and also there are missing values in the dataset.

## ***Understanding Your Variables***

In [None]:
# Dataset Columns
airline_df.columns

In [None]:
# Dataset Describe
airline_df.describe()

### Variables Description

* **airline** : Name of the airline.
* **overall** : Overall points are given to the trip between 1 to 10.
* **aothir** : Author of the trip.
* **reviewdate** : date of the review customer.
* **aircraft** : type of the aircraft.
* **travellertype** : Type of the traveler(e.g business, leisure).
* **Flight date** : Cabin at the flight date flown.
* **cabin service** : Rated between 1-5.
* **foodbev** : Rated between 1-5.
* **entertainment** : Rated between 1-5.
* **groundservice** : Rated between 1-5.
* **valueformoney** : Rated between 1-5
* **recommended** : Binary, target variable.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
airline_df.nunique()

## ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
duplicate_values  = airline_df.duplicated().sum()
print("Number of duplicate values:", duplicate_values)

In [None]:
# drop duplicate
airline_df.drop_duplicates(inplace=True)

In [None]:
duplicate_values  = airline_df.duplicated().sum()
print("Number of duplicate values:", duplicate_values)

In [None]:
# checking missing values
airline_df.isnull().sum()

### What all manipulations have you done and insights you found?

Answer : Here I droped duplicate values from the dataset after that I checked missing value.

## ***Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 : Top 10 airlines which have higher ratings?

In [None]:
# top airlines which have higher ratings
top_airlines = airline_df.groupby('airline')['overall'].mean().sort_values(ascending=False).head(10)
print(top_airlines)

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(x='airline', y='overall', data=airline_df, order=top_airlines.index, palette=sns.color_palette('rainbow'))
plt.title('Overall Ratings Across Airlines')
plt.xlabel('Airline')
plt.ylabel('Overall Rating')
plt.xticks(rotation=90)
plt.show()

##### 1. Why did you pick the specific chart?

Answer : I used barplot because barplot is easy to understand.

##### 2. What is/are the insight(s) found from the chart?

Answer : From the above chart we can see that top 10 airlines which have higher rating and the Garuda Indonesia airline is the top most airline which have highest rating.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer : Yes, the above insight has positive impact on business. The airlines which have lower ratings they can follow the airlines which have higher ratings.

In [None]:
airline_df.columns

#### Chart - 2 : Which 'traveller_type' has the most reviews?

In [None]:
# find out traveller type which has the most review
traveller_type_counts = airline_df['traveller_type'].value_counts()
print(traveller_type_counts)

In [None]:
# visualize traveller type count using barplot
plt.figure(figsize=(8,4))
sns.barplot(x=traveller_type_counts.index, y=traveller_type_counts, palette=sns.color_palette('rainbow'))
plt.title('Traveller Type Count')
plt.xlabel('Traveller Type')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

Answer : A bar plot is used because it effectively visualizes and compares the frequency or count of categorical data.

##### 2. What is/are the insight(s) found from the chart?

Answer : According to above barplot we can see that solo leisure traveller type has highest ratings and the business traveller type has lowest ratings.

#### Chart - 3 : Which 'cabin' class has the most reviews?

In [None]:
# find out cabin class which has highest review
cabin_class_counts = airline_df['cabin'].value_counts()
print(cabin_class_counts)

In [None]:
# visualize cabin class count
plt.figure(figsize=(8,4))
sns.barplot(x=cabin_class_counts.index, y=cabin_class_counts, palette=sns.color_palette('rainbow'))
plt.title('Cabin Class Count')
plt.xlabel('Cabin Class')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

Answer : A bar plot is used because it effectively visualizes and compares the frequency or count of categorical data.

##### 2. What is/are the insight(s) found from the chart?

Answer : From above chart we can see that economy class has the highest review count and first calss has the lowest count.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer : The insight that the economy class receives the most reviews suggests a positive business impact, as it reflects a high level of customer engagement and interest. However, if first-class reviews consistently indicate dissatisfaction, it could lead to negative growth by deterring potential customers and impacting revenue in that segment, necessitating prompt attention to address any identified issues.

#### Chart - 4 : Top 10 'author' who has written the most reviews?


In [None]:
# count the reviews by authe
author_reviews = airline_df['author'].value_counts().head(10)
print(author_reviews)

In [None]:
#visualize using bar plot
plt.figure(figsize=(8,4))
sns.barplot(x='author', y='overall', data=airline_df, order=author_reviews.index, palette=sns.color_palette('rainbow'))
plt.title('Overall Ratings Across Airlines')
plt.xlabel('Airline')
plt.ylabel('Overall Rating')
plt.xticks(rotation=90)
plt.show()

##### 1. Why did you pick the specific chart?

Answer : A bar plot is used because it effectively visualizes and compares the frequency or count of categorical data.

##### 2. What is/are the insight(s) found from the chart?

Answer : From above graph we can see the top 10 authors who has written most reviws and the Anders Pedersen is the top most author.

#### Chart - 5 : What is the proportion of positive (recommended) vs. negative (not recommended) reviews?


In [None]:
# Calculate the number of recommended and not recommended reviews
recommendation_counts = airline_df['recommended'].value_counts()

# Calculate proportions
proportions = recommendation_counts / recommendation_counts.sum()

recommendation_counts, proportions

In [None]:
plt.figure(figsize=(8, 6))
plt.pie(proportions, labels=proportions.index, autopct='%1.1f%%', startangle=90, colors=['#66b3ff','#ff9999'])
plt.title('Proportion of Recommended vs. Not Recommended Reviews')
plt.legend(loc='lower left')
plt.axis('equal')
plt.show()

##### 1. Why did you pick the specific chart?

Answer : A pie chart is used to visually represent the proportions of different categories within a whole, making it easy to compare the relative sizes of these categories at a glance.

##### 2. What is/are the insight(s) found from the chart?

Answer : From the above pie chart we can see thet there are 47.7 % reviws are positive and 52.3% reviwes are negative.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer : The insights from the pie chart showing that 47.7% of reviews are positive and 52.3% are negative indicate that the majority of customer feedback is not favorable. This negative sentiment can hinder business growth by potentially deterring new customers and diminishing brand reputation. Addressing the issues highlighted in negative reviews could help improve customer satisfaction and drive positive business impact.

#### Chart - 6 : What are the top 10 most common routes in the dataset?


In [None]:
# # Calculate top 10 most common routes
top_routes = airline_df['route'].value_counts().head(10)
print(top_routes)

In [None]:
# Create a bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x=top_routes.index, y=top_routes, palette=sns.color_palette('pastel'))
plt.title('Top 10 Most Common Routes')
plt.xlabel('Route')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()

##### 1. Why did you pick the specific chart?

Answer : I used barplot because bar plot is useful for visualization because it effectively compares the quantities of different categories, making it easy to see differences and trends across groups.

##### 2. What is/are the insight(s) found from the chart?

Answer : We can see that BKK to LHR is the most common route.

#### Chart - 7 : What is the distribution of 'food_bev' ratings?


In [None]:
# Calculate food_bev ratings
food_bev_ratings = airline_df['food_bev'].value_counts()
print(food_bev_ratings)

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(x=food_bev_ratings.index, y=food_bev_ratings, palette=sns.color_palette('pastel'))
plt.title('Distribution of Food and Beverage Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

Answer : Histograms are useful because they provide a visual representation of the distribution of data. They allow us to quickly grasp the shape, center, and spread of a dataset, enabling insights into its underlying patterns and trends.

##### 2. What is/are the insight(s) found from the chart?

Answer : From above histplot we can see that 1 rating has maximum frequency followed by 4.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer : Above insight will has negative impact on business. therefore airline company must have improve food_bev quality and distribution.

#### Chart - 8 : What is the most common 'ground_service' rating?


In [None]:
# calculate ground service rating
ground_service_ratings = airline_df['ground_service'].value_counts()
print(ground_service_ratings)

In [None]:
# create barplot
plt.figure(figsize=(8, 4))
sns.barplot(x=ground_service_ratings.index, y=ground_service_ratings, palette=sns.color_palette('pastel'))
plt.title('Most Common Ground Service Rating')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

Answer : I used barplot because bar plot is useful for visualization because it effectively compares the quantities of different categories, making it easy to see differences and trends across groups.

##### 2. What is/are the insight(s) found from the chart?

Answer : From above plot we can see that 1 rating has highest rating.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer : The insight that 1 rating has the highest frequency suggests potential issues with customer satisfaction or product quality, impacting business reputation and customer retention negatively. Identifying and addressing the root causes behind these low ratings could lead to significant improvements in customer experience and ultimately drive positive business impact.

#### Chart - 9 : What is the most common 'seat_comfort' rating?


In [None]:
# Calculate seat comfort rating
seat_comfort_ratings = airline_df['seat_comfort'].value_counts()
print(seat_comfort_ratings)

In [None]:
# Creat barplot
plt.figure(figsize=(8, 4))
sns.barplot(x=seat_comfort_ratings.index, y=seat_comfort_ratings, palette=sns.color_palette('pastel'))
plt.title('Most Common Seat Comfort Rating')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

Answer : I used barplot because bar plot is useful for visualization because it effectively compares the quantities of different categories, making it easy to see differences and trends across groups.

##### 2. What is/are the insight(s) found from the chart?

Answer : The graph shows that the distribution of seat comfort ratings, with a 1 rating being the most common. This suggest dissatisfaction with seat confrort among a large portion of users.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer : This insight will has negative impact on the business. Therefore solution for that is to improve customer seat comfort. Resolve the issues take feedbacks from the customer and according to feedback do changes in services.

#### Chart - 10 : What is the most common 'value_for_money' rating?


In [None]:
# Calculate value for money rating
value_for_money_ratings = airline_df['value_for_money'].value_counts()
print(value_for_money_ratings)

In [None]:
# Create barplot
plt.figure(figsize=(8, 4))
sns.barplot(x=value_for_money_ratings.index, y=value_for_money_ratings, palette=sns.color_palette('pastel'))
plt.title('Most Common Value for Money Rating')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

Answer : Barplot is easy to understand.

##### 2. What is/are the insight(s) found from the chart?

Answer : The graph shows that the most common rating is 1

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer : this insight will have negative impact. solution for that is to improve all tge services which have lower ratings.

#### Chart - 11 : Comparison of all independent variable/features?

In [None]:
# compairing independent variable
airline_df.hist(bins=50, figsize=(20,15),color = 'red')
plt.show()

* The overall feature ratings of 1 to 2 occur more frequently. From Seat comfort feature, We can say that rating of 1 is highest and rating of 4 is the second highest.

* From cabin service feature, We can say that rating of 5 is highest and rating of 1 is the second highest.

* The food bev feature ratings of 2,4 and 5 are varies equally.Which means their frequency are approximately equal.

* The features of both the entertainment & ground service, We can say that ratings of 3 is highest and ratings of 1 is the second highest.

* From value for money feature, It clearly shows that most of the passenger gives ratings of 1 as highest. From this we can say that most of the airline does not provide good service to passenger.

## **Dropping Unnecessary Columns**

In [None]:
#Checking Percentage wise missing values.
def missing_values_table(df):
        # Total missing values
        percent_mis_val = df.isnull().sum()*100/len(airline_df)
        missing_values_df = pd.DataFrame({'column_name': airline_df.columns,
                                     'percent_mis_val': percent_mis_val})
        return missing_values_df.sort_values('percent_mis_val',ascending=False)

In [None]:
missing_values_table(airline_df)

In [None]:
# showing the unique aircraft names
airline_df['aircraft'].unique()

In [None]:
# checking number of unique aircraft
airline_df['aircraft'].nunique()

Drop aircraft column from the dataset

In [None]:
# dropping column
airline_df.drop('aircraft', axis=1, inplace=True)

In [None]:
airline_df = airline_df.drop(['author','review_date','route','date_flown','customer_review'],axis = 1)

In [None]:
airline_df.columns

Columns which must have drop from dataset:

1. **Author**: This categorical column has high variability and is not required for prediction.
2. **Route**: This column is not needed for building the model as it is independent of the services and quality of travel.
3. **Date_flown**: This column is not needed for building the model because it is not a time series data, and there are overlapping time periods between different dates.
4. **Review_date**: Similar to `Date_flown`, this column is not needed for the model.
5. **Customer_review**: This column is closely related to the overall review feature of the dataset and is therefore redundant.

Based on the percentage of null values, we divide our data into two parts:

- **high_null**: Columns with a high percentage of null values.
- **low_null**: Columns with a low percentage of null values.

In [None]:
# split the numeric column
high_null = ['overall','seat_comfort','cabin_service','value_for_money']
low_null = ['food_bev','entertainment','ground_service']

In [None]:
#Imputation technique using Quantile-1 value
def impute_by_q1_values(df,column):
  Q1=np.percentile(np.sort(df[column].dropna()),25)
  df[column].fillna(Q1,inplace=True)

In [None]:
#Looping the null value column
for col in low_null:
  impute_by_q1_values(airline_df,col)

In [None]:
#Imputation technique using Median Imputation
def median_imputation(df,column):
  df[column].fillna(df[column].median(),inplace=True)

In [None]:
#Looping the null value column
for col in high_null:
  median_imputation(airline_df,col)

In [None]:
airline_df.head(1)

In [None]:
#Remove recommended null value row
airline_df.dropna(subset=['recommended'],inplace=True)

In [None]:
airline_df['traveller_type'].fillna(airline_df['traveller_type'].mode()[0], inplace=True)

In [None]:
airline_df['cabin'].fillna(airline_df['cabin'].mode().values[0], inplace=True)

In [None]:
airline_df.head(1)

In [None]:
# check null values
missing_values_table(airline_df)

In [None]:
airline_df.shape

## ***Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
missing_values_table(airline_df)

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
outlier_columns = list(set(airline_df.describe().columns) - {'recommended'})
outlier_columns

In [None]:
# Create boxplot
plt.figure(figsize=(10, 6))
for index, column in enumerate(outlier_columns):
    plt.subplot(3, 3, index+1)
    sns.boxplot(airline_df[column], orient='h')
    plt.title(column)
plt.tight_layout()
plt.show()

There is no any outliers in the dataset.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
#converting targeted column
airline_df['recommended'].replace({'yes':1,'no':0},inplace=True)

In [None]:
airline_df.head(1)

In [None]:
airline_df.columns

In [None]:
# select only numeric columns
numeric_columns = airline_df.select_dtypes(include=[np.number]).columns
numeric_columns

In [None]:
# plot heatmap
plt.figure(figsize=(8, 4))
sns.heatmap(airline_df[numeric_columns].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

From the above heatmap we have following insights:

* Overall column is strongly correlated with the recommendation.
* Value for Money is highly correlated with recommendations (0.84), indicating that customers who perceive good value are more likely to recommend the service.
* Cabine services is positively correlated with recommendattion.
Seat confort shows a notable correlation with recommendation.
* food and Beverage is moderatly correlated with recommendation.
* entertainment and ground service is weakly correlated with recommendation.

# Check Multicollinearity

In [None]:
#Creating a function to remove multicollinear
def calc_vif(X):

   # Calculating VIF
   vif = pd.DataFrame()
   vif["variables"] = X.columns
   vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

   return(vif)

In [None]:
calc_vif(airline_df[[i for i in airline_df.describe().columns if i not in ['recommended','value_for_money','overall']]])

In [None]:
#drop overall column
airline_df.drop(["overall"], axis = 1, inplace = True)

In [None]:
airline_df.drop(["airline"], axis = 1, inplace = True)

Airline column dropped because that column does not have any use further.

# Define dependent and independent variables

In [None]:
# dependent and independent variables
X = airline_df.drop('recommended', axis=1)
y = airline_df['recommended']

In [None]:
X.head()

In [None]:
y.head()

# One hot encoding

In [None]:
X = pd.get_dummies(X, drop_first=True)

In [None]:
X.head()

In [None]:
print("The Percentage of No labels of Target Variable is",np.round(y.value_counts()[0]/len(y)*100))
print("The Percentage of Yes labels of Target Variable is",np.round(y.value_counts()[1]/len(y)*100))

# Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

## ***ML Model Implementation***

In this project I am going to implement following algorithms:
1. Logistic Regression
2. Decision Tree
3. Random Forest Regression
4. SVM
5. KNN
6. Naive Bayes Classifier


### ML Model - 1 : Logistic Regression

In [None]:
# Initiate logistic regression
log_reg = LogisticRegression()

In [None]:
log_reg.fit(X_train,y_train)

In [None]:
y_pred_log = log_reg.predict(X_test)
print(y_pred_log)

In [None]:
# logistic regression report
report_log_reg = classification_report(y_test,y_pred_log)
print(report_log_reg)

In [None]:
# confusion matrix
confusion_matrix_log = confusion_matrix(y_test,y_pred_log)
print(confusion_matrix_log)

In [None]:
# plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix_log, annot=True, fmt='d', cmap='viridis')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Cross-Validation

In [None]:
logistic = LogisticRegression()

In [None]:
scores = cross_val_score(logistic, X, y, cv=5)
print(scores)

In [None]:
scores = pd.Series(scores)
print(scores.mean(), scores.std(), scores.min(), scores.max())

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Here I used logistic regression. This ML model is used for binary classification where we use sigmoid function, that takes input as independent variable and produces a probablity value between 0 and 1.

we can see that accuracy score of this model is 93%.

### ML Model - 2 : Decision Tree

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Initiate decision tree model
tree_model = DecisionTreeClassifier()

In [None]:
tree_model.fit(X_train,y_train)

In [None]:
y_pred_tree = tree_model.predict(X_test)
print(y_pred_tree)

In [None]:
# accuracy
accuracy_score_dt = accuracy_score(y_test,y_pred_tree)
print(accuracy_score_dt)

In [None]:
print(classification_report(y_test,y_pred_tree))

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# setting the parameters and scoring metric
parameter= {'criterion':['gini','entropy'],'splitter':['best','random'],'max_depth':[1,2,3],'min_samples_split':[2,3,4],'min_samples_leaf':[1,2,3]}
scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']

In [None]:
#prforming hyperparameter tuning using gridsearchch

# setting an estimator and crossvalidation
grid_search = GridSearchCV(estimator=tree_model, param_grid=parameter, scoring=scoring, cv=5,refit='accuracy')

In [None]:
# fitting x and y to gridsearchcv model using an DT classifier
grid_search.fit(X_train,y_train)

In [None]:
grid_search.best_params_

In [None]:
accuracy_score_dt = grid_search.best_score_
print(accuracy_score_dt)

##### Which hyperparameter optimization technique have you used?

Answer : I used gridsearchcv optimization technique.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer : after hypertunning score of matrics increased which is 93% now.

### ML Model - 3 : Fitting Random forest

In [None]:
# Initiate random forest model
rf_model = RandomForestClassifier()

In [None]:
rf_model.fit(X_train,y_train)

In [None]:
y_pred_rf = rf_model.predict(X_test)
print(y_pred_rf)

In [None]:
accuracy_score_random_forest = accuracy_score(y_test,y_pred_rf)
print(accuracy_score_random_forest)

In [None]:
print(classification_report(y_test,y_pred_rf))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset.

Accuracy score is 92%.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
parameter= {'criterion':['gini','entropy'],'max_depth':[1,2,3],'min_samples_split':[2,3,4,],'min_samples_leaf':[1,2,3]}

In [None]:
rf_model_grid = GridSearchCV(estimator=rf_model, param_grid=parameter, scoring=scoring, cv=5,refit='accuracy',verbose=3)

In [None]:
rf_model_grid.fit(X_train,y_train)

In [None]:
rf_model_grid.best_params_

In [None]:
accuracy_score_rf = rf_model_grid.best_score_
print(accuracy_score_rf)

After the hypertunning score is now 93%.

### ML Model - 4 : K-Nearest Neighbour

In [None]:
# Initializing KNN model
knn_model = KNeighborsClassifier()

In [None]:
knn_model.fit(X_train,y_train)

In [None]:
# predict
y_pred_knn = knn_model.predict(X_test)
print(y_pred_knn)

In [None]:
knn_model.score(X_train,y_train)

In [None]:
accuracy_score_knn = knn_model.score(X_test,y_test)
print(accuracy_score_knn)

In [None]:
# confusion matrix of knn
confusion_matrix(y_test,y_pred_knn)

In [None]:
# area under roc curve
roc_auc_score(y_test,y_pred_knn)

Here we get 92% score.

### ML Model - 5 : Support Vector Machine

In [None]:
# Initialize SVM model
svm_model = SVC(kernel='linear')

In [None]:
svm_model.fit(X_train,y_train)

In [None]:
y_pred_svm = svm_model.predict(X_test)
print(y_pred_svm)

In [None]:
accuracy_score_train_svm = svm_model.score(X_train,y_train)
print(accuracy_score_train_svm)

In [None]:
accuracy_score_test_svm = svm_model.score(X_test,y_test)
print(accuracy_score_test_svm)

In [None]:
# confusion metrix
confusion_matrix(y_test,y_pred_svm)

Here we get 92% accuracy score.

### ML Model - 6 : Naive Bayes Classifier

In [None]:
# Initialize Naive Bayes Classifier
nb_model = GaussianNB()

In [None]:
nb_model.fit(X_train,y_train)

In [None]:
accuracy_score_train_nb = nb_model.score(X_train,y_train)
print(accuracy_score_train_nb)

In [None]:
accoracy_score_test_nb = nb_model.score(X_test, y_test)
print(accoracy_score_test_nb)

In [None]:
y_pred_nb = nb_model.predict(X_test)
print(y_pred_nb)

In [None]:
# comparing actual response values (y_test) with predicted response values (y_pred)
from sklearn import metrics
print("Gaussian Naive Bayes model accuracy(in %):", metrics.accuracy_score(y_test, y_pred_nb)*100)

Here accuracy score is 90%.

# **Accuracy Matrics of All Models**

In [None]:
models=[log_reg,tree_model,rf_model,knn_model,svm_model,nb_model]
name=['Logistic Regression Model','Decision Tree Model After Hyperparameter Tuning','Random Forest Model After Hyperparameter Tuning','k_neighbor','support vector','naive bayes']

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score

def accuracy_of_each_model(model, X_train, y_train, X_test, y_test):
    # Predicting on train data
    y_train_preds = model.predict(X_train)
    # Predicting on test data
    y_test_preds = model.predict(X_test)

    # Storing all training scores
    train_scores = []
    # Storing all test scores
    test_scores = []
    metrics = ['Accuracy_Score', 'Precision_Score', 'Recall_Score', 'Roc_Auc_Score']

    # Get the accuracy scores
    train_accuracy_score = accuracy_score(y_train, y_train_preds)
    test_accuracy_score = accuracy_score(y_test, y_test_preds)

    train_scores.append(train_accuracy_score)
    test_scores.append(test_accuracy_score)

    # Get the precision scores
    train_precision_score = precision_score(y_train, y_train_preds, average='weighted')
    test_precision_score = precision_score(y_test, y_test_preds, average='weighted')

    train_scores.append(train_precision_score)
    test_scores.append(test_precision_score)

    # Get the recall scores
    train_recall_score = recall_score(y_train, y_train_preds, average='weighted')
    test_recall_score = recall_score(y_test, y_test_preds, average='weighted')

    train_scores.append(train_recall_score)
    test_scores.append(test_recall_score)

    # Get the roc_auc scores
    train_roc_auc_score = roc_auc_score(y_train, y_train_preds, average='weighted', multi_class='ovr')
    test_roc_auc_score = roc_auc_score(y_test, y_test_preds, average='weighted', multi_class='ovr')

    train_scores.append(train_roc_auc_score)
    test_scores.append(test_roc_auc_score)

    return train_scores, test_scores, metrics


In [None]:
import pandas as pd

# Assuming models and name are predefined lists
for model_ in range(len(models)):
    # Fit the model
    models[model_].fit(X_train, y_train)

    # Calculate train and test scores and metrics
    train_score_, test_score_, metrics_ = accuracy_of_each_model(models[model_], X_train, y_train, X_test, y_test)

    # Print the results
    print("-*-*-"*3 + f"{name[model_]}" + "-*-*-"*4)
    print("")
    print(pd.DataFrame(data={'Metrics': metrics_, 'Train_Score': train_score_, 'Test_Score': test_score_}))
    print("")


# **Conclusion**

Based on ML Model Implementation:

* The Logistic Regression model performs well with consistent metrics across accuracy, precision, recall, and ROC AUC. It shows strong performance on the test data.
* The Decision Tree model after hyperparameter tuning shows slightly lower performance compared to the Logistic Regression model. It has good accuracy and recall but slightly lower ROC AUC.
* The Random Forest model after hyperparameter tuning performs better than the Decision Tree model but slightly worse than the Logistic Regression model. It has strong overall metrics with good precision and recall.
* The k-NN model performs well, with metrics comparable to the Logistic Regression model. It shows strong accuracy, precision, recall, and ROC AUC scores.
* The SVM model also performs well, with metrics very close to those of the Logistic Regression and k-NN models. It has strong precision and ROC AUC scores.
* The Naive Bayes model shows the lowest performance among all the models, with lower accuracy, precision, recall, and ROC AUC scores.
* Based on the metrics, the Logistic Regression Model can be considered the best model for this particular dataset, closely followed by the k-Nearest Neighbors (k-NN) and Support Vector Machine (SVM) models. These models exhibit strong and consistent performance across all evaluated metrics.

EDA:

* The Garuda Indonesia airline is the top most airline which have highest rating.
* The solo leisure traveller type has highest ratings and the business traveller type has lowest ratings.
* economy class has the highest review count and first calss has the lowest count.
* The Anders Pedersen is the top most author who has written highest review.
* Dataset has 47.7 % positive reviews and 52.3% negative reviwes.
* BKK to LHR is the most common route.
* food_bev has maximum 1 rating.
* ground_serivce also has maximum 1 rating.
* seat comfert rating is also has maximum 1 rating.
* value for money has maximum 1 rating.


Suggestions:

* Airlines must focuse on improving those services which has maximum 1 rating. Take feedback from passangers, do changes according to feedbacks and focus on the weak areas.

Challenges during project:

* checking and removing multicollinearity.
* handelling missing values.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***