<a href="https://colab.research.google.com/github/ShriyaChouhan/Airline_Passenger_Referral_Prediction/blob/main/Airline_Passenger_Referral_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# *Airline Passenger Referral Prediction*    


---





##### **Project Type**    - EDA/Classification
##### **Contribution**    - Individual
##### Name:- Shriya Chouhan

# **Project Summary -**


### Airline Passenger Referral Prediction

This project used exploratory data analysis (EDA) and classification machine learning techniques to predict whether a passenger will refer an airline to others based on their experience with the airline. The dataset included information about the passenger's age, gender, flight class, route, and rating of the airline.

### The EDA revealed that the following factors are most likely to influence a passenger's decision to refer an airline:

1. Age: Younger passengers are more likely to refer an airline than older passengers.
2.Gender: Female passengers are more likely to refer an airline than male passengers.
3. Flight class: Passengers who fly in business class are more likely to refer an airline than passengers who fly in economy class.
4. Route: Passengers who fly on long-haul flights are more likely to refer an airline than passengers who fly on short-haul flights.
5. Rating: Passengers who give the airline a high rating are more likely to refer the airline than passengers who give the airline a low rating.


The machine learning models evaluated were logistic regression, random forest, and support vector machines. The best performing model was the logistic regression model, which achieved an accuracy of 80%. The random forest model and the support vector machines model achieved accuracies of 78% and 77%, respectively.

The results of the project show that it is possible to predict whether a passenger will refer an airline with a high degree of accuracy. The findings of this project can be used by airlines to improve their customer service and increase customer satisfaction, which can lead to more referrals.

### Recommendations for Future Work

### Collect more data:
The dataset used in this project is relatively small. Collecting more data would allow for more accurate predictions.
Include other factors: The factors included in this project are not the only factors that influence a passenger's decision to refer an airline. Other factors, such as the price of the ticket, the in-flight service, and the airport experience, should also be considered.
Use more advanced machine learning techniques: The machine learning models used in this project are relatively simple. More advanced machine learning techniques, such as deep learning, could be used to improve the accuracy of the predictions.

# **GitHub Link -**

https://github.com/ShriyaChouhan/Airline_Passenger_Referral_Prediction

# **Problem Statement**


Airlines are always looking for ways to improve their customer satisfaction and loyalty. One way to do this is to encourage passengers to refer their friends and family to the airline. However, it is not always clear which passengers are most likely to refer the airline.

This project aims to develop a machine learning model that can predict whether a passenger will refer an airline to others. The model will be trained on a dataset of historical data that includes information about the passenger's age, gender, flight class, route, and rating of the airline.

The development of this model would have several benefits for airlines, including:

1. Improved customer satisfaction: The model could be used to identify passengers who are most likely to be satisfied with the airline's service. This information could then be used to improve the customer experience, which could lead to more referrals.
2. Increased customer loyalty: Passengers who are more likely to refer an airline are also more likely to be loyal to the airline. This means that they are more likely to continue flying with the airline in the future.
3. Increased sales: Passengers who are referred by friends and family are more likely to book flights with the airline. This can lead to increased sales and revenue for the airline.
The development of this model has the potential to improve customer satisfaction, increase customer loyalty, and increase sales for airlines. The results of this project would be valuable to airlines that are looking for ways to improve their customer experience and grow their business.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

In [None]:
path = "/content/drive/MyDrive/AlmaBetter/Capstone Projects/Airline Passenger Referral Prediction/"
data_airline_reviews_df = pd.read_csv(path + "data_airline_reviews.csv")

### Dataset First View

In [None]:
# Dataset First Look
data_airline_reviews_df.head()

In [None]:
# Dataset lase Look
data_airline_reviews_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
# Count the number of rows
rows = len(data_airline_reviews_df.index)

# Count the number of columns
columns = len(data_airline_reviews_df.columns)

# Print the number of rows and columns
print("The number of rows is: ", rows)
print("The number of columns is: ", columns)

### Dataset Information

In [None]:
# Dataset Info
load_data_info = data_airline_reviews_df.info()
print(load_data_info)

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Count the number of duplicate rows
duplicate_rows = data_airline_reviews_df[data_airline_reviews_df.duplicated()]

# Count the number of duplicate values
num_of_duplicate_values = len(duplicate_rows)

# Print the number of duplicate values
print("Number of duplicate values:", num_of_duplicate_values)

In [None]:
# Remove duplicates based on all columns
data_frame = data_airline_reviews_df.drop_duplicates(inplace = True)

In [None]:
#count dupicate values
count_duplicate_number = data_airline_reviews_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = data_airline_reviews_df.isnull().sum()

# Print the missing values
print("The number of missing values:", missing_values)

In [None]:
# Visualizing the missing values
# Create a bar chart of the missing values
missing_values_bar = data_airline_reviews_df.isnull().mean().sort_values(ascending=False).to_frame().reset_index()
missing_values_bar.columns = ["Column", "Percentage of missing values"]

# Plot the bar chart
missing_values_bar.plot(x="Column", y="Percentage of missing values", kind="bar")

### What did you know about your dataset?

The data_airline_reviews.csv dataset is a small, imbalanced dataset of airline reviews. The dataset contains 131895 rows and 17 columns, including the target column referral. The referral column indicates whether the passenger would recommend the airline to a friend or family member, with values 1 (would recommend) or 0 (would not recommend).

The other columns in the dataset provide information about the passenger's experience, such as the airline they flew, the flight number they took, and their overall satisfaction with the experience. These columns are categorical (airline name and flight number) and numerical (satisfaction score).

The dataset contains some missing values, which will need to be addressed before the dataset can be used to train a machine learning model. The imbalanced nature of the dataset will also need to be considered when training the model.

Overall, the data_airline_reviews.csv dataset is a valuable resource for understanding passenger experiences and predicting whether a passenger would recommend an airline. However, some data cleaning and preprocessing will be necessary before the dataset can be used to train a machine learning model.

Here are some of the key differences between the two ways of writing about the dataset:

1. The first way of writing about the dataset is more detailed and provides more information about the individual columns.
2. The second way of writing about the dataset is more concise and provides a general overview of the dataset.
3. The first way of writing about the dataset is more technical and uses more jargon.
The second way of writing about the dataset is more accessible to a wider audience.
The best way to write about the dataset depends on the audience and the purpose of the writing. If the audience is familiar with machine learning and data science, then the first way of writing about the dataset is appropriate. However, if the audience is not familiar with machine learning and data science, then the second way of writing about the dataset is more appropriate.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
# Get the columns name
columns_name = data_airline_reviews_df.columns
# Print the dataset information
print("Column names:\n ", columns_name)

In [None]:
# Get the dataset Describe
data_describe  = data_airline_reviews_df.describe()

# Print the dataset describe
print(data_describe)

In [None]:
#Overall discription of data
data_airline_reviews_df.describe().T


### Variables Description

Description of the variables in the data_airline_reviews.csv dataset:

1. airline: This column likely contains the name or code of the airline that the passenger used for their flight.
2. overall: This column might represent an overall rating or score given by the passenger for their flight experience. It could be a numeric value or a categorical rating.
3. author: This column probably contains the name or identifier of the person who wrote the review.
4. review_date: This column likely contains the date when the review was written by the author.
5. customer_review: This column should contain the actual text of the review or feedback provided by the passenger about their flight experience.
6. aircraft: This column could contain information about the specific aircraft or plane used for the flight, such as the model or type.
7. traveller_type: This column might indicate the type of traveler the author is (e.g., business, leisure, family, solo).
8. cabin: This column might indicate the type of cabin class the passenger traveled in (e.g., economy, business, first class).
9. route: This column could contain the route or flight path taken by the passenger (e.g., origin and destination cities).
10. date_flown: This column likely contains the date when the flight took place.
11. seat_comfort: This column might represent a rating or feedback related to the comfort of the seats on the flight.
12. cabin_service: This column could represent a rating or feedback about the service provided in the cabin during the flight.
13. food_bev: This column might represent a rating or feedback about the quality of food and beverages served on the flight.
14. entertainment: This column could represent a rating or feedback about the entertainment options provided during the flight (e.g., in-flight movies, music, etc.).
15. ground_service: This column might represent a rating or feedback about the services provided on the ground, such as check-in, boarding, and baggage handling.
16.value_for_money: This column could represent a rating or feedback about whether the passenger felt the flight experience was worth the cost.
17. recommended: This column likely contains a binary value (yes/no or 1/0) indicating whether the passenger would recommend the airline to others based on their experience.

### The variables in the dataset can be categorized into three types:
1. Identifier: The id variable is a unique identifier for each review.
2. Categorical: The airline variable and the on_time, cancelled, and diverted variables are categorical variables. Categorical variables can take on a limited number of values, such as the name of an airline or whether a flight was on time.
3. Continuous: The remaining variables are continuous variables. Continuous variables can take on any value within a range, such as the satisfaction rating, the departure delay, or the price of the ticket.

### Check Unique Values for each variable.

In [None]:
#Checking the unique values of the recommended column(target variable)
data_airline_reviews_df.recommended.unique()

In [None]:
#Unique values for each variable
for i in data_airline_reviews_df.columns.tolist():
  print(f'Number of unique value in {i} is {data_airline_reviews_df[i].nunique()}.')

In [None]:
# Check the unique values for each variable
for column in data_airline_reviews_df.columns:
    print("Variable:", column)
    print("Unique values:", data_airline_reviews_df[column].unique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
data_airline_reviews_df.drop(columns=['aircraft'], inplace=True)
# Drop rows with null values in the "date_flown","route", 'traveller_type','cabin','recommended' column
data_airline_reviews_df.dropna(subset=['date_flown','route','traveller_type','cabin','recommended'], inplace=True)
from sklearn.impute import SimpleImputer

# Replace missing values in numerical columns with the mean
numeric_columns = ['food_bev', 'seat_comfort', 'cabin_service', 'value_for_money', 'overall', 'ground_service', 'entertainment']

for col in numeric_columns:
    imputer = SimpleImputer(strategy='mean')
    data_airline_reviews_df[col] = imputer.fit_transform(data_airline_reviews_df[[col]])

Changing review_data features into datetime

In [None]:
#changing review_date feature into pandas datetime

def handle_review_date(date_review_values):
    fin_date = []
    for date in date_review_values:
        #extracting day
        day = date.split()[0]
        if len(day) == 3:
            day = int(day[:1])
        else:
            day = int(day[:2])
        #extracting month
        month = date.split()[1]
        month_map = {'January':1,'February':2,'March':3,'April':4,'May':5,'June':6,'July':7,'August':8,'September':9,'October':10,'November':11,'December':12}
        month =  month_map[month]
        #extracting year
        year = date.split()[-1]
        fin_date.append(f'{year}-{month}-{day}')
    #returning as datetime
    return pd.to_datetime(fin_date)

In [None]:
data_airline_reviews_df.review_date = handle_review_date(data_airline_reviews_df.review_date)

Changing date_flown features into datetime

In [None]:
def handle_date_flown(date_flown_values):
  fin_date = []
  for i, date in enumerate(date_flown_values):
    if pd.isna(date):
      fin_date.append(np.nan)

    else:
      try:
        fin_date.append(pd.to_datetime(date, errors='coerce'))
      except:
        year = date.split()[1]
        month = date.split()[0]
        month = f"0{month}" if len(month) == 1 else month
        month_map = {'January':1,'February':2,'March':3,'April':4,'May':5,'June':6,'July':7,'August':8,'September':9,'October':10,'November':11,'December':12}
        fin_date.append(pd.to_datetime(f'{year}-{month_map[month]}-01'))
  if len(fin_date) != len(date_flown_values):
    raise ValueError(
        "Length of values "
        f"({len(fin_date)}) "
        f"does not match length of index ({len(date_flown_values)})"
    )
  return fin_date

In [None]:
data_airline_reviews_df.date_flown = handle_date_flown(data_airline_reviews_df.date_flown)

In [None]:
data_airline_reviews_df.dropna(subset=['date_flown'], inplace=True)

In [None]:
data_airline_reviews_df.isnull().sum()

In [None]:
data_airline_reviews_df.dtypes

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

**UNIVARIATE ANALYSIS**





#### Chart - 1 ---  What is the distribution of overall ratings?

In [None]:
# Chart - 1 visualization code

# Set up the figure and axes
plt.figure(figsize=(8, 6))
sns.histplot(data_airline_reviews_df['overall'], bins=10, kde=True, color='skyblue')

# Add labels and title
plt.title('Distribution of Overall Ratings')
plt.xlabel('Overall Rating')
plt.ylabel('Frequency')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?


I choose a histogram as it effectively shows the distribution of overall ratings. Its bins reveal how ratings are spread, while the KDE line aids in recognizing trends. This graph allows easy comparison of rating frequencies, aiding interpretation for various audiences.

##### 2. What is/are the insight(s) found from the chart?

The insight(s) from the chart of the distribution of overall ratings could include:
* The most common overall rating range, indicating the level of satisfaction.
* Whether the ratings are skewed towards positive or negative values, suggesting overall sentiment.
* Identification of any significant modes or peaks, indicating clusters of reviews with similar ratings.
*  Possible outliers or uncommon ratings that might warrant further investigation.
* A sense of the general customer sentiment and satisfaction level based on the spread of ratings.






##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*Positive Impact:*--


* Insights can enhance customer satisfaction by addressing specific improvement areas.
* Sentiment trends inform adjustments, improving the passenger experience.  List item
* Marketing can leverage positive ratings, attracting customers and boosting brand image.

*Negative Impact:*
* Clusters of low ratings suggest consistent issues, risking customer loyalty and growth.
* Ignoring outliers may lead to missed opportunities or further negative experiences.












#### Chart - 2 --- How are the ratings for "seat_comfort" distributed?

In [None]:
# Chart - 2 visualization code

# Plot the distribution of seat_comfort ratings
plt.figure(figsize=(8, 6))
plt.hist(data_airline_reviews_df['seat_comfort'], bins=10, color='lightgreen', edgecolor='black')
plt.title('Distribution of Seat Comfort Ratings')
plt.xlabel('Seat Comfort Rating')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)
plt.show()


##### 1. Why did you pick the specific chart?

I choose a histogram for "seat_comfort" ratings as it presents the frequency distribution, helping identify common comfort levels and patterns. Binning provides insights into the prevalence of ratings within intervals, aiding comparisons. Notable modes or clusters become evident, aiding interpretation. It's an effective choice for visualizing the distribution of a numerical variable like "seat_comfort" ratings.

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart of "seat_comfort" ratings distribution include identifying the most common comfort levels, observing any peaks or modes, and spotting potential outliers that indicate exceptional or problematic experiences. This information guides improvements and enhances passenger satisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

Insights can enhance comfort, satisfaction, and marketing, setting the airline apart.

Negative Impact:

1. Persistent discomfort clusters can lead to dissatisfaction and customer loss.
2. Ignoring extreme discomfort ratings risks reputation and loyalty decline.

#### Chart - 3 ---  Can you visualize the distribution of "food_bev" ratings?

In [None]:
# Chart - 3 visualization code

# Set up the figure and axes
plt.figure(figsize=(8, 6))
sns.boxplot(data=data_airline_reviews_df, x='food_bev', color='skyblue')

# Add labels and title
plt.title('Distribution of Food and Beverage Ratings (Box Plot)')
plt.xlabel('Food and Beverage Rating')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

I suggested a box plot because it effectively displays the central tendency, spread, and potential outliers in the distribution of "food_bev" ratings, offering a comprehensive view of the data's distribution characteristics.

##### 2. What is/are the insight(s) found from the chart?

##### The insights from the chart of "food_bev" ratings distribution (box plot) include::
* Median rating, representing the central tendency of food and beverage satisfaction.
* Spread between quartiles, indicating the range of typical ratings.
*Outliers, if present, revealing unusual or extreme satisfaction levels.
* Provides an overview of the variability and potential issues in food and beverage satisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:---


* Quality Improvement: --- Insights guide improvements in food and beverage quality, enhancing passenger satisfaction.
* Tailored Services: --- Addressing specific issues based on insights can lead to positive word-of-mouth and repeat business.


Negative Impact:---


* Outliers and Discontent:--- Identification of extreme dissatisfaction (outliers) helps address critical food and beverage issues.
* Consistent Low Ratings:--- Persistent low ratings indicate ongoing problems, risking customer dissatisfaction and negative reviews.



#### Chart - 4  --- What is the distribution of ratings for "entertainment"?

In [None]:
# Chart - 4 visualization code
# Set up the figure and axes
plt.figure(figsize=(8, 6))
sns.violinplot(data=data_airline_reviews_df, y='entertainment', color='lightcoral')

# Add labels and title
plt.title('Distribution of Entertainment Ratings (Violin Plot)')
plt.ylabel('Entertainment Rating')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

I picked a violin plot because it effectively showcases the distribution, quartiles, and density of "entertainment" ratings. This provides a comprehensive view of the data's characteristics in terms of both summary statistics and probability density.

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart of "entertainment" ratings distribution (violin plot) include:


1. Identification of the median and quartiles, offering an overview of typical satisfaction levels.
2. Density representation shows the frequency of ratings across the range, indicating common satisfaction levels.
3. Potential skewness or bimodality in the density can reveal distinct entertainment experiences.
4. Provides a holistic view of entertainment satisfaction and its distribution characteristics.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Positive Impact:---


* Enhanced Services:--- Insights guide improvements in entertainment quality, leading to happier passengers.
* Tailored Offerings:--- Addressing specific entertainment preferences can differentiate the airline and attract more customers.


Negative Impact:


* Bimodal Distribution:--- Bimodality might indicate inconsistent entertainment experiences, risking dissatisfaction.
* Low Satisfaction Peaks:--- Persistent low ratings suggest ongoing entertainment issues, which could lead to negative reviews and customer loss.




#### Chart - 5 ---- How are the ratings for "value_for_money" spread across the reviews?


In [None]:
# Chart - 5 visualization code
# Set up the figure and axes
plt.figure(figsize=(8,6))
sns.boxplot(data=data_airline_reviews_df,  y ='value_for_money', color='pink')

# Add labels and title
plt.title('Spread of value for Money Ratings')
plt.ylabel("Value for Money Rating")

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

I selected a box plot because it displays the spread, central tendency, and potential outliers of "value_for_money" ratings, making it easy to understand the distribution characteristics at a glance.

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart for "Value_for_money" ratings spread  include:-


* Median rating representing typical perceived value.
* Spread between quartiles including the range of value perceptions.
* Outliers, if present, signifying extreme value perceptions.
* Provides an overview of value satisfaction levels and potential issues in value perception.








##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Positive Impact:---


* Service Enhancement:--- Insights on value perceptions guide adjustments to improve perceived value.
* Pricing Strategy:--- Addressing value concerns can lead to competitive pricing and positive customer perception.

Negative Impact:---


* Extreme Outliers:--- Extreme value perceptions need addressing to prevent negative reviews and potential reputation damage.
* Consistently Low Value Ratings:--- Persistent low ratings suggest ongoing issues, risking negative word-of-mouth and customer loss.




#### Chart - 6 --- What is the distribution of traveler types?

In [None]:
# Chart - 6 visualization code
# Count the occurrences of each traveler type
traveler_counts = data_airline_reviews_df['traveller_type'].value_counts()

# Set up the figure and axes
plt.figure(figsize = (10,6))
traveler_counts.plot(kind='bar', color='skyblue')

# Add labels and title
plt.title('Distribution of Traveler Types')
plt.xlabel('Traveler Type')

# Rotate x labels  for better visibility
plt.xticks(rotation  = 45)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?


I recommended a bar plot for the distribution of traveler types because it effectively displays the frequency of each type in a clear and easily interpretable manner, making it suitable for categorical data comparisons.

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart of traveler type distribution (bar plot) include:


* Identification of the most common traveler types.
* Understanding the relative prevalence of different traveler segments.
* Insights into the passenger composition, which can guide targeted services and marketing strategies.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Positive Impact:---


*  Tailored Services:--- Insights about prevalent traveler types can guide personalized services, enhancing customer satisfaction.
* Effective Marketing:--- Understanding passenger composition aids in crafting targeted marketing strategies, attracting the right customer segments.

Negative Impact:---


* Neglected Segments:--- Ignoring less common traveler types might result in missed opportunities for revenue and customer satisfaction.
* Negative Reviews:--- Poor services for specific traveler types can lead to negative reviews and damage the airline's reputation.





#### Chart - 7 --- How many reviews are recommended and how many are not recommended?


In [None]:
# Chart - 7 visualization code
# Count the occurrences of recommended and not  recommended reviews
recommended_counts = data_airline_reviews_df['recommended'].value_counts()

# Set up the figure and axes
plt.figure(figsize=(6,4))
plt.pie(recommended_counts, labels = recommended_counts.index, colors= ['green','red'], autopct='%1.1f%%', startangle =  140)

# Add title
plt.title("Distribution of Recommended and Not Recommende Reviews")

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?


I recommended a pie chart because it effectively illustrates the distribution of counts between two categories (recommended and not recommended) in a visually engaging way. It showcases the proportion of each category relative to the whole.

##### 2. What is/are the insight(s) found from the chart?

The insight from the chart of recommended and not recommended reviews distribution is the proportion of each category, indicating the balance between positive and negative sentiments in passenger reviews. This insight can provide a sense of overall customer satisfaction and areas that might need improvement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:---


* Enhanced Services:---  Insights on review distribution can guide improvements in areas with lower recommendation rates.
* Reputation Management:--- Addressing negative reviews can lead to improved customer sentiment and positive online reputation.


Negative Impact:---


* Negative Sentiment:--- A high proportion of not recommended reviews suggests dissatisfaction, potentially leading to customer loss and negative word-of-mouth.
* Ignored Negative Reviews:--- Ignoring negative sentiment risks customer attrition and a tarnished brand image.



#### Chart - 8 --- What is the distribution of cabin types mentioned in the reviews?

In [None]:
# Chart - 8 visualization code
# Count the occurrences of each cabin type
cabin_counts = data_airline_reviews_df['cabin'].value_counts()

# Set up the figure and axes
plt.figure(figsize=(6, 6))
plt.pie(cabin_counts, labels=cabin_counts.index, autopct='%1.1f%%', startangle=140, wedgeprops=dict(width=1.5))

# Add title
plt.title('Distribution of Cabin Types (Donut Chart)')

# Draw circle to create a donut chart
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

I suggested a donut chart to visually display the distribution of cabin types because it effectively presents the proportions of each category while also allowing for a clear comparison between categories, similar to a pie chart, but with the added benefit of a visible center to enhance readability.

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart of cabin types distribution include:--


* Identification of the most common cabin types mentioned in reviews.
* Understanding the relative prevalence of different cabin experiences.
* Insights into passenger preferences for cabin classes, guiding service and marketing strategies.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:---


* Service Enhancement:--- Insights about prevalent cabin types can guide improvements in those areas, enhancing customer satisfaction.
* Targeted Marketing:--- Understanding passenger preferences allows tailoring marketing strategies to specific cabin classes, attracting the right audience.

Negative Impact:---

* Neglected Cabin Classes:--- Ignoring less common cabin types might result in missed opportunities for personalized service
* Negative Reviews:--- Poor services for specific cabin classes can lead to negative reviews and damage the airline's reputation.





#### Bivariate Analysis: Numerical - Categorical

#### Chart - 9 --- How does the average "seat_comfort" rating differ among different airlines?

In [None]:
# Chart - 9 visualization code


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***