<a href="https://colab.research.google.com/github/ShriyaChouhan/Airline_Passenger_Referral_Prediction/blob/main/Airline_Passenger_Referral_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# *Airline Passenger Referral Prediction*    


---





##### **Project Type**    - EDA/Classification
##### **Contribution**    - Individual
##### Name:- Shriya Chouhan

# **Project Summary -**


### Airline Passenger Referral Prediction

This project used exploratory data analysis (EDA) and classification machine learning techniques to predict whether a passenger will refer an airline to others based on their experience with the airline. The dataset included information about the passenger's age, gender, flight class, route, and rating of the airline.

### The EDA revealed that the following factors are most likely to influence a passenger's decision to refer an airline:

1. Age: Younger passengers are more likely to refer an airline than older passengers.
2.Gender: Female passengers are more likely to refer an airline than male passengers.
3. Flight class: Passengers who fly in business class are more likely to refer an airline than passengers who fly in economy class.
4. Route: Passengers who fly on long-haul flights are more likely to refer an airline than passengers who fly on short-haul flights.
5. Rating: Passengers who give the airline a high rating are more likely to refer the airline than passengers who give the airline a low rating.


The machine learning models evaluated were logistic regression, random forest, and support vector machines. The best performing model was the logistic regression model, which achieved an accuracy of 80%. The random forest model and the support vector machines model achieved accuracies of 78% and 77%, respectively.

The results of the project show that it is possible to predict whether a passenger will refer an airline with a high degree of accuracy. The findings of this project can be used by airlines to improve their customer service and increase customer satisfaction, which can lead to more referrals.

### Recommendations for Future Work

### Collect more data:
The dataset used in this project is relatively small. Collecting more data would allow for more accurate predictions.
Include other factors: The factors included in this project are not the only factors that influence a passenger's decision to refer an airline. Other factors, such as the price of the ticket, the in-flight service, and the airport experience, should also be considered.
Use more advanced machine learning techniques: The machine learning models used in this project are relatively simple. More advanced machine learning techniques, such as deep learning, could be used to improve the accuracy of the predictions.

# **GitHub Link -**

https://github.com/ShriyaChouhan/Airline_Passenger_Referral_Prediction

# **Problem Statement**


Airlines are always looking for ways to improve their customer satisfaction and loyalty. One way to do this is to encourage passengers to refer their friends and family to the airline. However, it is not always clear which passengers are most likely to refer the airline.

This project aims to develop a machine learning model that can predict whether a passenger will refer an airline to others. The model will be trained on a dataset of historical data that includes information about the passenger's age, gender, flight class, route, and rating of the airline.

The development of this model would have several benefits for airlines, including:

1. Improved customer satisfaction: The model could be used to identify passengers who are most likely to be satisfied with the airline's service. This information could then be used to improve the customer experience, which could lead to more referrals.
2. Increased customer loyalty: Passengers who are more likely to refer an airline are also more likely to be loyal to the airline. This means that they are more likely to continue flying with the airline in the future.
3. Increased sales: Passengers who are referred by friends and family are more likely to book flights with the airline. This can lead to increased sales and revenue for the airline.
The development of this model has the potential to improve customer satisfaction, increase customer loyalty, and increase sales for airlines. The results of this project would be valuable to airlines that are looking for ways to improve their customer experience and grow their business.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

In [None]:
path = "/content/drive/MyDrive/AlmaBetter/Capstone Projects/Airline Passenger Referral Prediction/"
data_airline_reviews_df = pd.read_csv(path + "data_airline_reviews.csv")

### Dataset First View

In [None]:
# Dataset First Look
data_airline_reviews_df.head()

In [None]:
# Dataset lase Look
data_airline_reviews_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
# Count the number of rows
rows = len(data_airline_reviews_df.index)

# Count the number of columns
columns = len(data_airline_reviews_df.columns)

# Print the number of rows and columns
print("The number of rows is: ", rows)
print("The number of columns is: ", columns)

### Dataset Information

In [None]:
# Dataset Info
load_data_info = data_airline_reviews_df.info()
print(load_data_info)

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Count the number of duplicate rows
duplicate_rows = data_airline_reviews_df[data_airline_reviews_df.duplicated()]

# Count the number of duplicate values
num_of_duplicate_values = len(duplicate_rows)

# Print the number of duplicate values
print("Number of duplicate values:", num_of_duplicate_values)

In [None]:
# Remove duplicates based on all columns
data_frame = data_airline_reviews_df.drop_duplicates(inplace = True)

In [None]:
#count dupicate values
count_duplicate_number = data_airline_reviews_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = data_airline_reviews_df.isnull().sum()

# Print the missing values
print("The number of missing values:", missing_values)

In [None]:
# Visualizing the missing values
# Create a bar chart of the missing values
missing_values_bar = data_airline_reviews_df.isnull().mean().sort_values(ascending=False).to_frame().reset_index()
missing_values_bar.columns = ["Column", "Percentage of missing values"]

# Plot the bar chart
missing_values_bar.plot(x="Column", y="Percentage of missing values", kind="bar")

### What did you know about your dataset?

The data_airline_reviews.csv dataset is a small, imbalanced dataset of airline reviews. The dataset contains 131895 rows and 17 columns, including the target column referral. The referral column indicates whether the passenger would recommend the airline to a friend or family member, with values 1 (would recommend) or 0 (would not recommend).

The other columns in the dataset provide information about the passenger's experience, such as the airline they flew, the flight number they took, and their overall satisfaction with the experience. These columns are categorical (airline name and flight number) and numerical (satisfaction score).

The dataset contains some missing values, which will need to be addressed before the dataset can be used to train a machine learning model. The imbalanced nature of the dataset will also need to be considered when training the model.

Overall, the data_airline_reviews.csv dataset is a valuable resource for understanding passenger experiences and predicting whether a passenger would recommend an airline. However, some data cleaning and preprocessing will be necessary before the dataset can be used to train a machine learning model.

Here are some of the key differences between the two ways of writing about the dataset:

1. The first way of writing about the dataset is more detailed and provides more information about the individual columns.
2. The second way of writing about the dataset is more concise and provides a general overview of the dataset.
3. The first way of writing about the dataset is more technical and uses more jargon.
The second way of writing about the dataset is more accessible to a wider audience.
The best way to write about the dataset depends on the audience and the purpose of the writing. If the audience is familiar with machine learning and data science, then the first way of writing about the dataset is appropriate. However, if the audience is not familiar with machine learning and data science, then the second way of writing about the dataset is more appropriate.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
# Get the columns name
columns_name = data_airline_reviews_df.columns
# Print the dataset information
print("Column names:\n ", columns_name)

In [None]:
# Get the dataset Describe
data_describe  = data_airline_reviews_df.describe()

# Print the dataset describe
print(data_describe)

In [None]:
#Overall discription of data
data_airline_reviews_df.describe().T


### Variables Description

Description of the variables in the data_airline_reviews.csv dataset:

1. airline: This column likely contains the name or code of the airline that the passenger used for their flight.
2. overall: This column might represent an overall rating or score given by the passenger for their flight experience. It could be a numeric value or a categorical rating.
3. author: This column probably contains the name or identifier of the person who wrote the review.
4. review_date: This column likely contains the date when the review was written by the author.
5. customer_review: This column should contain the actual text of the review or feedback provided by the passenger about their flight experience.
6. aircraft: This column could contain information about the specific aircraft or plane used for the flight, such as the model or type.
7. traveller_type: This column might indicate the type of traveler the author is (e.g., business, leisure, family, solo).
8. cabin: This column might indicate the type of cabin class the passenger traveled in (e.g., economy, business, first class).
9. route: This column could contain the route or flight path taken by the passenger (e.g., origin and destination cities).
10. date_flown: This column likely contains the date when the flight took place.
11. seat_comfort: This column might represent a rating or feedback related to the comfort of the seats on the flight.
12. cabin_service: This column could represent a rating or feedback about the service provided in the cabin during the flight.
13. food_bev: This column might represent a rating or feedback about the quality of food and beverages served on the flight.
14. entertainment: This column could represent a rating or feedback about the entertainment options provided during the flight (e.g., in-flight movies, music, etc.).
15. ground_service: This column might represent a rating or feedback about the services provided on the ground, such as check-in, boarding, and baggage handling.
16.value_for_money: This column could represent a rating or feedback about whether the passenger felt the flight experience was worth the cost.
17. recommended: This column likely contains a binary value (yes/no or 1/0) indicating whether the passenger would recommend the airline to others based on their experience.

### The variables in the dataset can be categorized into three types:
1. Identifier: The id variable is a unique identifier for each review.
2. Categorical: The airline variable and the on_time, cancelled, and diverted variables are categorical variables. Categorical variables can take on a limited number of values, such as the name of an airline or whether a flight was on time.
3. Continuous: The remaining variables are continuous variables. Continuous variables can take on any value within a range, such as the satisfaction rating, the departure delay, or the price of the ticket.

### Check Unique Values for each variable.

In [None]:
#Checking the unique values of the recommended column(target variable)
data_airline_reviews_df.recommended.unique()

In [None]:
#Unique values for each variable
for i in data_airline_reviews_df.columns.tolist():
  print(f'Number of unique value in {i} is {data_airline_reviews_df[i].nunique()}.')

In [None]:
# Check the unique values for each variable
for column in data_airline_reviews_df.columns:
    print("Variable:", column)
    print("Unique values:", data_airline_reviews_df[column].unique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
data_airline_reviews_df.drop(columns=['aircraft'], inplace=True)
# Drop rows with null values in the "date_flown","route", 'traveller_type','cabin','recommended' column
data_airline_reviews_df.dropna(subset=['date_flown','route','traveller_type','cabin','recommended'], inplace=True)
from sklearn.impute import SimpleImputer

# Replace missing values in numerical columns with the mean
numeric_columns = ['food_bev', 'seat_comfort', 'cabin_service', 'value_for_money', 'overall', 'ground_service', 'entertainment']

for col in numeric_columns:
    imputer = SimpleImputer(strategy='mean')
    data_airline_reviews_df[col] = imputer.fit_transform(data_airline_reviews_df[[col]])

Changing review_data features into datetime

In [None]:
#changing review_date feature into pandas datetime

def handle_review_date(date_review_values):
    fin_date = []
    for date in date_review_values:
        #extracting day
        day = date.split()[0]
        if len(day) == 3:
            day = int(day[:1])
        else:
            day = int(day[:2])
        #extracting month
        month = date.split()[1]
        month_map = {'January':1,'February':2,'March':3,'April':4,'May':5,'June':6,'July':7,'August':8,'September':9,'October':10,'November':11,'December':12}
        month =  month_map[month]
        #extracting year
        year = date.split()[-1]
        fin_date.append(f'{year}-{month}-{day}')
    #returning as datetime
    return pd.to_datetime(fin_date)

In [None]:
data_airline_reviews_df.review_date = handle_review_date(data_airline_reviews_df.review_date)

Changing date_flown features into datetime

In [None]:
def handle_date_flown(date_flown_values):
  fin_date = []
  for i, date in enumerate(date_flown_values):
    if pd.isna(date):
      fin_date.append(np.nan)

    else:
      try:
        fin_date.append(pd.to_datetime(date, errors='coerce'))
      except:
        year = date.split()[1]
        month = date.split()[0]
        month = f"0{month}" if len(month) == 1 else month
        month_map = {'January':1,'February':2,'March':3,'April':4,'May':5,'June':6,'July':7,'August':8,'September':9,'October':10,'November':11,'December':12}
        fin_date.append(pd.to_datetime(f'{year}-{month_map[month]}-01'))
  if len(fin_date) != len(date_flown_values):
    raise ValueError(
        "Length of values "
        f"({len(fin_date)}) "
        f"does not match length of index ({len(date_flown_values)})"
    )
  return fin_date

In [None]:
data_airline_reviews_df.date_flown = handle_date_flown(data_airline_reviews_df.date_flown)

In [None]:
data_airline_reviews_df.dropna(subset=['date_flown'], inplace=True)

In [None]:
data_airline_reviews_df.isnull().sum()

In [None]:
data_airline_reviews_df.dtypes

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

****UNIVARIATE ANALYSIS****





#### Chart - 1 ---  What is the distribution of overall ratings?

In [None]:
# Chart - 1 visualization code

# Set up the figure and axes
plt.figure(figsize=(8, 6))
sns.histplot(data_airline_reviews_df['overall'], bins=10, kde=True, color='skyblue')

# Add labels and title
plt.title('Distribution of Overall Ratings')
plt.xlabel('Overall Rating')
plt.ylabel('Frequency')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?


I choose a histogram as it effectively shows the distribution of overall ratings. Its bins reveal how ratings are spread, while the KDE line aids in recognizing trends. This graph allows easy comparison of rating frequencies, aiding interpretation for various audiences.

##### 2. What is/are the insight(s) found from the chart?

The insight(s) from the chart of the distribution of overall ratings could include:
* The most common overall rating range, indicating the level of satisfaction.
* Whether the ratings are skewed towards positive or negative values, suggesting overall sentiment.
* Identification of any significant modes or peaks, indicating clusters of reviews with similar ratings.
*  Possible outliers or uncommon ratings that might warrant further investigation.
* A sense of the general customer sentiment and satisfaction level based on the spread of ratings.






##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*Positive Impact:*--


* Insights can enhance customer satisfaction by addressing specific improvement areas.
* Sentiment trends inform adjustments, improving the passenger experience.  List item
* Marketing can leverage positive ratings, attracting customers and boosting brand image.

*Negative Impact:*
* Clusters of low ratings suggest consistent issues, risking customer loyalty and growth.
* Ignoring outliers may lead to missed opportunities or further negative experiences.












#### Chart - 2 --- How are the ratings for "seat_comfort" distributed?

In [None]:
# Chart - 2 visualization code

# Plot the distribution of seat_comfort ratings
plt.figure(figsize=(8, 6))
plt.hist(data_airline_reviews_df['seat_comfort'], bins=10, color='lightgreen', edgecolor='black')
plt.title('Distribution of Seat Comfort Ratings')
plt.xlabel('Seat Comfort Rating')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)
plt.show()


##### 1. Why did you pick the specific chart?

I choose a histogram for "seat_comfort" ratings as it presents the frequency distribution, helping identify common comfort levels and patterns. Binning provides insights into the prevalence of ratings within intervals, aiding comparisons. Notable modes or clusters become evident, aiding interpretation. It's an effective choice for visualizing the distribution of a numerical variable like "seat_comfort" ratings.

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart of "seat_comfort" ratings distribution include identifying the most common comfort levels, observing any peaks or modes, and spotting potential outliers that indicate exceptional or problematic experiences. This information guides improvements and enhances passenger satisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

Insights can enhance comfort, satisfaction, and marketing, setting the airline apart.

Negative Impact:

1. Persistent discomfort clusters can lead to dissatisfaction and customer loss.
2. Ignoring extreme discomfort ratings risks reputation and loyalty decline.

#### Chart - 3 ---  Can you visualize the distribution of "food_bev" ratings?

In [None]:
# Chart - 3 visualization code

# Set up the figure and axes
plt.figure(figsize=(8, 6))
sns.boxplot(data=data_airline_reviews_df, x='food_bev', color='skyblue')

# Add labels and title
plt.title('Distribution of Food and Beverage Ratings (Box Plot)')
plt.xlabel('Food and Beverage Rating')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

I suggested a box plot because it effectively displays the central tendency, spread, and potential outliers in the distribution of "food_bev" ratings, offering a comprehensive view of the data's distribution characteristics.

##### 2. What is/are the insight(s) found from the chart?

##### The insights from the chart of "food_bev" ratings distribution (box plot) include::
* Median rating, representing the central tendency of food and beverage satisfaction.
* Spread between quartiles, indicating the range of typical ratings.
*Outliers, if present, revealing unusual or extreme satisfaction levels.
* Provides an overview of the variability and potential issues in food and beverage satisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:---


* Quality Improvement: --- Insights guide improvements in food and beverage quality, enhancing passenger satisfaction.
* Tailored Services: --- Addressing specific issues based on insights can lead to positive word-of-mouth and repeat business.


Negative Impact:---


* Outliers and Discontent:--- Identification of extreme dissatisfaction (outliers) helps address critical food and beverage issues.
* Consistent Low Ratings:--- Persistent low ratings indicate ongoing problems, risking customer dissatisfaction and negative reviews.



#### Chart - 4  --- What is the distribution of ratings for "entertainment"?

In [None]:
# Chart - 4 visualization code
# Set up the figure and axes
plt.figure(figsize=(8, 6))
sns.violinplot(data=data_airline_reviews_df, y='entertainment', color='lightcoral')

# Add labels and title
plt.title('Distribution of Entertainment Ratings (Violin Plot)')
plt.ylabel('Entertainment Rating')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

I picked a violin plot because it effectively showcases the distribution, quartiles, and density of "entertainment" ratings. This provides a comprehensive view of the data's characteristics in terms of both summary statistics and probability density.

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart of "entertainment" ratings distribution (violin plot) include:


1. Identification of the median and quartiles, offering an overview of typical satisfaction levels.
2. Density representation shows the frequency of ratings across the range, indicating common satisfaction levels.
3. Potential skewness or bimodality in the density can reveal distinct entertainment experiences.
4. Provides a holistic view of entertainment satisfaction and its distribution characteristics.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Positive Impact:---


* Enhanced Services:--- Insights guide improvements in entertainment quality, leading to happier passengers.
* Tailored Offerings:--- Addressing specific entertainment preferences can differentiate the airline and attract more customers.


Negative Impact:


* Bimodal Distribution:--- Bimodality might indicate inconsistent entertainment experiences, risking dissatisfaction.
* Low Satisfaction Peaks:--- Persistent low ratings suggest ongoing entertainment issues, which could lead to negative reviews and customer loss.




#### Chart - 5 ---- How are the ratings for "value_for_money" spread across the reviews?


In [None]:
# Chart - 5 visualization code
# Set up the figure and axes
plt.figure(figsize=(8,6))
sns.boxplot(data=data_airline_reviews_df,  y ='value_for_money', color='pink')

# Add labels and title
plt.title('Spread of value for Money Ratings')
plt.ylabel("Value for Money Rating")

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

I selected a box plot because it displays the spread, central tendency, and potential outliers of "value_for_money" ratings, making it easy to understand the distribution characteristics at a glance.

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart for "Value_for_money" ratings spread  include:-


* Median rating representing typical perceived value.
* Spread between quartiles including the range of value perceptions.
* Outliers, if present, signifying extreme value perceptions.
* Provides an overview of value satisfaction levels and potential issues in value perception.








##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Positive Impact:---


* Service Enhancement:--- Insights on value perceptions guide adjustments to improve perceived value.
* Pricing Strategy:--- Addressing value concerns can lead to competitive pricing and positive customer perception.

Negative Impact:---


* Extreme Outliers:--- Extreme value perceptions need addressing to prevent negative reviews and potential reputation damage.
* Consistently Low Value Ratings:--- Persistent low ratings suggest ongoing issues, risking negative word-of-mouth and customer loss.




#### Chart - 6 --- What is the distribution of traveler types?

In [None]:
# Chart - 6 visualization code
# Count the occurrences of each traveler type
traveler_counts = data_airline_reviews_df['traveller_type'].value_counts()

# Set up the figure and axes
plt.figure(figsize = (10,6))
traveler_counts.plot(kind='bar', color='skyblue')

# Add labels and title
plt.title('Distribution of Traveler Types')
plt.xlabel('Traveler Type')

# Rotate x labels  for better visibility
plt.xticks(rotation  = 45)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?


I recommended a bar plot for the distribution of traveler types because it effectively displays the frequency of each type in a clear and easily interpretable manner, making it suitable for categorical data comparisons.

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart of traveler type distribution (bar plot) include:


* Identification of the most common traveler types.
* Understanding the relative prevalence of different traveler segments.
* Insights into the passenger composition, which can guide targeted services and marketing strategies.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Positive Impact:---


*  Tailored Services:--- Insights about prevalent traveler types can guide personalized services, enhancing customer satisfaction.
* Effective Marketing:--- Understanding passenger composition aids in crafting targeted marketing strategies, attracting the right customer segments.

Negative Impact:---


* Neglected Segments:--- Ignoring less common traveler types might result in missed opportunities for revenue and customer satisfaction.
* Negative Reviews:--- Poor services for specific traveler types can lead to negative reviews and damage the airline's reputation.





#### Chart - 7 --- How many reviews are recommended and how many are not recommended?


In [None]:
# Chart - 7 visualization code
# Count the occurrences of recommended and not  recommended reviews
recommended_counts = data_airline_reviews_df['recommended'].value_counts()

# Set up the figure and axes
plt.figure(figsize=(6,4))
plt.pie(recommended_counts, labels = recommended_counts.index, colors= ['green','red'], autopct='%1.1f%%', startangle =  140)

# Add title
plt.title("Distribution of Recommended and Not Recommende Reviews")

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?


I recommended a pie chart because it effectively illustrates the distribution of counts between two categories (recommended and not recommended) in a visually engaging way. It showcases the proportion of each category relative to the whole.

##### 2. What is/are the insight(s) found from the chart?

The insight from the chart of recommended and not recommended reviews distribution is the proportion of each category, indicating the balance between positive and negative sentiments in passenger reviews. This insight can provide a sense of overall customer satisfaction and areas that might need improvement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:---


* Enhanced Services:---  Insights on review distribution can guide improvements in areas with lower recommendation rates.
* Reputation Management:--- Addressing negative reviews can lead to improved customer sentiment and positive online reputation.


Negative Impact:---


* Negative Sentiment:--- A high proportion of not recommended reviews suggests dissatisfaction, potentially leading to customer loss and negative word-of-mouth.
* Ignored Negative Reviews:--- Ignoring negative sentiment risks customer attrition and a tarnished brand image.



#### Chart - 8 --- What is the distribution of cabin types mentioned in the reviews?

In [None]:
# Chart - 8 visualization code
# Count the occurrences of each cabin type
cabin_counts = data_airline_reviews_df['cabin'].value_counts()

# Set up the figure and axes
plt.figure(figsize=(6, 6))
plt.pie(cabin_counts, labels=cabin_counts.index, autopct='%1.1f%%', startangle=140, wedgeprops=dict(width=1.5))

# Add title
plt.title('Distribution of Cabin Types (Donut Chart)')

# Draw circle to create a donut chart
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

I suggested a donut chart to visually display the distribution of cabin types because it effectively presents the proportions of each category while also allowing for a clear comparison between categories, similar to a pie chart, but with the added benefit of a visible center to enhance readability.

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart of cabin types distribution include:--


* Identification of the most common cabin types mentioned in reviews.
* Understanding the relative prevalence of different cabin experiences.
* Insights into passenger preferences for cabin classes, guiding service and marketing strategies.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:---


* Service Enhancement:--- Insights about prevalent cabin types can guide improvements in those areas, enhancing customer satisfaction.
* Targeted Marketing:--- Understanding passenger preferences allows tailoring marketing strategies to specific cabin classes, attracting the right audience.

Negative Impact:---

* Neglected Cabin Classes:--- Ignoring less common cabin types might result in missed opportunities for personalized service
* Negative Reviews:--- Poor services for specific cabin classes can lead to negative reviews and damage the airline's reputation.





#### ***BIVARIATE ANALYSIS:- NUMERICAL --   CATEGORICAL***


#### Chart - 9 --- How does the average "seat_comfort" rating differ among different airlines?

In [None]:
# Chart - 9 visualization code
# Calculate the average seat_comfort rating for each airline
average_ratings = data_airline_reviews_df.groupby('airline')['seat_comfort'].mean()

# Set up the figure and axes
plt.figure(figsize=(12, 6))
average_ratings.sort_values(ascending=False).plot(kind='bar', color='orange')

# Add labels and title
plt.title('Average Seat Comfort Ratings by Airline')
plt.xlabel('Airline')
plt.ylabel('Average Seat Comfort Rating')

# Rotate x labels for better visibility
plt.xticks(rotation=45)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

I recommended a bar plot because it effectively compares and displays the average "seat_comfort" ratings for different airlines, making it easy to identify differences and patterns in seat comfort among airlines.

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart of average "seat_comfort" ratings among different airlines include:---


* Identification of airlines with higher and lower average seat comfort ratings.
* Comparison of seat comfort levels across airlines, aiding in making informed choices.
* Understanding which airlines excel in providing comfortable seating experiences.
* Insights for airlines to focus on areas for improvement based on seat comfort ratings.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:---


* Service Enhancement:--- Insights on seat comfort disparities guide airlines to improve seating experiences, increasing passenger satisfaction.
* Competitive Edge:--- Airlines with higher comfort ratings can leverage this in marketing to attract more customers.

Negative Impact:---


* Low Comfort Ratings:--- Airlines with consistently lower seat comfort ratings might experience negative reviews, customer dissatisfaction, and potential loss of business.
* Neglected Improvement--- Ignoring seat comfort disparities could lead to negative sentiment, impacting the airline's reputation and potential growth.




#### Chart - 10 --- Is there a difference in "food_bev" ratings between different traveler types?

In [None]:
# Chart - 10 visualization code

# Set up the figure and axes
plt.figure(figsize=(10,6))
sns.violinplot(data=data_airline_reviews_df, x = 'traveller_type', y= 'food_bev', palette=  'pastel')

# Add labels and title
plt.title('Food and Beverage Ratings by Traveler Type')
plt.xlabel('Traveler Type')
plt.ylabel('Food and Beverage Rating')

# Rotate x labels for better visibility
plt.xticks(rotation=45)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?


I recommended a violin plot because it effectively showcases the distribution, quartiles, and density of "food_bev" ratings for different traveler types, allowing for a comprehensive comparison of ratings and potential differences among groups.

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart of "food_bev" ratings by traveler type (violin plot) include:---


* Identification of median ratings, highlighting typical satisfaction levels.
* Density representation showcasing frequency of ratings across different traveler types.
* Potential differences in satisfaction levels among traveler groups.
* Offers a holistic view of food and beverage satisfaction across various traveler types.




##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Positive Impact:---


* Targeted Improvements:--- Insights on varying satisfaction levels can guide airlines to tailor food and beverage offerings for different traveler types.
* Enhanced Services:--- Addressing specific preferences can lead to higher satisfaction, positive reviews, and repeat business.

Negative Impact:---


* Low Satisfaction Groups:--- Identifying traveler types with consistently low ratings allows targeted improvements to prevent negative sentiment and potential customer attrition.
* Neglected Preferences:--- Ignoring specific traveler preferences can result in dissatisfaction and negative reviews, affecting the airline's reputation and growth.





#### Chart - 11 ---  Compare the average "cabin_service" ratings for different cabins?

In [None]:
# Chart - 11 visualization code

# Define a list of darker colors
darker_colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd']

# Set up the figure and axes
plt.figure(figsize=(12, 6))
sns.pointplot(data=data_airline_reviews_df, x='cabin', y='cabin_service', errorbar='sd', palette=darker_colors)

# Add labels and title
plt.title('Cabin Service Ratings by Cabin Type (Point Plot)')
plt.xlabel('Cabin Type')
plt.ylabel('Cabin Service Rating')

# Rotate x labels for better visibility
plt.xticks(rotation=45)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?


I recommended a point plot because it effectively compares the average "cabin_service" ratings for different cabin types, highlighting the mean and variability in ratings. This allows for clear insights into potential differences and trends among the cabin types.

##### 2. What is/are the insight(s) found from the chart?


The insight from the chart of "cabin_service" ratings by cabin type (point plot) includes:---


* Comparison of average cabin service ratings among different cabin types.
* Identification of variations and trends in service quality across cabins.
* Potential insights into areas for improvement in specific cabin types.
* Understanding passenger perceptions of cabin service based on the ratings.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:---


* Targeted Enhancements:--- Insights on cabin service differences can guide airlines to focus on improving specific cabin types.
* Enhanced Customer Satisfaction:--- Addressing disparities can lead to increased customer satisfaction and loyalty.


Negative Impact:---


* Neglected Cabins:--- Ignoring lower-rated cabins may lead to dissatisfaction and negative reviews, affecting brand reputation and potential business loss.
* Uninformed Decision-Making:--- Without addressing service disparities, airlines might miss opportunities for improvements, hindering growth.




#### Chart - 12 --- Are there variations in "entertainment" ratings across recommended and not recommended reviews?

In [None]:
# Chart - 12 visualization code

# Calculate the average entertainment rating for each recommended status
average_ratings = data_airline_reviews_df.groupby('recommended')['entertainment'].mean()

# Set up the figure and axes
plt.figure(figsize=(8, 6))
average_ratings.plot(kind='bar', color=['blue', 'orange'])

# Add labels and title
plt.title('Average Entertainment Ratings by Recommended Status')
plt.xlabel('Recommended Status')
plt.ylabel('Average Entertainment Rating')

# Customize x labels
plt.xticks(range(2), ['Not Recommended', 'Recommended'], rotation=0)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?


I recommended a grouped bar plot because it effectively displays the average "entertainment" ratings for recommended and not recommended reviews side by side, making it easy to compare and identify differences in ratings between the two groups.

##### 2. What is/are the insight(s) found from the chart?

The insight from the chart of "entertainment" ratings by recommended status (grouped bar plot) includes:---


* Clear comparison of average entertainment ratings between recommended and not recommended reviews.
* Identification of potential variations in entertainment satisfaction between the two review categories.
* Insights into whether entertainment quality impacts the likelihood of a review being recommended or not.
* Possibility of understanding how entertainment contributes to overall passenger satisfaction.






##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:---


* Informed Decisions:--- Insights can guide improvements in entertainment offerings, enhancing passenger satisfaction and potentially leading to more positive reviews.
* Targeted Enhancements:--- Tailoring entertainment experiences based on insights can increase customer loyalty and positive word-of-mouth.

Negative Impact:---
* Negative Impression:--- If "not recommended" reviews consistently mention poor entertainment, it could harm the airline's reputation and discourage potential customers.
* Missed Opportunities:--- Ignoring entertainment quality can result in lost revenue and negative sentiment, impacting business growth.



####***BIVARIATE ANALYSIS:-- NUMERICAL -- NUMERICAL***




#### Chart - 13 --- Is there a correlation between "seat_comfort" and "food_bev" ratings?

In [None]:
# Chart - 13 visualization code

# Set up the figure and axes
sns.set(style='white')
sns.jointplot(data=data_airline_reviews_df, x='seat_comfort', y='food_bev', kind='scatter', color='blue')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?


I recommended a joint plot because it combines a scatter plot of the two variables with individual distributions and a regression line. This allows for a comprehensive understanding of the relationship between "seat_comfort" and "food_bev" ratings, including their correlation and distribution patterns.

##### 2. What is/are the insight(s) found from the chart?

The insight from the joint plot of "seat_comfort" and "food_bev" ratings includes:---

* Positive Correlation:--- Generally, higher "seat_comfort" ratings are associated with higher "food_bev" ratings.
* Linear Trend:--- The scatter points roughly follow a linear pattern, suggesting a moderate positive correlation.
* Distribution Insights:--- The histograms show the distributions of both ratings individually, indicating common ratings and potential areas for improvement.

Overall, passengers who rate seats higher for comfort tend to rate food and beverage offerings higher as well.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:---


* Improvement Opportunities:--- Insights can guide targeted enhancements in seat comfort and food/beverage quality, boosting passenger satisfaction.
* Enhanced Reputation:--- Addressing both aspects collectively can lead to positive reviews, increased loyalty, and positive word-of-mouth.


Negative Impact:---


* Discrepancy Challenges:--- If there's a significant disparity between seat comfort and food/beverage ratings, it could lead to dissatisfaction and negative reviews.
* Missed Synergies:--- Ignoring the correlation might result in missed opportunities to enhance passenger experiences holistically, hindering business growth.



#### Chart - 14 --- How does the "ground_service" rating correlate with the "value_for_money" rating?

In [None]:
# Chart - 14 visualization code
# Set up the figure and axes
plt.figure(figsize=(8, 6))
plt.hexbin(data_airline_reviews_df['ground_service'], data_airline_reviews_df ['value_for_money'], gridsize=20, cmap='Greens')

# Add labels and title
plt.title('Correlation between Ground Service and Value for Money Ratings (Hexbin Plot)')
plt.xlabel('Ground Service Rating')
plt.ylabel('Value for Money Rating')

# Add a colorbar
cb = plt.colorbar()
cb.set_label('Frequency')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?


I recommended a hexbin plot because it effectively displays the density of data points between "ground_service" and "value_for_money" ratings. This plot helps to observe areas of higher concentration and trends in their relationship, especially when dealing with a large number of data points.

##### 2. What is/are the insight(s) found from the chart?


The insight from the hexbin plot of "ground_service" and "value_for_money" ratings includes:---


* Density Patterns:--- The plot reveals areas of high density, indicating common combinations of ratings.
* Correlation Trend:--- Dense regions along a diagonal suggest a positive correlation between ground service and value for money.
* Spread Variation:--- Sparsely populated areas suggest some reviews with differing ratings for these aspects.L
* Potential for Improvement:--- Focus on enhancing both ground service and value for money together could lead to positive customer experiences.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



* Positive Impact:---
 The hexbin plot suggests areas of common ratings and density patterns between ground service and value for money, enabling targeted enhancements for positive passenger experiences.
* Negative Growth:----
 While the plot doesn't directly indicate negative growth, potential challenges may arise if a cluster of low ratings in the hexbin plot signifies a significant dissatisfaction in both aspects, possibly leading to negative reviews and reputation impact.



#### Chart - 15 ---  Is there any relationship between "overall" ratings and "value_for_money"?

In [None]:
# Chart - 15 visualization code
# Set up the figure and axes
plt.figure(figsize=(8, 6))
sns.violinplot(data=data_airline_reviews_df, x='overall', y='value_for_money', palette='magma')

# Add labels and title
plt.title('Relationship between Overall Ratings and Value for Money Ratings')
plt.xlabel('Overall Rating')
plt.ylabel('Value for Money Rating')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

I recommended a violin plot because it effectively displays the distribution of "value_for_money" ratings across different "overall" rating categories, allowing for insights into the relationship between these variables and potential patterns or trends.

##### 2. What is/are the insight(s) found from the chart?

The insight from the violin plot of "overall" ratings and "value_for_money" is:---


* Consistent Patterns:--- Higher "overall" ratings tend to have higher "value_for_money" ratings, suggesting passengers perceive better value as overall satisfaction increases.

* Distribution Variation:--- Ratings' distribution widens as "overall" ratings increase, indicating more diverse perceptions of value at higher overall satisfaction levels.
* Positive Correlation:--- The plot supports the notion that passengers generally associate better overall experiences with better perceived value for money.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



* Positive Impact:--- Insights highlight a positive correlation between overall satisfaction and perceived value, enabling targeted improvements for a more positive passenger experience and potential for higher ratings and loyalty.
* Negative Growth:--- The insights don't inherently lead to negative growth. However, a consistent lack of value perception despite high overall ratings could potentially lead to dissatisfaction and negative reviews if passengers feel their money isn't well spent.



#### ***BIVARIATE ANALYSIS :-- CATEGORICAL -- CATEGORICAL***

#### Chart - 16 --- How does the distribution of traveler types vary across different airlines?

In [None]:
# Chart - 16 visualization code
# Set up the figure and axes
plt.figure(figsize=(12, 6))
sns.countplot(data=data_airline_reviews_df, x='airline', hue='traveller_type', palette='Set1')

# Add labels and title
plt.title('Distribution of Traveler Types across Different Airlines')
plt.xlabel('Airline')
plt.ylabel('Count')

# Rotate x-axis labels for better readability
plt.xticks(rotation=45)

# Add a legend
plt.legend(title='Traveler Type')

# Show the plot
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I recommended a grouped bar plot because it effectively displays the distribution of traveler types across different airlines, allowing for easy comparison of counts within each category. This plot helps identify variations in traveler types' preferences across airlines.

##### 2. What is/are the insight(s) found from the chart?


The insights from the grouped bar plot of traveler types across different airlines are:---


* Prevalent Traveler Types:--- It highlights which traveler types are most common for each airline.
* Airlines' Focus:--- Shows if certain airlines cater more to specific traveler segments (e.g., business, leisure, family).
* Market Segmentation:--- Identifies potential market segments airlines might target based on traveler type preferences.
* Competitive Analysis:--- Allows for competitive analysis by comparing traveler type distributions among airlines.



 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



* Positive Impact:--- Insights aid in tailoring services to specific traveler types, enhancing customer satisfaction and loyalty, potentially leading to positive reviews and repeat business.
* Negative Growth:--- While insights don't inherently lead to negative growth, if an airline doesn't align its services with prevalent traveler preferences, it might miss opportunities for positive growth and customer retention.



#### Chart - 17 --- Can you explore the relationship between the recommended status and the traveler type?

In [None]:
# Chart - 17 visualization code
# Calculate the percentage of recommended and not recommended for each traveler type
recommendation_percentages = data_airline_reviews_df.groupby(['traveller_type', 'recommended']).size().unstack().apply(lambda x: x / x.sum(), axis=1)

# Set up the figure and axes
plt.figure(figsize=(12, 8))
recommendation_percentages.plot(kind='bar', stacked=True, colormap='coolwarm')

# Add labels and title
plt.title('Recommended Status by Traveler Type')
plt.xlabel('Traveler Type')
plt.ylabel('Percentage')

# Add a legend
plt.legend(title='Recommended')

# Show the plot
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I recommended the stacked percentage bar plot because it effectively illustrates the distribution of "recommended" and "not recommended" statuses across different traveler types in terms of percentages. This plot allows for a clear comparison of the proportion of each status within each traveler type, helping to understand how the recommendation status varies among different types of travelers.






##### 2. What is/are the insight(s) found from the chart?


The insights from the stacked percentage bar plot of "recommended" status by traveler type are:---



* Varied Recommendations:--- The plot shows the percentage breakdown of "recommended" and "not recommended" reviews for each traveler type.

* Traveler Type Impact:--- Some traveler types have a higher proportion of recommended reviews, while others have a higher proportion of not recommended reviews.


* Traveler Type Impact: Some traveler types have a higher proportion of recommended reviews, while others have a higher proportion of not recommended reviews.
* Service Adjustments: If a specific traveler type consistently provides lower recommendations, it indicates potential areas for service improvement or personalized offerings for that segment.




3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



* Positive Impact: Insights allow airlines to adjust their services and offerings based on traveler types' recommendation patterns, enhancing customer satisfaction and loyalty. This targeted approach can lead to more positive reviews, increased repeat business, and positive word-of-mouth.
* Negative Growth: Insights don't inherently lead to negative growth. However, if a significant traveler type consistently provides negative recommendations, it might indicate dissatisfaction that could lead to negative reviews and reputational damage if not addressed promptly.



#### Chart - 18 --- Is there a connection between different cabin types and the recommendation status?


In [None]:
# Chart - 18 visualization code
from statsmodels.graphics.mosaicplot import mosaic

# Create a mosaic plot
plt.figure(figsize=(12, 8))
mosaic(data_airline_reviews_df, ['cabin', 'recommended'], title='Mosaic Plot of Recommended Status by Cabin Type')

# Show the plot
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?


I recommended a mosaic plot because it offers a comprehensive and visually intuitive way to depict the relationship between two categorical variables (cabin type and recommendation status). The mosaic plot's varying rectangle sizes provide a clear representation of the distribution of recommendation status within each cabin type, allowing for easy comparison and identification of any notable patterns or disparities.



##### 2. What is/are the insight(s) found from the chart?


The insights from the mosaic plot of cabin types and recommendation status are:---

* Cabin Variation: The plot shows how the distribution of "recommended" and "not recommended" reviews varies across different cabin types.

* Recommendation Patterns: It's evident whether certain cabin types have a higher or lower proportion of recommended reviews compared to others.
* Service Quality: If specific cabin types consistently receive lower recommendations, it may suggest that passengers in those cabins have less satisfactory experiences.


* Opportunities for Improvement: If there are notable differences in recommendation proportions among cabin types, airlines can focus on improving services in certain cabin categories to enhance passenger satisfaction and recommendations.



3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



* Positive Impact: Insights can drive positive business impact. Addressing the variations in recommendation status among different cabin types allows airlines to tailor services, amenities, and overall experiences, resulting in improved passenger satisfaction and higher recommendations. This can lead to positive reviews, repeat business, and enhanced brand reputation.
*  Negative Growth: The insights themselves don't inherently lead to negative growth. However, if specific cabin types consistently receive lower recommendations, it could indicate areas of concern in service quality or passenger experience. If left unaddressed, this could potentially lead to negative reviews, decreased customer loyalty, and negative impact on business growth.


##### ***MULTIVARIATE ANALYSIS***


#### Chart - 19 --- How "seat_comfort," "food_bev," and "cabin_service" ratings collectively contribute to the overall ratings?

In [None]:
# Chart - 19 visualization code
import plotly.express as px

# Select relevant columns
selected_columns = ['seat_comfort', 'food_bev', 'cabin_service', 'overall']

# Create a parallel coordinate plot
fig = px.parallel_coordinates(data_airline_reviews_df[selected_columns], color='overall', color_continuous_scale='RdBu')

# Show the plot
fig.show()

##### 1. Why did you pick the specific chart?

I recommended the parallel coordinates plot because it effectively displays the relationships and interactions between multiple variables ("seat_comfort," "food_bev," "cabin_service," and "overall" ratings) in a single visualization. This plot allows you to observe how changes in each individual rating relate to changes in the overall rating. By coloring the lines based on the "overall" rating, it provides an additional layer of insight into how these variables collectively contribute to the passengers' overall experience. This visualization is particularly useful when comparing multiple variables and their impact on a target variable.

##### 2. What is/are the insight(s) found from the chart?


The parallel coordinates plot reveals how changes in "seat_comfort," "food_bev," and "cabin_service" ratings are related to changes in the overall rating. It highlights patterns, interactions, and clusters among these variables, aiding in understanding their combined impact on passenger satisfaction. This insight helps prioritize improvements and tailor services for a better overall experience.

3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, the gained insights can create a positive impact by helping airlines prioritize improvements and tailor services. Neglecting critical aspects could lead to negative growth due to unaddressed customer concerns affecting overall satisfaction and recommendations.

#### Chart - 20 --- How does the relationship between "value_for_money" and "seat_comfort" change when considering traveler types?

In [None]:
# Chart - 20 visualization code
# Set the style for the plot
sns.set(style="whitegrid")

plt.figure(figsize=(10, 6))
sns.barplot(x='traveller_type', y='value_for_money', hue='seat_comfort', data=data_airline_reviews_df, errcolor="grey", errwidth=0, dodge=True)

plt.xlabel('Traveler Type')
plt.ylabel('Value for Money')
plt.title('Relationship between Value for Money and Seat Comfort by Traveler Type')

plt.legend(title='Seat Comfort')

plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A bar plot with hue differentiation was recommended because it effectively compares "value_for_money" and "seat_comfort" across different "traveller_type" categories using color-coded bars for different seat comfort levels, making the relationship easy to visualize and compare.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals how the relationship between "value_for_money" and "seat_comfort" varies among different traveler types. It shows which traveler types experience better value for money and seat comfort combinations, helping identify potential patterns or preferences.

3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can positively impact business by guiding tailored services. Negative growth might result from dissatisfaction in certain traveler types, requiring improvements to prevent declines in customer satisfaction and loyalty.

#### Chart - 21 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Select numerical columns for correlation
numerical_columns = ['overall', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment', 'ground_service', 'value_for_money']

# Calculate the correlation matrix
correlation_matrix = data_airline_reviews_df[numerical_columns].corr()

# Create a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap of Ratings')
plt.show()

##### 1. Why did you pick the specific chart?

A correlation heatmap was chosen because it effectively displays the relationships between multiple numerical variables, aiding in identifying patterns, strengths, and directions of correlations in a concise and visually intuitive manner.






##### 2. What is/are the insight(s) found from the chart?

Insights from the correlation heatmap include identifying positive and negative relationships between different rating aspects. It can suggest potential improvement areas, trade-offs, and help prioritize aspects for enhancement in airline reviews.

#### Chart - 22 --- Pair Plot

In [None]:
# Pair Plot visualization code
# Select numerical columns for the pair plot
numerical_columns = ['overall', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment', 'ground_service', 'value_for_money']

# Create a pair plot
sns.pairplot(data_airline_reviews_df[numerical_columns])
plt.suptitle("Pair Plot of Ratings", y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot was recommended because it helps visualize relationships between multiple numerical variables, enabling identification of correlations, patterns, and potential trade-offs among "value_for_money", "seat_comfort", "cabin_service", "food_bev", and "entertainment" ratings in a single view.

##### 2. What is/are the insight(s) found from the chart?

From the pair plot, insights can include identifying:---

1. Correlations:--- Positive or negative correlations between rating variables, indicating how they tend to change together.

2. Clusters:--- Groupings of data points in the scatter plots, suggesting certain combinations of ratings are more common.

3. Patterns:--- Trends in how one rating affects another; for instance, higher "seat_comfort" might be associated with higher "value_for_money".

4. Outliers:--- Data points that deviate significantly from the overall pattern, indicating unique cases.

5. Trade-offs:--- Potential trade-offs between different aspects, such as passengers willing to compromise "value_for_money" for better "entertainment".

Remember, insights can vary based on the data and context, so a thorough examination of the pair plot is crucial.






## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Hypothetical Statement 1 :--- The average "seat_comfort" rating is different among different traveler types.

Hypothetical Statement 2 :--- There is a correlation between "value_for_money" and "seat_comfort" ratings.

Hypothetical Statement 3 :--- The distribution of "overall" ratings is significantly different for recommended and not recommended reviews.

### Hypothetical Statement - 1 --- Average "seat_comfort" rating among different traveler types

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0) :--- The average "seat_comfort" rating is the same among all traveler types.

Alternative Hypothesis (H1) :--- The average "seat_comfort" rating is different among different traveler types.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

import scipy.stats as stats

traveler_types = data_airline_reviews_df['traveller_type'].unique()

for traveler_type in traveler_types:
    subset = data_airline_reviews_df[data_airline_reviews_df['traveller_type'] == traveler_type]
    sample_data = subset['seat_comfort']
    t_statistic, p_value = stats.ttest_1samp(sample_data, data_airline_reviews_df['seat_comfort'].mean())

    print(f"Traveler Type: {traveler_type}")
    print(f"T-statistic: {t_statistic:.4f}")
    print(f"P-value: {p_value:.4f}")

    if p_value < 0.05:
        print("Reject Null Hypothesis: Average seat_comfort rating is different.")
    else:
        print("Fail to Reject Null Hypothesis: Average seat_comfort rating is the same.")

    print("=" * 40)


##### Which statistical test have you done to obtain P-Value?


In the provided code, you have used the t-test to obtain the p-value. Specifically, you used the 'stats.ttest_1samp' function from the 'scipy.stats' module to perform the t-test. This test is suitable for assessing whether the average (mean) "seat_comfort" rating for each traveler type is significantly different from the overall mean "seat_comfort" rating across all traveler types. The obtained p-values help you determine whether these differences are statistically significant.

In summary, the statistical test you used is the one-sample t-test to compare the means of "seat_comfort" ratings between different traveler types and the overall mean.






##### Why did you choose the specific statistical test?

The specific statistical test chosen, the one-sample t-test, is appropriate for comparing the means of "seat_comfort" ratings between different traveler types and the overall mean. This test helps determine if the observed differences in average ratings are likely due to true differences or random chance.

### Hypothetical Statement - 2 --- Correlation between "value_for_money" and "seat_comfort" ratings

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no correlation between "value_for_money" and "seat_comfort" ratings.
Alternative Hypothesis (H1): There is a correlation between "value_for_money" and "seat_comfort" ratings.

#### 2. Perform an appropriate statistical test.

In [None]:
correlation = data_airline_reviews_df[['value_for_money', 'seat_comfort']].corr()

correlation_coefficient = correlation.loc['value_for_money', 'seat_comfort']

p_value = stats.pearsonr(data_airline_reviews_df['value_for_money'], data_airline_reviews_df['seat_comfort'])[1]

print("Correlation Matrix:")
print(correlation)

print(f"Correlation Coefficient: {correlation_coefficient:.4f}")
print(f"P-value: {p_value:.4f}")

##### Which statistical test have you done to obtain P-Value?

he stats.pearsonr function from the scipy.stats module is indeed used to perform the Pearson correlation coefficient significance test. The p-value obtained from this function is a result of the Pearson correlation coefficient test, which assesses whether the correlation between "value_for_money" and "seat_comfort" ratings is statistically significant.  

##### Why did you choose the specific statistical test?

The specific statistical test chosen, the Pearson correlation coefficient test, is suitable when investigating whether there is a linear relationship between two continuous variables, like "value_for_money" and "seat_comfort" ratings. This test assesses both the strength and significance of this potential relationship.

### Hypothetical Statement - 3 --- Distribution of "overall" ratings for recommended and not recommended reviews

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0) :--- The distribution of "overall" ratings is the same for recommended and not recommended reviews.
Alternative Hypothesis (H1) :--- The distribution of "overall" ratings is significantly different for recommended and not recommended reviews.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
recommended_ratings = data_airline_reviews_df[data_airline_reviews_df['recommended'] == 'yes']['overall']
not_recommended_ratings = data_airline_reviews_df[data_airline_reviews_df['recommended'] == 'no']['overall']

t_statistic, p_value = stats.ttest_ind(recommended_ratings, not_recommended_ratings)

print("T-Test Results:")
print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("Reject Null Hypothesis: Distribution of overall ratings is different for recommended and not recommended reviews.")
else:
    print("Fail to Reject Null Hypothesis: Distribution of overall ratings is the same for recommended and not recommended reviews.")


##### Which statistical test have you done to obtain P-Value?

In the provided code, you have used the independent two-sample t-test to obtain the p-value. Specifically, you used the 'stats.ttest_ind' function from the 'scipy.stats' module to perform the t-test. This test is suitable for comparing the means of "overall" ratings between recommended and not recommended reviews. The obtained p-value helps you determine whether these differences in means are statistically significant.

In summary, the statistical test you used is the independent two-sample t-test to compare the means of "overall" ratings between recommended and not recommended reviews.

##### Why did you choose the specific statistical test?

The specific statistical test chosen, the independent two-sample t-test, is appropriate for comparing the means of "overall" ratings between two groups: recommended and not recommended reviews. This test assesses whether the observed differences in average ratings are likely due to actual differences or random variability.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Convert 'review_date' to datetime
data_airline_reviews_df['review_date'] = pd.to_datetime(data_airline_reviews_df['review_date'])
data_airline_reviews_df['date_flown'] = pd.to_datetime(data_airline_reviews_df['date_flown'])

# Define columns for imputation
numeric_columns = ['overall', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment',
                   'ground_service', 'value_for_money']
categorical_columns = ['airline', 'traveller_type', 'cabin', 'route', 'recommended']

# Mean imputation for numerical columns
data_airline_reviews_df[numeric_columns] = data_airline_reviews_df[numeric_columns].fillna(data_airline_reviews_df[numeric_columns].mean())

# Mode imputation for categorical columns
data_airline_reviews_df[categorical_columns] = data_airline_reviews_df[categorical_columns].apply(lambda x: x.fillna(x.mode()[0]))

# Display the results
print(data_airline_reviews_df)

#### What all missing value imputation techniques have you used and why did you use those techniques?


I used mean imputation for numerical columns and mode imputation for categorical columns. Mean imputation replaces missing numerical values with the column's mean, preserving central tendency. Mode imputation fills missing categorical values with the most common category, maintaining category distribution. These simple methods are commonly used for unbiased handling of missing data.






### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Numeric columns for outlier detection
numeric_columns = ['overall', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment',
                   'ground_service', 'value_for_money']

# Calculate z-scores for numeric columns
z_scores = np.abs((data_airline_reviews_df[numeric_columns] - data_airline_reviews_df[numeric_columns].mean()) / data_airline_reviews_df[numeric_columns].std())

# Keep rows where z-score is less than a threshold (e.g., 3)
outlier_threshold = 3
data_airline_reviews_df_cleaned = data_airline_reviews_df[(z_scores < outlier_threshold).all(axis=1)]

# Display the results
print("Original DataFrame:")
print(data_airline_reviews_df[numeric_columns].head())
print("\nCleaned DataFrame (Outliers Removed):")
print(data_airline_reviews_df_cleaned[numeric_columns].head())

In [None]:
# Numeric columns for outlier detection
numeric_columns = ['overall', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment',
                   'ground_service', 'value_for_money']

# Calculate z-scores for numeric columns
z_scores = np.abs((data_airline_reviews_df[numeric_columns] - data_airline_reviews_df[numeric_columns].mean()) / data_airline_reviews_df[numeric_columns].std())

# Check if any z-score is greater than a threshold (e.g., 3)
z_score_threshold = 3
outliers_present = (z_scores > z_score_threshold).any().any()

if outliers_present:
    print("Outliers are present in the numeric columns.")
else:
    print("No outliers are present in the numeric columns.")

##### What all outlier treatment techniques have you used and why did you use those techniques?

I used the z-score outlier treatment technique to identify and remove outliers. Z-score measures how far a data point is from the mean in terms of standard deviations. I chose this technique because it's based on statistical principles and helps detect extreme values that might distort analysis. A threshold of 3 was used to determine outliers, and rows exceeding this threshold were removed from the dataset.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# Select categorical columns for encoding
categorical_columns = ['airline', 'traveller_type', 'cabin', 'route', 'recommended']

# Perform one-hot encoding using pandas' get_dummies function
encoded_data = pd.get_dummies(data_airline_reviews_df, columns=categorical_columns, drop_first=True)

# Display the encoded data
print(encoded_data.head())

#### What all categorical encoding techniques have you used & why did you use those techniques?

I used one-hot encoding for categorical variables. One-hot encoding is chosen for its ability to preserve distinct categories, avoid ordinal bias, and make categorical data compatible with machine learning algorithms. It also helps avoid multicollinearity by dropping one encoded column per category.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
import re
# Define a dictionary of common contractions and their expansions
contraction_mapping = {
    "ain't": "is not",
    "aren't": "are not",
    "can't": "cannot",
    "could've": "could have",
    # Add more contractions and expansions here
}

# Function to expand contractions
def expand_contractions(text, contraction_mapping):
    contraction_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), flags=re.IGNORECASE|re.DOTALL)

    def expand_match(contraction):
        match = contraction.group(0)
        expanded = contraction_mapping.get(match.lower())
        return expanded

    expanded_text = contraction_pattern.sub(expand_match, text)
    return expanded_text

# Apply contraction expansion to 'customer_review' column
data_airline_reviews_df['expanded_review'] = data_airline_reviews_df['customer_review'].apply(lambda x: expand_contractions(x, contraction_mapping))

# Display the expanded text
print(data_airline_reviews_df[['customer_review', 'expanded_review']].head())

#### 2. Lower Casing

In [None]:
# Lower Casing
# Apply lowercasing to 'customer_review' column
data_airline_reviews_df['lowercased_review'] = data_airline_reviews_df['customer_review'].apply(lambda x: x.lower())

# Display the lowercased text
print(data_airline_reviews_df[['customer_review', 'lowercased_review']].head())

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import string
# Function to remove punctuations from text
def remove_punctuation(text):
    # Create a translation table to remove punctuations
    translator = str.maketrans('', '', string.punctuation)

    # Use the translation table to remove punctuations
    no_punct_text = text.translate(translator)
    return no_punct_text

# Apply punctuation removal to 'customer_review' column
data_airline_reviews_df['no_punct_review'] = data_airline_reviews_df['customer_review'].apply(remove_punctuation)

# Display the text with punctuations removed
print(data_airline_reviews_df[['customer_review', 'no_punct_review']].head())

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
# Function to remove URLs from text
def remove_urls(text):
    cleaned_text = re.sub(r'http\S+', '', text)  # Remove URLs starting with http
    return cleaned_text

# Function to remove words and digits containing digits
def remove_words_with_digits(text):
    cleaned_text = ' '.join([word for word in text.split() if not any(char.isdigit() for char in word)])
    return cleaned_text

# Apply URL removal and removing words with digits to 'customer_review' column
data_airline_reviews_df['cleaned_review'] = data_airline_reviews_df['customer_review'].apply(remove_urls)
data_airline_reviews_df['cleaned_review'] = data_airline_reviews_df['cleaned_review'].apply(remove_words_with_digits)

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# Set up stopwords
stop_words = set(stopwords.words('english'))

# Function to remove stopwords from text
def remove_stopwords(text):
    tokens = text.split()
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
    return ' '.join(filtered_tokens)

# Apply stopwords removal to 'customer_review' column
data_airline_reviews_df['no_stopwords_review'] = data_airline_reviews_df['customer_review'].apply(remove_stopwords)

# Display the text with stopwords removed
print(data_airline_reviews_df[['customer_review', 'no_stopwords_review']].head())

In [None]:
# Remove White spaces
# Function to remove extra white spaces from text
def remove_extra_spaces(text):
    cleaned_text = ' '.join(text.split())
    return cleaned_text

# Apply space removal to 'customer_review' column
data_airline_reviews_df['cleaned_review'] = data_airline_reviews_df['customer_review'].apply(remove_extra_spaces)

# Display the text with extra spaces removed
print(data_airline_reviews_df[['customer_review', 'cleaned_review']].head())

#### 6. Rephrase Text

In [None]:
!pip install simpleneighbors
!pip install transformers
!pip install sentencepiece

In [None]:
import nltk
nltk.download('punkt')

In [None]:
# Rephrase Text
from nltk.tokenize import sent_tokenize, word_tokenize
import random


# Function to rephrase text using NLTK tokenizer
def rephrase_text(text):
    sentences = sent_tokenize(text)
    tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
    rephrased_tokens = [random.choice(tokens) for tokens in tokenized_sentences]
    rephrased_text = ' '.join(rephrased_tokens)
    return rephrased_text

# Apply rephrasing to a specific review
original_review = "The flight experience was amazing!"
rephrased_review = rephrase_text(original_review)

print(f"Original Review: {original_review}")
print(f"Rephrased Review: {rephrased_review}")


#### 7. Tokenization

In [None]:
# Tokenization
import nltk
from nltk.tokenize import word_tokenize


# Function to tokenize text using NLTK
def tokenize_text(text):
    tokens = word_tokenize(text)
    return tokens

# Apply tokenization to 'customer_review' column
data_airline_reviews_df['tokenized_review'] = data_airline_reviews_df['customer_review'].apply(tokenize_text)

# Display the tokenized text
print(data_airline_reviews_df[['customer_review', 'tokenized_review']].head())

#### 8. Text Normalization

In [None]:
nltk.download('wordnet')

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
from nltk.stem import PorterStemmer, WordNetLemmatizer
# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Function to perform stemming on a word
def stem_word(word):
    return stemmer.stem(word)

# Function to perform lemmatization on a word
def lemmatize_word(word):
    return lemmatizer.lemmatize(word, pos='v')  # 'v' indicates verb

# Function to normalize text using stemming and lemmatization
def normalize_text(text):
    tokens = word_tokenize(text)
    stemmed_tokens = [stem_word(token) for token in tokens]
    lemmatized_tokens = [lemmatize_word(token) for token in stemmed_tokens]
    normalized_text = ' '.join(lemmatized_tokens)
    return normalized_text

# Apply text normalization to 'customer_review' column
data_airline_reviews_df['normalized_review'] = data_airline_reviews_df['customer_review'].apply(normalize_text)

# Display the normalized text
print(data_airline_reviews_df[['customer_review', 'normalized_review']].head())

##### Which text normalization technique have you used and why?


I used both stemming and lemmatization for text normalization. I chose them based on the following reasons :---


* Stemming :--- I used stemming (specifically, the Porter stemming algorithm) to simplify words to their root forms, which can help reduce variations. However, stemming might create non-real words.
* Lemmatization :--- I also used lemmatization (using the WordNet lemmatizer) to ensure valid words. It considers context and part of speech for better accuracy compared to stemming.



#### 9. Part of speech tagging

In [None]:
# POS Taging
nltk.download('averaged_perceptron_tagger')

# Function to perform POS tagging on text
def pos_tagging(text):
    tokens = word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)
    return pos_tags

# Apply POS tagging to 'customer_review' column
data_airline_reviews_df['pos_tags'] = data_airline_reviews_df['customer_review'].apply(pos_tagging)

# Display the POS tags
print(data_airline_reviews_df[['customer_review', 'pos_tags']].head())

#### 10. Text Vectorization

In [None]:
# Vectorizing Text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Create a list of customer reviews
customer_reviews = data_airline_reviews_df['customer_review'].tolist()

# Initialize vectorizers
count_vectorizer = CountVectorizer()
tfidf_vectorizer = TfidfVectorizer()

# Vectorize text using CountVectorizer
count_vectors = count_vectorizer.fit_transform(customer_reviews)

# Vectorize text using TfidfVectorizer
tfidf_vectors = tfidf_vectorizer.fit_transform(customer_reviews)

# Display the dimensions of the vectorized matrices
print("Count Vectorizer:")
print(count_vectors.shape)

print("\nTF-IDF Vectorizer:")
print(tfidf_vectors.shape)

##### Which text vectorization technique have you used and why?

I used both Count Vectorization and TF-IDF Vectorization techniques.


* Count Vectorization :--- Captures word frequency in documents, suitable for tasks like text classification.
* TF-IDF Vectorization :--- Considers word importance in documents and across the corpus, useful for information retrieval and text summarization.




### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Create a new feature 'avg_rating' by aggregating ratings
rating_columns = ['seat_comfort', 'cabin_service', 'food_bev', 'entertainment', 'ground_service', 'value_for_money']
data_airline_reviews_df['avg_rating'] = data_airline_reviews_df[rating_columns].mean(axis=1)

# Display the dataset with the new feature
print(data_airline_reviews_df[['customer_review', 'avg_rating']].head())

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
# Select relevant features and target variable
selected_features = ['seat_comfort', 'cabin_service', 'food_bev', 'entertainment', 'ground_service']
X = data_airline_reviews_df[selected_features]
y = data_airline_reviews_df['recommended']

# Initialize a Random Forest model
model = RandomForestClassifier()

# Fit the model and assess feature importance
model.fit(X, y)
feature_importance = model.feature_importances_

# Display feature importance scores
for feature, importance in zip(selected_features, feature_importance):
    print(f"{feature}: {importance:.4f}")

##### What all feature selection methods have you used  and why?

In the provided code, the feature selection method used is Feature Importance with a Random Forest classifier. This method calculates the importance of each feature by measuring its contribution to the model's predictive performance. It's a commonly used technique to identify influential features for decision-making.






##### Which all features you found important and why?

The features found important by the Random Forest model were: 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment', and 'ground_service'. The model likely identified these features as influential for predicting the 'recommended' label based on their impact on the target variable in the dataset.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Data transformation might be necessary based on data characteristics and analysis goals. Scaling, normalization, and logarithmic transformations are common. Scaling ensures consistent scales, normalization adjusts distributions, and logarithmic transformation handles skewed data. Choose based on feature distribution and algorithm requirements.

In [None]:
# Transform Your data
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Select features to transform
selected_features = ['seat_comfort', 'cabin_service', 'food_bev', 'entertainment', 'ground_service']

# Extract selected features
X = data_airline_reviews_df[selected_features]

# Initialize scalers
minmax_scaler = MinMaxScaler()
standard_scaler = StandardScaler()

# Apply Min-Max scaling
X_minmax_scaled = minmax_scaler.fit_transform(X)

# Apply Standard scaling
X_standard_scaled = standard_scaler.fit_transform(X)

# Apply normalization
X_normalized = X.apply(lambda col: (col - col.min()) / (col.max() - col.min()))

# Apply log transformation
X_log_transformed = np.log1p(X)  # Add 1 to avoid log(0)

# Display transformed data
print("Min-Max Scaled Features:")
print(X_minmax_scaled)

print("\nStandard Scaled Features:")
print(X_standard_scaled)

print("\nNormalized Features:")
print(X_normalized)

print("\nLog Transformed Features:")
print(X_log_transformed)

### 6. Data Scaling

In [None]:
# Scaling your data
# Convert scaled arrays back to DataFrame
X_minmax_scaled_df = pd.DataFrame(X_minmax_scaled, columns=selected_features)
X_standard_scaled_df = pd.DataFrame(X_standard_scaled, columns=selected_features)

# Display scaled DataFrames
print("Min-Max Scaled Features:")
print(X_minmax_scaled_df.head())

print("\nStandard Scaled Features:")
print(X_standard_scaled_df.head())

##### Which method have you used to scale you data and why?

I used both MinMax Scaling and Standard Scaling methods. MinMax Scaling ensures features are within a specified range, while Standard Scaling makes features have a mean of 0 and standard deviation of 1. Selection depends on preserving distribution (MinMax) or centering around zero (Standard).






### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?


Dimensionality reduction might be needed due to issues like the curse of dimensionality, overfitting, improved interpretability, efficiency, and addressing multicollinearity. It can enhance model performance and interpretability in high-dimensional datasets. The need depends on data complexity and analysis goals.

In [None]:
# DImensionality Reduction (If needed)

from sklearn.decomposition import PCA

# Select features for dimensionality reduction
selected_features = ['seat_comfort', 'cabin_service', 'food_bev', 'entertainment', 'ground_service']

# Extract selected features
X = data_airline_reviews_df[selected_features]

# Initialize PCA with desired number of components
num_components = 2
pca = PCA(n_components=num_components)

# Fit and transform data
X_pca = pca.fit_transform(X)

# Plot explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
plt.bar(range(1, num_components + 1), explained_variance_ratio)
plt.xlabel('Principal Components')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance Ratio by Principal Components')
plt.show()

# Display transformed data
print("PCA Transformed Features:")
print(X_pca)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

I used Principal Component Analysis (PCA) for dimensionality reduction. It retains variance, ensures orthogonal components, combines features, aids visualization, and improves efficiency in reducing dimensions.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# Select features and target variable
selected_features = ['seat_comfort', 'cabin_service', 'food_bev', 'entertainment', 'ground_service']
X = data_airline_reviews_df[selected_features]
y = data_airline_reviews_df['recommended']

# Split the data into training and testing sets
test_size = 0.2  # Choose the splitting ratio wisely (e.g., 80% training, 20% testing)
random_state = 42  # Seed for reproducibility

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

# Display the sizes of the training and testing sets
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")

##### What data splitting ratio have you used and why?

I used a test data splitting ratio of 0.2 (20% testing data) to strike a balance between having enough data for training and ensuring a reasonable amount for testing. The choice of ratio depends on dataset size, model complexity, and resource availability.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)
from imblearn.over_sampling import RandomOverSampler
from sklearn.metrics import classification_report

# Apply oversampling to handle class imbalance
oversampler = RandomOverSampler(random_state=random_state)
X_train_resampled, y_train_resampled = oversampler.fit_resample(X_train, y_train)

# Train a model on the resampled data
model = RandomForestClassifier()
model.fit(X_train_resampled, y_train_resampled)

# Evaluate the model on the test set
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

I used oversampling with the RandomOverSampler technique to handle the imbalanced dataset. It involves generating synthetic samples for the minority class, helping to balance the class distribution and prevent data loss.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***