# **Project Name**    - IndiGO project



##### **Project Type**    - Classification
##### **Contribution**    - Individual

# **Project Summary -**

1️⃣ Data Loading & First View
The dataset contains 131,895 rows and 17 columns.
It has several missing values in many columns.
It includes airline reviews with ratings on various aspects like seat comfort, cabin service, food, entertainment, etc.
Some columns like review_date and date_flown might need date formatting.

2️⃣ Data Information & Understanding Variables
Key columns in the dataset:

Categorical Data: airline, author, traveller_type, cabin, route, recommended
Text Data: customer_review
Numerical Ratings: overall, seat_comfort, cabin_service, food_bev, entertainment, ground_service, value_for_money
Date Columns: review_date, date_flown

3️⃣ Data Wrangling (Cleaning & Processing)
✔ Handling missing values in columns like aircraft, traveller_type, route
✔ Converting date columns into a proper datetime format
✔ Cleaning text reviews for NLP analysis
✔ Encoding categorical variables (like cabin, recommended)

4️⃣ Data Visualization & Insights
📊 Exploratory Data Analysis (EDA):
✅ Distribution of overall ratings
✅ Trends in review counts over time
✅ Average ratings per airline
✅ Correlation between different rating factors

5️⃣ Hypothesis Testing
🔹 Example Hypotheses:

Does business class get higher ratings than economy?
Is there a significant difference between top airlines in terms of customer satisfaction?
Are customers more likely to recommend a flight based on seat comfort?

6️⃣ Feature Engineering
🔹 Creating New Features:
✔ Extracting sentiment scores from customer_review
✔ Binning overall ratings into categories (e.g., Good, Average, Bad)
✔ Calculating review length as a feature

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


1️⃣ Customer Satisfaction Analysis
🔹 Problem: What factors contribute the most to a passenger's overall satisfaction with an airline?
🔹 Goal: Identify key factors (seat comfort, food, cabin service, etc.) that significantly impact customer satisfaction.

2️⃣ Sentiment Analysis of Airline Reviews
🔹 Problem: Can we predict whether a review is positive or negative based on text data?
🔹 Goal: Perform sentiment analysis on customer reviews to classify them as positive, neutral, or negative.

3️⃣ Airline Recommendation Prediction
🔹 Problem: Can we predict whether a customer will recommend an airline based on their review and ratings?
🔹 Goal: Build a classification model to predict recommended (yes/no) based on rating scores and review content.

4️⃣ Comparative Airline Performance
🔹 Problem: Which airlines have the highest and lowest customer satisfaction scores?
🔹 Goal: Compare different airlines based on ratings, review sentiment, and recommendation rates.

5️⃣ Impact of Travel Class on Satisfaction
🔹 Problem: Do Business Class passengers have significantly higher satisfaction than Economy Class passengers?
🔹 Goal: Analyze the difference in ratings across different cabin classes and perform hypothesis testing.

6️⃣ Predicting Airline Review Scores
🔹 Problem: Can we predict the overall rating of a review based on customer comments?
🔹 Goal: Build a machine learning model (Regression or NLP-based model) to predict ratings from review text.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# import libraries
import numpy as np
import pandas as pd

from numpy import loadtxt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import rcParams

! pip install pymysql
import pymysql
from sqlalchemy import create_engine
from sqlalchemy.pool import NullPool

import warnings
warnings.filterwarnings('ignore')


In [None]:
# Load Dataset
def mysql(query:'Write the query here .'):
    '''
    This function fetches data from database and returns the result.
    '''
    try:
        engine_db = create_engine("mysql+pymysql://user:pw@host/db", pool_pre_ping=True)
        conn = engine_db.connect()
        # Reading Data
        df = pd.read_sql_query(query, conn)

        #if your connection object is named conn
        if not conn.closed:
            conn.close()
        engine_db.dispose()
        return df
    except Exception as e:
        print(e)

### Dataset Loading

In [None]:
# Importing the dataset
dataset = mysql('''SELECT * /data_airline_reviews.csv''')

In [None]:
# Load Dataset
# Load Dataset
def mysql(query:'Write the query here .'):
    '''
    This function fetches data from database and returns the result.
    '''
    try:
        engine_db = create_engine("mysql+pymysql://user:pw@host/db", pool_pre_ping=True)
        conn = engine_db.connect()
        # Reading Data
        df = pd.read_sql_query(query, conn)

        #if your connection object is named conn
        if not conn.closed:
            conn.close()
        engine_db.dispose()
        return df
    except Exception as e:
        print(e)

In [None]:
# Importing the dataset
dataset = mysql('''SELECT * FROM data_airline_reviews''')

### Dataset First View

In [None]:
import pandas as pd

# Load the dataset with a specified encoding
file_path = '/content/data_airline_reviews.csv'
df = pd.read_csv(file_path, encoding='latin1')

# Display the first few rows
print(df.head())


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
from IPython.display import display

# Display the entire DataFrame in a scrollable format
display(df)

### Dataset Information

In [None]:
# Dataset Info
display(df.info())

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicates = df.duplicated()

# Display the boolean series for duplicates
print(duplicates)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isnull()

# Display the boolean DataFrame for missing values
print(missing_values)

In [None]:
# Visualizing the missing values

# Checking Null Value by plotting Heatmap

# Set the size of the figure
plt.figure(figsize=(20, 10))

# Create a heatmap to visualize missing values
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')

# Add title and labels
plt.title('Heatmap of Missing Values')
plt.xlabel('Columns')
plt.ylabel('Rows')

# Show the plot
plt.show()

### What did you know about your dataset?

Answer

Size & Structure:
The dataset contains 131,895 rows and 17 columns.
Several columns have missing values, with some having significantly fewer non-null values.
Key Columns:

Categorical Variables: airline, author, review_date, customer_review, traveller_type, cabin, route, date_flown, recommended.

Numerical Ratings: overall, seat_comfort, cabin_service, food_bev, entertainment, ground_service, value_for_money.
Missing Values:

aircraft, traveller_type, route, date_flown, and several rating columns have a high percentage of missing values.
airline and author columns have around 50% missing data.
Numerical Data Summary:

overall rating ranges from 1 to 10.
Other rating columns (e.g., seat_comfort, cabin_service) range from 1 to 5.
The mean overall rating is 5.14, indicating a mix of positive and negative reviews.
Unique Values:

There are 65,948 unique airlines, meaning the dataset likely includes duplicate or incorrectly formatted data.
The recommended column has "yes" or "no" values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
columns =df.columns
print(columns)

In [None]:
# Dataset Describe

# Descriptive statistics
print("\nDescriptive statistics for numerical columns:")
print(df.describe())

# Distribution and counts of categorical variables
print("\nCounts of listings per borough:")
print(df['review_date'].value_counts())

print("\nCounts of listings per room type:")
print(df['customer_review'].value_counts())

# Additional descriptive statistics for specific columns
print("\nPrice statistics:")
print(df['aircraft'].describe())

print("\nMinimum nights statistics:")
print(df['value_for_money'].describe())

print("\nNumber of reviews statistics:")
print(df['ground_service'].describe())

print("\nReviews per month statistics:")
print(df['airline'].describe())


### Variables Description

Answer
1. airline – Name of the airline reviewed.

2. overall – Overall rating given by the customer (scale: 1 to 10).

3. author – Name of the reviewer.

4. review_date – Date when the review was posted.

5. customer_review – Textual review provided by the passenger.

6. aircraft – Type of aircraft used for the flight.

traveller_type – Type of traveler (Solo, Business, Family, Couple).

7. cabin – Cabin class (Economy, Premium Economy, Business, First Class).

8. route – Flight route (e.g., New York to London).

9. date_flown – The date when the flight took place.

10. seat_comfort – Rating for seat comfort (1 to 5).

11. cabin_service – Rating for in-flight service (1 to 5).

12. food_bev – Rating for food and beverages (1 to 5).

13. entertainment – Rating for in-flight entertainment (1 to 5).

14. ground_service – Rating for services at the airport (1 to 5).

15. value_for_money – Rating for ticket price worth (1 to 5).

16. recommended – Whether the passenger recommends the airline (Yes/No).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in df.columns:
    unique_values = df[column].unique()
    print(f"Column: {column}")
    print(f"Unique Values ({len(unique_values)}): {unique_values[:10]}")  # Show first 10 unique values
    print("-" * 50)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# 1. Display basic information
print("Initial Data Info:")
print(df.info())

# 2. Handle missing values (fill with 'Unknown' for categorical, mean for numeric)
df.fillna({
    col: df[col].mean() if df[col].dtype in ['int64', 'float64'] else 'Unknown'
    for col in df.columns
}, inplace=True)

# 3. Remove duplicate rows
df.drop_duplicates(inplace=True)

# 4. Convert data types (example: ensure date columns are in datetime format)
date_columns = ['Date']  # Change this based on your dataset
for col in date_columns:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col], errors='coerce')

# 5. Rename columns (optional, standardizing names)
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

# 6. Handle outliers (example: remove values outside 1.5*IQR for numerical columns)
for col in df.select_dtypes(include=['int64', 'float64']).columns:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]

# 7. Standardize text formatting (convert all string columns to lowercase)
for col in df.select_dtypes(include=['object']).columns:
    df[col] = df[col].str.lower()

# 8. Display cleaned data info
print("Cleaned Data Info:")
print(df.info())

# Save the cleaned dataset
cleaned_file_path = "/content/data_airline_reviews.xlsx"
df.to_excel(cleaned_file_path, index=False)
print(f"Cleaned dataset saved at: {cleaned_file_path}")


### What all manipulations have you done and insights you found?

Answer

Removed duplicates to ensure unique reviews.

Handled missing values by filling categorical data with "Unknown" and dropping essential nulls.

Standardized categorical data (lowercased airline names, traveler types, and cabins).

Converted date fields to datetime format.

Changed ratings and recommended columns to numeric for analysis.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 pie

In [None]:

# Selecting the dependent variable, assuming it is 'room_type'
dependent_variable = 'food_bev'

# Count the occurrences of each category in the dependent variable
data_counts = df[dependent_variable].value_counts()

# Plotting the Pie Chart
plt.figure(figsize=(5, 5))
plt.pie(data_counts, labels=data_counts.index, autopct='%1.1f%%', startangle=70, colors=plt.cm.Paired(range(len(data_counts))))

# Adding a title
plt.title('Distribution of food_bevs')

# Display the pie chart
plt.show()


##### 1. Why did you pick the specific chart?

Answer

A pie chart helps visualize sentiment proportions.

##### 2. What is/are the insight(s) found from the chart?

Answer

If negative sentiment dominates, it highlights critical service issues.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer

High negativity can lead to revenue loss; addressing customer concerns is crucial.



#### Chart - 2 Box Plot

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(7,3))
sns.boxplot(data=df, x='food_bev', y='ground_service')
plt.xticks(rotation=90)
plt.title("ground_service food_bev Comparison")
plt.show()

##### 1. Why did you pick the specific chart?

Answer

Box plots reveal variation in ratings among airlines.

##### 2. What is/are the insight(s) found from the chart?

Answer

Some airlines consistently receive low ratings—indicating poor service.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer

Poorly rated airlines need to improve services to compete.

#### Chart - 3 Bar plot

In [None]:
# Chart - 3 visualization code

plt.figure(figsize=(6,5))
sns.countplot(data=df, x='value_for_money', palette='viridis')
plt.title("Distribution of Airline value_for_money")
plt.xlabel("value_for_money")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

Answer

Bar plots help in understanding how frequently each rating is given.

##### 2. What is/are the insight(s) found from the chart?

Answer

If most ratings are low, it suggests customer dissatisfaction. If high, the airline is performing well.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer

Negative ratings indicate service issues that need urgent action.

#### Chart - 4 Line Chart

In [None]:
# Chart - 4 visualization code
df['seat_comfort'] = pd.to_datetime(df['seat_comfort'])
df.sort_values(by='seat_comfort', inplace=True)

plt.figure(figsize=(6,4))
sns.lineplot(data=df, x='seat_comfort', y='cabin_service', marker='o', color='blue')
plt.title("Ratings Over Time")
plt.xlabel("seat_comfort")
plt.ylabel("Average cabin_service")
plt.show()

##### 1. Why did you pick the specific chart?

Answer

A line chart reveals trends in ratings over time.

##### 2. What is/are the insight(s) found from the chart?

Answer

If ratings drop over time, service quality may be declining.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer

Consistently declining ratings require urgent corrective measures.



## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer

 Hypothesis 1: Passengers who travel for business rate airlines higher than leisure travelers.

Null Hypothesis (H₀): There is no difference in ratings between business and leisure travelers.
Alternative Hypothesis (H₁): Business travelers give significantly higher ratings than leisure travelers.

2️⃣ Hypothesis 2: The average rating of low-cost airlines is lower than full-service airlines.
Null Hypothesis (H₀): The mean rating of low-cost and full-service airlines is the same.
Alternative Hypothesis (H₁): Low-cost airlines have lower ratings than full-service airlines.

3️⃣ Hypothesis 3: Ratings have not changed significantly over the past two years.
Null Hypothesis (H₀): The average rating before and after a certain date is the same.
Alternative Hypothesis (H₁): Ratings have changed significantly over time.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer

Null Hypothesis (H₀):
There is no significant difference in passenger ratings between verified and unverified reviews.

Alternate Hypothesis (H₁):
There is a significant difference in passenger ratings between verified and unverified reviews.



#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import scipy.stats as stats
# Filtering ratings based on Travel Type
business_ratings = df[df['seat_comfort'] == 'Business']['cabin_service']
leisure_ratings = df[df['seat_comfort'] == 'Leisure']['cabin_service']

# Performing Independent t-test
t_stat, p_value = stats.ttest_ind(business_ratings, leisure_ratings, nan_policy='omit')

# Results
print("🔹 Hypothesis 1: Business vs Leisure cabin_service")
print(f"T-Statistic: {t_stat}, P-Value: {p_value}")

# Interpretation
alpha = 0.05  # Significance level
if p_value < alpha:
    print("✅ Reject Null Hypothesis: Business travelers rate airlines significantly higher.")
else:
    print("❌ Fail to Reject Null Hypothesis: No significant difference in cabin_service.")


##### Which statistical test have you done to obtain P-Value?

Answer

I performed an Independent Two-Sample t-test to obtain the P-Value. This test is used to compare the means of two independent groups—in this case, Business travelers and Leisure travelers—to determine if there is a significant difference in their ratings. Since we are dealing with a continuous variable (ratings) and comparing two groups, the t-test is the most appropriate choice.

##### Why did you choose the specific statistical test?

Answer

I chose the Independent Two-Sample t-test because we are comparing the mean ratings of two independent groups—Business travelers and Leisure travelers. This test is ideal when:

The dependent variable (Ratings) is continuous, making it suitable for a t-test.
The two groups are independent, meaning ratings from one group do not influence the other.
We want to determine if there is a statistically significant difference between the two group means.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer

🔹 Null Hypothesis (H₀):
There is no significant difference in airline ratings between Business travelers and Leisure travelers.

🔹 Alternate Hypothesis (H₁):
Business travelers give significantly higher ratings than Leisure travelers.

#### 2. Perform an appropriate statistical test.

In [None]:
import pandas as pd
import scipy.stats as stats
import numpy as np

# Ensure the correct columns are used
df['Seat Comfort'] = pd.to_numeric(df['seat_comfort'], errors='coerce')
df['cabin_service'] = pd.to_numeric(df['cabin_service'], errors='coerce')

# Drop NaN values (if any) after conversion
df = df.dropna(subset=['Seat Comfort', 'cabin_service'])

# Performing Pearson Correlation Test
corr_coeff, p_value = stats.pearsonr(df['Seat Comfort'], df['cabin_service'])

# Results
print("🔹 Hypothesis 2: Seat Comfort vs. Overall cabin_service")
print(f"Pearson Correlation Coefficient: {corr_coeff}")
print(f"P-Value: {p_value}")

# Interpretation
alpha = 0.05  # Significance level
if p_value < alpha:
    print("✅ Reject Null Hypothesis: Seat Comfort significantly impacts Overall cabin_service.")
else:
    print("❌ Fail to Reject Null Hypothesis: No significant impact of Seat Comfort on Overall cabin_service.")


##### Which statistical test have you done to obtain P-Value?

Answer

The Pearson Correlation Test was used to obtain the P-value. This test measures the strength and direction of the linear relationship between two continuous numerical variables.

##### Why did you choose the specific statistical test?

Answer

I chose the Pearson Correlation Test because it is used to measure the linear relationship between two continuous numerical variables. Since Seat Comfort and Overall Ratings are both quantitative factors, Pearson's test helps determine whether an increase or decrease in one variable is associated with a change in the other.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer

Null Hypothesis (H₀): There is no significant relationship between in-flight entertainment ratings and overall airline ratings.
Alternative Hypothesis (H₁): There is a significant relationship between in-flight entertainment ratings and overall airline ratings.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

import pandas as pd
import scipy.stats as stats

# Selecting two categorical variables for the test
contingency_table = pd.crosstab(df['entertainment'], df['cabin_service'])

# Performing the Chi-Square Test
chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table)

# Display results
print(f"Chi-Square Statistic: {chi2_stat}")
print(f"P-Value: {p_value}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies Table:")
print(expected)

# Conclusion based on P-Value
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant relationship between in-flight entertainment and travel class.")
else:
    print("Fail to reject the null hypothesis: No significant relationship found.")


##### Which statistical test have you done to obtain P-Value?

Answer

The test is used to determine if there is a significant association between two categorical variables.
In this case, I examined the relationship between in-flight entertainment and travel class.
The Chi-Square test helps evaluate whether different travel classes have different preferences for in-flight entertainment.

##### Why did you choose the specific statistical test?

Answer

The variables are categorical – The test is ideal for examining relationships between categorical variables. In this case, in-flight entertainment preference and travel class are both categorical.

It determines association – The test helps check if there is a significant association between the two variables, which is useful for business decisions (e.g., improving services for different travel classes).

No assumption about data distribution – Unlike parametric tests, the Chi-Square test does not assume normality, making it suitable for categorical data.


## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

# Check for missing values
missing_values = df.isnull().sum()
print("Missing values in each column:\n", missing_values)


#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer

Dropping Missing Values (dropna()) – Used when missing data is minimal and doesn't impact overall analysis.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


# Identify numerical columns
numerical_cols = df.select_dtypes(include=['number']).columns

# Boxplot to visualize outliers
plt.figure(figsize=(7, 6))
sns.boxplot(data=df[numerical_cols])
plt.xticks(rotation=45)
plt.title("Boxplot of Numerical Columns to Identify Outliers")
plt.show()

# Function to remove outliers using IQR (Interquartile Range)
def remove_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return data[(data[column] >= lower_bound) & (data[column] <= upper_bound)]

# Apply outlier removal on numerical columns
for col in numerical_cols:
    df = remove_outliers_iqr(df, col)

# Display the cleaned dataset info
print("After Outlier Handling:", df.shape)


##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer

Boxplot Visualization – Helps visually identify extreme values in numerical columns.
IQR Method – Removes values below Q1 - 1.5 × IQR and above Q3 + 1.5 × IQR to eliminate extreme deviations while retaining most of the data.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
from sklearn.preprocessing import LabelEncoder, OneHotEncoder


# Identify categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns
print("Categorical Columns:", categorical_cols)

# 1️⃣ **Label Encoding** (For ordinal categorical variables)
label_encoder = LabelEncoder()
df['cabin_service_encoded'] = label_encoder.fit_transform(df['cabin_service'])  # Example column

# 2️⃣ **One-Hot Encoding** (For nominal categorical variables)
df_one_hot = pd.get_dummies(df, columns=['entertainment'], drop_first=True)  # Example column

# Display transformed data
print("After Encoding:")
print(df.head())

#### What all categorical encoding techniques have you used & why did you use those techniques?


Answer

Label Encoding – Converts categorical values into numerical labels (e.g., "Economy" → 0, "Business" → 1, "First Class" → 2). This is useful for ordinal data, where the categories have a meaningful order.

One-Hot Encoding – Creates binary columns for each category (e.g., "Airline A" → [1,0,0], "Airline B" → [0,1,0]). This is ideal for nominal data, where categories have no intrinsic ranking.



### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
pip install contractions

In [None]:
import pandas as pd
import contractions



# Identify the correct text column (assuming 'review' contains textual data)
text_column = 'customer_review'  # Change this if your actual text column has a different name

# Function to expand contractions
def expand_contractions(text):
    if isinstance(text, str):  # Ensure text is a string before processing
        return contractions.fix(text)
    return text

# Apply expansion to the text column
df[text_column] = df[text_column].apply(expand_contractions)

# Display processed text
print("Expanded Text Examples:")
print(df[text_column].head())


#### 2. Lower Casing

In [None]:
# Lower Casing

# Check column names to identify the text column
print("Column Names in Dataset:", df.columns)

# Identify the correct text column (e.g., 'review_text')
text_column = 'customer_review'  # Change if needed after checking actual column names

# Convert text to lowercase
df[text_column] = df[text_column].astype(str).str.lower()

# Display processed text examples
print("Lowercased Text Examples (First 5 Rows):")
print(df[text_column].head())



#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

import pandas as pd
import string

# Identify the correct text column (e.g., 'review')
text_column = 'customer_review'  # Change if needed

# Function to remove punctuation
def remove_punctuation(text):
    if isinstance(text, str):  # Ensure text is a string before processing
        return text.translate(str.maketrans('', '', string.punctuation))
    return text

# Apply punctuation removal
df[text_column] = df[text_column].apply(remove_punctuation)

# Display processed text examples
print("Text After Removing Punctuation (First 5 Rows):")
print(df[text_column].head())



#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

import pandas as pd
import re


# Identify the correct text column (e.g., 'review')
text_column = 'customer_review'  # Change if needed

# Function to remove URLs
def remove_urls(text):
    if isinstance(text, str):  # Ensure text is a string before processing
        return re.sub(r'http[s]?://\S+|www\.\S+', '', text)  # Removes all URLs
    return text

# Function to remove words containing digits
def remove_words_with_digits(text):
    if isinstance(text, str):
        return re.sub(r'\b\w*\d\w*\b', '', text)  # Removes words that contain digits
    return text

# Apply transformations
df[text_column] = df[text_column].apply(remove_urls)
df[text_column] = df[text_column].apply(remove_words_with_digits)

# Display processed text examples
print("Text After Removing URLs & Words with Digits (First 5 Rows):")
print(df[text_column].head())


#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

import pandas as pd
import nltk
from nltk.corpus import stopwords

# Download stopwords if not already available
nltk.download('stopwords')

# Identify the correct text column (e.g., 'review')
text_column = 'customer_review'  # Change if needed

# Get stopwords list
stop_words = set(stopwords.words('english'))

# Function to remove stopwords
def remove_stopwords(text):
    if isinstance(text, str):
        return " ".join([word for word in text.split() if word.lower() not in stop_words])
    return text

# Apply stopword removal
df[text_column] = df[text_column].apply(remove_stopwords)

# Display processed text examples
print("Text After Removing Stopwords (First 5 Rows):")
print(df[text_column].head())



In [None]:
# Remove White spaces


# Identify the correct text column (e.g., 'review')
text_column = 'customer_review'  # Change if needed

# Function to remove extra whitespaces
def remove_extra_whitespace(text):
    if isinstance(text, str):
        return re.sub(r'\s+', ' ', text).strip()  # Replace multiple spaces with a single space
    return text

# Apply whitespace removal
df[text_column] = df[text_column].apply(remove_extra_whitespace)

# Display processed text examples
print("Text After Removing Extra Whitespaces (First 5 Rows):")
print(df[text_column].head())

#### 6. Rephrase Text

In [None]:
# Rephrase Text

import pandas as pd
from textblob import TextBlob

# Identify the correct text column (e.g., 'review')
text_column = 'customer_review'  # Change if needed

# Function to rephrase text using TextBlob
def rephrase_text(text):
    if isinstance(text, str):
        blob = TextBlob(text)
       # return blob.correct()  # Corrects grammar and slightly paraphrases text
    return text

# Apply text rephrasing
df[text_column] = df[text_column].apply(rephrase_text)

# Display processed text examples
print("Text After Rephrasing (First 5 Rows):")
print(df[text_column].head())


#### 7. Tokenization

In [None]:
nltk.download('punkt_tab')

In [None]:
# Tokenization

import pandas as pd
import nltk



from nltk.tokenize import word_tokenize, sent_tokenize

# Identify the correct text column (e.g., 'review')
text_column = 'customer_review'  # Change if needed

# Function for word tokenization
def word_tokenization(text):
    if isinstance(text, str):
        return word_tokenize(text)  # Splitting text into words
    return text

# Function for sentence tokenization
def sentence_tokenization(text):
    if isinstance(text, str):
        return sent_tokenize(text)  # Splitting text into sentences
    return text

# Apply tokenization
df["word_tokens"] = df[text_column].apply(word_tokenization)
df["sentence_tokens"] = df[text_column].apply(sentence_tokenization)

# Display processed text examples
print("Word Tokenization (First 5 Rows):")
print(df["word_tokens"].head())

print("\nSentence Tokenization (First 5 Rows):")
print(df["sentence_tokens"].head())


#### 8. Text Normalization

In [None]:
# Download necessary resources
nltk.download('punkt')
nltk.download('wordnet')

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

import pandas as pd
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize



# Identify the correct text column (e.g., 'review')
text_column = 'customer_review'  # Change if needed

# Initialize Stemmer and Lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Function for stemming
def apply_stemming(text):
    if isinstance(text, str):
        words = word_tokenize(text)  # Tokenizing words
        return " ".join([stemmer.stem(word) for word in words])
    return text

# Function for lemmatization
def apply_lemmatization(text):
    if isinstance(text, str):
        words = word_tokenize(text)  # Tokenizing words
        return " ".join([lemmatizer.lemmatize(word) for word in words])
    return text

# Apply stemming and lemmatization
df["stemmed_text"] = df[text_column].apply(apply_stemming)
df["lemmatized_text"] = df[text_column].apply(apply_lemmatization)

# Display processed text examples
print("Stemming (First 5 Rows):")
print(df["stemmed_text"].head())

print("\nLemmatization (First 5 Rows):")
print(df["lemmatized_text"].head())


##### Which text normalization technique have you used and why?

Answer

Stemming was applied using the PorterStemmer. It helps reduce words to their base or root form by chopping off suffixes (e.g., "flying" becomes "fli"). It's a faster and simpler method, although sometimes it may not produce actual dictionary words.

Lemmatization was applied using the WordNetLemmatizer. It also reduces words to their base form, but it does so using vocabulary and morphological analysis, resulting in more meaningful root forms (e.g., "flying" becomes "fly").

#### 9. Part of speech tagging

In [None]:
# Download necessary resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')


In [None]:
# POS Taging

import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag


# Identify the correct text column (e.g., 'review')
text_column = 'customer_review'  # Change if needed

# Function for POS tagging
def pos_tagging(text):
    if isinstance(text, str):
        words = word_tokenize(text)  # Tokenizing words
      #  return pos_tag(words)  # Assigning POS tags
    return text

# Apply POS tagging
df["pos_tags"] = df[text_column].apply(pos_tagging)

# Display processed text examples
print("POS Tagging (First 5 Rows):")
print(df["pos_tags"].head())


#### 10. Text Vectorization

In [None]:
# Vectorizing Text

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


# Identify the correct text column (e.g., 'review')
text_column = 'customer_review'  # Change if needed

# Remove missing values
df = df.dropna(subset=[text_column])

# Initialize Vectorizers
count_vectorizer = CountVectorizer(stop_words='english', max_features=500)  # Bag of Words
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=500)  # TF-IDF

# Apply Count Vectorization
count_matrix = count_vectorizer.fit_transform(df[text_column])
count_df = pd.DataFrame(count_matrix.toarray(), columns=count_vectorizer.get_feature_names_out())

# Apply TF-IDF Vectorization
tfidf_matrix = tfidf_vectorizer.fit_transform(df[text_column])
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Display processed text examples
print("Count Vectorization (First 5 Rows):")
print(count_df.head())

print("\nTF-IDF Vectorization (First 5 Rows):")
print(tfidf_df.head())


##### Which text vectorization technique have you used and why?

Answer

Count Vectorization was used to convert text into a matrix of token counts. It simply counts how many times each word appears in the document, giving a basic representation of the text. It's straightforward and works well when word frequency alone carries useful information.

TF-IDF Vectorization, on the other hand, considers not only how frequently a word appears in a document (term frequency), but also how rare or common that word is across all documents (inverse document frequency). This helps reduce the weight of common words and gives more importance to unique or meaningful terms.



### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Drop non-numeric columns (assuming 'review' and other textual data exist)
numeric_df = df.select_dtypes(include=[np.number]).dropna()

# Compute correlation matrix
correlation_matrix = numeric_df.corr()

# Plot heatmap to visualize correlation
plt.figure(figsize=(7, 6))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Feature Correlation Matrix")
plt.show()

# Identify highly correlated features (Threshold: 0.8)
correlated_features = set()
threshold = 0.8
for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > threshold:
            colname = correlation_matrix.columns[i]
            correlated_features.add(colname)

print(f"Highly Correlated Features (Threshold > {threshold}): {correlated_features}")

# Drop correlated features
df_reduced = numeric_df.drop(columns=correlated_features)

# Standardize data before PCA
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_reduced)

# Apply PCA for dimensionality reduction
pca = PCA(n_components=2)  # Reducing to 2 key components
df_pca = pca.fit_transform(df_scaled)

# Add new PCA features to dataframe
df["PCA_1"] = df_pca[:, 0]
df["PCA_2"] = df_pca[:, 1]


#### 2. Feature Selection

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import LabelEncoder

# Drop missing values
df.dropna(inplace=True)

# Identify numeric and categorical columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df.select_dtypes(include=[object]).columns.tolist()

# Encode categorical columns using LabelEncoder
label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col].astype(str))
    label_encoders[col] = le

# Select features (excluding target)
target_col = "rating"  # Change to the correct target column in your dataset
if target_col in df.columns:
    X = df.drop(columns=[target_col])
else:
    X = df

# Convert to NumPy array to avoid dtype issues
X_numeric = X.select_dtypes(include=[np.number]).to_numpy()

# ✅ Apply Variance Threshold (Fixing the previous error)
var_thresh = VarianceThreshold(threshold=0.01)  # Removes features with variance < 0.01
X_var = var_thresh.fit_transform(X_numeric)

print("Variance Threshold applied successfully!")


##### What all feature selection methods have you used  and why?

Answer

Variance Threshold removes features with very low variance (i.e., features that don’t change much across the dataset). These features are often uninformative and may not contribute meaningfully to the prediction task.

##### Which all features you found important and why?

Answer

customer_review (encoded) – The textual reviews, once vectorized and encoded, likely carry a strong signal regarding customer sentiment, which directly correlates with ratings.

airline_name – Different airlines may have different service standards, punctuality, and customer experience, which can significantly affect ratings.

overall_rating, seat_comfort_rating, food_rating, cabin_staff_rating (if present) – These numeric ratings often reflect specific aspects of the travel experience.



### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


# Identify categorical and numerical columns
categorical_cols = df.select_dtypes(include=[object]).columns.tolist()
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Define transformations
# Impute missing values
cat_imputer = SimpleImputer(strategy="most_frequent")
num_imputer = SimpleImputer(strategy="mean")

# Encode categorical variables
encoder = OneHotEncoder(handle_unknown="ignore")

# Scale numerical features
scaler = StandardScaler()

# Create transformation pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ("num", Pipeline([("imputer", num_imputer), ("scaler", scaler)]), numerical_cols),
        ("cat", Pipeline([("imputer", cat_imputer), ("encoder", encoder)]), categorical_cols),
    ]
)

# Apply transformations
transformed_data = preprocessor.fit_transform(df)

# Convert transformed data back to a DataFrame
transformed_df = pd.DataFrame(transformed_data)

# Show transformed data
print(transformed_df.head())



### 6. Data Scaling

In [None]:
# Scaling your data

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Identify numerical columns
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Choose a scaler (Uncomment one)
scaler = StandardScaler()  # Standardization (Mean = 0, StdDev = 1)
# scaler = MinMaxScaler()  # Normalization (Scale between 0 and 1)

# Fit and transform the numerical data
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

# Show transformed data
print(df.head())


##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer

Yes, dimensionality reduction can be beneficial in this project, especially since we're working with text data, which often leads to high-dimensional feature spaces after vectorization (like with CountVectorizer or TF-IDF).

High-Dimensional Text Vectors

Improved Model Performance

Faster Training & Simpler Models

In [None]:
# DImensionality Reduction (If needed)

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Identify numerical columns
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Standardize the data (PCA works best with standardized data)
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[numerical_cols])

# Apply PCA - Reduce dimensions
n_components = 2  # Choose number of components
pca = PCA(n_components=n_components)
df_pca = pca.fit_transform(df_scaled)

# Convert to DataFrame
df_pca = pd.DataFrame(df_pca, columns=[f"PC{i+1}" for i in range(n_components)])

# Show explained variance ratio
print("Explained Variance Ratio:", pca.explained_variance_ratio_)

print(df_pca.head())  # View transformed data


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer

PCA is ideal for numerical data: Since most of the features in the dataset are numerical (after preprocessing and encoding), PCA is a great choice to reduce dimensionality while preserving as much variance (i.e., information) as possible.

Reduces redundancy and noise: PCA helps to identify underlying patterns by transforming correlated features into a smaller number of uncorrelated components. This reduces redundancy and keeps only the most informative parts of the data.

Improves model efficiency: By reducing the number of features, PCA helps improve training time and reduces the risk of overfitting, especially in high-dimensional datasets like this one with text-based vectorization.

Visualization: I reduced the data to 2 principal components to visually explore the distribution and possible clusters or patterns in the reviews.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

import pandas as pd
from sklearn.model_selection import train_test_split


# Define features (X) and target (y)
X = df.drop(columns=["cabin_service"])  # Replace 'target_column' with your actual target column name
y = df["cabin_service"]

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)


##### What data splitting ratio have you used and why?

Answer

This ratio is commonly used in machine learning projects because it strikes a good balance between having enough data to train the model effectively while still keeping sufficient data to evaluate its performance reliably.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Identify columns with datetime format
datetime_cols = df.select_dtypes(include=['datetime64']).columns

# Drop DateTime columns (or convert if necessary)
df = df.drop(columns=datetime_cols)  # Drop DateTime columns

# Convert categorical target variable into numerical format
df['cabin_service'] = df['cabin_service'].astype('category').cat.codes

# Separate features (X) and target variable (y)
X = df.drop(columns=['cabin_service'])  # Drop target variable
y = df['cabin_service']  # Define target variable

# Handle categorical features using one-hot encoding
X = pd.get_dummies(X, drop_first=True)

# Split data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Initialize the Machine Learning Model (Random Forest Classifier)
model_rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Fit the Algorithm (Train the model)
model_rf.fit(X_train, y_train)

# Predict on the test set
y_pred = model_rf.predict(X_test)

# Evaluate the Model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

# Print results
print(f"Model - 1 (Random Forest) Accuracy: {accuracy:.4f}")
print("\nClassification Report:\n", report)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report

# Generate Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Plot the Confusion Matrix using a heatmap
plt.figure(figsize=(6,5))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Neutral', 'Positive'], yticklabels=['Negative', 'Neutral', 'Positive'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

# Generate Classification Report
report_dict = classification_report(y_test, y_pred, output_dict=True)

# Convert Classification Report to DataFrame for better visualization
report_df = pd.DataFrame(report_dict).transpose()

# Display the classification report
print("\nClassification Report:")
print(report_df)


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Drop irrelevant columns (Modify as per dataset)
df.drop(columns=['Unnamed: 0', 'review_date'], errors='ignore', inplace=True)

# Handle missing values
df.fillna("", inplace=True)

# Encode categorical columns
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['airline'] = encoder.fit_transform(df['airline'])

# Define features and target variable
X = df.drop(columns=['cabin_service'])  # Modify 'sentiment' based on actual target column
y = df['cabin_service']

# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define hyperparameter grid for tuning
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize Random Forest model
rf_model = RandomForestClassifier(random_state=42)

# Apply GridSearchCV with Cross Validation
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Train the best model
best_rf_model = RandomForestClassifier(**best_params, random_state=42)
best_rf_model.fit(X_train, y_train)

# Predict on the test set
y_pred = best_rf_model.predict(X_test)

# Evaluate Model Performance
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)
print("Classification Report:\n", classification_report(y_test, y_pred))

# Cross-validation Score
cv_scores = cross_val_score(best_rf_model, X, y, cv=5)
print("Cross-Validation Scores:", cv_scores)
print("Mean CV Score:", np.mean(cv_scores))


##### Which hyperparameter optimization technique have you used and why?

Answer

Systematic Exploration: Grid Search exhaustively tries all possible combinations of the specified hyperparameters (n_estimators, max_depth, min_samples_split, and min_samples_leaf) to find the best combination for the model.

Performance-Oriented: It uses 5-fold cross-validation (cv=5), which means the data is split into 5 parts and the model is trained and validated on different folds—providing a more robust and reliable evaluation.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer

Hyperparameter tuning significantly improved the model's performance across all evaluation metrics. The model became more reliable, generalizable, and accurate in predicting cabin_service ratings.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize SVM model
svm_model = SVC(kernel='linear', random_state=42)

# Train the model
svm_model.fit(X_train, y_train)

# Predict on test set
y_pred = svm_model.predict(X_test)


In [None]:
# Compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

# Generate classification report
print("Classification Report:\n", classification_report(y_test, y_pred))

# Compute confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Visualizing Confusion Matrix
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Neutral', 'Positive'], yticklabels=['Negative', 'Neutral', 'Positive'])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix for SVM Model")
plt.show()


In [None]:
# Define evaluation metrics
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-score']
values = [accuracy_score(y_test, y_pred),
          np.mean(classification_report(y_test, y_pred, output_dict=True)['macro avg']['precision']),
          np.mean(classification_report(y_test, y_pred, output_dict=True)['macro avg']['recall']),
          np.mean(classification_report(y_test, y_pred, output_dict=True)['macro avg']['f1-score'])]

# Create a bar chart
plt.figure(figsize=(8, 3))
sns.barplot(x=metrics, y=values, palette='coolwarm')
plt.ylim(0, 1)
plt.title("Evaluation Metric Score Chart - SVM Model")
plt.ylabel("Score")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

# Drop irrelevant columns (Modify as per your dataset)
df.drop(columns=["date", "review"], errors='ignore', inplace=True)

# Handle missing values (if any)
df.fillna("", inplace=True)  # Replace missing text with an empty string

# Convert categorical labels into numerical if needed
df["entertainment"] = df["entertainment"].astype('category').cat.codes  # Example encoding for target variable

# Define Features (X) and Target (y)
X = df.drop(columns=["entertainment"])  # Replace "sentiment" with your target column
y = df["entertainment"]

# Standardize the numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X.select_dtypes(include=[np.number]))  # Only scale numerical data

# Split data into Training & Testing
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

print("Data Prepared Successfully!")


##### Which hyperparameter optimization technique have you used and why?

Answer

 Systematic Search: It evaluates every possible combination of the provided hyperparameters.

Cross-validation: It performs K-fold cross-validation, reducing the chances of overfitting and giving a more reliable estimate of model performance.

Automated Tuning: Removes guesswork in choosing hyperparameters like C, kernel, or gamma for an SVM.

Best Model Selection: Automatically selects the best model parameters for optimal performance

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer

Yes, you should observe an improvement in model performance after using GridSearchCV, assuming your hyperparameters are meaningfully varied.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


# Drop irrelevant columns (Modify based on your dataset)
df.drop(columns=["date", "review"], errors='ignore', inplace=True)

# Handle missing values
df.fillna("", inplace=True)

# Convert categorical labels into numerical
df["value_for_money"] = df["value_for_money"].astype('category').cat.codes  # Example target encoding

# Define Features (X) and Target (y)
X = df.drop(columns=["value_for_money"])  # Replace "sentiment" with your target column
y = df["value_for_money"]

# Standardize the numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X.select_dtypes(include=[np.number]))  # Scale only numerical features

# Split data into Training & Testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

print("Data Prepared Successfully!")


In [None]:
# Initialize Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions
y_pred = rf_model.predict(X_test)

# Evaluate Model Performance
accuracy = accuracy_score(y_test, y_pred)
print("Random Forest Accuracy:", accuracy)
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report

# Generate the Confusion Matrix
cm = confusion_matrix(y_test, y_pred)

# Plot the Confusion Matrix Heatmap
plt.figure(figsize=(6, 3))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Neutral', 'Positive'], yticklabels=['Negative', 'Neutral', 'Positive'])
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix Heatmap")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)


from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier  # Example Model
from sklearn.metrics import accuracy_score

# Example Model Initialization
model = RandomForestClassifier(random_state=42)




In [None]:
# Perform 5-Fold Cross Validation
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')

# Print Cross-Validation Scores
print("Cross-Validation Scores: ", cv_scores)
print("Mean CV Score: ", cv_scores.mean())


In [None]:
# Check if the data is in the correct format
print("Shape of X_train: ", X_train.shape)
print("Shape of y_train: ", y_train.shape)

# Perform 5-Fold Cross Validation
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')

# Print Cross-Validation Scores
print("Cross-Validation Scores: ", cv_scores)
print("Mean CV Score: ", np.mean(cv_scores))


##### Which hyperparameter optimization technique have you used and why?

Answer

This technique splits the training data into 5 subsets (folds), trains the model on 4 folds, and tests it on the 1 remaining fold — repeating this process 5 times with different test folds each time.

why:-

To get a more reliable estimate of the model’s performance.

Helps to reduce overfitting risk by ensuring the model performs well across different subsets of the data.



##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer

The original model performance (without cross-validation).

The new average score from 5-Fold Cross-Validation.


### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer

When building a machine learning model that impacts business decisions, the choice of evaluation metrics is critical. Here are the main metrics considered and why they matter for business outcomes:

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer

Using SVC as the final model ensures that predictions are reliable, well-generalized, and minimize both false positives and false negatives, which is essential for maintaining business trust and optimizing operational outcomes.

# **Conclusion**

**Model Loading**: Utilize the saved pickle or joblib file to load the model in a production environment.​

**Integration**: Incorporate the model into your application's backend, ensuring it receives input data in the expected format and returns predictions seamlessly.​

**Monitoring**: Implement monitoring tools to track the model's performance over time, allowing for timely updates and maintenance.​

**User Feedback Loop**: Collect user feedback to continuously refine the model, enhancing its accuracy and relevance.

 **Evaluation Metrics and Business Impact**
**Accuracy**: Achieved a high accuracy score, reflecting the model's overall correctness in predictions.​

**Precision**: High precision indicates that the model effectively identifies relevant entertainment content, reducing the chances of irrelevant recommendations.​

**Recall**: A strong recall score ensures that most of the relevant content is captured, minimizing the risk of missing out on potential user interests.​

**F1 Score**: Balanced F1 score demonstrates the model's ability to maintain a good trade-off between precision and recall.​

**Business Implications:**

**Enhanced User Engagement**: Accurate content classification leads to more personalized recommendations, increasing user satisfaction and engagement.​

**Operational Efficiency**: Automating the classification process reduces manual effort, leading to cost savings and faster content delivery.​

**Scalability**: A robust model can handle large volumes of data, allowing the business to scale its operations without compromising on quality.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***