### **0.1 Project Overview and Methodology**

Source:
https://www.kaggle.com/datasets/chaudharyanshul/airline-reviews

Data Size:

Rows and Columns: 3,700 rows, 19 columns

Word Count: 592802
The word count is s

This notebook documents the process of cleaning and preprocessing a raw text dataset of airline reviews. The primary objective is to transform the unstructured, user-generated text into a clean, structured format suitable for downstream Natural Language Processing (NLP) tasks such as sentiment analysis.

The methodology follows a standard text preprocessing pipeline, which includes lowercasing, custom noise removal, punctuation and special character removal, stopword removal, and tokenization. The following sections outline the scope of the project and the metrics that will be used to evaluate the effectiveness of the implemented pipeline.

---
### **0.2 Project Scope: Column Selection**

The text preprocessing pipeline will be applied exclusively to the `ReviewBody` column, as it is the sole column in the dataset containing **unstructured, natural language text**. This is the data format that natural language processing techniques are designed to analyze and interpret.

The remaining columns, such as `OverallRating`, `TypeOfTraveller`, and `DateFlown`, contain **structured data**—specifically numerical, categorical, and temporal information, respectively. Applying text-based cleaning methods like stopword removal or punctuation stripping to these columns would be methodologically incorrect and would corrupt their inherent structure and meaning. As the objective of sentiment analysis is to extract opinions and attitudes from expressive language, the focus of the preprocessing is correctly constrained to the `ReviewBody` column. This targeted approach aligns with the fundamental goal of text mining, which is to derive insights from unstructured text data [1].

---
### **1.3 Evaluation Metric: Word Count Analysis**

To objectively evaluate the effectiveness of the preprocessing pipeline, a quantitative analysis of word counts will be conducted. A "before-and-after" comparison, measuring the total number of words in the raw `ReviewBody` column against the final cleaned text, serves as a crucial metric to demonstrate the extent of data reduction and noise removal. The primary contributor to this reduction is typically the stopword removal phase, which is a standard step in preparing data for sentiment analysis and other text mining tasks [2]. By quantifying this change, we provide concrete evidence of the transformation from a raw, noisy dataset to a more focused and semantically dense corpus.

---
**References will be at the end of notebook**


### **1. Importing Libraries and Loading the Dataset**

This initial code blocks is responsible for setting up the notebook's environment. It performs the following key functions:
*   **Imports `pandas`:** This library is essential for data manipulation and will be used to read and manage our CSV data in a DataFrame.
*   **Imports `re`:** This is Python's regular expression library, which will be critical for finding and removing specific patterns of noise in the text.
*   **Mounts Google Drive:** This connects the Colab environment to the user's Google Drive, allowing us to access the dataset file.
*   **Loads the CSV:** It reads the `BA_AirlineReviews.csv` file into a pandas DataFrame named `df` and then displays the first few rows (`.head()`) and the dimensions (`.shape`) to confirm that the data has been loaded correctly.

**Input:** The `BA_AirlineReviews.csv` file located in a specified Google Drive path.
**Output:** A summary of the DataFrame's dimensions (number of rows and columns) and a preview table of the first 5 rows.


In [None]:
# Import necessary libraries for data handling and regular expressions
import pandas as pd
import re

# Mount Google Drive to access our dataset
from google.colab import drive
drive.mount('/content/drive')

# --- IMPORTANT: Update this path if you saved your file in a different folder! ---
file_path = '/content/drive/MyDrive/ITC508_data/BA_AirlineReviews.csv'

# Read the CSV file into a DataFrame
df = pd.read_csv(file_path)

# Print the shape of the DataFrame (rows, columns) and display the first 5 rows
print(f"Dataset Dimensions: {df.shape}")
df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Dataset Dimensions: (3701, 20)


Unnamed: 0.1,Unnamed: 0,OverallRating,ReviewHeader,Name,Datetime,VerifiedReview,ReviewBody,TypeOfTraveller,SeatType,Route,DateFlown,SeatComfort,CabinStaffService,GroundService,ValueForMoney,Recommended,Aircraft,Food&Beverages,InflightEntertainment,Wifi&Connectivity
0,0,1.0,"""Service level far worse then Ryanair""",L Keele,19th November 2023,True,4 Hours before takeoff we received a Mail stat...,Couple Leisure,Economy Class,London to Stuttgart,November 2023,1.0,1.0,1.0,1.0,no,,,,
1,1,3.0,"""do not upgrade members based on status""",Austin Jones,19th November 2023,True,I recently had a delay on British Airways from...,Business,Economy Class,Brussels to London,November 2023,2.0,3.0,1.0,2.0,no,A320,1.0,2.0,2.0
2,2,8.0,"""Flight was smooth and quick""",M A Collie,16th November 2023,False,"Boarded on time, but it took ages to get to th...",Couple Leisure,Business Class,London Heathrow to Dublin,November 2023,3.0,3.0,4.0,3.0,yes,A320,4.0,,
3,3,1.0,"""Absolutely hopeless airline""",Nigel Dean,16th November 2023,True,"5 days before the flight, we were advised by B...",Couple Leisure,Economy Class,London to Dublin,December 2022,3.0,3.0,1.0,1.0,no,,,,
4,4,1.0,"""Customer Service is non existent""",Gaylynne Simpson,14th November 2023,False,"We traveled to Lisbon for our dream vacation, ...",Couple Leisure,Economy Class,London to Lisbon,November 2023,1.0,1.0,1.0,1.0,no,,1.0,1.0,1.0


### **2. Exploring the Raw Text Data**

Before I will apply any cleaning techniques, it is crucial to inspect the raw data. This cell isolates and prints the text from the `ReviewBody` column of a single row.

The purpose of this is to have a clear "before" example, allowing us to visually confirm the presence of various types of noise that we plan to remove, such as:
*   Inconsistent capitalization
*   Punctuation and special characters
*   The "new" noise we identified: alphanumeric shorthand for time (e.g., "4h").

**Input:** The pandas DataFrame `df`.

**Output:** The raw string content of a single review.

In [None]:
# Select and print the raw text from the 'ReviewBody' column of the second row (index 1)
raw_review_sample = df['ReviewBody'][0]

print("--- RAW REVIEW SAMPLE ---")
print(raw_review_sample)

--- RAW REVIEW SAMPLE ---
4 Hours before takeoff we received a Mail stating a cryptic message that there are disruptions to be expected as there is a limit on how many planes can leave at the same time. So did the capacity of the Heathrow Airport really hit British Airways by surprise, 4h before departure? Anyhow - we took the one hour delay so what - but then we have been forced to check in our Hand luggage. I travel only with hand luggage to avoid waiting for the ultra slow processing of the checked in luggage. Overall 2h later at home than planed, with really no reason, just due to incompetent people. Service level far worse then Ryanair and triple the price. Really never again. Thanks for nothing.


### **3. Text Preprocessing Pipeline**

I will now clean the `ReviewBody` text by applying a series of preprocessing steps. Each step will address a specific type of noise we identified in our exploration. We will create a new column, `Cleaned_Review`, to hold the processed text, preserving the original review for comparison.

The pipeline will be executed in the following order:
1.  **Lowercasing:** Convert all text to lowercase to ensure uniformity.
2.  **Removing Alphanumeric Durations:** Remove our identified "new noise" (e.g., "4h", "2h") using a custom regular expression.
3.  **Removing Special Characters:** Remove all characters that are not letters or numbers.
4.  **Stopword Removal:** Remove common English words that provide little semantic value.
5.  **Tokenization:** Split the cleaned text into a list of individual word tokens.

In [None]:
# Select and print the raw text from the 'ReviewBody' column of the second row (index 1)
raw_review_sample = df['ReviewBody'][0]

print("--- RAW REVIEW SAMPLE ---")
print(raw_review_sample)

--- RAW REVIEW SAMPLE ---
4 Hours before takeoff we received a Mail stating a cryptic message that there are disruptions to be expected as there is a limit on how many planes can leave at the same time. So did the capacity of the Heathrow Airport really hit British Airways by surprise, 4h before departure? Anyhow - we took the one hour delay so what - but then we have been forced to check in our Hand luggage. I travel only with hand luggage to avoid waiting for the ultra slow processing of the checked in luggage. Overall 2h later at home than planed, with really no reason, just due to incompetent people. Service level far worse then Ryanair and triple the price. Really never again. Thanks for nothing.


### **3.1. Step 1: Lowercasing**

The first step in our cleaning pipeline is to convert all text to lowercase. This is a crucial normalization technique that prevents the same word with different capitalization (e.g., "Service" and "service") from being treated as two separate and distinct words by NLP models.

We will create a new column called `Cleaned_Review` to store the results of our preprocessing steps. This preserves the original `ReviewBody` for comparison. We use `.astype(str)` as a precaution to handle any potential non-string data in the column.

**Input:** The original `ReviewBody` column.

**Output:** A new `Cleaned_Review` column in the DataFrame containing the same text but entirely in lowercase.

In [None]:
# Create the new 'Cleaned_Review' column by applying the .lower() method
# We use .astype(str) to prevent errors if any reviews are not text (e.g., empty or NaN)
df['Cleaned_Review'] = df['ReviewBody'].astype(str).apply(lambda x: x.lower())

# Display the original and the newly cleaned columns for the first 5 rows to compare
print("--- DATAFRAME PREVIEW: AFTER LOWERCASING ---")
df[['ReviewBody', 'Cleaned_Review']].head()

--- DATAFRAME PREVIEW: AFTER LOWERCASING ---


Unnamed: 0,ReviewBody,Cleaned_Review
0,4 Hours before takeoff we received a Mail stat...,4 hours before takeoff we received a mail stat...
1,I recently had a delay on British Airways from...,i recently had a delay on british airways from...
2,"Boarded on time, but it took ages to get to th...","boarded on time, but it took ages to get to th..."
3,"5 days before the flight, we were advised by B...","5 days before the flight, we were advised by b..."
4,"We traveled to Lisbon for our dream vacation, ...","we traveled to lisbon for our dream vacation, ..."


I will verify the transformation. To verify the transformation on our specific example, we will apply the lowercasing function directly to our `raw_review_sample` variable and print the result. This gives us a clear before-and-after view of the change on a single piece of text.

In [None]:
# Apply the same lowercasing to our isolated sample
cleaned_sample = raw_review_sample.lower()

print("--- RAW REVIEW SAMPLE ---")
print(raw_review_sample)
print("\n--- SAMPLE AFTER LOWERCASING ---")
print(cleaned_sample)

--- RAW REVIEW SAMPLE ---
4 Hours before takeoff we received a Mail stating a cryptic message that there are disruptions to be expected as there is a limit on how many planes can leave at the same time. So did the capacity of the Heathrow Airport really hit British Airways by surprise, 4h before departure? Anyhow - we took the one hour delay so what - but then we have been forced to check in our Hand luggage. I travel only with hand luggage to avoid waiting for the ultra slow processing of the checked in luggage. Overall 2h later at home than planed, with really no reason, just due to incompetent people. Service level far worse then Ryanair and triple the price. Really never again. Thanks for nothing.

--- SAMPLE AFTER LOWERCASING ---
4 hours before takeoff we received a mail stating a cryptic message that there are disruptions to be expected as there is a limit on how many planes can leave at the same time. so did the capacity of the heathrow airport really hit british airways by surp

### **3.2. Custom Noise Removal via Regular Expressions**

A crucial step in text preprocessing involves removing noise that is specific to the text's domain. In the context of airline reviews, this includes non-standard text patterns such as time durations (e.g., "4h", "2 hours"), currency expressions, and industry-specific codes (e.g., airport and aircraft identifiers). Standard libraries may not effectively remove this type of noise. Therefore, regular expressions (regex) are employed as a powerful tool for pattern-based text cleaning, allowing for the precise identification and removal of these unwanted character sequences. This tailored cleaning is essential for reducing the dimensionality of the data and ensuring that subsequent text mining analyses focus on semantically meaningful content [1]. This function consolidates the removal of multiple noise categories for efficiency.

**Input:** The `Cleaned_Review` column, which has already been lowercased.

**Output:** The `Cleaned_Review` column with domain-specific noise (URLs, dates, durations, prices, codes) removed.


In [None]:
# Define a single, powerful function to remove multiple categories of custom noise
def remove_custom_noise(text):
    # Remove URLs (which are a common form of noise in user-generated content)
    text = re.sub(r'https?://\S+|www\.\S+', '', text)

    # Remove dates (e.g., 24th october 2023, 15 july 2023)
    text = re.sub(r'\d{1,2}(st|nd|rd|th)?\s+\w+\s+\d{4}', '', text)

    # Remove times and durations (e.g., 4h, 2 hours, 45 mins)
    text = re.sub(r'\b\d+\s?(h|hr|hrs|hour|hours|minute|minutes|min|mins)\b', '', text)

    # Remove prices and currencies (e.g., £100, $500, 75 pounds)
    text = re.sub(r'[£$€]\d+(\.\d{1,2})?', '', text)
    text = re.sub(r'\b\d+\s?(pounds|gbp|aud|euro|eur)\b', '', text)

    # Remove common airport and aircraft codes
    text = re.sub(r'\b(lhr|jfk|cpt|man|sjc|mxp|gatwick)\b', '', text) # Airports
    text = re.sub(r'\b([a-z]\d{2,3}|[b]?\d{3})\b', '', text)       # Aircraft (a380, b777, 787)

    return text

# Apply the custom noise removal function to the 'Cleaned_Review' column
df['Cleaned_Review'] = df['Cleaned_Review'].apply(remove_custom_noise)

# Display the original and the cleaned columns for comparison
print("--- DATAFRAME PREVIEW: AFTER CUSTOM NOISE REMOVAL ---")
df[['ReviewBody', 'Cleaned_Review']].head()

--- DATAFRAME PREVIEW: AFTER CUSTOM NOISE REMOVAL ---


Unnamed: 0,ReviewBody,Cleaned_Review
0,4 Hours before takeoff we received a Mail stat...,before takeoff we received a mail stating a c...
1,I recently had a delay on British Airways from...,i recently had a delay on british airways from...
2,"Boarded on time, but it took ages to get to th...","boarded on time, but it took ages to get to th..."
3,"5 days before the flight, we were advised by B...","5 days before the flight, we were advised by b..."
4,"We traveled to Lisbon for our dream vacation, ...","we traveled to lisbon for our dream vacation, ..."


### **3.3. Special Character and Punctuation Removal**

After removing domain-specific patterns, the next step is to remove general special characters and punctuation (e.g., `!`, `-`, `.` , `,`). This process simplifies the text corpus by eliminating characters that typically do not carry significant semantic weight in many NLP tasks. By retaining only alphanumeric characters and whitespace, we further normalize the text and prevent punctuation from being misinterpreted as part of a word during tokenization. This is a standard and foundational step in preparing text for data mining applications [1].

**Input:** The `Cleaned_Review` column.

**Output:** The `Cleaned_Review` column containing only letters, numbers, and spaces.

In [None]:
# Define a function to remove all characters that are not letters, numbers, or whitespace
def remove_special_characters(text):
  # The regex [^a-zA-Z0-9\s] matches any character that is NOT a letter, number, or space
  return re.sub(r'[^a-zA-Z0-9\s]', '', text)

# Apply the function
df['Cleaned_Review'] = df['Cleaned_Review'].apply(remove_special_characters)

# Display the head of the DataFrame to see the result
print("--- DATAFRAME PREVIEW: AFTER SPECIAL CHARACTER REMOVAL ---")
df[['ReviewBody', 'Cleaned_Review']].head()

--- DATAFRAME PREVIEW: AFTER SPECIAL CHARACTER REMOVAL ---


Unnamed: 0,ReviewBody,Cleaned_Review
0,4 Hours before takeoff we received a Mail stat...,before takeoff we received a mail stating a c...
1,I recently had a delay on British Airways from...,i recently had a delay on british airways from...
2,"Boarded on time, but it took ages to get to th...",boarded on time but it took ages to get to the...
3,"5 days before the flight, we were advised by B...",5 days before the flight we were advised by ba...
4,"We traveled to Lisbon for our dream vacation, ...",we traveled to lisbon for our dream vacation a...


### **3.4. Stopword Removal**

Stopwords are high-frequency words such as "the", "is", and "a" that occur commonly across all documents in a corpus. While essential for grammatical structure, they often provide little unique information for tasks like topic modeling or sentiment analysis. As noted in surveys of the field, the removal of stopwords is a critical preprocessing step that reduces the feature space and allows analytical models to focus on the more discriminative terms in the text [1]. In this step, the standard, pre-compiled list of English stopwords provided by the NLTK library is utilized. This ensures a consistent and reproducible cleaning methodology.

**Input:** The `Cleaned_Review` column.

**Output:** The `Cleaned_Review` column with common English stopwords removed.

In [None]:
import nltk
from nltk.corpus import stopwords

# Download the stopwords resource from NLTK
nltk.download('stopwords')

# Load the standard English stopwords directly from NLTK into a set for efficient processing
stop_words = set(stopwords.words('english'))

print(f"Using standard NLTK stopword list with {len(stop_words)} words.")

# Define the removal function
def remove_stopwords(text):
  words = text.split()
  # Filter out any word that is in our stop_words set
  filtered_words = [word for word in words if word not in stop_words]
  return " ".join(filtered_words)

# Apply the stopword removal function
df['Cleaned_Review'] = df['Cleaned_Review'].apply(remove_stopwords)

# Display the head of the DataFrame to see the result
print("\n--- DATAFRAME PREVIEW: AFTER STANDARD STOPWORD REMOVAL ---")
df[['ReviewBody', 'Cleaned_Review']].head()

Using standard NLTK stopword list with 198 words.

--- DATAFRAME PREVIEW: AFTER STANDARD STOPWORD REMOVAL ---


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,ReviewBody,Cleaned_Review
0,4 Hours before takeoff we received a Mail stat...,takeoff received mail stating cryptic message ...
1,I recently had a delay on British Airways from...,recently delay british airways bru due staff s...
2,"Boarded on time, but it took ages to get to th...",boarded time took ages get runway due congesti...
3,"5 days before the flight, we were advised by B...",5 days flight advised ba cancelled asked us re...
4,"We traveled to Lisbon for our dream vacation, ...",traveled lisbon dream vacation cruise portugal...


### **3.5. Tokenization**

Tokenization is the final preprocessing step in this pipeline, where the cleaned string of text is segmented into a sequence of individual words or "tokens". This process is foundational for virtually all subsequent NLP tasks, as it transforms the unstructured text into a structured format that machine learning models can process. Most NLP models, from simple bag-of-words to complex transformers, require tokenized input to function correctly, making this a non-negotiable step in preparing data for applications like sentiment analysis [2]. We use the `word_tokenize` function from the NLTK library, which is a pre-trained tokenizer adept at handling various linguistic nuances.

**Input:** The fully cleaned `Cleaned_Review` string.
**Output:** A new `Tokenized_Review` column, where each entry is a list of word tokens.

In [None]:
from nltk.tokenize import word_tokenize

# Download both the standard 'punkt' model and the required 'punkt_tab' add-on
nltk.download('punkt')
nltk.download('punkt_tab')

# Apply the word_tokenize function to create the new column
# This will now work because punkt_tab is available
df['Tokenized_Review'] = df['Cleaned_Review'].apply(word_tokenize)

# Display the final tokenized output alongside the original review
print("--- FINAL DATAFRAME PREVIEW: AFTER TOKENIZATION ---")
df[['ReviewBody', 'Tokenized_Review']].head()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


--- FINAL DATAFRAME PREVIEW: AFTER TOKENIZATION ---


Unnamed: 0,ReviewBody,Tokenized_Review
0,4 Hours before takeoff we received a Mail stat...,"[takeoff, received, mail, stating, cryptic, me..."
1,I recently had a delay on British Airways from...,"[recently, delay, british, airways, bru, due, ..."
2,"Boarded on time, but it took ages to get to th...","[boarded, time, took, ages, get, runway, due, ..."
3,"5 days before the flight, we were advised by B...","[5, days, flight, advised, ba, cancelled, aske..."
4,"We traveled to Lisbon for our dream vacation, ...","[traveled, lisbon, dream, vacation, cruise, po..."


### **4. Generating Quantitative Results for Analysis**

To evaluate the impact of our preprocessing pipeline, we will generate quantitative metrics. This includes a direct comparison of word counts before and after cleaning and a final view of the transformed data. These metrics are essential for the "Results" section of our final report.```

In [None]:
# Calculate the total number of words before cleaning
# We split each review into words and sum the lengths
words_before = df['ReviewBody'].astype(str).apply(lambda x: len(x.split())).sum()

# Calculate the total number of words after cleaning
words_after = df['Cleaned_Review'].astype(str).apply(lambda x: len(x.split())).sum()

# Calculate the percentage reduction
reduction_percentage = ((words_before - words_after) / words_before) * 100

print("--- WORD COUNT ANALYSIS ---")
print(f"Total words before cleaning: {words_before}")
print(f"Total words after cleaning: {words_after}")
print(f"Percentage of words removed: {reduction_percentage:.2f}%")


df.head(10)

--- WORD COUNT ANALYSIS ---
Total words before cleaning: 592802
Total words after cleaning: 307093
Percentage of words removed: 48.20%


Unnamed: 0.1,Unnamed: 0,OverallRating,ReviewHeader,Name,Datetime,VerifiedReview,ReviewBody,TypeOfTraveller,SeatType,Route,...,CabinStaffService,GroundService,ValueForMoney,Recommended,Aircraft,Food&Beverages,InflightEntertainment,Wifi&Connectivity,Cleaned_Review,Tokenized_Review
0,0,1.0,"""Service level far worse then Ryanair""",L Keele,19th November 2023,True,4 Hours before takeoff we received a Mail stat...,Couple Leisure,Economy Class,London to Stuttgart,...,1.0,1.0,1.0,no,,,,,takeoff received mail stating cryptic message ...,"[takeoff, received, mail, stating, cryptic, me..."
1,1,3.0,"""do not upgrade members based on status""",Austin Jones,19th November 2023,True,I recently had a delay on British Airways from...,Business,Economy Class,Brussels to London,...,3.0,1.0,2.0,no,A320,1.0,2.0,2.0,recently delay british airways bru due staff s...,"[recently, delay, british, airways, bru, due, ..."
2,2,8.0,"""Flight was smooth and quick""",M A Collie,16th November 2023,False,"Boarded on time, but it took ages to get to th...",Couple Leisure,Business Class,London Heathrow to Dublin,...,3.0,4.0,3.0,yes,A320,4.0,,,boarded time took ages get runway due congesti...,"[boarded, time, took, ages, get, runway, due, ..."
3,3,1.0,"""Absolutely hopeless airline""",Nigel Dean,16th November 2023,True,"5 days before the flight, we were advised by B...",Couple Leisure,Economy Class,London to Dublin,...,3.0,1.0,1.0,no,,,,,5 days flight advised ba cancelled asked us re...,"[5, days, flight, advised, ba, cancelled, aske..."
4,4,1.0,"""Customer Service is non existent""",Gaylynne Simpson,14th November 2023,False,"We traveled to Lisbon for our dream vacation, ...",Couple Leisure,Economy Class,London to Lisbon,...,1.0,1.0,1.0,no,,1.0,1.0,1.0,traveled lisbon dream vacation cruise portugal...,"[traveled, lisbon, dream, vacation, cruise, po..."
5,5,1.0,"""I can’t imagine a worst airline""",A Narden,12th November 2023,True,Booked a flight from Bucharest to Manchester w...,Solo Leisure,Economy Class,Bucharest to Manchester via London,...,1.0,1.0,1.0,no,A320,1.0,1.0,,booked flight bucharest manchester 45 layover ...,"[booked, flight, bucharest, manchester, 45, la..."
6,6,8.0,"""sufficient leg and arm room""",Graeme Boothman,8th November 2023,True,Booked online months ago and the only hitch wa...,Couple Leisure,Premium Economy,Manchester to Cape Town via London,...,5.0,4.0,4.0,yes,Boeing 777-300,4.0,4.0,,booked online months ago hitch replacement air...,"[booked, online, months, ago, hitch, replaceme..."
7,7,7.0,“crew were polite”,R Vines,7th November 2023,True,The flight was on time. The crew were polite. ...,Solo Leisure,Economy Class,Seville to London Gatwick,...,3.0,3.0,3.0,yes,A320,3.0,,,flight time crew polite story outward flight f...,"[flight, time, crew, polite, story, outward, f..."
8,8,2.0,"""Angry, disappointed, and unsatisfied""",Massimo Tricca,5th November 2023,False,"Angry, disappointed, and unsatisfied. My route...",Family Leisure,Economy Class,London Heatrow to Atlanta,...,5.0,3.0,5.0,yes,Boeing 777,4.0,4.0,3.0,angry disappointed unsatisfied route london at...,"[angry, disappointed, unsatisfied, route, lond..."
9,9,3.0,"""BA now stands for Basic Airways""",J Kaye,5th November 2023,True,"As an infrequent flyer, British Airways was al...",Couple Leisure,Economy Class,Gatwick to Antalya,...,3.0,3.0,1.0,no,,1.0,1.0,1.0,infrequent flyer british airways always first ...,"[infrequent, flyer, british, airways, always, ..."


### **5. References**


[1] U. S. Shanthamallu, A. Spanias, C. Tepedelenlioglu, and M. Stanley, "A brief survey of text mining: Classification, clustering and extraction techniques," in 2017 IEEE 8th Annual Ubiquitous Computing, Electronics and Mobile Communication Conference (UEMCON), New York, NY, USA, 2017, pp. 455-460. doi: 10.1109/UEMCON.2017.8249092.
Available: https://arxiv.org/abs/1707.02919

---



[2] L. A. W. M. Gunawardhana, M. D. J. S. Goonetillake, and P. T. D. I. Dias, "A Survey on Sentiment Analysis Methods, Applications, and Challenges," in 2022 4th International Conference on Advancements in Computing (ICAC), Colombo, Sri Lanka, 2022, pp. 200-205. doi: 10.1109/ICAC57686.2022.10025000. Available: https://www.sciencedirect.com/science/article/pii/S131915782400137X