# Task 1

---

## Web scraping and analysis

We will use a package called `BeautifulSoup` to collect the data from the web. O

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 100
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())

    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews
Scraping page 11
   ---> 1100 total reviews
Scraping page 12
   ---> 1200 total reviews
Scraping page 13
   ---> 1300 total reviews
Scraping page 14
   ---> 1400 total reviews
Scraping page 15
   ---> 1500 total reviews
Scraping page 16
   ---> 1600 total reviews
Scraping page 17
   ---> 1700 total reviews
Scraping page 18
   ---> 1800 total reviews
Scraping page 19
   ---> 1900 total reviews
Scraping page 20
   ---> 2000 total reviews
Scraping page 21
   ---> 2100 total reviews
Scraping page 22
   ---> 2200 total reviews
Scraping page 23
   ---> 2300 total reviews
Scrapi

In [None]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | Flew British Airways on BA ...
1,✅ Trip Verified | BA cancelled the flight fro...
2,✅ Trip Verified | I strongly advise everyone t...
3,✅ Trip Verified | My partner and I were on the...
4,Not Verified | We had a Premium Economy retur...


In [None]:
reviews[1]

'✅ Trip Verified |  BA cancelled the flight from Tokyo to LHR. I was booked on next day flight. There was another flight on the same day. I went to the desk, but the flight was full. BA in charge offers another flight through Hong Kong which would have been 26h flight time. I declined that, and asked to stay on the next day flight. To my dismay he cancelled the next day flight without telling me he did that. I think he was annoyed that I didn’t accept the offer after he spent sometime looking for. In fact I am the one who should be annoyed for cancelling my flight. I ended up flying another airline with downgrading. Poor service, and appalling behaviour. You expect better from BA.'

In [None]:
import os

# Create the 'data' directory if it doesn't exist
if not os.path.exists("data"):
    os.makedirs("data")

df.to_csv("data/BA_reviews.csv")


 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

#Text Normalization

In [None]:
pip install pandas nltk



In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK data
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Load stopwords
stop_words = set(stopwords.words('english'))

# Function to normalize text
def normalize_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    # Tokenize
    tokens = nltk.word_tokenize(text)
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    # Lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    # Join tokens back to string
    normalized_text = ' '.join(tokens)
    return normalized_text

In [None]:
# Apply text normalization to the desired column (replace 'reviews' with your actual text column name)
df['normalized_text'] = df['reviews'].apply(normalize_text)

# Display the first few rows of the normalized dataset
print(df[['reviews', 'normalized_text']].head())


                                             reviews  \
0  ✅ Trip Verified |  Flew British Airways on BA ...   
1  ✅ Trip Verified |  BA cancelled the flight fro...   
2  ✅ Trip Verified | I strongly advise everyone t...   
3  ✅ Trip Verified | My partner and I were on the...   
4  Not Verified |  We had a Premium Economy retur...   

                                     normalized_text  
0  trip verified flew british airway ba london he...  
1  trip verified ba cancelled flight tokyo lhr bo...  
2  trip verified strongly advise everyone never f...  
3  trip verified partner ba return flight tampa g...  
4  verified premium economy return flight los ang...  


#Remove Stop Words:

In [None]:
from nltk.corpus import stopwords

# Step 3: Download NLTK stop words list
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Step 5: Define a function to remove stop words from a given text
def remove_stop_words(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)

# Step 6: Apply the function to the text column in your dataset
# Assuming the text data is in a column named 'text'
df['cleaned_text'] = df['normalized_text'].apply(remove_stop_words)

# Optional: Inspect the cleaned dataset to ensure stop words are removed
print(df[['normalized_text', 'cleaned_text']].head())

# Step 7: Save the cleaned dataset (optional, replace 'cleaned_dataset.csv' with your desired file path)
df.to_csv('cleaned_dataset.csv', index=False)

print("Cleaned dataset saved successfully.")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                                     normalized_text  \
0  trip verified flew british airway ba london he...   
1  trip verified ba cancelled flight tokyo lhr bo...   
2  trip verified strongly advise everyone never f...   
3  trip verified partner ba return flight tampa g...   
4  verified premium economy return flight los ang...   

                                        cleaned_text  
0  trip verified flew british airway ba london he...  
1  trip verified ba cancelled flight tokyo lhr bo...  
2  trip verified strongly advise everyone never f...  
3  trip verified partner ba return flight tampa g...  
4  verified premium economy return flight los ang...  
Cleaned dataset saved successfully.


#Tokenization

In [None]:
from nltk.tokenize import word_tokenize

# Ensure you have the necessary NLTK data files
nltk.download('punkt')

# Step 4: Define a function to tokenize text
def tokenize_text(text):
    return word_tokenize(text)

# Step 5: Apply the tokenization function to the text column of your dataset
# Replace 'text_column' with the actual name of the text column in your dataset
df['tokenized_text'] = df['cleaned_text'].apply(tokenize_text)

# Display the first few rows to see the tokenized text
print(df.head())

# Step 6: Save the tokenized dataset (optional, replace 'tokenized_dataset.csv' with your desired file path)
df.to_csv('tokenized_dataset.csv', index=False)

print("Tokenized dataset saved successfully.")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


                                             reviews  \
0  ✅ Trip Verified |  Flew British Airways on BA ...   
1  ✅ Trip Verified |  BA cancelled the flight fro...   
2  ✅ Trip Verified | I strongly advise everyone t...   
3  ✅ Trip Verified | My partner and I were on the...   
4  Not Verified |  We had a Premium Economy retur...   

                                     normalized_text  \
0  trip verified flew british airway ba london he...   
1  trip verified ba cancelled flight tokyo lhr bo...   
2  trip verified strongly advise everyone never f...   
3  trip verified partner ba return flight tampa g...   
4  verified premium economy return flight los ang...   

                                        cleaned_text  \
0  trip verified flew british airway ba london he...   
1  trip verified ba cancelled flight tokyo lhr bo...   
2  trip verified strongly advise everyone never f...   
3  trip verified partner ba return flight tampa g...   
4  verified premium economy return flight los 

#Stemming and Lemmatization:

In [None]:
!pip install nltk spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m76.3 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import pandas as pd
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import spacy

# Download necessary NLTK data
nltk.download('stopwords')

# Load spaCy's English model
nlp = spacy.load("en_core_web_sm")

# Define the stemmer
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
def stem_text(text):
    tokens = text.split()
    stemmed_tokens = [stemmer.stem(token) for token in tokens if token not in stop_words]
    return ' '.join(stemmed_tokens)

# Function to perform lemmatization
def lemmatize_text(text):
    doc = nlp(text)
    lemmatized_tokens = [token.lemma_ for token in doc if token.text not in stop_words]
    return ' '.join(lemmatized_tokens)

In [None]:
# Apply stemming
df['stemmed_text'] = df['cleaned_text'].apply(stem_text)


In [None]:
# Apply lemmatization
df['lemmatized_text'] = df['stemmed_text'].apply(lemmatize_text)

# Display the results
print(df['lemmatized_text'])

0       trip verifi fly british airway ba london heath...
1       trip verifi ba cancel flight tokyo lhr book ne...
2       trip verifi strongli advis everyon never fli b...
3       trip verifi partner ba return flight tampa gat...
4       verifi premium economi return flight lo angel ...
                              ...                        
3804    busi lhr bkk first tri back ba year fly mani a...
3805    lhr ham purser address club passeng name board...
3806    son work british airway urg fli british airway...
3807    london citynew york jfk via shannon realli nic...
3808    sinlhr ba b first class old aircraft seat priv...
Name: lemmatized_text, Length: 3809, dtype: object


#Remove Special Characters and Extra Whitespace:

Eliminate special characters, emojis, URLs, and excessive whitespace to clean the text further.


In [None]:
import re

# Step 4: Define a function to clean text data
def clean_text(text):
    # Remove special characters
    text = re.sub(r'[^A-Za-z0-9\s]+', '', text)
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Step 5: Apply the cleaning function to the relevant columns (assuming text is in a column named 'text_column')
df['lemmatized_text'] = df['lemmatized_text'].apply(clean_text)

# Optional: Display the first few rows of the cleaned dataset to verify
print(df['lemmatized_text'])

# Step 6: Save the cleaned dataset (optional, replace 'cleaned_dataset.csv' with your desired file path)
df.to_csv('cleaned_dataset.csv', index=False)

print("Cleaned dataset saved successfully.")

0       trip verifi fly british airway ba london heath...
1       trip verifi ba cancel flight tokyo lhr book ne...
2       trip verifi strongli advis everyon never fli b...
3       trip verifi partner ba return flight tampa gat...
4       verifi premium economi return flight lo angel ...
                              ...                        
3804    busi lhr bkk first tri back ba year fly mani a...
3805    lhr ham purser address club passeng name board...
3806    son work british airway urg fli british airway...
3807    london citynew york jfk via shannon realli nic...
3808    sinlhr ba b first class old aircraft seat priv...
Name: lemmatized_text, Length: 3809, dtype: object
Cleaned dataset saved successfully.


#Handle Negations

In [None]:
# Define a list of common negation words
negation_words = set(["not", "no", "never", "n't"])

# Function to handle negations
def handle_negations(text):
    words = text.split()
    negated_text = []
    negation = False
    for word in words:
        word = word.lower()
        # Check if the word is a negation
        if word in negation_words:
            negation = True
            negated_text.append(word)
        elif negation:
            # Prefix the word with "NOT_" if negation is active
            negated_text.append("NOT_" + word)
            negation = False
        else:
            negated_text.append(word)
    return ' '.join(negated_text)

# Apply the negation handling function to the DataFrame
df['text_cleaned'] = df['lemmatized_text'].apply(handle_negations)

# Display the cleaned text
print(df[['lemmatized_text', 'text_cleaned']])

                                        lemmatized_text  \
0     trip verifi fly british airway ba london heath...   
1     trip verifi ba cancel flight tokyo lhr book ne...   
2     trip verifi strongli advis everyon never fli b...   
3     trip verifi partner ba return flight tampa gat...   
4     verifi premium economi return flight lo angel ...   
...                                                 ...   
3804  busi lhr bkk first tri back ba year fly mani a...   
3805  lhr ham purser address club passeng name board...   
3806  son work british airway urg fli british airway...   
3807  london citynew york jfk via shannon realli nic...   
3808  sinlhr ba b first class old aircraft seat priv...   

                                           text_cleaned  
0     trip verifi fly british airway ba london heath...  
1     trip verifi ba cancel flight tokyo lhr book ne...  
2     trip verifi strongli advis everyon never NOT_f...  
3     trip verifi partner ba return flight tampa gat...  
4

#Feature Extraction:

Bag of Words (BoW)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the text data
bow_features = vectorizer.fit_transform(df['text_cleaned'])

# Convert to DataFrame for better visualization
bow_df = pd.DataFrame(bow_features.toarray(), columns=vectorizer.get_feature_names_out())

print("Bag of Words features:")
print(bow_df)

Bag of Words features:
      aa  aacx  ab  aback  abandon  abandondon  abba  abbrevi  abc  abd  ...  \
0      0     0   0      0        0           0     0        0    0    0  ...   
1      0     0   0      0        0           0     0        0    0    0  ...   
2      0     0   0      0        0           0     0        0    0    0  ...   
3      0     0   0      0        0           0     0        0    0    0  ...   
4      0     0   0      0        0           0     0        0    0    0  ...   
...   ..   ...  ..    ...      ...         ...   ...      ...  ...  ...  ...   
3804   0     0   0      0        0           0     0        0    0    0  ...   
3805   0     0   0      0        0           0     0        0    0    0  ...   
3806   0     0   0      0        0           0     0        0    0    0  ...   
3807   0     0   0      0        0           0     0        0    0    0  ...   
3808   0     0   0      0        0           0     0        0    0    0  ...   

      zombi  zon

Term Frequency-Inverse Document Frequency (TF-IDF)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the text data
tfidf_features = tfidf_vectorizer.fit_transform(df['text_cleaned'])

# Convert to DataFrame for better visualization
tfidf_df = pd.DataFrame(tfidf_features.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

print("TF-IDF features:")
print(tfidf_df)

TF-IDF features:
       aa  aacx   ab  aback  abandon  abandondon  abba  abbrevi  abc  abd  \
0     0.0   0.0  0.0    0.0      0.0         0.0   0.0      0.0  0.0  0.0   
1     0.0   0.0  0.0    0.0      0.0         0.0   0.0      0.0  0.0  0.0   
2     0.0   0.0  0.0    0.0      0.0         0.0   0.0      0.0  0.0  0.0   
3     0.0   0.0  0.0    0.0      0.0         0.0   0.0      0.0  0.0  0.0   
4     0.0   0.0  0.0    0.0      0.0         0.0   0.0      0.0  0.0  0.0   
...   ...   ...  ...    ...      ...         ...   ...      ...  ...  ...   
3804  0.0   0.0  0.0    0.0      0.0         0.0   0.0      0.0  0.0  0.0   
3805  0.0   0.0  0.0    0.0      0.0         0.0   0.0      0.0  0.0  0.0   
3806  0.0   0.0  0.0    0.0      0.0         0.0   0.0      0.0  0.0  0.0   
3807  0.0   0.0  0.0    0.0      0.0         0.0   0.0      0.0  0.0  0.0   
3808  0.0   0.0  0.0    0.0      0.0         0.0   0.0      0.0  0.0  0.0   

      ...  zombi  zone  zoo  zrh  zrhlhr  zrich  zuletzt  

#Sentiment Lexicons:

In [None]:
!pip install afinn

Collecting afinn
  Downloading afinn-0.1.tar.gz (52 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/52.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.6/52.6 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: afinn
  Building wheel for afinn (setup.py) ... [?25l[?25hdone
  Created wheel for afinn: filename=afinn-0.1-py3-none-any.whl size=53430 sha256=aabd993ea876d27413ee0d067bd856bb634247d7bc06bef959b330abd93b67b9
  Stored in directory: /root/.cache/pip/wheels/b0/05/90/43f79196199a138fb486902fceca30a2d1b5228e6d2db8eb90
Successfully built afinn
Installing collected packages: afinn
Successfully installed afinn-0.1


In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Step 3: Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Step 5: Text preprocessing
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Tokenize the text
    tokens = word_tokenize(text.lower())  # Convert to lowercase
    # Remove stopwords and punctuation
    cleaned_tokens = [token for token in tokens if token.isalnum() and token not in stop_words]
    return ' '.join(cleaned_tokens)

df['text_cleaned'] = df['text_cleaned'].apply(preprocess_text)

# Step 6: Apply Sentiment Lexicons
# Example: Using AFINN Lexicon
from afinn import Afinn

afinn = Afinn()

def get_sentiment_score(text):
    return afinn.score(text)

df['sentiment_score'] = df['text_cleaned'].apply(get_sentiment_score)

# Step 7: Save the cleaned dataset
df.to_csv('cleaned_dataset_with_sentiment.csv', index=False)

print("Cleaned dataset with sentiment scores saved successfully.")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Cleaned dataset with sentiment scores saved successfully.


#Split the dataset into training and testing sets

In [None]:
from sklearn.model_selection import train_test_split

# Assuming your DataFrame is named df and it has a column 'sentiment' containing sentiment labels
# 'sentiment' column could be categorical with labels like 'positive', 'negative', 'neutral'

# Step 1: Split the dataset into features (X) and target (y)
X = df.drop(columns=['sentiment_score'])  # Features
y = df['sentiment_score']  # Target variable

# Check for classes with only one member
class_counts = y.value_counts()
rare_classes = class_counts[class_counts == 1].index

# Option 1: Remove samples with rare classes
if not rare_classes.empty:
    print("Removing samples with rare classes:", rare_classes.tolist())
    df_filtered = df[~y.isin(rare_classes)]
    X = df_filtered.drop(columns=['sentiment_score'])
    y = df_filtered['sentiment_score']

# Option 2: If you can't remove samples, don't stratify
if not rare_classes.empty:
    print("Warning: Rare classes exist. Stratified splitting is not possible.")
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
else:
    # Step 2: Split the dataset into training and testing sets, maintaining the distribution of sentiments
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)


# Step 3: Display the shape of the training and testing sets
print("Training set shape:", X_train.shape, y_train.shape)
print("Testing set shape:", X_test.shape, y_test.shape)

Removing samples with rare classes: [53.0, -26.0, 54.0, 43.0, 35.0, 34.0, 30.0, 41.0]
Training set shape: (3040, 7) (3040,)
Testing set shape: (761, 7) (761,)


In [None]:
import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Download the VADER lexicon
nltk.download('vader_lexicon')

# Initialize the VADER sentiment analyzer
sid = SentimentIntensityAnalyzer()

# Function to get sentiment score
def get_sentiment_score(text):
    # Calculate sentiment scores
    scores = sid.polarity_scores(text)
    # Determine sentiment label based on compound score
    if scores['compound'] >= 0.05:
        return 'Positive'
    elif scores['compound'] <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

# Apply sentiment analysis to each row in the dataset
df['text_cleaned'] = df['text_cleaned'].apply(lambda x: get_sentiment_score(x))

# Display the results
print(df['text_cleaned'])

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


0       Positive
1       Negative
2       Negative
3       Positive
4       Negative
          ...   
3804    Positive
3805    Positive
3806    Positive
3807    Positive
3808    Positive
Name: text_cleaned, Length: 3809, dtype: object


In [None]:
# Count the number of positive, negative, and neutral sentiments
sentiment_counts = df['text_cleaned'].value_counts()

# Calculate the percentage of each sentiment
total_counts = len(df['text_cleaned'])
sentiment_percentages = (sentiment_counts / total_counts) * 100

# Display the sentiment counts and percentages
print(sentiment_counts)
print(sentiment_percentages)

text_cleaned
Positive    2560
Negative    1163
Neutral       86
Name: count, dtype: int64
text_cleaned
Positive    67.209241
Negative    30.532948
Neutral      2.257810
Name: count, dtype: float64
