# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [4]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [5]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews


In [6]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | Although transferring to thi...
1,✅ Trip Verified | We are extremely grateful ...
2,✅ Trip Verified | I had an appalling experie...
3,"Not Verified | Good points, the cabin crew, t..."
4,"Not Verified | It was a decent flight, reason..."


In [8]:
import os

os.makedirs("data", exist_ok=True)
df.to_csv("data/BA_reviews.csv")


Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

In [10]:
import pandas as pd
import re

# Load your raw reviews dataset
df = pd.read_csv('data/BA_reviews.csv')

# Function to clean review text
def clean_review(text):
    if pd.isnull(text):
        return ""
    # Remove verification text and "|"
    text = text.replace("✅ Trip Verified", "")
    text = text.replace("Not Verified", "")
    text = text.replace("|", "")
    # Replace multiple spaces and newlines with single space
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

# Apply cleaning to each row
df['reviews'] = df['reviews'].apply(clean_review)

# Drop any rows where reviews column is now empty
df = df[df['reviews'].str.strip() != ""]

# Save cleaned data
df.to_csv('data/BA_reviews_cleaned.csv', index=False)

print("✅ Data cleaned and saved to 'data/BA_reviews_cleaned.csv'")


✅ Data cleaned and saved to 'data/BA_reviews_cleaned.csv'


In [11]:
pip install pandas nltk wordcloud matplotlib textblob


Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
     ---------------------------------------- 1.5/1.5 MB 6.9 MB/s eta 0:00:00
Collecting wordcloud
  Downloading wordcloud-1.9.4-cp310-cp310-win_amd64.whl (299 kB)
     -------------------------------------- 299.8/299.8 kB 6.2 MB/s eta 0:00:00
Collecting matplotlib
  Downloading matplotlib-3.10.3-cp310-cp310-win_amd64.whl (8.1 MB)
     ---------------------------------------- 8.1/8.1 MB 12.3 MB/s eta 0:00:00
Collecting textblob
  Downloading textblob-0.19.0-py3-none-any.whl (624 kB)
     ------------------------------------- 624.3/624.3 kB 13.1 MB/s eta 0:00:00
Collecting joblib
  Downloading joblib-1.5.0-py3-none-any.whl (307 kB)
     ------------------------------------- 307.7/307.7 kB 18.6 MB/s eta 0:00:00
Collecting regex>=2021.8.3
  Downloading regex-2024.11.6-cp310-cp310-win_amd64.whl (274 kB)
     ------------------------------------- 274.0/274.0 kB 16.5 MB/s eta 0:00:00
Collecting tqdm
  Downloading tqdm-4.67.1-


[notice] A new release of pip available: 22.2.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from textblob import TextBlob
import nltk

# Download NLTK data if you haven't already
nltk.download('punkt')

# Load your CSV file (update the path accordingly)
df = pd.read_csv('data/BA_reviews.csv')

# Assuming the reviews column is named 'reviews' (adjust if needed)
reviews = df['reviews'].astype(str)

# Clean text: remove '✅ Trip Verified |' and 'Not Verified |' etc.
reviews_clean = reviews.str.replace(r'✅ Trip Verified \|', '', regex=True)
reviews_clean = reviews_clean.str.replace(r'Not Verified \|', '', regex=True)
reviews_clean = reviews_clean.str.strip()

# Sentiment Analysis function
def get_sentiment(text):
    analysis = TextBlob(text)
    return analysis.sentiment.polarity  # Returns a float within [-1.0, 1.0]

# Apply sentiment analysis to all reviews
df['sentiment'] = reviews_clean.apply(get_sentiment)

# Print average sentiment score
print(f"Average Sentiment Polarity: {df['sentiment'].mean():.3f}")

# Plot histogram of sentiment scores
plt.figure(figsize=(8, 5))
plt.hist(df['sentiment'], bins=20, color='skyblue', edgecolor='black')
plt.title('Sentiment Polarity Distribution')
plt.xlabel('Sentiment Polarity')
plt.ylabel('Number of Reviews')
plt.show()

# Combine all reviews text into one string for WordCloud
text_combined = " ".join(reviews_clean)

# Generate WordCloud
wordcloud = WordCloud(width=800, height=400, background_color='white', max_words=100).generate(text_combined)

# Plot WordCloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Reviews')
plt.show()
