# Web scraping

## Scraping data from Skytrax

Navigate to this link: https://www.airlinequality.com/airline-reviews/air-canada you will see this data.
Now, we can use Python and BeautifulSoup to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [3]:
import csv
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import time

In [34]:
# Base URL for Air Canada reviews
url = "https://www.airlinequality.com/airline-reviews/air-canada"

# Initialize list to store review data
data = []

In [35]:
# Pagination settings
page_num = 1
max_pages = 20
page_size = 100  # Number of reviews per page

while page_num <= max_pages:
    paginated_url = f"{url}/page/{page_num}/?sortby=post_date%3ADesc&pagesize={page_size}"

    print(f"🔍 Scraping page {page_num}...")

    # Send HTTP request
    response = requests.get(paginated_url, headers={'User-Agent': 'Mozilla/5.0'})

    # Check if request is successful
    if response.status_code != 200:
        print(f"❌ Failed to retrieve page {page_num}")
        break
    # Parse HTML content
    soup = BeautifulSoup(response.text, "html.parser")

    # Find all review articles
    reviews = soup.find_all("article", {"itemprop": "review"})

    for review in reviews:
        title = review.find("h2",class_ = "text_header").text.replace("\n", " ")
        sub_header = review.find("h3",class_ = "text_sub_header").text.replace("\n", " ")
        content = review.find("div",class_ = "text_content").text.replace("\n", " ")

        # Extract rating from class attribute
        rating_tag = review.find("div", itemprop="reviewRating")
        if rating_tag:
            rating_value = rating_tag.find("span", itemprop="ratingValue")
            if rating_value:
                rating = rating_value.text.strip()

        # Extract date
        date = review.find("time").text.strip() if review.find("time") else "No Date"



        # Store data
        data.append([date, title, sub_header, content, rating])

    print(f"✅ Collected {len(data)} total reviews so far...")
    # Move to next page
    page_num += 1

    # Pause to avoid getting blocked
    time.sleep(2)

🔍 Scraping page 1...
✅ Collected 100 total reviews so far...
🔍 Scraping page 2...
✅ Collected 200 total reviews so far...
🔍 Scraping page 3...
✅ Collected 300 total reviews so far...
🔍 Scraping page 4...
✅ Collected 400 total reviews so far...
🔍 Scraping page 5...
✅ Collected 500 total reviews so far...
🔍 Scraping page 6...
✅ Collected 600 total reviews so far...
🔍 Scraping page 7...
✅ Collected 700 total reviews so far...
🔍 Scraping page 8...
✅ Collected 800 total reviews so far...
🔍 Scraping page 9...
✅ Collected 900 total reviews so far...
🔍 Scraping page 10...
✅ Collected 1000 total reviews so far...
🔍 Scraping page 11...
✅ Collected 1100 total reviews so far...
🔍 Scraping page 12...
✅ Collected 1200 total reviews so far...
🔍 Scraping page 13...
✅ Collected 1300 total reviews so far...
🔍 Scraping page 14...
✅ Collected 1400 total reviews so far...
🔍 Scraping page 15...
✅ Collected 1500 total reviews so far...
🔍 Scraping page 16...
✅ Collected 1600 total reviews so far...
🔍 Scraping

In [37]:
df = pd.DataFrame(data, columns=["Date", "Overall", "Reviewer Info", "Review", "Rating"])
df.replace(re.compile(r'\s*✅ Trip Verified \|\s*'), '', inplace=True)
df.replace(re.compile(r'\s*✅ Verified Review \|\s*'), '', inplace=True)
df.head(20)

Unnamed: 0,Date,Overall,Reviewer Info,Review,Rating
0,23rd January 2025,"""no help or options given""",R Heale (Australia) 23rd January 2025,"Delayed hours, boarded the plane and then wait...",1
1,21st January 2025,"""always had a good experience""",T Chrones (Canada) 21st January 2025,Not Verified | Have flown with Air Canada man...,10
2,17th January 2025,“Now they are refusing to give the refund!”,Cinimol Nair (Canada) 17th January 2025,I booked a bulkhead seat online so that I coul...,1
3,16th January 2025,"""experience at check-in one of the worst""\r",M Erinaj (Canada) 16th January 2025,My husband and I had an extremely distressing ...,1
4,16th January 2025,“What I can truly call a win win”,Paul Tyrrell (Ireland) 16th January 2025,I flew with them last August. There was a prob...,10
5,22nd December 2024,"""nearly 1.5hrs to serve drinks""",E Han (Singapore) 22nd December 2024,"Star alliance gold traveller, flew on many Sta...",1
6,21st December 2024,"""everyone was rushed""",T Gayan (Canada) 21st December 2024,"I've flown air Canada many times, its better t...",2
7,19th December 2024,"""nothing short of a disaster""",Sariah DeGarmo (United States) 19th December...,Not Verified | I recently flew with Air Canad...,1
8,15th December 2024,"""business class service is outstanding""",Steve Pilibbossian (Canada) 15th December 2024,I absolutely love Air Canada. Their internatio...,10
9,15th December 2024,"""I am deeply disappointed""",E Doyle (United States) 15th December 2024,One of the flight attendants servicing the bac...,1


In [44]:
# Save to csv
df.to_csv("air_canada_reviews.csv")

In [41]:
sentiment_analysis_df=df.drop(["Date", "Overall", "Reviewer Info", "Rating"], axis=1)
sentiment_analysis_df

Unnamed: 0,Review
0,"Delayed hours, boarded the plane and then wait..."
1,Not Verified | Have flown with Air Canada man...
2,I booked a bulkhead seat online so that I coul...
3,My husband and I had an extremely distressing ...
4,I flew with them last August. There was a prob...
...,...
1995,I went out of my way to take the new Air Canad...
1996,Even on this very short 65 min flight from New...
1997,"As a UK expat, I've done this route (YVR -LHR)..."
1998,Toronto to Vancouver and back. Outbound I had ...


In [45]:
# Save to csv
sentiment_analysis_df.to_csv("sentiment_analysis.csv")