# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [30]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests 

In [31]:
# Base URL
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"

# Number of pages to scrape
pages = 10  # As per your range

# Create empty lists to collect data
reviews = []
stars = []
date = []
country = []

# Loop through each page
for i in range(1, pages + 1):
    # Make a GET request to fetch the raw HTML content
    page = requests.get(f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize=100")
    
    # Parse the page content with BeautifulSoup
    soup = BeautifulSoup(page.content, "html.parser")
    
    # Extract reviews
    for item in soup.find_all("div", {"class": "text_content"}):
        reviews.append(item.get_text())
    
    # Extract stars
    for item in soup.find_all("div", class_="rating-10"):
        try:
            stars.append(item.span.text.strip())
        except AttributeError:
            print(f"Error on page {i} for stars")
            stars.append("None")
            
    # Extract date
    for item in soup.find_all("time"):
        date.append(item.text.strip())
        
    # Extract country
    for item in soup.find_all("h3"):
        try:
            country.append(item.span.next_sibling.text.strip(" ()"))
        except AttributeError:
            print(f"Error on page {i} for country")
            country.append("None")

# Print the length of the collected lists to verify
print(f"Collected {len(reviews)} reviews")
print(f"Collected {len(stars)} star ratings")
print(f"Collected {len(date)} dates")
print(f"Collected {len(country)} countries")


Collected 1000 reviews
Collected 1010 star ratings
Collected 1000 dates
Collected 1000 countries


In [32]:
# Ensure lengths are synchronized
min_length = min(len(reviews), len(stars), len(date), len(country))

# Create a DataFrame from the collected data
df = pd.DataFrame({
    'Review': reviews[:min_length],
    'Stars': stars[:min_length],
    'Date': date[:min_length],
    'Country': country[:min_length]
})
df.sample()

Unnamed: 0,Review,Stars,Date,Country
190,✅ Trip Verified | My family flew from Washing...,1,19th August 2023,United States


In [33]:
df.to_csv("BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

In [29]:
df.sample(20)

Unnamed: 0,Review,Stars,Date,Country
1618,✅ Verified Review | London Heathrow to Paris ...,4,17th August 2017,United Kingdom
1538,✅ Verified Review | Bari to Gatwick. More of ...,2,23rd October 2017,United Kingdom
3389,CPH-LHR-CPH October 2014. Air travel just keep...,9,25th November 2014,Denmark
1018,Not Verified | \r\nMiami to London Heathrow w...,2,30th April 2019,United Kingdom
3175,I usually have very positive experiences when ...,7,6th April 2015,United Kingdom
2184,✅ Verified Review | Flew London Heathrow to L...,3,5th October 2016,Portugal
1417,✅ Trip Verified | I had booked business class ...,4,5th February 2018,United Kingdom
421,✅ Trip Verified | The check-in process was smo...,10,25th September 2022,United States
217,✅ Trip Verified | My family and I have flown ...,9,9th July 2023,United Kingdom
1921,✅ Verified Review | Istanbul to London Heathr...,5,23rd February 2017,Canada
