# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes code to get started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once we've collected our data and saved it into a local `.csv` file we can start with our analysis.

### Scraping data from Skytrax

If we visit [https://www.airlinequality.com] we can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If we navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] we will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

We will also collect data about seat ratings and lounge experience ratings from this website.

In [1]:
# import necessary libraries

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import re

In [2]:
reviews = []
stars = []
dates = []
countries = []

# Loop through 365 pages (10 reviews per page)
for i in range(1, 366):
    page = requests.get(f"https://www.airlinequality.com/airline-reviews/british-airways/page/{i}/")

    soup = BeautifulSoup(page.content, "html.parser")

    for item in soup.find_all("div", class_="text_content"):
        reviews.append(item.text)

    for item in soup.find_all("div", class_ = "rating-10"):
        try:
            stars.append(item.span.text)
        except:
            print(f"Error on page {i}")
            stars.append("None")

    # Extract date and country information
    for item in soup.find_all("time"):
        dates.append(item.text)

    for item in soup.find_all("h3"):
        country_text = item.span.next_sibling.text.strip(" ()")
        countries.append(country_text)

Error on page 307
Error on page 308
Error on page 320
Error on page 346
Error on page 349


In [3]:
# Check length of the extracted columns
print(f"Length of reviews: {len(reviews)}")
print(f"Length of countries: {len(countries)}")
print(f"Length of stars: {len(stars)}")
print(f"Length of dates: {len(dates)}")

Length of reviews: 3644
Length of countries: 3644
Length of stars: 4009
Length of dates: 3644


In [4]:
# Trim the stars list to match the length of the other lists
stars = stars[:3644]

In [5]:
# #create  a dataframe from these collected lists of data. First, create a dictionary from the lists
data = {
    'Review': reviews,
    'Stars': stars,
    'countries': countries,
    'Date': dates
}

# Create a DataFrame
df = pd.DataFrame(data)

# Print the DataFrame
print(df)

                                                 Review  \
0     ✅ Trip Verified |  4/4 flights we booked this ...   
1     ✅ Trip Verified |  British Airways has a total...   
2     ✅ Trip Verified | London Heathrow to Keflavik,...   
3     ✅ Trip Verified | Mumbai to London Heathrow in...   
4     ✅ Trip Verified |  Care and support shocking. ...   
...                                                 ...   
3639  This was a bmi Regional operated flight on a R...   
3640  LHR to HAM. Purser addresses all club passenge...   
3641  My son who had worked for British Airways urge...   
3642  London City-New York JFK via Shannon on A318 b...   
3643  SIN-LHR BA12 B747-436 First Class. Old aircraf...   

                              Stars       countries                Date  
0     \n\t\t\t\t\t\t\t\t\t\t\t\t\t5         Germany  6th September 2023  
1                                 1  United Kingdom  4th September 2023  
2                                 1         Iceland  4th September 20

In [6]:
# Display the first few rows of the DataFrame (default is the first 5 rows)
print("First few rows of the DataFrame:")
print(df.head())

First few rows of the DataFrame:
                                              Review  \
0  ✅ Trip Verified |  4/4 flights we booked this ...   
1  ✅ Trip Verified |  British Airways has a total...   
2  ✅ Trip Verified | London Heathrow to Keflavik,...   
3  ✅ Trip Verified | Mumbai to London Heathrow in...   
4  ✅ Trip Verified |  Care and support shocking. ...   

                           Stars       countries                Date  
0  \n\t\t\t\t\t\t\t\t\t\t\t\t\t5         Germany  6th September 2023  
1                              1  United Kingdom  4th September 2023  
2                              1         Iceland  4th September 2023  
3                              8         Iceland  4th September 2023  
4                              8  United Kingdom  4th September 2023  


In [7]:
# Display the shape of the DataFrame (number of rows, number of columns)
print("\nShape of the DataFrame:")
print(df.shape)


Shape of the DataFrame:
(3644, 4)


### Export the data into a csv format

In [8]:
# Export the DataFrame to a CSV file
df.to_csv('BA_reviews.csv', index=False)