# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping Review Data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [2]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews


In [3]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | My daughter and I were deni...
1,✅ Trip Verified | Despite boarding being the u...
2,"Not Verified | Flight cancelled, no crew! 9th..."
3,"Not Verified | The worst service ever, my bag..."
4,✅ Trip Verified | 4/4 flights we booked this ...


Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

### Scraping Route Data from Skytrax

In [4]:
route_list = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Create a BeautifulSoup object
    soup = BeautifulSoup(response.text, 'html.parser')

    # We find the 'route' information for each table
    for review in soup.find_all('article', {'itemprop': 'review'}):
        route_info = review.find('table', {'class': 'review-ratings'}).find('td', text='Route')
        if route_info:
            route_value = route_info.find_next_sibling('td').text
        else:
            route_value = np.nan  # If there is no 'route' information, we add NaN
        route_list.append(route_value)
    
    print(f"   ---> {len(route_list)} total routes")

Scraping page 1
   ---> 100 total routes
Scraping page 2
   ---> 200 total routes
Scraping page 3
   ---> 300 total routes
Scraping page 4
   ---> 400 total routes
Scraping page 5
   ---> 500 total routes
Scraping page 6
   ---> 600 total routes
Scraping page 7
   ---> 700 total routes
Scraping page 8
   ---> 800 total routes
Scraping page 9
   ---> 900 total routes
Scraping page 10
   ---> 1000 total routes


In [5]:
# We transform the data into a pandas DataFrame
df["Rota"] = route_list
df.head()

Unnamed: 0,reviews,Rota
0,✅ Trip Verified | My daughter and I were deni...,Madrid to Vancouver via London
1,✅ Trip Verified | Despite boarding being the u...,London to Santiago
2,"Not Verified | Flight cancelled, no crew! 9th...",London Heathrow to Faro
3,"Not Verified | The worst service ever, my bag...",Kuwait to Lisbon via London
4,✅ Trip Verified | 4/4 flights we booked this ...,London to Munich


### Scraping Seat Type Data from Skytrax

In [6]:
seat_list = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Create a BeautifulSoup object
    soup = BeautifulSoup(response.text, 'html.parser')

    # For each table we find the 'seat' information
    for review in soup.find_all('article', {'itemprop': 'review'}):
        seat_info = review.find('table', {'class': 'review-ratings'}).find('td', text='Seat Type')
        if seat_info:
            seat_value = seat_info.find_next_sibling('td').text
        else:
            seat_value = np.nan  # If there is no 'seat' information, we add NaN
        seat_list.append(seat_value)
    
    print(f"   ---> {len(seat_list)} total seat type")

Scraping page 1
   ---> 100 total seat type
Scraping page 2
   ---> 200 total seat type
Scraping page 3
   ---> 300 total seat type
Scraping page 4
   ---> 400 total seat type
Scraping page 5
   ---> 500 total seat type
Scraping page 6
   ---> 600 total seat type
Scraping page 7
   ---> 700 total seat type
Scraping page 8
   ---> 800 total seat type
Scraping page 9
   ---> 900 total seat type
Scraping page 10
   ---> 1000 total seat type


In [7]:
# We transform the data into a pandas DataFrame
df["Seat_Type"] = seat_list
df.head()

Unnamed: 0,reviews,Rota,Seat_Type
0,✅ Trip Verified | My daughter and I were deni...,Madrid to Vancouver via London,Business Class
1,✅ Trip Verified | Despite boarding being the u...,London to Santiago,Business Class
2,"Not Verified | Flight cancelled, no crew! 9th...",London Heathrow to Faro,Business Class
3,"Not Verified | The worst service ever, my bag...",Kuwait to Lisbon via London,Economy Class
4,✅ Trip Verified | 4/4 flights we booked this ...,London to Munich,Economy Class


### Scraping Numerical Datas from Skytrax

In [8]:
seat_comfort_list = []
cabin_staff_service_list = []
food_and_beverages_list = []
inflight_entertainment_list = []
ground_service_list = []
wifi_and_connectivity_list = []
value_for_money_list = []



# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Create a BeautifulSoup object
    soup = BeautifulSoup(response.text, 'html.parser')

    # For each table we find the 'Seat Comfort' information
    for review in soup.find_all('article', {'itemprop': 'review'}):
        seat_info = review.find('table', {'class': 'review-ratings'}).find('td', text='Seat Comfort')
        if seat_info:
            filled_stars = int(len(seat_info.find_next_sibling('td').find_all('span', {'class': 'fill'})))
        else:
            filled_stars = np.nan  # If there is no 'Seat Comfort' information, we add NaN
        seat_comfort_list.append(filled_stars)

    # For each table we find the 'Cabin Staff Service' information
    for review in soup.find_all('article', {'itemprop': 'review'}):
        cabin_staff_service_info = review.find('table', {'class': 'review-ratings'}).find('td', text='Cabin Staff Service')
        if cabin_staff_service_info:
            filled_stars = int(len(cabin_staff_service_info.find_next_sibling('td').find_all('span', {'class': 'fill'})))
        else:
            filled_stars = np.nan  # If there is no 'Cabin Staff Service' information, we add NaN
        cabin_staff_service_list.append(filled_stars)
    
    # For each table we find the 'Food & Beverages' information
    for review in soup.find_all('article', {'itemprop': 'review'}):
        food_and_beverages_info = review.find('table', {'class': 'review-ratings'}).find('td', text='Food & Beverages')
        if food_and_beverages_info:
            filled_stars = int(len(food_and_beverages_info.find_next_sibling('td').find_all('span', {'class': 'fill'})))
        else:
            filled_stars = np.nan  # If there is no 'Food & Beverages' information, we add NaN
        food_and_beverages_list.append(filled_stars)
    
    # For each table we find the 'Inflight Entertainment' information
    for review in soup.find_all('article', {'itemprop': 'review'}):
        inflight_entertainment_info = review.find('table', {'class': 'review-ratings'}).find('td', text='Inflight Entertainment')
        if inflight_entertainment_info:
            filled_stars = int(len(inflight_entertainment_info.find_next_sibling('td').find_all('span', {'class': 'fill'})))
        else:
            filled_stars = np.nan  # If there is no 'Inflight Entertainment' information, we add NaN
        inflight_entertainment_list.append(filled_stars)
    
    # For each table we find the 'Ground Service' information
    for review in soup.find_all('article', {'itemprop': 'review'}):
        ground_service_info = review.find('table', {'class': 'review-ratings'}).find('td', text='Ground Service')
        if ground_service_info:
            filled_stars = int(len(ground_service_info.find_next_sibling('td').find_all('span', {'class': 'fill'})))
        else:
            filled_stars = np.nan  # If there is no 'Ground Service' information, we add NaN
        ground_service_list.append(filled_stars)
    
    # For each table we find the 'Wifi & Connectivity' information
    for review in soup.find_all('article', {'itemprop': 'review'}):
        wifi_and_connectivity_info = review.find('table', {'class': 'review-ratings'}).find('td', text='Wifi & Connectivity')
        if wifi_and_connectivity_info:
            filled_stars = int(len(wifi_and_connectivity_info.find_next_sibling('td').find_all('span', {'class': 'fill'})))
        else:
            filled_stars = np.nan  # If there is no 'Wifi & Connectivity' information, we add NaN
        wifi_and_connectivity_list.append(filled_stars)

    # For each table we find the 'Value For Money information
    for review in soup.find_all('article', {'itemprop': 'review'}):
        value_for_money_info = review.find('table', {'class': 'review-ratings'}).find('td', text='Value For Money')
        if value_for_money_info:
            filled_stars = int(len(value_for_money_info.find_next_sibling('td').find_all('span', {'class': 'fill'})))
        else:
            filled_stars = np.nan  # If there is no 'Value For Money' information, we add NaN
        value_for_money_list.append(filled_stars)



    print(f"   ---> {len(seat_comfort_list+cabin_staff_service_list+food_and_beverages_list+inflight_entertainment_list+ground_service_list+wifi_and_connectivity_list+value_for_money_list)} total numerical values.")

Scraping page 1
   ---> 700 total numerical values.
Scraping page 2
   ---> 1400 total numerical values.
Scraping page 3
   ---> 2100 total numerical values.
Scraping page 4
   ---> 2800 total numerical values.
Scraping page 5
   ---> 3500 total numerical values.
Scraping page 6
   ---> 4200 total numerical values.
Scraping page 7
   ---> 4900 total numerical values.
Scraping page 8
   ---> 5600 total numerical values.
Scraping page 9
   ---> 6300 total numerical values.
Scraping page 10
   ---> 7000 total numerical values.


In [9]:
# We transform the data into a pandas DataFrame
df["Seat_Comfort"] = seat_comfort_list
df["Cabin_Staff_Service"] = cabin_staff_service_list
df["Food_and_Beverages"] = food_and_beverages_list
df["Inflight_Entertainment"] = inflight_entertainment_list
df["Ground_Service"] = ground_service_list
df["Wifi_and_Connectivity"] = wifi_and_connectivity_list
df["Value_for_Money"] = value_for_money_list
df

Unnamed: 0,reviews,Rota,Seat_Type,Seat_Comfort,Cabin_Staff_Service,Food_and_Beverages,Inflight_Entertainment,Ground_Service,Wifi_and_Connectivity,Value_for_Money
0,✅ Trip Verified | My daughter and I were deni...,Madrid to Vancouver via London,Business Class,3.0,3.0,,,1.0,,1
1,✅ Trip Verified | Despite boarding being the u...,London to Santiago,Business Class,3.0,5.0,4.0,,2.0,,5
2,"Not Verified | Flight cancelled, no crew! 9th...",London Heathrow to Faro,Business Class,,,,,1.0,,1
3,"Not Verified | The worst service ever, my bag...",Kuwait to Lisbon via London,Economy Class,3.0,1.0,1.0,1.0,3.0,1.0,3
4,✅ Trip Verified | 4/4 flights we booked this ...,London to Munich,Economy Class,1.0,3.0,1.0,1.0,1.0,1.0,1
...,...,...,...,...,...,...,...,...,...,...
995,✅ Trip Verified | London to Lyon. The flight ...,London to Lyon,Economy Class,2.0,1.0,,,1.0,,1
996,✅ Trip Verified | London to Boston. I was sea...,London to Boston,Economy Class,3.0,5.0,4.0,4.0,4.0,1.0,5
997,✅ Trip Verified | Stockholm to London. Standar...,Stockholm to London,Business Class,3.0,5.0,2.0,,1.0,,3
998,✅ Trip Verified | Amsterdam to London arrived...,Amsterdam to London,Economy Class,4.0,4.0,,,1.0,,3


In [10]:
df.to_csv("BA_DataSet.csv")