# Web Scraping
In this notebook we collect the data from the site: https://www.airlinequality.com/.

We used `BeautifulSoup` to parse the html page and to get data from reviews of British Airline and then create a DataFrame with `Pandas`. After that we save the DataFrame as a `.csv`

## Scraping the SkyTrax site

If we navigate the site https://www.airlinequality.com/, we can find British Airway reviews at the page https://www.airlinequality.com/airline-reviews/british-airways.
Now we can use `python` and `BeautifulSoup` to collect reviews data.

In [91]:
# Import necessary library

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

### Scrape Reviews

We want to know what people say after flying with British Airway, so we scrap from the url: https://www.airlinequality.com/airline-reviews/british-airways.

We use `requests` to get the raw data from the web page and then we parse the html document with `BeautifulSoup` to obtain reviews text.

We will perform data cleaning and after that we will analyze those data to find insights.

In [92]:
# Setting base url, number of pages, and number of reviews for page
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
page_number = 35
page_size = 100

# We initialize an empty list to collect reviews
reviews = []
verified = []
# Iterating on the number of pages 
for i in range(1, page_number + 1):
    
    #print('Scraping page: ', i)
    # Url of the i page to scrap
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Use requests.get() to collect the content of the page
    response = requests.get(url)

    # Pick the content of the web page and create a BeautifulSoup object to parse the content
    content = response.content
    parsed_content = BeautifulSoup(content, "html.parser")
    
    # Iterate over the content of the page, picking all the div tag, to get the content of the reviews
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        
        reviews.append(para.get_text())

    #print(f"------> {len(reviews)} total reviews")


In [93]:
# Create a DataFrame form the reviews list
df = pd.DataFrame()
df["reviews"] = reviews

df.head(), len(df)

(                                             reviews
 0  ✅ Trip Verified |  I was flying to Warsaw for ...
 1  ✅ Trip Verified |  Booked a BA holiday to Marr...
 2  ✅ Trip Verified | Extremely sub-par service. H...
 3  ✅ Trip Verified |  I virtually gave up on Brit...
 4  ✅ Trip Verified |  I was pleasantly surprised ...,
 3427)

In [94]:
# Save the DataFrame as a CSV file.
df.to_csv(r"..\data\reviews_data.csv", index = False)

### Scraping Rating Data

After the reviews dataset, we will now want to know ratings of each review. We follow the previous steps but this time we need to define a bunch of function to do the job.

In [95]:
# Create lists to store values in the table

Aircraft = []
Type_Of_Traveller =[]
Seat_Type = []
Route = []
Date_Flown = []
Seat_Comfort = []
Cabin_Staff_Service = []
Food_Beverages = []
Inflight_Entertainment = []
Ground_Service = []
Wifi_Connectivity = []
Value_For_Money = []
Recommended = []

In [96]:
# Define a function to grab data from a page: 

def Get_Data_Form_Page(url):
    """
    Get an HTML page, parse it and than search for table. Then fill up lists with values. NEED TO INITIALIZE THE LIST BEFORE CALLING. 
    Parameters
    __________
    url: url of the page to search
    """
    
    content_response = requests.get(url)
    content = content_response.content
    soup = BeautifulSoup(content, "html.parser")
    tables = soup.find_all("table")
    
    # Remove the first table that I don't know where it came from
    del tables[0]

    for table in tables:
    
        # Initalizing lists with NaN values
        Aircraft.append(np.nan)
        Type_Of_Traveller.append(np.nan)
        Seat_Type.append(np.nan)
        Route.append(np.nan)
        Date_Flown.append(np.nan)
        Seat_Comfort.append(np.nan)
        Cabin_Staff_Service.append(np.nan)
        Food_Beverages.append(np.nan)
        Inflight_Entertainment.append(np.nan)
        Ground_Service.append(np.nan)
        Wifi_Connectivity.append(np.nan)
        Value_For_Money.append(np.nan)
        Recommended.append(np.nan)

    # Grabing values for ratings points for a single table.
        for row in table.find_all("tr"):
            header = row.find("td", class_= "review-rating-header").text
            value = row.find("td", class_ = "review-value")

        # Fill the list with values
            if header == "Aircraft":
                Aircraft[-1] = value.text
        
            elif header == "Type Of Traveller":
                Type_Of_Traveller[-1] = value.text
        
            elif header == "Seat Type":
                Seat_Type[-1] = value.text
        
            elif header == "Route":
                Route[-1] = value.text
        
            elif header == "Date Flown":
                Date_Flown[-1] = value.text
        
            elif header == "Seat Comfort":
                Seat_Comfort[-1] = len(row.find_all("span", class_= "star fill"))
        
            elif header == "Cabin Staff Service":
                Cabin_Staff_Service[-1] = len(row.find_all("span", class_= "star fill"))
        
            elif header == "Food & Beverages":
                Food_Beverages[-1] = len(row.find_all("span", class_= "star fill"))
        
            elif header == "Inflight Entertainment":
                Inflight_Entertainment[-1] = len(row.find_all("span", class_= "star fill"))
        
            elif header == "Ground Service":
                Ground_Service[-1] = len(row.find_all("span", class_= "star fill"))
        
            elif header == "Wifi & Connectivity":
                Wifi_Connectivity[-1] = len(row.find_all("span", class_= "star fill"))
        
            elif header == "Value For Money":
                Value_For_Money[-1] = len(row.find_all("span", class_= "star fill"))
        
            elif header == "Recommended":
                Recommended[-1] = value.text
    
    return None

In [97]:
def Get_Data(base_url, page_size = 100, n_pages = 10):

    for i in range(1, page_number + 1):
    
        #print('Scraping page: ', i)
        # Url of the i page to scrap
        url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

        Get_Data_Form_Page(url)
    return None

In [98]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"

Get_Data(base_url, n_pages = 35)

In [99]:
data_dict = {"Aircraft": Aircraft,
    "Type Of Traveller":Type_Of_Traveller,
    "Seat Type": Seat_Type,
    "Route": Route,
    "Date Flown": Date_Flown,
    "Seat Comfort": Seat_Comfort,
    "Cabin Staff Service": Cabin_Staff_Service,
    "Food & Beverages": Food_Beverages,
    "Inflight Entertainment": Inflight_Entertainment,
    "Ground Service": Ground_Service,
    "Wifi & Connectivity": Wifi_Connectivity,
    "Value For Money": Value_For_Money,
    "Recommended": Recommended}


In [100]:
ratings = pd.DataFrame(data_dict)

In [101]:
len(ratings)

3427

In [102]:
ratings.head()

Unnamed: 0,Aircraft,Type Of Traveller,Seat Type,Route,Date Flown,Seat Comfort,Cabin Staff Service,Food & Beverages,Inflight Entertainment,Ground Service,Wifi & Connectivity,Value For Money,Recommended
0,Boeing 787-9,Business,Economy Class,Denver to London,December 2022,1.0,1.0,1.0,3.0,1.0,,1,no
1,A320,Solo Leisure,Business Class,London to Marrakech,June 2022,3.0,5.0,5.0,,4.0,,3,yes
2,A380,Solo Leisure,Economy Class,San Francisco to London,November 2022,2.0,1.0,2.0,2.0,3.0,1.0,2,no
3,A320,Solo Leisure,Business Class,London to Lisbon,November 2022,3.0,4.0,4.0,,3.0,,3,yes
4,Boeing 787 / A320,Solo Leisure,Economy Class,Montreal to Edinburgh via London Heathrow,January 2022,4.0,4.0,4.0,4.0,4.0,,4,yes


In [103]:
df["Recommended"] = ratings["Recommended"]

In [104]:
ratings.to_csv(r"..\data\rating_data.csv", index = False)

In [105]:
df.to_csv(r"..\data\reviews_data.csv", index = False)