# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

ratings = []
info = {}

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    
    article = soup.find_all("span", {"itemprop": "ratingValue"})
    
    for j in article:
        ratings.append(j.text)
        
    type_travell = soup.find_all("table", {"class": "review-ratings"})
    
    for k in range(len(type_travell)):
        info[f'{i},{k}']=[]
        inf = type_travell[k].find_all('tr')
        for l in inf:
            info[f'{i},{k}'].append(l.text)
            

Scraping page 1
Scraping page 2
Scraping page 3
Scraping page 4
Scraping page 5
Scraping page 6
Scraping page 7
Scraping page 8
Scraping page 9
Scraping page 10


In [3]:
filter_ratings = [element for element in ratings if '\t' not in element]

In [4]:
filter_info = {}
keys_required = ['Type Of Traveller', 'Seat Type', 'Recommended']

for key, values in info.items():
    # filter the elements that contain the required keys
    filtered = [value for value in values if any(req in value for req in keys_required)]
    # verify that all requirements are present
    if len(filtered) == len(keys_required):
        filter_info[key] = filtered

In [5]:
# Crear una lista para almacenar la información procesada
processed_data = []

# Procesar cada lista en el diccionario para extraer los datos
for key, values in filter_info.items():
    entry = {}
    for value in values:
        if 'Type Of Traveller' in value:
            entry['Type Of Traveller'] = value.split('Type Of Traveller')[1].strip()
        elif 'Seat Type' in value:
            entry['Seat Type'] = value.split('Seat Type')[1].strip()
        elif 'Recommended' in value:
            entry['Recommended'] = value.split('Recommended')[1].strip()
    processed_data.append(entry)

# Crear un DataFrame con los datos procesados
df_1 = pd.DataFrame(processed_data)

In [6]:
df_2 = pd.DataFrame(filter_ratings, columns=['Rating_Over_10'])

In [9]:
df = pd.concat([df_1, df_2], axis=1)
df.dropna(inplace=True)

In [10]:
df.to_csv('Data_Clean_BA.csv', index=False)