# -1_ Task One: review data insights

This notebook includes some code to collect review data from the `SkyTrax` https://www.airlinequality.com website using `BeautifulSoup` package for web scrapping. The collected data are saved it into a local `.csv` to perform analysis.

## -1.a- Data scrapping & Cleaning

For this task, we focus our analysis on reviews related to British Airways and the Airline itself.
If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. 
Now, we use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [56]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re


In [57]:
bads_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
# if you want to collect more data, try increasing the number of pages!
page_size = 100

reviews = []

# loop to collect 1000 reviews by iterating through the paginated pages
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{bads_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"
    print(url)
    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    # loop to extract the "text_content" HTML class data
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        # clean this data to remove any unnecessary text from each of the rows.
        # Delete "✅ Trip Verified" or "Not Verified as it's not relevant for investigation.
        # Begin all review after the "|" sequence, after strimming the spaces " ".
        review = para.get_text()
        reviews.append(review[review.index("|")+1:].lstrip())

    print(f"   ---> {len(reviews)} total reviews")


Scraping page 1
https://www.airlinequality.com/airline-reviews/british-airways/page/1/?sortby=post_date%3ADesc&pagesize=100
   ---> 100 total reviews
Scraping page 2
https://www.airlinequality.com/airline-reviews/british-airways/page/2/?sortby=post_date%3ADesc&pagesize=100
   ---> 200 total reviews
Scraping page 3
https://www.airlinequality.com/airline-reviews/british-airways/page/3/?sortby=post_date%3ADesc&pagesize=100
   ---> 300 total reviews
Scraping page 4
https://www.airlinequality.com/airline-reviews/british-airways/page/4/?sortby=post_date%3ADesc&pagesize=100
   ---> 400 total reviews
Scraping page 5
https://www.airlinequality.com/airline-reviews/british-airways/page/5/?sortby=post_date%3ADesc&pagesize=100
   ---> 500 total reviews
Scraping page 6
https://www.airlinequality.com/airline-reviews/british-airways/page/6/?sortby=post_date%3ADesc&pagesize=100
   ---> 600 total reviews
Scraping page 7
https://www.airlinequality.com/airline-reviews/british-airways/page/7/?sortby=post_d

In [58]:
# Transform data to DataFrame format
df = pd.DataFrame()
df["reviews"] = reviews
df.head()


Unnamed: 0,reviews
0,Extremely rude ground service. We were non-rev...
1,My son and I flew to Geneva last Sunday for a ...
2,For the price paid (bought during a sale) it w...
3,Flight left on time and arrived over half an h...
4,"Very Poor Business class product, BA is not ev..."


In [59]:
# Store DataFrame to CSV format
df.to_csv("csv_data/BADS_reviews.csv")


Congratulations! Now we have our dataset for this task! 
