# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
import csv
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

In [2]:
# Set the URL of the paginated webpage that you want to scrape
url = "https://www.airlinequality.com/airline-reviews/british-airways"

# Initializing an empty list to store the data that you scrape
data = []

# Setting the initial page number and the increment that you want to use to paginate through the webpage
page_num = 1
page_incr = 1
page_size = 100

# maximum number of pages to be scraped
max_pages = 20

# Set the URL of the webpage to be scraped 
paginated_url = f"{url}/page/{page_num}/?sortby=post_date%3ADesc&pagesize={page_size}"

# A while loop to paginate through the webpage and scrape the data
while page_num <= max_pages:

    print(f"Scraping page {page_num}")

    # A GET request to the paginated URL
    response = requests.get(paginated_url)

    # Parsing the response using BeautifulSoup
    parsed_content = BeautifulSoup(response.text, "html.parser")

    # Finding all the elements on the page that contain the data to be scraped
    elements = parsed_content.find_all("div",class_ = "body")

    # Looping through the elements and extract the data that you want to scrape
    for element in elements:
        header = element.find("h2",class_ = "text_header").text.replace("\n", " ")
        sub_header = element.find("h3",class_ = "text_sub_header").text.replace("\n", " ")
        content = element.find("div",class_ = "text_content").text.replace("\n", " ")
        
        data.append([header,sub_header,content])

    # Increasing the page number and setting the paginated URL to the new page
    page_num += page_incr
    paginated_url = f"{url}/page/{page_num}/?sortby=post_date%3ADesc&pagesize={page_size}"

    print(f"   ---> {len(data)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews
Scraping page 11
   ---> 1100 total reviews
Scraping page 12
   ---> 1200 total reviews
Scraping page 13
   ---> 1300 total reviews
Scraping page 14
   ---> 1400 total reviews
Scraping page 15
   ---> 1500 total reviews
Scraping page 16
   ---> 1600 total reviews
Scraping page 17
   ---> 1700 total reviews
Scraping page 18
   ---> 1800 total reviews
Scraping page 19
   ---> 1900 total reviews
Scraping page 20
   ---> 2000 total reviews


In [3]:
#Coverting the list data into a dataframe
df = pd.DataFrame(data)
df.columns = ["REVIEW","PERSONAL INFO","CONTENT"]

#Removing unwanted text(first text preprocessing)
df.replace(re.compile(r'\s*✅ Trip Verified \|\s*'), '', inplace=True)
df

Unnamed: 0,REVIEW,PERSONAL INFO,CONTENT
0,"""there is a race to the bottom""",Thomas Kowalski (United States) 19th January...,Not Verified | It seems that there is a race t...
1,"""need to cancel the ticket and rebook""",Reyes Diaz (United Kingdom) 19th January 2023,Not Verified | As a Spanish born individual l...
2,"""very friendly cabin crew""",1 reviews S Zarhas (United Kingdom) 18th ...,"A rather empty and quiet flight to Tel Aviv, v..."
3,"""a good drinks and food service""",E Smyth (United Kingdom) 17th January 2023,Easy check in and staff member was polite and ...
4,"""you should let me use the lounge""",Jozef Kis (United Kingdom) 17th January 2023,Being a silver flyer and booking a flight thro...
...,...,...,...
1995,"""a mediocre service""",M Steger (Germany) 5th June 2016,A mediocre service on this airline flying from...
1996,"""good fare from their sale""",Bill Atkins (United Kingdom) 5th June 2016,MAN to LHR with personable crew who did the br...
1997,"""slowed down the process""",12 reviews T Long (United Kingdom) 4th Ju...,London to Bucharest with British Airways. The ...
1998,"""does what you would expect""",Simon Fowler (United Kingdom) 4th June 2016,Shambolic check-in at New York JFK following t...


In [8]:
#Saving data into a csv
df.to_csv(r"data\BA_reviews.csv")

In [9]:
sentiment_analysis_df = df.drop(["REVIEW","PERSONAL INFO"], axis=1)
sentiment_analysis_df.replace(re.compile(r'\s*✅ Verified Review \|\s*'), '', inplace=True)
sentiment_analysis_df

Unnamed: 0,CONTENT
0,Not Verified | It seems that there is a race t...
1,Not Verified | As a Spanish born individual l...
2,"A rather empty and quiet flight to Tel Aviv, v..."
3,Easy check in and staff member was polite and ...
4,Being a silver flyer and booking a flight thro...
...,...
1995,A mediocre service on this airline flying from...
1996,MAN to LHR with personable crew who did the br...
1997,London to Bucharest with British Airways. The ...
1998,Shambolic check-in at New York JFK following t...


In [11]:
# Save the DataFrame to a CSV file
sentiment_analysis_df.to_csv("data\sentiment_content.csv", index=False)