# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews


In [3]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | A simple story with an unfor...
1,✅ Trip Verified | Flight was delayed due to t...
2,Not Verified | Fast and friendly check in (to...
3,✅ Trip Verified | I don't understand why Brit...
4,Not Verified | I'm sure that BA have graduall...


In [6]:
df['reviews'][0]

"✅ Trip Verified | A simple story with an unfortunate outcome that really could happen to anyone. My partner and I recently started working after studying purchased two tickets to travel from London City Airport to Frankfurt. When we purchased the tickets, I mistakenly entered my name twice (e.g. Mr John Smith and Ms John Smith). Little did we know that our 1 simple mistake would cost us over 300 pounds. Upon arriving at the airport we were told there was no way to change the name (apparently they can only change 3 letters where there has been a typo?) and I had no other option to purchase the last remaining ticket if I wanted to board the flight - the price: almost seven times (!) higher than my original ticket. Zero empathy was shown. Zero alternative was offered. Trusting BA's staff and under the pretence that there was apparently no other way we could board the flight we bought this ticket. Immediately after I purchased the ticket I contacted BA's 'Commercial Change Booking Team' a

In [7]:
df.to_csv("data/BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

In [2]:
# Load the data
df = pd.read_csv("data/BA_reviews.csv", index_col=0)

# Cleaning the data
1. Trip verified and unverified is removed from the reviews 
2. Symbols and signs are also removed from the reviews, so as to work with cleaned text

In [3]:
df['reviews'] = df.reviews.str.split('|', expand=True)[1]
df

Unnamed: 0,reviews
0,A simple story with an unfortunate outcome th...
1,Flight was delayed due to the inbound flight...
2,Fast and friendly check in (total contrast t...
3,I don't understand why British Airways is cl...
4,I'm sure that BA have gradually made their e...
...,...
995,British Airways is my favorite airline. Boei...
996,Rome to Newark via London. The first sector ...
997,London Heathrow to New York JFK. The First W...
998,London Heathrow to Dubai. This was the first...


In [4]:
import re

# Define a function to clean the text
def clean(text):
# Removes all special characters and numericals leaving the alphabets
    text = re.sub('[^A-Za-z]+', ' ', str(text))
    return text

# Cleaning the text in the review column
df['Cleaned Reviews'] = df['reviews'].apply(clean)
df.head()

Unnamed: 0,reviews,Cleaned Reviews
0,A simple story with an unfortunate outcome th...,A simple story with an unfortunate outcome th...
1,Flight was delayed due to the inbound flight...,Flight was delayed due to the inbound flight ...
2,Fast and friendly check in (total contrast t...,Fast and friendly check in total contrast to ...
3,I don't understand why British Airways is cl...,I don t understand why British Airways is cla...
4,I'm sure that BA have gradually made their e...,I m sure that BA have gradually made their ec...


In [None]:
# Performing cleaning of the reviews
from nltk.corpus import names
from nltk.stem import WordNetLemmatizer

In [6]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
def get_sentiment_scores(review):
    sentiment_dict = analyzer.polarity_scores(review)
    return sentiment_dict['compound']

sentiment_scores = df["reviews"].iloc[:6].apply(get_sentiment_scores)
sentiment_scores

0    0.5436
1    0.2598
2    0.9841
3    0.9682
4    0.7878
5    0.2568
Name: reviews, dtype: float64

In [8]:
from transformers import pipeline
def get_sentiment(revew):
    classifier = pipeline('sentiment-analysis')
    results = classifier(review)
    return results

In [9]:
sentiment_scores = df["reviews"].iloc[:6].apply(get_sentiment)
sentiment_scores

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 629/629 [00:00<00:00, 126kB/s]
Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

KeyboardInterrupt: 

In [13]:
sentiment_scores = df["Cleaned Reviews"].iloc[:6].apply(get_sentiment_scores)

In [16]:
df["Cleaned Reviews"][3]

' I don t understand why British Airways is classified as a star airline The service is really mediocre The food is untasty and insufficient for a long haul trip Some members of the cabin crew are friendly but they are not attentive enough and create a very basic experience This really is all about getting from point A to B without what it used to be an enjoyable trip making experience The inflight entertainment is fairly good but you do need to bring your own water not to get dehydrated some snacks and ideally food and perhaps smile to yourself as otherwise you are faced with just a cold personality less experience '

In [14]:
sentiment_scores

0    0.2538
1    0.1901
2    0.9836
3    0.9682
4    0.7878
5    0.0353
Name: Cleaned Reviews, dtype: float64