# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews


In [9]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()   

Unnamed: 0,reviews
0,✅ Trip Verified | We were traveling as a fami...
1,✅ Trip Verified | Flight at 8.40am from DUB to...
2,✅ Trip Verified | Terrible. I have traveled t...
3,✅ Trip Verified | The customer service is ugl...
4,✅ Trip Verified | Most uncomfortable flight I...


In [10]:
len(reviews)

1000

In [11]:
df.to_csv("data/BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

In [5]:
import requests
url ="https://www.airlinequality.com/airline-reviews/british-airways"

    # Collect HTML data from this page
response = requests.get(url)

    # Parse content
content = response.content

print(response)
print(content)

<Response [200]>
b'<!doctype html>\n\n<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7 lt-ie10" lang="en-GB"> <![endif]-->\n<!--[if IE 7]>    <html class="no-js lt-ie9 lt-ie8 lt-ie10" lang="en-GB"> <![endif]-->\n<!--[if IE 8]>    <html class="no-js lt-ie9 lt-ie10" lang="en-GB"> <![endif]-->\n<!--[if IE 9]>    <html class="no-js lt-ie10" lang="en-GB"> <![endif]-->\n<!--[if gt IE 8]><!-->\n<html lang="en-GB">\n<!--<![endif]-->\n\n<head>\n    <meta charset="utf-8">\n\n    <title>British Airways Customer Reviews - SKYTRAX</title>\n\n    <!-- Google Chrome Frame for IE -->\n    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n\n    <!-- mobile meta -->\n    <meta name="HandheldFriendly" content="True">\n    <meta name="MobileOptimized" content="320">\n    <meta name="viewport"\n        content="width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no" />\n    <!-- icons & favicons -->\n    <link rel="apple-touch-icon" href="https:

In [8]:
reviews=[]
date = []
country = []

In [9]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests 
for i in range(1, 20):
    data = requests.get(f"https://www.airlinequality.com/airline-reviews/british-airways/page/{i}/?sortby=post_date%3ADesc&pagesize=100")
    
    soup = BeautifulSoup(data.content, "html.parser")
    for item in soup.find_all("div",class_="text_content"):
        reviews.append(item.text)
    for item in soup.find_all("time"):
        date.append(item.text)
    for item in soup.find_all("h3"):
        country.append(item.span.next_sibling.text.strip("()"))


In [10]:
len(reviews)
len(date)
len(country)

1900

In [11]:
df = pd.DataFrame({"reviews":reviews,"date":date, "country": country})
df.head()
df.shape

(1900, 3)

In [12]:
import os

cwd = os.getcwd()
df.to_csv(cwd+ "/BA_Review.csv")
df['verified'] = df.reviews.str.contains("Trip Verified")
df['verified']

0       False
1        True
2        True
3        True
4        True
        ...  
1895    False
1896    False
1897    False
1898    False
1899    False
Name: verified, Length: 1900, dtype: bool

In [6]:
import nltk 
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [13]:
import pandas as pd
import matplotlib.pyplot as plt
import os

#regex
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
lemma = WordNetLemmatizer()


reviews_data = df.reviews.str.strip("✅ Trip Verified |")

#create an empty list to collect cleaned data corpus
corpus =[]

#loop through each review, remove punctuations, small case it, join it and add it to corpus
for rev in reviews_data:
    rev = re.sub('[^a-zA-Z]',' ', rev)
    rev = rev.lower()
    rev = rev.split()
    rev = [lemma.lemmatize(word) for word in rev if word not in set(stopwords.words("english"))]
    rev = " ".join(rev)
    corpus.append(rev)

In [14]:
df['corpus'] = corpus
df.head()

Unnamed: 0,reviews,date,country,verified,corpus
0,"Not Verified | I flew with numerous airlines, ...",16th June 2023,(Romania),False,verified flew numerous airline gotta admit bri...
1,✅ Trip Verified | We were traveling as a fami...,13th June 2023,(United States),True,traveling family people accident airport arriv...
2,✅ Trip Verified | Flight at 8.40am from DUB to...,12th June 2023,(Australia),True,flight dub lcy cancelled pm night text message...
3,✅ Trip Verified | Terrible. I have traveled t...,11th June 2023,(United Kingdom),True,ble traveled twice year via business class sig...
4,✅ Trip Verified | The customer service is ugl...,11th June 2023,(United States),True,customer service ugly tried calling two week a...


In [15]:

df.dtypes
df.date = pd.to_datetime(df.date)
df.date.head()

0   2023-06-16
1   2023-06-13
2   2023-06-12
3   2023-06-11
4   2023-06-11
Name: date, dtype: datetime64[ns]

In [16]:
df.isnull().value_counts()
df.country.isnull().value_counts()
df.drop(df[df.country.isnull() == True].index, axis=0, inplace=True)
df.shape
#resetting the index
df.reset_index(drop=True)

Unnamed: 0,reviews,date,country,verified,corpus
0,"Not Verified | I flew with numerous airlines, ...",2023-06-16,(Romania),False,verified flew numerous airline gotta admit bri...
1,✅ Trip Verified | We were traveling as a fami...,2023-06-13,(United States),True,traveling family people accident airport arriv...
2,✅ Trip Verified | Flight at 8.40am from DUB to...,2023-06-12,(Australia),True,flight dub lcy cancelled pm night text message...
3,✅ Trip Verified | Terrible. I have traveled t...,2023-06-11,(United Kingdom),True,ble traveled twice year via business class sig...
4,✅ Trip Verified | The customer service is ugl...,2023-06-11,(United States),True,customer service ugly tried calling two week a...
...,...,...,...,...,...
1895,London Heathrow to Bucharest with British Airw...,2016-11-03,(United Kingdom),False,london heathrow bucharest british airway fly d...
1896,✅ Verified Review | I have traveled with Brit...,2016-11-01,(United Kingdom),False,review traveled british airway many time past ...
1897,Flew London Heathrow to Cape Town via Johannes...,2016-11-01,(United Kingdom),False,flew london heathrow cape town via johannesbur...
1898,I was very disappointed to find that British A...,2016-11-01,(United Kingdom),False,disappointed find british airway middle road a...


In [17]:
df.to_csv(cwd + "/cleaned-BA-reviews.csv")