# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
import nltk 
import gensim
import spacy

In [2]:
# Sklearn

# main package
# Dimensionality reduction using truncated singular value decomposition
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD

# CountVectorizer -> instead of frequency, put binary
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.model_selection import GridSearchCV

# clean printing
from pprint import pprint

In [3]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews


In [4]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,"✅ Trip Verified | I had a flight from Miami, F..."
1,✅ Trip Verified | We started our day with BA ...
2,✅ Trip Verified | I fly British Airways weekl...
3,Not Verified | Everything was ok until our co...
4,Not Verified | My initial flight was cancelle...


In [5]:
#df.to_csv("BA_reviews_LDA.csv")

In [6]:
def remove_trip_verified(df):
    df['reviews'] = df['reviews'].str.replace('✅ Trip Verified | ', '')
    df['reviews'] = df['reviews'].str.replace('Not Verified | ', '')
    return df

df = remove_trip_verified(df.copy())
print(df)

                                               reviews
0    I had a flight from Miami, Florida to Dublin, ...
1     We started our day with BA in Prague. The fli...
2     I fly British Airways weekly not because I wa...
3     Everything was ok until our connecting flight...
4     My initial flight was cancelled 8 hours prior...
..                                                 ...
995   Phoenix to Accra via London. I had a great Cu...
996   Manchester to London. The bag drop process to...
997   San Diego to Hannover via London. I booked on...
998   London Heathrow to Stuttgart. Absolutely disg...
999   London to Johannesburg. Turning right to the ...

[1000 rows x 1 columns]


In [7]:
df

Unnamed: 0,reviews
0,"I had a flight from Miami, Florida to Dublin, ..."
1,We started our day with BA in Prague. The fli...
2,I fly British Airways weekly not because I wa...
3,Everything was ok until our connecting flight...
4,My initial flight was cancelled 8 hours prior...
...,...
995,Phoenix to Accra via London. I had a great Cu...
996,Manchester to London. The bag drop process to...
997,San Diego to Hannover via London. I booked on...
998,London Heathrow to Stuttgart. Absolutely disg...


In [8]:
df.describe()

Unnamed: 0,reviews
count,1000
unique,1000
top,"I had a flight from Miami, Florida to Dublin, ..."
freq,1
