
## Web scraping and analysis

This Jupyter notebook includes code for web scraping.A package called `BeautifulSoup` is used to collect the data from the web.After collecting the data,it is saved into a local `.csv` file after which the analysis is done.

### Scraping data from Skytrax

[https://www.airlinequality.com] there is a lot of data in this website. For this task, we are only using the reviews related to British Airways and the Airline itself.

link: [https://www.airlinequality.com/airline-reviews/british-airways].

Now,`Python` and `BeautifulSoup` is used to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [4]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews


In [5]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head(10)

Unnamed: 0,reviews
0,✅ Trip Verified | A last minute business trip ...
1,✅ Trip Verified | Overall I would say disapp...
2,Not Verified | LHR to Delhi in Business. Exce...
3,Not Verified | Efficient and Smooth flight fr...
4,✅ Trip Verified | Was told we can not take han...
5,Not Verified | The flight was comfortable eno...
6,✅ Trip Verified | We had a really good flying...
7,✅ Trip Verified | Waited an hour to check-in ...
8,Not Verified | Not a great experience at all...
9,✅ Trip Verified | Boarding was difficult caus...


In [6]:
df.to_csv("include/BA_reviews.csv")

The loops above collected 1000 reviews by iterating through the paginated pages on the website.

Next,Clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant

In [7]:
#Coverting the list data into a dataframe
import re
df = pd.DataFrame(reviews)
df.columns = ["REVIEW"]

#Removing unwanted text(first text preprocessing)
df.replace(re.compile(r'✅ Trip Verified \|'),'', inplace=True)
df

Unnamed: 0,REVIEW
0,"A last minute business trip to HND, a route I..."
1,Overall I would say disappointing. Due to B...
2,Not Verified | LHR to Delhi in Business. Exce...
3,Not Verified | Efficient and Smooth flight fr...
4,Was told we can not take hand luggage onto th...
...,...
995,Bridgetown to London Gatwick. Paid for a Bus...
996,St Lucia to Gatwick on which my wife and I w...
997,Chicago to London. Cancelled flights just a ...
998,London to Bangalore. This was the worst expe...
