In [None]:
'''Web scraping and analysis
This Jupyter notebook includes some code to get you started with web scraping. We will use a package called BeautifulSoup to collect the data from the web.
Once you've collected your data and saved it into a local .csv file you should start with your analysis.
Scraping data from Skytrax
If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to
British Airways and the Airline itself.
If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use
Python and BeautifulSoup to collect all the links to the reviews and then to collect the text data on each of the individual review links.'''

In [1]:
#Importing the Necessary Libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [2]:
#Setting the Base URL and initializing Page Size and Number of Pages to be Scrapped
base_url="https://www.airlinequality.com/airline-reviews/british-airways"
pages=20
page_size=200

In [4]:
#Creating the reviews list for storing the data
reviews=[]

In [5]:
# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())

    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 200 total reviews
Scraping page 2
   ---> 400 total reviews
Scraping page 3
   ---> 600 total reviews
Scraping page 4
   ---> 800 total reviews
Scraping page 5
   ---> 1000 total reviews
Scraping page 6
   ---> 1200 total reviews
Scraping page 7
   ---> 1400 total reviews
Scraping page 8
   ---> 1600 total reviews
Scraping page 9
   ---> 1800 total reviews
Scraping page 10
   ---> 2000 total reviews
Scraping page 11
   ---> 2200 total reviews
Scraping page 12
   ---> 2400 total reviews
Scraping page 13
   ---> 2600 total reviews
Scraping page 14
   ---> 2800 total reviews
Scraping page 15
   ---> 3000 total reviews
Scraping page 16
   ---> 3200 total reviews
Scraping page 17
   ---> 3400 total reviews
Scraping page 18
   ---> 3600 total reviews
Scraping page 19
   ---> 3800 total reviews
Scraping page 20
   ---> 3813 total reviews


In [6]:
print(reviews)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [7]:
df=pd.DataFrame(reviews)
df["reviews"]=reviews
df

Unnamed: 0,0,reviews
0,✅ Trip Verified | Flight delayed an hour due ...,✅ Trip Verified | Flight delayed an hour due ...
1,✅ Trip Verified | A very full flight made Pre...,✅ Trip Verified | A very full flight made Pre...
2,✅ Trip Verified | The worst airline I’ve ever ...,✅ Trip Verified | The worst airline I’ve ever ...
3,✅ Trip Verified | I am surprised to be able t...,✅ Trip Verified | I am surprised to be able t...
4,✅ Trip Verified | Flew British Airways on BA ...,✅ Trip Verified | Flew British Airways on BA ...
...,...,...
3808,Flew LHR - VIE return operated by bmi but BA a...,Flew LHR - VIE return operated by bmi but BA a...
3809,LHR to HAM. Purser addresses all club passenge...,LHR to HAM. Purser addresses all club passenge...
3810,My son who had worked for British Airways urge...,My son who had worked for British Airways urge...
3811,London City-New York JFK via Shannon on A318 b...,London City-New York JFK via Shannon on A318 b...


In [8]:
df.head(5)

Unnamed: 0,0,reviews
0,✅ Trip Verified | Flight delayed an hour due ...,✅ Trip Verified | Flight delayed an hour due ...
1,✅ Trip Verified | A very full flight made Pre...,✅ Trip Verified | A very full flight made Pre...
2,✅ Trip Verified | The worst airline I’ve ever ...,✅ Trip Verified | The worst airline I’ve ever ...
3,✅ Trip Verified | I am surprised to be able t...,✅ Trip Verified | I am surprised to be able t...
4,✅ Trip Verified | Flew British Airways on BA ...,✅ Trip Verified | Flew British Airways on BA ...


In [9]:
df.tail(5)

Unnamed: 0,0,reviews
3808,Flew LHR - VIE return operated by bmi but BA a...,Flew LHR - VIE return operated by bmi but BA a...
3809,LHR to HAM. Purser addresses all club passenge...,LHR to HAM. Purser addresses all club passenge...
3810,My son who had worked for British Airways urge...,My son who had worked for British Airways urge...
3811,London City-New York JFK via Shannon on A318 b...,London City-New York JFK via Shannon on A318 b...
3812,SIN-LHR BA12 B747-436 First Class. Old aircraf...,SIN-LHR BA12 B747-436 First Class. Old aircraf...


In [10]:
df.shape

(3813, 2)

In [11]:
df.columns

Index([0, 'reviews'], dtype='object')

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3813 entries, 0 to 3812
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   0        3813 non-null   object
 1   reviews  3813 non-null   object
dtypes: object(2)
memory usage: 59.7+ KB


In [14]:
df.isnull().sum()

0          0
reviews    0
dtype: int64

In [16]:
df.to_csv('Reviews', sep=',', index=False, encoding='utf-8')