# Web Scraping 
This notebook demonstrates the process of web scraping using Python libraries.

## Step 1: Importing Necessary Libraries
We will use the following libraries:
- `requests` to fetch webpage content
- `BeautifulSoup` to parse HTML
- `pandas` to store and process the extracted data
- `numpy` to deal with numeric calculations effectively

In [1]:
import requests
import numpy as np
from bs4 import BeautifulSoup
import pandas as pd

## Step 2: Defining the Target Website
We specify the URL of the website we want to scrape and send an HTTP request.

In [2]:
r=requests.get("https://www.airlinequality.com/airline-reviews/british-airways/page/2/?sortby=post_date%3ADesc&pagesize=1000")
r

<Response [200]>

## Step 3: Parsing the HTML Content
Using BeautifulSoup, we parse the webpage to extract relevant information.

In [3]:
soup=BeautifulSoup(r.content,"html.parser")

## Step 4: Extracting Specific Data
Here, we extract all the neccessary info that is required

In [4]:
authors = soup.find_all('span',attrs={"itemprop":"name"})
name=[]
for author in authors:
    name.append(author.text)
print(name[:10])
print(len(name))

['N Trabolini', 'Hungpin Hsieh', 'Shixin Chu', 'M Simpson', 'Natalie James-Deegan', 'O Morton', 'Andrew Miao', 'A Norton', 'Nakul Borade', 'Harry Aronowicz']
1000


In [5]:
ti = soup.find_all('h2',attrs={"class":"text_header"})
title=[]
for t in ti:
    title.append(str(t.text))
print(title[:10])
print(len(title))

['"feel worthless as a customer"', '"asked them to cancel my ticket"', '"Overall, the journey was great"', '"you will not be able to get any help"', '"They were beyond amazing!"', '"flight was cancelled 3 days in a row"\r\n', '"service was totally unacceptable"', '"customer service is increasingly low cost in feel"', '"crew were very helpful"', '"Crew on board very friendly and helpful"']
1000


In [6]:
ti = soup.find_all('strong')
verification=[]
for t in ti:
    verification.append(t.text)
verification=verification[:1000]
print(verification[:10])
print(len(verification))

['Trip Verified', 'Trip Verified', 'Trip Verified', 'Not Verified', 'Not Verified', 'Trip Verified', 'Trip Verified', 'Trip Verified', 'Not Verified', 'Trip Verified']
1000


In [7]:
ti = soup.find_all('span',attrs={"itemprop":"ratingValue"})
rating_out_of_10=[]
for t in ti:
    rating_out_of_10.append(t.text)
rating_out_of_10=rating_out_of_10[1:]
print(rating_out_of_10[:10])
print(len(rating_out_of_10))

['1', '1', '9', '2', '10', '1', '2', '6', '5', '10']
1000


In [8]:
ti = soup.find_all('h3',attrs={"class":"text_sub_header userStatusWrapper"})
nationality=[]
for t in ti:
    nationality.append(str(t.text.split('(')[1]).split(')')[0])
print(nationality[:10])
print(len(nationality))

['United Kingdom', 'Taiwan', 'China', 'United States', 'United Kingdom', 'United States', 'Hong Kong', 'United Kingdom', 'United Kingdom', 'Canada']
1000


In [9]:
ti = soup.find_all('h3',attrs={"class":"text_sub_header userStatusWrapper"})
date_of_review=[]
for t in ti:
    date_of_review.append(str(t.text.split('(')[1]).split(')')[1])
print(date_of_review[:10])
print(len(date_of_review))

[' 28th August 2019', ' 26th August 2019', ' 25th August 2019', ' 24th August 2019', ' 24th August 2019', ' 23rd August 2019', ' 23rd August 2019', ' 22nd August 2019', ' 21st August 2019', ' 21st August 2019']
1000


In [10]:
ti = soup.find_all('div',attrs={"class":"text_content"})
review=[]
for t in ti:
    review.append(t.text)
print(review[:1])
print(len(review))

['✅ Trip Verified |  I was supposed to fly from London City to Amsterdam on 24/7, Business Class. Once at LCY, my early pm flight appeared as cancelled. A state of confusion broke out at the airport, with people trying to find alternatives, however I managed to be rebooked on a later flight. On a busy day, I waited for hours at the airport. The new flight was displayed as “delayed” for 45mins. the delay kept on increasing, with the flight ultimately scheduled to depart after 10pm. I kept on waiting. The flight eventually ended up cancelled as well. Result: half a day wasted at LCY. I feel that BA’s punctuality and quality of customer service have been declining significantly lately. Delays and cancellations can happen, however I have been flying BA +/- 30 times a year and the % of flights delayed has been abnormally high lately, especially to/from Amsterdam and Germany. Also, the customer service seems to have a “minimum care” proposition (provide what is strictly required by the law, 

In [11]:
ti = soup.find_all('table',attrs={"class":"review-ratings"})
table=[]
for t in ti:
    table.append(t.text)
table=table[1:]
print(table[:1])
print(len(table))

['\nType Of TravellerBusiness\nSeat TypeBusiness Class\nRouteLondon City to Amsterdam\nDate FlownJuly 2019\n\nSeat Comfort\n12345\n\n\nCabin Staff Service\n12345\n\n\nGround Service\n12345\n\n\nValue For Money\n12345\n\nRecommendedno ']
1000


In [12]:
air=[]
for i in table:
    if('Aircraft' in i):
        j=i.split('\n')
        for x in j:
            if("Aircraft" in x):
                air.append(x.split("Aircraft")[1])
    else:
        air.append(np.nan)
len(air)

1000

In [13]:
Type_Of_Traveller=[]
for i in table:
    if('Type Of Traveller' in i):
        j=i.split('\n')
        for x in j:
            if("Type Of Traveller" in x):
                Type_Of_Traveller.append(x.split("Type Of Traveller")[1])
    else:
        Type_Of_Traveller.append(np.nan)
len(Type_Of_Traveller)

1000

In [14]:
Seat_Type=[]
for i in table:
    if('Seat Type' in i):
        j=i.split('\n')
        for x in j:
            if("Seat Type" in x):
                Seat_Type.append(x.split("Seat Type")[1])
    else:
        Seat_Type.append(np.nan)
len(Seat_Type)

1000

In [15]:
Route=[]
for i in table:
    if('Route' in i):
        j=i.split('\n')
        for x in j:
            if("Route" in x):
                Route.append(x.split("Route")[1])
    else:
        Route.append(np.nan)
len(Route)

1000

In [16]:
Date_Flown=[]
for i in table:
    if('Date Flown' in i):
        j=i.split('\n')
        for x in j:
            if("Date Flown" in x):
                Date_Flown.append(x.split("Date Flown")[1])
    else:
        Date_Flown.append(np.nan)
len(Date_Flown)

1000

In [17]:
f = soup.find_all('table',attrs={"class":"review-ratings"})
star=[]
for i in f:
    star.append(str(i))
star=star[1:]

In [18]:
Seat_Comfort=[]
for i in star:
    if("Seat Comfort" in i):
        j=i.split("5</span></td>")
        for x in j:
            if("Seat Comfort" in x):
                Seat_Comfort.append(x.count("star fill"))
    else:
        Seat_Comfort.append(np.nan)
len(Seat_Comfort)

1000

In [19]:
Cabin_Staff_Service=[]
for i in star:
    if("Cabin Staff Service" in i):
        j=i.split("5</span></td>")
        for x in j:
            if("Cabin Staff Service" in x):
                Cabin_Staff_Service.append(x.count("star fill"))
    else:
        Cabin_Staff_Service.append(np.nan)
len(Cabin_Staff_Service)

1000

In [20]:
Food_Beverages=[]
for i in star:
    if("Food &amp; Beverages" in i):
        j=i.split("5</span></td>")
        for x in j:
            if("Food &amp; Beverages" in x):
                Food_Beverages.append(x.count("star fill"))
    else:
        Food_Beverages.append(np.nan)
len(Food_Beverages)

1000

In [21]:
Inflight_Entertainment=[]
for i in star:
    if("Inflight Entertainment" in i):
        j=i.split("5</span></td>")
        for x in j:
            if("Inflight Entertainment" in x):
                Inflight_Entertainment.append(x.count("star fill"))
    else:
        Inflight_Entertainment.append(np.nan)
len(Inflight_Entertainment)

1000

In [22]:
Ground_Service=[]
for i in star:
    if("Ground Service" in i):
        j=i.split("5</span></td>")
        for x in j:
            if("Ground Service" in x):
                Ground_Service.append(x.count("star fill"))
    else:
        Ground_Service.append(np.nan)
len(Ground_Service)

1000

In [23]:
Wifi_Connectivity=[]
for i in star:
    if("Wifi &amp; Connectivity" in i):
        j=i.split("5</span></td>")
        for x in j:
            if("Wifi &amp; Connectivity" in x):
                Wifi_Connectivity.append(x.count("star fill"))
    else:
        Wifi_Connectivity.append(np.nan)
len(Wifi_Connectivity)

1000

In [24]:
Value_For_Money=[]
for i in star:
    if("Value For Money" in i):
        j=i.split("5</span></td>")
        for x in j:
            if("Value For Money" in x):
                Value_For_Money.append(x.count("star fill"))
    else:
        Value_For_Money.append(np.nan)
len(Value_For_Money)

1000

In [25]:
Recommended=[]
for i in star:
    if("Recommended" in i):
        j=i.split("recommended")
        for x in j:
            if("Recommended" in x):
                if("yes" in x):
                    Recommended.append("Yes")
                elif("no" in x):
                    Recommended.append("No")
    else:
        Recommended.append(np.nan)
len(Recommended)

1000

## Step 5: Storing Extracted Data
We store the scraped data in a Pandas DataFrame for easy processing.

In [26]:
dic={
    "Passenger_Name":name,
    "Nationality":nationality,
    "date_of_review":date_of_review,
    "Title":title,
    "Review":review,
    "Verification":verification,
    "Aircraft":air,
    "Type Of Traveller":Type_Of_Traveller,
    "Seat Type":Seat_Type,
    "Route":Route,
    "Date Flown":Date_Flown,
    "Seat Comfort":Seat_Comfort,
    "Cabin Staff Service":Cabin_Staff_Service,
    "Food & Beverages":Food_Beverages,
    "Inflight Entertainment":Inflight_Entertainment,
    "Ground Service":Ground_Service,
    "Wifi & Connectivity":Wifi_Connectivity,
    "Value For Money":Value_For_Money,
    "Recommended":Recommended,
    "OverAll_Rating":rating_out_of_10
}

In [27]:
df=pd.DataFrame(dic)
df.head()

Unnamed: 0,Passenger_Name,Nationality,date_of_review,Title,Review,Verification,Aircraft,Type Of Traveller,Seat Type,Route,Date Flown,Seat Comfort,Cabin Staff Service,Food & Beverages,Inflight Entertainment,Ground Service,Wifi & Connectivity,Value For Money,Recommended,OverAll_Rating
0,N Trabolini,United Kingdom,28th August 2019,"""feel worthless as a customer""",✅ Trip Verified | I was supposed to fly from ...,Trip Verified,,Business,Business Class,London City to Amsterdam,July 2019,3.0,3.0,,,1.0,,1,No,1
1,Hungpin Hsieh,Taiwan,26th August 2019,"""asked them to cancel my ticket""",✅ Trip Verified | I purchased a ticket for Du...,Trip Verified,,Solo Leisure,Economy Class,Dublin to Mauritius via London,February 2019,,,,,,,1,No,1
2,Shixin Chu,China,25th August 2019,"""Overall, the journey was great""",✅ Trip Verified | London to Shanghai. The Con...,Trip Verified,Boeing 777-200,Family Leisure,First Class,London to Shanghai,August 2019,5.0,5.0,4.0,3.0,5.0,3.0,5,Yes,9
3,M Simpson,United States,24th August 2019,"""you will not be able to get any help""",Not Verified | I have often flown British Air...,Not Verified,,Business,Business Class,Chicago to Heathrow via Barcelona,August 2019,3.0,3.0,3.0,4.0,1.0,4.0,3,No,2
4,Natalie James-Deegan,United Kingdom,24th August 2019,"""They were beyond amazing!""",Not Verified | Good morning. I would like to ...,Not Verified,,Family Leisure,Economy Class,New Orleans to London,November 2018,5.0,5.0,5.0,5.0,5.0,5.0,5,Yes,10


In [28]:
df["Review"]=df["Review"].str.split('|').str[1]

In [29]:
df["Title"]=df["Title"].str.split('"').str[1]

In [30]:
df.head()

Unnamed: 0,Passenger_Name,Nationality,date_of_review,Title,Review,Verification,Aircraft,Type Of Traveller,Seat Type,Route,Date Flown,Seat Comfort,Cabin Staff Service,Food & Beverages,Inflight Entertainment,Ground Service,Wifi & Connectivity,Value For Money,Recommended,OverAll_Rating
0,N Trabolini,United Kingdom,28th August 2019,feel worthless as a customer,I was supposed to fly from London City to Am...,Trip Verified,,Business,Business Class,London City to Amsterdam,July 2019,3.0,3.0,,,1.0,,1,No,1
1,Hungpin Hsieh,Taiwan,26th August 2019,asked them to cancel my ticket,I purchased a ticket for Dublin to Mauritius...,Trip Verified,,Solo Leisure,Economy Class,Dublin to Mauritius via London,February 2019,,,,,,,1,No,1
2,Shixin Chu,China,25th August 2019,"Overall, the journey was great",London to Shanghai. The Concorde room in Hea...,Trip Verified,Boeing 777-200,Family Leisure,First Class,London to Shanghai,August 2019,5.0,5.0,4.0,3.0,5.0,3.0,5,Yes,9
3,M Simpson,United States,24th August 2019,you will not be able to get any help,I have often flown British Airways and have ...,Not Verified,,Business,Business Class,Chicago to Heathrow via Barcelona,August 2019,3.0,3.0,3.0,4.0,1.0,4.0,3,No,2
4,Natalie James-Deegan,United Kingdom,24th August 2019,They were beyond amazing!,Good morning. I would like to write a review...,Not Verified,,Family Leisure,Economy Class,New Orleans to London,November 2018,5.0,5.0,5.0,5.0,5.0,5.0,5,Yes,10


## Step 6: Saving the Data to a CSV File
Finally, we save the extracted data in CSV format for future use.

In [31]:
df.to_csv("british_airways_reviews.csv",index=False)