Scrape data from the web.
The first thing to do will be to scrape review data from the web. For this, you should use a website called <a href='https://www.airlinequality.com/lounge-reviews/british-airways/'>Skytrax</a>.

The team leader wants you to focus on reviews specifically about the airline itself. You should collect as much data as you can in order to improve the output of your analysis. To get started with the data collection, you can use the “Jupyter Notebook” in the Resources section below to run some Python code that will help to collect some data. 

### Getting Started

In [1]:
# importing libraries
import urllib3
import json
from bs4 import BeautifulSoup

In [2]:
http = urllib3.PoolManager()

In [3]:
url = "https://www.airlinequality.com/airline-reviews/british-airways"
response = http.request('GET', url)
print(response.status)
soup = BeautifulSoup(response.data)

200


### Getting every page

In [4]:
url

'https://www.airlinequality.com/airline-reviews/british-airways'

In [5]:
urls = []
for i in range(357):
    pages = f'/page/{i}/'
    urls.append(url+pages)
    
urls[15]

'https://www.airlinequality.com/airline-reviews/british-airways/page/15/'

### Getting each review

In [10]:
# function for 
def get_info(url, count):
    response = http.request('GET', url)
    if response.status != 200:
        d = {"error": response.status}
        return d
    soup = BeautifulSoup(response.data)
    
    topic = soup.find_all("h2", {"text_header"})
    user = soup.find_all("h3", {"text_sub_header userStatusWrapper"})
    review = soup.find_all("div", {"text_content"})
    ratings = soup.find_all("div", "review-stats")

    users = []
    for i in range(len(topic)):
        reviews = {}
        reviews['id'] = i+count
        reviews['Topic'] = topic[i].get_text()
        
        try:
            try:
                number = user[i].span.find_all("span", "userStatusReviewCount")[0].text
                name = user[i].text
                name = name.replace(user[i].time.text, "")
                date = user[i].text.replace(name, "")
                name = name.replace('\n\n'+number+'\n\n\n\n', "")

            except:
                name = user[i].text
                name = name.replace(user[i].time.text, "")
                date = user[i].text.replace(name, "")
                name = name.replace('\n\n', "")
        except:
            continue
            
        reviews['Name'] = name
        reviews['Date'] = date
        
        try:
            verified = review[i].find("a").get_text()
            text = review[i].get_text().replace(verified + " |", "")
        except:
            verified = None
            text = review[i].get_text()
        reviews['Verified'] = verified
        reviews['Text'] = text
        
        # getting ratings
        rows = ratings[i].find_all("tr")
        for j in rows:
            name = j.find_all(("td", "review-value"))[0].get_text()
            value = j.find_all(("td", "review-value"))[1].get_text()
            if value == '12345':
                value = len(j.find_all("span", "star fill"))
            reviews[name] = value
                
        users.append(reviews)

    return users, i+count

data, count = get_info(url, 0)
data[3]

{'id': 3,
 'Topic': '"Cheap, quick and efficient"',
 'Name': 'A Warten (Chile) ',
 'Date': '23rd May 2023',
 'Verified': 'Trip Verified',
 'Text': '✅   Online check in worked fine. Quick security check. Once onboard quick flight up to Glasgow, water and snack provided. All in all very pleased. Cheap, quick and efficient.',
 'Aircraft': 'A320',
 'Type Of Traveller': 'Solo Leisure',
 'Seat Type': 'Economy Class',
 'Route': 'London to Glasgow',
 'Date Flown': 'May 2023',
 'Seat Comfort': 5,
 'Cabin Staff Service': 5,
 'Food & Beverages': 5,
 'Ground Service': 5,
 'Value For Money': 5,
 'Recommended': 'yes'}

In [577]:
url

'https://www.airlinequality.com/airline-reviews/british-airways'

### Scraping every page 

In [578]:
data = []
count = 0
for url in urls:
#     print(i)
    d, count = get_info(url, count)
    data.extend(d)
    count+=1

### DataFrame

In [579]:
import pandas as pd
df = pd.DataFrame(data)
df

Unnamed: 0,id,Topic,Name,Date,Verified,Text,Type Of Traveller,Seat Type,Route,Date Flown,Seat Comfort,Cabin Staff Service,Food & Beverages,Inflight Entertainment,Ground Service,Wifi & Connectivity,Value For Money,Recommended,Aircraft
0,0,"""cancel your flight without notice""",B Sherry (United States),23rd May 2023,Not Verified,Top Ten REASONS to not use British Airways To...,Couple Leisure,Premium Economy,Dallas to Madrid via London,May 2023,1.0,1.0,1,1,1.0,1.0,1,no,
1,1,"""flights changed with no cost""",William Jackson (Spain),23rd May 2023,Not Verified,Easy check in on the way to Heathrow. The fl...,Couple Leisure,Economy Class,London to Valencia,March 2023,4.0,4.0,,,5.0,,4,yes,
2,2,"""Cheap, quick and efficient""",A Warten (Chile),23rd May 2023,Trip Verified,✅ Online check in worked fine. Quick securit...,Solo Leisure,Economy Class,London to Glasgow,May 2023,5.0,5.0,5,,5.0,,5,yes,A320
3,3,"""the worst major European airline""",E Michaels (United Kingdom),22nd May 2023,Trip Verified,✅ . The BA first lounge at Terminal 5 was a z...,Business,Business Class,London Heathrow to Malaga,May 2023,2.0,2.0,3,1,1.0,1.0,1,no,A320 Finnair
4,4,"""do not think the fare was worth the money""",Steve Bennett (United Kingdom),22nd May 2023,Not Verified,Paid a quick visit to Nice yesterday from Hea...,Couple Leisure,Business Class,London to Nice,May 2023,2.0,3.0,3,,4.0,1.0,1,no,A319
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3558,3558,British Airways customer review,Michael Dielissen (Canada),29th August 2012,,YYZ to LHR - July 2012 - I flew overnight in p...,,Premium Economy,,,4.0,3.0,3,4,,,4,yes,
3559,3559,British Airways customer review,Nick Berry (United Kingdom),28th August 2012,,LHR to HAM. Purser addresses all club passenge...,,Business Class,,,4.0,5.0,4,,,,3,yes,
3560,3560,British Airways customer review,Avril Barclay (United Kingdom),12th October 2011,,My son who had worked for British Airways urge...,,Economy Class,,,,,,,,,4,yes,
3561,3561,British Airways customer review,C Volz (United States),11th October 2011,,London City-New York JFK via Shannon on A318 b...,,Premium Economy,,,1.0,3.0,5,,,,1,no,


In [582]:
df.to_csv('BritishAirlinesReview.csv', index=False)

### Scratch

In [78]:
table = soup.find(lambda tag: tag.name=='table') 
rows = table.findAll(lambda tag: tag.name=='tr')
rows[2].span

<span class="star fill">1</span>

In [75]:
rows[0].td.get_text()

'Food & Beverages'

In [7]:
topic = soup.find("h2", {"text_header"})
topic.get_text()

'"cancel your flight without notice"'

In [None]:
.span.span.get_text()

In [44]:
print(a.span)
print(a.time)

<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<span itemprop="name">B Sherry</span></span>
<time datetime="2023-05-23" itemprop="datePublished">23rd May 2023</time>


In [45]:
a = soup.find("h3", {"text_sub_header userStatusWrapper"})
a.get_text()

'\n\nB Sherry (United States) 23rd May 2023'

In [420]:
text = soup.find_all("div", {"text_content"})
text[2].find("a").text

'Trip Verified'

In [344]:
# script to split the ratings
r = {}

ratings = soup.find_all("div", "review-stats")
rows = ratings[1].find_all("tr")
for i in rows:
    name = i.find_all(("td", "review-value"))[0].get_text()
    value = i.find_all(("td", "review-value"))[1].get_text()
    if value == '12345':
        value = len(i.find_all("span", "star fill"))
    r[name] = value
    
print(r)

{'Type Of Traveller': 'Couple Leisure', 'Seat Type': 'Economy Class', 'Route': 'London to Valencia', 'Date Flown': 'March 2023', 'Seat Comfort': 4, 'Cabin Staff Service': 4, 'Ground Service': 5, 'Value For Money': 4, 'Recommended': 'yes'}


In [559]:
count

1130

In [None]:
name = user[i].text.replace(user[i].time.text, "")
reviews['user'] = name[2:]

In [565]:
u = 'https://www.airlinequality.com/airline-reviews/british-airways/page/113/'
response = http.request('GET', u)
soup = BeautifulSoup(response.data)

i=5
ratings = soup.find_all("div", "review-stats")
rows = ratings[i].find_all("tr")
for j in rows:
    name = j.find_all(("td", "review-value"))[0].get_text()
    value = j.find_all(("td", "review-value"))[1].get_text()
    if value == '12345':
        value = len(j.find_all("span", "star fill"))
    print(name, value)

Type Of Traveller Solo Leisure
Seat Type Economy Class
Route London to Belfast
Date Flown March 2018
Seat Comfort 5
Cabin Staff Service 5
Ground Service 4
Value For Money 3
Recommended no


In [499]:
user = soup.find_all("h3", {"text_sub_header userStatusWrapper"})
print(user[2].span.get_text())


P Gough


In [478]:
number

'3 reviews'

In [447]:
a[4]

{'id': 4,
 'topic': '"Ryanair has more finesse"',
 'user': '3 reviews\n\n\n\nS James (United Kingdom) ',
 'date': '17th March 2022',
 'verified': 'Trip Verified',
 'text': "✅  One would think that the number of crises BA incurs they would have had emergency planning down to a fine art. I last flew with BA May 2017 - yes, the weekend of the last but one IT crisis. I swore never again but ended up purchasing a ticket Aug 2018 to Bologna - BA cancelled the flight. Then covid hit so ended up with a voucher. Fast forward to Feb 26th, 2022. Received a message 5.30am 30mins from LHR that that my flight was cancelled. Arrived at the airport with everyone else just wandering and seeking answers. The staff that were there either told you to go online - a bit difficult as the BA systems were down or ring BA. Ha Ha or go home - not very helpful if you live two hours from LHR. By some miracle the app sprung into life, and I managed to get one of the last 10 seats on the evening flight. Sat in arriv