<a href="https://colab.research.google.com/github/ElenaHrytsai/BA_reviews_analytics/blob/main/British_Airways.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Scraping data from Skytrax
We will use a package BeautifulSoup to collect the reviews from the web.

In [1]:
from google.colab import drive
drive.mount ('/content/drive')

Mounted at /content/drive


In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re

In [None]:
base_url = 'https://www.airlinequality.com/airline-reviews/british-airways'
pages = 10

reviews = []
rating = []
description = []

for i in range(1, pages+1):
  print(f'scraping page {i}')
  url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize=100"

  response = requests.get(url)
  content = response.content
  parsed_content =BeautifulSoup(content, 'html.parser')
  for para in parsed_content.find_all('div', {'class': 'text_content'}):
    reviews.append(para.get_text())
  for para in parsed_content.find_all('div', {'class': 'review-stats'}):
    description.append(para.get_text())
  for para in parsed_content.find_all('div', {'class': 'rating-10'}):
    rating.append(para.get_text())

  print(f'{len(reviews)} total reviews')


scraping page 1
100 total reviews
scraping page 2
200 total reviews
scraping page 3
300 total reviews
scraping page 4
400 total reviews
scraping page 5
500 total reviews
scraping page 6
600 total reviews
scraping page 7
700 total reviews
scraping page 8
800 total reviews
scraping page 9
900 total reviews
scraping page 10
1000 total reviews


Let's create few additional columns of a DataFrame from the parsed data

In [None]:
df = pd.DataFrame()
df["reviews"] = reviews
df["description"] = description
df["rating"] = rating
df.head()

Unnamed: 0,reviews,description,rating
0,✅ Trip Verified | British airways lost bags ...,\n\nType Of TravellerFamily Leisure\nSeat Type...,\n1/10\n
1,✅ Trip Verified | The check in process and rew...,\n\nAircraftA320\nType Of TravellerBusiness\nS...,\n1/10\n
2,"✅ Trip Verified | We flew in November 2023, ...",\n\nType Of TravellerFamily Leisure\nSeat Type...,\n1/10\n
3,✅ Trip Verified | I left for London from Johan...,\n\nType Of TravellerFamily Leisure\nSeat Type...,\n1/10\n
4,✅ Trip Verified | After an excellent flight ...,\n\nAircraftA380\nType Of TravellerSolo Leisur...,\n5/10\n


The scraped data looks messy, so next we will work with cleaning and preparing our data for analysis.

In [4]:
df['rating'] = df['rating'].replace(to_replace=('\n', '/10'), value='', regex = True)

In [6]:
df['description'][1]

'\n\nAircraftA320\nType Of TravellerBusiness\nSeat TypeEconomy Class\nRouteLondon to Basel\nDate FlownJanuary 2025\n\nSeat Comfort\n12345\n\n\nCabin Staff Service\n12345\n\n\nFood & Beverages\n12345\n\n\nInflight Entertainment\n12345\n\n\nGround Service\n12345\n\n\nValue For Money\n12345\n\nRecommendedno \n'

In [7]:
df['Aircraft'] = df['description'].str.extract(r'(?<=Aircraft)([A-Za-z0-9]+.*[-+]?\d+)')
df['Type of Traveller']= df['description'].str.extract(r'(?<=Type Of Traveller)([A-Za-z0-9]+.*)')
df['Seat Type']= df['description'].str.extract(r'(?<=Seat Type)([A-Za-z0-9]+.*)')
df['Route']= df['description'].str.extract(r'(?<=Route)([A-Za-z0-9]+.*)')
df['Date Flown']= df['description'].str.extract(r'(?<=Date Flown)([A-Za-z0-9]+.*[-+]?\d+)')
df['Recommended']= df['description'].str.extract(r'(?<=Recommended)([A-Za-z0-9]+)')

In [16]:
df.describe()

Unnamed: 0,reviews,description,rating,Aircraft,Type of Traveller,Seat Type,Route,Date Flown,Recommended
count,1000,1000,1000,522,998,1000,995,1000,1000
unique,1000,998,10,57,4,4,677,71,4
top,✅ Trip Verified | British airways lost bags ...,\n\nAircraftA320\nType Of TravellerBusiness\nS...,1,A320,Couple Leisure,Economy Class,Vancouver to London,October 2019,no
freq,1,2,386,137,335,552,12,36,657


In [24]:
df['Recommended'].value_counts()

Unnamed: 0_level_0,count
Recommended,Unnamed: 1_level_1
no,657
yes,290
noClose,32
yesClose,21


In [25]:
df['Recommended'] = df['Recommended'].replace(to_replace=('Close'), value='', regex = True)

In [23]:
df = df.drop('description', axis=1)

In [26]:
df['Date Flown']=df['Date Flown'].astype('datetime64[ns]').dt.to_period('M')

In [28]:
df.head()

Unnamed: 0,reviews,rating,Aircraft,Type of Traveller,Seat Type,Route,Date Flown,Recommended
0,✅ Trip Verified | British airways lost bags ...,1,,Family Leisure,Premium Economy,Houston to cologne via London,2024-12,no
1,✅ Trip Verified | The check in process and rew...,1,A320,Business,Economy Class,London to Basel,2025-01,no
2,"✅ Trip Verified | We flew in November 2023, ...",1,,Family Leisure,Economy Class,London to Phoenix,2023-11,no
3,✅ Trip Verified | I left for London from Johan...,1,,Family Leisure,Economy Class,London to Johannesburg,2025-01,no
4,✅ Trip Verified | After an excellent flight ...,5,A380,Solo Leisure,Business Class,London to Cape Town via Johannesburg,2024-12,yes


Saving DataFrame to file

In [None]:
df.to_csv("/content/drive/MyDrive/data/BA_reviews_columns.csv")