## **British Airways DS Virtual Experience Program**

### **Task 1** 
#### Title: Web Scraping to gain company insights

#### **Data Gathering**

In this section, we would be collecting customer ratings data from the airline website(particularly for British Airways) called [Skytrax](https://www.airlinequality.com/airline-reviews/british-airways). We would be using a web scraping tool called Beautiful Soup. Specific data to be collected include data about airline ratings, seat ratings and lounge experience ratings

In [17]:
#importing important libraries to be used in this section

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs
import matplotlib.pyplot as plt
import seaborn as sns
import os
import requests 

In [6]:
#create an empty list to collect all reviews
reviews  = []

#create an empty list to collect rating stars
stars = []

#create an empty list to collect date
date = []

#create an empty list to collect country the reviewer is from
country = []

In [7]:
for i in range(1, 36):
    page = requests.get(f"https://www.airlinequality.com/airline-reviews/british-airways/page/{i}/?sortby=post_date%3ADesc&pagesize=100")
    
    soup = bs(page.content, "html5")
    
    for item in soup.find_all("div", class_="text_content"):
        reviews.append(item.text)
    
    for item in soup.find_all("div", class_ = "rating-10"):
        try:
            stars.append(item.span.text)
        except:
            print(f"Error on page {i}")
            stars.append("None")
            
    #date
    for item in soup.find_all("time"):
        date.append(item.text)
        
    #country
    for item in soup.find_all("h3"):
        country.append(item.span.next_sibling.text.strip(" ()"))

Error on page 29
Error on page 31
Error on page 31
Error on page 33
Error on page 33


In [12]:
#checking the lengths of the lists
len(reviews), len(stars), len(date), len(country)

(3451, 3486, 3451, 3451)

In [13]:
#taking only the first 3451 star ratings
stars = stars[:3451]

In [14]:
len(stars)

3451

In [15]:
#creating a dataframe to store the outputs
df = pd.DataFrame({"reviews":reviews,"stars": stars, "date":date, "country": country})
df.head()

Unnamed: 0,reviews,stars,date,country
0,✅ Trip Verified | Probably the worst business ...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t5,2nd January 2023,United States
1,"✅ Trip Verified | Definitely not recommended, ...",1,2nd January 2023,United States
2,✅ Trip Verified | BA shuttle service across t...,2,2nd January 2023,United Kingdom
3,✅ Trip Verified | I must admit like many other...,8,1st January 2023,United Kingdom
4,Not Verified | When will BA update their Busi...,6,30th December 2022,United Kingdom


In [16]:
df.shape

(3451, 4)

#### **Data Cleaning**

In this section, we would need to clean the data and ensure our columns are in the right format for analysis going forward.

In [18]:
df.head()

Unnamed: 0,reviews,stars,date,country
0,✅ Trip Verified | Probably the worst business ...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t5,2nd January 2023,United States
1,"✅ Trip Verified | Definitely not recommended, ...",1,2nd January 2023,United States
2,✅ Trip Verified | BA shuttle service across t...,2,2nd January 2023,United Kingdom
3,✅ Trip Verified | I must admit like many other...,8,1st January 2023,United Kingdom
4,Not Verified | When will BA update their Busi...,6,30th December 2022,United Kingdom


In [21]:
#check to see if we have any null values in our dataset
df.isnull().sum()

reviews    0
stars      0
date       0
country    0
dtype: int64

#### We need to add an extra column to check if the trip was verified or not 

In [22]:
df['verified'] = df.reviews.str.contains("Trip Verified")
df

Unnamed: 0,reviews,stars,date,country,verified
0,✅ Trip Verified | Probably the worst business ...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t5,2nd January 2023,United States,True
1,"✅ Trip Verified | Definitely not recommended, ...",1,2nd January 2023,United States,True
2,✅ Trip Verified | BA shuttle service across t...,2,2nd January 2023,United Kingdom,True
3,✅ Trip Verified | I must admit like many other...,8,1st January 2023,United Kingdom,True
4,Not Verified | When will BA update their Busi...,6,30th December 2022,United Kingdom,False
...,...,...,...,...,...
3446,YYZ to LHR - July 2012 - I flew overnight in p...,7,29th August 2012,Canada,False
3447,LHR to HAM. Purser addresses all club passenge...,1,28th August 2012,United Kingdom,False
3448,My son who had worked for British Airways urge...,9,12th October 2011,United Kingdom,False
3449,London City-New York JFK via Shannon on A318 b...,8,11th October 2011,United States,False


In [23]:
#check the datatypes of the columns 
df.dtypes

reviews     object
stars       object
date        object
country     object
verified      bool
dtype: object

In [25]:
#we need to convert the date column to a datetime 
df['date'] = pd.to_datetime(df['date'])
df.head()

Unnamed: 0,reviews,stars,date,country,verified
0,✅ Trip Verified | Probably the worst business ...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t5,2023-01-02,United States,True
1,"✅ Trip Verified | Definitely not recommended, ...",1,2023-01-02,United States,True
2,✅ Trip Verified | BA shuttle service across t...,2,2023-01-02,United Kingdom,True
3,✅ Trip Verified | I must admit like many other...,8,2023-01-01,United Kingdom,True
4,Not Verified | When will BA update their Busi...,6,2022-12-30,United Kingdom,False


In [26]:
df.dtypes

reviews             object
stars               object
date        datetime64[ns]
country             object
verified              bool
dtype: object

#### Cleaning the stars column to remove inconsistencies

In [28]:
df['stars'].unique()

array(['\n\t\t\t\t\t\t\t\t\t\t\t\t\t5', '1', '2', '8', '6', '4', '3', '5',
       '9', '7', '10', 'None'], dtype=object)

In [31]:
#Removing the new line and new tab from the records
df.stars = df.stars.str.strip("\n\t")

In [32]:
df.stars.unique()

array(['5', '1', '2', '8', '6', '4', '3', '9', '7', '10', 'None'],
      dtype=object)

In [33]:
df.stars.value_counts()

1       745
2       387
3       379
8       350
10      310
7       303
9       295
5       261
4       232
6       184
None      5
Name: stars, dtype: int64

There are 5 rows with 'None' as star ratings, and since we need every record to have a star rating, we would be removing these records

In [34]:
df = df[df['stars'] != 'None']

In [35]:
df.stars.unique()

array(['5', '1', '2', '8', '6', '4', '3', '9', '7', '10'], dtype=object)

In [36]:
df.shape

(3446, 5)

In [37]:
df.dtypes

reviews             object
stars               object
date        datetime64[ns]
country             object
verified              bool
dtype: object

In [39]:
#Converting the stars column to an integer 
df.stars = df['stars'].astype(int)

In [40]:
df.dtypes

reviews             object
stars                int32
date        datetime64[ns]
country             object
verified              bool
dtype: object

In [41]:
df.head()

Unnamed: 0,reviews,stars,date,country,verified
0,✅ Trip Verified | Probably the worst business ...,5,2023-01-02,United States,True
1,"✅ Trip Verified | Definitely not recommended, ...",1,2023-01-02,United States,True
2,✅ Trip Verified | BA shuttle service across t...,2,2023-01-02,United Kingdom,True
3,✅ Trip Verified | I must admit like many other...,8,2023-01-01,United Kingdom,True
4,Not Verified | When will BA update their Busi...,6,2022-12-30,United Kingdom,False


In [42]:
df.country.unique()

array(['United States', 'United Kingdom', 'Canada', 'France',
       'New Zealand', 'Czech Republic', 'Italy', 'Malaysia',
       'United Arab Emirates', 'Singapore', 'Netherlands', 'Australia',
       'Ireland', 'South Africa', 'Ghana', 'Hong Kong', 'Germany',
       'Switzerland', 'Bermuda', 'Botswana', 'Brazil', 'Panama', 'Sweden',
       'Greece', 'Nigeria', 'Russian Federation', 'Philippines', 'Spain',
       'Bulgaria', 'Poland', 'Thailand', 'Argentina', 'Mexico', 'Denmark',
       'India', 'Saint Kitts and Nevis', 'Vietnam', 'Belgium', 'Norway',
       'Jordan', 'Japan', 'Taiwan', 'China', 'Slovakia', 'Kuwait',
       'Israel', 'Qatar', 'Romania', 'South Korea', 'Saudi Arabia',
       'Hungary', 'Austria', 'Portugal', 'Cayman Islands', 'Costa Rica',
       'Egypt', 'Iceland', 'Laos', 'Turkey', 'Indonesia', 'Bahrain',
       'Dominican Republic', 'Cyprus', 'Luxembourg', 'Finland', 'Ukraine',
       '', 'Trinidad & Tobago', 'Barbados', 'Oman'], dtype=object)

In [43]:
df[df['country'] == '']

Unnamed: 0,reviews,stars,date,country,verified
2807,I travelled from London to Jo'burg and back on...,4,2015-04-08,,False
3112,St Lucia to London round trip. Full flight bot...,8,2014-10-20,,False


In [44]:
#We can see 2 reviews dont include the country of the reviewer 
df = df[df['country'] != '']

In [48]:
df.country.unique()

array(['United States', 'United Kingdom', 'Canada', 'France',
       'New Zealand', 'Czech Republic', 'Italy', 'Malaysia',
       'United Arab Emirates', 'Singapore', 'Netherlands', 'Australia',
       'Ireland', 'South Africa', 'Ghana', 'Hong Kong', 'Germany',
       'Switzerland', 'Bermuda', 'Botswana', 'Brazil', 'Panama', 'Sweden',
       'Greece', 'Nigeria', 'Russian Federation', 'Philippines', 'Spain',
       'Bulgaria', 'Poland', 'Thailand', 'Argentina', 'Mexico', 'Denmark',
       'India', 'Saint Kitts and Nevis', 'Vietnam', 'Belgium', 'Norway',
       'Jordan', 'Japan', 'Taiwan', 'China', 'Slovakia', 'Kuwait',
       'Israel', 'Qatar', 'Romania', 'South Korea', 'Saudi Arabia',
       'Hungary', 'Austria', 'Portugal', 'Cayman Islands', 'Costa Rica',
       'Egypt', 'Iceland', 'Laos', 'Turkey', 'Indonesia', 'Bahrain',
       'Dominican Republic', 'Cyprus', 'Luxembourg', 'Finland', 'Ukraine',
       'Trinidad & Tobago', 'Barbados', 'Oman'], dtype=object)

#### Cleaning Reviews

We will extract the column of reviews into a separate dataframe and clean it for semantic analysis

In [52]:
!pip install nltk



In [53]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Oluwatobi.Ojo\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [57]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

In [60]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Oluwatobi.Ojo\AppData\Roaming\nltk_data...


True

In [61]:
#for lemmatization of words we will use nltk library
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re

lemma = WordNetLemmatizer()


reviews_data = df.reviews.str.strip("✅ Trip Verified |")

#create an empty list to collect cleaned data corpus
corpus =[]

#loop through each review, remove punctuations, small case it, join it and add it to corpus
for rev in reviews_data:
    rev = re.sub('[^a-zA-Z]',' ', rev)
    rev = rev.lower()
    rev = rev.split()
    rev = [lemma.lemmatize(word) for word in rev if word not in set(stopwords.words("english"))]
    rev = " ".join(rev)
    corpus.append(rev)

In [64]:
df['corpus'] = corpus

In [65]:
df.head()

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,✅ Trip Verified | Probably the worst business ...,5,2023-01-02,United States,True,probably worst business class experience ever ...
1,"✅ Trip Verified | Definitely not recommended, ...",1,2023-01-02,United States,True,definitely recommended especially business cla...
2,✅ Trip Verified | BA shuttle service across t...,2,2023-01-02,United Kingdom,True,ba shuttle service across uk still surprisingl...
3,✅ Trip Verified | I must admit like many other...,8,2023-01-01,United Kingdom,True,must admit like many others tend avoid ba long...
4,Not Verified | When will BA update their Busi...,6,2022-12-30,United Kingdom,False,verified ba update business class cabin across...


In [66]:
#Renaming the column name to a better name 
df.rename(columns={'corpus': 'review_raw_text'}, inplace=True)

In [67]:
df.head()

Unnamed: 0,reviews,stars,date,country,verified,review_raw_text
0,✅ Trip Verified | Probably the worst business ...,5,2023-01-02,United States,True,probably worst business class experience ever ...
1,"✅ Trip Verified | Definitely not recommended, ...",1,2023-01-02,United States,True,definitely recommended especially business cla...
2,✅ Trip Verified | BA shuttle service across t...,2,2023-01-02,United Kingdom,True,ba shuttle service across uk still surprisingl...
3,✅ Trip Verified | I must admit like many other...,8,2023-01-01,United Kingdom,True,must admit like many others tend avoid ba long...
4,Not Verified | When will BA update their Busi...,6,2022-12-30,United Kingdom,False,verified ba update business class cabin across...


Our data is clean now and we can now perform exploratory data analysis on it 

In [68]:
#Exporting the cleaned data
df.to_csv("cleaned-BA-reviews.csv", index=False)