# British Airways - Customer Reviews Sentiment Analysis
*British Airways (BA) is the flag carrier airline of the United Kingdom (UK). Every day, thousands of BA flights arrive to and depart from the UK, carrying customers across the world. Whether it’s for holidays, work or any other reason, the end-to-end process of scheduling, planning, boarding, fuelling, transporting, landing, and continuously running flights on time, efficiently and with top-class customer service is a huge task with many highly important responsibilities.*


# Business Objective
* Customers who book a flight with BA will experience many interaction points with the BA brand. Understanding a customer's feelings, needs, and feedback is crucial for any business, including BA.

* Our goal is to examine how customers are communicating their positive and negative experiences about the airline. What are the attributes that customers are considering while travelling with an airline. 

* With this decision-makers can understand which elements of their service influence more in forming a positive review or improves airline brand image.

## Import required Libraries

In [None]:
#Import Libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from itertools import chain

## Data Preprocessing
* Web Scraping
* Storing & Integrating data
* Transform raw data into readable format
* Saving the file for further analysis

In [None]:
#Webscraping the data from the website using below base_url link
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"


#Defining page size and multipages number
pages = int(input("Enter the no. pages to be scraped: "))
page_size = 100


#Creating an empty list to store the scraped data
reviews = []
rating = []
country = []
date = []
traveller = []


#Created a Loop to extract the data from website using beautiful soup:
for i in range(1, pages + 1):
    
    #print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')

    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    for rat in parsed_content.find_all('div', {'class':'rating-10'}):
        rating.append(rat.getText())

    for ctr in parsed_content.find_all('h3'):
        country.append(ctr.span.next_sibling)

    for dt in parsed_content.find_all('time'):
        date.append(dt.get_text())

    for tvr in parsed_content.find_all('td', {"class":"review-value "}):
        traveller.append(tvr.get_text())


print(f"   ---> {len(reviews)} total reviews")
print(f"   ---> {len(rating)} total ratings")
print(f"   ---> {len(country)} total countries")
print(f"   ---> {len(date)} total date")
print(f"   ---> {len(traveller)} total travellers")

Enter the no. pages to be scraped: 35
   ---> 3454 total reviews
   ---> 3489 total ratings
   ---> 3454 total countries
   ---> 3454 total date
   ---> 13300 total travellers


In [None]:
#Reviewing the rating data
date[0:5]

['9th January 2023',
 '8th January 2023',
 '6th January 2023',
 '2nd January 2023',
 '2nd January 2023']

In [None]:
#Reviewing the rating data
rating[0:5]

['\n\n\t\t\t\t\t\t\t\t\t\t\t\t\t5/10 ',
 '\n4/10\n',
 '\n5/10\n',
 '\n1/10\n',
 '\n1/10\n']

In [None]:
#Checking the unique values in rating
set(rating)

{'\n\n\t\t\t\t\t\t\t\t\t\t\t\t\t5/10 ',
 '\n1/10\n',
 '\n10/10\n',
 '\n2/10\n',
 '\n3/10\n',
 '\n4/10\n',
 '\n5/10\n',
 '\n6/10\n',
 '\n7/10\n',
 '\n8/10\n',
 '\n9/10\n',
 '\r\n                        na\r\n                    '}

In [None]:
#Counting the error values
rating.count('\r\n                        na\r\n                    ')

5

In [None]:
#Replacing the error values with NaN
for i in range(len(rating)):
    if rating[i]=='\r\n                        na\r\n                    ':
        rating[i]='NaN'

#Checking the unique values
b = set(rating)
b = list(b)
print(b)

#Checking the count
print('Total Count: ', len(rating))

['\n4/10\n', '\n9/10\n', '\n6/10\n', '\n8/10\n', '\n\n\t\t\t\t\t\t\t\t\t\t\t\t\t5/10 ', '\n2/10\n', '\n10/10\n', 'NaN', '\n7/10\n', '\n1/10\n', '\n5/10\n', '\n3/10\n']
Total Count:  3489


In [None]:
#Counting the overall rating value for total reviews
rating.count('\n\n\t\t\t\t\t\t\t\t\t\t\t\t\t5/10 ')

35

In [None]:
#Removing the above value
rating = [x for x in rating if x!='\n\n\t\t\t\t\t\t\t\t\t\t\t\t\t5/10 ']

#Checking the unique values
b = set(rating)
b = list(b)
print(b)

#Checking the count
print('Total Count: ', len(rating))

['\n4/10\n', '\n9/10\n', '\n6/10\n', '\n8/10\n', '\n2/10\n', '\n10/10\n', 'NaN', '\n7/10\n', '\n1/10\n', '\n5/10\n', '\n3/10\n']
Total Count:  3454


In [None]:
#Extracting only the ratings number from rating data 
rat_1 = [item.split('/', 1)[0] for item in rating]
rat_2 = [item.split() for item in rat_1]
fin_ratings = list(chain.from_iterable(rat_2))

#Reviewing the cleaned rating data
print('Count of total ratings:', len(fin_ratings),'\n')

#Checking the unique values in cleaned data
b = set(fin_ratings)
b = list(b)
print(b)

Count of total ratings: 3454 

['1', '8', '6', '7', 'NaN', '5', '3', '2', '10', '9', '4']


In [None]:
#Reviewing the country
country[0:5]

[' (United Kingdom) ',
 ' (United Kingdom) ',
 ' (Canada) ',
 ' (United States) ',
 ' (United States) ']

In [None]:
#Extracting only the country name from country data 
for i in range(len(country)):
    
    country[i] = re.sub(r"\W", " ", country[i]) #Replacing non-alpha numeric character with empty space
    country[i] = country[i].strip()

print(f"   ---> {len(country)} total country entries")

   ---> 3454 total country entries


In [None]:
#After above operation
country[0]

'United Kingdom'

In [None]:
#Reviewing the scraped reviews
reviews[0:3]

['✅ Trip Verified |  Flew ATL to LHR 8th Jan 2023. Was unlucky enough to be on board a 23 year old 777. Refit gave it a decent IFE screen and the seat looked decent. Although combine the IFE with the cheap and nasty earbuds, and any movie can be ruined. Headrest was great, just a pity little padding is used on the seat as my Wife and I were very uncomfortable. The leg room in general is poor, especially when passengers keep their seat reclined from start to finish. Zero room. Aircraft was tired. Rubber spacers falling out, silicone sealer falling apart in the toilets. Toilet seats old, stained. Rubber on arm rest was hanging off. No post take off drinks/snacks offered.  Meal was sent out after a couple of hours. Was poor. Chicken cubes that reminded me of dog food, mashed potatoes that were purified within an inch of their life. Stale rock hard roll, salad which was rice and carrots?! Dried crackers with no cheese. Kids meal was just as sad. Tiny leaf salad with enough dressing to refl

In [None]:
#Cleaning the reviews from reviews data 
for i in range(len(reviews)):
    
    reviews[i] = re.sub(r"✅ Trip Verified "+"| ", " ", reviews[i])
    reviews[i] = re.sub(r"Not Verified "+"| ", " ", reviews[i])
    reviews[i] = reviews[i].strip(" | ")

print(f"   ---> {len(reviews)} total reviews entries")

   ---> 3454 total reviews entries


In [None]:
#Reviewing the reviews after above operation
reviews[0:3]

['Flew ATL to LHR 8th Jan 2023. Was unlucky enough to be on board a 23 year old 777. Refit gave it a decent IFE screen and the seat looked decent. Although combine the IFE with the cheap and nasty earbuds, and any movie can be ruined. Headrest was great, just a pity little padding is used on the seat as my Wife and I were very uncomfortable. The leg room in general is poor, especially when passengers keep their seat reclined from start to finish. Zero room. Aircraft was tired. Rubber spacers falling out, silicone sealer falling apart in the toilets. Toilet seats old, stained. Rubber on arm rest was hanging off. No post take off drinks/snacks offered.  Meal was sent out after a couple of hours. Was poor. Chicken cubes that reminded me of dog food, mashed potatoes that were purified within an inch of their life. Stale rock hard roll, salad which was rice and carrots?! Dried crackers with no cheese. Kids meal was just as sad. Tiny leaf salad with enough dressing to refloat a shipwreck. Co

In [None]:
#Reviewing the travellers data
traveller[0:15]

['Boeing 777-200',
 'Family Leisure',
 'Economy Class',
 'Atlanta to London',
 'January 2023',
 'A380',
 'Family Leisure',
 'Economy Class',
 'London to Chicago',
 'December 2022',
 'Family Leisure',
 'Economy Class',
 'Istanbul to Vancouver via Heathrow',
 'January 2023',
 'A320, A380']

In [None]:
#Extracting only the Class type from travellers data 
r = []

for i in range(len(traveller)): 
    r.append(re.findall(r"[a-zA-Z]+\sClass|Premium\s+Economy", traveller[i]))

r = [x for x in r if x != []]
print(len(r))

3453


In [None]:
#Reviewing the list
r[0:10]

[['Economy Class'],
 ['Economy Class'],
 ['Economy Class'],
 ['Business Class'],
 ['Business Class'],
 ['Economy Class'],
 ['Economy Class'],
 ['Business Class'],
 ['Premium Economy'],
 ['Business Class']]

In [None]:
#Converting 2D list to 1D
seat = []

for j in range(len(r)):
    seat.append(r[j][0])

In [None]:
#Reviewing the list
seat[0:10]

['Economy Class',
 'Economy Class',
 'Economy Class',
 'Business Class',
 'Business Class',
 'Economy Class',
 'Economy Class',
 'Business Class',
 'Premium Economy',
 'Business Class']

In [None]:
#Checking the length
len(seat)

3453

In [None]:
#Reviewing the length of data before making the DataFrame
print(f"   ---> {len(reviews)} total reviews")
print(f"   ---> {len(fin_ratings)} total ratings")
print(f"   ---> {len(country)} total countries")
print(f"   ---> {len(date)} total date")
print(f"   ---> {len(seat)} total seats")


#Imputing NaN if length are not equal 
if len(reviews) != len(seat):
    # Append NaN values to the list with smaller length
    if len(reviews) > len(seat):
        value = 'NaN'
        seat += (len(reviews)-len(seat)) * [value]

print(f"   ---> {len(seat)} total seats after imputing")

   ---> 3454 total reviews
   ---> 3454 total ratings
   ---> 3454 total countries
   ---> 3454 total date
   ---> 3454 total seats
   ---> 3454 total seats after above operation


In [None]:
#Creating a Dataframe and storing above values 
df = pd.DataFrame({"Date": date, "Reviews":reviews, "Rating":fin_ratings, "Class":seat, "Country":country})

print("Shape of dataset", df.shape, '\n')

df.head()

Shape of dataset (3454, 5) 



Unnamed: 0,Date,Reviews,Rating,Class,Country
0,9th January 2023,Flew ATL to LHR 8th Jan 2023. Was unlucky enou...,4,Economy Class,United Kingdom
1,8th January 2023,Great thing about British Airways A380 is the ...,5,Economy Class,United Kingdom
2,6th January 2023,"The staff are friendly. The plane was cold, we...",1,Economy Class,Canada
3,2nd January 2023,Probably the worst business class experience I...,1,Business Class,United States
4,2nd January 2023,"Definitely not recommended, especially for bus...",2,Business Class,United States


In [None]:
#Saving dataframe into the csv file
#df.to_csv("/content/BA_reviews.csv", date_format='Date', index=False)

**Note:**
* We have successfully collected the BA reviews data from the source(website), stored, retrieved, transformed and integrated into dataframe.
* In next notebook file, we will perform data cleaning, exploratory data analysis, visualisations, sentiment analysis, etc.

#### Thank you!