![Forage](https://cdn2.downdetector.com/static/uploads/logo/British_Airways_logo_1.png)

# **Project Description**

## **Overview**
    This project was done as a part of british airway's data science job simulation on forage.

    Scrape and collect customer feedback by analysing third-party data to uncover findings for British Airways.
    
    Review data was collected from Skytrax a website that reviews airlines.

## **Workflow**
    The WorkFlow can be divided into 7 steps
    
    - Create the dataframe
    
    - Getting data from website
    
    - Adding data into dataframe
    
    - Custom cleaning of dataframe
    
    - Doing sentiment analysis to analyse the emotional state of the customer
    

### **Create the Dataframe**

In [1]:
# Importing all the necessary libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
from textblob import TextBlob
import matplotlib.pyplot as plt
import numpy as np


### **Getting data from website**

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def make_df(url):
    """
    Fetches all the headers as columns from the website Skytax and creates a dataframe.

    Parameters:
    url (str): The URL of the webpage Skytrax(the British Airways reviews must be present in the webpage).

    Returns:
    pd.DataFrame: A DataFrame headers as columns.
    """
    # Fetch the webpage
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    
    # Find all review-rating headers
    headers = soup.find_all('td', class_='review-rating-header')
    
    # Extract text from headers and remove duplicates
    header_texts = list(set(header.text for header in headers))
    
    # Create a DataFrame with header texts as columns
    df = pd.DataFrame(columns=header_texts)
    
    return df


### **Adding data into the dataframe**

In [3]:
import requests
from bs4 import BeautifulSoup

# Initialize lists to store extracted data
a, h, r, s_c, s_s, f_b, i_e, g_s, w_c, v, t, s_t, ro, d_f, rec = ([] for _ in range(15))

def get_data(url):
    """
    Gets all the data from the class and appends into a multiple lists.
    
    Parameters:
    url (str): The URL of the webpage Skytrax(the British Airways reviews must be present in the webpage).
    
    """
    # Fetch the webpage
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    
    # Find all articles on the page
    articles = soup.find_all('article')
    
    # Extract data from each article
    for article in articles:
        # Aircraft
        aircraft = article.find('td', class_='review-rating-header aircraft')
        a.append(aircraft.parent.text if aircraft else 'nan')
        
        # Header
        header = article.find('h2', class_='text_header')
        h.append(header.text if header else 'nan')
        
        # Rating value
        rating_value = article.find('span', itemprop='ratingValue')
        r.append(rating_value.text if rating_value else 'nan')
        
        # Seat comfort
        seat_comfort = article.find('td', class_='review-rating-header seat_comfort')
        if seat_comfort:
            s_c.append(seat_comfort.parent.find_all('span', class_='star fill')[-1].text)
        else:
            s_c.append('nan')
        
        # Cabin staff service
        cabin_staff_service = article.find('td', class_='review-rating-header cabin_staff_service')
        if cabin_staff_service:
            s_s.append(cabin_staff_service.parent.find_all('span', class_='star fill')[-1].text)
        else:
            s_s.append('nan')
        
        # Food and beverages
        food_and_beverages = article.find('td', class_='review-rating-header food_and_beverages')
        if food_and_beverages:
            f_b.append(food_and_beverages.parent.find_all('span', class_='star fill')[-1].text)
        else:
            f_b.append('nan')
        
        # Inflight entertainment
        inflight_entertainment = article.find('td', class_='review-rating-header inflight_entertainment')
        if inflight_entertainment:
            i_e.append(inflight_entertainment.parent.find_all('span', class_='star fill')[-1].text)
        else:
            i_e.append('nan')
        
        # Ground service
        ground_service = article.find('td', class_='review-rating-header ground_service')
        if ground_service:
            g_s.append(ground_service.parent.find_all('span', class_='star fill')[-1].text)
        else:
            g_s.append('nan')
        
        # Wifi and connectivity
        wifi_and_connectivity = article.find('td', class_='review-rating-header wifi_and_connectivity')
        if wifi_and_connectivity:
            w_c.append(wifi_and_connectivity.parent.find_all('span', class_='star fill')[-1].text)
        else:
            w_c.append('nan')
        
        # Value for money
        value_for_money = article.find('td', class_='review-rating-header value_for_money')
        if value_for_money:
            v.append(value_for_money.parent.find_all('span', class_='star fill')[-1].text)
        else:
            v.append('nan')
        
        # Type of traveller
        type_of_traveller = article.find('td', class_='review-rating-header type_of_traveller')
        if type_of_traveller:
            t.append(type_of_traveller.parent.find('td', class_='review-value').text)
        else:
            t.append('nan')
        
        # Cabin flown
        cabin_flown = article.find('td', class_='review-rating-header cabin_flown')
        if cabin_flown:
            s_t.append(cabin_flown.parent.find('td', class_='review-value').text)
        else:
            s_t.append('nan')
        
        # Route
        route = article.find('td', class_='review-rating-header route')
        if route:
            ro.append(route.parent.find('td', class_='review-value').text)
        else:
            ro.append('nan')
        
        # Date flown
        date_flown = article.find('td', class_='review-rating-header date_flown')
        if date_flown:
            d_f.append(date_flown.parent.find('td', class_='review-value').text)
        else:
            d_f.append('nan')
        
        # Recommended
        recommended = article.find('td', class_='review-rating-header recommended')
        if recommended:
            rec.append(recommended.parent.find('td', class_='review-value').text)
        else:
            rec.append('nan')

### **Iterating this process for multiple webpages**

In [4]:
def scrap(num_pages, base_url, n_url):
    """
    This function loops through all the webpages from the website skytrax.

    Parameters:
    num_pages (int): Number of pages to scrap data.
    base_url (str) : https://www.airlinequality.com/airline-reviews/british-airways/page/
    n_url (str)    : https://www.airlinequality.com/airline-reviews/british-airways

    Returns:
    df: DataFrame with all empty columns filled.
    """
    print(f"Scraping {num_pages} pages from {base_url}")

    # Looping through all the pages
    for page_count in range(1, num_pages + 1):
        url = f"{base_url}{page_count}/"
        get_data(url)
    
    # Create DataFrame
    df = make_df(n_url)
    df['Recommended'] = rec
    df['Overall Rating'] = r
    df['Comments'] = h
    df['Seat Comfort'] = s_c
    df['Wifi & Connectivity'] = w_c
    df['Value for Money'] = v
    df['Ground Service'] = g_s
    df['Inflight Entertainment'] = i_e
    df['Cabin Staff Service'] = s_s
    df['Aircraft'] = a
    df['Food & Beverages'] = f_b
    df['Type Of Traveller'] = t
    df['Date Flown'] = d_f
    df['Seat Type'] = s_t
    df['Route'] = ro
    
    return df

### **Running the code**

In [5]:
# Running the code for 40 pages
num_pages=40
base_url='https://www.airlinequality.com/airline-reviews/british-airways/page/'
n_url='https://www.airlinequality.com/airline-reviews/british-airways'
df=scrap(num_pages,base_url,n_url)


Scraping 40 pages from https://www.airlinequality.com/airline-reviews/british-airways/page/


### **Custom Cleaning**

In [6]:
df_1=df.drop_duplicates(subset=['Comments'])
df_2=df_1.dropna(thresh=4)
df_2.replace("nan",np.nan,inplace=True)
df_2_2=df_2.dropna()
df_2.drop('Staff Service',axis=1,inplace=True)
df_2=df_2.reindex(columns=['Overall Rating','Comments','Aircraft','Type Of Traveller','Seat Type','Route','Date Flown','Seat Comfort','Cabin Staff Service','Food & Beverages','Inflight Entertainment','Ground Service','Wifi & Connectivity','Value for Money','Recommended'])
df_2.drop(11, axis=0, inplace=True)
df_2.reset_index(drop=True,inplace=True)


In [7]:
df_2.head(15)

Unnamed: 0,Overall Rating,Comments,Aircraft,Type Of Traveller,Seat Type,Route,Date Flown,Seat Comfort,Cabin Staff Service,Food & Beverages,Inflight Entertainment,Ground Service,Wifi & Connectivity,Value for Money,Recommended
0,9,"""no fuss, no bother experience""",AircraftA320,Solo Leisure,Economy Class,Vancouver to Gatwick,September 2024,4.0,5.0,5.0,5.0,5,4.0,4,yes
1,1,"""Who can trust BA to travel2",AircraftA320,Solo Leisure,Economy Class,London to Istanbul,October 2024,3.0,3.0,1.0,1.0,1,,1,no
2,5,"""just another poor airline""",AircraftBoeing 777,Couple Leisure,Business Class,San Francisco to Barcelona via London,October 2024,3.0,2.0,1.0,3.0,3,4.0,2,no
3,1,"""spent two hours trying to make contact with BA""",AircraftBoeing 787-9,Business,Business Class,Mexico City to London Heathrow,October 2024,1.0,1.0,,,1,,1,no
4,1,"""using another airline for future travel""",,Solo Leisure,Business Class,Paris to Boston via London,July 2024,1.0,1.0,,,1,,2,no
5,1,"""oversold tickets on our flight""",,Family Leisure,Economy Class,Geneva to London,September 2024,,,,,1,,1,no
6,1,“Appalling service”,AircraftA380,Business,Business Class,Johannesburg to London,October 2024,2.0,1.0,2.0,2.0,1,2.0,1,no
7,6,“BA’s petty penny pinching ”,AircraftBoeing 787-900,Business,Business Class,London to Mexico City,October 2024,1.0,5.0,1.0,1.0,2,2.0,3,yes
8,1,“British arrogance with no manners”,AircraftA320N,Couple Leisure,Business Class,Berlin to London Heathrow,October 2024,1.0,1.0,1.0,,2,,1,no
9,2,"""Terrible customer service""",,Couple Leisure,Business Class,London Heathrow to Philadelphia,September 2024,1.0,1.0,1.0,,1,1.0,1,no


## **Sentiment Analysis**

In [13]:
reviews=df_2['Comments'].tolist()

In [11]:
def analyze_sentiment(text):
    analysis = TextBlob(text)
    return analysis.sentiment.polarity

sentiments = [analyze_sentiment(review) for review in reviews]

In [10]:
print("\nSummary of Sentiments:")
print("\nNumber of positive reviews:",len([i for i in sentiments if i>0]))
print("Number of negative reviews:",len([i for i in sentiments if i<0]))
print("Number of neutral reviews:",len([i for i in sentiments if i==0]))


Summary of Sentiments:

Number of positive reviews: 101
Number of negative reviews: 136
Number of neutral reviews: 160
