# British Airways Review Project  
Source of dataset: https://www.airlinequality.com/airline-reviews/british-airways/  

### Project Context  
British Airways (BA) is the flag carrier airline of the United Kingdom (UK). Every day, thousands of BA flights arrive to and depart from the UK, carrying customers across the world. Whether it’s for holidays, work or any other reason, the end-to-end process of scheduling, planning, boarding, fuelling, transporting, landing, and continuously running flights on time, efficiently and with top-class customer service is a huge task with many highly important responsibilities.  

As a data scientist at BA, it will be your job to apply your analytical skills to influence real life multi-million-pound decisions from day one, making a tangible impact on the business as your recommendations, tools and models drive key business decisions, reduce costs and increase revenue.  

Customers who book a flight with BA will experience many interaction points with the BA brand. Understanding a customer's feelings, needs, and feedback is crucial for any business, including BA.  

### Project Requirement  
1. Scraping and collecting customer feedback and reviewing data from a third-party source   
2. Cleaning the collected dataset  
3. Analysing this data to present any insights, through:
    - Descriptive statistic
    - Topic Modeling
    - Sentiment Analysis
4. Presenting the findings

### Project Planning
1. Scraping & collecting customer feedback data has already done in `scrape_script.py`
2. Exploratory Data Analysis
    - Data Structure
    - Data Quality
    - Content
3. Data Cleaning
4. Feature Engineering
5. Modeling  


### Dataset Column
`id`, `review`, `rating`, `header`, `sub_header`, `author`, `time_published`, `aircraft`, `type_of_traveller`, `seat_type`, `route`, `date_flown`, `seat_comfort`, `cabin_staff_service`, `food_&_beverages`, `ground_service`, `value_for_money`, `recommended`, `inflight_entertainment`, `wifi_&_connectivity`, `verified`, `rating_only`, `city`, `type_aircraft`

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import regex as re

In [3]:
df = pd.read_csv('..\dataset\dataframe_totpage_35.csv', sep=';')
df.head(3)

Unnamed: 0,id,review,rating,header,sub_header,author,time_published,Aircraft,Type Of Traveller,Seat Type,Route,Date Flown,Seat Comfort,Cabin Staff Service,Food & Beverages,Ground Service,Value For Money,Recommended,Inflight Entertainment,Wifi & Connectivity
0,anchor817891,✅ Trip Verified | Booked a BA holiday to Marr...,9/10,"""eventually make good on their promise""",Ian Sinclair (United Kingdom) 30th November 2022,Ian Sinclair,30th November 2022,A320,Solo Leisure,Business Class,London to Marrakech,June 2022,3.0,5.0,5.0,4.0,3.0,yes,,
1,anchor817666,✅ Trip Verified | Extremely sub-par service. H...,2/10,"""Extremely sub-par service""",S SI (United States) 28th November 2022,S SI,28th November 2022,A380,Solo Leisure,Economy Class,San Francisco to London,November 2022,2.0,1.0,2.0,3.0,2.0,no,2.0,1.0
2,anchor817196,✅ Trip Verified | I virtually gave up on Brit...,7/10,"""the service was excellent""",R Vines (United Kingdom) 26th November 2022,R Vines,26th November 2022,A320,Solo Leisure,Business Class,London to Lisbon,November 2022,3.0,4.0,4.0,3.0,3.0,yes,,


In [4]:
df.info()

# Note:
    # shape 3426, 20
    # some features have null values

# To do:
    # convert col name to
    # 'review' -> split verfied status & review
    # 'rating' -> split and convert to float
    # 'header' -> replace double quote
    # 'sub_header' -> get the location 
    # 'Aircraft' -> check for 'A'

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3426 entries, 0 to 3425
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      3426 non-null   object 
 1   review                  3426 non-null   object 
 2   rating                  3422 non-null   object 
 3   header                  3426 non-null   object 
 4   sub_header              3426 non-null   object 
 5   author                  3426 non-null   object 
 6   time_published          3426 non-null   object 
 7   Aircraft                1789 non-null   object 
 8   Type Of Traveller       2656 non-null   object 
 9   Seat Type               3424 non-null   object 
 10  Route                   2652 non-null   object 
 11  Date Flown              2648 non-null   object 
 12  Seat Comfort            3328 non-null   float64
 13  Cabin Staff Service     3321 non-null   float64
 14  Food & Beverages        3095 non-null   

In [5]:
def search_city(x):
    try:
        x = re.compile(r'\(.*\)').search(x).group()
        x = x.replace('(','').replace(')','')
    except Exception as e:
        x = ''
    return x

df['verified'] = df['review'].apply(lambda x: x.split(' |')[0])
df['verified'] = df['verified'].str.replace('✅ ','')
df['rating_only'] = df['rating'].apply(lambda x: x.split('/')[0] if type(x)==type('a') else x)
df['rating_only'] = df['rating_only'].astype('float')
df['header'] = df['header'].str.replace('"','')
df['city'] = df['sub_header'].apply(lambda x: search_city(x))


df.columns = [col_name.lower().replace(' ','_') for col_name in df.columns]

In [10]:
for col in df.columns:
    print(f'`{col}`, ', end='')

`id`, `review`, `rating`, `header`, `sub_header`, `author`, `time_published`, `aircraft`, `type_of_traveller`, `seat_type`, `route`, `date_flown`, `seat_comfort`, `cabin_staff_service`, `food_&_beverages`, `ground_service`, `value_for_money`, `recommended`, `inflight_entertainment`, `wifi_&_connectivity`, `verified`, `rating_only`, `city`, `type_aircraft`, 

In [6]:
def type_aircraft_search(x):
    if type(x) == type('a'):
        if ('A' in x) and ('Boeing' in x):
            x = 'A / Boeing'
        elif ('A' in x) or ('Boeing' in x):
            if ('A' in x):
                x = 'A'
            else:
                x = 'Boeing'
        elif 'Embraer' in x:
            x = 'Embraer'
    return x
        

df['type_aircraft'] = df['aircraft'].apply(lambda x: type_aircraft_search(x))

In [7]:
df.describe(include='number').T

# Note:
    # Overall descriptive statistic of numerical features

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
seat_comfort,3328.0,2.906851,1.36105,1.0,2.0,3.0,4.0,5.0
cabin_staff_service,3321.0,3.281843,1.4861,1.0,2.0,4.0,5.0,5.0
food_&_beverages,3095.0,2.74475,1.439847,1.0,1.0,3.0,4.0,5.0
ground_service,2590.0,2.846332,1.443522,1.0,1.0,3.0,4.0,5.0
value_for_money,3425.0,2.745693,1.466875,1.0,1.0,3.0,4.0,5.0
inflight_entertainment,2404.0,2.659734,1.397885,1.0,1.0,3.0,4.0,5.0
wifi_&_connectivity,506.0,1.934783,1.356209,1.0,1.0,1.0,3.0,5.0
rating_only,3422.0,4.857978,3.164356,1.0,2.0,4.0,8.0,10.0


In [8]:
df.describe(exclude='number')

# Note:
    # Overall descriptive statistic of categorical features

Unnamed: 0,id,review,rating,header,sub_header,author,time_published,aircraft,type_of_traveller,seat_type,route,date_flown,recommended,verified,city,type_aircraft
count,3426,3426,3422,3426,3426,3426,3426,1789,2656,3424,2652,2648,3426,3426,3426,1789
unique,3417,3415,10,2415,3309,2730,1636,189,4,4,1431,100,2,1518,70,25
top,anchor243835,Flown 6 flights on BA recently generally satis...,1/10,British Airways customer review,26 reviews C Fordham (United States) 8th Decem...,Clive Drake,19th January 2015,A320,Couple Leisure,Economy Class,London to Johannesburg,August 2015,no,Trip Verified,United Kingdom,Boeing
freq,2,2,741,956,4,32,26,326,896,1760,18,83,1984,948,2175,905
