## Libraries

In [1]:
import pandas as pd
import numpy as np

## Importing Data

The original dataset is stored in the 'data/external' directory within this repository. The path has been included to run the code without further modifications.

In [2]:
df = pd.read_csv('../data/external/airlines_reviews.csv')

In [3]:
df.head()

Unnamed: 0,Title,Name,Review Date,Airline,Verified,Reviews,Type of Traveller,Month Flown,Route,Class,Seat Comfort,Staff Service,Food & Beverages,Inflight Entertainment,Value For Money,Overall Rating,Recommended
0,Flight was amazing,Alison Soetantyo,2024-03-01,Singapore Airlines,True,Flight was amazing. The crew onboard this fl...,Solo Leisure,December 2023,Jakarta to Singapore,Business Class,4,4,4,4,4,9,yes
1,seats on this aircraft are dreadful,Robert Watson,2024-02-21,Singapore Airlines,True,Booking an emergency exit seat still meant h...,Solo Leisure,February 2024,Phuket to Singapore,Economy Class,5,3,4,4,1,3,no
2,Food was plentiful and tasty,S Han,2024-02-20,Singapore Airlines,True,Excellent performance on all fronts. I would...,Family Leisure,February 2024,Siem Reap to Singapore,Economy Class,1,5,2,1,5,10,yes
3,“how much food was available,D Laynes,2024-02-19,Singapore Airlines,True,Pretty comfortable flight considering I was f...,Solo Leisure,February 2024,Singapore to London Heathrow,Economy Class,5,5,5,5,5,10,yes
4,“service was consistently good”,A Othman,2024-02-19,Singapore Airlines,True,The service was consistently good from start ...,Family Leisure,February 2024,Singapore to Phnom Penh,Economy Class,5,5,5,5,5,10,yes


## Data Cleaning (excluding 'Title', 'Reviews' and 'Route' for now)

### Checking missing data

In [4]:
df.isnull().sum()

Title                     0
Name                      0
Review Date               0
Airline                   0
Verified                  0
Reviews                   0
Type of Traveller         0
Month Flown               0
Route                     0
Class                     0
Seat Comfort              0
Staff Service             0
Food & Beverages          0
Inflight Entertainment    0
Value For Money           0
Overall Rating            0
Recommended               0
dtype: int64

### Column Review

After an initial review, the following columns were found to contain valid and expected values and require no further processing: 
'Airline', 'Type of Traveller', 'Class', 'Seat Comfort', 'Staff Service', 'Food & Beverages', 'Value For Money', 'Overall Rating'.

#### 'Verified' Column

'Verified' contains three values that are not True or False. Attending to their content, they are converted to the corresponding value. All values are then converted to 0 or 1  to improve performance and clarity of analysis.

In [5]:
df['Verified'].value_counts()

Verified
True                                                                                                                                                                                                                                           6216
False                                                                                                                                                                                                                                          1881
*Unverified*                                                                                                                                                                                                                                      1
NotVerified                                                                                                                                                                                                                                       1
we do appreciat

In [6]:
# Convert different values to 'True' or 'False'
df['Verified'] = df['Verified'].replace('NotVerified', 'False')
df['Verified'] = df['Verified'].replace('*Unverified*', 'False')
df['Verified'] = df['Verified'].replace('we do appreciate you bringing this matter to our attention. Please accept my apologies for not having met your expectations; I do hope that we can leave you and your family with a more positive impression on your future flights with us', 'True')

df['Verified'].value_counts()

Verified
True     6217
False    1883
Name: count, dtype: int64

In [7]:
# Convert the values to 0 and 1
df['Verified'] = df['Verified'].replace('False', 0)
df['Verified'] = df['Verified'].replace('True', 1)

df['Verified'].value_counts()

Verified
1    6217
0    1883
Name: count, dtype: int64

#### 'Inflight Entertainment' column

All numerical rating columns go from 1 to 5, except for 'Overall Rating' going from 1 to 10. The one rating of 0 found in Inflight Entertainment is converted to 1 for consistency.

In [8]:
df['Inflight Entertainment'].value_counts()

Inflight Entertainment
5    2672
4    2284
3    1539
1     863
2     741
0       1
Name: count, dtype: int64

In [9]:
# Replacing 0 for 1
df['Inflight Entertainment'] = df['Inflight Entertainment'].replace(0,1)
df['Inflight Entertainment'].value_counts()

Inflight Entertainment
5    2672
4    2284
3    1539
1     864
2     741
Name: count, dtype: int64

#### 'Recommended' column

Recommended values of 'yes' and 'no' are converted to 0 and 1 for future analysis.

In [10]:
df['Recommended'].value_counts()

Recommended
yes    4287
no     3813
Name: count, dtype: int64

In [11]:
# Replace 'yes' and 'no' with 1 and 0 for easier analysis
df['Recommended'] = df['Recommended'].replace('yes',1)
df['Recommended'] = df['Recommended'].replace('no',0)

df['Recommended'].value_counts()

Recommended
1    4287
0    3813
Name: count, dtype: int64

## Saving clean dataset

The cleaned dataset is saved as 'cleaned_dataset.csv' and stored in the 'data/interim' directory within this repository. Notebooks using that dataset contain right paths to load them.

In [12]:
df.head()

Unnamed: 0,Title,Name,Review Date,Airline,Verified,Reviews,Type of Traveller,Month Flown,Route,Class,Seat Comfort,Staff Service,Food & Beverages,Inflight Entertainment,Value For Money,Overall Rating,Recommended
0,Flight was amazing,Alison Soetantyo,2024-03-01,Singapore Airlines,1,Flight was amazing. The crew onboard this fl...,Solo Leisure,December 2023,Jakarta to Singapore,Business Class,4,4,4,4,4,9,1
1,seats on this aircraft are dreadful,Robert Watson,2024-02-21,Singapore Airlines,1,Booking an emergency exit seat still meant h...,Solo Leisure,February 2024,Phuket to Singapore,Economy Class,5,3,4,4,1,3,0
2,Food was plentiful and tasty,S Han,2024-02-20,Singapore Airlines,1,Excellent performance on all fronts. I would...,Family Leisure,February 2024,Siem Reap to Singapore,Economy Class,1,5,2,1,5,10,1
3,“how much food was available,D Laynes,2024-02-19,Singapore Airlines,1,Pretty comfortable flight considering I was f...,Solo Leisure,February 2024,Singapore to London Heathrow,Economy Class,5,5,5,5,5,10,1
4,“service was consistently good”,A Othman,2024-02-19,Singapore Airlines,1,The service was consistently good from start ...,Family Leisure,February 2024,Singapore to Phnom Penh,Economy Class,5,5,5,5,5,10,1
