## Libraries

In [14]:
import pandas as pd
import numpy as np

## Importing Data

The cleaned dataset is stored in the 'data/interim' directory within this repository. The path has been included to run the code without further modifications.

In [15]:
df_clean = pd.read_csv('../data/interim/cleaned_airlines.csv', index_col = 0)

In [16]:
df_clean.head()

Unnamed: 0,Title,Name,Review Date,Airline,Verified,Reviews,Type of Traveller,Month Flown,Route,Class,Seat Comfort,Staff Service,Food & Beverages,Inflight Entertainment,Value For Money,Overall Rating,Recommended
0,Flight was amazing,Alison Soetantyo,2024-03-01,Singapore Airlines,1,Flight was amazing. The crew onboard this fl...,Solo Leisure,December 2023,Jakarta to Singapore,Business Class,4,4,4,4,4,9,1
1,seats on this aircraft are dreadful,Robert Watson,2024-02-21,Singapore Airlines,1,Booking an emergency exit seat still meant h...,Solo Leisure,February 2024,Phuket to Singapore,Economy Class,5,3,4,4,1,3,0
2,Food was plentiful and tasty,S Han,2024-02-20,Singapore Airlines,1,Excellent performance on all fronts. I would...,Family Leisure,February 2024,Siem Reap to Singapore,Economy Class,1,5,2,1,5,10,1
3,“how much food was available,D Laynes,2024-02-19,Singapore Airlines,1,Pretty comfortable flight considering I was f...,Solo Leisure,February 2024,Singapore to London Heathrow,Economy Class,5,5,5,5,5,10,1
4,“service was consistently good”,A Othman,2024-02-19,Singapore Airlines,1,The service was consistently good from start ...,Family Leisure,February 2024,Singapore to Phnom Penh,Economy Class,5,5,5,5,5,10,1


## Feature Engineering

Create new features out of the existing data that could add value to the analysis.

### Frequent Reviewer

Split the reviewers in four categories based on the amount of verified reviews left. The categories correspond to common airline frequent flyers programs:

- 0 (Non frequent or Blue tier) - Customer left only 1 or 0 verified reviews 
- 1 (Silver Tier)               - Between 2 and 5 verified reviews
- 2 (Gold Tier)                 - Between 6 and 15 verified reviews
- 3 (Platinum Tier)             - Over 15 verified reviews

In [17]:
# Filter verified reviews
verified_reviews = df_clean[df_clean['Verified'] == True]

# Count occurrences of each name in the filtered DataFrame
verified_name_counts = verified_reviews['Name'].value_counts()

# Categorization function
def categorize_verified_reviews(count):
    if count == 1:
        return 0  # Not a frequent reviewer or Blue Tier
    elif 2 <= count <= 5:
        return 1  # Silver Tier
    elif 6 <= count <= 15:
        return 2  # Gold Tier
    else:
        return 3  # Platinum Tier

# Create the new feature
df_clean['Frequent Reviewer'] = df_clean['Name'].map(verified_name_counts).apply(lambda x: categorize_verified_reviews(x) if pd.notnull(x) else 0)

# Drop the Name column
df_clean = df_clean.drop(['Name'],axis=1)

# Check distribution of the new feature
df_clean['Frequent Reviewer'].value_counts()

Frequent Reviewer
0    6298
1    1214
2     383
3     205
Name: count, dtype: int64

The distribution of Frequent Reviewers represents approximately the average distribution of Frequent Flyers on a normal flight.

### Flight Month and Flight Year

Flight Month: This feature can capture seasonal trends, such as differences in travel amount or moods during holiday seasons or off-peak times.

Flight Year: This feature can help identify trends over the years and account for significant events like the COVID-19 pandemic, which had a major impact on travel patterns


In [18]:
# Split 'Month Flown' into 'Flight Month' and 'Flight Year'
df_clean[['Flight Month', 'Flight Year']] = df_clean['Month Flown'].str.split(' ', expand=True)

# Convert to appropriate types
df_clean['Flight Year'] = df_clean['Flight Year'].astype(int)
df_clean['Flight Month'] = pd.to_datetime(df_clean['Flight Month'], format='%B').dt.month

In [21]:
# Verify correct split
df_clean[['Month Flown', 'Flight Year','Flight Month']].head()

Unnamed: 0,Month Flown,Flight Year,Flight Month
0,December 2023,2023,12
1,February 2024,2024,2
2,February 2024,2024,2
3,February 2024,2024,2
4,February 2024,2024,2


In [None]:
# Drop the original 'Month Flown' column
df_clean = df_clean.drop(['Month Flown'],axis=1)

### Quick Review

Check if the review was written soon after the flight was done. It could be useful for understanding if more immediate reviews correlate with higher or lower ratings. Cconsider as ‘quick review’ thoss that were left the same mont and yearh as the flight was done

Limitations:
Dataset contains no information about the flight day. Considering a month as the time frame, reviews could potentially be innacurately classified leading to inaccuracies and biasing the analysis.

The feature could still provide some insights, while keeping in mind the potential bias.. 


In [8]:
# Extracting the month and year of the review
df_clean['Review Date'] = pd.to_datetime(df_clean['Review Date'])
df_clean['Review Year'] = df_clean['Review Date'].dt.year
df_clean['Review Month'] = df_clean['Review Date'].dt.month

In [9]:
# Creating the new column 'Quick Review'
df_clean['Quick Review'] = (df_clean['Review Year'] == df_clean['Flight Year']) & (df_clean['Review Month'] == df_clean['Flight Month'])

# Dropping the temporary columns
df_clean=df_clean.drop(['Review Year'],axis=1)
df_clean=df_clean.drop(['Review Month'],axis=1)

# Checking the new columns
df_clean['Quick Review'].value_counts()

Quick Review
True     5005
False    3095
Name: count, dtype: int64

## Reorganizing Columns

In [10]:
df_clean.columns

Index(['Title', 'Review Date', 'Airline', 'Verified', 'Reviews',
       'Type of Traveller', 'Route', 'Class', 'Seat Comfort', 'Staff Service',
       'Food & Beverages', 'Inflight Entertainment', 'Value For Money',
       'Overall Rating', 'Recommended', 'Frequent Reviewer', 'Flight Month',
       'Flight Year', 'Quick Review'],
      dtype='object')

In [11]:
columns_reordered = ['Title', 'Reviews','Frequent Reviewer', 'Verified', 'Airline', 'Class','Type of Traveller', 
                    'Route','Review Date','Flight Year', 'Flight Month', 'Quick Review', 'Seat Comfort',
                    'Staff Service', 'Food & Beverages', 'Inflight Entertainment',
                    'Value For Money', 'Overall Rating', 'Recommended']

df_clean =df_clean[columns_reordered]

In [13]:
df_clean.head()

Unnamed: 0,Title,Reviews,Frequent Reviewer,Verified,Airline,Class,Type of Traveller,Route,Review Date,Flight Year,Flight Month,Quick Review,Seat Comfort,Staff Service,Food & Beverages,Inflight Entertainment,Value For Money,Overall Rating,Recommended
0,Flight was amazing,Flight was amazing. The crew onboard this fl...,0,1,Singapore Airlines,Business Class,Solo Leisure,Jakarta to Singapore,2024-03-01,2023,12,False,4,4,4,4,4,9,1
1,seats on this aircraft are dreadful,Booking an emergency exit seat still meant h...,1,1,Singapore Airlines,Economy Class,Solo Leisure,Phuket to Singapore,2024-02-21,2024,2,True,5,3,4,4,1,3,0
2,Food was plentiful and tasty,Excellent performance on all fronts. I would...,2,1,Singapore Airlines,Economy Class,Family Leisure,Siem Reap to Singapore,2024-02-20,2024,2,True,1,5,2,1,5,10,1
3,“how much food was available,Pretty comfortable flight considering I was f...,0,1,Singapore Airlines,Economy Class,Solo Leisure,Singapore to London Heathrow,2024-02-19,2024,2,True,5,5,5,5,5,10,1
4,“service was consistently good”,The service was consistently good from start ...,0,1,Singapore Airlines,Economy Class,Family Leisure,Singapore to Phnom Penh,2024-02-19,2024,2,True,5,5,5,5,5,10,1


The dataset is saved as 'feature_airlines.csv' and stored in the 'data/interim' directory within this repository. Notebooks using that dataset contain right paths to load them.