# **Kenya Airways & Industry Airline Customer Reviews Analysis Notebook**


---

![Kenya Airways Image](https://www.kindpng.com/picc/m/337-3373993_kenya-airways-hd-png-download.png)

> ## **Introduction**   
> Kenya Airways receives airline reviews from trip advisors from both local and international travellers. Their customer service team would like to extract insights from their customer reviews on TripAdvisor and conduct competitor analysis of the top 10 airlines from Skytrax Ranking to discover their competitive edge and where they fall short.

> However, they need help analyzing reviews due to the large volume of customer reviews they have to go through manually. It's time-consuming and resource-intensive. Additionally, there’s a challenge in identifying common trends and themes in customer feedback, considering customers provide review feedback on a wide range of topics, e.g. quality of food to their in-flight experiences.

> In this notebook we will be using text mining and sentiment analysis to process and analyze customer reviews to help Kenya Airways overcome these challenges through Data Science & Analytics. This would allow the airline to quickly and efficiently gain insights from the data and identify common issues and trends in customer feedback. The airline could then use this information to improve its products and services and provide better support to its customers.

> ## **Dataset Source**   
> To meet the objectives of the analysis we've extracted Airline customer reviews feedback from TripAdvisor for Kenya Airways and the top 10 leading airlines in Africa by SkyTrack Ranking. This datasets will help us analyze reviews both at organization level (Kenya Airways) and how it compares to Industry (9 other airlines).

Datasets can be accessed through github repository via [This Link](https://github.com/billyotieno/analytics-datasets/tree/main/Transport%20Services/Airlines/african-airlines-reviews-dataset)


In [None]:
# Check GPU connectivity
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Wed Feb 22 20:32:09 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   52C    P0    29W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Check Hi RAM Allocation
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 13.6 gigabytes of available RAM

Not using a high-RAM runtime


# **Table of Contents**

>[Kenya Airways & Industry Airline Customer Reviews Analysis Notebook](#scrollTo=jNpZkgLrpWiF)

>>[Introduction](#scrollTo=jNpZkgLrpWiF)

>>[Dataset Source](#scrollTo=jNpZkgLrpWiF)

>[Table of Contents](#scrollTo=DM0RD92SwuJF)

>>[Setting up and Installing Required Libraries](#scrollTo=5yvGfaTlygLb)

>>[Sourcing Data from the Github Respository](#scrollTo=j4rYvvQ61Ps7)

>>[Importing Required Libraries](#scrollTo=Yuqv8THG2ZXG)

>>[Loading Data into DataFrames](#scrollTo=sII_fRkJpVcT)

>>[Initial Data Exploration](#scrollTo=L6LOVnaX0mhI)

>>>[Renaming Columns to Clear Columns](#scrollTo=RdNUoUxq9XNA)

>>>[Checking Dataset Shape](#scrollTo=aB5iN376RWDo)

>>>[Checking DataTypes](#scrollTo=8WxPPNl0RcRH)

>>>[Checking for Missing Values](#scrollTo=8oOLLEXfRfj8)

>>>[Dataset Description](#scrollTo=pYf9YG2GRjIL)

>>>[Initial Data Cleaning: Overlapped Text](#scrollTo=n6Peyes6Rowq)

>>[Data Exploration: Focused on Non-Review Columns](#scrollTo=bO-JbEUMR30e)

>>>[Total Number of Reviews by Airlines](#scrollTo=xjZz7X0gRw9q)

>>>[Flight Types or Regions Travelled by Reviewers for Each Airline](#scrollTo=kZe6gvL1SO-T)

>>>[Distribution of Ratings (1 -5)](#scrollTo=dVmmWr8LSjsC)

>>>[Average Rating Across the Airlines for the Various Travel Classes](#scrollTo=aFC4ts_yTchJ)

>>>[Data Cleaning: Correcting Travel Month Column](#scrollTo=znqKrX7ZTtpb)

>>>[Exploring Review Ratings by Airline and Flight Travel Class](#scrollTo=FzY5o-D7T2cU)

>>>[Exploring Review Ratings by Airlines across Regions](#scrollTo=ET3DkYQyt0rf)

>>>[Breakdown of Airlines by Respective Travel Classes](#scrollTo=wvNVWzd8uk0Z)

>>[Data Exploration: Focused on Review Text](#scrollTo=R5Fe-nXgZ3G7)

>>>[Checking for NaNs in Extracted Review Columns](#scrollTo=3KTia-Ad8L6w)

>>>[Features distributions into Boolean, Categorical and Numerical types](#scrollTo=YvyCIsjiAseg)

>>>[Plotting the correlation matrix for the features](#scrollTo=6V63NJ4jBlXl)

>>[Data Quality Summary](#scrollTo=1ZosZNZ4mAvp)

>>[Data Preparation](#scrollTo=MqaDRPCsq64H)

>>>[1. Merging the two Datasets - Text Profiled & Non-Review Dataset](#scrollTo=zDXZ9KbArND8)

>>>[2. Removing Duplicate Rows](#scrollTo=clfUWwL0Yxqm)

>>>[3. Removing Redundant / Unrequired Columns - Select Data](#scrollTo=tYsbet3FXjwg)

>>>[4. Cleaning Travel Month & Year - Select Data](#scrollTo=FDt1YXyDc74S)

>>>[5. DataType Conversion](#scrollTo=gpXMSwBvsqE0)

>>>[6. Review Sentiment - New Column from Rating Scores](#scrollTo=0CBTTd-Da8RB)

>>[Data Preparation - Text Pre-processing for Reviews](#scrollTo=E8DIrwWyCFaj)

>>[Exploratory Data Analysis / Modelling](#scrollTo=GIFr98omeGvK)

>>>[Building a Quick Sentiment Classifier using CountVectorizer on Airline Reviews](#scrollTo=GIFr98omeGvK)

>>>[POS - Review Text Parts of Speech Analysis](#scrollTo=M9Ow5ul0ZdO9)

>>>[Review Text - Bigram Analysis](#scrollTo=dh_oAFVKgT6v)

>>>[Review Text - Trigram Analysis](#scrollTo=1Krld8DlgYCR)

>>[Sentiment Analysis Modelling](#scrollTo=Vnr-y3BToI7t)

>>>[Using CountVectorizer & LogisticsRegression](#scrollTo=-_VuyN_IoYq9)

>>>[Using TfidfVectorizer & MultinomialNB Model](#scrollTo=BOtBTxWLkRhN)

>>>[Using CountVectorizer & XGBoost Model](#scrollTo=jxBS5eJ7t_ng)

>>[Topic Modelling](#scrollTo=G2zOuPa3oma5)

>>>[Displaying and Evaluating Topics](#scrollTo=1hTR__MCslGq)

>>>[Topic Modelling with BertTopic](#scrollTo=wAIelvHoJhOr)



## **Setting up and Installing Required Libraries**

In [None]:
# Installing required libraries (-q quiet installing all libraries)
! pip install -q pandas pandera numpy matplotlib seaborn textblob dask missingno wordcloud pyldavis
! pip install -q fasttext

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 KB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m38.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.8/68.8 KB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone


In [None]:
!wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

--2023-02-22 20:33:35--  https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.75.142, 172.67.9.4, 104.22.74.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.75.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 131266198 (125M) [application/octet-stream]
Saving to: ‘lid.176.bin’


2023-02-22 20:33:42 (19.9 MB/s) - ‘lid.176.bin’ saved [131266198/131266198]



In [None]:
# Install pandas profiling - required for initial exploration
!pip install -q https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m22.1/22.1 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.7/102.7 KB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m679.5/679.5 KB[0m [31m36.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m76.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.7/4.7 MB[0m [31m74.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.5/296.5 KB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for ydata-profiling (setup.py) ... [?25l[?25hdone


In [None]:
# Install NLP Profiler for text datasets
# !pip install -U -q git+https://github.com/neomatrix369/nlp_profiler@scale-when-applied-to-larger-datasets
# print("\n Installation Completed")

!pip install -U -q git+https://github.com/neomatrix369/nlp_profiler.git@master

## **Sourcing Data from the Github Respository**

In [None]:
from google.colab import files

# Create an airline-datasets directory on google colab to host the files
!rm -rf airline-datasets
!mkdir -p airline-datasets
!cd airline-datasets

# fetch all the datasets
!wget -q --show-progress "https://raw.githubusercontent.com/billyotieno/analytics-datasets/main/Transport Services/Airlines/african-airlines-reviews-dataset/kenya_airways_flights.csv" -P ./airline-datasets
!wget -q --show-progress "https://raw.githubusercontent.com/billyotieno/analytics-datasets/main/Transport Services/Airlines/african-airlines-reviews-dataset/air_mauritius.csv" -P ./airline-datasets
!wget -q --show-progress "https://raw.githubusercontent.com/billyotieno/analytics-datasets/main/Transport Services/Airlines/african-airlines-reviews-dataset/egypt_airways.csv" -P ./airline-datasets
!wget -q --show-progress "https://raw.githubusercontent.com/billyotieno/analytics-datasets/main/Transport Services/Airlines/african-airlines-reviews-dataset/ethiopian_airlines.csv" -P ./airline-datasets
!wget -q --show-progress "https://raw.githubusercontent.com/billyotieno/analytics-datasets/main/Transport Services/Airlines/african-airlines-reviews-dataset/fastjet_flights.csv" -P ./airline-datasets
!wget -q --show-progress "https://raw.githubusercontent.com/billyotieno/analytics-datasets/main/Transport Services/Airlines/african-airlines-reviews-dataset/flysafair_flights.csv" -P ./airline-datasets
!wget -q --show-progress "https://raw.githubusercontent.com/billyotieno/analytics-datasets/main/Transport Services/Airlines/african-airlines-reviews-dataset/royal_air_maroc.csv" -P ./airline-datasets
!wget -q --show-progress "https://raw.githubusercontent.com/billyotieno/analytics-datasets/main/Transport Services/Airlines/african-airlines-reviews-dataset/rwand_air_flights.csv" -P ./airline-datasets
!wget -q --show-progress "https://raw.githubusercontent.com/billyotieno/analytics-datasets/main/Transport Services/Airlines/african-airlines-reviews-dataset/seychelles_airways.csv" -P ./airline-datasets
!wget -q --show-progress "https://raw.githubusercontent.com/billyotieno/analytics-datasets/main/Transport Services/Airlines/african-airlines-reviews-dataset/south_african_airways.csv" -P ./airline-datasets

## **Importing Required Libraries**

In [None]:
# Import required libraries
import pandas as pd
import pandera as pn
import dask
import seaborn as sns
import spacy
import re
import nltk
import string
import fasttext
import warnings
import inflect # converting numbers in text to words
import wordcloud
import missingno as msno
from pandas_profiling import ProfileReport
from nlp_profiler.core import apply_text_profiling

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
import nltk
from nltk import wordpunct_tokenize
import matplotlib.pyplot as plt

from __future__ import print_function
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

# Pandas settings
pd.options.mode.chained_assignment = None
pd.set_option('display.max_colwidth', 20)
pd.options.display.max_rows = 4000
from IPython.display import Image

%matplotlib inline
warnings.filterwarnings("ignore", category=DeprecationWarning)

# NLTK Download Options
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')


import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [None]:
# Visualization Fonts
!wget -O IBM_Sans.zip "https://fonts.google.com/download?family=IBM%20Plex%20Sans"
!wget -O McKinsey_Bower.zip "https://cdn.mckinsey.com/assets/fonts/web/Bower_Fonts.zip"

In [None]:
!unzip -o '*.zip'

In [None]:
!mv *.ttf /usr/share/fonts/truetype/
!mv *.otf /usr/share/fonts/truetype/

## **Loading Data into DataFrames**

In [None]:
from pathlib import Path

path = "./airline-datasets/"
files = Path(path).glob('*.csv')

In [None]:
# Read data into dataframe with a new column identifying airline dataset
dfs = list()
for f in files:
  data = pd.read_csv(f,
                     usecols=['Title','Image','Avatar_URL',
                              'crvsd','ui_header_link','default',
                              'phmbo','phmbo1','dmrsr','dmrsr2','dmrsr3',
                              'qwuub_URL','qwuub','tehyy','xcjrc',
                              'Rating'])
  data['source'] = f.stem
  dfs.append(data)

In [None]:
df = pd.concat(dfs, ignore_index=True)
df.head()

## **Initial Data Exploration**

### **Renaming Columns to Clear Columns**

In [None]:
# Rename dataset to clear & understandable columns
column_rename = {
    'Title':'review',
    'Image':'review_image',
    'Avatar_URL':'avatar_url',
    'crvsd':'writing_month',
    'ui_header_link':'reviewer_username',
    'default':'reviewer_city',
    'phmbo':'reviewer_contribution',
    'phmbo1':'helpful_votes',
    'dmrsr':'flight_path',
    'dmrsr2':'flight_type',
    'dmrsr3':'travel_class',
    'qwuub_URL':'review_link',
    'qwuub':'review_headline',
    'tehyy':'travel_month',
    'xcjrc':'disclaimer',
    'Rating':'review_rating',
    'source':'airline'
}

df.rename(columns=column_rename, inplace=True)

In [None]:
# Proper Naming for Airlines
df['airline'] = df.airline.astype('category')
df['airline'] = df['airline'].cat.rename_categories({
  'air_mauritius':'Air Mauritius',
  'egypt_airways':'Egypt Air',
  'ethiopian_airlines':'Ethiopian Airlines',
  'fastjet_flights':'FastJet',
  'flysafair_flights':'FlySafair',
  'kenya_airways_flights':'Kenya Airways',
  'royal_air_maroc':'Royal Air Maroc',
  'rwand_air_flights':'RwandAir',
  'seychelles_airways':'Air Seychelles',
  'south_african_airways':'South African Airways',
})

In [None]:
# Check new columns
df.columns

### **Checking Dataset Shape**

In [None]:
# Checking dataframe shape
df.shape

### **Checking DataTypes**

In [None]:
# Checking datatypes
df.dtypes

### **Checking for Missing Values**

In [None]:
# Check for Missing Values
msno.matrix(df)

### **Dataset Description**

In [None]:
# Checking dataframe description
df.describe(include='all')

In [None]:
#Getting the total number of reviews in the dataset
n_reviews = df.shape[0]
# print('Number of customer reviews in the dataset: {}'.format(n_reviews))

In [None]:
review = df[df.airline == 'Kenya Airways'].review.values[12]
print(review)

### **Initial Data Cleaning: Overlapped Text**

In [None]:
# Initial dataset cleaning to support exploration
def remove_overlapped_text(df):
  df = df.copy()
  index = df[(df["travel_class"] != "Economy") & (df["travel_class"] != "Business Class") & (df["travel_class"] != "First Class")].index
  df.drop(index, inplace=True)
  df.reset_index(drop=True, inplace=True)
  return df

df = remove_overlapped_text(df)
df.shape

## **Data Exploration: Focused on Non-Review Columns**

### **Total Number of Reviews by Airlines**

In [None]:
import matplotlib.font_manager as fm
viz_color = "#102747"

# path = '/usr/share/fonts/truetype/IBMPlexSans-Regular.ttf'
path = '/usr/share/fonts/truetype/Bower-Bold.otf'
fontprop = fm.FontProperties(fname=path)


sns.set_theme(style="whitegrid")
fig, ax = plt.subplots(figsize=(12,8))

plt.suptitle("Number of Reviews by Airline", ha='left', fontproperties=fontprop, fontsize=40, x=0.125, y=0.98)
plt.title("Showing distribution of reviews per Airline", loc='left',alpha=0.9, fontproperties=fontprop, fontsize=20)

ax = sns.countplot(data=df, y="airline", ax=ax, color=viz_color, order=df['airline'].value_counts().index)

plt.xticks(fontproperties=fontprop, fontsize=15)
plt.yticks(fontproperties=fontprop, fontsize=15)
plt.xlabel('Airlines', fontproperties=fontprop, fontsize=20)
plt.ylabel('Number of Reviews', fontproperties=fontprop, fontsize=20)

# ax.set(ylabel="Airlines", xlabel="Number of Reviews")

plt.show()

### **Flight Types or Regions Travelled by Reviewers for Each Airline**

In [None]:
# What are the most common flight types across the various airlines experience by reviewers?
airline_flight_type = df.groupby(['airline', 'flight_type']).size().reset_index().pivot(columns='flight_type', index='airline', values=0)
airline_flight_type

In [None]:
airline_flight_type.columns

In [None]:
index= airline_flight_type.index
cols = airline_flight_type.columns
airline_flight_type.style.background_gradient(cmap='Blues')

### **Distribution of Ratings (1 -5)**

In [None]:
# What is the distribution of Ratings in the review dataset??
# Clean up review rating & convert column into ratings / interger
df.review_rating = df.review_rating.str[-2:]
df.review_rating = df.review_rating.astype(int) / 10
df.review_rating.value_counts().plot(kind="barh")

### **Average Rating Across the Airlines for the Various Travel Classes**

In [None]:
import copy
cmap = copy.copy(plt.cm.get_cmap("Blues"))
cmap.set_under("white")

# Whats the average rating experience by travellers within KQ and across the various airlines in the different classes?
import numpy as np

plt.figure(figsize=(12, 8), dpi= 80)
airline_class_review = df.groupby(['airline', 'travel_class']).agg({'review_rating':[np.mean]}).reset_index().pivot(columns="travel_class", index="airline").droplevel(0, axis=1).droplevel(0, axis=1)
airline_class_review.style.background_gradient(cmap='Blues').applymap(lambda x: 'background-color: white' if pd.isna(x) else '')

### **Data Cleaning: Correcting Travel Month Column**

In [None]:
# Clean Writing Month
df.writing_month = df.writing_month.str[-8:]
df.writing_month

In [None]:
# Clean Travel Month
df.travel_month = df.travel_month.str[16:]
df.travel_month

In [None]:
df['travel_year'] = df.travel_month.str[-4:]
df.travel_year.head()

In [None]:
# Assumptions, due to extraction error, we'll convert 25** years to 2022
df.isna().sum()

### **Exploring Review Ratings by Airline and Flight Travel Class**

In [None]:
sns.set_theme(style="whitegrid")
fig, ax = plt.subplots(figsize=(12,8))

plt.suptitle("Heatmap of Review Rating by Airline, Travel Class", ha='left', fontproperties=fontprop, fontsize=30, x=0.125, y=1)
plt.title("Most airlines tend to have good ratings for their Business Class compared to Economy. \n",
          loc='left', alpha=0.9, fontproperties=fontprop, fontsize=15)

sns.heatmap(airline_class_review, cmap="Blues", linewidth=1, linecolor="#F4F4F4", cbar_kws = {"location":"bottom", "use_gridspec":False})

plt.xticks(fontproperties=fontprop, fontsize=15)
plt.yticks(fontproperties=fontprop, fontsize=15)
plt.xlabel('Flight Travel Class', fontproperties=fontprop, fontsize=20)
plt.ylabel('Airlines', fontproperties=fontprop, fontsize=20)

From the heatmap above, it shows that Travellers have had a great experience with Business Class as opposed to the Economy Class.
FlySafari is an exception since it only runs flights in the Economy Class.

### **Exploring Review Ratings by Airlines across Regions**

In [None]:
sns.set_theme(style="whitegrid")
fig, ax = plt.subplots(figsize=(12,8))

plt.suptitle("Exploring Common Flight Type across Airlines", ha='left', fontproperties=fontprop, fontsize=30, x=0.125, y=1)
plt.title("Most of the travel done by airline customers were International followed by Africa. \n",
          loc='left', alpha=0.9, fontproperties=fontprop, fontsize=15)

sns.heatmap(airline_flight_type, cmap="Blues", linewidth=1, linecolor="#F4F4F4")

plt.xticks(fontproperties=fontprop, fontsize=15)
plt.yticks(fontproperties=fontprop, fontsize=15)
plt.xlabel('Flight Type / Regions', fontproperties=fontprop, fontsize=20)
plt.ylabel('Airlines', fontproperties=fontprop, fontsize=20)

From the heatmap above, it shows that most airline travellers took international flights, followed closely by travels to Africa.

### **Breakdown of Airlines by Respective Travel Classes**

In [None]:
airline_travel_class = df.groupby(['airline', 'travel_class']).size().reset_index().pivot(columns='travel_class', index='airline', values=0)
airline_travel_class.reset_index().style.background_gradient(cmap='Blues').applymap(lambda x: 'background-color: white' if pd.isna(x) else '')

In [None]:
sns.set_theme(style="whitegrid")
fig, ax = plt.subplots(figsize=(12,8))

plt.suptitle("Travel Classes by Airline", ha='left', fontproperties=fontprop, fontsize=40, x=0.125, y=0.98)
plt.title("Most reviewers in the dataset travelled on Economy Class", loc='left',alpha=0.9, fontproperties=fontprop, fontsize=20)

airline_travel_class.plot(kind="barh", stacked=True, ax=ax, color=['darkblue','lightsteelblue','darkred'])

plt.xticks(fontproperties=fontprop, fontsize=15)
plt.yticks(fontproperties=fontprop, fontsize=15)
plt.xlabel('Number of Reviews', fontproperties=fontprop, fontsize=20)
plt.ylabel('Airlines', fontproperties=fontprop, fontsize=20)

In [None]:
airline_flight_path = df.groupby(['flight_path', 'airline']).size().reset_index().pivot(columns='airline', index='flight_path', values=0)
airline_flight_path["Kenya Airways"].sort_values(ascending=False).head(15)

In [None]:
# pd.crosstab(df['flight_path'], df['airline']).plot(kind='barh', stacked=True)

In [None]:
fig, ax = plt.subplots(figsize=(12,8))

plt.suptitle("Top Flight Paths by Reviews - All Airlines", ha='left', fontproperties=fontprop, fontsize=40, x=0.125, y=0.98)
plt.title("Flight paths used by customers giving reviews", loc='left',alpha=0.9, fontproperties=fontprop, fontsize=20)

plt.xticks(fontproperties=fontprop, fontsize=12)
plt.yticks(fontproperties=fontprop, fontsize=12)

df[["airline",'flight_path']].value_counts()[:5].plot(kind='barh', stacked=True, ax=ax, color=['darkblue'])

plt.xlabel('Number of Reviews', fontproperties=fontprop, fontsize=20)
plt.ylabel('Flight Paths', fontproperties=fontprop, fontsize=20)

In [None]:
df[df.airline == 'Kenya Airways'].flight_path.value_counts()[:5]

In [None]:
fig, ax = plt.subplots(figsize=(12,8))

plt.suptitle("Top Flight Paths by Reviews - KQ", ha='left', fontproperties=fontprop, fontsize=40, x=0.125, y=0.98)
plt.title("Most flights taken by KQ Reviewers were from London to Nairobi", loc='left',alpha=0.9, fontproperties=fontprop, fontsize=20)

plt.xticks(fontproperties=fontprop, fontsize=12)
plt.yticks(fontproperties=fontprop, fontsize=12)

df[df.airline == 'Kenya Airways'].flight_path.value_counts()[:15].plot(kind='barh', stacked=True, ax=ax, color=['darkblue'])

plt.xlabel('Number of KQ Reviews', fontproperties=fontprop, fontsize=20)
plt.ylabel('Flight Paths', fontproperties=fontprop, fontsize=20)

In [None]:
df.describe()

## **Data Exploration: Focused on Review Text**

At this step we drill down into the Review Text, Extract text features and perform an exploratory analysis from the extracted features. This feature will then be used downstream in Modelling Stage.

In [None]:
from wordcloud import STOPWORDS

In [None]:
type(STOPWORDS)

In [None]:
# Checking on the Word Cloud for Each Review Rating
from wordcloud import WordCloud, STOPWORDS

STOPWORDS.update(['flight','airport','airline'])

stopwords = set(STOPWORDS)

wordcloud = WordCloud(
    background_color = 'white',
    stopwords = stopwords,
    max_words = 400,
    max_font_size = 200,
    width=1000, height=1000,
    random_state = 42
).generate(" ".join(df["review"].astype('str')))

fig = plt.figure(figsize = (12,14))
plt.imshow(wordcloud)

plt.title("Frequently Occuring words across all Reviews", loc='center',alpha=0.9, fontproperties=fontprop, fontsize=22)
plt.axis('off')
plt.show()

In [None]:
df.review_rating

In [None]:
# Checking on the Word Cloud for Each Review Rating
from wordcloud import WordCloud, STOPWORDS

STOPWORDS.update(['flight','airport','airline'])

stopwords = set(STOPWORDS)

wordcloud = WordCloud(
    background_color = 'white',
    stopwords = stopwords,
    max_words = 400,
    max_font_size = 200,
    width=1000, height=1000,
    random_state = 42
).generate(" ".join(df[df.airline == 'Kenya Airways']["review"].astype('str')))

fig = plt.figure(figsize = (12,14))
plt.imshow(wordcloud)

plt.title("Frequently Occuring words across all Kenya Airways Reviews", loc='center',alpha=0.9, fontproperties=fontprop, fontsize=22)
plt.axis('off')
plt.show()

In [None]:
# Checking on the Word Cloud for Each Review Rating
from wordcloud import WordCloud, STOPWORDS

STOPWORDS.update(['flight','airport','airline'])

stopwords = set(STOPWORDS)

wordcloud = WordCloud(
    background_color = 'white',
    stopwords = stopwords,
    max_words = 400,
    max_font_size = 200,
    width=1000, height=1000,
    random_state = 42
).generate(" ".join(df[(df.airline == 'Kenya Airways') & (df.review_rating >= 4)]["review"].astype('str')))

fig = plt.figure(figsize = (12,14))
plt.imshow(wordcloud)

plt.title("Frequent Words - Kenya Airways Reviews (Review Rating >=4) ", loc='center',alpha=0.9, fontproperties=fontprop, fontsize=22)
plt.axis('off')
plt.show()

In [None]:
# Checking on the Word Cloud for Each Review Rating
from wordcloud import WordCloud, STOPWORDS

STOPWORDS.update(['flight','airport','airline'])

stopwords = set(STOPWORDS)

wordcloud = WordCloud(
    background_color = 'white',
    stopwords = stopwords,
    max_words = 400,
    max_font_size = 200,
    width=1000, height=1000,
    random_state = 42
).generate(" ".join(df[(df.airline == 'Kenya Airways') & (df.review_rating <= 2)]["review"].astype('str')))

fig = plt.figure(figsize = (12,14))
plt.imshow(wordcloud)

plt.title("Frequent Words - Kenya Airways Reviews (Review Rating <=2) ", loc='center',alpha=0.9, fontproperties=fontprop, fontsize=22)
plt.axis('off')
plt.show()

In [None]:
# Checking on the Word Cloud for Each Review Rating
from wordcloud import WordCloud, STOPWORDS

STOPWORDS.update(['flight','airport','airline'])

stopwords = set(STOPWORDS)

wordcloud = WordCloud(
    background_color = 'white',
    stopwords = stopwords,
    max_words = 400,
    max_font_size = 200,
    width=1000, height=1000,
    random_state = 42
).generate(" ".join(df[(df.airline == 'Kenya Airways') & (df.review_rating == 3)]["review"].astype('str')))

fig = plt.figure(figsize = (12,14))
plt.imshow(wordcloud)

plt.title("Frequent Words - Kenya Airways Reviews (Review Rating ==3) ", loc='center',alpha=0.9, fontproperties=fontprop, fontsize=22)
plt.axis('off')
plt.show()

In [None]:
# From the Dataframe we fetch the Review Column and peform text profilling.
text_nlp = pd.DataFrame(df, columns=['review'])
# Exploring a Sample Review
text_nlp["review"][2000]

In [None]:
# Text NLP Review for Kenya Airways Data
text_nlp_kq = pd.DataFrame(df[df.airline == 'Kenya Airways'], columns=['review'])

profile_data_kq = apply_text_profiling(
    text_nlp_kq, 'review',
    params={'spelling_check': False,
            'grammar_check': False,
            'ease_of_reading_check':False,
            'parallelisation_method': 'default'})

In [None]:
profile_data_kq.describe(include='all')

In [None]:
profile_data_kq[["sentences_count",
"characters_count",
"repeated_letters_count",
"spaces_count",
"chars_excl_spaces_count",
"repeated_spaces_count",
"whitespaces_count",
"chars_excl_whitespaces_count",
"repeated_whitespaces_count",
"count_words",
"duplicates_count",
"emoji_count",
'repeated_digits_count',
"whole_numbers_count",
"alpha_numeric_count",
"non_alpha_numeric_count",
"punctuations_count",
"repeated_punctuations_count",
"stop_words_count",
"dates_count",
"noun_phrase_count",
"english_characters_count",
"non_english_characters_count",
"syllables_count"]].aggregate('sum')

In [None]:
import seaborn as sns
sns.set_theme(style="ticks")

pairplot_df = profile_data_kq[['sentiment_polarity_summarised','stop_words_count','emoji_count','sentences_count','punctuations_count']]
sns.pairplot(pairplot_df, hue='sentiment_polarity_summarised')

In [None]:
pairplot_df = profile_data_kq[['sentiment_polarity_summarised','stop_words_count','emoji_count','sentences_count','punctuations_count']]
sns.pairplot(pairplot_df, hue='sentiment_polarity_summarised')

In [None]:
%%script echo skipping
# We'll skip this step due to the execution timeline on the notebook.
# As an alternative we have saved the files from this output into .csv so that
# we only read directly from the CSV files.
profile_data_kq = apply_text_profiling(
    text_nlp, 'review',
    params={'spelling_check': False,
            'grammar_check': False,
            'ease_of_reading_check':False,
            'parallelisation_method': 'default'})

# Generating a profiling report into HTML
profile_text = ProfileReport(profile_data)
profile_text.to_file("airline-review-text-profiler-form.html")

# Saving the profiled data to CSV to save on execution
profile_data.to_csv("airline-review-text-profiled-dataset.csv")

In [None]:
profile_data = pd.read_csv("https://raw.githubusercontent.com/billyotieno/analytics-datasets/main/Transport%20Services/Airlines/airline-review-text-profiled-dataset.csv")

In [None]:
# Dropping the Unnamed: 0 column created during file export
profile_data.drop(["Unnamed: 0"], axis=1, inplace=True)
profile_data.columns

In [None]:
# Check the datatypes of the newly created
profile_data.dtypes

In [None]:
profile_data.iloc[9500,0]

In [None]:
profile_data.iloc[9500,]

In [None]:
# Comparing Common Words used by Different Airlines - we'll use df for this
df.columns

In [None]:
# Summary of key text statistics
print("Number of Emojis in Corpus - ", profile_data["emoji_count"].sum())
print("Number of Punctuations in Corpus - ", profile_data["punctuations_count"].sum())
print("Number of Stop Words in Corpus - ", profile_data["stop_words_count"].sum())
print("Number of Dates in Corpus - ", profile_data["dates_count"].sum())
print("Number of Non-English Character in Corpus - ", profile_data["non_english_characters_count"].sum())
print("Number of Repeated Whitespaces in Corpus - ", profile_data["repeated_whitespaces_count"].sum())

### **Checking for NaNs in Extracted Review Columns**

In [None]:
# Percentage of non-null values.
filling_rates = 100.*profile_data.count().sort_values(ascending=False)/profile_data.shape[0]
print(filling_rates)

In [None]:
values_filling_rates = filling_rates.values
text_filling_rates = filling_rates.index.to_list()
print(text_filling_rates)

In [None]:
plt.figure(figsize=(6,6),dpi=100)
sns.set(style="whitegrid")
ax = sns.barplot(x=values_filling_rates, y=text_filling_rates,color="Red")
ax.set(xlabel='Filling percentage (%)', ylabel='Feature')
plt.tight_layout()
plt.show()

### **Features distributions into Boolean, Categorical and Numerical types**

In [None]:
df_for_training = profile_data.copy()

In [None]:
cols_for_training = df_for_training.columns.to_list()

In [None]:
feats_bool = ['recommended',
              'has_layover']
feats_cat = ['airline',
             'traveller_type',
             'cabin','review_text', 'review',
             'pos_neu_neg_review_score']
feats_num = [feat for feat in cols_for_training if feat not in feats_bool and feat not in feats_cat]

In [None]:
print('Boolean features: \n{}\n'.format(feats_bool))
print('Categorical features: \n{}\n'.format(feats_cat))
print('Numerical features: \n{}\n'.format(feats_num))

### **Plotting the correlation matrix for the features**

In [None]:
# Let's plot a correlation matrix among the features
def plot_cmap(matrix_values, figsize_w, figsize_h, filename):
    """
    Plot a heatmap corresponding to the input values.
    """
    if figsize_w is not None and figsize_h is not None:
        plt.figure(figsize=(figsize_w,figsize_h))
    else:
        plt.figure()
    cmap = sns.diverging_palette(240, 10, sep=20, as_cmap=True)
    sns.heatmap(matrix_values, fmt=".2f", cmap=cmap, vmin=-1, vmax=1)
    plt.savefig(filename)
    plt.show()
    return cmap

corr_values = df_for_training[feats_num].dropna(axis=0,how='any').corr()
plot_cmap(matrix_values=corr_values,
          figsize_w=15,
          figsize_h=15,
          filename='./Corr.png')

Note:

1. A positive correlation between the different types of review scores and subscores
2. A negative correlation between the length of the review text and the value of the different types of review scores and subscores
3. The similarity between using the number of characters and the number of words, from which we conclude that we can drop one of the two features

In [None]:
# Based on the Correlation Matrix - Checking the Columnns to be Dropped
corr_matrix = profile_data.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find features with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.80)]

to_drop

## **Data Quality Summary**

As an output of our Data Exploration efforts we have been able to identify the following Data Quality Issues within the Review Text and additional columns from the datasets. Below is a summary of our Data Quality findings which informs our Data Preparation Stage:

**Data Quality Issues Identified in Non-Review Text Columns**

> - *Redundant Columns*   
> - *Duplicate Rows*  
> - *Wrong Data Types*
> - *Missing Values*

**Data Quality Issues Identified in Review Text**

> - *Extra Whitespaces in Text*.
> - *Digitis in Review Text*.  
> - *Existing Emoji's in Review Text*.  
> - *Punctuations*.  
> - *URL's*.  

For the review text, the data will be cleaned up at part of text pre-processing.

## **Data Preparation**

At this stage we prepare the data for modelling and further analysis.

### **1. Merging the two Datasets - Text Profiled & Non-Review Dataset**

In [None]:
# profile_data
profile_data["id"] = profile_data.index
df["id"] = df.index

In [None]:
review_df = pd.merge(df, profile_data, how="left", on="id")
review_df.shape

In [None]:
df.shape

In [None]:
profile_data.shape

In [None]:
review_df.columns

In [None]:
review_df.isna().sum()

In [None]:
review_df.describe()

In [None]:
review_df.dtypes

### **2. Removing Duplicate Rows**

In [None]:
# Number of Duplicate Rows
print(review_df.duplicated().sum())

### **3. Removing Redundant / Unrequired Columns** - Select Data

In [None]:
redundant_columns = ['spaces_count',
 'chars_excl_spaces_count',
 'whitespaces_count',
 'chars_excl_whitespaces_count',
 'count_words',
 'duplicates_count',
 'alpha_numeric_count',
 'non_alpha_numeric_count',
 'stop_words_count',
 'noun_phrase_count',
 'english_characters_count',
 'syllables_count',
 'review_y',
 'review_image',
 'avatar_url',
 'reviewer_username',
 'reviewer_city',
 'helpful_votes',
 'review_link',
 'review_headline',
 'id']

In [None]:
review_df.drop(redundant_columns, axis=1, inplace=True)
review_df.dtypes

In [None]:
review_df.shape

### **4. Cleaning Travel Month & Year** - Select Data

In [None]:
# Replace month abbreviation in writing month
def replace_month_abrev(date_string):
    month_dict = {"Jan ": "January ",
              "Feb ": "February ",
              "Mar ": "March ",
              "Apr ": "April ",
              "May ": "May ",
              "Jun ": "June ",
              "Jul ": "July ",
              "Aug ": "August ",
              "Sep ": "September ",
              "Sept ": "September ",
              "Oct ": "October ",
              "Nov ": "November ",
              "Dec ": "December "}
    # find all dates with abrev
    abrev_found = filter(lambda abrev_month: abrev_month in date_string, month_dict.keys())
    # replace each date with its abbreviation
    for abrev in abrev_found:
        date_string = date_string.replace(abrev, month_dict[abrev])
    # return the modified string (or original if no states were found)
    return date_string

review_df.writing_month = review_df.writing_month.apply(replace_month_abrev)

In [None]:
# Removing all the with wrong travel periods and replacing them with null
review_df[review_df.travel_year >= '2500'].travel_month = None

In [None]:
review_df.loc[review_df.travel_year >= '2500','travel_month'] = review_df.loc[review_df.travel_year >= '2500','writing_month']
review_df.loc[review_df.travel_year >= '2500','travel_month']

In [None]:
review_df.loc[review_df.travel_year >= '2500', 'travel_year'] = review_df.travel_month.str[-4:]

In [None]:
# Where travel month is missing, we assume the date review was written is the same as the travel month
review_df.travel_month.fillna(review_df.writing_month, inplace=True)
review_df.travel_year.fillna(review_df.writing_month.str[-4:], inplace=True)

In [None]:
review_df[review_df.travel_month.isna()].head()

In [None]:
review_df.travel_year.value_counts()

In [None]:
# Check again for missing values - No missing values in dataset - Travel is Cleaned.
review_df.isna().sum()

In [None]:
review_df.travel_month.value_counts()

In [None]:
# Create new column travel date
review_df["travel_date"] = pd.to_datetime(review_df['travel_month'], format="%B %Y")
review_df.travel_date

### **5. DataType Conversion**


At this stage we convert the *reviewer contributions* from string to integer.

In [None]:
review_df.dtypes

In [None]:
review_df.reviewer_contribution = \
review_df.reviewer_contribution.str.replace('contributions','').str.replace('contribution','').str.strip()

In [None]:
review_df.reviewer_contribution = review_df.reviewer_contribution.astype(int)

### **6. Review Sentiment - New Column from Rating Scores**

All Ratings >= 4 are classified as "Positive Reviews - 1".    
All Ratings < 4 are classified as "Negative Reviews - 0"

In [None]:
def convert_rating_to_sentiment(rating):
  return 1 if rating >=4 else 0

review_df["review_sentiment"] = review_df.review_rating.apply(convert_rating_to_sentiment)
review_df.review_sentiment.value_counts()

In [None]:
review_df.head()

In [None]:
review_df = review_df[review_df.airline == 'Kenya Airways']

## **Data Preparation - Text Pre-processing for Reviews**

In [None]:
# Pre-processing Variables Declared
CONTRACTION_MAP = {"ain't": "is not", "aren't": "are not", "can't": "cannot",
                   "can't've": "cannot have", "'cause": "because", "could've": "could have",
                   "couldn't": "could not", "couldn't've": "could not have", "didn't": "did not",
                   "doesn't": "does not", "don't": "do not", "hadn't": "had not",
                   "hadn't've": "had not have", "hasn't": "has not", "haven't": "have not",
                   "he'd": "he would", "he'd've": "he would have", "he'll": "he will",
                   "he'll've": "he he will have", "he's": "he is", "how'd": "how did",
                   "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
                   "I'd": "I would", "I'd've": "I would have", "I'll": "I will",
                   "I'll've": "I will have", "I'm": "I am", "I've": "I have",
                   "i'd": "i would", "i'd've": "i would have", "i'll": "i will",
                   "i'll've": "i will have", "i'm": "i am", "i've": "i have",
                   "isn't": "is not", "it'd": "it would", "it'd've": "it would have",
                   "it'll": "it will", "it'll've": "it will have", "it's": "it is",
                   "let's": "let us", "ma'am": "madam", "mayn't": "may not",
                   "might've": "might have", "mightn't": "might not", "mightn't've": "might not have",
                   "must've": "must have", "mustn't": "must not", "mustn't've": "must not have",
                   "needn't": "need not", "needn't've": "need not have", "o'clock": "of the clock",
                   "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not",
                   "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would",
                   "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have",
                   "she's": "she is", "should've": "should have", "shouldn't": "should not",
                   "shouldn't've": "should not have", "so've": "so have", "so's": "so as",
                   "this's": "this is",
                   "that'd": "that would", "that'd've": "that would have", "that's": "that is",
                   "there'd": "there would", "there'd've": "there would have", "there's": "there is",
                   "they'd": "they would", "they'd've": "they would have", "they'll": "they will",
                   "they'll've": "they will have", "they're": "they are", "they've": "they have",
                   "to've": "to have", "wasn't": "was not", "we'd": "we would",
                   "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have",
                   "we're": "we are", "we've": "we have", "weren't": "were not",
                   "what'll": "what will", "what'll've": "what will have", "what're": "what are",
                   "what's": "what is", "what've": "what have", "when's": "when is",
                   "when've": "when have", "where'd": "where did", "where's": "where is",
                   "where've": "where have", "who'll": "who will", "who'll've": "who will have",
                   "who's": "who is", "who've": "who have", "why's": "why is",
                   "why've": "why have", "will've": "will have", "won't": "will not",
                   "won't've": "will not have", "would've": "would have", "wouldn't": "would not",
                   "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would",
                   "y'all'd've": "you all would have", "y'all're": "you all are", "y'all've": "you all have",
                   "you'd": "you would", "you'd've": "you would have", "you'll": "you will",
                   "you'll've": "you will have", "you're": "you are", "you've": "you have", "n't": "not"}

PUNCTUATIONS = [
    ',', '.', '"', ':', ')', '(', '!', '?', '|', ';', "'", '$', '&',
    '/', '[', ']', '>', '%', '=', '#', '*', '+', "\\", "*",  "~", "@", "£",
    '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›',
    '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', '“', '★', '”',
    '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾',
    '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', '▒', '：', '¼', '⊕', '▼',
    '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲',
    'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', '∙', '）', '↓', '、', '│', '（', '»',
    '，', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø',
    '¹', '≤', '‡', '√', '«', '»', '´', 'º', '¾', '¡', '§', '£', '₤','*']

In [None]:
!pip install pyspellchecker
!pip install textdistance

In [None]:
from nltk.corpus import stopwords
import string
import re
import nltk

# Text Pre-processing Function
def expand_contractions(sentence, contraction_mapping):

    """Function expands a contraction word within a sentence
       returns a sentence with expanded contraction. example; can't to cannot

    Args:
        sentence (str): A sentence with a contraction word
        contraction_mapping (dict): A list of contraction and their expanded forms.

    Returns:
        str: A string with expanded contractions
    """
    contractions_pattern = re.compile('({})'.format('|'.join(
        contraction_mapping.keys())),
        flags=re.IGNORECASE | re.DOTALL)

    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match) if contraction_mapping.get(
            match) else contraction_mapping.get(match.lower())
        expanded_contraction = first_char + expanded_contraction[1:]
        return expanded_contraction

    expanded_sentence = contractions_pattern.sub(expand_match, sentence)
    return expanded_sentence


def remove_emojis(text):
    emoj = re.compile("["
        u"\U00002700-\U000027BF"  # Dingbats
        u"\U0001F600-\U0001F64F"  # Emoticons
        u"\U00002600-\U000026FF"  # Miscellaneous Symbols
        u"\U0001F300-\U0001F5FF"  # Miscellaneous Symbols And Pictographs
        u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
        u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
        u"\U0001F680-\U0001F6FF"  # Transport and Map Symbols
                      "]+", re.UNICODE)
    return re.sub(emoj, '', text)


def sep_num_words(text):
    """
    Function seperates numbers from words or other characters e.g. 22ABC to 22 ABC
    Args:
        text (str): String of
    Returns:
        cleaned string with numbers seperated from words and other characters
    """
    return re.sub(r"([0-9]+(\.[0-9]+)?)", r"\1 ", text).strip()


def num_to_word(text):
    """Function converts numbers in review texts into words.

    Args:
        text (str): A review text

    Returns:
        str: A review text with numbers converted to words.
    """
    p = inflect.engine()
    output_text = []
    for word in text.split():
        if word.isdigit():
          output_text.append(p.number_to_words(word))
        else:
          output_text.append(word)
    return " ".join(output_text)

def remove_numbers_in_string(text):
     mapping = str.maketrans('', '', string.digits)
     text = text.translate(mapping)
     return text

def classify_reviews_by_lang(df, review_column="review_x"):
    """Function classifies reviews by language and adds a new column to the dataframe.

    Args:
        df (DataFrame): Reviews DataFrame

    Returns:
        df: A Reviews Dataframe with a new column with classifications.
    """
    # Loading pre-trained language model to identify the review languages - ## The objective is to only focus on english review.
    pre_trained_model = "lid.176.bin"
    lang_model = fasttext.load_model(pre_trained_model)

    # For each review line, I pass it through the model .predict() function with a resulting language - en,de etc.
    detected_lang  = []

    for review in df[review_column]:
        language = lang_model.predict(review)[0]
        detected_lang.append(str(language)[11:13])

    df["review_language"] = detected_lang
    return df


def drop_non_english_languages(df):
    """Function drops all the non-english languages from the DataFrame

    Args:
        df (DataFrame): DataFrame Object

    Returns:
        df: Returns a DataFrame with non-english languages dropped.
    """
    other_language_index = df[(df["review_language"] != 'en')].index
    df = df.drop(other_language_index)
    return df


def remove_punctuation(text, punctuations):
    """Function removes all punctuations from text.

    Args:
        text (str): A string text with or without punctuations.
        punctuations (List): A list of common punctuations.

    Returns:
        str: Returns a string cleaned of punctuations
    """
    for punctuation in punctuations:
        if punctuation in text:
            text = text.replace(punctuation, '')
        return text.strip().lower()

def remove_small_character(token_list, threshold=2):
    return [word for word in token_list if len(word) > threshold]


def remove_punctuation_list(word_token, punctuations):
    """Function removes all punctuations from a List of Tokens.

    Args:
        word_token (List): A list of Word Tokens.
        punctuations (List): A list of punctuations.

    Returns:
        List: A list of word tokens without punctuation tokens.
    """
    for word in word_token:
        if word in punctuations:
          word_token.remove(word)
    return word_token


def remove_stop_words(text):
    """Function removes stop words from word token list.

    Args:
        text (str): Review text corpus.

    Returns:
        List: A list of word token without stop words.
    """
    stopwords_set = set(stopwords.words('english'))
    return [t for t in text if not t in stopwords_set]


def lemmatize_review(token_list):
    """
    Function takes the lemmatizer object and word_token_list
    Input: Lemmatizer, word_token_list []
    Output: A list of lemmatized word tokens
    Return
    """
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(w) for w in token_list]

def remove_punctuations_in_token(token_strings):
  """ Function removes punctuations from a token list.
  """
  return [x for x in token_strings if not re.fullmatch('[' + string.punctuation + ']+', x)]

def preprocess_airline_reviews(df, CONTRACTION_MAP, PUNCTUATIONS):
    """
    Function take Dataframe, Contraction Mapping and Punctuations and returns a cleaned DataFrame
    with Tokenized Review Column
    Args:
        df (DataFrame): A DataFrame Object
        CONTRACTION_MAP (dict): A dictionary of contractions
        PUNCTUATIONS (list): A list of punctuations to remove from text
    Returns:
        Returns a DataFrame Object
    """
    df = df.copy()
    # create a new review column for initial review column
    df["processed_review_tokens"] = df["review_x"]
    df["processed_review_tokens"] = df["processed_review_tokens"].apply(lambda x: expand_contractions(x, CONTRACTION_MAP))
    df["processed_review_tokens"] = df["processed_review_tokens"].apply(lambda x: sep_num_words(x))
    df["processed_review_tokens"] = df["processed_review_tokens"].apply(lambda x: remove_numbers_in_string(x))
    df["processed_review_tokens"] = df["processed_review_tokens"].apply(lambda x: num_to_word(x))
    df["processed_review_tokens"] = df["processed_review_tokens"].apply(lambda x: remove_emojis(x))

    # Classify reviews by language
    # classify_reviews_by_lang(df, "review_x")
    # Drop non-english reviews
    # drop_non_english_languages(df)

    df["processed_review_tokens"] = df["processed_review_tokens"].apply(lambda x: remove_punctuation(x, PUNCTUATIONS))
    df["processed_review_tokens"] = df["processed_review_tokens"].apply(lambda x: wordpunct_tokenize(x))
    df["processed_review_tokens"] = df["processed_review_tokens"].apply(lambda x: remove_punctuation_list(x, PUNCTUATIONS))
    df["processed_review_tokens"] = df["processed_review_tokens"].apply(lambda x: remove_punctuations_in_token(x))
    df["processed_review_tokens"] = df["processed_review_tokens"].apply(lambda x: remove_small_character(x, threshold=2))
    df["processed_review_tokens"] = df["processed_review_tokens"].apply(lambda x: remove_stop_words(x))
    df["processed_review_tokens"] = df["processed_review_tokens"].apply(lambda x: lemmatize_review(x))

    return df

In [None]:
%%time
review_df = preprocess_airline_reviews(
    review_df,
    CONTRACTION_MAP,
    PUNCTUATIONS
)

In [None]:
review_df.head()

In [None]:
review_df.iloc[200,-1]

## **Exploratory Data Analysis / Modelling**
### Building a Quick Sentiment Classifier using CountVectorizer on Airline Reviews

In [None]:
review_df.sentiment_polarity_summarised.value_counts()

In [None]:
review_df.sentiment_subjectivity.value_counts()

In [None]:
review_df.review_sentiment.value_counts()

In [None]:
f, ax = plt.subplots(figsize=(12, 12))
sns.boxplot(x="review_rating", y="sentiment_polarity_score", hue='travel_class',
                data=review_df, ax=ax)

In [None]:
f, ax = plt.subplots(figsize=(12, 12))
sns.boxplot(x="review_rating", y="sentiment_subjectivity_score", hue='travel_class',
                data=review_df, ax=ax)

In [None]:
f, ax = plt.subplots(figsize=(12, 12))
sns.boxplot(x="review_rating", y="sentiment_polarity_score", hue='flight_type',
                data=review_df, ax=ax)

In [None]:
f, ax = plt.subplots(figsize=(12, 12))
sns.boxplot(x="review_rating", y="sentiment_subjectivity_score", hue='flight_type',
                data=review_df, ax=ax)

In [None]:
### **Text Detokenization - After pre-processing**

from nltk.tokenize.treebank import TreebankWordDetokenizer

def token_to_sentence(token):
    return TreebankWordDetokenizer().detokenize(token)

review_df['processed_review_detokenized'] = review_df.processed_review_tokens.apply(lambda x: token_to_sentence(x))

In [None]:
review_df["processed_review_detokenized"].values[:1]

In [None]:
review_df.iloc[10,:].values

In [None]:
from spellchecker import SpellChecker

spell_corrector = SpellChecker()

# spelling correction using spellchecker
def spell_correction(text):
	"""
	Return :- text which have correct spelling words
	Input :- string
	Output :- string
	"""
	# initialize empty list to save correct spell words
	correct_words = []
	# extract spelling incorrect words by using unknown function of spellchecker
	misSpelled_words = spell_corrector.unknown(text.split())

	for each_word in text.split():
		if each_word in misSpelled_words:
			right_word = spell_corrector.correction(each_word)
			correct_words.append(right_word)
		else:
			correct_words.append(each_word)

	# joining correct_words list into single string
	correct_spelling = ' '.join(word for word in correct_words if word)
	return correct_spelling

In [None]:
%%script echo skipping
review_df["processed_review_detokenized"] = review_df["processed_review_detokenized"].apply(lambda x: spell_correction(x))

### POS - Review Text Parts of Speech Analysis

In [None]:
### **Review Text - Parts of Speech Analysis**
import spacy

nlp = spacy.load("en_core_web_sm")

def pos_tag(text):
    df = pd.DataFrame(columns = ['WORD', 'POS'])
    doc = nlp(text)
    for token in doc:
        df = df.append({'WORD': token.text, 'POS': token.pos_}, ignore_index=True)
    return df


In [None]:
df_pos = pos_tag(review_df['processed_review_detokenized'].to_string())

In [None]:
df_pos.shape

In [None]:
df_top_pos = df_pos.groupby('POS')['POS'].count().\
    reset_index(name='count').sort_values(['count'],ascending=False).head(15)

In [None]:
fig, ax = plt.subplots(figsize=(12,8))

plt.suptitle("Review Text POS Analysis - KQ", ha='left', fontproperties=fontprop, fontsize=40, x=0.125, y=0.98)
plt.title("Review Text is composed mainly of Spaces, then Nouns", loc='left',alpha=0.9, fontproperties=fontprop, fontsize=20)

plt.xticks(fontproperties=fontprop, fontsize=12)
plt.yticks(fontproperties=fontprop, fontsize=12)

df_top_pos.plot(kind="bar", x='POS', stacked=True, ax=ax, color=['darkblue'])

plt.xlabel('Parts of Speech Tags', fontproperties=fontprop, fontsize=20)
plt.ylabel('Frequency in Review Text', fontproperties=fontprop, fontsize=20)

In [None]:
df_nn = df_pos[df_pos['POS'] == 'NOUN'].copy()
df_nn.groupby('WORD')['WORD'].count().reset_index(name='count').\
    sort_values(['count'], ascending=False).head(15)

In [None]:
fig, ax = plt.subplots(figsize=(12,8))

plt.suptitle("POS Analysis (Nouns) - KQ", ha='left', fontproperties=fontprop, fontsize=40, x=0.125, y=0.98)
plt.title("Top ranking noun in frequency is Flight, Time", loc='left',alpha=0.9, fontproperties=fontprop, fontsize=20)

plt.xticks(fontproperties=fontprop, fontsize=12)
plt.yticks(fontproperties=fontprop, fontsize=12)

df_nn = df_pos[df_pos['POS'] == 'NOUN'].copy()
df_nn.groupby('WORD')['WORD'].count().reset_index(name='count').\
    sort_values(['count'], ascending=False).head(15).plot(kind='barh', x='WORD', stacked=True, ax=ax, color=['darkblue'])

plt.xlabel('Frequency of Words', fontproperties=fontprop, fontsize=20)
plt.ylabel('Nouns in Review', fontproperties=fontprop, fontsize=20)

In [None]:
# df_nn = df_pos[df_pos['POS'] == 'VERB'].copy()
# df_nn.groupby('WORD')['WORD'].count().reset_index(name='count').\
#     sort_values(['count'], ascending=False).head(15)

In [None]:
fig, ax = plt.subplots(figsize=(12,8))

plt.suptitle("POS Analysis (Adjective) - KQ", ha='left', fontproperties=fontprop, fontsize=40, x=0.125, y=0.98)
plt.title("Top ranking adjective in frequency is Flight, Time", loc='left',alpha=0.9, fontproperties=fontprop, fontsize=20)

plt.xticks(fontproperties=fontprop, fontsize=12)
plt.yticks(fontproperties=fontprop, fontsize=12)

df_nn = df_pos[df_pos['POS'] == 'ADJ'].copy()
df_nn.groupby('WORD')['WORD'].count().reset_index(name='count').\
    sort_values(['count'], ascending=False).head(15).plot(kind='barh', x='WORD', stacked=True, ax=ax, color=['darkblue'])

plt.xlabel('Frequency of Words', fontproperties=fontprop, fontsize=20)
plt.ylabel('Nouns in Review', fontproperties=fontprop, fontsize=20)

In [None]:
df_adj = df_pos[df_pos['POS'] == 'ADJ'].copy()
df_adj.groupby('WORD')['WORD'].count().reset_index(name='count').\
    sort_values(['count'], ascending=False).head(15).plot(kind='bar', x='WORD')

### Review Text - Bigram Analysis

Bigrams & Trigram Analysis: We want to identify bigrams and trigrams so we can concatenate them and consider them as one word. Bigrams are phrases containing 2 words e.g. ‘social media’, where ‘social’ and ‘media’ are more likely to co-occur rather than appear separately. Likewise, trigrams are phrases containing 3 words that more likely co-occur e.g. ‘Proctor and Gamble’. We use Pointwise Mutual Information score to identify significant bigrams and trigrams to concatenate. We also filter bigrams or trigrams with the filter (noun/adj, noun), (noun/adj,all types,noun/adj) because these are common structures pointing out noun-type n-grams. This helps the LDA model better cluster topics.

In [None]:
from nltk.util import ngrams
from collections import Counter


def get_bigram(text):
    token = word_tokenize(text)
    bigram = list(ngrams(token, 2))
    return bigram


review_df['bigram_list'] = review_df['processed_review_detokenized'].apply(lambda x: get_bigram(x))
# review_df['bigram_list'].apply(pd.Series).stack().reset_index(drop = True)

counter = Counter(review_df['bigram_list'].apply(pd.Series).stack().reset_index(drop = True))
counter.most_common(25)

### Review Text - Trigram Analysis

In [None]:
from nltk.util import ngrams
from collections import Counter


def get_trigram(text):
    token = word_tokenize(text)
    bigram = list(ngrams(token, 3))
    return bigram


review_df['trigram_list'] = review_df['processed_review_detokenized'].apply(lambda x: get_trigram(x))
# review_df['bigram_list'].apply(pd.Series).stack().reset_index(drop = True)

counter = Counter(review_df['trigram_list'].apply(pd.Series).stack().reset_index(drop = True))
counter.most_common(25)

In [None]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_documents([word for word in review_df.loc[:,"processed_review_tokens"]])

# Filter only those that occur at least 50 times
finder.apply_freq_filter(30)
bigram_scores = finder.score_ngrams(bigram_measures.pmi)

In [None]:
bigram_scores[:20]

In [None]:
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = nltk.collocations.TrigramCollocationFinder.from_documents([word for word in review_df.loc[:,"processed_review_tokens"]])

# Filter only those that occur at least 50 times
finder.apply_freq_filter(30)
trigram_scores = finder.score_ngrams(trigram_measures.pmi)

In [None]:
trigram_scores[:10]

In [None]:
bigram_pmi = pd.DataFrame(bigram_scores)

bigram_pmi.columns = ['bigram', 'pmi']
bigram_pmi.sort_values(by='pmi', axis = 0, ascending = False, inplace = True)

In [None]:
trigram_pmi = pd.DataFrame(trigram_scores)

trigram_pmi.columns = ['trigram', 'pmi']
trigram_pmi.sort_values(by='pmi', axis = 0, ascending = False, inplace = True)

In [None]:
trigram_pmi

In [None]:
# Filter for bigrams with only noun-type structures
def bigram_filter(bigram):
    tag = nltk.pos_tag(bigram)
    if tag[0][1] not in ['JJ', 'NN'] and tag[1][1] not in ['NN']:
        return False
    if bigram[0] in stopwords.words('english') or bigram[1] in stopwords.words('english'):
        return False
    if 'n' in bigram or 't' in bigram:
        return False
    if 'PRON' in bigram:
        return False
    return True

In [None]:
# Filter for trigrams with only noun-type structures
def trigram_filter(trigram):
    tag = nltk.pos_tag(trigram)
    if tag[0][1] not in ['JJ', 'NN'] and tag[1][1] not in ['JJ','NN']:
        return False
    if trigram[0] in stopwords.words('english') or trigram[-1] in stopwords.words('english') or trigram[1] in stopwords.words('english'):
        return False
    if 'n' in trigram or 't' in trigram:
         return False
    if 'PRON' in trigram:
        return False

In [None]:
# Need to set pmi threshold to whatever makes sense - eyeball through and select threshold where n-grams stop making sense
# choose top 500 ngrams in this case ranked by PMI that have noun like structures

filtered_bigram = bigram_pmi[bigram_pmi.apply(lambda bigram: bigram_filter(bigram['bigram']) and bigram.pmi > 4, axis = 1)][:500]
# filtered_trigram = trigram_pmi[trigram_pmi.apply(lambda trigram: trigram_filter(trigram['trigram']) and trigram.pmi > 5, axis = 1)][:500]


bigrams = [' '.join(x) for x in filtered_bigram.bigram.values if len(x[0]) > 2 or len(x[1]) > 2]
# trigrams = [' '.join(x) for x in filtered_trigram.trigram.values if len(x[0]) > 2 or len(x[1]) > 2 and len(x[2]) > 2]


In [None]:
bigrams[:25]

## **Sentiment Analysis Classification**

### Using CountVectorizer & LogisticsRegression

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize.treebank import TreebankWordDetokenizer

reviews_X = review_df["processed_review_detokenized"].values

In [None]:
y = review_df.review_sentiment.values

In [None]:
import time
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

lr_st = time.time()

cv = CountVectorizer()

X = cv.fit_transform(reviews_X)
len(cv.get_feature_names())

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.30, random_state=5)

lr_model = LogisticRegression()

lr_model.fit(X_train, y_train)

y_pred = lr_model.predict(X_test)

lr_et = time.time()

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
(lr_et - lr_st) * 1000

In [None]:
from sklearn.metrics import plot_confusion_matrix

color = 'white'
matrix = plot_confusion_matrix(lr_model, X_test, y_test, cmap=plt.cm.Blues)

matrix.ax_.set_title('Confusion Matrix', color=color)
plt.xlabel('Predicted Label', color=color)
plt.ylabel('True Label', color=color)
plt.gcf().axes[0].tick_params(colors=color)
plt.gcf().axes[1].tick_params(colors=color)
plt.show()

### Using TfidfVectorizer & MultinomialNB Model

In [None]:
import time

# X = review_df.loc[:,"review_x"]
X = review_df["processed_review_detokenized"].values
y = review_df.loc[:,"review_sentiment"]

mnb_st = time.time()

from sklearn.feature_extraction.text import TfidfVectorizer
# td = TfidfVectorizer(max_features = 4500)

td = TfidfVectorizer(max_features = 4500)

X = td.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Training Classifier and Predicting on Test Data
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()

classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

mnb_et = time.time()

In [None]:
from sklearn.metrics import accuracy_score, classification_report

accuracy_score(y_test, y_pred)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
color = 'white'
matrix = plot_confusion_matrix(classifier, X_test, y_test, cmap=plt.cm.Blues)
matrix.ax_.set_title('Confusion Matrix', color=color)
plt.xlabel('Predicted Label', color=color)
plt.ylabel('True Label', color=color)
plt.gcf().axes[0].tick_params(colors=color)
plt.gcf().axes[1].tick_params(colors=color)
plt.show()

In [None]:
(mnb_et - mnb_st) * 1000

### Using TfidfVectorizer & XGBoost Model

In [None]:
from xgboost import XGBClassifier
from sklearn.feature_extraction.text import CountVectorizer

import time

xgb_st = time.time()

td = TfidfVectorizer(max_features = 4500)

reviews_X = review_df["processed_review_detokenized"].values
X = td.fit_transform(reviews_X)
y = review_df.review_sentiment.values

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.30, random_state=5)

xgb_model = XGBClassifier(max_depth=6, n_estimators=1000).fit(X_train, y_train)
y_pred = xgb_model.predict(X_test)

xgb_et = time.time()

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
# Execution Time
(xgb_et - xgb_st) * 1000

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
color = 'white'
matrix = plot_confusion_matrix(xgb_model, X_test, y_test, cmap=plt.cm.Blues)
matrix.ax_.set_title('Confusion Matrix', color=color)
plt.xlabel('Predicted Label', color=color)
plt.ylabel('True Label', color=color)
plt.gcf().axes[0].tick_params(colors=color)
plt.gcf().axes[1].tick_params(colors=color)
plt.show()

In [None]:
import pickle

filename = 'sentiment_classification_model.sav'
pickle.dump(classifier, open(filename, 'wb'))

In [None]:
!ls

## **Topic Modelling**

Topic modeling is an unsupervised machine learning technique that’s capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents.



In [None]:
# Filter for only nouns
def noun_only(x):
    pos_comment = nltk.pos_tag(x)
    filtered = [word[0] for word in pos_comment if word[1] in ['NN']]
    # to filter both noun and verbs
    #filtered = [word[0] for word in pos_comment if word[1] in ['NN','VB', 'VBD', 'VBG', 'VBN', 'VBZ']]
    return filtered

# Concatenate n-grams
def replace_ngram(x):
    # for gram in trigrams:
    #     x = x.replace(gram, '_'.join(gram.split()))
    for gram in bigrams:
        x = x.replace(gram, '_'.join(gram.split()))
    return x

In [None]:
review_df['processed_review_detokenized'] = review_df['processed_review_detokenized'].map(lambda x: replace_ngram(x))

In [None]:
from nltk.tokenize import word_tokenize

review_df['processed_review_detokenized_tokenized'] = review_df['processed_review_detokenized'].apply(lambda x: word_tokenize(x))
review_df['processed_review_detokenized_tokenized'] = review_df['processed_review_detokenized_tokenized'].apply(lambda x: noun_only(x))

review_df['processed_review_detokenized_tokenized'].head()

In [None]:
review_df['processed_review_detokenized_tokenized'].values[:1]

In [None]:
review_df['processed_review_detokenized'] = review_df.processed_review_detokenized_tokenized.apply(lambda x: token_to_sentence(x))

In [None]:
## Topic Modelling using LDA - Latent Diriclet Analysis
## Topic Modelling using LSA - Latent Semantic Analysis
## Topic Modelling using NMF - Non-negative Matrix Factorization
## Topic Modelling using BertTopic - Transformer Models

documents = review_df.loc[:,"processed_review_detokenized"]

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

no_features = 4000

# NMF uses tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words="english")
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

Checking Sparscity - Sparsicity is nothing but the percentage of non-zero datapoints in the document-word matrix.

In [None]:
# Checking Sparsity of the TF-IDF Vectors

# Materialize the sparse data
data_dense = tfidf.todense()

# Compute Sparsicity = Percentage of Non-Zero cells
print("Sparsicity: ", ((data_dense > 0).sum()/data_dense.size)*100, "%")

In [None]:
# LDA uses raw term counts for LDA because its a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words="english")
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()

Checking Sparscity - Sparsicity is nothing but the percentage of non-zero datapoints in the document-word matrix.

In [None]:
# Checking Sparsity of the CountVectorizer Vectors

# Materialize the sparse data
data_dense_v = tf.todense()

# Compute Sparsicity = Percentage of Non-Zero cells
print("Sparsicity: ", ((data_dense_v > 0).sum()/data_dense_v.size)*100, "%")

In [None]:
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD

no_topics = 5

# Run LSA
# SVD represent documents and terms in vectors
lsa = TruncatedSVD(
    n_components=no_topics,
    algorithm='randomized',
    n_iter=100,
    random_state=122).fit(tfidf)

# Run NMF
nmf = NMF(
    n_components=no_topics,
    random_state=1,
    alpha=.1,
    l1_ratio=.5,
    init='nndsvd').fit(tfidf)

# Run LDA
lda = LatentDirichletAllocation(
    n_components=no_topics,
    learning_method='online',
    learning_decay=0.2,
    random_state=0).fit(tf)

### **Displaying and Evaluating Topics**

We evaluate our topic models using a mixed approach? Highlighted as below:

- Observation-based, eg. observing the top ‘n‘ words in a topic.
- Interpretation-based, eg. ‘word intrusion’ and ‘topic intrusion’ to identify the words or topics that “don’t belong” in a topic or document.
- Quantitative metrics – Perplexity (held out likelihood) and coherence calculations.

In [None]:
def display_topics(model, feature_names, no_top_words):
  """
  """
  for index, component in enumerate(model.components_):
    zipped = zip(feature_names, component)
    top_terms_key = sorted(
        zipped,
        key=lambda t:t[1],
        reverse=True)[:no_top_words]
    top_terms_list = list(dict(top_terms_key).keys())
    print("Topic " + str(index) + ": ", top_terms_list)

In [None]:
display_topics(nmf, tfidf_feature_names, 15)

In [None]:
display_topics(lda, tf_feature_names, 15)

In [None]:
display_topics(lsa, tfidf_feature_names, 15)

In [None]:
# Model Evaluation
# Model Evaluation Strategies
# 1. Eye Balling Models - Top N words, Topics / Documents
# 2. Intrinsic Evaluation Metrics - Capturing model semantics, Topics interpretability
# 3. Human Judgements - What is a topic?
# 4. Extrinsic Evaluation Metrics / Evaluation at task.
print("Log Likelihood, ", lda.score(tf))
print("Perplexity,", lda.perplexity(tf)) # Optimizing for perplexity may not yield human intepretable results.
print("LDA Model Params", lda.get_params())

In [None]:
from gensim.models import CoherenceModel
import gensim.corpora as corpora

def get_coherence_value(model, df_column, n_top_words):
    topics = model.components_
    n_top_words = n_top_words
    texts = [[word for word in doc.split()] for doc in df_column]

    # Create the dictionary
    dictionary = corpora.Dictionary(texts)

    # Create a gensim corpus from the word count matrix
    corpus = [dictionary.doc2bow(text) for text in texts]

    feature_names = [dictionary[i] for i in range(len(dictionary))]

    # Get the top words for each topic from the components_ attribute
    top_words = []
    for topic in topics:
      top_words.append([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])

    coherence_model = CoherenceModel(
        topics=top_words,
        texts=texts,
        dictionary=dictionary,
        coherence='c_v')

    coherence = coherence_model.get_coherence()
    return coherence

In [None]:
get_coherence_value(lda, review_df['processed_review_detokenized'], 30)

In [None]:
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

In [None]:
print("LSA Coherence", get_coherence_value(lsa, review_df['processed_review_detokenized'], 20))

In [None]:
get_coherence_value(nmf, review_df['processed_review_detokenized'], 20)

In [None]:
%%script echo skipping

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

dictionary = corpora.Dictionary(review_df.loc[:,"processed_review_tokens"])
texts = review_df.loc[:,"processed_review_tokens"]
corpus = [dictionary.doc2bow(text) for text in texts]

coherence = []
for k in range(1,9):
    print('Round: '+ str(k))
    Lda = gensim.models.ldamodel.LdaModel
    ldamodel = Lda(corpus, num_topics=k, \
               id2word = dictionary, passes=40,\
               iterations=200, chunksize = 10000, eval_every = None)

    cm = gensim.models.coherencemodel.CoherenceModel(\
         model=ldamodel, texts=texts,\
         dictionary=dictionary, coherence='c_v')

    coherence.append((k,cm.get_coherence()))

In [None]:
# coherence

In [None]:
%%script echo skipping

from sklearn.model_selection import GridSearchCV
from pprint import pprint
# Grid searching to get the best LDA model etc.

# Define Search Param
search_params = {'n_components': [2, 3, 4, 5, 10], 'learning_decay': [.2, .5, .7, .9]}

# Init the Model
lda = LatentDirichletAllocation()

# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)

# Do the Grid Search
model.fit(tf)

In [None]:
%%script echo skipping

# Best Model
best_lda_model = model.best_estimator_

# Model Parameters
print("Best Model's Params: ", model.best_params_)

# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)

# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(tf))

### Topic Modelling with BertTopic

In [None]:
!pip -q install bertopic

In [None]:
!pip show tqdm
!pip uninstall tqdm
!pip install tqdm

In [None]:
!pip -q install umap
!pip -q install hdbscan
!pip -q install sentence_transformers

In [None]:
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN

embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=100)
hdbscan_model = HDBSCAN(min_cluster_size=20, min_samples=40,
                        gen_min_span_tree=True,
                        prediction_data=True)

In [None]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

stopwords = list(stopwords.words('english')) + ['airline', 'kenya', 'nairobi','flight','hour']

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# topic_model = BERTopic(language="multilingual")
# topics, probs = topic_model.fit_transform(review_df.loc[:,"processed_review_detokenized"])
# topic_model.update_topics(review_df.loc[:,"processed_review_detokenized"], n_gram_range=(1, 3))

vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words=stopwords)

seed_topic_list = [
    ["time", "delay", "airport","time",  "hour"],
    ["baggage", "luggage", "bag"],
    ["seat", "comfort", "cleanliness","legroom", "entertainment", "space"],
    ["food", "meal", "snack", "drink","beverage"],
    ["staff", "service", "airline staff", "service staff", "ground_staff", "crew", "pilot", "customer_service"],
    ["airport", "flight", "connecting_flight", "experience", "board", "dreamliner"],
    ["business", "business_class", "economy"],
    ["hotel"]
    ]

model = BERTopic(
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    embedding_model=embedding_model,
    vectorizer_model=vectorizer_model,
    top_n_words=10,
    language='english',
    calculate_probabilities=True,
    verbose=True,
    diversity=0.2,
    seed_topic_list=seed_topic_list
)

topics, probs = model.fit_transform(review_df.processed_review_detokenized.values)

In [None]:
model.get_topic_info()

In [None]:
model.get_representative_docs()

In [None]:
# model.visualize_topics()

In [None]:
model.get_topic(-1)

In [None]:
model.get_topic(0)

In [None]:
model.get_topic(1)

In [None]:
model.get_topic(2)

In [None]:
model.get_topic(3)

In [None]:
model.visualize_barchart(topics=[-1,0,1,2,3], n_words=10, custom_labels=False, width=350, height=350)

In [None]:
model.set_topic_labels({
    -1:"General Flight Experience",
     0:"Flight Time / Service Time",
     1:"In-Flight Service",
     2:"Staff Experience / Delays",
     3:"Flight Operations (Baggage Handling)",
})

In [None]:
model.visualize_barchart(topics=[-1,0,1,2,3], n_words=10, custom_labels=True, width=350, height=350, title="Review Topic Classification - KQ")

In [None]:
custom_topic_names = ["In-Flight Service (Food, Entertainment, Meal)", "Flight Time (Service Time, Departure, Arrival)", "Staff & Crew Experience", "Luggage Handling"]

In [None]:
model.visualize_term_rank()

In [None]:
model.visualize_hierarchy()

Intertopic distance map measures the distance between topics. Similar topics are closer to each other, and very different topics are far from each other.

In [None]:
model.visualize_topics()

In [None]:
model.visualize_heatmap()

The topic prediction for a document is based on the predicted probabilities of the document belonging to each topic. The topic with the highest probability is the predicted topic. This probability represents how confident we are about finding the topic in the document.

In [None]:
review_df.processed_review_detokenized.values[45]

In [None]:
model.visualize_distribution(model.probabilities_[45], min_probability=0.015)

In [None]:
embeddings = embedding_model.encode(review_df.processed_review_detokenized.values, show_progress_bar=False)
model.visualize_documents(review_df.processed_review_detokenized.values, embeddings=embeddings)

In [None]:
# Calculate the topic distributions on a token-level
topic_distr, topic_token_distr = model.approximate_distribution(review_df.processed_review_detokenized.values, calculate_tokens=True)

# Visualize the token-level distributions
df = model.visualize_approximate_distribution(review_df.processed_review_detokenized.values[8], topic_token_distr[8])
df

Saving the Final Topic Model in the Colab Notebook.

In [None]:
# Save the topic model
model.save("kq_review_topic_model")
# Load the topic model
model = BERTopic.load("kq_review_topic_model")

In [None]:
review_df.head()

In [None]:
review_df["Review Topic"] = topics

In [None]:
topic_1 = []
topic_2 = []
topic_3 = []
topic_4 = []

In [None]:
for i in probs:
  topic_1.append(i[0])
  topic_2.append(i[1])
  topic_3.append(i[2])
  topic_4.append(i[3])

In [None]:
review_df["topic_1_prob"] = topic_1
review_df["topic_2_prob"] = topic_2
review_df["topic_3_prob"] = topic_3
review_df["topic_4_prob"] = topic_4

In [None]:
review_df

In [None]:
review_df.to_csv('final_topical_annotated_dataset.csv')