## Contents
* [1. Sentiment Analysis on Top Mentioned Words](#1.-Sentiment-Analysis-on-Top-Mentioned-Words)
* [2. Imports](#2.-Imports)
* [3. Data Cleaning & Preparation](#3.-Data-Cleaning-&-Preparation)
* [4. Top Mentioned Words](#4.-Top-Mentioned-Words)
    * [4.1 KrisFlyer & PPS Club](#4.1-KrisFlyer-&-PPS-Club)
    * [4.2 Lounge, Catering, Amenity Kits](#4.2-Lounge,-Catering,-Amenity-Kits)
* [5. Sentiment Analysis](#5.-Sentiment-Analysis)
    * [5.1 KrisFlyer & PPS Club](#5.1-KrisFlyer-&-PPS-Club)
    * [5.2 Lounge, Catering, Amenity Kits](#5.2-Lounge,-Catering,-Amenity-Kits)
* [6. Remarks](#6.-Remarks)

---
## 1. Sentiment Analysis on Top Mentioned Words
---
The objective is to find the top 3 words for SIA's top 2 most-frequently-asked topics of KrisFlyer and PPS Club, and Lounge, Catering and Amenity Kits. Thereafter, analyse the sentiments of these 6 words to help the business prioritise its areas of strengths and/or weaknesses.

---
## 2. Imports
---

In [1]:
import numpy as np
import pandas as pd
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from spacytextblob.spacytextblob import SpacyTextBlob

---
## 3. Data Cleaning & Preparation
---

- read CSV files

In [2]:
kf_df = pd.read_csv('output/kf_clean.csv')
lca_df = pd.read_csv('output/LCA_clean.csv')
print(kf_df.head())

                                           text source
0           Qualifying as EG for the first time     kf
1           Which FFP for me? Master Discussion     kf
2                              #SQMelbourneTram     kf
3  Advice sought - Changing redemption bookings     kf
4                          First Savers SYD-SIN     kf


- resolve any NA values

In [3]:
print(kf_df.isna().sum())
print(lca_df.isna().sum())

# acceptable to drop 3 NA values out of 44k values
kf_df.dropna(inplace=True)
lca_df.dropna(inplace=True)

# reset index, and drop old index
kf_df.reset_index(drop=True, inplace=True)
lca_df.reset_index(drop=True, inplace=True)

print(kf_df.isna().sum())
print(lca_df.isna().sum())

text      1
source    0
dtype: int64
text      1
source    0
dtype: int64
text      0
source    0
dtype: int64
text      0
source    0
dtype: int64


---
## 4. Top Mentioned Words
---
- extract top words using CountVectorizer

- initialise CountVectorizer

In [5]:
cvec = CountVectorizer(lowercase=False, stop_words='english')

- extract top words

In [6]:
matrix = cvec.fit_transform(kf_df['text'])
counts = pd.DataFrame(matrix.toarray(),
                      columns=cvec.get_feature_names_out())

In [8]:
top_words_kf = counts.sum().sort_values(ascending=False)
top_words_kf = pd.DataFrame(top_words_kf, columns=['count'])
top_words_kf.head(10)

Unnamed: 0,count
miles,8899
SQ,8627
posted,7734
Originally,7367
PPS,5771
SIN,4730
flight,3692
KF,3616
just,3547
year,2822


## 4.2 Lounge, Catering, Amenity Kits

- extract top words

In [9]:
matrix = cvec.fit_transform(lca_df['text'])
counts = pd.DataFrame(matrix.toarray(),
                      columns=cvec.get_feature_names_out())

In [10]:
top_words_lca = counts.sum().sort_values(ascending=False)
top_words_lca = pd.DataFrame(top_words_lca, columns=['count'])
top_words_lca.head(10)

Unnamed: 0,count
SIN,7319
lounge,4650
SQ,4337
posted,3819
Originally,3637
The,2213
served,2031
chicken,1928
flight,1872
vegetables,1805


---
## 5. Sentiment Analysis
---

## 5.1 KrisFlyer & PPS Club

- top 3 words selected words based on relevance are: miles, PPS, KF
- find text in kf_df that contains the top 3 words

In [11]:
miles_df = [sent for sent in kf_df['text'] if 'miles' in str(sent)]
pps_df = [sent for sent in kf_df['text'] if 'PPS' in str(sent)]
kris_df = [sent for sent in kf_df['text'] if 'KF' in str(sent)]

- change list to df type

In [14]:
miles_df = pd.DataFrame(miles_df, columns=['text'])
pps_df = pd.DataFrame(pps_df, columns=['text'])
kris_df = pd.DataFrame(kris_df, columns=['text'])

print(miles_df.head())
print(miles_df.shape)
print(pps_df.shape)
print(kris_df.shape)

                                                text
0            request to cancel miles (FQTV and FQTS)
1       Mix miles with cash - credit to another FFP?
2     Collecting SQ miles with KF or Alaska Airlines
3            Earning Krisflyer miles on Scoot flight
4  Change in accrual of KrisFlyer Elite miles on ...
(4338, 1)
(3323, 1)
(2498, 1)


- conduct sentiment analysis of the top 3 words

In [15]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

<spacytextblob.spacytextblob.SpacyTextBlob at 0x16f004199a0>

In [16]:
# define function to predict sentiment
def sentiment_pred(comment):
    spacy_output = nlp(comment)
    
    if spacy_output._.polarity < -0.5:
        # very negative
        return 1
    if spacy_output._.polarity < -0.1 and spacy_output._.polarity >= -0.5:
        # negative
        return 2
    if spacy_output._.polarity < 0.1 and spacy_output._.polarity >= -0.1:
        # neutral
        return 3
    if spacy_output._.polarity < 0.5 and spacy_output._.polarity >= 0.1:
        # positive
        return 4
    else:
        # very positive
        return 5

In [17]:
miles_df['sentiment_pred'] = miles_df['text'].apply(sentiment_pred)
pps_df['sentiment_pred'] = pps_df['text'].apply(sentiment_pred)
kris_df['sentiment_pred'] = kris_df['text'].apply(sentiment_pred)

print(miles_df.head())
print(miles_df['sentiment_pred'].value_counts(normalize=True), '\n')
print(pps_df['sentiment_pred'].value_counts(normalize=True), '\n')
print(kris_df['sentiment_pred'].value_counts(normalize=True))

                                                text  sentiment_pred
0            request to cancel miles (FQTV and FQTS)               3
1       Mix miles with cash - credit to another FFP?               3
2     Collecting SQ miles with KF or Alaska Airlines               3
3            Earning Krisflyer miles on Scoot flight               3
4  Change in accrual of KrisFlyer Elite miles on ...               3
4    0.582757
3    0.343937
2    0.039650
5    0.031812
1    0.001844
Name: sentiment_pred, dtype: float64 

4    0.558231
3    0.374060
2    0.042732
5    0.024075
1    0.000903
Name: sentiment_pred, dtype: float64 

4    0.550440
3    0.376301
2    0.039231
5    0.032026
1    0.002002
Name: sentiment_pred, dtype: float64


## 5.2 Lounge, Catering, Amenity Kits

- top 3 words selected words based on relevance are: 'lounge', 'SQ', 'served'
- find text in lca_df that contains the top 3 words

In [18]:
lounge_df = [sent for sent in lca_df['text'] if 'lounge' in str(sent)]
sq_df = [sent for sent in lca_df['text'] if 'SQ' in str(sent)]
served_df = [sent for sent in lca_df['text'] if 'served' in str(sent)]

- change list to df

In [20]:
lounge_df = pd.DataFrame(lounge_df, columns=['text'])
sq_df = pd.DataFrame(sq_df, columns=['text'])
served_df = pd.DataFrame(served_df, columns=['text'])

print(lounge_df.head())
print(lounge_df.shape)
print(sq_df.shape)
print(served_df.shape)

                                                text
0                                MUC - Which lounge?
1           Best pay-access lounge at Changi airport
2                               [ZRH] Suites lounge?
3  New dim sum corner at Krisflyer Gold lounge @ ...
4            MAN lounge access on SQ Y as non-SQ *G?
(2728, 1)
(4562, 1)
(1057, 1)


- conduct sentiment analysis of the top 3 words

In [21]:
lounge_df['sentiment_pred'] = lounge_df['text'].apply(sentiment_pred)
sq_df['sentiment_pred'] = sq_df['text'].apply(sentiment_pred)
served_df['sentiment_pred'] = served_df['text'].apply(sentiment_pred)

print(lounge_df['sentiment_pred'].value_counts(normalize=True), '\n')
print(sq_df['sentiment_pred'].value_counts(normalize=True), '\n')
print(served_df['sentiment_pred'].value_counts(normalize=True))

4    0.549487
3    0.366569
2    0.050220
5    0.031891
1    0.001833
Name: sentiment_pred, dtype: float64 

4    0.446076
3    0.423937
2    0.087023
5    0.029811
1    0.013152
Name: sentiment_pred, dtype: float64 

3    0.561022
4    0.294229
2    0.116367
1    0.017975
5    0.010407
Name: sentiment_pred, dtype: float64


---
## 6. Remarks
---

| Sentiment Score  | KrisFlyer<br>'miles' | KrisFlyer<br>'PPS' | KrisFlyer<br>'KF' | LCA<br>'lounge' | LCA<br>'SQ' | LCA<br>'served' |
|------------------|---------------------------|-------------------|---------------------|-------------------|-------------|------------------|
| Very negative (1) | <0.1                         | <0.1                 | <0.1                | <0.1                 | <0.1        | <0.1                |
| Negative (2)      | <0.1                         | <0.1              | <0.1                | <0.1                 | 0.1         | 0.1                |
| Neutral (3)       | 0.3                       | 0.4               | 0.4                 | 0.4               | 0.4         | 0.6              |
| Positive (4)      | 0.6                       | 0.6               | 0.6                 | 0.5               | 0.4         | 0.3              |
| Very positive (5) | <0.1                         | <0.1              | <0.1                | <0.1                 | <0.1        | <0.1             |
| Overall:         | Positive                  | Positive          | Positive            | Positive           | Positive    | Positive         |

- due to rounding, the total sentiment score for each word may not add up to 1

- the comments that contained the top words for both KF and LCA are positive. The results suggest that:
    - 'miles': may be positive about the mile earned per dollar spent, and/or deals that can be exchanged with miles
    - 'PPS': may be positive about the PPS membership experience (e.g. exclusive deals, priority services)
    - 'KF': may be positive about the KrisFlyer membership experience (e.g. exclusive deals, priority services)
    - 'lounge': may be positive about the lounge experience (e.g. cleanliness, service level, etc.)
    - 'SQ': may be positive about SQ's lounge, catering, amenity kits (e.g. cleanliness, service level, etc.)
    - 'served': may be positive about the items served (e.g. food, amenity kits)
<br>
<br>
- a separate deep-dive into the comments that contained the top words can be conducted to find out the reasons why the comments were positive (under future areas of improvement)