### Citations of all resources explored
1) 
- The Natural Language Toolkit (NLTK) is an open source Python library for Natural Language Processing. A free online book is available. (If you use the library for academic research, please cite the book.)

Steven Bird, Ewan Klein, and Edward Loper (2009). Natural Language Processing with Python. O’Reilly Media Inc. https://www.nltk.org/book/

- If you use the VADER sentiment analysis tools, please cite:

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for
Sentiment Analysis of Social Media Text. Eighth International Conference on
Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
https://www.nltk.org/_modules/nltk/sentiment/vader.html

https://www.nltk.org/api/nltk.sentiment.vader.html

2) 
https://www.kaggle.com/code/satishgunjal/tokenization-in-nlp

3) 
ChatGPT 3.5 for how to use NLP techniques and libraries

4) 
Spacy
https://course.spacy.io/en/chapter2 
https://spacy.io/models
https://spacy.io/usage/linguistic-features#named-entities

In [1]:
import pandas as pd

1) Overview of data
2) remove NaNs from reviews column (first column is necessarily non-empty)
3) Remove everything before the first "|" character --> to get the first 1500 reviews
4) Sentiment analysis as a score column
5) find significant phrases in each review 
6) find random rows to sample and check
7) 


In [2]:
reviews_df = pd.read_csv("data/BA_reviews.csv")

In [3]:
reviews_df.head()

Unnamed: 0.1,Unnamed: 0,reviews
0,0,✅ Trip Verified | For the price paid (bought ...
1,1,✅ Trip Verified | Flight left on time and arr...
2,2,✅ Trip Verified | Very Poor Business class pr...
3,3,Not Verified | This review is for LHR-SYD-LHR....
4,4,✅ Trip Verified | Absolutely pathetic business...


In [4]:
reviews_df.shape
reviews_df = reviews_df.dropna(subset = ['reviews'])

In [5]:
#csv file created on Dec 29 2023
reviews_df.iloc[0,1]

'✅ Trip Verified |  For the price paid (bought during a sale) it was a decent experience although the club class (business class) seats offer no more legroom than economy class (using short-haul fleet on a 4 hour flight). Fast track through security was not honoured. The lounge at Istanbul airport was over-crowded as it is also open to the public who can pay for usage, causing a long queue for entry , which was badly organised. Boarding was smooth, cabin crew were friendly but their service was hit-and-miss. Eg. Some people got a “welcome” and some didn’t; Half of the cabin was automatically offered coffee after dinner but not the other half. However, drinks were replenished generously and regularly and the meal was good (with a choice of three mains from the menu).'

In [6]:
#Let's only include verified reviews
verified_df = reviews_df[reviews_df.iloc[:,1].str.contains("Trip Verified")]

In [7]:
verified_df.head()

Unnamed: 0.1,Unnamed: 0,reviews
0,0,✅ Trip Verified | For the price paid (bought ...
1,1,✅ Trip Verified | Flight left on time and arr...
2,2,✅ Trip Verified | Very Poor Business class pr...
4,4,✅ Trip Verified | Absolutely pathetic business...
6,6,✅ Trip Verified | This was our first flight wi...


In [8]:
verified_df.loc[:,'reviews'] = verified_df.loc[:,'reviews'].apply(lambda x: x[18:])

In [9]:
verified_df.head()

Unnamed: 0.1,Unnamed: 0,reviews
0,0,For the price paid (bought during a sale) it ...
1,1,Flight left on time and arrived over half an ...
2,2,"Very Poor Business class product, BA is not e..."
4,4,Absolutely pathetic business class product. BA...
6,6,This was our first flight with British Airways...


In [10]:
verified_df.shape

(1174, 2)

Need to break down each review:
- probably not necessary: ChatGPT mentions to exclude stopwords (such as "and", "the", "is", etc.)
- identify product features and adjectives mentioned in each review
- identify negative words (for example: "not happy" as opposed to "not" and "happy" separately
- catgeorize each review as positive, neutral, negative

In [11]:
# !pip install spacy
import spacy
#https://spacy.io/models
# !python -m spacy download en_core_web_md

nlp = spacy.load("en_core_web_md")

In [12]:
# It occurred to me that the stopwords are actually significant and spacy excludes some meaningful words

# from tqdm import tqdm

# non_stopwords = []

# # https://spacy.io/api/token#section-attributes
# for i in tqdm(range(len(verified_df.iloc[:,1])), desc = "looking through reviews"):
#     doc = nlp(str(verified_df.iloc[i,1]))
#     non_stopwords.append([token.text for token in doc if (token.is_alpha and not token.is_stop)])

# verified_df['removed_stopwords'] = non_stopwords
# verified_df.loc[0,:]

In [13]:
# from tqdm import tqdm

# # https://spacy.io/api/token#section-attributes

# review_tokens = []

# for i in tqdm(range(len(verified_df.iloc[:,1])), desc = "looking through reviews"):
#     doc = nlp(str(verified_df.iloc[i,1]))
#     review_tokens.append([token.text for token in doc if token.is_alpha])

# verified_df['review_tokens'] = review_tokens
# verified_df.loc[0,:]

In [14]:
# ok so I see nltk already has tokenize which is way faster than what I had earlier
# https://www.nltk.org/api/nltk.tokenize.html
# this cell gathers the features of interest
# I think I do actually need to determine negations in words to grab key terms
# https://spacy.io/usage/linguistic-features#named-entities provides convenient usage in code
import nltk
# nltk.download('punkt')

from nltk.tokenize import word_tokenize
from tqdm import tqdm

review_tokens = []
key_phrases = []

for i in tqdm(range(len(verified_df.loc[:,'reviews'])), desc = "looking through reviews"):
    review_tokens.append(nltk.tokenize.word_tokenize(str(verified_df.iloc[i,1]), language='english', preserve_line=False))
    key_phrases.append([chunk.text for chunk in nlp(verified_df.iloc[i,1]).noun_chunks or chunk.pos_ == "ADJ" or chunk.pos_ == "ADV"])
verified_df['review_tokens'] = review_tokens
verified_df['key_phrases'] = key_phrases

looking through reviews: 100%|██████████████| 1174/1174 [00:31<00:00, 37.79it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  verified_df['review_tokens'] = review_tokens
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  verified_df['key_phrases'] = key_phrases


In [15]:
verified_df.loc[0,:]

Unnamed: 0                                                       0
reviews           For the price paid (bought during a sale) it ...
review_tokens    [For, the, price, paid, (, bought, during, a, ...
key_phrases      [the price, a sale, it, a decent experience, b...
Name: 0, dtype: object

In [16]:
# this cell calculates overall sentiment
import nltk
# nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer
from tqdm import tqdm

sia = SentimentIntensityAnalyzer()

review_polarity = []

for i in tqdm(range(len(verified_df.loc[:,'reviews'])), desc = "determining review connotations"):
    review_polarity.append(sia.polarity_scores(str(verified_df.iloc[i,1])))

verified_df['review_polarity'] = review_polarity

determining review connotations: 100%|████| 1174/1174 [00:00<00:00, 1856.67it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  verified_df['review_polarity'] = review_polarity


In [17]:
# verified_df.to_csv('verified_df.csv') # for better visibility

In [18]:
# The polarity is off and key concepts aren't being captured... 

In [21]:
verified_df_keywords = verified_df[[verified_df.columns[0],verified_df.columns[3],verified_df.columns[4]]]
verified_df_keywords.to_csv('verified_df_keywords.csv') # outputs
# I already see a an obviously negative review being considered positive --> I'll try a different package