### Citations
1) 
The Natural Language Toolkit (NLTK) is an open source Python library for Natural Language Processing. A free online book is available. (If you use the library for academic research, please cite the book.)

Steven Bird, Ewan Klein, and Edward Loper (2009). Natural Language Processing with Python. O’Reilly Media Inc. https://www.nltk.org/book/

2) 
https://www.kaggle.com/code/satishgunjal/tokenization-in-nlp

3) 
ChatGPT 3.5 for how to use NLP techniques and libraries

4) 
Spacy
https://course.spacy.io/en/chapter2 
https://spacy.io/models

In [2]:
import pandas as pd

1) Overview of data
2) remove NaNs from reviews column (first column is necessarily non-empty)
3) Remove everything before the first "|" character --> to get the first 1500 reviews
4) Sentiment analysis as a score column
5) find significant phrases in each review 
6) find random rows to sample and check
7) 


In [3]:
reviews_df = pd.read_csv("data/BA_reviews.csv")

In [4]:
reviews_df.head()

Unnamed: 0.1,Unnamed: 0,reviews
0,0,✅ Trip Verified | For the price paid (bought ...
1,1,✅ Trip Verified | Flight left on time and arr...
2,2,✅ Trip Verified | Very Poor Business class pr...
3,3,Not Verified | This review is for LHR-SYD-LHR....
4,4,✅ Trip Verified | Absolutely pathetic business...


In [5]:
reviews_df.shape
reviews_df = reviews_df.dropna(subset = ['reviews'])

In [6]:
#csv file created on Dec 29 2023
reviews_df.iloc[0,1]

'✅ Trip Verified |  For the price paid (bought during a sale) it was a decent experience although the club class (business class) seats offer no more legroom than economy class (using short-haul fleet on a 4 hour flight). Fast track through security was not honoured. The lounge at Istanbul airport was over-crowded as it is also open to the public who can pay for usage, causing a long queue for entry , which was badly organised. Boarding was smooth, cabin crew were friendly but their service was hit-and-miss. Eg. Some people got a “welcome” and some didn’t; Half of the cabin was automatically offered coffee after dinner but not the other half. However, drinks were replenished generously and regularly and the meal was good (with a choice of three mains from the menu).'

In [7]:
#Let's only include verified reviews
verified_df = reviews_df[reviews_df.iloc[:,1].str.contains("Trip Verified")]

In [8]:
verified_df.head()

Unnamed: 0.1,Unnamed: 0,reviews
0,0,✅ Trip Verified | For the price paid (bought ...
1,1,✅ Trip Verified | Flight left on time and arr...
2,2,✅ Trip Verified | Very Poor Business class pr...
4,4,✅ Trip Verified | Absolutely pathetic business...
6,6,✅ Trip Verified | This was our first flight wi...


In [9]:
verified_df.loc[:,'reviews'] = verified_df.loc[:,'reviews'].apply(lambda x: x[18:])

In [10]:
verified_df.head()

Unnamed: 0.1,Unnamed: 0,reviews
0,0,For the price paid (bought during a sale) it ...
1,1,Flight left on time and arrived over half an ...
2,2,"Very Poor Business class product, BA is not e..."
4,4,Absolutely pathetic business class product. BA...
6,6,This was our first flight with British Airways...


In [11]:
verified_df.shape

(1174, 2)

Need to break down each review:
- probably not necessary: ChatGPT mentions to exclude stopwords (such as "and", "the", "is", etc.)
- identify product features and adjectives mentioned in each review
- identify negative words (for example: "not happy" as opposed to "not" and "happy" separately
- catgeorize each review as positive, neutral, negative

In [19]:
# !pip install spacy
import spacy
#https://spacy.io/models
# !python -m spacy download en_core_web_md

nlp = spacy.load("en_core_web_md")

In [26]:
from tqdm import tqdm

non_stopwords = []

# https://spacy.io/api/token#section-attributes
for i in tqdm(range(len(verified_df.iloc[:,1])), desc = "looking through reviews"):
    doc = nlp(str(verified_df.iloc[i,1]))
    non_stopwords.append([token.text for token in doc if (token.is_alpha and not token.is_stop)])

verified_df['removed_stopwords'] = non_stopwords
verified_df.loc[0,:]

looking through reviews: 100%|██████████████| 1174/1174 [00:30<00:00, 38.55it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  verified_df['removed_stopwords'] = non_stopwords


Unnamed: 0                                                           0
reviews               For the price paid (bought during a sale) it ...
removed_stopwords    [price, paid, bought, sale, decent, experience...
Name: 0, dtype: object

In [27]:
verified_df.head()

Unnamed: 0.1,Unnamed: 0,reviews,removed_stopwords
0,0,For the price paid (bought during a sale) it ...,"[price, paid, bought, sale, decent, experience..."
1,1,Flight left on time and arrived over half an ...,"[Flight, left, time, arrived, half, hour, earl..."
2,2,"Very Poor Business class product, BA is not e...","[Poor, Business, class, product, BA, close, ai..."
4,4,Absolutely pathetic business class product. BA...,"[Absolutely, pathetic, business, class, produc..."
6,6,This was our first flight with British Airways...,"[flight, British, Airways, years, usual, fault..."
