In [1]:
import pandas as pd
data = 'TripleA.csv'
df = pd.read_csv(data)
print(df.columns)
print(df["Sentiment"].value_counts())


Index(['Review', 'Sentiment'], dtype='object')
Sentiment
Negative    538
Neutral     262
Positive    200
Name: count, dtype: int64


In [2]:
df['Review'] = df['Review'].str.lower()
df.head()

Unnamed: 0,Review,Sentiment
0,ive flown with them dozens of times and never ...,Positive
1,ive flown airasia thai airasia and airasia phi...,Positive
2,i have flown with them several times and never...,Positive
3,within thailand i have good experiences with t...,Negative
4,same experience with airasia japan,Negative


In [3]:
import string
df['Review'] = df['Review'].str.translate(str.maketrans('', '', string.punctuation))

In [4]:
import string
df['Review'] = df['Review'].str.translate(str.maketrans('', '', string.punctuation))

In [5]:
import nltk
from nltk.corpus import stopwords
import contractions

nltk.download('stopwords') #make sure the stopwords are downloaded.

def expand_and_remove_stopwords(text):
    expanded_text = contractions.fix(text) #expands the contractions
    stop_words = set(stopwords.words('english'))
    words = expanded_text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words] #make sure words are lower case for comparison.
    return ' '.join(filtered_words)

df['Review'] = df['Review'].apply(expand_and_remove_stopwords)
df.head()

[nltk_data] Downloading package stopwords to C:\Users\Athin
[nltk_data]     Suresh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Review,Sentiment
0,flown dozens times never major issues,Positive
1,flown airasia thai airasia airasia philippines...,Positive
2,flown several times never problems weigh bag t...,Positive
3,within thailand good experiences flew air asia...,Negative
4,experience airasia japan,Negative


In [39]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Load dataset
X = df['Review']

# Convert text to a document-term matrix
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english', max_features=5000)
X_dtm = vectorizer.fit_transform(X)

In [40]:
# Apply LDA for topic modeling
lda3 = LatentDirichletAllocation(n_components=3, random_state=42) 
lda10 = LatentDirichletAllocation(n_components=10, random_state=42)  # 10 topics
lda5 = LatentDirichletAllocation(n_components=5, random_state=42)
lda3.fit(X_dtm)
lda10.fit(X_dtm)
lda5.fit(X_dtm)

# Get feature names (words)
words = np.array(vectorizer.get_feature_names_out())

# Sorting logic (from previous code)
sorting = np.argsort(lda3.components_, axis=1)[:, ::-1]  # Sort words by importance for each topic

In [41]:
# Display top words in each topic
def display_topics(model, feature_names, num_words=10):
    for topic_idx, topic in enumerate(model.components_):
        top_words = feature_names[sorting[topic_idx]][:num_words]  # Extract top words
        print(f"\n🔹 Topic {topic_idx+1}:")
        print(", ".join(top_words))

display_topics(lda3, words, num_words=10)


🔹 Topic 1:
air, asia, airasia, airlines, service, airline, flight, budget, fly, pay

🔹 Topic 2:
flight, flights, aa, refund, airasia, time, delayed, hours, got, money

🔹 Topic 3:
flight, airasia, aa, mas, flew, like, company, flights, say, good


In [42]:
pip install pyLDAvis




In [43]:
from __future__ import print_function  # Ensures Python 2/3 compatibility (not needed in Python 3).
import pyLDAvis
import pyLDAvis.lda_model  # Importing LDA visualization module.
pyLDAvis.enable_notebook()  # Enables inline visualization in Jupyter Notebook.


In [44]:
pyLDAvis.lda_model.prepare(lda3, X_dtm, vectorizer)

In [17]:
pyLDAvis.lda_model.prepare(lda5, X_dtm, vectorizer)


In [18]:
pyLDAvis.lda_model.prepare(lda10, X_dtm, vectorizer)


In [20]:
print(df.columns)  # Check the actual column names
print(df.head())

Index(['Review', 'Sentiment', 'ReviewTokens'], dtype='object')
                                              Review Sentiment  \
0              flown dozens times never major issues  Positive   
1  flown airasia thai airasia airasia philippines...  Positive   
2  flown several times never problems weigh bag t...  Positive   
3  within thailand good experiences flew air asia...  Negative   
4                           experience airasia japan  Negative   

                                        ReviewTokens  
0                 flown dozen time never major issue  
1  flown airasia thai airasia airasia philippine ...  
2  flown several time never problem weigh bag tha...  
3  within thailand good experience flew air asia ...  
4                           experience airasia japan  


In [27]:
from gensim import corpora
import pyLDAvis
import pyLDAvis.gensim
import gensim

positive_reviews = df[df['Sentiment'] == "Positive"]['Review']
negative_reviews = df[df['Sentiment'] == "Negative"]['Review']

positive_reviews = [eval(tokens) if isinstance(tokens, str) else tokens for tokens in positive_reviews]
negative_reviews = [eval(tokens) if isinstance(tokens, str) else tokens for tokens in negative_reviews]

def train_lda(reviews, num_topics):
    dictionary = corpora.Dictionary(reviews)
    corpus = [dictionary.doc2bow(text) for text in reviews]
    lda_model = gensim.models.LdaMulticore(corpus, num_topics=num_topics, id2word=dictionary, passes=20, workers=2)
    vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
    return vis

lda_normal = train_lda(df['ReviewTokens'].tolist(), 5)
pyLDAvis.display(lda_normal)

ImportError: cannot import name 'triu' from 'scipy.linalg' (C:\Users\Athin Suresh\anaconda3\Lib\site-packages\scipy\linalg\__init__.py)

## A+B

Comments were curated from a given Reddit post URL using asyncpraw, filtering out unrelated comments based on keywords, and returning a list of relevant AirAsia reviews along with the total number of comments processed. Vader lexicon was then used to process the comments and classify them according to 3 different sentiments, "Negative", "Positive" and "Neutral". The curated comments as well as their sentiments were then stored into a csv file where they could be further processed. Out of 2264 total comments across all the posts, 1001 were determined to be relevant to our study.

We decided to take a combined approach to data labelling. After labelling by Vader, the group decided to split the dataset into smaller subsections and manually change any sentiments that were not in line with the review given. This approach allowed Vader to handle most of the workload before manual validation of the labels assigned to the dataset. This process combines transfer learning as well as internal labelling [1] In the end the dataset was an imbalanced class, as out of the 1000 remaining comments 538 were labelled negative, 200 labelled positive and 262 labelled neutral.

We were not able to find any other efficient way of extracting comments since there isnt a subreddit dedicated to AirAsia. There is, but it contained little to 5 posts a subreddit. 

[1] https://www.phdata.io/blog/techniques-for-labeling-data-in-machine-learning/

In [46]:
print(df.columns)
print(df["Sentiment"].value_counts())


Index(['Review', 'Sentiment', 'ReviewTokens'], dtype='object')
Sentiment
Negative    538
Neutral     262
Positive    200
Name: count, dtype: int64


From this we understand that our dataset is highly imbalanced. 

## D - Topic Modeling

In [47]:
pyLDAvis.lda_model.prepare(lda3, X_dtm, vectorizer)

Since LDA is an unsupervised learning method, it tries to find topics that best explain the text data. If 3 topics produce the most distinct clusters, that means the reviews naturally group into three key themes.

This reflected by our number of labels which are 3 (Positive, Negative and Neutral). A lower or higher number of topics might cause overlap (too few) or unnecessary splitting (too many). If 3 topics are optimal, then the reviews likely fall into three well-defined themes without excessive overlap.

Cluster 3 (Neutral) 
Frequent words = (flight, good, airasia, website)
Cluster 2 (Negative)
Frequent Words = (refund, delayed, cancelled, rescheduled)
Cluster 1 (Positive)
Frequent Words = (budget, better, low, cheap)