In this code, we first load the data into a pandas DataFrame. Then, we define a function for preprocessing the text data, which removes stop words and unnecessary characters. Next, we tokenize the text data using the simple_preprocess function from gensim. After that, we create a dictionary and a document-term matrix using the corpora module from gensim. Finally, we apply the LDA algorithm on the document-term matrix to identify the topics in the reviews. We print the top 10 topics and their corresponding words. You can adjust the number of topics by changing the num_topics parameter in the LdaModel function.

In [25]:
import pandas as pd
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora
import pyLDAvis
import pyLDAvis.gensim_models

In [26]:
# Load the data into a pandas DataFrame
data = pd.read_csv('../../Warehouse/Reviews/app_reviews_merged.csv')

In [27]:
# Preprocessing
def preprocess(text):
    result = []
    for token in simple_preprocess(text):
        if token not in STOPWORDS and len(token) > 3:
            result.append(token)
    return result

In [28]:
# Tokenization
tokenized_data = data['content'].apply(preprocess)

In [29]:
# Vectorization
dictionary = corpora.Dictionary(tokenized_data)
doc_term_matrix = [dictionary.doc2bow(tokens) for tokens in tokenized_data]

In [30]:
# Topic Modeling
lda_model = gensim.models.ldamodel.LdaModel(doc_term_matrix, num_topics=10, id2word=dictionary, passes=50)

In [None]:
# store the model is a pickel file
lda_model.save('lda_model')

In [None]:
# Print the topics
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.179*"amazing" + 0.043*"help" + 0.034*"safe" + 0.031*"service" + 0.030*"work" + 0.026*"features" + 0.021*"famcard" + 0.021*"account" + 0.020*"years" + 0.019*"helps"
Topic: 1 
Words: 0.159*"easy" + 0.078*"payments" + 0.065*"fampay" + 0.058*"thanks" + 0.044*"loved" + 0.033*"making" + 0.030*"teenager" + 0.029*"team" + 0.021*"makes" + 0.017*"highly"
Topic: 2 
Words: 0.267*"awesome" + 0.107*"transactions" + 0.062*"fast" + 0.042*"mast" + 0.040*"convenient" + 0.036*"people" + 0.023*"definitely" + 0.023*"smooth" + 0.021*"bahut" + 0.018*"financial"
Topic: 3 
Words: 0.138*"love" + 0.108*"super" + 0.049*"fampay" + 0.040*"problem" + 0.026*"phone" + 0.025*"want" + 0.023*"time" + 0.021*"bhim" + 0.020*"guys" + 0.017*"better"
Topic: 4 
Words: 0.153*"money" + 0.063*"useful" + 0.053*"teenagers" + 0.051*"transfer" + 0.035*"thank" + 0.026*"free" + 0.025*"children" + 0.023*"wonderful" + 0.022*"like" + 0.021*"easily"
Topic: 5 
Words: 0.196*"card" + 0.086*"application" + 0.085*"debit" + 0.0

In [None]:
topic_labels = {
    0: "Problem & Issues",
    1: "Usability & Experience",
    2: "Installation & Time",
    3: "Cards & Data",
    4: "Functionality & Payments",
    5: "Features & Design",
    6: "Security & Privacy",
    7: "Support & Service",
    8: "Updates & Performance",
    9: "Notifications & Ads",
    10: "Other"
}

# Print the topics with meaningful labels
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {topic_labels.get(idx, idx)}: \nWords: {topic}")


Topic Problem & Issues: 
Words: 0.179*"amazing" + 0.043*"help" + 0.034*"safe" + 0.031*"service" + 0.030*"work" + 0.026*"features" + 0.021*"famcard" + 0.021*"account" + 0.020*"years" + 0.019*"helps"
Topic Usability & Experience: 
Words: 0.159*"easy" + 0.078*"payments" + 0.065*"fampay" + 0.058*"thanks" + 0.044*"loved" + 0.033*"making" + 0.030*"teenager" + 0.029*"team" + 0.021*"makes" + 0.017*"highly"
Topic Installation & Time: 
Words: 0.267*"awesome" + 0.107*"transactions" + 0.062*"fast" + 0.042*"mast" + 0.040*"convenient" + 0.036*"people" + 0.023*"definitely" + 0.023*"smooth" + 0.021*"bahut" + 0.018*"financial"
Topic Cards & Data: 
Words: 0.138*"love" + 0.108*"super" + 0.049*"fampay" + 0.040*"problem" + 0.026*"phone" + 0.025*"want" + 0.023*"time" + 0.021*"bhim" + 0.020*"guys" + 0.017*"better"
Topic Functionality & Payments: 
Words: 0.153*"money" + 0.063*"useful" + 0.053*"teenagers" + 0.051*"transfer" + 0.035*"thank" + 0.026*"free" + 0.025*"children" + 0.023*"wonderful" + 0.022*"like" + 

In [None]:
# Visualize the topics
lda_display = pyLDAvis.gensim_models.prepare(lda_model, doc_term_matrix, dictionary)
pyLDAvis.display(lda_display)

  by='saliency', ascending=False).head(R).drop('saliency', 1)


In [None]:
summary = ""
for idx, topic in lda_model.print_topics(-1):
    topic_name = topic_labels.get(idx, idx)
    top_words = [word_prob.split('*')[1].strip('"') for word_prob in topic.split(' + ')[:5]]
    summary += f"Topic {topic_name} mainly discusses {', '.join(top_words[:-1])}, and {top_words[-1]}.\n\n"

print(summary)

Topic Problem & Issues mainly discusses amazing, help, safe, service, and work.

Topic Usability & Experience mainly discusses easy, payments, fampay, thanks, and loved.

Topic Installation & Time mainly discusses awesome, transactions, fast, mast, and convenient.

Topic Cards & Data mainly discusses love, super, fampay, problem, and phone.

Topic Functionality & Payments mainly discusses money, useful, teenagers, transfer, and thank.

Topic Features & Design mainly discusses card, application, debit, teens, and bank.

Topic Security & Privacy mainly discusses good, helpful, teenagers, digital, and like.

Topic Support & Service mainly discusses great, fampay, experience, happy, and kids.

Topic Updates & Performance mainly discusses nice, excellent, superb, banking, and install.

Topic Notifications & Ads mainly discusses best, teenagers, payment, online, and teens.


