In this code, we first load the data into a pandas DataFrame. Then, we define a function for preprocessing the text data, which removes stop words and unnecessary characters. Next, we tokenize the text data using the simple_preprocess function from gensim. After that, we create a dictionary and a document-term matrix using the corpora module from gensim. Finally, we apply the LDA algorithm on the document-term matrix to identify the topics in the reviews. We print the top 10 topics and their corresponding words. You can adjust the number of topics by changing the num_topics parameter in the LdaModel function.

In [43]:
import pandas as pd
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora
import pyLDAvis
import pyLDAvis.gensim_models
import pickle

In [44]:
# Load the data into a pandas DataFrame
data = pd.read_csv('../../Warehouse/Reviews/app_reviews_merged.csv')

In [45]:
# Preprocessing
def preprocess(text):
    result = []
    for token in simple_preprocess(text):
        if token not in STOPWORDS and len(token) > 3:
            result.append(token)
    return result

In [48]:
for i in range(1,6):

    data = pd.read_csv('../../Warehouse/Reviews/app_reviews_'+str(i)+'.csv')

    print('Iteration: ', i)

    # Tokenization
    tokenized_data = data['content'].apply(preprocess)

    # Vectorization
    dictionary = corpora.Dictionary(tokenized_data)
    doc_term_matrix = [dictionary.doc2bow(tokens) for tokens in tokenized_data]

    # Topic Modeling
    lda_model = gensim.models.ldamodel.LdaModel(doc_term_matrix, num_topics=10, id2word=dictionary, passes=50)
    
    final_model = lda_model
    model_name = 'lda_model_net_data_10_topics_' + str(i)

    # Saving the model in a pickle file
    with open(model_name + '.pkl', 'wb') as file:
        pickle.dump(final_model, file)

Iteration:  1
Iteration:  2
Iteration:  3
Iteration:  4
Iteration:  5


In [53]:
review_score = int(input('Enter the review score: '))

# load the model from disk
loaded_model = pickle.load(open('lda_model_net_data_10_topics_'+str(review_score)+'.pkl', 'rb'))
print("Working with the model for review score: ", review_score)

In [54]:
topic_labels = {
    0: "Problem & Issues",
    1: "Usability & Experience",
    2: "Installation & Time",
    3: "Cards & Data",
    4: "Functionality & Payments",
    5: "Features & Design",
    6: "Security & Privacy",
    7: "Support & Service",
    8: "Updates & Performance",
    9: "Notifications & Ads",
    10: "Other"
}

# Print the topics with meaningful labels
for idx, topic in loaded_model.print_topics(-1):
    print(f"Topic {topic_labels.get(idx, idx)}: \nWords: {topic}")


Topic Problem & Issues: 
Words: 0.065*"recommend" + 0.059*"digital" + 0.049*"money" + 0.044*"safe" + 0.041*"interface" + 0.032*"simple" + 0.027*"friends" + 0.026*"friendly" + 0.026*"available" + 0.024*"cash"
Topic Usability & Experience: 
Words: 0.161*"card" + 0.070*"debit" + 0.054*"payments" + 0.038*"teenager" + 0.036*"helpful" + 0.033*"loved" + 0.027*"thanks" + 0.025*"making" + 0.021*"parents" + 0.020*"cool"
Topic Installation & Time: 
Words: 0.257*"teenagers" + 0.139*"money" + 0.056*"transfer" + 0.043*"payment" + 0.036*"bank" + 0.029*"account" + 0.024*"apps" + 0.023*"easily" + 0.022*"paytm" + 0.015*"send"
Topic Cards & Data: 
Words: 0.664*"good" + 0.042*"happy" + 0.035*"wonderful" + 0.022*"fantastic" + 0.020*"application" + 0.013*"support" + 0.012*"customer" + 0.010*"children" + 0.010*"experience" + 0.007*"impressed"
Topic Functionality & Payments: 
Words: 0.148*"easy" + 0.136*"payment" + 0.056*"online" + 0.039*"payments" + 0.028*"fast" + 0.027*"working" + 0.026*"transaction" + 0.02

In [55]:
# Visualize the topics
lda_display = pyLDAvis.gensim_models.prepare(lda_model, doc_term_matrix, dictionary)
pyLDAvis.display(lda_display)

  by='saliency', ascending=False).head(R).drop('saliency', 1)


In [56]:
summary = ""
for idx, topic in lda_model.print_topics(-1):
    topic_name = topic_labels.get(idx, idx)
    top_words = [word_prob.split('*')[1].strip('"') for word_prob in topic.split(' + ')[:5]]
    summary += f"Topic {topic_name} mainly discusses {', '.join(top_words[:-1])}, and {top_words[-1]}.\n\n"

print(summary)

Topic Problem & Issues mainly discusses recommend, digital, money, safe, and interface.

Topic Usability & Experience mainly discusses card, debit, payments, teenager, and helpful.

Topic Installation & Time mainly discusses teenagers, money, transfer, payment, and bank.

Topic Cards & Data mainly discusses good, happy, wonderful, fantastic, and application.

Topic Functionality & Payments mainly discusses easy, payment, online, payments, and fast.

Topic Features & Design mainly discusses nice, mast, application, lovely, and appp.

Topic Security & Privacy mainly discusses best, teens, excellent, teen, and kids.

Topic Support & Service mainly discusses great, amazing, awesome, like, and useful.

Topic Updates & Performance mainly discusses love, fampay, transactions, thank, and download.

Topic Notifications & Ads mainly discusses fampay, super, problem, team, and help.


