In this code, we first load the data into a pandas DataFrame. Then, we define a function for preprocessing the text data, which removes stop words and unnecessary characters. Next, we tokenize the text data using the simple_preprocess function from gensim. After that, we create a dictionary and a document-term matrix using the corpora module from gensim. Finally, we apply the LDA algorithm on the document-term matrix to identify the topics in the reviews. We print the top 10 topics and their corresponding words. You can adjust the number of topics by changing the num_topics parameter in the LdaModel function.

In [43]:
import pandas as pd
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora
import pyLDAvis
import pyLDAvis.gensim_models
import pickle

In [44]:
# Load the data into a pandas DataFrame
data = pd.read_csv('../../Warehouse/Reviews/app_reviews_merged.csv')

In [45]:
# Preprocessing
def preprocess(text):
    result = []
    for token in simple_preprocess(text):
        if token not in STOPWORDS and len(token) > 3:
            result.append(token)
    return result

In [48]:
for i in range(1,6):

    data = pd.read_csv('../../Warehouse/Reviews/app_reviews_'+str(i)+'.csv')

    print('Iteration: ', i)

    # Tokenization
    tokenized_data = data['content'].apply(preprocess)

    # Vectorization
    dictionary = corpora.Dictionary(tokenized_data)
    doc_term_matrix = [dictionary.doc2bow(tokens) for tokens in tokenized_data]

    # Topic Modeling
    lda_model = gensim.models.ldamodel.LdaModel(doc_term_matrix, num_topics=10, id2word=dictionary, passes=50)
    
    final_model = lda_model
    model_name = 'lda_model_net_data_10_topics_' + str(i)

    # Saving the model in a pickle file
    with open(model_name + '.pkl', 'wb') as file:
        pickle.dump(final_model, file)

Iteration:  1
Iteration:  2
Iteration:  3
Iteration:  4
Iteration:  5


In [49]:
# load your pickel file and use it

review_score = int(input('Enter the review score: '))

# load the model from disk
loaded_model = pickle.load(open('lda_model_net_data_10_topics_'+str(review_score)+'.pkl', 'rb'))


In [51]:
topic_labels = {
    0: "Problem & Issues",
    1: "Usability & Experience",
    2: "Installation & Time",
    3: "Cards & Data",
    4: "Functionality & Payments",
    5: "Features & Design",
    6: "Security & Privacy",
    7: "Support & Service",
    8: "Updates & Performance",
    9: "Notifications & Ads",
    10: "Other"
}

# Print the topics with meaningful labels
for idx, topic in loaded_model.print_topics(-1):
    print(f"Topic {topic_labels.get(idx, idx)}: \nWords: {topic}")


Topic Problem & Issues: 
Words: 0.140*"payment" + 0.092*"time" + 0.054*"worst" + 0.052*"waste" + 0.039*"transaction" + 0.036*"download" + 0.035*"dont" + 0.032*"fraud" + 0.026*"failed" + 0.021*"processing"
Topic Usability & Experience: 
Words: 0.189*"worst" + 0.084*"experience" + 0.064*"server" + 0.057*"application" + 0.040*"wrost" + 0.034*"time" + 0.029*"opening" + 0.027*"seen" + 0.022*"bekar" + 0.019*"error"
Topic Installation & Time: 
Words: 0.201*"money" + 0.077*"account" + 0.056*"fampay" + 0.027*"send" + 0.017*"added" + 0.017*"bank" + 0.016*"days" + 0.016*"stuck" + 0.015*"want" + 0.015*"transfer"
Topic Cards & Data: 
Words: 0.170*"card" + 0.045*"available" + 0.045*"aadhar" + 0.041*"details" + 0.024*"fampay" + 0.024*"forcing" + 0.024*"data" + 0.021*"going" + 0.019*"showing" + 0.019*"famcard"
Topic Functionality & Payments: 
Words: 0.024*"fampay" + 0.024*"like" + 0.022*"getting" + 0.019*"registration" + 0.018*"months" + 0.015*"shows" + 0.015*"paused" + 0.015*"saying" + 0.015*"month" 

In [52]:
# Visualize the topics
lda_display = pyLDAvis.gensim_models.prepare(lda_model, doc_term_matrix, dictionary)
pyLDAvis.display(lda_display)

  by='saliency', ascending=False).head(R).drop('saliency', 1)


In [None]:
summary = ""
for idx, topic in lda_model.print_topics(-1):
    topic_name = topic_labels.get(idx, idx)
    top_words = [word_prob.split('*')[1].strip('"') for word_prob in topic.split(' + ')[:5]]
    summary += f"Topic {topic_name} mainly discusses {', '.join(top_words[:-1])}, and {top_words[-1]}.\n\n"

print(summary)

Topic Problem & Issues mainly discusses good, nice, teenagers, amazing, and application.

Topic Usability & Experience mainly discusses money, great, transfer, bank, and account.

Topic Installation & Time mainly discusses payment, payments, online, recommend, and making.

Topic Cards & Data mainly discusses teenager, apps, working, transaction, and work.

Topic Functionality & Payments mainly discusses easy, experience, transactions, makes, and cashless.

Topic Features & Design mainly discusses thanks, loved, wonderful, service, and features.

Topic Security & Privacy mainly discusses fampay, love, awesome, teenagers, and useful.

Topic Support & Service mainly discusses like, children, download, mast, and earn.

Topic Updates & Performance mainly discusses best, super, happy, kids, and safe.

Topic Notifications & Ads mainly discusses card, teens, debit, fampay, and problem.


