In this code, we first load the data into a pandas DataFrame. Then, we define a function for preprocessing the text data, which removes stop words and unnecessary characters. Next, we tokenize the text data using the simple_preprocess function from gensim. After that, we create a dictionary and a document-term matrix using the corpora module from gensim. Finally, we apply the LDA algorithm on the document-term matrix to identify the topics in the reviews. We print the top 10 topics and their corresponding words. You can adjust the number of topics by changing the num_topics parameter in the LdaModel function.

In [16]:
import pandas as pd
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora
import pyLDAvis
import pyLDAvis.gensim_models
import pickle

In [17]:
# Load the data into a pandas DataFrame
data = pd.read_csv('../../Warehouse/Reviews/app_reviews_merged.csv')

In [18]:
# Preprocessing
def preprocess(text):
    result = []
    for token in simple_preprocess(text):
        if token not in STOPWORDS and len(token) > 3:
            result.append(token)
    return result

In [19]:
for i in range(1,6):

    data = pd.read_csv('../../Warehouse/Reviews/app_reviews_'+str(i)+'.csv')

    print('Iteration: ', i)

    # Tokenization
    tokenized_data = data['content'].apply(preprocess)

    # Vectorization
    dictionary = corpora.Dictionary(tokenized_data)
    doc_term_matrix = [dictionary.doc2bow(tokens) for tokens in tokenized_data]

    # Topic Modeling
    lda_model = gensim.models.ldamodel.LdaModel(doc_term_matrix, num_topics=10, id2word=dictionary, passes=50)
    
    final_model = lda_model
    model_name = 'lda_model_net_data_10_topics_' + str(i)

    # Saving the model in a pickle file
    with open(model_name + '.pkl', 'wb') as file:
        pickle.dump(final_model, file)

Iteration:  1
Iteration:  2
Iteration:  3
Iteration:  4
Iteration:  5


In [None]:
# load the model from disk with 1 review as loaded_model
loaded_model = pickle.load(open('lda_model_net_data_10_topics_1.pkl', 'rb'))

In [20]:
topic_labels = {
    0: "Problem & Issues",
    1: "Usability & Experience",
    2: "Installation & Time",
    3: "Cards & Data",
    4: "Functionality & Payments",
    5: "Features & Design",
    6: "Security & Privacy",
    7: "Support & Service",
    8: "Updates & Performance",
    9: "Notifications & Ads",
    10: "Other"
}

# Print the topics with meaningful labels
for idx, topic in loaded_model.print_topics(-1):
    print(f"Topic {topic_labels.get(idx, idx)}: \nWords: {topic}")

Topic Problem & Issues: 
Words: 0.161*"good" + 0.070*"application" + 0.057*"fake" + 0.031*"time" + 0.029*"wrost" + 0.025*"long" + 0.024*"location" + 0.021*"device" + 0.021*"waiting" + 0.018*"register"
Topic Usability & Experience: 
Words: 0.077*"server" + 0.041*"scan" + 0.033*"issues" + 0.032*"nahi" + 0.025*"mere" + 0.022*"bekar" + 0.022*"free" + 0.020*"paise" + 0.014*"sign" + 0.013*"technical"
Topic Installation & Time: 
Words: 0.201*"worst" + 0.078*"download" + 0.057*"work" + 0.037*"dont" + 0.036*"experience" + 0.033*"install" + 0.028*"time" + 0.028*"registration" + 0.023*"seen" + 0.019*"slow"
Topic Cards & Data: 
Words: 0.179*"problem" + 0.064*"issue" + 0.064*"solve" + 0.034*"login" + 0.025*"fampay" + 0.023*"problems" + 0.022*"soon" + 0.019*"facing" + 0.019*"able" + 0.018*"possible"
Topic Functionality & Payments: 
Words: 0.088*"working" + 0.065*"fampay" + 0.037*"account" + 0.026*"open" + 0.021*"help" + 0.019*"days" + 0.019*"showing" + 0.018*"months" + 0.016*"error" + 0.014*"shows"


In [21]:
summary = ""
for idx, topic in loaded_model.print_topics(-1):
    topic_name = topic_labels.get(idx, idx)
    top_words = [word_prob.split('*')[1].strip('"') for word_prob in topic.split(' + ')[:5]]
    summary += f"Topic {topic_name} mainly discusses {', '.join(top_words[:-1])}, and {top_words[-1]}.\n\n"

print(summary)

Topic Problem & Issues mainly discusses good, application, fake, time, and wrost.

Topic Usability & Experience mainly discusses server, scan, issues, nahi, and mere.

Topic Installation & Time mainly discusses worst, download, work, dont, and experience.

Topic Cards & Data mainly discusses problem, issue, solve, login, and fampay.

Topic Functionality & Payments mainly discusses working, fampay, account, open, and help.

Topic Features & Design mainly discusses payment, time, transaction, useless, and payments.

Topic Security & Privacy mainly discusses service, customer, available, care, and support.

Topic Support & Service mainly discusses money, account, fampay, send, and rupees.

Topic Updates & Performance mainly discusses card, nice, details, fampay, and famcard.

Topic Notifications & Ads mainly discusses waste, option, number, star, and parents.


