# Capstone Project: Topic Modelling of Academic Journals (Model-Based Systems Engineering)

# 03: Topic Modeling

In this notebook, we will perform the following actions:
1. Topic Modelling using LDA and HDP

## Import Libraries

In [1]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import gensim

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from gensim.models import HdpModel
from gensim.corpora.dictionary import Dictionary
from wordcloud import WordCloud

# Set all columns and rows to be displayed
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)



## Import Data

In [2]:
# Import the data for modelling
journals = pd.read_csv('../data/journals_processed.csv')

## Final Data Preprocessing using TF-IDF

In this section, we will perform our final data preprocessing using TF-IDF (Term Frequency - Inverse Document Frequency). TF-IDF is used as it takes into account how often a word appears in the whole corpus. This helps to penalize common words that appear across every document, which is not informative. 

In [3]:
# Instantiate a TF-IDF Vectorizer
tvec_journals = TfidfVectorizer(lowercase=False, ngram_range=(1,3))

# Fit and transform the text data to prepare for topic modelling
journals_corpus = tvec_journals.fit_transform(journals['tokens'])

## Topic Modelling using Latent Dirichlet Allocation (LDA)

Here, we will perform topic modelling using LDA. 

In [4]:
# Instantiate the LDA model
lda_model = LatentDirichletAllocation(n_components=7,
                                     random_state=42)

# Fit the model
lda_model.fit(journals_corpus)

In [5]:
# Extract the top words for each topic
feature_names = tvec_journals.get_feature_names_out()
n_top_words = 10
for topic_idx, topic in enumerate(lda_model.components_):
    print("Topic #%d:" %topic_idx)
    print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words -1:-1]]))
    print()
    
# Extract the topic distribution for each journal
journal_topic_dist = lda_model.transform(journals_corpus)

# Create a dataframe to store the journal topics probability distribution
df_journal_topic_dist = pd.DataFrame(journal_topic_dist, columns=['topic_0', 'topic_1', 'topic_2', 'topic_3', 'topic_4', 'topic_5', 'topic_6'])

# Add in a column with the topic generated 
df_journal_topic_dist['topic_generated'] = journal_topic_dist.argmax(axis=1)

# Add in the title of the journal
df_journal_topic_dist['title'] = journals['title']

# Add in the publication year of each journal
df_journal_topic_dist['year'] = journals['year'] 

Topic #0:
requirement design development approach method process modeling architecture sysml tool

Topic #1:
design product development process approach architecture analysis modeling requirement paper

Topic #2:
design analysis approach development requirement architecture sysml tool product modeling

Topic #3:
approach design sysml requirement modeling analysis development process paper data

Topic #4:
design process modeling approach simulation sysml tool paper requirement language

Topic #5:
development product process design approach method paper support concept complexity

Topic #6:
design approach sysml development modeling tool architecture process language simulation



In [6]:
df_journal_topic_dist.head()

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_generated,title,year
0,0.009956,0.009963,0.009953,0.009952,0.940248,0.009955,0.009973,4,Model-based Design Process for the Early Phase...,2017
1,0.009406,0.943557,0.009408,0.009406,0.009407,0.009406,0.009409,1,Model Based Systems Engineering using VHDL-AMS,2013
2,0.009466,0.009468,0.009464,0.009466,0.94319,0.009469,0.009477,4,Code Generation Approach Supporting Complex Sy...,2022
3,0.008307,0.008312,0.950154,0.008303,0.008308,0.008305,0.008311,2,Model based systems engineering as enabler for...,2021
4,0.009831,0.009839,0.00984,0.00983,0.009834,0.940982,0.009844,5,Electric Drive Vehicle Development and Evaluat...,2014


The LDA model is able to generate the topics, based on the number of topics we had specified. However, many of the keywords across the topics are the same. This makes it difficult to generate a proper label for each topic. Furthermore, we are unsure of the number of topics to assign and are applying a trial and error approach. As such, we'll explore other models and compare their results.

## Topic Modeling using Hierarchical Dirichlet Process (HDP)

In [7]:
# Convert the vectorized data into a Gensim corpus
corpus = gensim.matutils.Sparse2Corpus(journals_corpus, documents_columns=False)

# Create a dictionary from the corpus
id2word = Dictionary.from_corpus(corpus, id2word=dict((id, word) for word, id in tvec_journals.vocabulary_.items()))

# Train the HDP model
hdp_model = HdpModel(corpus=corpus, id2word=id2word)

# Print the topics
topics = hdp_model.show_topics(num_topics=5,formatted=False)
for topic in topics:
    print(topic)

(0, [('pattern', 0.00010714731443079997), ('design', 0.00010692141383283276), ('simulation', 9.660619040932919e-05), ('modeling', 8.714699192192052e-05), ('need', 8.495269076675641e-05), ('language', 8.441854694319732e-05), ('sysml', 7.883500247768128e-05), ('problem', 7.538762815897994e-05), ('understand adoption strategy', 7.511145007649122e-05), ('development', 7.425534151876296e-05), ('study', 7.371653813166287e-05), ('method', 7.318744040560228e-05), ('uncertainty', 6.882367345619179e-05), ('framework', 6.861016280297731e-05), ('view', 6.85830905556617e-05), ('improved', 6.747256570438712e-05), ('test', 6.705802245252899e-05), ('negating factor wide', 6.64094209174884e-05), ('analysis', 6.625240617101234e-05), ('research', 6.363334705115212e-05)])
(1, [('swarm', 0.00013827511156118787), ('requirement', 9.150930216860144e-05), ('tree', 9.0365756482248e-05), ('account realistic adaptation', 7.903873780993692e-05), ('category', 7.373299645704282e-05), ('hierarchy recommended pre', 7.

The HDP model does not seem to be very helpful, as it outputs a total of 150 topics. Meaning each topic is assigned approximately 5-6 articles. Furthermore, each keyword for each topic only has a 0.1% or less probability of being associated to the topic. This does not allow us to gain much insight. Hence, we will not be using the HDP model.
*The total number of topics generated can be seen by adjusting the num_topics hyperparameter above to 200. You will observe that 150 topics will be generated.*

## Next Step

From this notebook, we have assessed that the LDA and HDP models do not perform well enough. In the next notebook, we will apply a more advanced topic model, namely the BERTopic model.