# Capstone Project: Topic Modelling of Academic Journals (Model-Based Systems Engineering)

# 03: Modelling and Evaluation

In this notebook, we will perform the following actions:
1. Topic Modelling
2. Evaluation

## Import Libraries

In [33]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import gensim

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from bertopic import BERTopic
from gensim.models import HdpModel
from gensim.models.ldamodel import LdaModel
from gensim.matutils import dense2vec
from gensim.corpora.dictionary import Dictionary
from wordcloud import WordCloud

# Set all columns and rows to be displayed
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Import Data

In [114]:
# Import the data for modelling
journals = pd.read_csv('../data/journals_processed.csv')

## Final Data Preprocessing using TF-IDF

In this section, we will perform our final data preprocessing using TF-IDF (Term Frequency - Inverse Document Frequency). TF-IDF is used as it takes into account how often a word appears in the whole corpus. This helps to penalize common words that appear across every document, which is not informative. 

In [115]:
# Instantiate a TF-IDF Vectorizer
tvec_journals = TfidfVectorizer(lowercase=False, ngram_range=(1,3))

# Fit and transform the text data to prepare for topic modelling
journals_corpus = tvec_journals.fit_transform(journals['tokens'])

## Topic Modelling using Latent Dirichlet Allocation (LDA) - sklearn implementation

Here, we will perform topic modelling using LDA. 

In [120]:
# Instantiate the LDA model
lda_model = LatentDirichletAllocation(n_components=7,
                                     random_state=42)

# Fit the model
lda_model.fit(journals_corpus)

In [121]:
# Extract the top words for each topic
feature_names = tvec_journals.get_feature_names_out()
n_top_words = 10
for topic_idx, topic in enumerate(lda_model.components_):
    print("Topic #%d:" %topic_idx)
    print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words -1:-1]]))
    print()
    
# Extract the topic distribution for each journal
journal_topic_dist = lda_model.transform(journals_corpus)

# Create a dataframe to store the journal topics probability distribution
df_journal_topic_dist = pd.DataFrame(journal_topic_dist, columns=['topic_0', 'topic_1', 'topic_2', 'topic_3', 'topic_4', 'topic_5', 'topic_6'])

# Add in a column with the topic generated 
df_journal_topic_dist['topic_generated'] = journal_topic_dist.argmax(axis=1)

# Add in the title of the journal
df_journal_topic_dist['title'] = journals['title']

# Add in the publication year of each journal
df_journal_topic_dist['year'] = journals['year'] 

Topic #0:
based model model based problem data domain time support engineer activity

Topic #1:
based model model based domain safety support research verification production engineer

Topic #2:
based model production model based level safety management service industry practice

Topic #3:
based model methodology model based element ontology case study work safety

Topic #4:
based model model based knowledge safety study application environment integration data

Topic #5:
based model model based digital mission methodology data support domain twin

Topic #6:
based model model based digital domain test methodology software support data



In [113]:
df_journal_topic_dist.head()

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_generated,title,year
0,0.015653,0.015654,0.015628,0.937436,0.01563,3,Model-based Design Process for the Early Phase...,2017
1,0.941587,0.014594,0.01461,0.014613,0.014596,0,Model Based Systems Engineering using VHDL-AMS,2013
2,0.014266,0.943007,0.014239,0.014257,0.014232,1,Code Generation Approach Supporting Complex Sy...,2022
3,0.012464,0.012471,0.012483,0.950109,0.012473,3,Model based systems engineering as enabler for...,2021
4,0.014777,0.014836,0.940789,0.014806,0.014792,2,Electric Drive Vehicle Development and Evaluat...,2014


## Topic Modeling using Hierarchical Dirichlet Process (HDP)

In [122]:
# Convert the vectorized data into a Gensim corpus
corpus = gensim.matutils.Sparse2Corpus(journals_corpus, documents_columns=False)

# Create a dictionary from the corpus
id2word = Dictionary.from_corpus(corpus, id2word=dict((id, word) for word, id in tvec_journals.vocabulary_.items()))

# Train the HDP model
hdp_model = HdpModel(corpus=corpus, id2word=id2word)

# Print the topics
topics = hdp_model.show_topics(num_topics=5,formatted=False)
for topic in topics:
    print(topic)

(0, [('mission', 0.00012394881414371035), ('based', 0.00010720703351408015), ('model', 9.080159443503525e-05), ('model based', 8.99694664209689e-05), ('different', 8.98586582074072e-05), ('viability entire state', 8.895002583330703e-05), ('reliant aircraft', 8.369040619098087e-05), ('behavior', 7.50927724607063e-05), ('safety', 7.468799497130322e-05), ('view', 7.429296975241166e-05), ('discipline', 7.38150178994274e-05), ('activity', 7.035947791248999e-05), ('dod transition', 6.976440438095318e-05), ('optimization across', 6.829500922140032e-05), ('specific', 6.728391887143601e-05), ('optimized sustainment cost', 6.626937108474876e-05), ('modern internet', 6.56758495468444e-05), ('discipline specific', 6.552022572641444e-05), ('efficient', 6.357570443161601e-05), ('of', 6.308362632630983e-05)])
(1, [('adoption', 8.634506519732364e-05), ('time embedded', 8.616861554887796e-05), ('across domain though', 7.381312030451386e-05), ('based', 7.13427069276732e-05), ('spanning across biological

The HDP model does not seem to be very helpful, as it outputs a total of 150 topics. Meaning each topic is assigned approximately 5-6 articles. Furthermore, each keyword for each topic only has a 0.1% or less probability of being associated to the topic. This does not allow us to gain much insight. Hence, we will not be using the HDP model.
*The total number of topics generated can be seen by adjusting the num_topics hyperparameter above to 200. You will observe that 150 topics will be generated.*

## Topic Modeling using BERTopic

In [None]:
%%time
# Instantiate a BERTopic model
bertopic_model = BERTopic()

# Fit and transform the model to the corpus
topics, _ = bertopic_model.fit_transform(journals['tokens'])

# Print the top words for each topic
for topic_id in range(max(topics)):
    words = bertopic_model.get_topic(topic_id)
    print(f"Topic {topic_id}: {' | '.join(words)}")

## Topic Modelling using Latent Dirichlet Allocation (LDA) - gensim implementation

In [123]:
# Convert corpus to Gensim format
corpus = gensim.matutils.Sparse2Corpus(journals_corpus, documents_columns=False)

# Create Gensim dictionary
id2word = Dictionary.from_corpus(corpus, id2word=dict((id, word) for word, id in tvec_journals.vocabulary_.items()))

# Build LDA model
lda_model = LdaModel(corpus=corpus,
                     id2word=id2word,
                     num_topics=5,
                     random_state=42,
                     passes=10)

# Print keywords for each topic
for idx, topic in lda_model.show_topics(num_topics=5, num_words=15, formatted=False):
    print('Topic: {} \nWords: {}'.format(idx, [w[0] for w in topic]))

# Create dataframe with topic probabilities for each document
topic_probs = [lda_model.get_document_topics(corpus[i]) for i in range(len(corpus))]
df_topic_probs = pd.DataFrame([{f'topic_{tp[0]}': tp[1] for tp in probs} for probs in topic_probs])

# Combine topic probabilities dataframe with original data
df_journals_gensim = pd.concat([data, df_topic_probs], axis=1)

# Add column for topic with highest percentage
df_journals_gensim['topic_generated'] = df_journals_gensim.iloc[:, 3:].idxmax(axis=1)

# Add in the title of the journal
df_journals_gensim['title'] = journals['title']

# Add in the publication year of each journal
df_journals_gensim['year'] = journals['year'] 

# Frop the sentences column
df_journals_gensim.drop(columns='sentences', inplace=True)

Topic: 0 
Words: ['based', 'model', 'digital', 'model based', 'mission', 'study', 'research', 'methodology', 'concept', 'data', 'modelling', 'case', 'support', 'different', 'practice']
Topic: 1 
Words: ['based', 'model', 'model based', 'safety', 'domain', 'verification', 'support', 'new', 'component', 'digital', 'management', 'lifecycle', 'level', 'methodology', 'cost']
Topic: 2 
Words: ['based', 'model', 'data', 'model based', 'effort', 'service', 'document', 'present', 'new', 'stakeholder', 'concept', 'study', 'element', 'level', 'need']
Topic: 3 
Words: ['based', 'model', 'model based', 'safety', 'domain', 'case', 'integration', 'methodology', 'management', 'different', 'data', 'support', 'ontology', 'application', 'view']
Topic: 4 
Words: ['based', 'model', 'safety', 'model based', 'industry', 'domain', 'application', 'support', 'methodology', 'data', 'change', 'concept', 'challenge', 'need', 'different']


In [83]:
df_journals_gensim.shape

(850, 8)

In [84]:
df_journals_gensim.head()

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_generated,title,year
0,0.017888,0.017893,0.928312,0.017919,0.017988,topic_2,Model-based Design Process for the Early Phase...,2017
1,0.015413,0.015403,0.0154,0.01543,0.938354,topic_4,Model Based Systems Engineering using VHDL-AMS,2013
2,0.01642,0.016486,0.0163,0.934438,0.016355,topic_3,Code Generation Approach Supporting Complex Sy...,2022
3,0.014561,0.014587,0.014538,0.014566,0.941748,topic_4,Model based systems engineering as enabler for...,2021
4,0.016745,0.01676,0.0167,0.016712,0.933082,topic_4,Electric Drive Vehicle Development and Evaluat...,2014


# Outstanding Work
* Can't seem to remove the stop words "model" and "based" - **NEED TAs HELP**
* BERTopic (kills my kernal. need cloud i think)
* Fine tune the stop words to make the keywords in each topic distinct
* Evaluate the key words and create a proper topic label using domain knowledge and reading the top few articles for each topic
* Topic evaluation with visualizations, distributions and trending
* Recommendations for the organization
* Create dashboard
* Touch up on the EDA for presentation slides
* Create slides and prepare for presentation