# Capstone Project: Topic Modelling of Academic Journals (Model-Based Systems Engineering)

# 03: Modelling and Evaluation

In this notebook, we will perform the following actions:
1. Topic Modelling
2. Evaluation

## Import Libraries

In [1]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import gensim

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
#from bertopic import BERTopic
from gensim.models import HdpModel
from gensim.models.ldamodel import LdaModel
from gensim.matutils import dense2vec
from gensim.corpora.dictionary import Dictionary
from wordcloud import WordCloud

# Set all columns and rows to be displayed
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

  @numba.jit()
  @numba.jit()
  @numba.jit()
  @numba.jit()


## Import Data

In [2]:
# Import the data for modelling
journals = pd.read_csv('../data/journals_processed.csv')

## Final Data Preprocessing using TF-IDF

In this section, we will perform our final data preprocessing using TF-IDF (Term Frequency - Inverse Document Frequency). TF-IDF is used as it takes into account how often a word appears in the whole corpus. This helps to penalize common words that appear across every document, which is not informative. 

In [3]:
# Instantiate a TF-IDF Vectorizer
tvec_journals = TfidfVectorizer(lowercase=False, ngram_range=(1,3))

# Fit and transform the text data to prepare for topic modelling
journals_corpus = tvec_journals.fit_transform(journals['tokens'])

## Topic Modelling using Latent Dirichlet Allocation (LDA) - sklearn implementation

Here, we will perform topic modelling using LDA. 

In [4]:
# Instantiate the LDA model
lda_model = LatentDirichletAllocation(n_components=7,
                                     random_state=42)

# Fit the model
lda_model.fit(journals_corpus)

In [5]:
# Extract the top words for each topic
feature_names = tvec_journals.get_feature_names_out()
n_top_words = 10
for topic_idx, topic in enumerate(lda_model.components_):
    print("Topic #%d:" %topic_idx)
    print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words -1:-1]]))
    print()
    
# Extract the topic distribution for each journal
journal_topic_dist = lda_model.transform(journals_corpus)

# Create a dataframe to store the journal topics probability distribution
df_journal_topic_dist = pd.DataFrame(journal_topic_dist, columns=['topic_0', 'topic_1', 'topic_2', 'topic_3', 'topic_4', 'topic_5', 'topic_6'])

# Add in a column with the topic generated 
df_journal_topic_dist['topic_generated'] = journal_topic_dist.argmax(axis=1)

# Add in the title of the journal
df_journal_topic_dist['title'] = journals['title']

# Add in the publication year of each journal
df_journal_topic_dist['year'] = journals['year'] 

Topic #0:
data digital new safety support study methodology level case need

Topic #1:
domain concept work study mission space cost research application integration

Topic #2:
mission methodology validation new domain verification level based activity data

Topic #3:
safety industry different methodology level case engineer need opm chapter

Topic #4:
safety concept transformation research work document change data integration methodology

Topic #5:
domain challenge application new slim support present phase integration problem

Topic #6:
support data integrated safety software challenge document capability different digital



In [8]:
df_journal_topic_dist.head()

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_generated,title,year
0,0.011376,0.011372,0.931773,0.011371,0.011373,0.011367,0.011368,2,Model-based Design Process for the Early Phase...,2017
1,0.010357,0.010352,0.937874,0.010357,0.010355,0.010352,0.010353,2,Model Based Systems Engineering using VHDL-AMS,2013
2,0.010191,0.010183,0.010192,0.010179,0.938895,0.01018,0.010179,4,Code Generation Approach Supporting Complex Sy...,2022
3,0.008825,0.008819,0.008814,0.008818,0.947089,0.008816,0.008818,4,Model based systems engineering as enabler for...,2021
4,0.010582,0.010585,0.936519,0.010576,0.010584,0.010577,0.010577,2,Electric Drive Vehicle Development and Evaluat...,2014


## Topic Modeling using Hierarchical Dirichlet Process (HDP)

In [9]:
# Convert the vectorized data into a Gensim corpus
corpus = gensim.matutils.Sparse2Corpus(journals_corpus, documents_columns=False)

# Create a dictionary from the corpus
id2word = Dictionary.from_corpus(corpus, id2word=dict((id, word) for word, id in tvec_journals.vocabulary_.items()))

# Train the HDP model
hdp_model = HdpModel(corpus=corpus, id2word=id2word)

# Print the topics
topics = hdp_model.show_topics(num_topics=5,formatted=False)
for topic in topics:
    print(topic)

(0, [('exmc', 9.693473614367421e-05), ('spacecraft mission ever', 9.546788112991308e-05), ('medical', 9.456319869202603e-05), ('field predominantly focus', 8.393760888049093e-05), ('air sea', 7.411506315889563e-05), ('increasingly consequently', 7.36843600943108e-05), ('simulator enables', 7.061182728569595e-05), ('substitute substantial', 6.963797047861733e-05), ('evaluated scope innovation', 6.688301040198796e-05), ('still difficult task', 6.530436244694119e-05), ('optimization', 6.527950878551754e-05), ('element report review', 6.502066619295586e-05), ('integrating three', 6.473781853249864e-05), ('component subsystem could', 6.444382097774067e-05), ('better understanding behavior', 6.39179245668562e-05), ('challenging implemented', 6.26660399442616e-05), ('modeled posse always', 6.13289641878113e-05), ('potential major', 6.0728657229594897e-05), ('connecting data', 5.941921818044092e-05), ('relation main concept', 5.870820859662418e-05)])
(1, [('rts', 9.112087889380139e-05), ('comp

The HDP model does not seem to be very helpful, as it outputs a total of 150 topics. Meaning each topic is assigned approximately 5-6 articles. Furthermore, each keyword for each topic only has a 0.1% or less probability of being associated to the topic. This does not allow us to gain much insight. Hence, we will not be using the HDP model.
*The total number of topics generated can be seen by adjusting the num_topics hyperparameter above to 200. You will observe that 150 topics will be generated.*

## Topic Modeling using BERTopic

In [None]:
%%time
# Instantiate a BERTopic model
bertopic_model = BERTopic()

# Fit and transform the model to the corpus
topics, _ = bertopic_model.fit_transform(journals['tokens'])

# Print the top words for each topic
for topic_id in range(max(topics)):
    words = bertopic_model.get_topic(topic_id)
    print(f"Topic {topic_id}: {' | '.join(words)}")

## Topic Modelling using Latent Dirichlet Allocation (LDA) - gensim implementation

In [None]:
# Convert corpus to Gensim format
corpus = gensim.matutils.Sparse2Corpus(journals_corpus, documents_columns=False)

# Create Gensim dictionary
id2word = Dictionary.from_corpus(corpus, id2word=dict((id, word) for word, id in tvec_journals.vocabulary_.items()))

# Build LDA model
lda_model = LdaModel(corpus=corpus,
                     id2word=id2word,
                     num_topics=5,
                     random_state=42,
                     passes=10)

# Print keywords for each topic
for idx, topic in lda_model.show_topics(num_topics=5, num_words=15, formatted=False):
    print('Topic: {} \nWords: {}'.format(idx, [w[0] for w in topic]))

# Create dataframe with topic probabilities for each document
topic_probs = [lda_model.get_document_topics(corpus[i]) for i in range(len(corpus))]
df_topic_probs = pd.DataFrame([{f'topic_{tp[0]}': tp[1] for tp in probs} for probs in topic_probs])

# Combine topic probabilities dataframe with original data
df_journals_gensim = pd.concat([data, df_topic_probs], axis=1)

# Add column for topic with highest percentage
df_journals_gensim['topic_generated'] = df_journals_gensim.iloc[:, 3:].idxmax(axis=1)

# Add in the title of the journal
df_journals_gensim['title'] = journals['title']

# Add in the publication year of each journal
df_journals_gensim['year'] = journals['year'] 

# Frop the sentences column
df_journals_gensim.drop(columns='sentences', inplace=True)

In [None]:
df_journals_gensim.shape

In [None]:
df_journals_gensim.head()

# Outstanding Work
* Do EDA on the year distribution of the articles
* Fine tune the stop words to make the keywords in each topic distinct
* Evaluate the key words and create a proper topic label using domain knowledge and reading the top few articles for each topic
* Topic evaluation with visualizations, distributions and trending
* Recommendations for the organization
* Classficiation models for future articles
* Create dashboard
* Touch up on the EDA for presentation slides
* Create slides and prepare for presentation