# Capstone Project: Topic Modelling of Academic Journals (Model-Based Systems Engineering)

# 03: Modelling and Evaluation

In this notebook, we will perform the following actions:
1. Topic Modelling
2. Evaluation

## Import Libraries

In [1]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from bertopic import BERTopic
#from gensim.models.ldamodel import LdaModel
#from gensim.corpora.dictionary import Dictionary
from wordcloud import WordCloud

# Set all columns and rows to be displayed
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

  @numba.jit()
  @numba.jit()
  @numba.jit()
  @numba.jit()


## Import Data

In [2]:
# Import the data for modelling
journals = pd.read_csv('../data/journals_processed.csv')

## Final Data Preprocessing using TF-IDF

In this section, we will perform our final data preprocessing using TF-IDF (Term Frequency - Inverse Document Frequency). TF-IDF is used as it takes into account how often a word appears in the whole corpus. This helps to penalize common words that appear across every document, which is not informative. 

In [3]:
# Instantiate a TF-IDF Vectorizer
tvec_journals = TfidfVectorizer(lowercase=False, ngram_range=(1,3))

# Fit and transform the text data to prepare for topic modelling
journals_corpus = tvec_journals.fit_transform(journals['tokens'])

## Topic Modelling using Latent Dirichlet Allocation (LDA) - sklearn

Here, we will perform topic modelling using LDA. 

In [4]:
# Instantiate the LDA model
lda_model = LatentDirichletAllocation(n_components=5,
                                     random_state=42)

# Fit the model
lda_model.fit(journals_corpus)

In [5]:
# Extract the top words for each topic
feature_names = tvec_journals.get_feature_names_out()
n_top_words = 15
for topic_idx, topic in enumerate(lda_model.components_):
    print("Topic #%d:" %topic_idx)
    print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words -1:-1]]))
    print()
    
# Extract the topic distribution for each journal
journal_topic_dist = lda_model.transform(journals_corpus)

# Create a dataframe to store the journal topics probability distribution
df_journal_topic_dist = pd.DataFrame(journal_topic_dist, columns=['Topic 0', 'Topic 1', 'Topic 2', 'Topic 3', 'Topic 4'])

# Add in a column with the topic generated 
df_journal_topic_dist['topic_generated'] = journal_topic_dist.argmax(axis=1)

# Add in the title of the journal
df_journal_topic_dist['title'] = journals['title']

# Add in the publication year of each journal
df_journal_topic_dist['year'] = journals['year'] 

Topic #0:
based requirement model tool model based analysis method industry support complexity simulation research lifecycle concept language

Topic #1:
requirement based tool model analysis model based methodology framework method domain complex language human simulation study

Topic #2:
language based model tool model based method new analysis project information framework requirement support study case

Topic #3:
requirement based simulation model method model based tool language analysis data safety domain framework verification environment

Topic #4:
based simulation digital requirement model analysis model based tool safety information method language project data twin



In [6]:
df_journal_topic_dist.head()

Unnamed: 0,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4,topic_generated,title,year
0,0.015411,0.015428,0.938266,0.015417,0.015478,2,Model-based Design Process for the Early Phase...,2017
1,0.014299,0.014271,0.014305,0.94257,0.014555,3,Model Based Systems Engineering using VHDL-AMS,2013
2,0.013795,0.013762,0.013797,0.94483,0.013816,3,Code Generation Approach Supporting Complex Sy...,2022
3,0.012287,0.012286,0.012273,0.01229,0.950865,4,Model based systems engineering as enabler for...,2021
4,0.014573,0.014557,0.01459,0.014566,0.941714,4,Electric Drive Vehicle Development and Evaluat...,2014


## Topic Modeling using BERTopic

In [None]:
# Instantiate a BERTopic model
bertopic_model = BERTopic()

# Fit and transform the model to the corpus
topics, _ = bertopic_model.fit_transform(journals['tokens'])

# Print the top words for each topic
for topic_id in range(max(topics)):
    words = bertopic_model.get_topic(topic_id)
    print(f"Topic {topic_id}: {' | '.join(words)}")

## Topic Modelling using Latent Dirichlet Allocation (LDA) - gensim

In [None]:
# Instantiate a TF-IDF Vectorizer
tvec_2 = TfidfVectorizer(lowercase=False, ngram_range=(1,3))

# Fit and transform the text data to prepare for topic modelling
tfidf_matrix = tvec_2.fit_transform(journals['tokens'])

# Create a gensim dictionary 
dictionary = Dictionary([abstract.split() for abstract in journals['tokens']])

# Create the gensim corpus
corpus = [dictionary.doc2bow(abstract.split()) for abstract in journals['tokens']]

# Create the gensim LDA model
lda_model_2 = LdaModel(corpus=corpus, id2word=dictionary, num_topics=5, passes=10)

# Print the topics
topics = lda_model_2.print_topics(num_words=15)
for topic in topics:
    print(topic)

stopped at trying to get gensim's LDA model to work.

In [None]:
lda_model_2.show_topics(formatted=False)

## Import Data

In [None]:
# Import journals data
journals = pd.read_csv('../data/journals.csv')

In [None]:
# Take a look at the dataframe
journals.head()

In [None]:
# Check the shape of the data
journals.shape

## Data Dictionary

In [None]:
# Check the columns in the dataframe
journals.columns

Columns in the dataframe

|Column Name | Use of Column|
|------------|--------------|
|title| Title of the academic journal. Through topic modelling, each title will be assigned to a topic for quick search later on|
|abstract| Abstract of each academic journal. This data will be preprocessed and used as the dataset for the unsupervised learning to identify topics|
|year| Year that the academic journal was published. This will be used to identify shifts in trends between the topics over the years|

## Data Preprocessing

In this section, we will process the text data in the abstract column by cleaning the text, tokenizing and lemmatizing them. A description in more detail is provided below.
* Cleaning the text to remove special characters
* Tokenizing (converts sentences into individual words, and by using ngrams, we can also form tokens with multiple words to give better context)
* Lemmatization (converts different words with the same meaning/intent into the same word)
* Stop word removal (stop words are filler words that do not provide any context and just assist with sentence structure)

### Definition of Stopwords

We will assign the stopwords from NLTK to a list called stop_words. This is so that the list can be further expanded later on when looking at the word frequency.

In [None]:
stop_words = stopwords.words('english')

### Function Defintion for Preprocessing

The below function will be used to preprocess the text data by perform the functions listed above. 

In [None]:
def preprocess_text(text):
    
    # Remove 's
    text = re.sub(r"'s", '', text)
    
    # Remove n't (example don't)
    text = re.sub(r"n't", '', text)
    
    # Remove 'm (example I'm)
    text = re.sub(r"'m", '', text)
    
    # Remove 'd (e.g. I'd)
    text = re.sub(r"'d", '', text)
    
    # Remove 're (example They're)
    text = re.sub(r"'re", '', text)
    
    # Remove 've (example They've)
    text = re.sub(r"'ve", " have", text)
    
    # Remove 'll (example We'll)
    text = re.sub(r"'ll", '', text)
    
    # Remove URL links
    text = re.sub(r'http\S+', '', text)
    
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    
    # Change all text to lower case
    text = text.lower()
    
    # Remove the word abstract as it was included as the first word in one of the dataset
    text = re.sub(r"abstract", '', text)
    
    # Tokenize the text
    text = word_tokenize(text)
    
    # Lemmatize the text
    lemmatizer = WordNetLemmatizer()
    text = [lemmatizer.lemmatize(i) for i in text]
    
    # Remove stop words
    text = [token for token in text if token not in stop_words]
    
    return text

### Preprocess the Text Data

Here, we will apply the preprocess_text function to clean and tokenize our text data

In [None]:
%%time
# Place the proprocessed data as a new column called tokens
journals['tokens'] = journals['abstract'].apply(preprocess_text)

In [None]:
# Check the tokens
journals['tokens'].head()

### Vectorize the words for EDA

We will use CountVectorizer to vectorize our words, to enable EDA.

In [None]:
# Join the tokenized words so that we can vectorize them
journals['tokens'] = [" ".join(post) for post in journals['tokens']]

In [None]:
# Instantiate a CountVectorizer with ngrams 1 for word frequency analysis
cvec_journals_1 = CountVectorizer(lowercase=False, ngram_range=(1,1))

In [None]:
# Fit the Count Vectorizer, transform the data and export them into a dataframe

# Unigrams
cvec_journals_1.fit(journals['tokens'])
journals_unigrams = cvec_journals_1.transform(journals['tokens'])
journals_unigrams = pd.DataFrame(journals_unigrams.todense(), 
                                 columns=cvec_journals_1.get_feature_names_out())

## Exploratory Data Analysis (EDA)

In this section, we'll conduct EDA to look at the distribution of the text as well as most frequently occuring words. Furthermore, based on the EDA, we will further clean the text.

#### Document Length Distribution

In [None]:
# Initialize a dataframe to store the additional EDA data
eda_journals = pd.DataFrame()

In [None]:
# Calculate the word count for the abstracts
eda_journals['word_count'] = journals['tokens'].map(lambda x: len(x.split()))

In [None]:
# Create a histogram plot to view the distribution of the word count in the abstracts
plt.figure(figsize=(8, 6))
sns.histplot(eda_journals['word_count'], kde=False, bins=20, label="Word Count", 
             color="blue", alpha = 0.7)

plt.title("Distribution of Word Count in the Abstracts")
plt.xlabel("Word Count in the Abstracts")
plt.ylabel("Frequency of Count")
plt.legend();

#### Word Frequency Analysis

In [None]:
# Plot the 100 most frequently occuring unigrams in the abstracts
plt.figure(figsize=(16,25))
top_100_words = journals_unigrams.sum().sort_values(ascending=False).head(100)
top_100_words.sort_values(ascending=True).plot(kind='barh');
plt.title('100 Most Frequently Occurring Words in the Abstracts')
plt.ylabel('Words')
plt.xlabel('Number of Occurences');

From the above distribution, we can identify various words to add into our stopwords: $using, used, ha, use, also, within, however, well, wa, example$.

#### Remove the Additional Stopwords Identified Above

In [None]:
# Create list of high frequency words that are identified as stopwords
additional_stop_words = ['using', 'used', 'ha', 'use', 'also', 'within', 
                         'however', 'well', 'wa', 'example']

# Add the additional stop words to the original stop word list
stop_words.extend(additional_stop_words)

In [None]:
%%time
# Preprocess the data again
journals['tokens'] = journals['abstract'].apply(preprocess_text)

# Join the tokenized words so that we can vectorize them
journals['tokens'] = [" ".join(post) for post in journals['tokens']]

# Instantiate a CountVectorizer with ngrams 1 for word frequency analysis
cvec_journals_1 = CountVectorizer(lowercase=False, ngram_range=(1,1))

# Instantiate a CountVectorizer with ngrams 2 for bigram analysis
cvec_journals_2 = CountVectorizer(lowercase=False, ngram_range=(2,2))

# Instantiate a CountVectorizer with ngrams 3 for trigram analysis
cvec_journals_3 = CountVectorizer(lowercase=False, ngram_range=(3,3))


# Fit the three vectorizers, transform the data and export them into a dataframe

# Unigrams
cvec_journals_1.fit(journals['tokens'])
journals_unigrams = cvec_journals_1.transform(journals['tokens'])
journals_unigrams = pd.DataFrame(journals_unigrams.todense(), 
                                 columns=cvec_journals_1.get_feature_names_out())

# Bigrams
cvec_journals_2.fit(journals['tokens'])
journals_bigrams = cvec_journals_2.transform(journals['tokens'])
journals_bigrams = pd.DataFrame(journals_bigrams.todense(), 
                                 columns=cvec_journals_2.get_feature_names_out())

# Trigrams
cvec_journals_3.fit(journals['tokens'])
journals_trigrams = cvec_journals_3.transform(journals['tokens'])
journals_trigrams = pd.DataFrame(journals_trigrams.todense(), 
                                 columns=cvec_journals_3.get_feature_names_out())

#### Word Frequency Analysis

In [None]:
# Plot the 25 most frequently occuring unigrams in the abstracts
plt.figure(figsize=(16,12))
top_25_unigrams = journals_unigrams.sum().sort_values(ascending=False).head(25)
top_25_unigrams.sort_values(ascending=True).plot(kind='barh');
plt.title('25 Most Frequently Occurring Unigrams in the Abstracts')
plt.ylabel('Words')
plt.xlabel('Number of Occurences');

In [None]:
# Plot the 25 most frequently occuring bigrams in the abstracts
plt.figure(figsize=(16,12))
top_25_bigrams = journals_bigrams.sum().sort_values(ascending=False).head(25)
top_25_bigrams.sort_values(ascending=True).plot(kind='barh');
plt.title('25 Most Frequently Occurring Bigrams in the Abstracts')
plt.ylabel('Words')
plt.xlabel('Number of Occurences');

In [None]:
# Plot the 25 most frequently occuring trigrams in the abstracts
plt.figure(figsize=(16,12))
top_25_trigrams = journals_trigrams.sum().sort_values(ascending=False).head(25)
top_25_trigrams.sort_values(ascending=True).plot(kind='barh');
plt.title('25 Most Frequently Occurring Trigrams in the Abstracts')
plt.ylabel('Words')
plt.xlabel('Number of Occurences');

#### Word Clouds of the Unigrams, Bigrams and Trigrams

In [None]:
# Count the frequencies of the words in the Unigrams, Bigrams and Trigrams
unigrams_count = journals_unigrams.sum().sort_values(ascending=False)
bigrams_count = journals_bigrams.sum().sort_values(ascending=False)
trigrams_count = journals_trigrams.sum().sort_values(ascending=False)

In [None]:
# Create a word cloud for the unigrams
wordcloud_unigrams = WordCloud(max_words=100, width=1000, height=1000, 
                             background_color='white').generate_from_frequencies(unigrams_count)

plt.figure(figsize=(16,9))
plt.imshow(wordcloud_unigrams)
plt.axis('off')
plt.title('Unigrams', fontsize=20);

In [None]:
# Create a word cloud for the bigrams
wordcloud_bigrams = WordCloud(max_words=100, width=1000, height=1000, 
                             background_color='white').generate_from_frequencies(bigrams_count)

plt.figure(figsize=(16,9))
plt.imshow(wordcloud_bigrams)
plt.axis('off')
plt.title('Bigrams', fontsize=20);

In [None]:
# Create a word cloud for the trigrams
wordcloud_trigrams = WordCloud(max_words=100, width=1000, height=1000, 
                             background_color='white').generate_from_frequencies(trigrams_count)

plt.figure(figsize=(16,9))
plt.imshow(wordcloud_trigrams)
plt.axis('off')
plt.title('Trigrams', fontsize=20);

## Final Data Preprocessing using TF-IDF

In this section, we will perform our final data preprocessing using TF-IDF (Term Frequency - Inverse Document Frequency). TF-IDF is used as it takes into account how often a word appears in the whole corpus. This helps to penalize common words that appear across every document, which is not informative. 

In [None]:
%%time
# Preprocess the data again
journals['tokens'] = journals['abstract'].apply(preprocess_text)

# Join the tokenized words so that we can vectorize them
journals['tokens'] = [" ".join(post) for post in journals['tokens']]

# Instantiate a TF-IDF Vectorizer
tvec_journals = TfidfVectorizer(lowercase=False, ngram_range=(1,3))

# Fit and transform the text data to prepare for topic modelling
journals_corpus = tvec_journals.fit_transform(journals['tokens'])

## Export the Data

Now that we have completed the data processing, let's export the data into another notebook to perform topic modelling.

In [None]:
# Use pickle to export the data
pickle.dump(journals_corpus, open('../data/journals_corpus.pkl', 'wb'))