# Capstone Project: Topic Modelling of Academic Journals (Model-Based Systems Engineering)

# Problem Statement

Due to the increasing complexity of engineering projects, there is a growing trend to adopt Model-Based Systems Engineering (MBSE) over the traditional Document-Based Systems Engineering (DBSE). However, as Systems Engineering is a large field and there are limited resourses availablity to work ont he adoption of MBSE, our organization needs to prioritize which aspect of MBSE to implement. However, as this is still an emerging field, we are unaware of the best approach to take or what aspects of MBSE are available and applicable for our context. <br>

By applying topic modeling to the abstracts of academic research articles and conference papers on MBSE, our organization can be better equiped to implement MBSE where it is most impactful. <br>

In particular, we will use topic modeling to try to address the following three points:
1. Identification of organizational enterprise goals
2. Where to focus R&D research efforts
3. Identify latest trends in the field

# Data Collected

Our data was collected from the below three sources: <br>
1. Institute of Electrical & Electronics Engineers (IEEE) - 384 articles
2. International Council on Systems Engineering (INCOSE) - 257 articles
3. Science Direct - 210 articles

# 01: Data Cleaning

For the remainder of the notebook, we will import and clean the data. As the data was collected in both csv and bibtex format, we will use a library known as bibtexparser to help import the bibtex format and process it. 

## Import Libraries

In [1]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import bibtexparser

# Set all columns and rows to be displayed
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Import and Clean Data from Science Direct

In [2]:
# List of .bib files to import
bib_files = ['../data/ScienceDirect_articles_1_100.bib', 
             '../data/ScienceDirect_articles_101_200.bib', 
             '../data/ScienceDirect_articles_201_210.bib']

# Initialize an empty list to store the bib_database objects
bib_databases = []

# Load each .bib file into a bib_database object
for bib_file in bib_files:
    with open(bib_file) as bibtex_file:
        bib_database = bibtexparser.load(bibtex_file)
        bib_databases.append(bib_database)

# Extract the entries from each bib_database object and combine them into a single list
entries = []
for bib_database in bib_databases:
    entries.extend(bib_database.entries)

# Convert the list of entries to a pandas dataframe
sd_articles = pd.DataFrame(entries)

# Display the resulting dataframe
sd_articles.head(2)

Unnamed: 0,abstract,keywords,author,url,doi,issn,note,year,pages,volume,journal,title,ENTRYTYPE,ID,number,isbn,series,address,publisher,booktitle,editor,edition
0,This paper presents an approach for a model-ba...,"Manufacturing, System, Planning, Design, Metho...",Chantal Steimer and Jan Fischer and Jan C. Aurich,https://www.sciencedirect.com/science/article/...,https://doi.org/10.1016/j.procir.2017.01.036,2212-8271,Complex Systems Engineering and Development Pr...,2017,163-168,60,Procedia CIRP,Model-based Design Process for the Early Phase...,article,STEIMER2017163,,,,,,,,
1,The purpose of this paper is to contribute to ...,"Property-based requirement, MBSE, Specificatio...",Patrice Micouin,https://www.sciencedirect.com/science/article/...,https://doi.org/10.1016/j.procs.2013.01.014,1877-0509,2013 Conference on Systems Engineering Research,2013,128-137,16,Procedia Computer Science,Model Based Systems Engineering using VHDL-AMS,article,MICOUIN2013128,,,,,,,,


In [3]:
# Check the shape of the dataframe to confirm it has been imported correctly
sd_articles.shape

(210, 22)

In [4]:
# Check what are the columns in the dataframe
sd_articles.columns

Index(['abstract', 'keywords', 'author', 'url', 'doi', 'issn', 'note', 'year',
       'pages', 'volume', 'journal', 'title', 'ENTRYTYPE', 'ID', 'number',
       'isbn', 'series', 'address', 'publisher', 'booktitle', 'editor',
       'edition'],
      dtype='object')

In [5]:
# Drop the columns we don't need
sd_articles = sd_articles[['title', 'abstract', 'year']]

In [6]:
# Check that the correct columns have been dropped
sd_articles.head(2)

Unnamed: 0,title,abstract,year
0,Model-based Design Process for the Early Phase...,This paper presents an approach for a model-ba...,2017
1,Model Based Systems Engineering using VHDL-AMS,The purpose of this paper is to contribute to ...,2013


In [7]:
# Check the dataframe types
sd_articles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     210 non-null    object
 1   abstract  209 non-null    object
 2   year      210 non-null    object
dtypes: object(3)
memory usage: 5.0+ KB


In [8]:
# Check for null values
sd_articles.isnull().sum()

title       0
abstract    1
year        0
dtype: int64

Since there is only one article that is missing an abstract, we will drop it.

In [9]:
# Drop null values
sd_articles.dropna(inplace=True)

## Import and Clean Data from INCOSE

In [10]:
# List of .bib files to import
bib_files = ['../data/incose_articles_1_20.bib', 
             '../data/incose_articles_21_40.bib',
             '../data/incose_articles_41_60.bib',
             '../data/incose_articles_61_80.bib',
             '../data/incose_articles_81_100.bib',
             '../data/incose_articles_101_120.bib',
             '../data/incose_articles_121_140.bib',
             '../data/incose_articles_141_160.bib',
             '../data/incose_articles_161_180.bib',
             '../data/incose_articles_181_200.bib',
             '../data/incose_articles_201_220.bib',
             '../data/incose_articles_221_240.bib',
             '../data/incose_articles_241_257.bib',]

# Initialize an empty list to store the bib_database objects
bib_databases = []

# Load each .bib file into a bib_database object
for bib_file in bib_files:
    with open(bib_file) as bibtex_file:
        bib_database = bibtexparser.load(bibtex_file)
        bib_databases.append(bib_database)

# Extract the entries from each bib_database object and combine them into a single list
entries = []
for bib_database in bib_databases:
    entries.extend(bib_database.entries)

# Convert the list of entries to a pandas dataframe
incose_articles = pd.DataFrame(entries)

# Display the resulting dataframe
incose_articles.head(2)

Unnamed: 0,year,abstract,eprint,url,doi,keywords,pages,number,volume,journal,title,author,ENTRYTYPE,ID
0,2023,Abstract Although Model-Based Systems Engineer...,https://incose.onlinelibrary.wiley.com/doi/pdf...,https://incose.onlinelibrary.wiley.com/doi/abs...,https://doi.org/10.1002/sys.21644,"MBSE, model-based systems engineering, systems...",104-129,1,26,Systems Engineering,Model-based systems engineering: Evaluating pe...,"Campo, Kelly X. and Teper, Thomas and Eaton, C...",article,https://doi.org/10.1002/sys.21644
1,2019,Abstract This paper aims to examine and docume...,https://incose.onlinelibrary.wiley.com/doi/pdf...,https://incose.onlinelibrary.wiley.com/doi/abs...,https://doi.org/10.1002/sys.21466,"analytical hierarchy process (AHP), model-base...",134-145,2,22,Systems Engineering,State-of-practice survey of model-based system...,"Huldt, T. and Stenius, I.",article,https://doi.org/10.1002/sys.21466


In [11]:
# Check the shape of the dataframe to confirm it has been imported correctly
incose_articles.shape

(257, 14)

In [12]:
# Check what are the columns in the dataframe
incose_articles.columns

Index(['year', 'abstract', 'eprint', 'url', 'doi', 'keywords', 'pages',
       'number', 'volume', 'journal', 'title', 'author', 'ENTRYTYPE', 'ID'],
      dtype='object')

In [13]:
# Drop the columns we don't need
incose_articles = incose_articles[['title', 'abstract', 'year']]

In [14]:
# Check that the correct columns have been dropped
incose_articles.head(2)

Unnamed: 0,title,abstract,year
0,Model-based systems engineering: Evaluating pe...,Abstract Although Model-Based Systems Engineer...,2023
1,State-of-practice survey of model-based system...,Abstract This paper aims to examine and docume...,2019


In [15]:
# Check the dataframe types
incose_articles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 257 entries, 0 to 256
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     257 non-null    object
 1   abstract  257 non-null    object
 2   year      254 non-null    object
dtypes: object(3)
memory usage: 6.1+ KB


In [16]:
incose_articles.isnull().sum()

title       0
abstract    0
year        3
dtype: int64

In [17]:
# There are three rows with missing years. We will impute this value with the median
median_year = np.median(incose_articles['year'].dropna().astype(int))
incose_articles['year'].fillna(median_year, inplace=True)

## Import and Clean Data from IEEE

In [18]:
# Import as a pandas dataframe
ieee_articles = pd.read_csv('../data/ieee_articles_1_384.csv')

# Display the resulting dataframe
ieee_articles.head(2)

Unnamed: 0,Document Title,Authors,Author Affiliations,Publication Title,Date Added To Xplore,Publication Year,Volume,Issue,Start Page,End Page,Abstract,ISSN,ISBNs,DOI,Funding Information,PDF Link,Author Keywords,IEEE Terms,INSPEC Controlled Terms,INSPEC Non-Controlled Terms,Mesh_Terms,Article Citation Count,Patent Citation Count,Reference Count,License,Online Date,Issue Date,Meeting Date,Publisher,Document Identifier
0,A generic conceptual model and actual systems ...,Donghun Yoon,"Keio University, Japan",2009 International Conference on Model-Based S...,29 May 2009,2009,,,69,74,"In this paper, a generic conceptual model and ...",,978-1-4244-2967-7,10.1109/MBSE.2009.5031722,,https://ieeexplore.ieee.org/stamp/stamp.jsp?ar...,,Systems engineering and theory;Information man...,information management;smart cards;systems eng...,generic conceptual model;IC-card system;model-...,,,,6.0,IEEE,29 May 2009,,,IEEE,IEEE Conferences
1,Model-Based Systems Engineering Implementation...,W. K. Vaneman; R. Carlson,"Systems Engineering Department, Naval Postgrad...",2019 IEEE International Systems Conference (Sy...,16 Sep 2019,2019,,,1,6,As organizations strive to implement Model-Bas...,2472-9647,978-1-5386-8396-5,10.1109/SYSCON.2019.8836888,,https://ieeexplore.ieee.org/stamp/stamp.jsp?ar...,Model-Based Systems Engineering (MBSE),Analytical models;Organizations;Data models;To...,business data processing;organisational aspect...,MBSE environment;model-based languages;organiz...,,2.0,,10.0,USGov,16 Sep 2019,,,IEEE,IEEE Conferences


In [19]:
# Check the shape of the dataframe to confirm it has been imported correctly
ieee_articles.shape

(384, 30)

In [20]:
# Check what are the columns in the dataframe
ieee_articles.columns

Index(['Document Title', 'Authors', 'Author Affiliations', 'Publication Title',
       'Date Added To Xplore', 'Publication Year', 'Volume', 'Issue',
       'Start Page', 'End Page', 'Abstract', 'ISSN', 'ISBNs', 'DOI',
       'Funding Information', 'PDF Link', 'Author Keywords', 'IEEE Terms',
       'INSPEC Controlled Terms', 'INSPEC Non-Controlled Terms', 'Mesh_Terms',
       'Article Citation Count', 'Patent Citation Count', 'Reference Count',
       'License', 'Online Date', 'Issue Date', 'Meeting Date', 'Publisher',
       'Document Identifier'],
      dtype='object')

In [21]:
# Drop the columns we don't need
ieee_articles = ieee_articles[['Document Title', 'Abstract', 'Publication Year']]

In [22]:
# Rename the columns
ieee_articles = ieee_articles.rename(columns = {'Document Title' : 'title', 
                                                'Abstract' : 'abstract',
                                                'Publication Year' : 'year'})

In [23]:
# Check the dataframe types
ieee_articles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 384 entries, 0 to 383
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     384 non-null    object
 1   abstract  384 non-null    object
 2   year      384 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 9.1+ KB


In [24]:
# Check for null values
ieee_articles.isnull().sum()

title       0
abstract    0
year        0
dtype: int64

## Merge the 3 Dataframes Together

As our data comes from three different sources with different formats, we had previously imported them seperately. Hence, we will now merge them all into one dataframe

In [25]:
# Check the shape of each dataframe
print(f"Science Direct articles: {sd_articles.shape}")
print(f"Incose articles: {incose_articles.shape}")
print(f"IEEE articles: {ieee_articles.shape}")

Science Direct articles: (209, 3)
Incose articles: (257, 3)
IEEE articles: (384, 3)


In [26]:
# Count how many articles we should have
sd_articles.shape[0] + incose_articles.shape[0] + ieee_articles.shape[0]

850

In [27]:
# Concatinate the three dataframes into one
journals = pd.concat([sd_articles, incose_articles, ieee_articles], ignore_index=True)

In [28]:
# Check the shape of the dataframe
journals.shape

(850, 3)

In [29]:
# Check the content of the dataframe
journals.head(2)

Unnamed: 0,title,abstract,year
0,Model-based Design Process for the Early Phase...,This paper presents an approach for a model-ba...,2017
1,Model Based Systems Engineering using VHDL-AMS,The purpose of this paper is to contribute to ...,2013


In [30]:
# Check the dataframe types
journals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 850 entries, 0 to 849
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     850 non-null    object
 1   abstract  850 non-null    object
 2   year      850 non-null    object
dtypes: object(3)
memory usage: 20.0+ KB


## Cleaning the Data

As we have sucessfully imported all the data into one dataframe, let's continue cleaning the data

In [31]:
# Change the year column to datetime format
journals['year'] = pd.to_datetime(journals['year']).dt.year

As our data comes from three different sources, we'll also have to check for duplicates

In [32]:
journals.duplicated().sum()

0

No duplicates were found

## Export the Cleaned Data

In [33]:
# Export the cleaned data into a csv file
journals.to_csv('../data/journals.csv', index=False)