# Capstone Project: Topic Modelling of Academic Journals (Model-Based Systems Engineering)

# 02: Preprocessing and EDA

In this notebook, we will perform the following actions:
1. Data preprocessing
2. Exploratory Data Analysis (EDA)

## Import Libraries

In [85]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from nltk.stem import WordNetLemmatizer
#from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from wordcloud import WordCloud
import re

# Set all columns and rows to be displayed
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Import Data

In [86]:
# Import journals data
journals = pd.read_csv('../data/journals.csv')

In [87]:
# Take a look at the dataframe
journals.head()

Unnamed: 0,title,abstract,year
0,Model-based Design Process for the Early Phase...,This paper presents an approach for a model-ba...,2017
1,Model Based Systems Engineering using VHDL-AMS,The purpose of this paper is to contribute to ...,2013
2,Code Generation Approach Supporting Complex Sy...,Code generation is an effective way to drive t...,2022
3,Model based systems engineering as enabler for...,"Product complexity is steadily increasing, cus...",2021
4,Electric Drive Vehicle Development and Evaluat...,To reduce development time and introduce techn...,2014


In [88]:
# Check the shape of the data
journals.shape

(850, 3)

## Data Dictionary

In [89]:
# Check the columns in the dataframe
journals.columns

Index(['title', 'abstract', 'year'], dtype='object')

Columns in the dataframe

|Column Name | Use of Column|
|------------|--------------|
|title| Title of the academic journal. Through topic modelling, each title will be assigned to a topic for quick search later on|
|abstract| Abstract of each academic journal. This data will be preprocessed and used as the dataset for the unsupervised learning to identify topics|
|year| Year that the academic journal was published. This will be used to identify shifts in trends between the topics over the years|

## Data Preprocessing

In this section, we will process the text data in the abstract column by cleaning the text, tokenizing and lemmatizing them. A description in more detail is provided below.
* Cleaning the text to remove special characters
* Tokenizing (converts sentences into individual words, and by using ngrams, we can also form tokens with multiple words to give better context)
* Lemmatization (converts different words with the same meaning/intent into the same word)
* Stop word removal (stop words are filler words that do not provide any context and just assist with sentence structure)

### Function Defintion for Preprocessing

The below function will be used to preprocess the text data by perform the functions listed above. 

In [90]:
def preprocess_text(text):
    
    # Remove 's
    text = re.sub(r"'s", '', text)
    
    # Remove n't (example don't)
    text = re.sub(r"n't", '', text)
    
    # Remove 'm (example I'm)
    text = re.sub(r"'m", '', text)
    
    # Remove 'd (e.g. I'd)
    text = re.sub(r"'d", '', text)
    
    # Remove 're (example They're)
    text = re.sub(r"'re", '', text)
    
    # Remove 've (example They've)
    text = re.sub(r"'ve", " have", text)
    
    # Remove 'll (example We'll)
    text = re.sub(r"'ll", '', text)
    
    # Remove URL links
    text = re.sub(r'http\S+', '', text)
    
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    
    # Change all text to lower case
    text = text.lower()
    
    # Remove the word abstract as it was included as the first word in one of the dataset
    text = re.sub(r"abstract", '', text)
    
    # Tokenize the text
    text = word_tokenize(text)
    
    # Lemmatize the text
    lemmatizer = WordNetLemmatizer()
    text = [lemmatizer.lemmatize(i) for i in text]
    
    # Remove stop words
    text = [token for token in text if token not in stopwords.words('english')]
    
    return text

### Preprocess the Text Data

Here, we will apply the preprocess_text function to clean and tokenize our text data

In [91]:
%%time
# Place the proprocessed data as a new column called tokens
journals['tokens'] = journals['abstract'].apply(preprocess_text)

CPU times: user 19.1 s, sys: 9.16 s, total: 28.2 s
Wall time: 33.5 s


In [92]:
# Check the tokens
journals['tokens'].head()

0    [paper, present, approach, model-based, planni...
1    [purpose, paper, contribute, definition, model...
2    [code, generation, effective, way, drive, comp...
3    [product, complexity, steadily, increasing, ,,...
4    [reduce, development, time, introduce, technol...
Name: tokens, dtype: object

### Vectorize the words for EDA

We will use CountVectorizer to vectorize our words, to enable EDA.

In [93]:
# Instantiate a CountVectorizer with ngrams 1 for word frequency analysis
cvec_journals_1 = CountVectorizer(lowercase=False, ngram_range=(1,1))

# Instantiate a CountVectorizer with ngrams 2 for bigram analysis
cvec_journals_2 = CountVectorizer(lowercase=False, ngram_range=(2,2))

# Instantiate a CountVectorizer with ngrams 3 for trigram analysis
cvec_journals_3 = CountVectorizer(lowercase=False, ngram_range=(3,3))

In [94]:
# Join the tokenized words so that we can vectorize them
journals['tokens'] = [" ".join(post) for post in journals['tokens']]

In [95]:
# Fit the three vectorizers, transform the data and export them into a dataframe

# Unigrams
cvec_journals_1.fit(journals['tokens'])
journals_unigrams = cvec_journals_1.transform(journals['tokens'])
journals_unigrams = pd.DataFrame(journals_unigrams.todense(), 
                                 columns=cvec_journals_1.get_feature_names_out())

# Bigrams
cvec_journals_2.fit(journals['tokens'])
journals_bigrams = cvec_journals_2.transform(journals['tokens'])
journals_bigrams = pd.DataFrame(journals_bigrams.todense(), 
                                 columns=cvec_journals_2.get_feature_names_out())

# Trigrams
cvec_journals_3.fit(journals['tokens'])
journals_trigrams = cvec_journals_3.transform(journals['tokens'])
journals_trigrams = pd.DataFrame(journals_trigrams.todense(), 
                                 columns=cvec_journals_3.get_feature_names_out())

In [64]:
test['rows'] = [1, 2, 3, 4, 5]

In [65]:
test['articles'] = ["i shall be having's a good times", 
                   "testings's they've, hadn't they're, and they'll i'd and shouldn't", 
                    "they've",
                   "abstract",
                   "123"]

In [66]:
test

Unnamed: 0,rows,articles,cleaned
0,1,i shall be having's a good times,"[shall, good, time]"
1,2,"testings's they've, hadn't they're, and they'l...","[testing, ,, ,]"
2,3,they've,[]
3,4,abstract,[]
4,5,123,[]


In [None]:
test['cleaned'] = test['articles'].apply(preprocess_text)

In [None]:
test

Unnamed: 0,rows,articles,cleaned
0,1,i shall be having's a good times,"[shall, good, time]"
1,2,"testings's they've, hadn't they're, and they'l...","[testing, ,, ,]"
2,3,they've,[]
3,4,i'm,[]
4,5,123,[]
