# Experimenting with Topic Modeling using Word Embeddings

The data set being used contains research paper titles and abstracts as well as labels as either Computer Science, Physics, Mathematics, Statistics, Quantitative Biology, Quantitative Finance, or some combination of those labels.  The approach that I am taking is to convert the text to a vector using word embeddings trained on this data set, then I will train a classifier for each of the labels, separately.  At the end I am going to create a function that when text is inputed will return the likely topic(s) of the title or abstract.

This function will evaluate the inputed text on each of the classifiers separately, then return an array with the results of each one in the same order that they appear in the columns in the training dataset.

In [5]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import ast 

# import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))


import sys
import pandas as pd
import matplotlib.pyplot as plt
import random


from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.ensemble import RandomForestClassifier
from gensim.models import Word2Vec
import gensim.utils
import warnings 
warnings.filterwarnings('ignore')

import tensorflow as tf

%matplotlib notebook
print('You\'re running python %s' % sys.version.split(' ')[0])

You're running python 3.10.5


#### Load the training data:

In [60]:
# train = pd.read_csv('/kaggle/input/topic-modeling-for-research-articles/train.csv',keep_default_na=False)
train = pd.read_csv("nasa_data.csv")

In [65]:
train.isna().sum()

Unnamed: 0                 0
subjectCategories          0
keywords                 936
otherReportNumbers      1326
title                      0
distribution               0
submittedDate              0
authorAffiliations         0
stiTypeDetails             0
technicalReviewType        0
modified                   0
id                         0
sourceIdentifiers          0
created                    0
onlyAbstract               0
sensitiveInformation       0
abstract                 234
dtype: int64

In [64]:
for i,d in train.groupby('subjectCategories'):
    print(i, d['abstract'].isna().sum())

['Aeronautics (General)'] 78
['Astronomy'] 0
['Astrophysics'] 0
['Chemistry and Materials (General)'] 78
['Cybernetics, Artificial Intelligence and Robotics'] 78
['Electronics and Electrical Engineering'] 0
['Engineering (General)'] 0
['Geosciences (General)', 'Lunar and Planetary Science and Exploration'] 0
['Instrumentation and Photography'] 0
['Life Sciences (General)'] 0
['Lunar and Planetary Science and Exploration', 'Spacecraft Design, Testing and Performance'] 0
['Lunar and Planetary Science and Exploration'] 0
['Man/System Technology and Life Support'] 0
['Mechanical Engineering'] 0
['Meteorology and Climatology'] 0
['Physics of Elementary Particles and Fields'] 0
[] 0


In [66]:
train['subjectCategories'].unique()

dataframe = []
for i in range(len(train)):
    # print(train.loc[i])
    categories = ast.literal_eval(train.loc[i, "subjectCategories"])
    d = train.loc[i].copy()
    for category in categories:
        d['subject_category'] = category
        dataframe.append(d.copy())

In [67]:
data = pd.DataFrame(dataframe)
data.shape,train.shape

((1716, 18), (1560, 17))

In [69]:
data['abstract'].isna().sum()

234

In [70]:
train_x = data[['subject_category', 'abstract', 'title']]
train_x.columns = ['subject_category', 'ABSTRACT', 'TITLE']
train_x.reset_index(inplace=True)
train_x.drop(columns='index',inplace=True)
train_x

Unnamed: 0,subject_category,ABSTRACT,TITLE
0,Mechanical Engineering,The Aerospace Mechanisms Symposium (AMS) provi...,46th Aerospace Mechanisms Symposium Proceedings
1,Life Sciences (General),A series of virtual workshops was held during ...,The Integration of Life Sciences in Space: Ast...
2,Lunar and Planetary Science and Exploration,The purpose of the Human Landing System (HLS) ...,Lunar Thermal Analysis Guidebook (L-TAG)
3,"Spacecraft Design, Testing and Performance",The purpose of the Human Landing System (HLS) ...,Lunar Thermal Analysis Guidebook (L-TAG)
4,Man/System Technology and Life Support,The processing of trash and waste is a welcome...,Technical Risks Associated with Heat Melt Comp...
...,...,...,...
1711,Engineering (General),BACKGROUND: This study was conducted with the ...,ANALYSIS OF EXERCISE LOADS TO INFORM VIBRATION...
1712,Aeronautics (General),,Unmanned Aircraft Systems (UAS) Integration in...
1713,Instrumentation and Photography,The James Webb Space Telescope (JWST) is a lar...,Preparing the JWST Observatory for Science Obs...
1714,Astronomy,Surface morphologies and space weathering feat...,Surface Morphologies and Space Weathering Feat...


In [76]:
ohe =  pd.get_dummies(train_x['subject_category'])
train = train_x.join(ohe).drop(columns=['subject_category'])
# train['ABSTRACT'].fillna('',inplace=True)
train.dropna(inplace=True)
# train
train.head(2)

Unnamed: 0,ABSTRACT,TITLE,Aeronautics (General),Astronomy,Astrophysics,Chemistry and Materials (General),"Cybernetics, Artificial Intelligence and Robotics",Electronics and Electrical Engineering,Engineering (General),Geosciences (General),Instrumentation and Photography,Life Sciences (General),Lunar and Planetary Science and Exploration,Man/System Technology and Life Support,Mechanical Engineering,Meteorology and Climatology,Physics of Elementary Particles and Fields,"Spacecraft Design, Testing and Performance"
0,The Aerospace Mechanisms Symposium (AMS) provi...,46th Aerospace Mechanisms Symposium Proceedings,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
1,A series of virtual workshops was held during ...,The Integration of Life Sciences in Space: Ast...,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


## Take a look at the training data:

#### Create array with labels for later:

In [77]:
labels = np.array(list(train.columns[2:]),dtype='str')
labels

array(['Aeronautics (General)', 'Astronomy', 'Astrophysics',
       'Chemistry and Materials (General)',
       'Cybernetics, Artificial Intelligence and Robotics',
       'Electronics and Electrical Engineering', 'Engineering (General)',
       'Geosciences (General)', 'Instrumentation and Photography',
       'Life Sciences (General)',
       'Lunar and Planetary Science and Exploration',
       'Man/System Technology and Life Support', 'Mechanical Engineering',
       'Meteorology and Climatology',
       'Physics of Elementary Particles and Fields',
       'Spacecraft Design, Testing and Performance'], dtype='<U49')

#### Break up dataset into lists that can be used for training and testing sets:

In [78]:
# this row number has a title that has no strings after simple_preprocess, so I removed it
# issueRowNumber = 8270

# X's
inputAbstracts = train['ABSTRACT'].tolist()
inputTitles = train['TITLE'].tolist()
# inputAbstracts.pop(issueRowNumber)
# inputTitles.pop(issueRowNumber)

# y's or labels
labelColumns = [None]*len(labels)
for i in range(len(labels)):
    col = train[labels[i]].tolist()
    labelColumns[i] = col

#### Tokenize titles and abstracts:

In [79]:
#tokenize titles:
inputTitleTokens = []
for title in inputTitles:
    tokens = gensim.utils.simple_preprocess(title)
    inputTitleTokens.append(tokens)
    
#tokenize abstracts:   
inputAbstractTokens = []
for abstract in inputAbstracts:
    tokens = gensim.utils.simple_preprocess(abstract)
    inputAbstractTokens.append(tokens)

#### Create Word Embeddings for article titles using Word2Vec

In [80]:
W2V_model_title = Word2Vec(inputTitleTokens, min_count=1, vector_size=100,workers=3, window=5, sg=1)
W2V_model_abstract = Word2Vec(inputAbstractTokens, min_count=1,vector_size=100,workers=3, window=5, sg=1)

#### Vectorize article titles using Word Embeddings:

In [81]:
vectorizedTitles = [None]*len(inputTitleTokens)
for i in range(len(inputTitleTokens)):
    title=[]
    for word in inputTitleTokens[i]:
        try:
            title.append(W2V_model_title.wv[word])
        except:
            'do nothing'
    title_avg = np.mean(np.array(title, dtype='f'),axis=0)
    vectorizedTitles[i]=title_avg

vectorizedAbstracts = [None]*len(inputAbstractTokens)
for i in range(len(inputAbstractTokens)):
    abstract=[]
    for word in inputAbstractTokens[i]:
        try:
            abstract.append(W2V_model_abstract.wv[word])
        except:
            'do nothing'
    abstract_avg = np.mean(np.array(abstract, dtype='f'),axis=0)
    vectorizedAbstracts[i]=abstract_avg

#### Split up testing and training sets:

In [82]:
test_size = len(inputTitles)//5
train_size = len(inputTitles)-test_size
print('Testing set size: '+str(test_size),'|','Training set size: '+str(train_size),'|','Total size: '+str(test_size+train_size))

Testing set size: 296 | Training set size: 1186 | Total size: 1482


In [83]:
len(vectorizedAbstracts)
# len(vectorizedTitles)

1482

In [84]:
#create the X test and training matricies for the article titles
temp = np.array(vectorizedTitles)
X_title_test,X_title_train = temp[train_size:],temp[:train_size]
#create the X test and training matricies for the article abstracts
temp = np.array(vectorizedAbstracts)
X_abstract_test,X_abstract_train = temp[train_size:],temp[:train_size]

#create the Y test and training arrays for the article labels (list of "np.array columns")
Y_train,Y_test = [None]*len(labelColumns),[None]*len(labelColumns)
for colNumber in range(len(labelColumns)):
    temp = np.array(labelColumns[colNumber])
    Y_test[colNumber],Y_train[colNumber]  = temp[train_size:],temp[:train_size]

#### Create random forest classifiers for each label:

In [85]:
print('TITLES:')
title_classifiers = [None]*len(Y_train)
for colNumber in range(len(Y_train)):
    temp = RandomForestClassifier(max_depth=6,n_estimators=10)
    temp.fit(X_title_train, Y_train[colNumber])
    title_classifiers[colNumber] = temp
    print(colNumber,labels[colNumber])
    print('Training accuracy:',np.sum(temp.predict(X_title_train)==Y_train[colNumber])/len(X_title_train))
    print('Testing accuracy:',np.sum(temp.predict(X_title_test)==Y_test[colNumber])/len(X_title_test))
    print()

TITLES:
0 Aeronautics (General)
Training accuracy: 1.0
Testing accuracy: 1.0

1 Astronomy
Training accuracy: 1.0
Testing accuracy: 1.0

2 Astrophysics
Training accuracy: 1.0
Testing accuracy: 1.0

3 Chemistry and Materials (General)
Training accuracy: 1.0
Testing accuracy: 1.0

4 Cybernetics, Artificial Intelligence and Robotics
Training accuracy: 1.0
Testing accuracy: 1.0

5 Electronics and Electrical Engineering
Training accuracy: 1.0
Testing accuracy: 1.0

6 Engineering (General)
Training accuracy: 1.0
Testing accuracy: 1.0

7 Geosciences (General)
Training accuracy: 0.9477234401349073
Testing accuracy: 0.9459459459459459

8 Instrumentation and Photography
Training accuracy: 1.0
Testing accuracy: 1.0

9 Life Sciences (General)
Training accuracy: 1.0
Testing accuracy: 1.0

10 Lunar and Planetary Science and Exploration
Training accuracy: 0.842327150084317
Testing accuracy: 0.8412162162162162

11 Man/System Technology and Life Support
Training accuracy: 1.0
Testing accuracy: 1.0

12 M

In [86]:

print('ABSTRACTS:')
abstract_classifiers = [None]*len(Y_train)
for colNumber in range(len(Y_train)):
    try :
        temp = RandomForestClassifier(max_depth=6,n_estimators=10)
        temp.fit(X_abstract_train, Y_train[colNumber])
        abstract_classifiers[colNumber] = temp
        print(colNumber,labels[colNumber])
        print('Training accuracy:',np.sum(temp.predict(X_title_train)==Y_train[colNumber])/len(X_title_train))
        print('Testing accuracy:',np.sum(temp.predict(X_title_test)==Y_test[colNumber])/len(X_title_test))
    except :
        print("")
        break

ABSTRACTS:
0 Aeronautics (General)
Training accuracy: 1.0
Testing accuracy: 1.0
1 Astronomy
Training accuracy: 0.8954468802698144
Testing accuracy: 0.8918918918918919
2 Astrophysics
Training accuracy: 0.9477234401349073
Testing accuracy: 0.9459459459459459
3 Chemistry and Materials (General)
Training accuracy: 1.0
Testing accuracy: 1.0
4 Cybernetics, Artificial Intelligence and Robotics
Training accuracy: 1.0
Testing accuracy: 1.0
5 Electronics and Electrical Engineering
Training accuracy: 0.9468802698145026
Testing accuracy: 0.9493243243243243
6 Engineering (General)
Training accuracy: 0.8954468802698144
Testing accuracy: 0.8918918918918919
7 Geosciences (General)
Training accuracy: 0.9477234401349073
Testing accuracy: 0.9459459459459459
8 Instrumentation and Photography
Training accuracy: 0.8946037099494097
Testing accuracy: 0.8952702702702703
9 Life Sciences (General)
Training accuracy: 0.9468802698145026
Testing accuracy: 0.9493243243243243
10 Lunar and Planetary Science and Explor

#### Create classifier function that evaluates input text on all five labels:
one for the title, and one for the abstract.

In [106]:
def title_classifier(title):
    global title_classifiers
    tokenTitle = gensim.utils.simple_preprocess(title)
    vecTitle = []
    for word in tokenTitle:
        try:
            vecTitle.append(W2V_model_title.wv[word])
        except:
            'do nothing'
    vecTitle = np.mean(np.array(vecTitle, dtype='f'),axis=0)
    preds = [None]*len(title_classifiers)
    for index in range(len(title_classifiers)):
        preds[index] = int(title_classifiers[index].predict(vecTitle.reshape(1, -1))[0])
    return np.array(preds)

def abstact_classifier(abstract):
    global abstract_classifiers
    tokenAbstract = gensim.utils.simple_preprocess(abstract)
    vecAbstract = []
    for word in tokenAbstract:
        try:
            vecAbstract.append(W2V_model_title.wv[word])
        except:
            'do nothing'
    vecAbstract = np.mean(np.array(vecAbstract, dtype='f'),axis=0)
    preds = [None]*len(abstract_classifiers)
    for index in range(len(abstract_classifiers)):
        preds[index] = int(abstract_classifiers[index].predict(vecAbstract.reshape(1, -1))[0])
    return np.array(preds)

#### Try out classifier on some made up article name inputs:


In [102]:
articleName = "Symposium Proceedings"
preds = title_classifier(articleName)
print('Output vector:',preds,'|','Predicited Label(s):',labels[preds==1])

Output vector: [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0] | Predicited Label(s): ['Mechanical Engineering']


In [101]:
articleName = "life is possible on diffrent planet"
preds = title_classifier(articleName)
print('Output vector:',preds,'|','Predicited Label(s):',labels[preds==1])

Output vector: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] | Predicited Label(s): []


In [110]:
articleName = "possiblities of life in space"
preds = abstact_classifier(articleName)
print('Output vector:',preds,'|','Predicited Label(s):',labels[preds==1])

Output vector: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] | Predicited Label(s): []


In [15]:
articleName = "New prime number discovered"
preds = title_classifier(articleName)
print('Output vector:',preds,'|','Predicited Label(s):',labels[preds==1])

Output vector: [0 0 1 0 0 0] | Predicited Label(s): ['Mathematics']


In [16]:
articleName = "New Data distribution used to speed up training"
preds = title_classifier(articleName)
print('Output vector:',preds,'|','Predicited Label(s):',labels[preds==1])

Output vector: [1 0 0 0 0 0] | Predicited Label(s): ['Computer Science']


## Summary:

While this ensemble classifier does not work perfectly, it does a fairly descent job of classifying the papers correctly. With the way thes models have been trained, it seems that the title gives more of an indication of the field of the paper, rather than the abstract.  Having thought about this a bit, I think maybe the abstract classifiers would be more accurate if the window of the word embeddings for them was larger, since the abstracts have more words, and so maybe a context longer than 5 words.  However, overall the accuracy of 80%+ test and training accuracy for the the titles is a good indication that this classification can be done well.  I am sure it is possible to improve the accuracy a bit.  I believe a better method of combining the word embeddings, instead of a simple average as I have done here, might achieve greater accuracy.

In [None]:
tokenAbstract = gensim.utils.simple_preprocess(abstract)
vecAbstract = []
for word in tokenAbstract:
    try:
        vecAbstract.append(W2V_model_title.wv[word])
    except:
        'do nothing'
vecAbstract = np.mean(np.array(vecAbstract, dtype='f'),axis=0)
preds = [None]*len(abstract_classifiers)
for index in range(len(abstract_classifiers)):
    preds[index] = int(abstract_classifiers[index].predict(vecTitle.reshape(1, -1))[0])
