<h1>LDA Topic Modeling of CORD19 Data</h1>
<h3>Morgan VandenBerg</h3>

This notebook provides utilities to generate and explore topic models on the CORD19 dataset, found on Kaggle at:
https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/

Here is a link to the `topic_models.csv` file that can be read in to avoid processing the first section of this notebook: https://smu.box.com/s/3t52kyb3jeeftd8q4w3zd9bl59ba6x9b

The below code is a generic topic model builder class adapted from SMU's CS7391 Special Topics (Natural Language Processing) second homework project. Note that the `NUM_CORES` parameter should be left at 1 for the CORD19 data if processing individual paragraphs, as the multithreading overhead is too high to increase efficiency on these small documents.

In [1]:
from nltk.corpus import stopwords as nltk_stopwords
from nltk.stem.porter import PorterStemmer

# Allows me to speed up tagging without using pos_tag_sents
# as per https://stackoverflow.com/questions/33676526/pos-tagger-is-incredibly-slow
from nltk.tag.perceptron import PerceptronTagger

from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

from nltk.corpus import wordnet

import nltk

import sys
import os
import re
import math

import gensim
import gensim.corpora as corpora

NUM_CORES = 1

class LDAModelBuilder:
    _stemLemmaTool = None
    _stemDictionary = {}
    _tagger = PerceptronTagger()

    def __init__(self, numTopics, vectorModel, alpha, useToken, usePOS, useStemLemma, stopwordsFile, outputFile, verbose=True):
        self.verbose = verbose
        self._numTopics = numTopics
        self._vectorModel = vectorModel
        self._alpha = alpha
        self._useToken = useToken
        self._usePOS = usePOS
        self._useStemLemma = useStemLemma
        self.__initStopWords(stopwordsFile)
        self._outputFile = outputFile

    def __initStopWords(self, stopwordsFile):
        if stopwordsFile == None:
            if self.verbose:
                print("\tSelected 'none' for stopwords.")
            self._stopwords = set()
        elif stopwordsFile == 'nltk':
            if self.verbose:
                print("\tSelected NLTK stopwords.")
            self._stopwords = set(nltk_stopwords.words('english'))
        else:
            if self.verbose:
                print("\tReading custom stopwords from file.")
            self._stopwords = set(line.strip().lower()
                                  for line in open(stopwordsFile))

    def getStopwordSet(self):
        return self._stopwords

    def __stemOrLemmatizeDocument(self, document):
        # Stem or lemmatize document given program config
        if self._useStemLemma == 'N':
            return document
        elif self._useStemLemma == 'B':
            return self.__stemDocument(document)
        elif self._useStemLemma == 'L':
            return self.__lemmatizeDoc(document)
        else:
            print("Unsupported stem/lemmatization setting given in config file.")

    def __stemDocument(self, documentTokens):
        # Stem documents using PorterStemmer
        toStem = []
        # process document to remove parts of speech as specified, since lemmatization function will do this automatically
        partsSpeech = self._tagger.tag(documentTokens)
        for word, tag in partsSpeech:
            wntag = self.__getWordnetTag(tag)
            if self.__keepPartOfSpeech(wntag):
                toStem.append(word)

        # Pass items through stemmer, memoizing / referencing dictionary for performance
        toReturn = []
        if self._stemLemmaTool is None:
            self._stemLemmaTool = PorterStemmer()
        for word in toStem:
            if word not in self._stemDictionary:
                self._stemDictionary[word] = self._stemLemmaTool.stem(word)
            toReturn.append(self._stemDictionary[word])

        return toReturn

    def __lemmatizeDoc(self, documentTokens):
        # Lemmatize the document tokens using NLTK pos_tag
        toReturn = []
        if self._stemLemmaTool is None:
            self._stemLemmaTool = WordNetLemmatizer()
        partsSpeech = self._tagger.tag(documentTokens)

        for word, tag in partsSpeech:
            wntag = self.__getWordnetTag(tag)
            lemma = None
            if self.__keepPartOfSpeech(wntag):
                if wntag is None:
                    lemma = self._stemLemmaTool.lemmatize(word)
                else:
                    lemma = self._stemLemmaTool.lemmatize(word, pos=wntag)
                toReturn.append(lemma)
        return toReturn

    def __keepPartOfSpeech(self, pos):
        # Determine if the word should be kept given its part of speech and program config
        if self._usePOS == 'A':
            return True
        elif self._usePOS == 'F':
            return (pos == wordnet.NOUN or pos == wordnet.VERB or pos == wordnet.ADJ or pos == wordnet.ADV)
        elif self._usePOS == 'N':
            return (pos == wordnet.NOUN or pos == wordnet.ADJ)
        elif self._usePOS == 'n':
            return (pos == wordnet.NOUN)
        else:
            return False

    def __getWordnetTag(self, tag):
        # Convert to WordNet tags (from Penn)
        # Source for this method: https://stackoverflow.com/a/15590384
        if tag.startswith('J'):
            return wordnet.ADJ
        elif tag.startswith('V'):
            return wordnet.VERB
        elif tag.startswith('N'):
            return wordnet.NOUN
        elif tag.startswith('R'):
            return wordnet.ADV
        else:
            return None

    def __isStopWord(self, word):
        if word.lower() in self._stopwords:
            return True
        else:
            return False

    def __processWordToKeep(self, word):
        if self._useToken == 'A':
            # Keep all words except single character non-alphanumeric characters
            if len(word) == 1 and re.search(r'\W', word):
                # Case for single-character nonalphanumeric
                return None
            else:
                return word

        elif self._useToken == 'a':
            # Keep all words except single character non-alphanumeric characters,
            # remove symbols if token is a mixture of alphanumeric and symbols
            if len(word) == 1 and re.search(r'\W', word):
                # Case for single-character nonalphanumeric
                return None
            else:
                return re.sub(r'\W', '', word)

        elif self._useToken == 'N':
            # Keep only alphanumeric tokens
            if re.search(r'\W', word):
                # Case for non-alphanumeric
                return None
            else:
                # Valid case
                return word

        elif self._useToken == 'n':
            # Keep only alphanumeric tokens, removing tokens that are only numbers
            if re.search(r'\W', word):
                # Case for non-alphanumeric
                return None
            if not re.search(r'[a-zA-Z]', word):
                # Case for only numbers
                return None
            else:
                # Valid case
                return word

    def preProcessDocument(self, doc):
        return self.__preProcessDocument(word_tokenize(doc))

    def getBagOfWords(self, tokens):
        return self.id2word_.doc2bow(tokens)

    def __preProcessDocument(self, tokens):
        # Perform stemming or lemmatization
        firstPass = []
        for word in tokens:
            if len(word) < 3:
                continue
            word=word.lower()
            keepWord = self.__processWordToKeep(word)
            if keepWord is not None and not self.__isStopWord(keepWord):
                firstPass.append(keepWord)
        firstPass = self.__stemOrLemmatizeDocument(firstPass)
        # Now we remove stop words again
        toReturn = []
        for word in firstPass:
            if not self.__isStopWord(word):
                toReturn.append(word)

        return toReturn

    def __buildGensimCorpus(self, documents):
        if self.verbose:
            print("\tBuilding GenSim corpus...")
        # Build a GenSim corpus given documents
        processedDocuments = []
        for doc in documents:
            processedDocuments.append(self.preProcessDocument(doc))

        wordIDs = corpora.Dictionary(processedDocuments)
        corpus = [wordIDs.doc2bow(text) for text in processedDocuments]

        if self._vectorModel == 'B':
            # Use binary model
            for document in corpus:
                document[:] = [(id, 1 if freq > 0 else 0)
                               for (id, freq) in document]

        elif self._vectorModel == 'T':
            # Use TFIDF model
            tfidf = gensim.models.TfidfModel(corpus)
            corpus = tfidf[corpus]

        # Else 't' use TF model (no adjustment)
        elif not self._vectorModel == 't':
            print("Unsupported vector model passed in config file.")

        if self.verbose:
            print("\tBuilt GenSim corpus.")
        return (corpus, wordIDs)

    def __buildLDAModel(self, corpus, id2word):
        return gensim.models.ldamulticore.LdaMulticore(
            corpus=corpus,
            id2word=id2word,
            num_topics=self._numTopics,
            alpha=self._alpha,
            workers=NUM_CORES
        )

    def trainLDA(self, documents):
        trainSuccess = False
        if self.verbose:
            print("\tBuilding GenSim LDA topic model...")
        # Build / train LDA model via GenSim
        corpus, wordIDs = self.__buildGensimCorpus(documents)
        if len(wordIDs) > 0:
            self.LDAmodel_ = self.__buildLDAModel(corpus, wordIDs)
            self.id2word_ = wordIDs
            self._corpus_ = corpus
            trainSuccess = True
            if self.verbose:
                print("\tBuilt GenSim LDA topic model.")
        return trainSuccess
    
    def saveModel_(self):
        if self.verbose:
            print("\tSaving LDA topic model...")
        # Output the model, note I use SKLearn convention with pre/post-underscore to denote pre/post train functions
        self.LDAmodel_.save(self._outputFile + '.model')
        if self.verbose:
            print("\tSaved LDA topic model.")

    def loadModel(self, fromFile):
        self.LDAmodel_ = gensim.models.ldamulticore.LdaMulticore.load(fromFile)
            
    def getTopic(self, topicID, n=10):
        topic = self.LDAmodel_.get_topic_terms(topicid=topicID, topn=n)
        # Transform word IDs back to the original word
        topic[:] = [(self.id2word_[id], prob) for (id, prob) in topic]
        return topic

    def saveTopics_(self, n=10):
        if self.verbose:
            print("\tSaving LDA topics...")
        for topicID in range(0, self._numTopics):
            topic = self.getTopic(topicID=topicID, n=n)
            # Write topic output
            with open(self._outputFile + '_' + str(topicID) + '.topic', 'w') as writer:
                for (word, prob) in topic:
                    writer.write(word)
                    writer.write(' ')
                    writer.write(str(prob))
                    writer.write('\n')
        if self.verbose:
            print("\tSaved LDA topics.")

    def generateAndSaveDocTopics_(self, fileNames):
        if self.verbose:
            print("\tGenerating and saving document topics...")
        # Pass file names in as vector corresponding to original corpus documents
        with open(self._outputFile + '.dt', 'w') as writer:
            # Iterate over documents / filenames simultaneously
            for document, fileName in zip(self._corpus_, fileNames):
                writer.write(fileName)
                writer.write(' ')
                # Get topics for document
                docTopics = self.LDAmodel_.get_document_topics(
                    document, minimum_probability=0)
                # Write document topics to file
                for (topicID, prob) in docTopics:
                    writer.write(str(prob))
                    writer.write(' ')
                writer.write('\n')
        print("\tGenerated and saved document topics.")


def getJaccard(s, t):
    # Given two sets of words
    # Calculate the Jaccard coefficient | S ⋂ T | / | S ⋃ T |
    numer = len(s.intersection(t))
    denom = len(s.union(t))
    return numer / denom if denom > 0 else 0


def getTopicSim(t1, t2):
    # Given two topics, get their similarity as Jaccard of T1(k) and T2(k)
    # Format note: t1, t2 should be sets of tuple representing (topic, topic_prob)
    t1_words = set()
    t2_words = set()
    for ((t1_word, _), (t2_word, _)) in zip(t1, t2):
        t1_words.add(t1_word)
        t2_words.add(t2_word)
    sim = getJaccard(t1_words, t2_words)
    return sim


def getTopicSetSim(tprime, uprime):
    selectedUvals = set()  # Used to see if we get a perfect match
    simSum = 0
    for t in tprime:
        bestTopic = None
        bestSim = None
        counter = 0
        bestTopicIndex = 0
        for u in uprime:
            if bestTopic is None:
                # First index: assign base best topic, sim, index
                bestTopic = u
                bestTopicIndex = 0
                bestSim = getTopicSim(t, u)
                continue
            # Get similarity for current topic
            sim = getTopicSim(t, u)
            if sim > bestSim:
                bestSim = sim
                bestTopic = u
                bestTopicIndex = counter
            counter += 1
        selectedUvals.add(bestTopicIndex)  # Selected U at index (counter)
        simSum += bestSim
    if len(selectedUvals) == len(tprime):
        # Perfect match was found
        return (None, simSum)
    else:
        # Did not find a perfect match
        return (
            # First term: number of selected topics / number of topics in T
            len(selectedUvals) / len(tprime),
            simSum
        )


def getWordnetTag(tag):
    # Convert to WordNet tags (from Penn)
    # Source for this method: https://stackoverflow.com/a/15590384
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

In [2]:
cord19Path = './../scratch/CORD19Data/'
import pandas as pd
import numpy as np
import json
import os
from pprint import pprint
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

In [4]:
metadata_path = cord19Path + 'metadata.csv'
metadata = pd.read_csv(metadata_path)
metadata.head()

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_pdf_parse,has_pmc_xml_parse,full_text_file,url
0,xqhn0vbp,1e1286db212100993d03cc22374b624f7caee956,PMC,Airborne rhinovirus detection and effect of ul...,10.1186/1471-2458-3-5,PMC140314,12525263.0,no-cc,"BACKGROUND: Rhinovirus, the most common cause ...",2003-01-13,"Myatt, Theodore A; Johnston, Sebastian L; Rudn...",BMC Public Health,,,True,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
1,gi6uaa83,8ae137c8da1607b3a8e4c946c07ca8bda67f88ac,PMC,Discovering human history from stomach bacteria,10.1186/gb-2003-4-5-213,PMC156578,12734001.0,no-cc,Recent analyses of human pathogens have reveal...,2003-04-28,"Disotell, Todd R",Genome Biol,,,True,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
2,le0ogx1s,,PMC,A new recruit for the army of the men of death,10.1186/gb-2003-4-7-113,PMC193621,12844350.0,no-cc,"The army of the men of death, in John Bunyan's...",2003-06-27,"Petsko, Gregory A",Genome Biol,,,False,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
3,fy4w7xz8,0104f6ceccf92ae8567a0102f89cbb976969a774,PMC,Association of HLA class I with severe acute r...,10.1186/1471-2350-4-9,PMC212558,12969506.0,no-cc,BACKGROUND: The human leukocyte antigen (HLA) ...,2003-09-12,"Lin, Marie; Tseng, Hsiang-Kuang; Trejaut, Jean...",BMC Med Genet,,,True,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...
4,0qaoam29,5b68a553a7cbbea13472721cd1ad617d42b40c26,PMC,A double epidemic model for the SARS propagation,10.1186/1471-2334-3-19,PMC222908,12964944.0,no-cc,BACKGROUND: An epidemic of a Severe Acute Resp...,2003-09-10,"Ng, Tuen Wai; Turinici, Gabriel; Danchin, Antoine",BMC Infect Dis,,,True,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...


The above code reads in the CORD19 metadata.

The below code will read in the full dataset. Note that this takes a significant amount of time and consumes a large amount of memory. Moreover, it is NOT necessary to run other items in this notebook--only to generate the original topics, which takes ~30 hours.

In [None]:
# Original data loading code adapted from 
# https://github.com/elilaird/COVID-19-Open-Research-Dataset-Challenge/blob/master/CORD-19-Topic-Modeling.ipynb

'''
    @Desc    : Reads in json article and converts into Pandas Dataframe
    @Params  : filepath (str)
    @Returns : Pandas Dataframe 
'''
def JsonToDataFrame(filepath):
        
    #read json into dict
    with open(filepath) as json_data:
        data = json.load(json_data)
        
        paper_id = data['paper_id']
        abstract = '\n'.join([section['text'] for section in data['abstract']])

        

        final_data = {
            'paper_id'  : [data['paper_id']],
            'section'   : ['abstract'],
            'text'  : ['\n'.join([section['text'] for section in data['abstract']])]                                       
        }
        
        df = pd.DataFrame.from_dict(final_data)
        for section in data['body_text']:
            df = df.append({
                'paper_id' : data['paper_id'],
                'section'  : section['section'],
                'text'     : section['text']
            }, ignore_index=True)
            
        return df
    
        
biorxiv_medrxiv    = cord19Path + 'biorxiv_medrxiv/biorxiv_medrxiv/pdf_json/'
comm_use_subset    = cord19Path + 'comm_use_subset/comm_use_subset/pdf_json/'
noncomm_use_subset = cord19Path + 'noncomm_use_subset/noncomm_use_subset/pdf_json/'

biorxiv_medrxiv_files       = [biorxiv_medrxiv + pos_json for pos_json in os.listdir(biorxiv_medrxiv) if pos_json.endswith('.json')]
comm_use_subset_files       = [comm_use_subset + pos_json for pos_json in os.listdir(comm_use_subset) if pos_json.endswith('.json')]
noncomm_use_subset_files    = [noncomm_use_subset + pos_json for pos_json in os.listdir(noncomm_use_subset) if pos_json.endswith('.json')]


In [None]:
%%time

# Corpus loading code from https://github.com/elilaird/COVID-19-Open-Research-Dataset-Challenge/blob/clay_bert/CORD-19_BERT_clustering.ipynb
# adapted with multiprocessing by Clay Harper

import concurrent


#initialize dfs
biomed_df      = pd.DataFrame()
comm_use_df    = pd.DataFrame()
noncomm_use_df = pd.DataFrame()

count = 0

def to_df(file):
    try:
        return JsonToDataFrame(file)
    except:
        global count 
        count += 1
        return None

with concurrent.futures.ProcessPoolExecutor() as executor:
        for _, df in zip(biorxiv_medrxiv_files, executor.map(to_df, biorxiv_medrxiv_files)):
            if df is not None:
                biomed_df = biomed_df.append(df, ignore_index=True)
            
with concurrent.futures.ProcessPoolExecutor() as executor:
        for _, df in zip(comm_use_subset_files, executor.map(to_df, comm_use_subset_files)):
            if df is not None:
                comm_use_df = comm_use_df.append(df, ignore_index=True)
            
with concurrent.futures.ProcessPoolExecutor() as executor:
        for _, df in zip(noncomm_use_subset_files, executor.map(to_df, noncomm_use_subset_files)):
            if df is not None:
                noncomm_use_df = noncomm_use_df.append(df, ignore_index=True)
            
print('Count of files with issues: {}'.format(count))
full_corpus = pd.concat([biomed_df, comm_use_df, noncomm_use_df])

In [None]:
import re
import string

punct_table = str.maketrans('', '', string.punctuation)

#remove punctuation
full_corpus['text'] = full_corpus['text'].map(lambda x: x.translate(punct_table))

#convert to lowercase
full_corpus['text'] = full_corpus['text'].map(lambda x: x.lower())

full_corpus.head()

In [None]:
len(full_corpus)  # number of documents is almost 500k

<h1>Topic Models Exploration</h1>

I begin by creating a topic model for the 'prompt'; the thing that we want to know more about. Note that using a moderate-sized document here will help with the stability of the model. One should not run topic models for individual sentences in this manner.

In [114]:
# Seek to find groups of topics that match this prompt

promptText = '''
What do we know about virus genetics, origin, and evolution? What do we know about the virus origin and management measures at the human-animal interface?
Specifically, we want to know what the literature reports about:
Real-time tracking of whole genomes and a mechanism for coordinating the rapid dissemination of that information to inform the development of diagnostics and therapeutics and to track variations of the virus over time.
Access to geographic and temporal diverse sample sets to understand geographic distribution and genomic differences, and determine whether there is more than one strain in circulation. Multi-lateral agreements such as the Nagoya Protocol could be leveraged.
Evidence that livestock could be infected (e.g., field surveillance, genetic sequencing, receptor binding) and serve as a reservoir after the epidemic appears to be over.
Evidence of whether farmers are infected, and whether farmers could have played a role in the origin.
Surveillance of mixed wildlife- livestock farms for SARS-CoV-2 and other coronaviruses in Southeast Asia.
Experimental infections to test host range for this pathogen.
Animal host(s) and any evidence of continued spill-over to humans
Socioeconomic and behavioral risk factors for this spill-over
Sustainable risk reduction strategies
'''

promptDocs = [promptText]

baseModelOutputPath = '../scratch/CORD19_Topic_Models/'

ldaParams = dict(
    numTopics = 15,           # 15 topics
    vectorModel = 't',        # term freq detection model
    alpha = 5,                # higher alpha for sharp topic detection
    useToken = 'n',           # strictest token filtering 
    usePOS = 'N',             # nouns or adv 
    useStemLemma = 'L',       # lemmatization
    stopwordsFile = 'nltk',   # LDA stopwords
    outputFile = baseModelOutputPath + 'prompt_base_model',
    verbose=False
)

topicsToMatch = LDAModelBuilder(
    **ldaParams
)

topicsToMatch.trainLDA(promptDocs)
topicsToMatch.saveTopics_()

k = 15 # number of words per topic
promptTopicSet = []
for i in range(0, ldaParams['numTopics']):
    topic = topicsToMatch.getTopic(topicID=i, n=k)
    promptTopicSet.append(topic)

<h1>Generating CORD19 Topic Models</h1>

The two below cells are used to build the full dataset of topic models. Please note that these cells take an incredibly long time to run--most users should instead skip this section and use the `topic_models.csv` file that I make openly available through SMU's Box service.

The first cell is a single-threaded approach to generating models sequentially, the bottom cell will run a specified number of processes and split the work between them. 

NOTE: the workflow here requires that the generated topics are written to disk. There will be close to two million total files here; consider this before running the below code. I performed all work with these on SMU's supercomputer, ManeFrame II (M2). Once the models are generated, you MUST use the `TopicModelCSVGenerator.java` or `TopicModelCSVGeneratorThreaded.java` programs to convert the individual model files into a single CSV.

In [None]:
from tqdm import tqdm

# Now get topics for each document
ldaModels = dict()

numEmpty=0

minDocLength = 10

for i in range(len(full_corpus)):
    docText = [full_corpus['text'].iloc[i]]
    ldaParams['outputFile'] = baseModelOutputPath + '_model_doc_' + str(i)
    ldaModels[i] = LDAModelBuilder(**ldaParams)
    if ldaModels[i].trainLDA(docText):
        # Save if train successful
        ldaModels[i].saveTopics_()
    if i % 100 == 0:
        print(i)
        
print("Processed", i, "documents.")
print("Found", numEmpty,"empty.")

In [None]:
# multiprocessing for above code
from multiprocessing import Pool, Lock, Process
from multiprocessing.sharedctypes import Array


ldaModels = dict()

numEmpty=0

minDocLength = 10


def processLDA(lowerIndex, upperIndex, documents):
    for (doc, i) in zip(documents, range(lowerIndex, upperIndex)):
        docText = [doc]
        ldaParams['outputFile'] = baseModelOutputPath + '_model_doc_' + str(i)
        model = LDAModelBuilder(**ldaParams)
        if model.trainLDA(docText):
            model.saveTopics_()

            
            
if __name__ == '__main__':

    numCores = 36
    totalDocs = len(full_corpus)

    processes = []
    splits = np.array_split(full_corpus, numCores)
    
    lower = 0
    
    for i in range(numCores):
        # spawn process with docs
        docs = splits[i].text.tolist()
        p = Process(target=processLDA, args=(lower, lower+len(docs), docs))
        p.start()
        print("Started process", i)
        processes.append(p)

        lower += len(docs)

        
        
    for p in processes:
        p.join()
        print("Joined process.")
        
        

    print("Processed", i, "documents.")
    print("Found", numEmpty,"empty.")



This cell provides a Pythononic way to read in the models. However, on most systems (including SMU's supercomputer), it is too inefficient to run. One should instead use the provided Java code if the individual topics must be read in.

In [None]:
# Read in calculated topic models
from tqdm import tqdm
import os.path
from os import path

import numpy

ldaDocumentTopics = numpy.empty(500000, dtype=object)
print("alloc array")

numFailed = 0

numTopics = 15

for i in range(500000):
    readFile = baseModelOutputPath + '_model_doc_' + str(i)
    
    ldaDocumentTopics[i] = list()
    
    for t in range(numTopics):
        topics = []
        #if path.exists(readFile + '_' + str(t) + '.topic'):
        try:
            with open(readFile + '_' + str(t) + '.topic', 'r') as topicFile:
                for topicLine in topicFile:
                    items = topicLine.split()
                    topicString = items[0]
                    topicWeight = float(items[1])
                    topics.append((topicString, topicWeight))
        except OSError:
            pass
        ldaDocumentTopics[i].append(topics)      
    
    if i % 100 == 0:
        print("read in ", i, "models.")

<h1>Exploring Topic Models</h1>

This is the section that most users should skip to! Be sure to download the `topic_models.csv` file and place it in the appropriate directory before using the below code.

Each line of that CSV corresponds to a word and weight from an individual topic. That belongs to one of 15 topics for any individual document. The `doc_id` field is ordered identically to the `full_corpus` object read in above.

In [104]:
# Read saved topic models from disk
import pandas as pd
models = pd.read_csv('topic_models_mthread.csv')
models.head()

Unnamed: 0,doc_id,topic_id,word,weight
0,0,0,expert,0.048942
1,0,0,dynamic,0.046465
2,0,0,result,0.045898
3,0,0,license,0.044178
4,0,0,current,0.043589


In [12]:
models.describe()

Unnamed: 0,doc_id,topic_id,weight
count,66968550.0,66968550.0,66968550.0
mean,224879.1,7.0,0.04915377
std,129717.4,4.320494,0.04923616
min,0.0,0.0,0.0
25%,112618.0,3.0,0.02631349
50%,224823.0,7.0,0.03767874
75%,337108.0,11.0,0.05726257
max,449883.0,14.0,1.0


<h2>Topic Similarity Matching</h2>

I begin by finding paragraphs which match the prompt topics. The first approach below uses a simple Jaccard-coefficient based similarity measure. I will also explore techniques based on word-embeddings in later cells.

In [131]:
import numpy as np

simScores = np.zeros((len(pd.unique(models.doc_id)), 1))

# getTopicsForDocument will deterministically give the topics for a certain doc. But it is very, very slow. (10its/second)
def getTopicsForDocument(docID):
    topics = []
    docTopics = models[ models.doc_id == docID ]
    
    for topicID in range(0, ldaParams['numTopics']):
        topicItems = []
        docTopic = docTopics[ docTopics.topic_id == topicID ]
        for (topic, weight) in zip(docTopic.word.tolist(), docTopic.weight.tolist()):
            topicItems.append( (topic, weight) )
        topics.append(topicItems)
    
    return topics

# I can instead get topics very quickly if I index by rows
def getTopicsFromRows(lower, upper, expectedID = None):
    docTopics = models.iloc[lower:upper, :]
    if expectedID is not None:
        expectedLen = len(docTopics)
        docTopics = docTopics[ docTopics.doc_id == expectedID ]
        if len(docTopics) != expectedLen:
            print("WARNING: at expected doc ID",expectedID,
                  ", number of items not matching expected ID:",len(docTopics)-expectedLen)
            
    topics = []
    for topicID in range(0, ldaParams['numTopics']):
        topicItems = []
        docTopic = docTopics[ docTopics.topic_id == topicID ]
        for (topic, weight) in zip(docTopic.word.tolist(), docTopic.weight.tolist()):
            topicItems.append( (topic, weight) )
        topics.append(topicItems)
    return topics

In [105]:
# NOTE: you only need to run this cell if using a word-embedding similarity metric
# to run this cell, you will need to download the GoogleNewsVectors-negative300 word embeddings

# Simple, order-dependent Word-Embedding based topic set similarity metric based on that proposed in 
# Wang, Xi. (2019). Evaluating Similarity Metrics for Latent Twitter Topics. 

# Aggregate topic-vector word embedding similarity metric also provided

import gensim

print("Loading Word2Vec model (this may take some time)...")
word2vec = gensim.models.KeyedVectors.load_word2vec_format('../scratch/GoogleNewsVectors.gz', binary=True)  
print("Word2Vec model loaded.")


Loading Word2Vec model (this may take some time)...
Word2Vec model loaded.


In [106]:
def getWordVecOrZero(word):
    # Get the vector for the word as a NP array, or return zero if not contained
    if word in word2vec.vocab:
        toReturn = word2vec[word]
        return toReturn
    else:
        return np.zeros(word2vec.vector_size, dtype=np.float32)

def getWordCosineSim(word1, word2):
    # Get cosine similarity for word vectors, returning zero if either is not contained in the embedding vocab
    wordVec1 = getWordVecOrZero(word1)
    wordVec2 = getWordVecOrZero(word2)
    return np.dot(wordVec1, wordVec2)
    
def getEmbeddingBasedSimScore(topicSet1, topicSet2):
    # Get topic set similarity by pair-wise word embeddings
    topicSetSim = 0.0
    total = 0.0
    for (topic1, topic2) in zip(topicSet1, topicSet2):
        for ((word1, _) , (word2, _)) in zip(topic1, topic2):
            topicSetSim += getWordCosineSim(word1, word2)
            total += 1.0
    return topicSetSim / total

def getAggregateEmbeddingBasedSimScore(topicSet1, topicSet2):
    # Aggregate each topic into a single vector (sum of words), normalize it, and then 
    # get that vector cosine similarity to the opposing topic and sum it up for a final set sim score
    topicSetSim = 0.0
    total = 0.0
    for (topic1, topic2) in zip(topicSet1, topicSet2):
        topicVec1 = np.zeros(word2vec.vector_size, dtype=np.float32)
        topicVec2 = np.zeros(word2vec.vector_size, dtype=np.float32)
        for ((word1, _) , (word2, _)) in zip(topic1, topic2):
            topicVec1 += getWordVecOrZero(word1)
            topicVec2 += getWordVecOrZero(word2)
        topicVec1 = np.linalg.norm(topicVec1)
        topicVec2 = np.linalg.norm(topicVec2)
        topicSetSim += np.dot(topicVec1, topicVec2)
        total += 1.0
    return topicSetSim / total

In [None]:
# Calculate topic set sims--note that this cell takes about 13 hours to run on Jaccard coefficients, 18 hours on WE
# I provide a np.save() and np.load() for the sim scores array so that the notebook can be closed / reloaded with that data.

def getJaccardTopicSetSim(topicSet1, topicSet2):
    (imperfCoef, simScore) = getTopicSetSim(topicSet1, topicSet2)
    imperCoef = 1 if imperfCoef is None else imperfCoef
    return imperfCoef * simScore

def getWordEmbeddingTopicSetSim(topicSet1, topicSet2):
    # Uses the word-by-word approach proposed in Wang, Xi. (2019). Evaluating Similarity Metrics for Latent Twitter Topics. 
    return getEmbeddingBasedSimScore(topicSet1, topicSet2)

def getTopicEmbeddingTopicSetSim(topicSet1, topicSet2):
    # Uses aggregation of the topic into one normalized vector, then compares the vectors for each topic
    return getAggregateEmbeddingBasedSimScore(topicSet1, topicSet2)
    

# CHANGE THIS FUNCTION to adjust between word-embedding topic similarity and Jaccard similarity
def calcTopicSetSim(topicSet1, topicSet2):
#     return getJaccardTopicSetSim(topicSet1, topicSet2)
    return getAggregateEmbeddingBasedSimScore(topicSet1, topicSet2)
#     return getWordEmbeddingTopicSetSim(topicSet1, topicSet2)


<h3>Data Conversions for Performance</h3>

One issue with the original `getTopicsForDocument(docID)` function is that it is extremely slow. Pandas just isn't able to perform large selections by field-matching the `doc_id` column on this large of a dataset.

It turns out that a Python dictionary has the same problem--it doesn't rehash very efficiently at this size, so inserting stuff there gives performance that degrades so quickly that the below loop would never finish.

So, I'll instead use a sparse NumPy array with object type to hold all of the topic lists. This conversion still takes some time--about 50it/s (a couple of hours total) but it makes calculating similarities exponentially faster later on.

With that said: the below cell is optional. But if you want to use this code on a large dataset as intended, it will be of huge benefit to run this first.

In [None]:
%%time
#################################    OPTIONAL CELL     ################################
# Run this cell to increase performance significantly on large datsets. See above note.
#######################################################################################

from tqdm import tqdm

# Use a Numpy array to store these lists... it's far faster than dictionary for numeric keys
documentTopics = np.empty(max(models.doc_id), dtype=np.object)

documentIDs = pd.unique(models.doc_id)

ITEMS_PER_DOC = 10 * 15 # 10 words for each of 15 topics

counter = 0
for docID in documentIDs:
    # Get topics by unsafe row indexing, but use verification of expected doc ID (slightly slower)
    documentTopics[docID] = getTopicsFromRows(counter * ITEMS_PER_DOC, (counter + 1) * ITEMS_PER_DOC, docID)
    counter += 1
        
# Now redefine the getTopicsForDocument function for future use

def getTopicsForDocument(docID):
    return documentTopics[docID]

In [None]:
%%time
from tqdm import tqdm
# Single-threaded similarity score generator

maxID = max(models['doc_id'])

for i in tqdm(range(maxID)):
    docTopics = getTopicsForDocument(i)
    simScores[i] = calcTopicSetSim(docTopics, promptTopicSet)
    if i % 1000 == 0:
        print("Processed", i, "similarity scores.")

np.save('simScores_array', simScores)

In [39]:
simScores = np.load('simScores_array.npy')

In [None]:
# Sort by highest sim scores
full_corpus.sort_values('aggSimScore')