<h1>Poetry Topic Modelling with Latent Dirichlet Allocation and Latent Semantic Analysis</h1>

This code was utilised as part of my final year undergraduate project and dissertation, whose abstract is as follows:


<b>Topic models, which detect latent themes in a corpus of documents to group co-occurring keywords together in thematically comprehensible ways, were generated using the Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) algorithms with three datasets of poetry from different time periods. A close reading of the results as well as a study to measure interpretability were used to measure which algorithm was the most successful at uncovering specific themes in each dataset established using relevant literary studies. Comparison between the two algorithms’ performances served to indicate which method was the most successful in modelling this highly figurative language. Our findings indicated that LDA generated the most thematically comprehensible topics, owing to improved performance in identifying context and polysemy in the vocabulary used throughout the corpora, as well as having more parameters available to tune and optimise performance. <b>

Steps taken to build our models in this notebook:
1. Mine a large CSV file of poems from Kaggle to create smaller CSV files for each of the poetic movements we are exploring (Romantic, Metaphyiscal, Harlem Renaissance)

2. Clean and preprocess the data, creating bigrams, dictionaries and document-term matrices ready to be passed into the Gensim model functions

3. Evaluation and validation of topics generated. 

Step 1:

Necessary imports for mining the Kaggle CSV:

In [200]:
import pandas as pd 
import os 

We begin by importing the original Kaggle dataset into a Pandas dataframe and making empty dataframes for each poetic movement. 

In [201]:
poetrydata = pd.read_csv('kaggle_poem_dataset.csv') #This is a csv containing many PoetryFoundation poems

metaphysical = pd.DataFrame()
romantic = pd.DataFrame()
harlem = pd.DataFrame()

Our function <b>addPoems</b> takes the main CSV and the name of a poet as input, and adds any poems from that poet into the dataframe we're building.

In [202]:
def addPoems(poetrydata, poetname):
    #Add every row (poem) whose author is the poet we specify into an object 
    newPoems = poetrydata[poetrydata['Author'].str.contains(poetname)]
    return newPoems

Now, lists for each poet whose works we want to explore are created for each movement. Identifying the poets whose work we would want to explore was done by searching for the poets in the Romantic and Harlem Renaissance movements as listed on poetryfoundation.org. Identifying the Metaphysical poets (a harder task) was done through researching several sources detailed further in the report.

In [203]:
metaphysicalPoets = ['John Donne', 'Abraham Cowley', 'Andrew Marvell', 'Richard Crashaw','George Herbert','John Cleveland', 'Henry Vaughan']
romanticPoets = ['George Gordon','William Blake','Shelley','Felicia Dorothea Hemans','William Wordsworth','Coleridge','Keats','John Clare','Beddoes','William Lisle Bowles','Robert Burns','Barbauld','Heinrich Heine','Friedrich Hölderlin','Charles Lamb','Thomas Moore','Giacamo Leopardi','Christian Milne','Walter Scott','Robert Southey','Mary Lamb','Elizabeth Moody','Anna Seward','Elizabeth Bentley','Helen Leigh','George Crabbe','Joanna Baillie','Letitia Elizabeth Landon','Helen Maria Williams','Matilda Bethem','Mary Robinson','Walter Savage Landor','Leigh Hunt','Charlotte Smith','John Clare','Thomas Hood','Elizabeth Hands','Dorothy Wordsworth','Charlotte Richardson','Jane Taylor','Hartley Coleridge']
harlemPoets = ['Langston Hughes','Paul Dunbar','Claude McKay','Melvin B. Tolson','James Weldon Johnson','Fenton Johnson','Countee Cullen','Anne Spencer','William Warning Cuney','Margaret Walker','Jean Toomer','Georgia Douglas Johnson','W. E. B. Du Bois','Arna Bontemps','Leslie Pickeny Hill','Sterling A. Brown','Alice Dunbar-Nelson','Jessie Redmon Fauset']

Using for loops and our <b>addPoems</b> function to build our DataFrames, then verifying their sizes. 

In [204]:
#Metaphysical
for poet in metaphysicalPoets:
    newPoems = addPoems(poetrydata, poet) 
    metaphysical = metaphysical.append(newPoems, ignore_index=True)

#Romantic
for poet in romanticPoets:
    newPoems = addPoems(poetrydata, poet) 
    romantic = romantic.append(newPoems, ignore_index=True)

#Harlem Renaissance
for poet in harlemPoets:
    newPoems = addPoems(poetrydata, poet) 
    harlem = harlem.append(newPoems, ignore_index=True)
    
    
print('Metaphysical Poems: \n', metaphysical.shape[0]) #shape[0] = row count = amount of poems
print('Harlem Renaissance Poems: \n', harlem.shape[0])
print('Romantic Poems: \n', romantic.shape[0])

Metaphysical Poems: 
 126
Harlem Renaissance Poems: 
 82
Romantic Poems: 
 392


Our function createCSV takes the movement DataFrame and the name of the movement as parameters and is used to produce a CSV file for each movement's poems. 

In [205]:
def createCSV(movementDF, movementName):
    content = pd.DataFrame(movementDF, columns=['Content'])#We only want the Content column (the poems)
    content = content.replace('\n',' ', regex=True) #Remove line breaks for formatting
    export_csv = content.to_csv(r""+movementName+".csv", index = None, header=True) 

createCSV(romantic, 'romantic')
createCSV(metaphysical, 'metaphysical')
createCSV(harlem,'harlem')

With our datasets prepared, we will create the LDA and LSA models.

Step 2:
First we'll import the necessary libraries, making it clear which modules we'll be using:

In [206]:
import numpy as np
import matplotlib 
import sys
import gensim
from gensim import corpora, models, utils
from gensim import similarities
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import nltk
from pprint import pprint
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import brown
import string
import re #regex
from nltk.tokenize import RegexpTokenizer
import pyLDAvis
import pyLDAvis.gensim
from gensim.models import LsiModel
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

We'll specify the movement we wish to explore.

In [207]:
poemSet = input('Please specify the poetry movement (romantic, metaphysical, harlem)\n')
poems = pd.read_csv(poemSet+'.csv')

Please specify the poetry movement (romantic, metaphysical, harlem)
romantic


The next step will preprocess the data ready for bigram models and a document-term matrix to be built from the corpus.

In [208]:
def sent_to_words(poems):
    for poem in poems:
        yield(gensim.utils.simple_preprocess(str(poem))) #For formatting 


#Remove punctuation
poems["poems_processed"] = poems['Content'].str.replace('[^\w\s]','')

#Make all lowercase
poems["poems_processed"] = poems['poems_processed'].str.lower()

#Remove stopwords
stop = stopwords.words('english')
#Many of our poems contain some antiquated language not accounted for in NLTK's stopwords collection, so we need 
#to add them. A few other words have been added which consistently made their way into almost every topic and needed
#to be processed out (these words are not very substantive anyway)
stop.extend(['from', 'like', 'thou', 'may', 'much','let','ye','said','tis','thy','whose','thee','yet','shall','one', 'see','every','amp','even','juan','yarrow','upon','though','oh'])
poems['poems_processed'] = poems['poems_processed'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

#Put column of poems into new variable
poemtemp = poems['poems_processed']

#Conversion of the processed column into its own dataframe then a list to keep formatting
datadf = poemtemp.to_frame() 
data = datadf['poems_processed'].values.tolist()

data_words = list(sent_to_words(data))
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.

bigram_mod = gensim.models.phrases.Phraser(bigram)


Functions we'll use to format and create bigrams from our corpus respectively:

In [209]:
def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

Creating bigrams, a dictionary, preparing the corpus and finally a document-term matrix to pass into the model function.

In [210]:
# Form Bigrams
data_words_bigrams = make_bigrams(data_words)

# Create Dictionary
id2word = corpora.Dictionary(data_words_bigrams)

# Create Corpus
texts = data_words_bigrams

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

Building the LDA model using MALLET

In [211]:
mallet_path = '/Users/admin/mallet-2.0.8/bin/mallet' #path to mallet

ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=13, id2word=id2word,
                                                random_seed=100, iterations=500) #Increase iterations for improvement
# Show Topics
pprint(ldamallet.show_topics(num_topics=-1, num_words=20))

[(0,
  '0.024*"love" + 0.015*"light" + 0.014*"sleep" + 0.013*"death" + '
  '0.013*"flowers" + 0.012*"tears" + 0.011*"cold" + 0.011*"smile" + '
  '0.011*"voice" + 0.010*"thine" + 0.009*"pale" + 0.009*"wings" + '
  '0.009*"bright" + 0.009*"till" + 0.009*"woe" + 0.009*"dead" + 0.009*"weep" + '
  '0.009*"golden" + 0.009*"wild" + 0.008*"cloud"'),
 (1,
  '0.018*"green" + 0.016*"hear" + 0.012*"sun" + 0.011*"spring" + 0.011*"make" '
  '+ 0.010*"die" + 0.009*"water" + 0.009*"birds" + 0.009*"fancy" + '
  '0.009*"home" + 0.009*"bird" + 0.008*"mine" + 0.008*"high" + 0.008*"quiet" + '
  '0.008*"happy" + 0.008*"bring" + 0.008*"round" + 0.007*"oer" + 0.007*"clear" '
  '+ 0.007*"lake"'),
 (2,
  '0.017*"eyes" + 0.015*"sweet" + 0.012*"lady" + 0.011*"lay" + 0.011*"side" + '
  '0.010*"heard" + 0.010*"made" + 0.010*"heart" + 0.010*"stood" + 0.010*"fair" '
  '+ 0.009*"white" + 0.009*"face" + 0.009*"hath" + 0.009*"bright" + '
  '0.008*"arms" + 0.007*"child" + 0.007*"eye" + 0.007*"full" + 0.007*"rose" + '
  '

Next we'll compute the Coherence score as a quantitative performance metric (as detailed in the report, this does not really tell us much in this project and seems essentially arbitrary, but will be included for demonstration purposes)

In [212]:
# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=data_words_bigrams, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)


Coherence Score:  0.3601080630415803


Next we'll perform LSA, by first taking as input the name of the poetic movement to be explored (we will assume for the purposes of this notebook's usability that it will be the same as the one specified previously for LDA) and setting up some other variables. LSA won't work as intuitively using DataFrames so we'll prepare the poems in a list instead.

In [213]:
def load_data(file_name):
    
    documents_list = []
    titles=[]
    poems = pd.read_csv(file_name+'.csv')
    poemnum = poems.shape[0]
    print("Poemnum = ",poemnum)

    
    for i in range(poemnum-1):
        forname = str(i)
        file = open(poemSet+"Txt"+ "/Poem" +forname+".txt", "r") 
        #print("Currently exploring poem " , i )
        for line in file.readlines():
            text = line.strip()
            documents_list.append(text)

        file.close() 
        #print('Run once')

    print("Total Number of Documents:",len(documents_list))
    #titles.append( text[0:min(len(text),100)] )
    return documents_list

# LSA Model
number_of_topics=13
words=20
#poemSetLSA = poemSet as defined earlier

We'll preprocess this new list in the same way, with the same stopwords in order to convert into a new document-term matrix.

In [214]:
def preprocess_data(doc_set):
    """
    Preprocess text (tokenization and removing stopwords)
    """
    
    tokenizer = RegexpTokenizer(r'\w+')

    texts = []
    stop = stopwords.words('english')
    stop.extend(['from', 'like', 'thou', 'may', 'much','let','ye','said','yarrow','tis','thy','whose','thee','yet','shall','one', 'see','every','amp','even','juan','upon','though','oh'])


    
    for i in doc_set:
        # clean and tokenize document string
        raw = i.lower()
        tokens = tokenizer.tokenize(raw)
        # remove stop words from tokens
        stopped_tokens = [i for i in tokens if not i in stop]
        texts.append(stopped_tokens)



    return texts
def prepare_corpus(doc_clean):
    """
    Conversion into a document-term matrix for feeding into LSIModel for SVD reduction
    """
    # Creating the term dictionary of our courpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)
    dictionary = corpora.Dictionary(doc_clean)
    # Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
    doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
    return dictionary,doc_term_matrix

document_list =load_data(poemSet)
clean_text=preprocess_data(document_list)


Poemnum =  392
Total Number of Documents: 391


And finally we'll build our LSA model and calculate the CoherenceModel score.

In [215]:
def create_gensim_lsa_model(doc_clean,number_of_topics,words):
    '''
    Use SVD on the document term matrix and output the LSA model topics
    as well as coherence calculated with CoherenceModel
    '''
    dictionary,doc_term_matrix=prepare_corpus(doc_clean)
    # generate LSA model
    print('Document Term Matrix:')
    lsamodel = LsiModel(doc_term_matrix, num_topics=number_of_topics, id2word = dictionary, power_iters=50, onepass=False)  # train model
    pprint(lsamodel.print_topics(num_topics=number_of_topics, num_words=words))
    coherencemodel = CoherenceModel(model=lsamodel, texts=doc_clean, dictionary=dictionary, coherence='c_v')
    coherence_score = coherencemodel.get_coherence()
    print('\nCoherence Score: ', coherence_score)




    
    return lsamodel

model=create_gensim_lsa_model(clean_text,number_of_topics,words)

Document Term Matrix:
[(0,
  '0.151*"still" + 0.145*"eyes" + 0.145*"light" + 0.141*"day" + 0.138*"love" + '
  '0.135*"heart" + 0.120*"night" + 0.114*"oer" + 0.108*"sweet" + 0.100*"came" '
  '+ 0.098*"would" + 0.098*"life" + 0.096*"old" + 0.096*"made" + 0.093*"earth" '
  '+ 0.093*"dark" + 0.090*"death" + 0.090*"bright" + 0.090*"world" + '
  '0.087*"thus"'),
 (1,
  '-0.348*"lady" + -0.310*"christabel" + -0.197*"geraldine" + -0.169*"leoline" '
  '+ -0.164*"sir" + -0.124*"maid" + 0.113*"dark" + -0.109*"sweet" + '
  '-0.105*"well" + -0.099*"hath" + -0.096*"ladys" + -0.088*"saw" + '
  '0.087*"earth" + -0.080*"look" + -0.080*"eyes" + 0.079*"death" + '
  '-0.076*"child" + -0.076*"say" + -0.076*"tell" + 0.075*"ever"'),
 (2,
  '0.173*"dark" + 0.172*"eyes" + -0.161*"time" + -0.153*"seen" + '
  '0.126*"christabel" + -0.125*"man" + -0.115*"could" + 0.111*"lady" + '
  '0.096*"bright" + 0.094*"sleep" + -0.090*"lie" + 0.088*"sweet" + '
  '-0.086*"know" + 0.080*"geraldine" + 0.080*"fled" + -0.078*"long