INTRODUCTION

The goal of this analysis is to demonstrate the usage of topic modeling using Python on the New Yorker Caption Contest database. By performing topic modeling on multiple contests, we can see which topics are common across them and perhaps in the future use this information to infer on what makes a caption funny. To perform topic modeling, I will use an algorithm called Latent Dirichlet Allocation (LDA) to grab topic vectors and visualize common topics. The data is collected from https://nextml.github.io/caption-contest-data/ and is stored on a SQL database. 

I am using the gensim package for preprocessing and topic modeling which is an open source Python library representing documents as semantic vectors, as efficiently and painlessly as possible. It is designed to process raw, unstructured digital texts (“plain text”) using unsupervised machine learning algorithms. This is the fastest library for natural language processing and it is easy to use and understand.

The first step of this analysis is to pull down our data from the SQL database in which the code block below does so. I am requesting a connection to the SQL database by using a Python package called mysql.connector which allows Python progams to have access to SQL databases. The database I am pulling down information from is called new york cartoon.

In [None]:
# libraries for topic modeling
import pandas as pd
import numpy as np
import gensim
from gensim import models
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import re
from collections import defaultdict 
from numpy import dot
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import matplotlib.pyplot as plt
from pyLDAvis import save_html
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

In [None]:
# connecting to SQL database
import mysql.connector
from mysql.connector import Error
pd.set_option('display.max_colwidth', None)

try:
    connection = mysql.connector.connect(host='dbnewyorkcartoon.cgyqzvdc98df.us-east-2.rds.amazonaws.com',
                                         database='new_york_cartoon',
                                         user='dbuser',
                                         password='Sql123456')
    if connection.is_connected():
        db_Info = connection.get_server_info()
        print("Connected to MySQL Server version ", db_Info)
        cursor = connection.cursor()
        cursor.execute("select database();")
        record = cursor.fetchone()
        print("You succeed to connect to database: ", record)

except Error as e:
    print("Error while connecting to MySQL", e)

In order to understand how we can get our data from SQL, we have to input what contest numbers we want our captions from. In this case, we want data from all contests. We can do this using SQL's search function and selecting the result table which allows to get data from the contests and show it in a Pandas dataframe for ease of usage.

In [None]:
# pulling down data from SQL database via search
sql_select_Query = "select caption,ranking from result;"
cursor.execute(sql_select_Query)

# show attributes names of target data
num_attr = len(cursor.description)
attr_names = [i[0] for i in cursor.description]
print(attr_names)

# get all records
records = cursor.fetchall()
print("Total number of rows in table: ", cursor.rowcount)
df = pd.DataFrame(records, columns=attr_names)
df

Next, we have to perform some preprocessing of our text. Preprocessing of text before any form of analysis is very important because it can remove noise such as unnecessary punctuation which contain no meaning. It also allows us to homogenize all the words through lowercasing them. Having uppercase letters might cause variation in how the text is analyzed which can cause different results in our embeddings. Apparently, the values in the column are classified as objects when they should be strings, so I will convert them to strings before performing preprocessing. Here, I am creating a new column called "caption_processed" because I want to see how the text changes once we have finished our preprocessing for clarity purposes. I am using the re library to substitute all the punctuation in the brackets with a blank space and I am lowercasing all words using the lower function.

In [None]:
# Remove punctuation lowercasing and creating new column "caption_processed"
df['caption'] = df['caption'].astype(str)
df['caption_processed'] = df['caption'].map(lambda x: re.sub(r'[,\.\!\?\"\']', '', x).lower())
df['caption_processed'] = df['caption_processed'].map(lambda x: re.sub(r'[--]', ' ', x).lower())

# Print out the first rows of captions
df.head()

In the next few code blocks, I am simply preprocessing the text even more. First, I am making the values in the caption_processed column into a list for tokenization. I am using using the simple_preprocess function from gensim which tokenizes text and passing it through an interative for loop. Then, I'm making a list of tokenized words.

In [None]:
# tokenizing and cleaning up text
data = df.caption_processed.values.tolist()

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  

data_words = list(sent_to_words(data))

Next, I created two objects that captures bigrams and trigrams. Bigrams are phrases that have two words appear in pairs consecutively and trigrams are phrases that have three words appear together consecutively. There might be some bigrams and trigrams in our data, and I want to cover all of our data so I don't miss any patterns. I set the min_count to 5 and threshold to 100 because having a lower appearance rate ensures that not all phrases become bigrams/trigrams by accident.

In [None]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold = fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

Stopwords are a set of commonly used words in any language in this case, English. Removing stopwords is very important in text processing as it can remove noise from the data and provide greater semantical meaning with those words removed. The most common corpus used for stopwords is NLTK's dictionary, but I have opted to use Spacy's dictionary instead. Spacy's dictionary of stopwords is larger thus potentially removing more noise from the data and having a cleaner look at the most important words. I load stopwords from the Spacy library and choose stopwords in English since our text is in English. I am using Spacy's model "en_core_web_sm" which is a small English pipeline trained on written web text that includes vocabulary, syntax and entities. I am using the small model for faster computational purposes.

In [None]:
# loading stopwords from Spacy
en = spacy.load('en_core_web_sm')
stop_words = en.Defaults.stop_words

I created functions removing stopwords, creating bigram phrases, and lemmatizing words. I then applied them to my list of processed words. The next step is to lemmatize the text. Lemmatization is a preprocessing method that reduces words to their root form. By lemmatizing words before tokenizing allows for efficient and faster processing afterwards. I am creating a function called "lemmatization" that takes in nouns, adjectives, verbs, adjectives, and proper nouns from Spacy's vocabulary. In this function, I am lemmatizing words that are in the text and in "allowed_postags" and then appending them into an empty list "texts_out" which will contain all the new lemmatized words.

In [None]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV', 'PROPN']):
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [None]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv, propn
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV', 'PROPN'])
print(data_lemmatized[0])

Before building our LDA model, we have to create a dictionary called id2word which allows to look up individual words and their frequency in the text. This is a useful tool in seeing what words are most frequent throughout the text and if some patterns of words appear.

In [None]:
# corpus
id2word = corpora.Dictionary(data_lemmatized)

texts = data_lemmatized

corpus = [id2word.doc2bow(text) for text in texts]
print(corpus[0])

Now that the text has been preprocessed and a corpus has been made, I can build an LDA model. The model has some necessary parameters which include the id2word dictionary and corpus. You may notice that num_topics which represent the number of topics in the model is 20 but this number is arbitrary meaning that you can put in whatever number and the model will return the number of topics specified. This is a problem that will be fixed in the next steps. Other parameters include random_state which is a seed for replicability, chunksize which is the number of captions in a training batch, passes which is how many times the model is passed over for training. I am using the multicore version of Gensim's LDA model because of parallel processing which speeds of computation time.

In [None]:
# creating lda model
lda_model = models.ldamulticore.LdaMulticore(corpus=corpus,
                                    id2word=id2word,
                                    num_topics=20,
                                    random_state=100,
                                    chunksize=100,
                                    passes=10,
                                    workers=6, 
                                    per_word_topics=True)

How do you know if a topics model is good? I can find out if my topics model is good through its perplexity and coherence score. Perplexity is a measure of how good a model is and coherence score measures how clear the topcis are. I want a low perplexity score and a high coherence score which indicates a good model. To adjust these scores, I need to tune the chunksize and passes parameters.

In [None]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v') # a measure of how accurate the model is. higher is better
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

I have a problem of creating a model because I do not know the optimal amount of topics that are meaningful and interpretable. To fix this problem, I created a function that creates multiple models with each having different amount of topics starting from 2 and ending at 50 topics. For each model, I also measured their respective coherence scores to see which model is the best that I can use.

In [None]:
import random
random.seed(50)

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = models.ldamulticore.LdaMulticore(corpus=corpus,
                                    id2word=id2word,
                                    num_topics=num_topics,
                                    random_state=100,
                                    chunksize=100,
                                    passes=10,
                                    workers=6, 
                                    per_word_topics=True)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    # Find the model with the highest coherence value
    optimal_model_index = coherence_values.index(max(coherence_values))
    optimal_model = model_list[optimal_model_index]

    return model_list, coherence_values, optimal_model

model_list, coherence_values, optimal_model = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, start=2, limit=50, step=6)

for num_topics, cv in zip(range(2, 50, 6), coherence_values):
    print("Num Topics =", num_topics, " has Coherence Value of", round(cv, 3))

print("Optimal Model:", optimal_model)

I can also visualize the optimal LDA model by graphing its coherence score along with the number of topics. 

In [None]:
limit=50; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

Now that I found the best number of topics for my model, I can visualize the model using the pyLDAvis library which is exclusive to LDA models. The "R=20" parameter indicates how many terms I want to show each topic bubble. In this case, I want to show the top 20 words for each topic. I exported the model's visualization as an html file for sharing with other people in my team.

In [None]:
# Generate the visualization
lda_display = gensimvis.prepare(optimal_model, corpus, id2word, mds="mmds", R=20, sort_topics=False)
    
# Generate a filename based on the current topic number
filename = f"lda_vis.html"

# Save the HTML visualization with the topic number in the filename
pyLDAvis.save_html(lda_display, filename)

With this model, I can find a lot of interesting information of the topics in the captions. One of the practical application of topic modeling is to determine what topic a given document is about. To find that, I find the topic number that has the highest percentage contribution in that caption.

In [None]:
# Finding the dominant topic in each individual caption
def format_topics_sentences(ldamodel, corpus, texts):
    data = []

    for i, doc in enumerate(corpus):
        topics = ldamodel.get_document_topics(doc)
        topics = sorted(topics, key=lambda x: x[1], reverse=True)
        
        # Initialize variables to store dominant topic and its contribution
        dominant_topic = -1
        max_topic_contribution = 0.0

        for j, (topic_num, prop_topic) in enumerate(topics):
            if j == 0:  # First topic is the dominant topic
                dominant_topic = topic_num
                max_topic_contribution = prop_topic

        # Get the keywords for the dominant topic
        wp = ldamodel.show_topic(dominant_topic)
        topic_keywords = ", ".join([word for word, prop in wp])

        # Append the data as a list
        data.append([int(dominant_topic), round(max_topic_contribution, 4), topic_keywords, texts[i]])

    sent_topics_df = pd.DataFrame(data, columns=['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords', 'Text'])

    return sent_topics_df

# Example usage:
df_topic_sents_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']

df_dominant_topic.to_csv('Dominant_Topic_in_each_Caption.csv', index=False)

# Show
df_dominant_topic.head(10)

Sometimes just the topic keywords may not be enough to make sense of what a topic is about. So, to help with understanding the topic, I can find the documents a given topic has contributed to the most and infer the topic by reading that document.

In [None]:
# Most representive caption of each topic
sent_topics_sorted = pd.DataFrame()

sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')

for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorted = pd.concat([sent_topics_sorted, 
                                             grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)], 
                                            axis=0)

# Reset Index    
sent_topics_sorted.reset_index(drop=True, inplace=True)

# Format
sent_topics_sorted.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text"]

sent_topics_sorted.to_csv('Most_Representative_Caption_of_Each_Topic.csv', index=False)

# Show
sent_topics_sorted.head()

I also want to understand the volume and distribution of topics in order to judge how widely it was discussed.

In [None]:
# Number of Documents for Each Topic
topic_counts = df_topic_sents_keywords['Dominant_Topic'].value_counts()

# Percentage of Documents for Each Topic
topic_contribution = round(topic_counts/topic_counts.sum(), 4)

# Topic Number and Keywords
topic_num_keywords = sent_topics_sorted[['Topic_Num', 'Keywords']]

# Concatenate Column wise
df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)

df_dominant_topics.reset_index(drop=True, inplace=True)

# Change Column names
df_dominant_topics.columns = ['Topic_Num', 'Topic_Keywords', 'Num_Documents', 'Perc_Documents']

df_dominant_topics.to_csv('Topic_Distributions.csv', index=False)

# Show
df_dominant_topics.head()

I am most interested in finding the probability of each topic showing up in each caption. I can do so by using the show_topics function which returns the topic probability for each caption. I added in the "ranking column" from our original dataframe for more information. Additionally, I'm also interested in finding the top 20 words for each topic. Lastly, I compiled all of this data in to a csv for others to use. 

In [None]:
# finding topic probabilities in each caption
def corpus_to_lda_features(lda_model, corpus, num_words=20):
    topic_probabilities_list = []

    # Get the top words for each topic with probabilities
    topic_terms = lda_model.show_topics(num_topics=lda_model.num_topics, num_words=num_words, formatted=False)

    # Extracting top words and probabilities for each topic
    top_words_per_topic = {topic[0]: [(word[0], word[1]) for word in topic[1]] for topic in topic_terms}

    for doc in corpus:
        topic_probabilities = lda_model.get_document_topics(doc, minimum_probability=0)
        topic_probabilities = np.array(topic_probabilities)
        topic_probabilities_list.append(topic_probabilities[:, 1])

    # Create a pandas DataFrame with a column for each topic
    df = pd.DataFrame(topic_probabilities_list)

    # Add columns for top words and their probabilities of each topic
    for topic, top_words_probs in top_words_per_topic.items():
        top_words, word_probs = zip(*top_words_probs)
        df[f"Topic_{topic}_top_words"] = pd.Series(top_words)
        df[f"Topic_{topic}_word_probs"] = pd.Series(word_probs)

    return df

In [None]:
topic_probabilities_df = corpus_to_lda_features(optimal_model, corpus, num_words = 20)
topic_probabilities_df = topic_probabilities_df.assign(caption_text=df.caption.values)
topic_probabilities_df = topic_probabilities_df.assign(ranking=df.ranking)

column_name = 'caption_text'
first_column = topic_probabilities_df.pop(column_name)
second_column = topic_probabilities_df.pop('ranking')
topic_probabilities_df.insert(0, column_name, first_column)
topic_probabilities_df.insert(1, 'ranking', second_column)
topic_probabilities_df.head()

In [None]:
topic_probabilities_df.to_csv('document_topics.csv', index=False)