INTRODUCTION

The goal of this analysis is to show an alternative method to create caption embeddings or vectors compared to using a pretrained Sentence-Bert model. I will use the term embeddings and vectors interchangeably. By capturing embeddings for captions in multiple contests, we can use these to infer on what makes a caption funny. To create vectors, I will use an algorithm called Doc2Vec which generates vectors for each individual caption. The data is grabbed from our SQL database in which the data was collected from https://nextml.github.io/caption-contest-data/. In this notebook, I will use data from 5 contests since this version is used for demonstration purposes.

The first step of this analysis is to load in the relevant libraries and pull down data from the SQL database in which the next two blocks do. I am using the gensim package for preprocessing and topic modeling which is an open source Python library representing documents as semantic vectors, as efficiently and painlessly as possible. It is designed to process raw, unstructured digital texts (“plain text”) using unsupervised machine learning algorithms. This is the fastest library for natural language processing and it is easy to use and understand. Next, I am requesting a connection to the SQL database by using a Python package called mysql.connector which allows Python progams to have access to SQL databases. The database I am pulling down information from is called new york cartoon.

In [None]:
# libraries for Doc2Vec model
import pandas as pd
import numpy as np
import gensim
from gensim import models
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import re
from collections import defaultdict 
from numpy import dot
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [None]:
# connecting to SQL database
import mysql.connector
from mysql.connector import Error
pd.set_option('display.max_colwidth', None)

try:
    connection = mysql.connector.connect(host='dbnewyorkcartoon.cgyqzvdc98df.us-east-2.rds.amazonaws.com',
                                         database='new_york_cartoon',
                                         user='dbuser',
                                         password='Sql123456')
    if connection.is_connected():
        db_Info = connection.get_server_info()
        print("Connected to MySQL Server version ", db_Info)
        cursor = connection.cursor()
        cursor.execute("select database();")
        record = cursor.fetchone()
        print("You succeed to connect to database: ", record)

except Error as e:
    print("Error while connecting to MySQL", e)

In order to understand how we can get our data from SQL, we have to input what contest numbers we want our captions from. In this case, we want data from the last 5 contests. We can do this using SQL's search function and selecting the result table which allows to get data from the contests and show it in a Pandas dataframe for ease of usage.

In [None]:
# pulling down data from SQL database via search
sql_select_Query = "select caption,ranking from result where contest_num in (863, 862, 861, 860, 859);"  # you can change query in this line for selecting your target data
cursor.execute(sql_select_Query)

# show attributes names of target data
num_attr = len(cursor.description)
attr_names = [i[0] for i in cursor.description]
print(attr_names)

# get all records
records = cursor.fetchall()
print("Total number of rows in table: ", cursor.rowcount)
df = pd.DataFrame(records, columns=attr_names)
df

The second step of this analysis is to first get rid of any distracting information or columns that are not useful to us in text extraction. As we can see in our dataframe (df), we have two columns called caption and ranking in which the former contains the text that we want to analyze. The second column, ranking, is not needed in our analysis because its values contain numbers which do not contain any meaningful information. Therefore, I will drop the "ranking" column.

In [None]:
# remove unneccessary columns, axis = 1 means to remove vertical axis(columns)
df = df.drop(columns=['ranking'], axis=1)

df.head()

Next, we have to perform some preprocessing of our text. Preprocessing of text before any form of analysis is very important because it can remove noise such as unnecessary punctuation which contain no meaning. It also allows us to homogenize all the words through lowercasing them. Having uppercase letters might cause variation in how the text is analyzed which can cause different results in our embeddings. Apparently, the values in the column are classified as objects when they should be strings, so I will convert them to strings before performing preprocessing. Here, I am creating a new column called "caption_processed" because I want to see how the text changes once we have finished our preprocessing for clarity purposes. I am using the re library to substitute all the punctuation in the brackets with a blank space and I am lowercasing all words using the lower function.

In [None]:
# Remove punctuation lowercasing and creating new column "caption_processed"
df['caption'] = df['caption'].astype(str)
df['caption_processed'] = df['caption'].map(lambda x: re.sub(r'[,\.\!\?\'\"]', '', x).lower())
df['caption_processed'] = df['caption_processed'].map(lambda x: re.sub(r'[--]', ' ', x).lower())

# Print out the first rows of captions
df.head()

In the next few code blocks, I am simply preprocessing the text even more. First, I am making the values in the caption_processed column into a list for tokenization. I am using using the simple_preprocess function from gensim which tokenizes text and passing it through an interative for loop. Then, I'm making a list of tokenized words.

In [None]:
# tokenizing and clean up text
data = df.caption_processed.values.tolist()

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  

data_words = list(sent_to_words(data))

Next, I created two objects that captures bigrams and trigrams. Bigrams are phrases that have two words appear in pairs consecutively and trigrams are phrases that have three words appear together consecutively. There might be some bigrams and trigrams in our data, and I want to cover all of our data so I don't miss any patterns. I set the min_count to 5 and threshold to 100 because having a lower appearance rate ensures that not all phrases become bigrams/trigrams by accident.

In [None]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold = fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

Stopwords are a set of commonly used words in any language in this case, English. Removing stopwords is very important in text processing as it can remove noise from the data and provide greater semantical meaning with those words removed. The most common corpus used for stopwords is NLTK's dictionary, but I have opted to use Spacy's dictionary instead. Spacy's dictionary of stopwords is larger thus potentially removing more noise from the data and having a cleaner look at the most important words. I load stopwords from the Spacy library and choose stopwords in English since our text is in English. 

In [None]:
# loading stopwords from Spacy
en = spacy.load('en_core_web_sm')
stop_words = en.Defaults.stop_words

I created functions removing stopwords and creating bigram phrases. I then applied them to my list of processed words.

In [None]:
# Define functions for stopwords, bigrams, trigrams
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

In [None]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

In [None]:
documents = data_words_bigrams

The final step before creating the Doc2Vec model is to make a unique id for each individual caption. This is so that the model can know that each caption is a unique item and it generates an embedding for each one. I did this through the tagged_document function which creates a tag or "id" for each caption.

In [None]:
# creating document ids
def tagged_document(documents):
    for i, words in enumerate(documents):
        yield gensim.models.doc2vec.TaggedDocument(words, [i])

tagged_documents = list(tagged_document(documents))

# Print the first TaggedDocument
print(tagged_documents[0])

This block is for viewing the vectors in their raw state which consumes a lot of memory. If you wish to view the vectors and see how they look like, please uncomment this code block.

In [None]:
# for i, tagged_doc in enumerate(tagged_documents):
    # words = tagged_doc.words
    # print(f"Words in Document {i}: {words}")

We can now create our Doc2Vec model. Some parameters to take note of are the window parameter which is how many words surrond the target word and vector size which is the length of our embeddings. Other parameters to understand is the dm parameter which makes the model use Distrbuted Memory (DM) to create embeddings based on the context of the caption which gives us the best embedding to use. I ran an epochs of 15 so that it trains itself a sufficient amount.

In [None]:
# Doc2Vec model
model = gensim.models.doc2vec.Doc2Vec(vector_size = 200, 
                                      window = 10,
                                      min_count = 5, 
                                      dm = 1,
                                      dbow_words = 0,
                                      epochs = 15,
                                      workers = 6)

Afterwards, we load our tagged documents into the model and build its vocabulary for training. We then train the model on its vocabulary and epochs.

In [None]:
# building the model vocabulary
model.build_vocab(tagged_documents)

In [None]:
# training the model
model.train(tagged_documents, total_examples=model.corpus_count, epochs=model.epochs)

Finally, once we have our vectors we can save them into a numpy file for future use. 

In [None]:
# saving the caption vectors to a compressed numpy file
def get_document_vectors_with_ids(model, tagged_docs):
    document_vectors = []
    for i, (doc_id, dv) in enumerate(zip(tagged_docs, model.dv)):
        words = doc_id.words
        document_vectors.append((f"Document {i + 1}", words, dv))
    return document_vectors

document_vectors_with_ids = get_document_vectors_with_ids(model, tagged_documents)

dtype = [('doc_id', 'U20'), ('words', object), ('doc_vector', np.float32, (model.vector_size,))]
data = np.array(document_vectors_with_ids, dtype=dtype)

# Save the data as an NPZ file
np.savez("caption_vectors.npz", data=data)

print("Document IDs, vectors, and words saved to 'caption_vectors.npz'.")