INTRODUCTION 

The goal of this analysis is to show a method to create caption embeddings or vectors. I will use the term embeddings and vectors interchangeably. By capturing embeddings for captions in multiple contests, we can use these to infer on what makes a caption funny. To create vectors, I will use a pretrained Sentence_Bert model which generates vectors for each individual caption. The pretrained model can be found here: https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models/. A Sentence_Bert model is a neural network model that has been fine-tuned on a large corpus of sentences. The data is grabbed from our SQL database in which the data was collected from https://nextml.github.io/caption-contest-data/. In this notebook, I will use data from 5 contests since this version is used for demonstration purposes. 

The first step of this analysis is to load in the relevant libraries and pull down data from the SQL database in which the next two blocks do. Next, I am requesting a connection to the SQL database by using a Python package called mysql.connector which allows Python progams to have access to SQL databases. The database I am pulling down information from is called new york cartoon.

In [None]:
# libraries for caption embeddings
import pandas as pd
import numpy as np
import re
from sentence_transformers import SentenceTransformer

In [None]:
# connecting to SQL database
import mysql.connector
from mysql.connector import Error
pd.set_option('display.max_colwidth', None)

try:
    connection = mysql.connector.connect(host='dbnewyorkcartoon.cgyqzvdc98df.us-east-2.rds.amazonaws.com',
                                         database='new_york_cartoon',
                                         user='dbuser',
                                         password='Sql123456')
    if connection.is_connected():
        db_Info = connection.get_server_info()
        print("Connected to MySQL Server version ", db_Info)
        cursor = connection.cursor()
        cursor.execute("select database();")
        record = cursor.fetchone()
        print("You succeed to connect to database: ", record)

except Error as e:
    print("Error while connecting to MySQL", e)

In order to understand how we can get our data from SQL, we have to input what contest numbers we want our captions from. In this case, we want data from the last 5 contests. We can do this using SQL's search function and selecting the result table which allows to get data from the contests and show it in a Pandas dataframe for ease of usage.

In [None]:
# pulling down data from SQL database via search
sql_select_Query = "select caption,ranking from result where contest_num in (863, 862, 861, 860, 859);"  # you can change query in this line for selecting your target data
cursor.execute(sql_select_Query)

# show attributes names of target data
num_attr = len(cursor.description)
attr_names = [i[0] for i in cursor.description]
print(attr_names)

# get all records
records = cursor.fetchall()
print("Total number of rows in table: ", cursor.rowcount)
df = pd.DataFrame(records, columns=attr_names)
df

The second step of this analysis is to first get rid of any distracting information or columns that are not useful to us in text extraction. As we can see in our dataframe (df), we have two columns called caption and ranking in which the former contains the text that we want to analyze. The second column, ranking, is not needed in our analysis because its values contain numbers which do not contain any meaningful information. Therefore, I will drop the "ranking" column.

In [None]:
# remove unneccessary columns, axis = 1 means to remove vertical axis(columns)
df = df.drop(columns=['ranking'], axis=1)

df.head()

Unlike the other embedding models, we do not need to do any preprocessing because a Bert model uses the entire sentence to create embeddings including stop words and punctuation. This is because it accounts for the context of each word used in a sentence. We just need to simply create a list of our captions' text.

In [None]:
sentences = df['caption'].tolist()

The model I chose is called "all-MiniLM-L12-v2". I chose this model because while it doesn't give the best accuracy for sentence embeddings, it runs faster saving me a lot of time. I am willing to trade off a bit of accuracy for a faster processing speed.

In [None]:
model = SentenceTransformer('all-MiniLM-L12-v2')

In [None]:
caption_embeddings = model.encode(sentences)

If you want to see the raw embeddings in a numpy array, please decomment this code block.

In [None]:
# for sentence, embedding in zip(sentences, caption_embeddings):
    # print("Sentence:", sentence)
    # print("Embedding:", embedding)
    # print("")

I am saving the embeddings in a compressed numpy file for future use such as storing these embeddings in our SQL database.

In [None]:
# Create a dictionary to store the embeddings with sentences as keys
data_dict = {sentence: embedding for sentence, embedding in zip(sentences, caption_embeddings)}

# Save the dictionary to a numpy file
np.savez('caption_embeddings.npz', **data_dict)