There are different ways to create sentence vectors for the captions in the New Yorker Caption Contests. One way to do so is by using a Sentence-Bert pretrained model. The website for the models is https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models/.

In [None]:
# libraries for caption embeddings
import pandas as pd
import numpy as np
import re
from sentence_transformers import SentenceTransformer

The first step to get our caption embeddings is to find our data. We are selecting a MySQL server database called "new_york_cartoon" and connecting to it via the mysql.connector library.

In [None]:
# connecting to SQL database
import mysql.connector
from mysql.connector import Error
pd.set_option('display.max_colwidth', None)

try:
    connection = mysql.connector.connect(host='dbnewyorkcartoon.cgyqzvdc98df.us-east-2.rds.amazonaws.com',
                                         database='new_york_cartoon',
                                         user='dbuser',
                                         password='Sql123456')
    if connection.is_connected():
        db_Info = connection.get_server_info()
        print("Connected to MySQL Server version ", db_Info)
        cursor = connection.cursor()
        cursor.execute("select database();")
        record = cursor.fetchone()
        print("You succeed to connect to database: ", record)

except Error as e:
    print("Error while connecting to MySQL", e)

Now that we connected to the database, we then have to pull down our data. We want all captions in this database, so we select the table which contains our data. We also want to make this data readable, so we insert this data into a pandas dataframe with two columns called "caption" and "ranking".

In [None]:
# pulling down data from SQL database via search
sql_select_Query = "select caption,ranking from result;"  # you can change query in this line for selecting your target data
cursor.execute(sql_select_Query)

# show attributes names of target data
num_attr = len(cursor.description)
attr_names = [i[0] for i in cursor.description]
print(attr_names)

# get all records
records = cursor.fetchall()
print("Total number of rows in table: ", cursor.rowcount)
df = pd.DataFrame(records, columns=attr_names)
df

Now that we have our data in a pandas dataframe, we can get rid of the ranking column since it contains numbers which are not meaninful information to us.

In [None]:
# remove unneccessary columns, axis = 1 means to remove vertical axis(columns)
df = df.drop(columns=['ranking'], axis=1)

df.head()

The model that we are going to use requires the text data be in a list.

In [None]:
sentences = df['caption'].tolist()

The model I chose is called "all-MiniLM-L12-v2". I chose this model because while it doesn't give the best accuracy for sentence embeddings, it runs faster saving me a lot of time. I am willing to trade off a bit of accuracy for a faster processing speed.

In [None]:
model = SentenceTransformer('all-MiniLM-L12-v2')

In [None]:
caption_embeddings = model.encode(sentences)

If you want to see the raw embeddings in a numpy array, please decomment this code block.

In [None]:
# for sentence, embedding in zip(sentences, caption_embeddings):
    # print("Sentence:", sentence)
    # print("Embedding:", embedding)
    # print("")

I am saving the embeddings in a compressed numpy file for future use such as storing these embeddings in our SQL database.

In [None]:
# Create a dictionary to store the embeddings with sentences as keys
data_dict = {sentence: embedding for sentence, embedding in zip(sentences, caption_embeddings)}

# Save the dictionary to a numpy file
np.savez('caption_embeddings.npz', **data_dict)