This notebook will explore use of a simple BERT tokenizer and a simple search using similarity scores to create groups of four groups of four words from the puzzles. This will be "zero-shot" in the sense that the BERT model will have no supervision.

# Imports #

In [1]:
import numpy as np
import pandas as pd 
import torch
from transformers import AutoTokenizer, AutoModel
from torch.nn.functional import normalize
from itertools import combinations


In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


# Preprocessing #
Load Connections dataset

In [3]:
df_test = pd.read_csv("/kaggle/input/the-new-york-times-connections/Connections_Data.csv")
df_test.columns
df_test = df_test.drop(['Starting Row', 'Starting Column'], axis=1) #Unecessary info
df_test.head()

Unnamed: 0,Game ID,Puzzle Date,Word,Group Name,Group Level
0,1,2023-06-12,SNOW,WET WEATHER,0
1,1,2023-06-12,LEVEL,PALINDROMES,3
2,1,2023-06-12,SHIFT,KEYBOARD KEYS,2
3,1,2023-06-12,KAYAK,PALINDROMES,3
4,1,2023-06-12,HEAT,NBA TEAMS,1


## Imputation ##

In [4]:
df_test.isna().sum()

Game ID        0
Puzzle Date    0
Word           2
Group Name     0
Group Level    0
dtype: int64

There seems to be two missing entries from the data. I went to check the actual puzzles from those days to figure out what the missing entries should be and manually entered them. It seems like "NA" from puzzle #62 mistaken got labeled as NaN.

In [5]:
df_test[df_test.Word.isna()]

Unnamed: 0,Game ID,Puzzle Date,Word,Group Name,Group Level
930,59,2023-08-09,,UNSPECIFIED QUANTITIES,0
978,62,2023-08-12,,PERIODIC TABLE SYMBOLS,3


In [6]:
df_test[(df_test['Game ID'] == 59) & 
    (df_test['Group Name'] == 'UNSPECIFIED QUANTITIES')]

Unnamed: 0,Game ID,Puzzle Date,Word,Group Name,Group Level
930,59,2023-08-09,,UNSPECIFIED QUANTITIES,0
932,59,2023-08-09,FEW,UNSPECIFIED QUANTITIES,0
937,59,2023-08-09,HANDFUL,UNSPECIFIED QUANTITIES,0
943,59,2023-08-09,SEVERAL,UNSPECIFIED QUANTITIES,0


In [7]:
df_test[(df_test['Game ID'] == 62) & 
    (df_test['Group Name'] == 'PERIODIC TABLE SYMBOLS')]

Unnamed: 0,Game ID,Puzzle Date,Word,Group Name,Group Level
978,62,2023-08-12,,PERIODIC TABLE SYMBOLS,3
981,62,2023-08-12,NI,PERIODIC TABLE SYMBOLS,3
983,62,2023-08-12,HE,PERIODIC TABLE SYMBOLS,3
987,62,2023-08-12,FE,PERIODIC TABLE SYMBOLS,3


In [8]:
def load_data():
    df = pd.read_csv("/kaggle/input/the-new-york-times-connections/Connections_Data.csv")
    df = df[['Game ID', 'Word', 'Group Name']]
    df.at[930, "Word"] = "SOME"
    df.at[978, "Word"] = "NA"
    return df

In [9]:
df_test = load_data()
df_test.isna().sum()

Game ID       0
Word          0
Group Name    0
dtype: int64

# Encoding #

In this section, the words will be encoded using google's bert tokenizer and model. I also added gpu-acceleration support

In [10]:
tokenizer = AutoTokenizer.from_pretrained('google-bert/bert-base-uncased')
model = AutoModel.from_pretrained('google-bert/bert-base-uncased')
model = model.to(device)

#Embeds a list of strings 
def embed_words(words, model):
    inputs = tokenizer(words, padding=True, truncation=True, return_tensors="pt")
    inputs = {key: value.to(model.device) for key, value in inputs.items()}

    #Don't compute gradients for embeddings
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state[:, 0, :]


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

2026-02-10 00:24:15.034668: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1770683055.401194      24 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1770683055.527100      24 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1770683056.434431      24 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1770683056.434482      24 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1770683056.434484      24 computation_placer.cc:177] computation placer alr

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [11]:
words = df_test['Word'].tolist()
embeddings = embed_words(words, model)
df_test["Embedding"] = list(embeddings.cpu())


In [12]:
df_test.head()

Unnamed: 0,Game ID,Word,Group Name,Embedding
0,1,SNOW,WET WEATHER,"[tensor(-0.2204), tensor(0.2819), tensor(-0.13..."
1,1,LEVEL,PALINDROMES,"[tensor(-0.2300), tensor(0.1055), tensor(0.123..."
2,1,SHIFT,KEYBOARD KEYS,"[tensor(-0.1224), tensor(0.0911), tensor(0.119..."
3,1,KAYAK,PALINDROMES,"[tensor(-0.5861), tensor(0.0035), tensor(-0.31..."
4,1,HEAT,NBA TEAMS,"[tensor(-0.4224), tensor(0.1301), tensor(-0.51..."


# Predictions #

I initially tried to use KMeans to select groups but realized there's no easy way to ensure groups of four. Creating four clusters doesn't ensure each has four data points. <br/>
Instead, I created a function that calculates the average cosine similarities between all possible combinations of 4 and chooses the best one. It iteratively repeats this for the remaing 12 and then 8 words.

In [13]:
N_GROUPS = 4 # Number of connections groups
GROUP_SIZE = 4 #Number of words in each connection group

#Takes np.darray as input
#converts to torch tensor to compute mean cosine similarity score between all elements
def get_similarity(embeddings): 
    embeddings = torch.tensor(embeddings)
    X = normalize(embeddings, dim=1)
    sims = X @ X.T 
    #Mask out diagonals (self similarity scores)
    n = sims.size(0)
    mask = ~torch.eye(n, dtype=bool)
    sims = sims[mask]
    return sims.mean()
    
def predict(df):
    embeddings = df["Embedding"].to_numpy()
    for i in range(N_GROUPS):
        idx_combos = combinations(range(embeddings), GROUP_SIZE)
        
        
    

In [14]:
#df_example = df_test[df_test['Game ID' == 1]]
get_similarity(np.stack(df_test[df_test['Game ID'] == 1]['Embedding'].to_numpy()))

tensor(0.8889)