This notebook will explore use of a simple BERT tokenizer and a simple search to iteratively create groups of four groups of four words from the puzzles. The search will find the group of four words that has the highest average pairwise cosine similarity score and group them. It will then repeat this for the remaining 12 words and so on until 4 groups are created. This will be "zero-shot" in the sense that the BERT model will have no supervision.

# Imports #

In [1]:
import numpy as np
import pandas as pd 
import torch
from transformers import AutoTokenizer, AutoModel
from torch.nn.functional import normalize
from itertools import combinations


In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


# Preprocessing #
Load Connections dataset

In [3]:
df_test = pd.read_csv("/kaggle/input/the-new-york-times-connections/Connections_Data.csv")
df_test.columns
df_test = df_test.drop(['Starting Row', 'Starting Column'], axis=1) #Unecessary info
df_test.head()

Unnamed: 0,Game ID,Puzzle Date,Word,Group Name,Group Level
0,1,2023-06-12,SNOW,WET WEATHER,0
1,1,2023-06-12,LEVEL,PALINDROMES,3
2,1,2023-06-12,SHIFT,KEYBOARD KEYS,2
3,1,2023-06-12,KAYAK,PALINDROMES,3
4,1,2023-06-12,HEAT,NBA TEAMS,1


## Imputation ##

In [4]:
df_test.isna().sum()

Game ID        0
Puzzle Date    0
Word           2
Group Name     0
Group Level    0
dtype: int64

There seems to be two missing entries from the data. I went to check the actual puzzles from those days to figure out what the missing entries should be and manually entered them. It seems like "NA" from puzzle #62 mistaken got labeled as NaN.

In [5]:
df_test[df_test.Word.isna()]

Unnamed: 0,Game ID,Puzzle Date,Word,Group Name,Group Level
930,59,2023-08-09,,UNSPECIFIED QUANTITIES,0
978,62,2023-08-12,,PERIODIC TABLE SYMBOLS,3


In [6]:
df_test[(df_test['Game ID'] == 59) & 
    (df_test['Group Name'] == 'UNSPECIFIED QUANTITIES')]

Unnamed: 0,Game ID,Puzzle Date,Word,Group Name,Group Level
930,59,2023-08-09,,UNSPECIFIED QUANTITIES,0
932,59,2023-08-09,FEW,UNSPECIFIED QUANTITIES,0
937,59,2023-08-09,HANDFUL,UNSPECIFIED QUANTITIES,0
943,59,2023-08-09,SEVERAL,UNSPECIFIED QUANTITIES,0


In [7]:
df_test[(df_test['Game ID'] == 62) & 
    (df_test['Group Name'] == 'PERIODIC TABLE SYMBOLS')]

Unnamed: 0,Game ID,Puzzle Date,Word,Group Name,Group Level
978,62,2023-08-12,,PERIODIC TABLE SYMBOLS,3
981,62,2023-08-12,NI,PERIODIC TABLE SYMBOLS,3
983,62,2023-08-12,HE,PERIODIC TABLE SYMBOLS,3
987,62,2023-08-12,FE,PERIODIC TABLE SYMBOLS,3


In [8]:
def load_data():
    df = pd.read_csv("/kaggle/input/the-new-york-times-connections/Connections_Data.csv")
    df = df[['Game ID', 'Word', 'Group Name']]
    df.at[930, "Word"] = "SOME"
    df.at[978, "Word"] = "NA"
    return df

In [9]:
df_test = load_data()
df_test.isna().sum()

Game ID       0
Word          0
Group Name    0
dtype: int64

# Encoding #

In this section, the words will be encoded using google's bert tokenizer and model. I also added gpu-acceleration support

In [10]:
tokenizer = AutoTokenizer.from_pretrained('google-bert/bert-base-uncased')
model = AutoModel.from_pretrained('google-bert/bert-base-uncased')
model = model.to(device)

#Embeds a list of strings 
def embed_words(words, model):
    inputs = tokenizer(words, padding=True, truncation=True, return_tensors="pt")
    inputs = {key: value.to(model.device) for key, value in inputs.items()}

    #Don't compute gradients for embeddings
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state[:, 0, :]


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

2026-02-18 19:23:01.925197: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1771442582.309073      23 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1771442582.418162      23 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1771442583.337829      23 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1771442583.337875      23 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1771442583.337878      23 computation_placer.cc:177] computation placer alr

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [11]:
words = df_test['Word'].tolist()
embeddings = embed_words(words, model)
df_test["Embedding"] = list(embeddings.cpu())


In [12]:
df_test.head()

Unnamed: 0,Game ID,Word,Group Name,Embedding
0,1,SNOW,WET WEATHER,"[tensor(-0.2204), tensor(0.2819), tensor(-0.13..."
1,1,LEVEL,PALINDROMES,"[tensor(-0.2300), tensor(0.1055), tensor(0.123..."
2,1,SHIFT,KEYBOARD KEYS,"[tensor(-0.1224), tensor(0.0911), tensor(0.119..."
3,1,KAYAK,PALINDROMES,"[tensor(-0.5861), tensor(0.0035), tensor(-0.31..."
4,1,HEAT,NBA TEAMS,"[tensor(-0.4224), tensor(0.1301), tensor(-0.51..."


# Predictions #

I initially tried to use KMeans to select groups but realized there's no easy way to ensure groups of four. Creating four clusters doesn't ensure each has four data points. <br/>
Instead, I created a function that calculates the average cosine similarities between all possible combinations of 4 and chooses the best one. It iteratively repeats this for the remaining 12 and then 8 words.

In [13]:
N_GROUPS = 4 # Number of connections groups
GROUP_SIZE = 4 #Number of words in each connection group

#Takes np.darray as input
#Computes mean cosine similarity score between all elements
def get_similarity(embeddings):
    X = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
    sims = X @ X.T
    
    n = sims.shape[0]
    mask = ~np.eye(n, dtype=bool)
    return sims[mask].mean()

#Takes numpy embeddings for only one puzzle as input
#Returns array of tuples where each tuple contains 4 indices of a group.
def greedy_predict_puzzle(embeddings):
    preds = []
    sims = []
    
    X = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
    remaining_indices = list(range(len(X)))
    
    for i in range(N_GROUPS):
        highest_sim = -float("inf")
        best_combo = None
        #Iterate through all possible combinations left
        for idx_combo in combinations(remaining_indices, GROUP_SIZE):
            group = X[list(idx_combo)]
            
            sim_matrix = group @ group.T
            mask = ~np.eye(sim_matrix.shape[0], dtype=bool)
            candidate_sim = sim_matrix[mask].mean()

            if candidate_sim > highest_sim:
                highest_sim = candidate_sim
                best_combo = idx_combo

        preds.append(best_combo)
        sims.append(highest_sim)
        remaining_indices = [i for i in remaining_indices if i not in best_combo]
    
    return preds, sims


#Function gets embeddings from dataframe given game_id
def get_embeddings(df, id):
    return np.stack(df[df['Game ID'] == id]['Embedding'].to_numpy())

In [14]:
TEST_ID = 42
#df_test[df_test['Game ID'] == TEST_ID]

# Evaluation #

For all game puzzles, we'll use arrays of tuples to represent groupings. Each tuple will contain the indices of each word. An example representation would be [(1,4,15,14), (5,7,11,12), (2,6,8,10), (0,3,9,13)]

In [15]:
preds, sims = greedy_predict_puzzle(get_embeddings(df_test, TEST_ID))
print(sims)
df_test[df_test['Game ID'] == TEST_ID].iloc[list(preds[0])]

[np.float32(0.9735923), np.float32(0.9520364), np.float32(0.94216925), np.float32(0.7442939)]


Unnamed: 0,Game ID,Word,Group Name,Embedding
656,42,DASH,RUN QUICKLY,"[tensor(-0.2204), tensor(0.2636), tensor(0.028..."
663,42,BOLT,RUN QUICKLY,"[tensor(-0.1479), tensor(0.1453), tensor(0.056..."
664,42,FAT,___ CAT,"[tensor(-0.1662), tensor(0.1528), tensor(0.061..."
666,42,COLON,PUNCTUATION MARKS,"[tensor(-0.4412), tensor(0.0973), tensor(-0.14..."


In [16]:
def generate_predictions(df, predict_fn):
    all_preds = {}

    for game_id, df_game in df.groupby("Game ID"):
        embeddings = get_embeddings(df, game_id)
        preds, _ = predict_fn(embeddings)

        all_preds[game_id] = preds

    return all_preds


In [17]:
#scores single puzzle given predictions and testing array
def group_accuracy(y, preds):
    #So order of tuple doesn't matter
    y = [set(group) for group in y]
    preds = [set(group) for group in preds]
    
    n_correct = sum(pred in y for pred in preds)
    return n_correct / N_GROUPS

#Takes df as X input and a dictionary {game_id : predictions tuple} as y input
#Returns accuracy score of categories categories completely grouped correctly
def accuracy_score(df, all_preds, scoring_fn=group_accuracy):
    accs = []
    
    for game_id, df_game in df.groupby('Game ID'):
        df_game = df_game.reset_index(drop=True)
        y_test = [
            tuple(group.index)
            for _, group in df_game.groupby("Group Name")
        ]

        preds = all_preds[game_id]
        
        accs.append(scoring_fn(y_test, preds))

    return np.mean(accs)

        

In [18]:
preds = generate_predictions(df_test, greedy_predict_puzzle)
acc = accuracy_score(df_test, preds)
print("Full Test Set Accuracy:", acc)

Full Test Set Accuracy: 0.011475409836065573


We get about a 1.1% testing accuracy for this model, not too great to be honest. Random guessing would be expected to yield around 0.22% accuracy so our model does perform better than that.

In [19]:
from itertools import permutations

def accuracy_min_swaps(pred_groups, gold_groups):

    pred_sets = [set(g) for g in pred_groups]
    gold_sets = [set(g) for g in gold_groups]

    total_items = N_GROUPS * GROUP_SIZE
    best_misplaced = total_items

    for perm in permutations(range(N_GROUPS)):
        misplaced = 0
        for i in range(N_GROUPS):
            j = perm[i]
            overlap = len(pred_sets[i] & gold_sets[j])
            misplaced += GROUP_SIZE - overlap

        best_misplaced = min(best_misplaced, misplaced)

    #Each swap fixes 2 misplaced elements
    return (best_misplaced + 1) // 2

Using a different metric: minimum number of swaps to get correct groups (from Kyle)

In [20]:
preds = generate_predictions(df_test, greedy_predict_puzzle)
acc = accuracy_score(df_test, preds, accuracy_min_swaps)
print("Average Min Swap:", acc)

Average Min Swap: 4.177049180327868
