## Brainstorm/Outline
- Goal is to fine-tune a base model LLM to predict crossword answers using Hint **and** Answer Length.
    - Also would like to try building a NN from scratch, but compute resources are an issue.  
- One Future Idea:
    - Utilize hint classification: Separate Hints by type.
        - ie.) Use a masked language model for fill in the blank hints.
        - ie.) Train separate model for understanding puns, anagrams, wordplay, cryptic clues
- What type of NLP task is this?
    - This is Text2Text Generation. We have an input --> output format. We want the model to generate text based on the input.
- What base model are we using for fine-tuning? **Google's t5-base**
    -  Full name + Creator: T5 Text-to-Text Transfer Transformer. Made by google.
    -  Key Characteristics
        - Treats every NLP task as a text-to-text problem. Both the input and output are texts, regardless of task.
            - Text-to-Text Transfer: A model that converts one piece of text into another for any NLP task.
        - Based on Transfer Learning.
            - Transfer Learning: Allows a model to leverage learned from one task/domain to improve performance on other.
            - Analogy:
                - Imagine a person who already speaks Spanish learning Italian. Since the languages are similar, they can learn faster compared to someone starting from zero.
                -  Similarly, T5 has already learned about words, sentence structure, and general knowledge, so it adapts to crosswords faster than a randomly initialized model.
        - Archtiecture: Sequence-to-Sequence (Encoder-Decoder Transformer).
            - A model where an input sequence is mapped to an output sequence. Useful for when input and outputs are variable lengths.
            - Encoder: Reads input (clue), converts it to some meaningful numerical representation.
            - Decoder: Uses that information to generate the correct answer one token at a time.
        - Pre-trained on massive dataset (the C4 Colossal Clean Crawled Corpus), meaning it already has built-in knowledge. Knows general facts, common words. Understands word relationships, synonyms and meaning. Understands grammar, sentence structure and phrasing
    - How was it trained?
        - Instead of predicting single masked tokens (like BERT), T5 masks entire spans/chunks of words and asks models to reconstruct the,  
    -  Why is it useful for answer prediction?
        - Understand clues by encoding them into a numerical representation.  
        - It's text-to-text approach ensures we aren't just choosing/classifying from predefined answer - model actually learns to generate the correct words based on patterns it has seen.
        - Pretraining on large data helps it understand word relationships, trivia and definitions.
    - Drawbacks/Cons:
        - Requires more compute than basic shallow learning classification models.
        - Generative aspect means it may hallucinate/make up stuff.
        - Does not have up-to-date knowledge. C4 datasets stops at 2019/2020.
        - Does not store facts directly/have explicity world knowledge like a database.
        - Does not inherently understand wordplay or anagrams.
- General approach
    - Fine-tune the model on crossword answers.
    - Generate multiple answer using **Beam Search** to produce diverse, high-quality alternative answers. Also could use top-k sampling for more diversity/randomness.
        - Necessary so we can match answer length.
        - How beam search works: Instead of greedily picking best answer at all steps, Beam Search keeps track of multiple possible output and ranks them. Keeps the top beam_size candidates at each step.
    - Filter answer candiadtes by length to ensure predicted output actually matches the crossword answer length. 


## Prototype: Start with training only on 2021 crosswords.

### Data Preprocessing and cleaning

In [None]:
#imports
import pandas as pd
import numpy as np
import re

In [None]:
#Load 2021 data
df = pd.read_csv('deep_learning_nytcrosswords2021.csv')

#Rename columns for clarity
df.rename(columns={
    "Word": "Answer",
    "Clue": "Hint",
    "Character Count": "Answer_Length"
}, inplace=True)

#Reorder columns for clarity 
df = df[['Date', 'Hint', 'Answer', 'Answer_Length']]

In [None]:
#Take a look. 
df.head(5)

In [None]:
#Minimal preprocessing required. t5 tokenizer is pretty advanced.
def clean_text(text):
    """Minimal cleaning for T5: normalizes quotes, removes special symbols."""
    text = text.strip() #remove leading and trailing spaces
    text = re.sub(r'[“”‘’]', '"', text)  # Normalize quotes
    text = re.sub(r'[•◇➤]', '', text)  # Remove special symbols
    text = text.replace("’", "'")  # Normalize apostrophes
    return text
    
def add_length_to_clue(df):
    """
    Appends the answer length to the clue in parentheses.
    Assumes the dataframe has 'Clue' and 'Answer_Length' columns.
    """
    df = df.copy()  # Avoid modifying the original dataframe
    df["Formatted Hint"] = df.apply(lambda row: f"{row['Hint']} ({row['Answer_Length']})", axis=1)
    return df

df["Answer"] = df["Answer"].apply(clean_text)
df = add_length_to_clue(df)

In [None]:
#Small experimental preprocessing step: Add a column that classifies Hint as fill in the blank or not.
def classify_clue_type(hint):
    """Returns 1 if clue is fill-in-the-blank, else 0."""
    return 1 if "_" in hint else 0

# Add binary classification column
df["Fill-in-the-Blank"] = df["Hint"].apply(classify_clue_type)

#Reorder cols for clarity 
df = df[["Date", "Hint", "Formatted Hint", "Answer", "Answer_Length", "Fill-in-the-Blank"]]

In [None]:
df.head(5)

In [None]:
df.to_csv('feature3_cleaned2021_data.csv')

## Model Training

### Training Process Explained
- Input: Formatted Clue
- Output: Answer
- Tokenizer: Must use T5Tokenizer to match T5 model.
- Imports
    - Tokenizer: Must use T5Tokenizer to match T5 model.
    - ConditionalGeneration: Instead of generating free text from scratch, T5 generates output based on givin input condition.
    - Trainer: Handles batching + gradient updates etc
    - TrainingArguments - Specificies training hyperparameters (batch size, epochs, evaluation strategy)
    - DataCollatorForSeqtoSeq - Ensures batch sequences are properly padded for sequence-sequence learning.
- Convert to Hugging Face Dataset
    - What is it?  A structured dataset format used by the Hugging Face `datasets` library, optimized for efficient tokenization and training.  
    - Why?  It allows for easy preprocessing, batching, and integration with Hugging Face’s `Trainer` API, making training faster and more memory-efficient.  
    - Hugging Face .map() function: apply a transformation to every example in a dataset. It is highly efficient because it supports batch processing, multiprocessing, and in-place modifications.
- Preprocessing
    - Add prefix --> Crossword clue: {clue}. T5 is designed for task-based learning. Prefixes helps it distinguish/affirm task.

In [1]:
#Load in cleaned data and imports
import pandas as pd
import re
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments, DataCollatorForSeq2Seq
from datasets import Dataset

df = pd.read_csv('feature3_cleaned2021_data.csv')

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
import torch
print("PyTorch detects GPUs:", torch.cuda.device_count())


PyTorch detects GPUs: 1


In [3]:
#Set up GPU training
#Check if GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}") # Initialize Tokenizer and load in model. 

Using device: cuda


In [4]:
#Initialize Tokenizer and load in model. 
tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("t5-base").to(device)
print("Model and tokenizer loaded successfully!")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Model and tokenizer loaded successfully!


In [5]:
# Preprocess + tokenize data. Convert to HuggingFace Dataset.
def preprocess_data(examples):
    """Tokenizes clues and answers for T5 training."""
    model_inputs = tokenizer(examples["Formatted Hint"], truncation=True, max_length=128, padding="max_length")
    labels = tokenizer(examples["Answer"], truncation=True, max_length=32, padding="max_length").input_ids
    model_inputs["labels"] = labels
    return model_inputs
    
#Convert to Hugging Face Dataset 
dataset = Dataset.from_pandas(df)
dataset = dataset.map(preprocess_data, batched=True)

#View dataset
print(dataset)

# Show first few rows
print(dataset[:3])  # Retrieves first 5 entries

Map: 100%|██████████████████████| 23420/23420 [00:02<00:00, 10928.13 examples/s]

Dataset({
    features: ['Unnamed: 0', 'Date', 'Hint', 'Formatted Hint', 'Answer', 'Answer_Length', 'Fill-in-the-Blank', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 23420
})
{'Unnamed: 0': [0, 1, 2], 'Date': ['2021-10-25', '2021-01-27', '2021-08-12'], 'Hint': ['Eyelid affliction', '"I only got a seventh-grade education, but I have a doctorate in ___": James Brown', 'Warmer in the winter'], 'Formatted Hint': ['Eyelid affliction (4)', '"I only got a seventh-grade education, but I have a doctorate in ___": James Brown (4)', 'Warmer in the winter (5)'], 'Answer': ['STYE', 'FUNK', 'COCOA'], 'Answer_Length': [4, 4, 5], 'Fill-in-the-Blank': [0, 1, 0], 'input_ids': [[9172, 8130, 3, 4127, 2176, 1575, 3, 10820, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0




In [6]:
#Split into training and testing sets
dataset = dataset.train_test_split(test_size=0.1)
train_dataset = dataset["train"]
test_dataset = dataset["test"]
print(f"Training dataset size: {len(train_dataset)}")
print(f"Test dataset size: {len(test_dataset)}")

Training dataset size: 21078
Test dataset size: 2342


In [7]:
import os
print("Current working directory:", os.getcwd())


Current working directory: /projectnb/ds340/students/seansal2/CrosswordHelper


In [8]:
#TRAIN THE MODEL
# Training arguments
output_path = "/projectnb/ds340/students/seansal2/CrosswordHelper/t5_crossword_model"

training_args = TrainingArguments(
    output_dir=output_path,  # Save model here
    logging_dir=f"{output_path}/logs",  # Ensure logs persist
    eval_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
    save_total_limit=2,
    logging_steps=500,
    report_to="none",
    fp16=True,  # Enables mixed precision for efficiency
)

# Data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Train the model
trainer.train()

# Save model & tokenizer to the correct directory
model.save_pretrained(output_path)
tokenizer.save_pretrained(output_path)

print(f"Training complete. Model saved to: {output_path}")

  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,0.3727,0.350984
2,0.3524,0.336866
3,0.3302,0.32957
4,0.313,0.325087
5,0.3096,0.323777


Training complete. Model saved to: /projectnb/ds340/students/seansal2/CrosswordHelper/t5_crossword_model


### Model Evaluation

- Check Model Predictions
- Evaluate model using other loss functions

In [9]:
#Load in trained model
output_path = "/projectnb/ds340/students/seansal2/CrosswordHelper/t5_crossword_model"
from transformers import T5ForConditionalGeneration, T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained(output_path)
model = T5ForConditionalGeneration.from_pretrained(output_path)


In [12]:
#Create one case generate_answer function using beam search 
    #One concern: should we filter to fixed length at generation or after?
def generate_answer(clue, model, tokenizer, max_length=32, num_beams=7, top_k = 5):
    """
    Generates an answer for a given crossword clue using the trained model.
    """
    model.eval()  # Set model to evaluation mode
    # Ensure everything runs on the same device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)  # Move model to the correct device

    
    input_text = clue
    #If we add a Prefix, use this ...
    #input_text = f"Crossword clue: {clue}"
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda" if torch.cuda.is_available() else "cpu") #tokenize input 

    #Generate text using hugging face generate text
        #Our decoding method is beam search: instead of just greedily returning highest word probability, keep num_beams most likely choices 
    with torch.no_grad(): #Disable gradient calculation for efficiency - don't need it for inference/prediction/generation
        #beam search, return k best answers:
            #Could also try top-k sampling for more diverse answer
            #Lower temperature = more structured predictions
            #top-p/nucleus-sampling - choose words from top X% probability mass dynamically. Less random than top-k, more than beam search
        outputs = model.generate(
            input_ids, 
            max_length=max_length, 
            num_beams=num_beams,  # More beams = better search
            num_return_sequences=top_k,  
            early_stopping=True
        )

    #Lastly convert token ids to readable text
    predictions = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
    return predictions

In [13]:
#Some test cases
clues = [
    "Capital of France (5)",  
    "___ the Explorer (4)",  
    "Largest planet in the solar system (7)"
]

for clue in clues:
    answer = generate_answer(clue, model, tokenizer)
    print(f"Clue: {clue}")
    print(f"Predicted Answer: {answer}\n")


Clue: Capital of France (5)
Predicted Answer: ['LYON', 'FRANCE', 'CAMBOY', 'ANGELES', 'LESTIN']

Clue: ___ the Explorer (4)
Predicted Answer: ['ENTR', 'EYES', 'IERO', 'NETWORK', 'TERR']

Clue: Largest planet in the solar system (7)
Predicted Answer: ['AURORA', 'GREENPOINT', 'GREENPOOL', 'GREENHOUSE', 'GREENPOLE']



In [18]:
#Raw Accuracy Evaluation
num_samples = 1000  # Adjust based on test set size

correct = 0
total = 0

for example in test_dataset.select(range(num_samples)):
    clue = example["Formatted Hint"]  # Ensure correct column name
    true_answer = example["Answer"].strip().upper()  # Normalize answer case
    
    predicted_answers = generate_answer(clue, model, tokenizer)  # Returns a list
    
    # Check if the correct answer is in the list of predicted answers
    if true_answer in [ans.strip().upper() for ans in predicted_answers]:
        correct += 1
    
    total += 1

accuracy = correct / total
print(f"Model Top-5 Accuracy on {num_samples} test samples: {accuracy:.2%}")

Model Top-5 Accuracy on 1000 test samples: 5.90%


In [19]:
trainer.state.log_history[-5:]  # Last 5 logs

[{'loss': 0.3133,
  'grad_norm': 0.6222014427185059,
  'learning_rate': 4.50853889943074e-06,
  'epoch': 4.554079696394687,
  'step': 12000},
 {'loss': 0.304,
  'grad_norm': 0.5957279205322266,
  'learning_rate': 2.618595825426945e-06,
  'epoch': 4.743833017077799,
  'step': 12500},
 {'loss': 0.3096,
  'grad_norm': 0.6870908141136169,
  'learning_rate': 7.324478178368121e-07,
  'epoch': 4.933586337760911,
  'step': 13000},
 {'eval_loss': 0.32377704977989197,
  'eval_runtime': 3.721,
  'eval_samples_per_second': 629.405,
  'eval_steps_per_second': 78.743,
  'epoch': 5.0,
  'step': 13175},
 {'train_runtime': 1044.8926,
  'train_samples_per_second': 100.862,
  'train_steps_per_second': 12.609,
  'total_flos': 1.60445180215296e+16,
  'train_loss': 0.36967294074100154,
  'epoch': 5.0,
  'step': 13175}]