## Brainstorm/Outline
- Goal is to fine-tune a base model LLM to predict crossword answers using Hint **and** Answer Length.
    - Also would like to try building a NN from scratch, but compute resources are an issue.  
- One Future Idea:
    - Utilize hint classification: Separate Hints by type.
        - ie.) Use a masked language model for fill in the blank hints.
        - ie.) Train separate model for understanding puns, anagrams, wordplay, cryptic clues
- What type of NLP task is this?
    - This is Text2Text Generation. We have an input --> output format. We want the model to generate text based on the input.
- What base model are we using for fine-tuning? **Google's t5-base**
    -  Full name + Creator: T5 Text-to-Text Transfer Transformer. Made by google.
    -  Key Characteristics
        - Treats every NLP task as a text-to-text problem. Both the input and output are texts, regardless of task.
            - Text-to-Text Transfer: A model that converts one piece of text into another for any NLP task.
        - Based on Transfer Learning.
            - Transfer Learning: Allows a model to leverage learned from one task/domain to improve performance on other.
            - Analogy:
                - Imagine a person who already speaks Spanish learning Italian. Since the languages are similar, they can learn faster compared to someone starting from zero.
                -  Similarly, T5 has already learned about words, sentence structure, and general knowledge, so it adapts to crosswords faster than a randomly initialized model.
        - Archtiecture: Sequence-to-Sequence (Encoder-Decoder Transformer).
            - A model where an input sequence is mapped to an output sequence. Useful for when input and outputs are variable lengths.
            - Encoder: Reads input (clue), converts it to some meaningful numerical representation.
            - Decoder: Uses that information to generate the correct answer one token at a time.
        - Pre-trained on massive dataset (the C4 Colossal Clean Crawled Corpus), meaning it already has built-in knowledge. Knows general facts, common words. Understands word relationships, synonyms and meaning. Understands grammar, sentence structure and phrasing
    - How was it trained?
        - Instead of predicting single masked tokens (like BERT), T5 masks entire spans/chunks of words and asks models to reconstruct the,  
    -  Why is it useful for answer prediction?
        - Understand clues by encoding them into a numerical representation.  
        - It's text-to-text approach ensures we aren't just choosing/classifying from predefined answer - model actually learns to generate the correct words based on patterns it has seen.
        - Pretraining on large data helps it understand word relationships, trivia and definitions.
    - Drawbacks/Cons:
        - Requires more compute than basic shallow learning classification models.
        - Generative aspect means it may hallucinate/make up stuff.
        - Does not have up-to-date knowledge. C4 datasets stops at 2019/2020.
        - Does not store facts directly/have explicity world knowledge like a database.
        - Does not inherently understand wordplay or anagrams.
- General approach
    - Fine-tune the model on crossword answers.
    - Generate multiple answer using **Beam Search** to produce diverse, high-quality alternative answers. Also could use top-k sampling for more diversity/randomness.
        - Necessary so we can match answer length.
        - How beam search works: Instead of greedily picking best answer at all steps, Beam Search keeps track of multiple possible output and ranks them. Keeps the top beam_size candidates at each step.
    - Filter answer candiadtes by length to ensure predicted output actually matches the crossword answer length. 


## Prototype: Start with training only on 2021 crosswords.

### Data Preprocessing and cleaning

In [30]:
#imports
import pandas as pd
import numpy as np
import re

In [38]:
#Load 2021 data
df = pd.read_csv('deep_learning_nytcrosswords2021.csv')

#Rename columns for clarity
df.rename(columns={
    "Word": "Answer",
    "Clue": "Hint",
    "Character Count": "Answer_Length"
}, inplace=True)

#Reorder columns for clarity 
df = df[['Date', 'Hint', 'Answer', 'Answer_Length']]

In [39]:
#Take a look. 
df.head(5)

Unnamed: 0,Date,Hint,Answer,Answer_Length
0,2021-10-25,Eyelid affliction,STYE,4
1,2021-01-27,"""I only got a seventh-grade education, but I h...",FUNK,4
2,2021-08-12,Warmer in the winter,COCOA,5
3,2021-10-26,___ Boyardee,CHEF,4
4,2021-08-08,More like a dive bar or certain bread,SEEDIER,7


In [40]:
#Minimal preprocessing required. t5 tokenizer is pretty advanced.
def clean_text(text):
    """Minimal cleaning for T5: normalizes quotes, removes special symbols."""
    text = text.strip() #remove leading and trailing spaces
    text = re.sub(r'[“”‘’]', '"', text)  # Normalize quotes
    text = re.sub(r'[•◇➤]', '', text)  # Remove special symbols
    text = text.replace("’", "'")  # Normalize apostrophes
    return text
    
def add_length_to_clue(df):
    """
    Appends the answer length to the clue in parentheses.
    Assumes the dataframe has 'Clue' and 'Answer_Length' columns.
    """
    df = df.copy()  # Avoid modifying the original dataframe
    df["Formatted Hint"] = df.apply(lambda row: f"{row['Hint']} ({row['Answer_Length']})", axis=1)
    return df

df["Answer"] = df["Answer"].apply(clean_text)
df = add_length_to_clue(df)

In [41]:
#Small experimental preprocessing step: Add a column that classifies Hint as fill in the blank or not.
def classify_clue_type(hint):
    """Returns 1 if clue is fill-in-the-blank, else 0."""
    return 1 if "_" in hint else 0

# Add binary classification column
df["Fill-in-the-Blank"] = df["Hint"].apply(classify_clue_type)

#Reorder cols for clarity 
df = df[["Date", "Hint", "Formatted Hint", "Answer", "Answer_Length", "Fill-in-the-Blank"]]

In [42]:
df.head(5)

Unnamed: 0,Date,Hint,Formatted Hint,Answer,Answer_Length,Fill-in-the-Blank
0,2021-10-25,Eyelid affliction,Eyelid affliction (4),STYE,4,0
1,2021-01-27,"""I only got a seventh-grade education, but I h...","""I only got a seventh-grade education, but I h...",FUNK,4,1
2,2021-08-12,Warmer in the winter,Warmer in the winter (5),COCOA,5,0
3,2021-10-26,___ Boyardee,___ Boyardee (4),CHEF,4,1
4,2021-08-08,More like a dive bar or certain bread,More like a dive bar or certain bread (7),SEEDIER,7,0


In [43]:
df.to_csv('feature3_cleaned2021_data.csv')

## Model Training

### Training Process Explained
- Input: Formatted Clue
- Output: Answer
- Tokenizer: Must use T5Tokenizer to match T5 model.
- Imports
    - Tokenizer: Must use T5Tokenizer to match T5 model.
    - ConditionalGeneration: Instead of generating free text from scratch, T5 generates output based on givin input condition.
    - Trainer: Handles batching + gradient updates etc
    - TrainingArguments - Specificies training hyperparameters (batch size, epochs, evaluation strategy)
    - DataCollatorForSeqtoSeq - Ensures batch sequences are properly padded for sequence-sequence learning.
- Convert to Hugging Face Dataset
    - What is it?  A structured dataset format used by the Hugging Face `datasets` library, optimized for efficient tokenization and training.  
    - Why?  It allows for easy preprocessing, batching, and integration with Hugging Face’s `Trainer` API, making training faster and more memory-efficient.  
    - Hugging Face .map() function: apply a transformation to every example in a dataset. It is highly efficient because it supports batch processing, multiprocessing, and in-place modifications.

In [1]:
#Load in cleaned data and imports
import pandas as pd
import re
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments, DataCollatorForSeq2Seq
from datasets import Dataset
from tqdm import tqdm
import time

df = pd.read_csv('feature3_cleaned2021_data.csv')

In [2]:
#Setup for GPU training
# Check if GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cpu


In [3]:
# Initialize Tokenizer and load in model. 
tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("t5-base").to(device)
print("Model and tokenizer loaded successfully!")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Model and tokenizer loaded successfully!


In [4]:
# Preprocess + tokenize data. Convert to HuggingFace Dataset.
def preprocess_data(examples):
    """Tokenizes clues and answers for T5 training."""
    model_inputs = tokenizer(examples["Formatted Hint"], truncation=True, max_length=128, padding="max_length")
    labels = tokenizer(examples["Answer"], truncation=True, max_length=32, padding="max_length").input_ids
    model_inputs["labels"] = labels
    return model_inputs
    
#Convert to Hugging Face Dataset 
dataset = Dataset.from_pandas(df)
dataset = dataset.map(preprocess_data, batched=True)

#View dataset
print(dataset)

# Show first few rows
print(dataset[:3])  # Retrieves first 5 entries

Map:   0%|          | 0/23420 [00:00<?, ? examples/s]

Dataset({
    features: ['Unnamed: 0', 'Date', 'Hint', 'Formatted Hint', 'Answer', 'Answer_Length', 'Fill-in-the-Blank', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 23420
})
{'Unnamed: 0': [0, 1, 2], 'Date': ['2021-10-25', '2021-01-27', '2021-08-12'], 'Hint': ['Eyelid affliction', '"I only got a seventh-grade education, but I have a doctorate in ___": James Brown', 'Warmer in the winter'], 'Formatted Hint': ['Eyelid affliction (4)', '"I only got a seventh-grade education, but I have a doctorate in ___": James Brown (4)', 'Warmer in the winter (5)'], 'Answer': ['STYE', 'FUNK', 'COCOA'], 'Answer_Length': [4, 4, 5], 'Fill-in-the-Blank': [0, 1, 0], 'input_ids': [[9172, 8130, 3, 4127, 2176, 1575, 3, 10820, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [5]:
#Split into training and testing sets
dataset = dataset.train_test_split(test_size=0.1)
train_dataset = dataset["train"]
test_dataset = dataset["test"]

In [6]:
#TRAIN THE MODEL
# Training arguments
training_args = TrainingArguments(
    output_dir="./t5_crossword_model",
    eval_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
    save_total_limit=2,
    logging_dir="./logs",
    logging_steps=500,
    report_to="none",
)

# Data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Train the model
trainer.train()

# Save final model
model.save_pretrained("./t5_crossword_model")
tokenizer.save_pretrained("./t5_crossword_model")
print("Training complete. Model saved!")



Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 