### Project Introduction
- **Goal**: The goal of this project is to create a crossword helper that provides the user additional hints/clues to make solving crosswords more enjoyable. For now, the crossword helper assumes that you have access to the correct answers, but the end product would not require this. Moreover, a separate computer vision piece is being developed so a user can just take a picture of the entire crossword and request help where needed. For now, the crossword will operate on a clue/answer pair as being the input. As far as hint generation goes, the project is heading in a few different directions with varying levels of complexity, which include but are not limited to:
  1. Provide synonyms/related words/antonyms to the answer --> use embeddings/thesaurus
  2. Provide answer classification so the user know what *kind* of word they should be thinking of --> classification problem, probably exists
  3. Provide clue classification so the user knows what *kind* of hint they are looking at --> classification problem
  4. Provide new additional hints so the user can look at an answer from a different perspective --> train a transformer?
- Data: The data used in this project consists of NYT crossword data from 1993-2021.

### Initial Data Inspection, Basic Cleaning

In [None]:
import pandas as pd
import numpy as np
import chardet

with open('nytcrosswords.csv', 'rb') as file:
    result = chardet.detect(file.read())
    print(result['encoding'])  # Displays the detected encoding

df = pd.read_csv('nytcrosswords.csv', encoding=result['encoding'])

In [None]:
### Minimal Cleaning for Deep Learning 
#drop any null rows
df.dropna(inplace=True)

#simple cleaning - get rid of excess whitespace, let BERT handle the rest!
df['Clue'] = df['Clue'].str.strip()
df['Word'] = df['Word'].str.strip()
df['Date'] = pd.to_datetime(df['Date'], format = '%m/%d/%Y')

#Add character count column that shows the length of each answer
df["Character Count"] = df["Word"].apply(len)

#filter to 2021 for smaller dataset
df = df[df['Date'].dt.year == 2021]

#shuffle to reduce bias
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

#reset index
df = df.reset_index(drop=True)

#let's add a column that tells you how many characters 
df.info()
df.to_csv('deep_learning_nytcrosswords2021.csv', index = False)

In [None]:
df.head(1)

### Advanced Data Loading - Batch Processing!

In [None]:
### Smaller Data Solution - Pandas and Pytorch
import pandas as pd
import chardet
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer

# Detect file encoding
with open('nytcrosswords.csv', 'rb') as file:
    result = chardet.detect(file.read())
    encoding = result['encoding']

# Define batch size and chunk size for efficient loading
batch_size = 16
chunk_size = 10000  # Adjust based on memory and performance

# Load BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Define PyTorch Dataset class
class CrosswordDataset(Dataset):
    def __init__(self, df):
        self.clues = df["Clue"].tolist()
        self.answers = df["Word"].tolist()

    def __len__(self):
        return len(self.clues)

    def __getitem__(self, idx):
        clue = self.clues[idx]
        answer = self.answers[idx]

        # Tokenize clue
        encoding = tokenizer(clue, padding="max_length", truncation=True, return_tensors="pt")

        return {
            "input_ids": encoding["input_ids"].squeeze(0),
            "attention_mask": encoding["attention_mask"].squeeze(0),
            "label": answer
        }

# Load CSV in chunks and process data in batches
chunks = pd.read_csv('nytcrosswords.csv', encoding=encoding, chunksize=chunk_size)

for chunk in chunks:
    # Clean and filter data
    chunk.dropna(inplace=True)
    chunk['Clue'] = chunk['Clue'].str.strip()
    chunk['Word'] = chunk['Word'].str.strip()
    chunk['Date'] = pd.to_datetime(chunk['Date'], format='%m/%d/%Y', errors='coerce')
    chunk = chunk[chunk['Date'].dt.year == 2021]

    # Convert to PyTorch dataset and DataLoader
    dataset = CrosswordDataset(chunk)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

    # Training loop (simplified example)
    for batch in dataloader:
        print(batch["input_ids"].shape)  # Check batch shape
        break  # Remove in final implementation


In [None]:
### Big Data Solution - Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lower, regexp_replace

# Initialize Spark
spark = SparkSession.builder.appName("CrosswordProcessing").getOrCreate()

# Load large crossword dataset
df_spark = spark.read.csv("crossword_dataset.csv", header=True, inferSchema=True)

# Preprocess: Clean and normalize text in parallel
df_spark = df_spark.withColumn("Clue", lower(col("Clue")))
df_spark = df_spark.withColumn("Clue", regexp_replace(col("Clue"), "[^\w\s]", ""))

# Convert Spark DataFrame to Pandas if needed
df_pandas = df_spark.toPandas()


In [None]:
### Real Time Crossword Solving: Kafka + Spark Streaming 
from confluent_kafka import Producer

producer = Producer({'bootstrap.servers': 'localhost:9092'})
producer.produce('crossword-clues', key="clue", value="Capital of France")
producer.flush()


In [None]:
from confluent_kafka import Consumer

consumer = Consumer({'bootstrap.servers': 'localhost:9092', 'group.id': 'clue_solver', 'auto.offset.reset': 'earliest'})
consumer.subscribe(['crossword-clues'])

while True:
    msg = consumer.poll(1.0)  # Wait for new crossword clues
    if msg is None:
        continue
    clue = msg.value().decode("utf-8")
    
    # Solve clue using BERT
    tokens = tokenizer(clue, return_tensors="pt")
    with torch.no_grad():
        output = model(**tokens)
    
    predicted_label = torch.argmax(output.logits, dim=1).item()
    predicted_answer = label_encoder.inverse_transform([predicted_label])
    
    print(f"Clue: {clue} | Predicted Answer: {predicted_answer[0]}")


### Exploring the Full Dataset 
- Questions
    - How often are answered reused? - If answers are reused frequently, then we can reuse clues!
    - Identify trends over the last few decades in NYT crosswords
        - Can use my clue classification algo. to breakdown every crossword
    - **Can I make my own difficulty rating?** 

In [None]:
#Basic Loading and Cleaning Again
import pandas as pd
import chardet

with open('nytcrosswords.csv', 'rb') as file:
    result = chardet.detect(file.read())
    print(result['encoding'])  # Displays the detected encoding

df = pd.read_csv('nytcrosswords.csv', encoding=result['encoding'])

#drop any null rows
df.dropna(inplace=True)

#convert date col to datetimetype 
df['Date'] = pd.to_datetime(df['Date'], format = '%m/%d/%Y')

#Normalize clues and answers to account for any discrepancies 
import re
def clean_text(text):
    text = text.lower()  # Lowercase
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = text.strip()  # Remove leading/trailing spaces
    return text

df['Clue'] = df['Clue'].apply(clean_text)
df['Word'] = df['Word'].apply(clean_text)

df.info()

In [None]:
#Group by, look for duplicates
#Drop date col for now
df1 = df.iloc[:, 1:3]
df1.info()
#First groupby answer, should default to count 
# Group by 'Answer' by count, sort, and print
df1 = df.groupby('Word').size().reset_index(name='Count')
df1 = df1.sort_values(by='Count', ascending=False)
print(df1.head())

#Do same groupby for clues
df2 = df.groupby('Clue').size().reset_index(name='Count')
df2 = df2.sort_values(by = 'Count', ascending = False)
print(df2.head())

# Group by 'Clue' and count unique answers
clue_group = df.groupby('Clue')['Word'].nunique().reset_index()
clue_group.rename(columns={'Word': 'Unique_Answers'}, inplace=True)
# Sort by number of unique answers
clue_group = clue_group.sort_values(by='Unique_Answers', ascending=False)
# View top results
print(clue_group.head())


In [None]:
#Naive way to do it 
import numpy as np
import matplotlib.pyplot as plt
clues = df['Clue']
total_clues = len(clues)
uniq_clues = len(set(clues))
diff_clues = total_clues - uniq_clues
prop_clues = np.round( (1 -  (uniq_clues/total_clues)) * 100, 1)
print(f"There are {diff_clues} duplicate clues which is {prop_clues}% of the total. There are {total_clues} total clues and {uniq_clues} unique clues.")

answers = df['Word']
total_answers = len(answers)
uniq_answers = len(set(answers))
diff_answers = total_answers - uniq_answers
prop_answers = np.round((1 - (uniq_answers/total_answers)) * 100, 1)
print(f"There are {diff_clues} duplicate answers which is {prop_answers}% of the total. There are {total_answers} and {uniq_answers} unique answers. ")

plt.plot(uniq_clues, uniq_answers)

### General Approach Thoughts

- Provide various hints
    - Synonym of answer
    - Antonym of answer
    - Help give context to the clue - sentiment analysis, text classification 
    - Help give to context to the answer
        - What kind of word etc.
    - Answer used in sentence
    - Varying level of hints
- How can I incorporate NLP?
- Problem: some answers are multiple words/phrase/proper noun/name
- Later quality of life stuff
    - Autochecker
    - Full puzzle checker
    - Single word checker 

### Classification of Clue Types 
- Goal: classify clue types as definition, wordplay, anagram, name/etc.
- Necessary steps:
    - Create labels for different clue types
    - Train some classification program using labeled data
        - Options: Naive Bayes from DS122, fine tune BERT 

### Other Avenues of Exploration for similar words
- WordNet --> directly pull synonyms, antonyms, related words
- Thesaurus APIs --> fetch related words dynamically
- Context-Aware Model --> pre-trained models like BERT to train a model to predict answers or generate hints based on clue embeddings
    - Use hyperparameter fine tuning
- Real-Time Suggestions -->   leverage APIs to fetch synonyms/related terms in real-time. Probably useful if we haven't seen the answer yet
    - Use GPT APIs for generateing context-aware hint
- Evaluation
    - Try on clues not in the dataset    

### Hint Help Feature 1: Answer Classification and Similar Answer Generation

#### Goal/Explanation

- **Goal**: 
- **Methods**: Answer Classification Model #1 --> Using GlINER
    - **Why GLINER** - Generalist and Lightweight NER --> designed to recognize entities beyond typical predefined categories. Wide array of entities. Allows for flexible/customizable labels!
    - CAN ALSO BE USED FOR HINT CLASSIFICATION!!!
    - **NEW IDEA** - Further **improve NER** by using GLiNER on the hints too! Combine inputs so we can be sure what the answer is!
    - This way we can also maybe reverse engineer the answer/generate hints and maybe come up with some new model with stuff we can train on 
    - Other models used: Spacy, Roberta, Hybrid Spacy + Berta --> too many None categorizations which aren't really helpful (due to limited entity choice)
    - https://github.com/urchade/GLiNER/blob/main/README.md
- **Challenges**: How to handle Answers that are multiple words combined/made up words/names or pronoun/acronym
    - Lots of possible edge cases for Crossword Answers:
        - Multi-woerd answers --> lematize each word separately, rejoin them
        - Made-up words/slang --> use original word if not recognized
        - Proper nouns --> detect named entities, keep same
        - Acronyms --> try to identify ...
        - Foreign words --> keep unchanged
        - Hyphenated words - keep if word exists
        - Contractions --> expand/keep original
        - Numbers in words --> keep 
- **Areas of improvement**
    - Find more explicit ways to handle the edge cases, ie. use a super long list of common acronyms/slang etc.
- **NEW IDEAS**:
    - One shot/few shot learning/prompting --> have the model do some task/question its never seen
    - Use knowledge graphs/RAG/other things
    - Other learning approaches to consider
        - Reinforcement learning -->
        - Self-supervised
        - Semi-supervised
    - FOR HINTS
        - Look for fill in the blanks with ___ and then use a mask model!
- huggingface pipeline function + quick tour: https://huggingface.co/docs/transformers/en/quicktour#trainer---a-pytorch-optimized-training-loop
    - NER - persons/organizations/locations in a sentence
        - classify each word in a sentence:
        - can you customize labels?
    - Maybe token classification instead???
        - NER/POS  
    - sentiment analyis
    - zero shot classificaiton - tries to label given whateber labels you want
        - Can use for hints/clues/predictions
        - Find the best models
    - text generation --> finishes some prompt using predicted words. Get some max length. Get as many return sequences as you want.
    - Fill mask - predicts what words goes in the blank (mask) and returns score/token/token_str. Get top k answers
    - question-answering
        - give question
        - give context!
- **Current Areas of Improvement**: The POS tagging is not good. Can also improve word embedding section to use knowledge graphs to get relationships for some answers/clues.  

##### Zero Shot Models
BART: BART (Bidirectional and Auto-Regressive Transformers) is a denoising autoencoder pre-trained on a large corpus. It is beneficial for generating textual data and has shown promising results in zero-shot classification tasks.
T5: T5 (Text-to-Text Transfer Transformer) is a transformer model that frames almost all NLP tasks as text-to-text problems. It can be adapted for zero-shot learning by providing the task description as input alongside the text to classify.
GPT-3: GPT-3 (Generative Pre-trained Transformer 3) is one of the most significant language models available and has impressive zero-shot capabilities. Although GPT-3 might not be directly accessible due to its size, smaller versions and similar models are available.
RoBERTa: RoBERTa (A Robustly Optimized BERT Pre-training Approach) is a variant of BERT that modifies the training process to improve its performance. It is widely used for various NLP tasks, including zero-shot classification.
BERT: BERT (Bidirectional Encoder Representations from Transformers) is one of the pioneering language models for NLP. While not explicitly designed for zero-shot learning, it can still perform reasonably well in zero-shot classification tasks.
ALBERT: ALBERT (A Lite BERT) is a lightweight version of BERT that reduces the model’s size and training time while maintaining performance. It can be a good choice for zero-shot classification in resource-constrained environments.

##### Zero Shot Alternatives
Alternatives to zero-shot learning
Several alternative approaches to zero-shot learning exist for classification tasks. These methods vary in their complexity, data requirements, and performance. Some common alternatives include:

Supervised Learning: A model is trained on a labelled dataset with examples for each class it needs to classify. This is the traditional approach to classification and is highly effective when a sufficient amount of labelled training data is available for all classes.
Few-Shot Learning: Few-shot learning lies between zero-shot and fully supervised learning. It aims to classify data with only a few examples for each class. This approach is advantageous when labelled data is scarce for certain classes but available for others.
Semi-Supervised Learning: Semi-supervised learning combines labelled and unlabeled data during training. It can leverage labelled examples for some classes and unlabeled data to improve classification performance.
Transfer Learning: Transfer learning involves pre-training a model on a large dataset and then fine-tuning it on a smaller labelled dataset specific to the target task. This approach can be practical when the pre-trained model captures relevant features useful for the classification task.
Multi-Task Learning: In multi-task learning, a single model is trained to perform multiple related tasks simultaneously. By leveraging knowledge from other related tasks, it can help improve classification performance.
Active Learning: Active learning is an iterative approach where the model actively selects the most informative instances for labelling. This reduces the need for large amounts of labelled data and can improve classification performance with a smaller labelled dataset.
Ensemble Methods: Ensemble methods combine predictions from multiple models to obtain more accurate and robust classifications. They can be used to improve classification performance when individual models might struggle to handle specific classes.
Domain Adaptation: Domain adaptation aims to transfer knowledge from a source domain with labelled data to a target domain with different characteristics but lacks labelled data. It can be helpful when the target domain has another distribution from the source domain.
Meta-Learning: Meta-learning, also known as “learning to learn,” trains a model to learn how to adapt quickly to new tasks with limited data. It can help handle new classes with only a few examples.

**GLiNER Notes**
- Word-level models work better for finding multi-word entities, highlighting sentences or paragraphs. They require additional output postprocessing that can be found in the corresponding model card.
- GLiNER NuNerZero: numind/NuNER_Zero (MIT) - +3% more powerful GLiNER Large v2.1, better suitable to detect multi-word entities
- GLiNER NuNerZero 4k context: numind/NuNER_Zero-4k (MIT) - 4k-long-context NuNerZero


- 🔬 Domain Specific Models
- Personally Identifiable Information: 🔍 urchade/gliner_multi_pii-v1 (Apache 2.0)
This model is capable of recognizing various types of personally identifiable information (PII), including but not limited to these entity types: person, organization, phone number, address, passport number, email, credit card number, social security number, health insurance id number, date of birth, mobile phone number, bank account number, medication, cpf, driver's license number, tax identification number, medical condition, identity card number, national id number, ip address, email address, iban, credit card expiration date, username, health insurance number, registration number, student id number, insurance number, flight number, landline phone number, blood type, cvv, reservation number, digital signature, social media handle, license plate number, cnpj, postal code, passport_number, serial number, vehicle registration number, credit card brand, fax number, visa number, insurance company, identity document number, transaction number, national health insurance number, cvc, birth certificate number, train ticket number, passport expiration date, and social_security_number.

#### Data/Model Loading

In [23]:
#Load in cleaned data
import pandas as pd
import numpy as np
df = pd.read_csv('deep_learning_nytcrosswords2021.csv')
df.info()

#for now just pick small subset of data, since this section doesn't really require training 
df = df.sample(n=100)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23420 entries, 0 to 23419
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Date             23420 non-null  object
 1   Word             23420 non-null  object
 2   Clue             23420 non-null  object
 3   Character Count  23420 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 732.0+ KB


#### Data Preprocessing and Classification 


**Explanation of the Code:**
- Purpose: This section processes crossword answers to clean, categorize, and enhance them with NLP techniques. The goal is to prepare the data for generating related words and analyzing patterns in crossword clues.
  
- Key Steps:
  1. Load & Preprocess Crossword Data  
     - Loads a dataset of crossword clues and answers.  
     - Uses a small subset (`n=100`) for quick processing.  
     - Converts answers to lowercase for consistency.

  2. Load NLP Models & Tools  
     - Uses `spaCy` for text processing and part-of-speech (POS) tagging.  
     - `PunctuationModel` restores capitalization & punctuation.  
     - `wordsegment` helps correct improperly formatted multi-word answers.  
     - `GLiNER` classifies answers into categories (e.g., Person, Place, Food, Science).  

  3. Define Helper Functions  
     - `restore_spacing(word)` → Fixes spacing for multi-word answers.  
     - `detect_multi_word(word)` → Identifies if an answer has multiple words.  
     - `classify_pos(word)` → Tags the part-of-speech using `spaCy`.  
     - `classify_with_gliner(answer, clue)` → Uses GLiNER to assign semantic categories.  
     - `lemmatize_word(word)` → Converts words to their base form (e.g., "running" → "run").  

  4. Process the Crossword Dataset  
     - Applies all the above functions to clean and enrich the dataset.  
     - Restores punctuation, fixes multi-word formatting, performs POS tagging, and classifies words.  

  5. Display Processed Data  
     - Shows the first few rows of the cleaned and categorized crossword data.  

This structured approach ensures that crossword answers are in a useful format for further analysis, including word similarity and hint generation.



In [25]:
import pandas as pd
import spacy
import warnings
from transformers import pipeline
from wordsegment import load, segment
from deepmultilingualpunctuation import PunctuationModel
from gliner import GLiNER
from gliner.multitask import GLiNERClassifier

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

# Define category labels for GLiNER classification
LABELS = [
    "Person", "Place", "Thing", "Animal", "Food", "Science", "Art", "Sport",
    "History", "Literature", "Music", "Brand", "Abbreviation", "Acronym",
    "Foreign", "Wordplay (Pun/Anagram/Homophone)", "Mythology", "Religion", "Vehicle", "Clothing",
    "Instrument", "Plant", "Event", "Concept", "Miscellaneous",
    "Slang", "Geography", "Object", "Technology", "Expression"
]

# Load NLP models once (global initialization for efficiency)
print("Loading models...")
nlp = spacy.load("en_core_web_trf", disable=["parser"])  # Transformer-based NLP model
punctuation_model = PunctuationModel()  # Restores capitalization & punctuation
load()  # Load word segmentation model

# Load GLiNER multitask model for classification
model_id = "knowledgator/gliner-multitask-v1.0"
gliner_model = GLiNER.from_pretrained(model_id)
classifier = GLiNERClassifier(model=gliner_model)
print("Models loaded successfully!")

# Helper functions
def restore_spacing(word):
    """Fixes spacing for improperly formatted words."""
    return " ".join(segment(word.lower())).title()

def detect_multi_word(word):
    """Detects if an answer consists of multiple words."""
    return "MULTI-WORD" if len(segment(word.lower())) > 1 and not word.islower() else "SINGLE-WORD"

def classify_pos(word):
    """Tags part-of-speech (POS) using spaCy."""
    doc = nlp(word)
    return " ".join([token.pos_ for token in doc if token.pos_ in ["VERB", "NOUN", "ADJ", "ADV"]]) if len(doc) > 1 else doc[0].pos_ if len(doc) > 0 else "UNKNOWN"

def classify_with_gliner(answer, clue, top_n=3):
    """Classifies a clue-answer pair using GLiNER and returns the top N predicted labels."""
    formatted_text = f"Clue: {clue}. Answer: {answer}"
    predictions = classifier(formatted_text, classes=LABELS, multi_label=True)
    predictions = predictions[0] if isinstance(predictions, list) and len(predictions) > 0 and isinstance(predictions[0], list) else predictions
    sorted_labels = sorted(predictions, key=lambda x: x["score"], reverse=True)[:top_n]
    return [f"{label['label']} ({label['score']:.2f})" for label in sorted_labels] if sorted_labels else ["Other"]

def lemmatize_word(word):
    """Lemmatizes words to their root form."""
    return nlp(word)[0].lemma_ if nlp(word) else word

def process_crossword_data(csv_path, sample_size=100):
    """
    Loads, cleans, and processes a crossword dataset.
    - Restores capitalization & punctuation
    - Fixes improperly formatted multi-word entities
    - Performs named entity recognition (NER) with GLiNER
    - Classifies multi-word terms
    - Performs POS tagging
    - Lemmatizes words to their root form
    """
    print(f"Loading dataset from {csv_path}...")
    df = pd.read_csv(csv_path)
    
    if sample_size:
        df = df.sample(n=sample_size)  # Use a smaller subset for faster processing
    
    df['Word'] = df['Word'].str.lower()  # Normalize to lowercase

    print("Processing crossword data...")
    df["Fixed Word"] = df["Word"].apply(lambda x: punctuation_model.restore_punctuation(x))
    df["Spaced Word"] = df["Fixed Word"].apply(restore_spacing)
    df["GLiNER Labels"] = df.apply(lambda row: classify_with_gliner(row["Spaced Word"], row["Clue"]), axis=1)
    df["Multi Word"] = df["Spaced Word"].apply(detect_multi_word)
    df["POS Tag"] = df["Spaced Word"].apply(classify_pos)
    df["Lemmatized Word"] = df["Spaced Word"].apply(lemmatize_word)

    print("Processing complete!")
    return df

Loading models...


Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Models loaded successfully!


In [26]:
#Process the df:
df = process_crossword_data("deep_learning_nytcrosswords2021.csv", sample_size=100)
df.head(10)

Loading dataset from deep_learning_nytcrosswords2021.csv...
Processing crossword data...


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Processing complete!


Unnamed: 0,Date,Word,Clue,Character Count,Fixed Word,Spaced Word,GLiNER Labels,Multi Word,POS Tag,Lemmatized Word
4058,2021-06-08,tara,Figure skater Lipinski,4,tara.,Tara,"[Sport (0.98), Person (0.96)]",SINGLE-WORD,PROPN,Tara
20951,2021-02-05,nepal,Home of many a Sherpa,5,nepal.,Nepal,[other (1.00)],SINGLE-WORD,PROPN,Nepal
17684,2021-06-30,moreso,On a larger scale,6,moreso.,More So,"[History (0.68), Music (0.67), Literature (0.65)]",MULTI-WORD,ADV ADV,more
18587,2021-03-30,pecan,Pie nut,5,pecan.,Pecan,"[Food (0.98), Science (0.58)]",SINGLE-WORD,PROPN,Pecan
6161,2021-01-05,highc,An alto probably can't hit it,5,highc.,High C,[Music (0.97)],MULTI-WORD,NOUN,High
15148,2021-01-14,ents,Docs treating vertigo,4,ents.,Ents,"[Music (0.64), Science (0.64), History (0.62)]",SINGLE-WORD,NOUN,ent
6638,2021-05-29,itdepends,"""Not always""",9,itdepends.,It Depends,"[Literature (0.65), History (0.63), Music (0.63)]",MULTI-WORD,VERB,it
20151,2021-05-03,palau,Island nation in the western Pacific,5,palau.,Palau,[History (0.82)],SINGLE-WORD,PROPN,Palau
6176,2021-06-13,plus,Not only that,4,plus.,Plus,"[History (0.68), Music (0.65), Literature (0.59)]",SINGLE-WORD,CCONJ,plus
18488,2021-03-02,solo,"""Star Wars"" pilot who, despite his name, flies...",4,solo.,Solo,[Literature (0.60)],SINGLE-WORD,PROPN,Solo


#### Feature 1: Adding Similar/Related Words

#### Old Testing

##### Gliner Testing

In [26]:
#GLINER experiment with multiple categories
labels = [
    # Core categories
    "Person (Historical/Literary/Fictional)", 
    "Place (Geographic/Constructed)", 
    "Animal (Real/Mythical)",
    "Food (Dish/Ingredient)",
    "Science (Biology/Chemistry/Physics)", 
    "Art (Visual/Performing)", 
    "Sport (Game/Athlete/Equipment)",
    "Literature (Book/Author/Character)",
    "Music (Genre/Instrument/Song)",
    "Brand (Company/Product)",
    "Vehicle (Type/Brand)",
    "Plant (Flower/Tree)",
    "Event (Historical/Cultural)",
    
    # Crossword-specific helpers
    "Abbreviation (Common/Initialism)",
    "Foreign Word (Language-Specific)",  # e.g., French, Latin
    "Wordplay (Pun/Anagram/Homophone)", 
    "Mythology (Deity/Creature)",
    "Religion (Practice/Figure)",
    "Concept (Abstract/Idea)",
    "Object (Everyday/Tool)",
    "Clothing (Type/Brand)",
    "Slang/Colloquialism",
    "Acronym (Pronounceable)",  # e.g., NASA vs. FBI
    "Geography (Landform/Region)",
    "Time (Unit/Historical Era)",
    
    # Fallback
    "Miscellaneous"
]

label_groups = [
    # Group 1: Core entities
    ["Person (Historical/Literary/Fictional)", "Place (Geographic/Constructed)", "Animal (Real/Mythical)"],
    
    # Group 2: Culture & Activities
    ["Art (Visual/Performing)", "Sport (Game/Athlete/Equipment)", "Music (Genre/Instrument/Song)"],
    
    # Group 3: Abstract/Crossword-Specific
    ["Wordplay (Pun/Anagram/Homophone)", "Abbreviation (Common/Initialism)", "Foreign Word (Language-Specific)"],
    
    # Group 4: Objects & Brands
    ["Brand (Company/Product)", "Vehicle (Type/Brand)", "Clothing (Type/Brand)"],
    
    # Group 5: Science & Nature
    ["Science (Biology/Chemistry/Physics)", "Plant (Flower/Tree)", "Geography (Landform/Region)"]
]

def classify_crossword_answer(text, label_groups, top_n=3, threshold=0.1):
    all_entities = []
    for group in label_groups:
        entities = model.predict_entities(text, labels=group, threshold=threshold)
        all_entities.extend(entities)
    
    # Deduplicate and sort
    seen = set()
    unique_entities = []
    for ent in sorted(all_entities, key=lambda x: x["score"], reverse=True):
        key = (ent["text"], ent["label"])
        if key not in seen:
            seen.add(key)
            unique_entities.append(ent)
    
    # Format for crossword hints
    formatted = []
    for ent in unique_entities[:top_n]:
        label = ent["label"].split(" (")[0]  # Simplify for output (e.g., "Person" instead of "Person (Historical/Literary/Fictional)")
        formatted.append(f"{label} ({ent['score']:.2f})")
    
    return formatted if formatted else ["Miscellaneous"]

clue = "King in chess"
answer = "Rook"

# Add context to help GLiNER resolve ambiguity
context = f"Clue: '{clue}' (Answer: '{answer}')"
result = classify_crossword_answer(context, label_groups)

print(f"Labels for '{answer}': {result}")

Labels for 'Rook': ['Foreign Word (0.76)', 'Sport (0.67)', 'Person (0.66)']


##### Test Approach 1: Word Embeddings (word2vec, Wordnet)

In [9]:
#Use word2vec model (google word dict. to convert words to vectors) to identify most similar words
import gensim.downloader as api

# Load the pretrained model
wv = api.load('word2vec-google-news-300')
print('model loaded')

#Also use wordnet for more structure/tighter 
import nltk
nltk.download('wordnet')

print('Models loaded!')

model loaded
Models loaded!


[nltk_data] Downloading package wordnet to C:\Users\Sean
[nltk_data]     Salvador\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [10]:
#Let's try word embeddings, with some filitering to make sure generated words are at least under the same category
#first try wordnet then word2vec for more complexity
import gensim.downloader as api
import spacy
import numpy as np
from nltk.corpus import wordnet as wn
from gliner import GLiNER

# Define labels
labels = [
    "Person", "Place", "Thing", "Animal", "Food", "Science", "Art", "Sport",
    "History", "Literature", "Music", "Brand", "Abbreviation", "Foreign",
    "Wordplay", "Mythology", "Religion", "Vehicle", "Clothing", "Instrument",
    "Plant", "Event", "Concept", "Miscellaneous"
]

def lemmatize_word(word):
    """Lemmatizes a word using spaCy."""
    doc = nlp(word)
    return doc[0].lemma_ if doc else word

def get_gliner_labels(gliner_labels):
    """Extracts clean category labels from GLiNER output (removes confidence scores)."""
    return {label.split(" (")[0] for label in gliner_labels}  # Remove score (0.XX) part

def predict_gliner_labels(word):
    """Runs GLiNER to predict entity labels for a word."""
    gliner_prediction = model_NER.predict_entities(word, labels, threshold=0.2)
    return {entity["label"] for entity in gliner_prediction} if gliner_prediction else set()

def is_valid_synonym(word, answer_lemmas, seen_lemmas, answer_labels, word_labels):
    """
    Determines if a synonym is valid based on:
    - **Matching GLiNER labels** (must share at least one).
    - **No duplicate lemmatized words** (answer or previous synonyms).
    - **Ensuring variety in generated words**.
    """
    word_lemma = lemmatize_word(word)

    # ✅ Must match at least one GLiNER label
    if not answer_labels & word_labels:
        return False  # No category overlap

    # ✅ Ensure uniqueness by checking lemmas
    if word_lemma in answer_lemmas or word_lemma in seen_lemmas:
        return False

    return True

def get_wordnet_synonyms(word):
    """Fetches synonyms from WordNet."""
    synonyms = set()
    for synset in wn.synsets(word):
        for lemma in synset.lemmas():
            synonyms.add(lemma.name().replace("_", " "))  # Replace underscores with spaces
    return list(synonyms)

def get_word2vec_synonyms(answer, gliner_labels, top_n=5):
    """
    Generates synonyms using Word2Vec while ensuring:
    - Labels match GLiNER categories.
    - No duplicate lemmatized words.
    - Multi-word handling via embedding averaging.
    """
    answer = answer.lower()
    answer_lemmas = {lemmatize_word(word) for word in answer.split()}  # Root forms of answer words
    answer_labels = get_gliner_labels(gliner_labels)  # Get clean GLiNER labels
    words = answer.split()

    # ✅ Try getting a direct match for the full phrase
    if answer in wv:
        similar_words = wv.most_similar(answer, topn=top_n * 5)  # Fetch extra words to filter
    else:
        # ✅ If phrase is missing, average embeddings of individual words
        valid_vectors = [wv[word] for word in words if word in wv]
        if not valid_vectors:
            return []  # No valid embeddings found

        avg_vector = np.mean(valid_vectors, axis=0)  # Compute mean embedding
        similar_words = wv.similar_by_vector(avg_vector, topn=top_n * 5)

    similar_words = [w[0] for w in similar_words]  # Extract words only

    # ✅ Filtering: Keep only words that match at least **one** category and aren't duplicates
    cleaned_words = set()
    seen_lemmas = set()  # Track lemmas to avoid repetition

    for word in similar_words:
        word_labels = predict_gliner_labels(word)  # Classify Word2Vec word with GLiNER

        if is_valid_synonym(word, answer_lemmas, seen_lemmas, answer_labels, word_labels):
            lemma_word = lemmatize_word(word)
            seen_lemmas.add(lemma_word)  # Prevent duplicates
            cleaned_words.add(word)  # Keep original word

        if len(cleaned_words) >= top_n:
            break  # ✅ Stop once we have enough valid words

    return list(cleaned_words)

def get_combined_synonyms(answer, gliner_labels, top_n=5):
    """
    Combines **WordNet and Word2Vec** for better synonym generation.
    - WordNet first (higher-quality synonyms).
    - Word2Vec fills in gaps.
    """
    # ✅ Step 1: Get WordNet synonyms
    wordnet_synonyms = get_wordnet_synonyms(answer)

    # ✅ Step 2: Get Word2Vec synonyms (only if WordNet gave too few results)
    if len(wordnet_synonyms) < top_n:
        w2v_synonyms = get_word2vec_synonyms(answer, gliner_labels, top_n=top_n - len(wordnet_synonyms))
    else:
        w2v_synonyms = []

    # ✅ Combine results, ensuring uniqueness
    combined_synonyms = list(set(wordnet_synonyms + w2v_synonyms))

    return combined_synonyms[:top_n]  # Limit results

# ✅ **Example Test Cases**
test_data = [
    ("Black Sea", ["Foreign Word (0.90)", "Geography (0.86)", "Place (0.82)"]),
    ("Einstein", ["Person (0.98)", "Science (0.95)"]),
    ("Shakespeare", ["Person (0.96)", "Literature (0.92)"]),
    ("Nike", ["Brand (0.99)", "Sport (0.91)"]),
    ("Amazon", ["Brand (0.92)", "Place (0.85)"]),
]

print("\n🔹 **Testing WordNet + Word2Vec Synonyms**")
for answer, labels in test_data:
    print(f"Synonyms for '{answer}' → {get_combined_synonyms(answer, labels)}")




🔹 **Testing WordNet + Word2Vec Synonyms**


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Synonyms for 'Black Sea' → []
Synonyms for 'Einstein' → ['mastermind', 'Albert Einstein', 'genius', 'brain', 'Einstein']
Synonyms for 'Shakespeare' → ['Shakspere', 'Bard of Avon', 'William Shakspere', 'Shakespeare', 'William Shakespeare']
Synonyms for 'Nike' → ['Nike']
Synonyms for 'Amazon' → ['virago', 'Amazon River', 'amazon', 'Amazon']


##### Test Approach 2: Knowledge Graph's + ConceptNet for better relationships.

In [27]:
#VERY VERY GOOD
#Use all relevant relationships, #Mix of previous two, with related words + output diversity
import requests
import difflib

def get_conceptnet_synonyms_and_related(word, top_n=10, weight_threshold=0.5, similarity_threshold=0.8):
    """Fetch synonyms and related words from ConceptNet while ensuring diversity and relevance."""
    
    word = word.lower().replace(" ", "_")  # Format for ConceptNet API
    base_url = "http://api.conceptnet.io"

    # ✅ **Expanded synonym retrieval (directly interchangeable words)**
    synonym_rels = ["/r/IsA", "/r/Synonym", "/r/SimilarTo"]
    synonym_urls = [f"{base_url}/query?rel={rel}&start=/c/en/{word}&end=/c/en" for rel in synonym_rels]

    synonyms = set()
    for url in synonym_urls:
        response = requests.get(url).json()
        for edge in response.get('edges', []):
            synonyms.add(edge['end']['label'])

    # ✅ **Expanded related term retrieval (broader conceptual connections)**
    related_rels = ["/r/PartOf", "/r/HasA", "/r/UsedFor", "/r/DerivedFrom", "/r/RelatedTo"]
    related_urls = [f"{base_url}/query?rel={rel}&start=/c/en/{word}&end=/c/en" for rel in related_rels]

    related_terms = set()
    for url in related_urls:
        response = requests.get(url).json()
        for edge in response.get('edges', []):
            related_terms.add(edge['end']['label'])

    # ✅ **Weight-based filtering for related terms**
    related_url = f"{base_url}/related/c/en/{word}?filter=/c/en"
    related_response = requests.get(related_url).json()

    weighted_related_terms = sorted(
        [(edge["@id"].split("/")[-1].replace("_", " "), edge["weight"]) 
         for edge in related_response.get("related", []) if edge["weight"] > weight_threshold],
        key=lambda x: x[1], reverse=True  # Sort by weight (highest first)
    )

    # Merge weighted related terms
    for term, _ in weighted_related_terms[:top_n]:
        related_terms.add(term)

    # ✅ **Filter out near-duplicates and self-referential terms**
    def is_too_similar(word, seen_words):
        """Check if a word is too similar to an already included word using similarity ratio."""
        return any(difflib.SequenceMatcher(None, word, seen).ratio() > similarity_threshold for seen in seen_words)

    def is_containing_original(word, original):
        """Check if the word contains the original answer or is an exact match."""
        return word.lower() == original.lower() or original.lower() in word.lower()

    # Filter synonyms & related terms for uniqueness and no self-reference
    filtered_synonyms = []
    seen_words = set()
    for syn in synonyms:
        if not is_too_similar(syn, seen_words) and not is_containing_original(syn, word):
            filtered_synonyms.append(syn)
            seen_words.add(syn)

    filtered_related = []
    seen_words = set()
    for rel in related_terms:
        if not is_too_similar(rel, seen_words) and not is_containing_original(rel, word):
            filtered_related.append(rel)
            seen_words.add(rel)

    return filtered_synonyms[:top_n], filtered_related[:top_n]

# ✅ **Test Cases**
test_words = ["Shakespeare", "Einstein", "Nike", "Black Sea", "Amazon", "Physics"]

print("\n🔹 **Testing Expanded ConceptNet Synonyms & Related Terms**")
for word in test_words:
    synonyms, related_terms = get_conceptnet_synonyms_and_related(word)
    print(f"🔹 **ConceptNet results for '{word}':**")
    print(f"   - Synonyms: {synonyms}")
    print(f"   - Related Terms: {related_terms}\n")



🔹 **Testing Expanded ConceptNet Synonyms & Related Terms**
🔹 **ConceptNet results for 'Shakespeare':**
   - Synonyms: ['a great dramatist']
   - Related Terms: ['seventeenth', 'centuries', 'christopher marlowe', 'harold pinter', 'poet', 'shakespearian', 'macbeth', 'english', 'playwright', 'sonnets']

🔹 **ConceptNet results for 'Einstein':**
   - Synonyms: ['a physicist', 'genius', 'a very intelligent man']
   - Related Terms: ['photon', 'stephen hawking', 'relativity', 'mole', 'genius', 'e mc', 'theoretical physicist', 'paul dirac', 'frequency', 'isaac newton']

🔹 **ConceptNet results for 'Nike':**
   - Synonyms: ['sneaks', 'information appliance']
   - Related Terms: ['victory', 'victoria', 'asteroid', 'sneakers', 'tennis shoes', 'jordans', 'reebok', 'athena', 'triumph', 'adidas']

🔹 **ConceptNet results for 'Black Sea':**
   - Synonyms: ['black sea', 'sea']
   - Related Terms: ['southeastern europe', 'caucasus', 'novorossiysk', 'euxinian', 'euxine', 'white sea', 'turkey', 'inland se

In [8]:
#VERY GOOD, GOOD FILTERING
#Combine with wordnet
import requests
import difflib
from nltk.corpus import wordnet as wn

def get_wordnet_synonyms(word):
    """Fetch synonyms from WordNet."""
    synonyms = set()
    for synset in wn.synsets(word):  # Fixed reference to 'wn'
        for lemma in synset.lemmas():
            synonyms.add(lemma.name().replace('_', ' '))
    return list(synonyms)

def get_conceptnet_synonyms_and_related(word, top_n=10, weight_threshold=0.5, similarity_threshold=0.6):
    """Fetch synonyms and related words from ConceptNet and WordNet, ensuring diversity and relevance."""
    
    word = word.lower().replace(" ", "_")  # Format for ConceptNet API
    base_url = "http://api.conceptnet.io"

    # ✅ **Expanded synonym retrieval**
    synonym_rels = ["/r/IsA", "/r/Synonym", "/r/SimilarTo"]
    synonym_urls = [f"{base_url}/query?rel={rel}&start=/c/en/{word}&end=/c/en" for rel in synonym_rels]

    synonyms = set()
    for url in synonym_urls:
        response = requests.get(url).json()
        for edge in response.get('edges', []):
            synonyms.add(edge['end']['label'])

    # ✅ **Integrate WordNet synonyms**
    wordnet_synonyms = get_wordnet_synonyms(word)
    synonyms.update(wordnet_synonyms)

    # ✅ **Expanded related term retrieval**
    related_rels = ["/r/PartOf", "/r/HasA", "/r/UsedFor", "/r/DerivedFrom", "/r/RelatedTo"]
    related_urls = [f"{base_url}/query?rel={rel}&start=/c/en/{word}&end=/c/en" for rel in related_rels]

    related_terms = set()
    for url in related_urls:
        response = requests.get(url).json()
        for edge in response.get('edges', []):
            related_terms.add(edge['end']['label'])

    # ✅ **Weight-based filtering for related terms**
    related_url = f"{base_url}/related/c/en/{word}?filter=/c/en"
    related_response = requests.get(related_url).json()

    weighted_related_terms = sorted(
        [(edge["@id"].split("/")[-1].replace("_", " "), edge["weight"]) 
         for edge in related_response.get("related", []) if edge["weight"] > weight_threshold],
        key=lambda x: x[1], reverse=True
    )

    for term, _ in weighted_related_terms[:top_n]:
        related_terms.add(term)

    # ✅ **Filter out near-duplicates and self-referential terms**
    def is_too_similar(word, seen_words):
        return any(difflib.SequenceMatcher(None, word, seen).ratio() > similarity_threshold for seen in seen_words)

    def is_containing_original(word, original):
        """Check if the word is an exact match (ignoring case) or contains the original term."""
        word_lower = word.lower().replace("_", " ")
        original_lower = original.lower().replace("_", " ")
        
        return word_lower == original_lower or original_lower in word_lower 



    # Filter synonyms & related terms for uniqueness and no self-reference
    filtered_synonyms = []
    seen_words = set()
    for syn in synonyms:
        if not is_too_similar(syn, seen_words) and not is_containing_original(syn, word):
            filtered_synonyms.append(syn)
            seen_words.add(syn)

    filtered_related = []
    seen_words = set()
    for rel in related_terms:
        if not is_too_similar(rel, seen_words) and not is_containing_original(rel, word):
            filtered_related.append(rel)
            seen_words.add(rel)

    return filtered_synonyms[:top_n], filtered_related[:top_n]

# ✅ **Test Cases**
test_words = ["Shakespeare", "Einstein", "Nike", "Black Sea", "Amazon", "Physics"]

print("\n🔹 **Testing Combined ConceptNet & WordNet Synonyms & Related Terms**")
for word in test_words:
    synonyms, related_terms = get_conceptnet_synonyms_and_related(word)
    print(f"🔹 **ConceptNet & WordNet results for '{word}':**")
    print(f"   - Synonyms: {synonyms}")
    print(f"   - Related Terms: {related_terms}\n")



🔹 **Testing Combined ConceptNet & WordNet Synonyms & Related Terms**
🔹 **ConceptNet & WordNet results for 'Shakespeare':**
   - Synonyms: ['Bard of Avon', 'a great dramatist', 'William Shakspere']
   - Related Terms: ['shakespearian', 'alfred lord tennyson', 'seventeenth', 'harold pinter', 'english', 'christopher marlowe', 'macbeth', 'poet', 'dramatist', 'centuries']

🔹 **ConceptNet & WordNet results for 'Einstein':**
   - Synonyms: ['a very intelligent man', 'a physicist', 'brain', 'mastermind', 'genius']
   - Related Terms: ['mole', 'e mc', 'smart', 'relativity', 'paul dirac', 'theoretical physicist', 'frequency', 'stephen hawking', 'isaac newton', 'photon']

🔹 **ConceptNet & WordNet results for 'Nike':**
   - Synonyms: ['information appliance', 'sneaks']
   - Related Terms: ['goddess', 'triumph', 'adidas', 'athena', 'victory', 'reebok', 'sneakers', 'asteroid', 'tennis shoes', 'jordans']

🔹 **ConceptNet & WordNet results for 'Black Sea':**
   - Synonyms: ['sea', 'Euxine Sea']
   - R

#### Final Embedding Approach: Combine ConceptNet + Wordnet, use Word2Vec as fallback 

In [13]:
import requests
import difflib
import gensim.downloader as api
from nltk.corpus import wordnet as wn


# ✅ **Get WordNet Synonyms**
def get_wordnet_synonyms(word):
    """Fetch synonyms from WordNet."""
    synonyms = set()
    for synset in wn.synsets(word):
        for lemma in synset.lemmas():
            synonyms.add(lemma.name().replace('_', ' '))
    return list(synonyms)

# ✅ **Get Word2Vec Similar Words (Only as Backup)**
def get_word2vec_similar_words(word, top_n=5):
    """Fetch similar words from Word2Vec if the word exists in vocabulary."""
    word = word.lower()
    if word in wv:
        return [w[0] for w in wv.most_similar(word, topn=top_n)]
    return []

# ✅ **Get ConceptNet Synonyms & Related Words**
def get_conceptnet_synonyms_and_related(word, min_synonyms=5, min_related=10, weight_threshold=0.5, similarity_threshold=0.8):
    """Fetch synonyms and related words from ConceptNet and WordNet, ensuring diversity and relevance."""
    
    word = word.lower().replace(" ", "_")  # Format for ConceptNet API
    base_url = "http://api.conceptnet.io"

    # ✅ **Expanded synonym retrieval**
    synonym_rels = ["/r/IsA", "/r/Synonym", "/r/SimilarTo"]
    synonym_urls = [f"{base_url}/query?rel={rel}&start=/c/en/{word}&end=/c/en" for rel in synonym_rels]

    synonyms = set()
    for url in synonym_urls:
        response = requests.get(url).json()
        for edge in response.get('edges', []):
            synonyms.add(edge['end']['label'])

    # ✅ **Include WordNet synonyms**
    wordnet_synonyms = get_wordnet_synonyms(word)
    synonyms.update(wordnet_synonyms)

    # ✅ **Expanded related term retrieval**
    related_rels = ["/r/PartOf", "/r/HasA", "/r/UsedFor", "/r/DerivedFrom", "/r/RelatedTo"]
    related_urls = [f"{base_url}/query?rel={rel}&start=/c/en/{word}&end=/c/en" for rel in related_rels]

    related_terms = set()
    for url in related_urls:
        response = requests.get(url).json()
        for edge in response.get('edges', []):
            related_terms.add(edge['end']['label'])

    # ✅ **Weight-based filtering for related terms**
    related_url = f"{base_url}/related/c/en/{word}?filter=/c/en"
    related_response = requests.get(related_url).json()

    weighted_related_terms = sorted(
        [(edge["@id"].split("/")[-1].replace("_", " "), edge["weight"]) 
         for edge in related_response.get("related", []) if edge["weight"] > weight_threshold],
        key=lambda x: x[1], reverse=True
    )

    for term, _ in weighted_related_terms[:min_related]:
        related_terms.add(term)

    # ✅ **Filter out near-duplicates and self-referential terms**
    def is_too_similar(word, seen_words):
        """Check if a word is too similar to an already included word using similarity ratio."""
        return any(difflib.SequenceMatcher(None, word, seen).ratio() > similarity_threshold for seen in seen_words)

    def is_containing_original(word, original):
        """Check if the word is an exact match (ignoring case) or contains the original term."""
        word_lower = word.lower().replace("_", " ")
        original_lower = original.lower().replace("_", " ")
        return word_lower == original_lower or original_lower in word_lower

    # ✅ Filter synonyms & related terms for uniqueness and no self-reference
    filtered_synonyms = []
    seen_words = set()
    for syn in synonyms:
        if not is_too_similar(syn, seen_words) and not is_containing_original(syn, word):
            filtered_synonyms.append(syn)
            seen_words.add(syn)

    filtered_related = []
    seen_words = set()
    for rel in related_terms:
        if not is_too_similar(rel, seen_words) and not is_containing_original(rel, word):
            filtered_related.append(rel)
            seen_words.add(rel)

    # ✅ **Use Word2Vec Backup if Needed**
    if len(filtered_synonyms) < min_synonyms:
        word2vec_synonyms = get_word2vec_similar_words(word, top_n=min_synonyms - len(filtered_synonyms))
        for w2v in word2vec_synonyms:
            if not is_too_similar(w2v, seen_words) and not is_containing_original(w2v, word):
                filtered_synonyms.append(w2v)
                seen_words.add(w2v)

    if len(filtered_related) < min_related:
        word2vec_related = get_word2vec_similar_words(word, top_n=min_related - len(filtered_related))
        for w2v in word2vec_related:
            if not is_too_similar(w2v, seen_words) and not is_containing_original(w2v, word):
                filtered_related.append(w2v)
                seen_words.add(w2v)

    return filtered_synonyms[:min_synonyms], filtered_related[:min_related]

# ✅ **Test Cases**
test_words = ["Shakespeare", "Einstein", "Nike", "Black Sea", "Amazon", "Physics"]

print("\n🔹 **Testing Final ConceptNet + WordNet + Word2Vec Backup**")
for word in test_words:
    synonyms, related_terms = get_conceptnet_synonyms_and_related(word)
    print(f"🔹 **Results for '{word}':**")
    print(f"   - Synonyms: {synonyms}")
    print(f"   - Related Terms: {related_terms}\n")



🔹 **Testing Final ConceptNet + WordNet + Word2Vec Backup**
🔹 **Results for 'Shakespeare':**
   - Synonyms: ['Bard of Avon', 'a great dramatist', 'William Shakspere', 'Shakspere', 'www.angelfire.com']
   - Related Terms: ['shakespearian', 'alfred lord tennyson', 'seventeenth', 'harold pinter', 'english', 'sixteenth', 'christopher marlowe', 'macbeth', 'poet', 'dramatist']

🔹 **Results for 'Einstein':**
   - Synonyms: ['a very intelligent man', 'a physicist', 'brain', 'mastermind', 'brainiac']
   - Related Terms: ['mole', 'e mc', 'smart', 'relativity', 'paul dirac', 'theoretical physicist', 'frequency', 'stephen hawking', 'isaac newton', 'photon']

🔹 **Results for 'Nike':**
   - Synonyms: ['information appliance', 'sneaks', 'schuhe']
   - Related Terms: ['goddess', 'triumph', 'adidas', 'athena', 'victory', 'reebok', 'sneakers', 'asteroid', 'tennis shoes', 'jordans']

🔹 **Results for 'Black Sea':**
   - Synonyms: ['sea', 'Euxine Sea']
   - Related Terms: ['euxinian', 'eastern europe', 'tr

In [17]:
#This time, loosen thresholds if not enough words found. More filtering 
import requests
import difflib
import langdetect
import gensim.downloader as api
from nltk.corpus import wordnet as wn


def get_wordnet_synonyms(word):
    """Fetch synonyms from WordNet."""
    synonyms = set()
    for synset in wn.synsets(word):
        for lemma in synset.lemmas():
            synonyms.add(lemma.name().replace('_', ' '))
    return list(synonyms)

def get_word2vec_synonyms(word, top_n=5):
    """Fetch similar words from Word2Vec if the word exists in vocabulary."""
    word = word.lower()
    if word in wv:
        return [w[0] for w in wv.most_similar(word, topn=top_n)]
    return []

def filter_results(words, original_word):
    """Remove non-English words, URLs, underscores, and overly similar words."""
    filtered = set()
    for word in words:
        word_clean = word.replace('_', ' ')
        try:
            if "www" in word or ".com" in word or ".net" in word:
                continue  # Remove URLs
            if langdetect.detect(word_clean) != "en":
                continue  # Remove non-English words
            if word_clean.lower() == original_word.lower() or original_word.lower() in word_clean.lower():
                continue  # Avoid self-referential words
            filtered.add(word_clean)
        except:
            continue
    return list(filtered)

def get_conceptnet_synonyms_and_related(word, min_synonyms=5, min_related=10, weight_threshold=0.5, similarity_threshold=0.8):
    """Fetch synonyms and related words from ConceptNet, WordNet, and Word2Vec."""
    word = word.lower().replace(" ", "_")
    base_url = "http://api.conceptnet.io"
    
    # ✅ **Fetch synonyms**
    synonym_rels = ["/r/IsA", "/r/Synonym", "/r/SimilarTo"]
    synonym_urls = [f"{base_url}/query?rel={rel}&start=/c/en/{word}&end=/c/en" for rel in synonym_rels]
    synonyms = set()
    for url in synonym_urls:
        response = requests.get(url).json()
        for edge in response.get('edges', []):
            synonyms.add(edge['end']['label'])
    synonyms.update(get_wordnet_synonyms(word))
    
    # ✅ **Fetch related terms**
    related_rels = ["/r/PartOf", "/r/HasA", "/r/UsedFor", "/r/DerivedFrom", "/r/RelatedTo"]
    related_urls = [f"{base_url}/query?rel={rel}&start=/c/en/{word}&end=/c/en" for rel in related_rels]
    related_terms = set()
    for url in related_urls:
        response = requests.get(url).json()
        for edge in response.get('edges', []):
            related_terms.add(edge['end']['label'])
    
    # ✅ **Weight-based filtering for related terms**
    related_url = f"{base_url}/related/c/en/{word}?filter=/c/en"
    related_response = requests.get(related_url).json()
    weighted_related_terms = sorted(
        [(edge["@id"].split("/")[-1].replace("_", " "), edge["weight"]) 
         for edge in related_response.get("related", []) if edge["weight"] > weight_threshold],
        key=lambda x: x[1], reverse=True
    )
    for term, _ in weighted_related_terms[:min_related]:
        related_terms.add(term)
    
    # ✅ **Apply filtering**
    filtered_synonyms = filter_results(synonyms, word)
    filtered_related = filter_results(related_terms, word)
    
    # ✅ **Ensure minimum outputs, fallback to Word2Vec if needed**
    if len(filtered_synonyms) < min_synonyms:
        word2vec_synonyms = get_word2vec_synonyms(word, top_n=min_synonyms - len(filtered_synonyms))
        filtered_synonyms.extend(filter_results(word2vec_synonyms, word))
    
    if len(filtered_related) < min_related:
        word2vec_related = get_word2vec_synonyms(word, top_n=min_related - len(filtered_related))
        filtered_related.extend(filter_results(word2vec_related, word))
    
    return filtered_synonyms[:min_synonyms], filtered_related[:min_related]

# ✅ **Test Cases**
test_words = [
    "Shakespeare", "Einstein", "Nike", "Black Sea", "Amazon", "Physics",
    "Isaac Newton", "Neural Networks", "Cryptography", "Pi",
    "Tesla", "Nintendo", "McDonald's", "Chair", "Backpack", "Airplane",
    "Relativity", "Artificial Intelligence", "Elon Musk", "Nikola Tesla"
]

print("\n🔹 **Testing Final ConceptNet + WordNet + Word2Vec Backup**")
for word in test_words:
    synonyms, related_terms = get_conceptnet_synonyms_and_related(word)
    print(f"🔹 **Results for '{word}':**")
    print(f"   - Synonyms: {synonyms}")
    print(f"   - Related Terms: {related_terms}\n")


🔹 **Testing Final ConceptNet + WordNet + Word2Vec Backup**
🔹 **Results for 'Shakespeare':**
   - Synonyms: ['William Shakspere', 'Bard of Avon', 'home.htm']
   - Related Terms: ['shakespearian', 'sixteenth', 'christopher marlowe', 'harold pinter', 'playwright', 'home.htm']

🔹 **Results for 'Einstein':**
   - Synonyms: ['a physicist']
   - Related Terms: ['theoretical physicist', 'relativity', 'theory of relativity', 'photon', 'armstrong']

🔹 **Results for 'Nike':**
   - Synonyms: ['information appliance']
   - Related Terms: ['victory', 'tennis shoes', 'athletic footwear', 'athena', 'christian louboutin']

🔹 **Results for 'Black Sea':**
   - Synonyms: ['black sea', 'Black Sea']
   - Related Terms: ['yellow sea', 'white sea', 'inland sea', 'southeastern europe', 'red sea']

🔹 **Results for 'Amazon':**
   - Synonyms: ['fictional female person', 'mythical being']
   - Related Terms: ['overwhelm', 'warrior', 'south america']

🔹 **Results for 'Physics':**
   - Synonyms: ['cathartic', 'phys

In [21]:
#final version
import requests
import difflib
import langdetect
from nltk.corpus import wordnet as wn
import gensim.downloader as api


def get_wordnet_synonyms(word):
    """Fetch synonyms from WordNet."""
    synonyms = set()
    for synset in wn.synsets(word):
        for lemma in synset.lemmas():
            synonyms.add(lemma.name().replace('_', ' '))
    return list(synonyms)


def get_word2vec_similar_words(word, top_n=5):
    """Fetch similar words from Word2Vec if the word exists in vocabulary."""
    word = word.lower()
    if word in wv:
        return [w[0] for w in wv.most_similar(word, topn=top_n)]
    return []


def is_valid_word(word):
    """Filter out non-English words, URLs, and junk values."""
    if any(substr in word for substr in ["www.", ".com", ".net", "home.htm"]):
        return False
    try:
        if langdetect.detect(word) != "en":
            return False
    except:
        return False
    return True


def get_conceptnet_synonyms_and_related(word, min_synonyms=5, min_related=10, weight_threshold=0.5, similarity_threshold=0.6):
    """Fetch synonyms and related words from ConceptNet, WordNet, and Word2Vec."""
    word = word.lower().replace(" ", "_")  # Format for ConceptNet API
    base_url = "http://api.conceptnet.io"

    # ✅ **Expanded synonym retrieval**
    synonym_rels = ["/r/IsA", "/r/Synonym", "/r/SimilarTo"]
    synonym_urls = [f"{base_url}/query?rel={rel}&start=/c/en/{word}&end=/c/en" for rel in synonym_rels]

    synonyms = set()
    for url in synonym_urls:
        response = requests.get(url).json()
        for edge in response.get('edges', []):
            synonyms.add(edge['end']['label'])

    # ✅ **Integrate WordNet synonyms**
    wordnet_synonyms = get_wordnet_synonyms(word)
    synonyms.update(wordnet_synonyms)
    
    # ✅ **Expanded related term retrieval**
    related_rels = ["/r/PartOf", "/r/HasA", "/r/UsedFor", "/r/DerivedFrom", "/r/RelatedTo"]
    related_urls = [f"{base_url}/query?rel={rel}&start=/c/en/{word}&end=/c/en" for rel in related_rels]

    related_terms = set()
    for url in related_urls:
        response = requests.get(url).json()
        for edge in response.get('edges', []):
            related_terms.add(edge['end']['label'])

    # ✅ **Weight-based filtering for related terms**
    related_url = f"{base_url}/related/c/en/{word}?filter=/c/en"
    related_response = requests.get(related_url).json()

    weighted_related_terms = sorted(
        [(edge["@id"].split("/")[-1].replace("_", " "), edge["weight"]) 
         for edge in related_response.get("related", []) if edge["weight"] > weight_threshold],
        key=lambda x: x[1], reverse=True
    )

    for term, _ in weighted_related_terms[:min_related]:
        related_terms.add(term)

    # ✅ **Filter out near-duplicates, non-English words, and junk**
    def is_too_similar(word, seen_words):
        return any(difflib.SequenceMatcher(None, word, seen).ratio() > similarity_threshold for seen in seen_words)

    def is_containing_original(word, original):
        word_lower = word.lower().replace("_", " ")
        original_lower = original.lower().replace("_", " ")
        return word_lower == original_lower or original_lower in word_lower

    # Filter synonyms & related terms for uniqueness and no self-reference
    filtered_synonyms = []
    seen_words = set()
    for syn in synonyms:
        if not is_too_similar(syn, seen_words) and not is_containing_original(syn, word) and is_valid_word(syn):
            filtered_synonyms.append(syn)
            seen_words.add(syn)

    filtered_related = []
    seen_words = set()
    for rel in related_terms:
        if not is_too_similar(rel, seen_words) and not is_containing_original(rel, word) and is_valid_word(rel):
            filtered_related.append(rel)
            seen_words.add(rel)

    # ✅ **Ensure Minimum Synonyms & Related Terms**
    if len(filtered_synonyms) < min_synonyms:
        word2vec_synonyms = get_word2vec_similar_words(word, top_n=min_synonyms - len(filtered_synonyms))
        for w2v in word2vec_synonyms:
            if not is_too_similar(w2v, seen_words) and is_valid_word(w2v):
                filtered_synonyms.append(w2v)
                seen_words.add(w2v)

    return filtered_synonyms[:min_synonyms], filtered_related[:min_related]

# ✅ **Test Cases**
test_words = [
    "Shakespeare", "Einstein", "Nike", "Black Sea", "Amazon", "Physics", "Neural Networks",
    "Cryptography", "Pi", "Tesla", "Nintendo", "McDonald's", "Backpack", "Airplane", "Relativity",
    "Artificial Intelligence", "Elon Musk", "Nikola Tesla", "Olympics", "Wimbledon", "Adidas",
    "Leonardo da Vinci", "Cleopatra", "Hercules", "Quantum Computing", "CRISPR", "Lord of the Rings",
    "Marvel", "Cyberpunk", "Bicycle", "Cooking", "Programming"
]

print("\n🔹 **Testing Final ConceptNet + WordNet + Word2Vec Backup**")
for word in test_words:
    synonyms, related_terms = get_conceptnet_synonyms_and_related(word)
    print(f"🔹 **Results for '{word}':**")
    print(f"   - Synonyms: {synonyms}")
    print(f"   - Related Terms: {related_terms}\n")



🔹 **Testing Final ConceptNet + WordNet + Word2Vec Backup**
🔹 **Results for 'Shakespeare':**
   - Synonyms: ['Bard of Avon', 'William Shakspere']
   - Related Terms: ['shakespearian', 'harold pinter', 'sixteenth', 'christopher marlowe', 'playwright']

🔹 **Results for 'Einstein':**
   - Synonyms: ['a physicist', 'armstrong']
   - Related Terms: ['relativity', 'theoretical physicist', 'photon']

🔹 **Results for 'Nike':**
   - Synonyms: ['information appliance']
   - Related Terms: ['athena', 'victory', 'athletic footwear']

🔹 **Results for 'Black Sea':**
   - Synonyms: []
   - Related Terms: ['yellow sea', 'white sea', 'inland sea', 'southeastern europe']

🔹 **Results for 'Amazon':**
   - Synonyms: ['mythical being', 'fictional female person']
   - Related Terms: ['warrior', 'south america', 'overwhelm']

🔹 **Results for 'Physics':**
   - Synonyms: ['cathartic', 'physical science']
   - Related Terms: ['field', 'interaction', 'physicist', 'mathematics', 'math', 'quantum theory', 'chemist

### Hint Help 2: Classification of Clue Types

In [None]:
#Classification of Clue Types 
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

# Create labels for clue types (e.g., 0 = definition, 1 = anagram)
df['ClueType'] = ...  # Add this column based on manual labeling

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(df['Clue'], df['ClueType'], test_size=0.2)

# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train a classifier
classifier = MultinomialNB()
classifier.fit(X_train_tfidf, y_train)

# Predict clue types
y_pred = classifier.predict(X_test_tfidf)


In [None]:
#Recall word vectors pointing in same direction are most similar
#wv.most_similar('____')

#Helper function to identify similarity between words, 1 = synonym, -1 = antonym, 0 = none
def find_cosine(vec1, vec2):
  # Scale vectors to both have unit length
  unit_vec1 = vec1/np.linalg.norm(vec1)
  unit_vec2 = vec2/np.linalg.norm(vec2)
  # The dot product of unit vectors gives the cosine of their angle
  return np.dot(unit_vec1,unit_vec2)

#Getting sentence level vectors
    #Naive approach - avg meaning vector 
    #more advanced - neural network with embedding 

### Hint Help 3: Fine tune transformer (BERT) to give hints

In [None]:
#import data
import pandas as pd
import numpy as np 
df = pd.read_csv('deep_learning_nytcrosswords2021.csv')

In [None]:
df.head(5)

In [None]:
import torch
print(torch.__version__)
print(torch.cuda.is_available())  # Should return False (CUDA is for NVIDIA)
print(torch.backends.mps.is_available())  # Check if Metal is available (Mac users)
print(torch.cuda.device_count())  # Should show 1+ if using ROCm


In [None]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
print("BERT is ready to use!")


In [None]:
# Count unique answers dynamically
num_unique_answers = df["Word"].nunique()
print(f"Number of unique answers: {num_unique_answers}")


In [None]:
from transformers import BertTokenizer, BertForSequenceClassification

#using uncased model for speed and performacne 
# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=num_unique_answers)

In [None]:
# Tokenize the clues
tokens = tokenizer(df["Clue"].tolist(), padding=True, truncation=True, return_tensors="pt")

# Convert answers to numerical labels (assuming we have 5000 unique answers)
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(df["Word"])  # Converts text answers to numbers


In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

class CrosswordDataset(Dataset):
    def __init__(self, tokens, labels):
        self.tokens = tokens
        self.labels = torch.tensor(labels, dtype=torch.long)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.tokens.items()}
        item["labels"] = self.labels[idx]
        return item

# Create dataset and DataLoader
dataset = CrosswordDataset(tokens, labels)
dataloader = DataLoader(dataset, batch_size=16, shuffle=True)

In [None]:
#Check if it worked so far:
# Get the first item from the dataset
first_sample = dataset[0]  # This should return a dictionary

# Print the keys in the sample
print(first_sample.keys())

# Print the actual contents of the sample
print("Input IDs:", first_sample["input_ids"])
print("Attention Mask:", first_sample["attention_mask"])
print("Label:", first_sample["labels"])

print("Decoded Clue:", tokenizer.decode(first_sample["input_ids"]))


# 📌 Fine-Tuning BERT for Crossword Solving

## **1️⃣ Conceptual Overview**
Fine-tuning BERT means **adapting a pre-trained language model** to specialize in **solving crossword clues**. Instead of training BERT from scratch, we **modify its last layers** so that it learns to map **crossword clues to correct answers**.

🔹 **What we’re doing:**  
- Giving BERT **crossword clues** as input.  
- Training it to **predict the correct answer** (classification task).  
- Using **supervised learning** (training with labeled crossword data).  
- Adjusting BERT’s weights so it learns **patterns in crossword clues** over multiple epochs.

---

## **2️⃣ Technical Breakdown**
### **1️⃣ Loading Pre-trained BERT Model**
- We use `bert-base-uncased`, a pre-trained **Transformer model** that already understands English.  
- Modify BERT’s **final layer** to classify one of many possible crossword answers.

### **2️⃣ Tokenizing Data**
- Convert crossword clues into **tokenized input** that BERT can understand.  
- Convert answers into **numerical labels** using `LabelEncoder()`.

### **3️⃣ Training Process (Fine-Tuning)**
The fine-tuning process consists of:
1. **Forward Pass:** BERT takes a **tokenized crossword clue** and predicts an answer.  
2. **Loss Calculation:** Compare BERT’s predicted answer to the correct answer using **CrossEntropyLoss**.  
3. **Backpropagation:** Compute gradients to understand **how much each weight contributed to the error**.  
4. **Optimizer Update:** Adjust BERT’s weights using **Adam optimizer- Common optimization algo in DL. * to improve predictions.  
5. **Repeat for Multiple Epochs:** The model gradually gets better at predicting correct answers.  

---

## **3️⃣ Key Code Components**
```python
optimizer = Adam(model.parameters(), lr=2e-5)  # Adjust BERT’s weights
loss_fn = torch.nn.CrossEntropyLoss()  # Measure how far off the predictions are

for epoch in range(epochs):
    for batch in dataloader:
        optimizer.zero_grad()  # Reset gradients
        outputs = model(**inputs)  # Forward pass: Predict crossword answer
        loss = loss_fn(outputs.logits, labels)  # Calculate loss
        loss.backward()  # Compute gradients
        optimizer.step()  # Update model weights


In [None]:
from transformers import AdamW
from torch.optim import Adam

# Move model to GPU (if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Optimizer and loss function
optimizer = Adam(model.parameters(), lr=2e-5)
loss_fn = torch.nn.CrossEntropyLoss()

# Training loop
epochs = 3  # Adjust as needed
for epoch in range(epochs):
    total_loss = 0

    for batch in dataloader:
        optimizer.zero_grad()

        # Move batch to GPU if available
        inputs = {key: val.to(device) for key, val in batch.items() if key != "labels"}
        labels = batch["labels"].to(device)

        # Forward pass
        outputs = model(**inputs)
        loss = loss_fn(outputs.logits, labels)

        # Backpropagation
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(f"Epoch {epoch+1}: Loss = {total_loss:.4f}")

In [None]:
df.info()

### Crossword Inputs Section
- Option 1: manually type in hint and the answer --> not a very good option since you have to see the answer but simple enough
- Option 2: manually type in just the clue, spaces used --> more realistic scenario but helper has to come up with the answer.
- Option 3: use computer vision to scan the crossword clues and the crossword answers.
    - Easiest/best/fastest solution, but requires user to have an answer key and also to look at it.
- Option 4: use computer vision to scan empty crossword with hints. Helper has to come up with the answers on its own.
    - Probably the most practical for a normal person doing a newspaper crossword without access to answers.
    - So this is the ideal end goal.


In [None]:
#Option 1:
