### Importing packages and loading the required data

In [None]:
!pip install --upgrade openai
!pip install tiktoken

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
pd.set_option('display.max_colwidth', 400)
import tiktoken
import os
from google.colab import userdata, drive

# models
EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_MODEL = "gpt-3.5-turbo"

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# client for OpenAI API
from openai import OpenAI
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))



In [None]:
# Import data
!wget -q -O nytcrosswords.csv 'https://www.dropbox.com/scl/fi/frj3j6vyrg36cjb4rvdtm/nytcrosswords.csv?rlkey=0wsqemquskwy6fta48mjk46f2&dl=0'

## Importing data and Pre-processing

In [None]:
# Import and clean data

try:
    data = pd.read_csv('nytcrosswords.csv', encoding='latin1')
except UnicodeDecodeError:
    try:
        data = pd.read_csv('nytcrosswords.csv', encoding='ISO-8859-1')
    except UnicodeDecodeError:
        data = pd.read_csv('nytcrosswords.csv', encoding='utf-8-sig')

data = data.astype("string")
data['word_length'] = data['Word'].str.len()
data = data.dropna()
data = data.drop('Date', axis=1)

# Only select words of length 3-8 and remove duplicates
data = data[(data['word_length'] >= 3) & (data['word_length'] <= 8)]
data = data[data.duplicated('Word', keep=False)]
data = data.drop_duplicates(subset=['Word','Clue'])
data.to_csv('preprocessed.csv', index=False)
data = data.reset_index(drop=True)

In [None]:
subset = data[:1000]
subset

Unnamed: 0,Word,Clue,word_length
1000,PRIDES,Lion packs,6
1001,EUREKA,Shout accompanying a brilliant realization,6
1002,APEMEN,Prehistoric human relations?,6
1003,RENO,Nevada slots city,4
1004,TEACUP,Super-miniature dog breed size,6
...,...,...,...
1095,TESSIE,"Santiago of ""Scandal""",6
1096,ROAN,Horse of a different color,4
1097,MOURN,"Sit shiva, e.g.",5
1098,STAG,Male deer,4


### Including rules as part of the prompt

In [None]:
crossword_rules = """
The puzzle follows a number of conventions:
- Any time a clue contains the tag "Abbr." or an abbreviation more significant than "e.g.", the answer will be an abbreviation (EXAMPLE: [M.D. org. (3 letters)] for AMA).
- Any time a clue ends in a question mark, the answer is a play on words (e.g., [Fitness center? (4 letters)] for CORE).
- French-, Spanish-, or Latin-language answers, and more rarely answers from other languages are indicated either by a tag in the clue giving the answer language (EXAMPLE: [Summer: Fr. (3 letters)] for ETE) or by the use in the clue of a word from that language, often a personal or place name (EXAMPLE: [Friends of Pierre (4 letters)] for AMIS) or (EXAMPLE: [The ocean, e.g., in Orleans (3 letters)] for EAU).
- Clues and answers must always match in part of speech, tense, number, and degree. Thus a plural clue always indicates a plural answer (and the same for singular), a clue in the past tense will always be matched by an answer in the same tense, and a clue containing a comparative or superlative will always be matched by an answer in the same degree.
- The answer word (or any of the answer words, if it consists of multiple words) will not appear in the clue itself. Unlike in some easier puzzles in other outlets, the number of words in the answer is not given in the clue—so a one-word clue can have a multiple-word answer. -
- Words that might appear elsewhere in the newspaper, such as well-known brand names, pop culture figures, or current phrases of the moment, are fair game.
- Spoken phrases are always indicated by enclosure in quotation marks, (EXAMPLE: ["Get out of here!" (8 letters)] for LEAVENOW).[26]
- When the answer can only be substituted for the clue when preceding a specific other word, this other word is indicated in parentheses. For example, [Think (over)] can be MULL, since "mull" only means "think" when preceding the word "over" (i.e., "think over" and "mull over" are synonymous, but "think" and "mull" are not necessarily synonymous otherwise).
- When the answer needs an additional word in order to fit the clue, this other word is indicated with the use of "with". For example, [Become understood, with "in"] can be SINK, since "Sink in" (but not "Sink" alone) means "to become understood."
"""

In [None]:
query = f"""Use the below article on the styles and conventions of New York Times crosswords to find the best answer of the given word length to the subsequent crossword clue \


Article:
```
{crossword_rules}
```
Clue: Fitness center? (4 letters)"""

print(query)

Use the below article on the styles and conventions of New York Times crosswords to find the best answer of the given word length to the subsequent crossword clue 

Article:
```

The puzzle follows a number of conventions:
- Any time a clue contains the tag "Abbr." or an abbreviation more significant than "e.g.", the answer will be an abbreviation (EXAMPLE: [M.D. org. (3 letters)] for AMA).
- Any time a clue ends in a question mark, the answer is a play on words (e.g., [Fitness center? (4 letters)] for CORE).
- French-, Spanish-, or Latin-language answers, and more rarely answers from other languages are indicated either by a tag in the clue giving the answer language (EXAMPLE: [Summer: Fr. (3 letters)] for ETE) or by the use in the clue of a word from that language, often a personal or place name (EXAMPLE: [Friends of Pierre (4 letters)] for AMIS) or (EXAMPLE: [The ocean, e.g., in Orleans (3 letters)] for EAU).
- Clues and answers must always match in part of speech, tense, number, an

In [None]:
# TEST
response = client.chat.completions.create(
    messages=[
        {'role': 'system', 'content': 'You provide answers to new york times crossword clues. Only provide the answer'},
        {'role': 'user', 'content': query}
    ],
    model=GPT_MODEL,
    temperature=0.5
)

print(response.choices[0].message.content)

CORE


### Encoding the available dataset based on clues

In [None]:
!pip install sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np




In [None]:

# Loading a pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')



In [None]:

df = data.copy()[1000:]
df

Unnamed: 0,Word,Clue,word_length
0,PAT,"Action done while saying ""Good dog""",3
1,RASCALS,Mischief-makers,7
2,PEN,It might click for a writer,3
3,SEP,Fall mo.,3
4,ECO,Kind to Mother Nature,3
...,...,...,...
532938,NIOBE,Tantalus's daughter,5
532939,IRAQI,Kirkuk native,5
532940,ARS,"""___ magna"" (anagrams, appropriately)",3
532941,ACE,King's superior,3


In [None]:
df_subset = df.sample(frac=0.05).reset_index(drop=True)

def combine(clue,word_length,answer):
  return f"CLUE:"+clue+"\nLENGTH:"+word_length+"\nANSWER:"+answer

df_subset["text"]=combine(df_subset['Clue'],df_subset['word_length'].astype(str),df_subset['Word'])
df_subset

Unnamed: 0,Word,Clue,word_length,text
0,DESK,Newspaper post,4,CLUE:Newspaper post LENGTH:4 ANSWER:DESK
1,BELLI,Ruby defender,5,CLUE:Ruby defender LENGTH:5 ANSWER:BELLI
2,AMOK,Running ___,4,CLUE:Running ___ LENGTH:4 ANSWER:AMOK
3,ECO,Conscious beginning?,3,CLUE:Conscious beginning? LENGTH:3 ANSWER:ECO
4,PASS,Bridge comment,4,CLUE:Bridge comment LENGTH:4 ANSWER:PASS
...,...,...,...,...
26642,TMC,HBO alternative,3,CLUE:HBO alternative LENGTH:3 ANSWER:TMC
26643,ASHES,Hibachi residue,5,CLUE:Hibachi residue LENGTH:5 ANSWER:ASHES
26644,PSA,"Anti-bullying spot, for short",3,"CLUE:Anti-bullying spot, for short LENGTH:3 ANSWER:PSA"
26645,ZIP,Elan,3,CLUE:Elan LENGTH:3 ANSWER:ZIP


In [None]:
def encode_text(text, model):
    # Encode the text to an embedding
    embedding = model.encode(text)
    return embedding

df_subset['encoded'] = df_subset['text'].apply(encode_text , model=model)

In [None]:
df_subset.to_csv("df_subset.csv")

In [None]:
from IPython import embed
from scipy import spatial

# search function
def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100
) -> tuple[list[str], list[float]]:

    """Returns a list of strings and relatednesses, sorted from most related to least."""

    query_embedding = encode_text(query,model)

    strings_and_relatednesses = [
        (row["text"], relatedness_fn(query_embedding, row["encoded"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]

In [None]:
strings, relatednesses = strings_ranked_by_relatedness("CLUE: Fitness center? \n LENGTH:4", df_subset, top_n=5)
for string, relatedness in zip(strings, relatednesses):
    print(f"{relatedness=:.3f}")
    display(string)

relatedness=0.746


'CLUE:Fitness guru\nLENGTH:7\nANSWER:TRAINER'

relatedness=0.727


'CLUE:Exercise venue, for short\nLENGTH:4\nANSWER:YMCA'

relatedness=0.701


'CLUE:Popular fitness class\nLENGTH:4\nANSWER:YOGA'

relatedness=0.700


'CLUE:Y feature\nLENGTH:3\nANSWER:GYM'

relatedness=0.666


'CLUE:Bit of gym attire\nLENGTH:3\nANSWER:TEE'

In [None]:
def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

In [None]:
intro = f""" Pretend you are an expert crossword solver. Use the below rules and examples to answer the subsequent question.

Article:
    ```
    {crossword_rules}
    ```
"""

In [None]:
def query_message(
    query: str,
    df: pd.DataFrame,
    model: str,
    token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    strings, relatednesses = strings_ranked_by_relatedness(query, df)
    introduction = intro
    question = query + "\nAnswer:"
    message = introduction
    for string in strings:

        next_article = f'\n\nEXAMPLE:\n"""\n{string}\n"""'
        if (
            num_tokens(message + next_article + question, model=model)
            > token_budget
        ):
            break
        else:
            message += next_article
    return message + question

In [None]:
query = query_message("CLUE:Community gym org.\nLENGTH:4", df_subset, GPT_MODEL, 1000)

print(query)

 Pretend you are an expert crossword solver. Use the below rules and examples to answer the subsequent question.

Article:
    ```
    
The puzzle follows a number of conventions:
- Any time a clue contains the tag "Abbr." or an abbreviation more significant than "e.g.", the answer will be an abbreviation (EXAMPLE: [M.D. org. (3 letters)] for AMA).
- Any time a clue ends in a question mark, the answer is a play on words (e.g., [Fitness center? (4 letters)] for CORE).
- French-, Spanish-, or Latin-language answers, and more rarely answers from other languages are indicated either by a tag in the clue giving the answer language (EXAMPLE: [Summer: Fr. (3 letters)] for ETE) or by the use in the clue of a word from that language, often a personal or place name (EXAMPLE: [Friends of Pierre (4 letters)] for AMIS) or (EXAMPLE: [The ocean, e.g., in Orleans (3 letters)] for EAU).
- Clues and answers must always match in part of speech, tense, number, and degree. Thus a plural clue always indicat

In [None]:
answers = []

for i, row in subset.iterrows():
    clue = row['Clue']
    word_length = row['word_length']
    question = "CLUE: " + clue + "\nLENGTH:" + str(word_length)

    query = query_message(question, df_subset, GPT_MODEL, 1000)


    messages = [
        {'role': 'system', 'content': 'You provide answers to new york times crossword clues. You take your time but only provide the answer'},
        {'role': 'user', 'content': query},
    ]

    # Create a completion using the specified messages and model
    response = client.chat.completions.create(
        messages=messages,
        model=GPT_MODEL,
        temperature=0.5

    )


    # Extract the answer from the response
    answer = response.choices[0].message.content.strip()
    answers.append(answer)

    if i % 10 == 0:
        print(f"Processed {i} rows")


subset['answers'] = answers
subset['answers']

Processed 1000 rows
Processed 1010 rows
Processed 1020 rows
Processed 1030 rows
Processed 1040 rows
Processed 1050 rows
Processed 1060 rows
Processed 1070 rows
Processed 1080 rows
Processed 1090 rows


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset['answers'] = answers


1000     PRIDES
1001     EUREKA
1002    KINSHIP
1003       RENO
1004     TEACUP
         ...   
1095     FIGURE
1096       MARE
1097      MOURN
1098       HART
1099       FULL
Name: answers, Length: 100, dtype: object

In [None]:

subset['answers'] = subset['answers'].str.split().str[-1]
subset['answers'] = subset['answers'].str.replace(r'[^\w\s]', '', regex=True)
subset['answers'] = subset['answers'].str.upper()
subset

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset['answers'] = subset['answers'].str.split().str[-1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset['answers'] = subset['answers'].str.replace(r'[^\w\s]', '', regex=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset['answers'] = subset['answers'].str.upper()


Unnamed: 0,Word,Clue,word_length,answers
1000,PRIDES,Lion packs,6,PRIDES
1001,EUREKA,Shout accompanying a brilliant realization,6,EUREKA
1002,APEMEN,Prehistoric human relations?,6,KINSHIP
1003,RENO,Nevada slots city,4,RENO
1004,TEACUP,Super-miniature dog breed size,6,TEACUP
...,...,...,...,...
1095,TESSIE,"Santiago of ""Scandal""",6,FIGURE
1096,ROAN,Horse of a different color,4,MARE
1097,MOURN,"Sit shiva, e.g.",5,MOURN
1098,STAG,Male deer,4,HART


In [None]:

correct = 0
for i, row in subset.iterrows():
  if row['Word'] == row['answers']:
    correct += 1

print(correct/len(subset))

0.45


### Attempt to Hard-encode the dataset using known styles and conventions

In [None]:
def encode_detailed_features(clue):
    # Define features
    features = {
        'abbreviation': 1 if "Abbr." in clue or re.search(r'\b[A-Z]{2,}\b', clue) else 0,
        'play_on_words': 1 if clue.endswith('?') else 0,
        'foreign_language': 1 if any(tag in clue for tag in ["Fr.", "Sp.", "Lat."]) or re.search(r'\b(Ete|Amis|Eau)\b', clue, re.IGNORECASE) else 0,
        'spoken_phrase': 1 if clue.startswith('"') and clue.endswith('"') else 0,
        'plural': 1 if re.search(r'\(s\)\b', clue) else 0,
        'past_tense': 1 if re.search(r'\bed\b|\bwas\b|\bwere\b', clue) else 0,
        'comparative_superlative': 1 if re.search(r'\ber\b|\best\b', clue) else 0,
        'specific_word_dependency': 1 if re.search(r'\(\w+\)$', clue) else 0,
        'requires_additional_word': 1 if "with" in clue else 0,
    }
    features_vector = np.array(list(features.values()))
    return features_vector

df_subset['features'] = df_subset['Clue'].apply(encode_detailed_features)

In [None]:
df_subset['combined_embedding'] = df_subset.apply(lambda row: np.concatenate((row['features'], row['encoded'])), axis=1)
df_subset.to_csv('df_subset2.csv', index=False)

In [None]:
def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100
) -> tuple[list[str], list[float]]:

    """Returns a list of strings and relatednesses, sorted from most related to least."""

    query_embedding = encode_text(query,model)
    query_features = encode_detailed_features(query)
    print(query_features)
    query_combined_embedding = np.concatenate((query_features, query_embedding))
    strings_and_relatednesses = [
        (row["text"], relatedness_fn(query_combined_embedding, row["combined_embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]

In [None]:
strings, relatednesses = strings_ranked_by_relatedness("Fitness Center?", df_subset, top_n=5)
for string, relatedness in zip(strings, relatednesses):
    print(f"{relatedness=:.3f}")
    display(string)

[0 1 0 0 0 0 0 0 0]
relatedness=0.644


'CLUE:Track team?\nLENGTH:6\nANSWER:TRAINS'

relatedness=0.636


'CLUE:Juice providers?\nLENGTH:7\nANSWER:OUTLETS'

relatedness=0.633


'CLUE:Waist removal regimens?\nLENGTH:5\nANSWER:DIETS'

relatedness=0.626


'CLUE:Ace place?\nLENGTH:6\nANSWER:SLEEVE'

relatedness=0.623


'CLUE:Things used during crunch time?\nLENGTH:3\nANSWER:ABS'

In [None]:
df_subset['features'].sum()

array([761, 905,  17, 646,   0,  93,   7, 167, 778])

In [None]:
def query_message(
    query: str,
    df: pd.DataFrame,
    model: str,
    token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    strings, relatednesses = strings_ranked_by_relatedness(query, df)
    introduction = crossword_rules
    question = query + "\nAnswer:"
    message = introduction
    for string in strings:
      # useful to indicate the start of each new potentially relevant
      # article here with the header 'Wikipedia article section:'

        next_article = f'\n\nEXAMPLE:\n"""\n{string}\n"""'
        if (
            num_tokens(message + next_article + question, model=model)
            > token_budget
        ):
            break
        else:
            message += next_article
    return message + question

In [None]:
answers = []

for i, row in subset.iterrows():
    clue = row['Clue']

    query = query_message(clue, df_subset, GPT_MODEL, 500)


    messages = [
        {'role': 'system', 'content': 'You provide answers to new york times crossword clues. You take your time but only provide the answer'},
        {'role': 'user', 'content': query},
    ]

    # Create a completion using the specified messages and model
    response = client.chat.completions.create(
        messages=messages,
        model=GPT_MODEL,
        temperature=0.5

    )


    # Extract the answer from the response
    answer = response.choices[0].message.content.strip()
    answers.append(answer)

    if i % 10 == 0:
        print(f"Processed {i} rows")


subset['answers'] = answers
subset['answers']

In [None]:

subset['answers'] = subset['answers'].str.split().str[-1]
subset['answers'] = subset['answers'].str.replace(r'[^\w\s]', '', regex=True)
subset['answers'] = subset['answers'].str.upper()
subset


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset['answers'] = subset['answers'].str.split().str[-1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset['answers'] = subset['answers'].str.replace(r'[^\w\s]', '', regex=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset['answers'] = subset['answers'].str.upper()


Unnamed: 0,Word,Clue,word_length,answers
1000,PRIDES,Lion packs,6,PRIDES
1001,EUREKA,Shout accompanying a brilliant realization,6,EUREKA
1002,APEMEN,Prehistoric human relations?,6,STONEAGELOVE
1003,RENO,Nevada slots city,4,RENO
1004,TEACUP,Super-miniature dog breed size,6,TOY
...,...,...,...,...
1095,TESSIE,"Santiago of ""Scandal""",6,VERGARA
1096,ROAN,Horse of a different color,4,ZEBRA
1097,MOURN,"Sit shiva, e.g.",5,MOURN
1098,STAG,Male deer,4,BUCK


In [None]:
correct = 0
for i, row in subset.iterrows():
  if row['Word'] == row['answers']:
    correct += 1

print(correct/len(subset))

0.43


This model has a decreased performance, signalling the need for more sophisticated methods of finding relevant examples that do not confuse the model, and integrating any known letters to help the model find the right answer when there are multiple candidate answers