# Project Part 2

FYI: The code in JNotebook doesn't seem to want to run, but no errors are presented. To view the outputs, open in colab/kaggle.

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://colab.research.google.com/github/PGLavergne/NYTCrosswordPredicter/blob/main/Part%20II/Project_Part_II.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/PGLavergne/NYTCrosswordPredicter/blob/main/Part%20II/Project_Part_II.ipynb)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

url = 'https://raw.githubusercontent.com/PGLavergne/NYTCrosswordPredicter/main/nytcrosswords.csv'
dataSet = pd.read_csv(url, encoding='latin-1')
dataSet_partial = pd.read_csv(url, encoding='latin-1', nrows=10000)
dataSet_partial

The code below creates histograms that represent the distribution of clue lengths in terms of words and the distribution of answer lengths in terms of characters. As you can see, most of the clue lengths are below 5 words and the length of the answer is usually less than 5 characters. 

In [None]:
#this code calculates the clue/answer lengths in terms of words/chars, respectively. 
dataSet_partial["Clue_Length_Words"] = dataSet_partial['Clue'].apply(lambda x: len(x.split()))
dataSet_partial['Word_Length_Chars'] = dataSet_partial['Word'].apply(len)

plt.figure(figsize=(12, 5))

plt.subplot(1,2,1)
plt.hist(dataSet_partial['Clue_Length_Words'], bins=20, color='green')
plt.title('Hist of Clue Length (Words)')
plt.xlable('Num of Words')
plt.ylabel('Frequency')

plt.subplot(1,2,2)
plt.hist(dataSet_partial['Word_Length_Chars'], bins=20, color='skyblue')
plt.title('Hist of Answer Length (char)')
plt.xlabel("Num of Chars")
plt.ylabel('Frequency')

plt.titght_layout()
plt.show()

This code illustrates a scatter plot where the x-axis represents the lengths of clue in terms of words and the y-axis represents the lengths of words in terms of characters. Each point in the scatter plot corresponds to a clue/answer pair from the dataset.

In [None]:
dataSet = dataSet.dropna(subset=['Word_Length_Chars'])

plt.figure(figsize=(14.5,6))
plt.scatter(dataSet['Clue_Length_Words'], dataSet['Word_Length_Chars'], color='purple', alpha=0.2)
plt.title('Clue Length v. Word Length')
plt.xlabel('Clue Length (Words)')
plt.ylabel('Word Length (Chars)')
plt.grid(True)
plt.show()

The code below is designed to help make predictions based on clues and certain characteristics of words. 

It loads a partial amount of the dataset (3000 rows) and determines the length of each word (crossword answer). 

TF-IDF (Term Frequency - Inverse Document Frequency):

TF computes how often each word appears in each clue. It gives higher weights to words that occur more frequently within a clue. 

IDF calculates the importance of each word in the entire set of clues. It assigns higher weights to wrods that are uncommon across all clues but occur often in a specific clue. So, words that are common across many clues are given lower IDF values.

For each word in each clue, TF-IDF computes a numberical value that represents the importance of that word in that particular clue relative to its importance to all clues. 

# Warning: this code took 14 min to run.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X = dataSet_partial['Clue']
y = dataSet_partial['Word']

tfidf_vectorizer = TfidfVectorizer()
X_encoded = tfidf_vectorizer.fit_transform(X).toarray()

# Calculate character size based on the length of words
character_size = dataSet_partial['Word'].apply(len)

# Concatenate encoded clues and character size into a DataFrame
X_processed = pd.DataFrame(X_encoded, columns=tfidf_vectorizer.get_feature_names_out())
X_processed['Character_Size'] = character_size.values

# this code trains a logistical regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_processed, y)

# User Input
user_input_clue = input("Enter the clue: ")
user_input_size = int(input("Enter the character size of the answer: "))

# Preprocess user input similarly to the training data (tfidf encoding)
user_input_encoded = tfidf_vectorizer.transform([user_input_clue]).toarray()

# This creates a dataframe so the model can make a prediction
user_input_df = pd.DataFrame(user_input_encoded, columns=tfidf_vectorizer.get_feature_names_out())
user_input_df['Character_Size'] = user_input_size

# This code makes the prediction
predicted_word = model.predict(user_input_df)

print(f"Predicted word based on the clue '{user_input_clue}' and character size {user_input_size}: {predicted_word[0]}")

Fantastic! This model accurately predicted the answer for the clue that I provided. Granted, it was a test run and the clue was already present in the dataset that I fed into the model. Next time, I will input a clue that's not within the partial dataset that I loaded.