# NYT Spelling Bee


In [27]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

### Research Question
How many features make a new york times spelling bee game have a bingo, which is when there is at least one word that begins with every letter.

### Hypothesis
If there are certain letters and a certain amount of points and words in the puzzle, then we will have a bingo.

### Dataset
#### Collection
Collected by Malin Morris from the daily New York Times Games Spelling Bee puzzle using the hint line from the hints page. The dataset contains 612 puzzles and 40 features
#### Variables
Center Letter: the letter at the center of the puzzle that must be included in every word (most of the 26 letters appears at least once)
Letters 1-6: the other letters in the puzzle, usually listed alphabetically (Each of the 26 letters appears at least once)
Points: The number of points in the puzzle (4 letter words are 1 point, 5+ letters is 1 point per letter, pangrams are worth an additional 7)
Words: the number of words in the puzzle solution
Pangrams: the number of words that use every letter at least once (every puzzle has at least one)
Perfect Pangrams: the number of words that use every letter exactly once
Bingo: 1 if every letter begins at least one word, otherwise 0
Date: the date of that puzzle (8/1/23-4/3/25)
Non-perfect Pangrams: pangrams - perfect pangrams
Number of Vowels: number of vowels (A, E, I, O, U, Y) in the puzzle
ING, OUGH, TION, ED, UN, ABLE, IGHT, LY: whether or not the prefix or suffix exists in the puzzle. 1 if it is there, 0 otherwise
NONE: if there are no prefixes or suffixes in the puzzle
PPW: points per word (points / words) as a decimal number
J, Q, V, W, X, Y, Z Bingo: whether there is bingo if one of the weird letters is in the puzzle
C and K: whether C and K exist together in the puzzle
C, K: used to calculate if C and K are (separately) in the puzzle
Letters as a word: concatenates the letters together
Word: alphabetizes the letters in the word to look for repeat pangrams
Repeat Pangram: 1 if the pangram has appeared before 0 if not (if the entry in Word column appears more than once)
Vowels as a Word: alphabetized vowels appearing in the puzzle
#### Notes
Editor Sam Ezersky is noted as saying there will never be an S in a puzzle, but the 2,500th puzzle on 3/12/25 did contain an S for the very first time

In [36]:
data = pd.read_excel("Spelling Bee Midterm 1.xlsx")
data = data.drop('Notes', axis = 1)

letter_cols = ['Center Letter', 'Letter 1', 'Letter 2', 'Letter 3', 'Letter 4', 'Letter 5', 'Letter 6']

# drop puzzle with S
data = data[~data[letter_cols].isin(['S']).any(axis=1)]

# drop columns 10, 13, 15-39
cols_to_drop = list(data.columns[[10, 13]]) + list(data.columns[15:40])
data = data.drop(columns=cols_to_drop)

# target
target = data.columns[11]

# standardize
scaler = StandardScaler()
data[['Words', 'Points']] = scaler.fit_transform(data[['Words', 'Points']])

# one hot encode the center letter
center_letter_encoded = pd.get_dummies(data['Center Letter'], prefix='center')

# other letters combined and one hot encoded
data['Other Letters'] = data[['Letter 1', 'Letter 2', 'Letter 3', 'Letter 4', 'Letter 5', 'Letter 6']].agg(''.join, axis=1)
data['Other Letters'] = data['Other Letters'].apply(lambda x: ''.join(sorted(set(x))))
other_letters_encoded = data['Other Letters'].str.get_dummies()

data = pd.concat([data.drop(columns=letter_cols + ['Other Letters', 'Center Letter']), center_letter_encoded, other_letters_encoded], axis=1)

data.head()

Unnamed: 0,Points,Words,Pangrams,Bingo,Date,Number of Vowels,center_A,center_B,center_C,center_D,...,HMNOTW,HORTWY,IKLNOT,ILNORT,ILNOTZ,ILOTVY,IMNOPT,IMPTUY,IMRTUY,LMNOTU
0,1.104198,0.892639,2,1,2023-08-01,3,False,False,False,False,...,0,0,0,0,0,0,0,0,0,0
1,0.60496,0.641301,1,1,2023-08-02,2,False,False,False,False,...,0,0,0,0,0,0,0,0,0,0
2,-0.518326,-0.782944,1,0,2023-08-03,2,False,False,False,False,...,0,0,0,0,0,0,0,0,0,0
3,-0.365781,-0.950502,2,0,2023-08-04,3,False,True,False,False,...,0,0,0,0,0,0,0,0,0,0
4,-0.781813,-0.866723,1,0,2023-08-05,3,False,False,False,False,...,0,0,0,0,0,0,0,0,0,0


In [35]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 611 entries, 0 to 611
Columns: 617 entries, Points to LMNOTU
dtypes: bool(24), datetime64[ns](1), float64(2), int64(590)
memory usage: 2.8 MB


### Data Preprocessing
- drop features 10, 13, 15-39
- target is 11
- standardization on words and score

- center letter should be one-hot encoded
- all other letters should be combined and one-hot encoded.
- drop puzzle with S (it's an outlier)

### Data Analysis and Visualization
- charts for each variable
- bar plot for counts of each letter
- perform t-test on each letter against the target variable
- sort results in order

### Data Modeling and Prediction
- create and run model
- train-test split

### Results Analysis
- write stuff
- make presentation