# **Spelling Bee (version with S)**

## [Riddler Classic](https://fivethirtyeight.com/features/can-you-solve-the-vexing-vexillology/), Jan 3, 2020

### solution by [Laurent Lessard](https://laurentlessard.com)

The New York Times recently launched some new word puzzles, one of which is [Spelling Bee](https://www.nytimes.com/puzzles/spelling-bee). In this game, seven letters are arranged in a honeycomb lattice, with one letter in the center. Here’s the lattice from December 24, 2019:

<img src="https://fivethirtyeight.com/wp-content/uploads/2020/01/Screen-Shot-2019-12-24-at-5.46.55-PM.png?w=1136" width="250">

The goal is to identify as many words that meet the following criteria:
- The word must be at least four letters long.
- The word must include the central letter.
- The word cannot include any letter beyond the seven given letters.

Note that letters can be repeated. For example, the words GAME and AMALGAM are both acceptable words. Four-letter words are worth 1 point each, while five-letter words are worth 5 points, six-letter words are worth 6 points, seven-letter words are worth 7 points, etc. Words that use all of the seven letters in the honeycomb are known as “pangrams” and earn 7 bonus points (in addition to the points for the length of the word). So in the above example, MEGAPLEX is worth 15 points.

Which seven-letter honeycomb results in the highest possible game score? To be a valid choice of seven letters, no letter can be repeated, it must not contain the letter S (that would be too easy) and there must be at least one pangram.

For consistency, please use [this word list](https://norvig.com/ngrams/enable1.txt) to check your game score.

---


# My solution

I used a brute-force approach: compute the score of every possible game board, and then pick the best one. That being said, not all brute-force approaches are created equal! The **slow way** to solve this problem is to enumerate all sets of 7 distinct letters with each possible central tile. This leads to 4,604,600 boards to test (3,364,900 if we don't include S). A **better approach** is to start by listing all pangrams. There are 37,876 pangrams (14,741 if you don't include S). Each pangram must contain exactly 7 distinct letters. Since we know a board must contain a pangram in order to be eligible, we can extract all possible boards from the list of pangrams! This leads to a set of 109,193 boards to test (55,902 if we don't include S). So a much smaller list.

To find the score for a particular board, we need to look at all words in the dictionary, see which ones we can make using the letters in our board, and add up all the scores for those words. We can use several strategies to save time in this step:
- Make the dictionary smaller. We can eliminate all words that have fewer than 4 letters, or that contain more than 7 different letters.
- Pre-compute the scores of all words in the dictionary and store them in a lookup table.
- Create separate dictionaries for each "center tile", e.g. dictionary of all words that contain 'a', all words that contain 'b', etc.
- Use loops that exit early whenever possible; no need to search the rest of the list if we already found what we're looking for.

After all was said and done, it took about 26 min to score all the boards (6 min if we don't include S).

- The winning board (using S) was **EAINRST** (with E in the center), for a total score of 8681.<br/>
This board yields a total of 1179 words, 86 of which are pangrams.<br/>
The highest-scoring words are ENTERTAINERS, INTERSTRAINS, STRAITNESSES (19 points each).

- The winning board (not using S) was **RAEGINT** (with R in the center), for a total score of 3898.<br/>
This board yields a total of 537 words, 50 of which are pangrams.<br/>
The highest-scoring words are REAGGREGATING and REINTEGRATING (20 points each).

- The worst possible board was **XCINOPR** (with X in the center), for a total score of 14.<br/>
There is only one valid word you can make with this board (must be 4+ letters long, must include X): PRINCOX.<br/>
So a single word, and it's a pangram!

Here is the code I wrote to solve the problem:

---

### Perform pre-computations
Filter out the dictionary to eliminate words that can't occur and throw away all boards that don't contain pangrams.

In [235]:
def get_wordscore(word):
    wordlen = len(word)
    if wordlen == 4:
        return 1
    elif len(set(word)) == 7:  # if it's a pangram, add 7 more points
        return wordlen + 7
    else:
        return wordlen

In [268]:
# read full word list
with open('enable1.txt', 'r') as f:
    words = f.read().splitlines()
    
# all letters in the alphabet
letters = 'abcdefghijklmnopqrstuvwxyz'

remlett = 's'

letters = ''.join(set(letters).difference(set(remlett)))
for lett in remlett:
    words = [word for word in words if lett not in word]
    
# keep only the words that have length at least 4
words = [word for word in words if len(word) >= 4]

# keep only the words that have at most 7 distinct letters
words = [word for word in words if len(set(word)) <= 7]

In [269]:
# dictionary of scores for all words
word_scores = dict()
wordtuples = [tuple(sorted(list(set(word)))) for word in words]

# initialize
for wtup in wordtuples:
    word_scores[wtup] = 0
    
# add up all the scores
for word in words:
    word_scores[ tuple(sorted(list(set(word)))) ] += get_wordscore(word)

# make separate dictionaries for each letter that the word could contain
words_by_letter = dict()
for lett in letters:
    words_by_letter[lett] = { wtup:set(wtup) for wtup in wordtuples if lett in wtup }

In [270]:
# set of pangram words
pangrams = [word for word in words if len(set(word)) == 7]

pangramseven = [ ''.join(sorted(list(set(pan)))) for pan in pangrams]


In [239]:
%%time
board_scores = dict()

for ps in pangramseven:
    for i,lett in enumerate(ps):
        for wtup,wset in words_by_letter[lett].items():
            if wset.issubset(ps):
                board_scores[ lett + ps[:i] + ps[i+1:] ] = word_scores[wtup]

Wall time: 31.8 s


In [227]:
len(board_scores)

1918

In [200]:
wordsets = [set(word) for word in words]

# for each letter, assemble all words containing the letter 'a':
wordlist = [word for word in words if 'a' in word]

wordsets_a = [set(word) for word in wordlist]

# set of pangram words
pangrams = [word for word in words if len(set(word)) == 7]

# pangrams containing the letter a
pangramlist = [pangram for pangram in pangrams if 'a' in pangram]

pangramseven = sorted(list(set( [ ''.join(sorted(set(pangram))) for pangram in pangramlist ] )))

d = Dict()

for p in pangramseven:
    for w in wordsets_a:
        if w.issubset(p):
            d[p] += 



print("there are", len(words), "admissible words in the dictionary")
print("there are", len(pangrams), "pangrams in the dictionary")

# all the strings of seven distinct letters that can produce a pangram
# (start from list of pangrams and extract all sets of seven letters)
valid_seven_letters = sorted(list(set( [ ''.join(sorted(set(list(p)))) for p in pangrams ] )))

# list all admissible boards (boards that contain at least one pangram)
boards = [ s[i]+s[:i]+s[i+1:] for i in range(7) for s in valid_seven_letters ]

print("there are", len(boards), "different valid boards (contain at least one pangram)")

there are 10495 admissible words in the dictionary
there are 1929 pangrams in the dictionary
there are 9121 different valid boards (contain at least one pangram)


In [194]:
len(words)

10495

### Method for scoring individual words

In [195]:
# compute the score of a given word
def word_score(word):
    wordlen = len(word)
    if wordlen == 4:
        return 1
    elif len(set(word)) == 7:  # if it's a pangram, add 7 more points
        return wordlen + 7
    else:
        return wordlen

# helper lookup table of all admissible words and their scores
# precomputing the scores of all words saves time later!
wscorelook = {word:word_score(word) for word in wordlist}

In [196]:
import pandas as pd

In [197]:
wordsets = [tuple(sorted(list(set(word)))) for word in words]
df = pd.DataFrame( data=words, columns=['word'] )
df['score'] = df.word.apply(word_score)
df['wordtup'] = df.word.apply(lambda x: tuple(sorted(list(set(x)))))
df2 = df.pivot_table(values='score', index='wordtup', aggfunc=sum)
df2['wordset'] = df2.index
df2['wordset'] = df2.wordset.apply(set)

In [198]:
%%time

boardscores = dict()
for board in boards:
    boardscores[board] = 0

for wtup in df2.index:
    wset = df2.wordset[wtup]
    wsco = df2.score[wtup]
    for board in boards:
        if wset.issubset(board):
            boardscores[board] += wsco

Wall time: 40.5 s


In [9]:
from scipy.sparse import dok_matrix

In [59]:
P = [[ lett in word for lett in letters ] for word in boards]
Q = [[ lett in word for lett in letters ] for word in words]

In [89]:
def iscontainedin(p,q):
    for i in p:
        if i > 

True

In [None]:
# to check if p is contained in q,
# test 

In [78]:
%%time
R = [[ q < p for p in P ] for q in Q ]

Wall time: 1.96 s


In [77]:
F = [[i+j for i in range(3)] for j in range(4)]

In [80]:
len(R)

3488

### Method for scoring boards

In [274]:
len(words)

44585

In [None]:
# 1.87s vs 5.31s vs 6.19s

In [279]:
wdic = dict()
for word in words:
    wdic[word] = set(word)
    
wsets = [wset for word,wset in wdic.items()]

In [280]:
%%time
q = [ [wset.issubset(board) for wset in wsets] for board in boards[:100] ]

Wall time: 2.76 s


In [265]:
def islegal3(word,board):
    return set(word).issubset(board)

In [260]:
def islegal2(word,board):
    return all(lett in board for lett in word)

In [276]:
def islegal4(word,board):
    return word.issubset(board)

In [262]:
# create a separate list of words for each center letter
# this also helps to save time
wlistc = {}
for center_letter in letters:
    wlistc[center_letter] = [word for word in wordlist if center_letter in word]

# helper function: is a given word makeable from a given board?
# NOTE: this implementation is about 3x faster than using set(word).issubset(board)
def islegal(word,board):
    for lett in word:
        if lett not in board:
            return False
    return True

# compute the score of a given board
def score_board(board):
    return sum( [wscorelook[word] for word in wlistc[board[0]] if islegal(word,board)] )

In [244]:
%%time
A = dok_matrix( (len(words),len(boards)), dtype=np.bool )
for (i,word) in enumerate(words):
    for (j,board) in enumerate(boards):
        if islegal(word,board):
            A[i,j] = True


KeyboardInterrupt: 

In [19]:
%%time
A = dok_matrix( (len(words),len(boards)), dtype=np.bool )
for (j,board) in enumerate(boards):
    for (i,word) in enumerate(words):
        if islegal(word,board):
            A[i,j] = True

Wall time: 8.34 s


### Big computation: score all the boards!

In [None]:
%%time
board_scores = [ (board,score_board(board)) for board in boards ]

In [19]:
# sort all the boards based on their score
board_scores.sort(key = lambda x: x[1])

print("The best board (that may contain S) is:", board_scores[-1][0], "with score:", board_scores[-1][1])

for (board,score) in board_scores[::-1]:
    if 's' not in board:
        print("The best board (that does not contain S) is:", board, "with score:", score)
        break

print("The worst board is:", board_scores[0][0], "with score:", board_scores[0][1])

The best board (that may contain S) is: eainrst with score: 8681
The best board (that does not contain S) is: raegint with score: 3898
The worst board is: xcinopr with score: 14


In [36]:
# get stats for a given word
def get_stats(board):
    # extract words from this winner
    wrds = [ (word,wscorelook[word]) for word in wlistc[board[0]] if islegal(word,board) ]
    wrds.sort(key=lambda x: x[1], reverse=True)
    pcount = len( [word for word,score in wrds if len(set(word))==7] )
    print("The board", board, "can make", len(wrds), "valid words, including", pcount, "pangrams.")
    print("Total score is", sum([score for word,score in wrds]), "and the top 5 words are:")
    print(wrds[:5])

In [37]:
get_stats('eainrst')

The board eainrst can make 1179 valid words, including 86 pangrams.
Total score is 8681 and the top 5 words are:
[('entertainers', 19), ('interstrains', 19), ('straitnesses', 19), ('intenerates', 18), ('interstates', 18)]


In [38]:
get_stats('raegint')

The board raegint can make 537 valid words, including 50 pangrams.
Total score is 3898 and the top 5 words are:
[('reaggregating', 20), ('reintegrating', 20), ('entertaining', 19), ('intenerating', 19), ('regenerating', 19)]


In [39]:
get_stats('xcinopr')

The board xcinopr can make 1 valid words, including 1 pangrams.
Total score is 14 and the top 5 words are:
[('princox', 14)]


# Hector's solution

In [1]:
# read full word list
with open('enable1.txt', 'r') as f:
    words = f.read().splitlines()

letters = 'abcdefghijklmnopqrstuvwxyz'

# letters to be removed --- remove them from the alphabet and from the wordlist
remletts = 's'
letters = ''.join(set(letters).difference(set(remletts)))
for lett in remletts:
    words = [word for word in words if lett not in word]
    
# keep only the words that have length at least 4 and have at most 7 distinct letters
words = [ word for word in words if len(word) >= 4 and len(set(word)) <= 7 ]

In [2]:
def score_word(word):
    wordlen = len(word)
    if wordlen == 4:
        return 1
    if len(set(word)) == 7:  # if it's a pangram, add 7 more points
        return wordlen + 7
    return wordlen

# pre-compute score of every word
word_scores = { word: score_word(word) for word in words }

In [3]:
# set of seven-letter tuples that can produce pangrams
pansets = set( [ tuple(sorted(list(set(word)))) for word in words if len(set(word)) == 7 ] )

# for each set of seven letters, store: letterset and wordlist
letterClusters = [ {'letterset': set(pan), 'wordlist': [] }  for pan in pansets ]

In [4]:
%%time
# compute the word list for each set of seven letters
# iterate through WORDS FIRST then through sets of seven letters

for word in words:
    letters = set(word)
    for cluster in letterClusters:
        if letters.issubset(cluster['letterset']):
                cluster['wordlist'].append(word)

Wall time: 2min 15s


In [5]:
# compute the score of a cluster using a particular center letter
def score_cluster(cluster,centerLetter):
    return sum( [word_scores[word] for word in cluster['wordlist'] if centerLetter in word] )


In [6]:
# compute score of all clusters and sort the list
scores = []
for cluster in letterClusters:
    for i in range(7):
        s = ''.join(sorted(list(cluster['letterset'])))
        scores.append( (s[i]+s[:i]+s[i+1:], score_cluster(cluster,s[i])) )

scores.sort(key = lambda x: x[1], reverse=True)

print("the 5 best scores are:")
scores[:5]

the 5 best scores are:


[('raegint', 3898),
 ('naegirt', 3782),
 ('eaginrt', 3769),
 ('eadinrt', 3672),
 ('taeginr', 3421)]

## Actually Hector's solution

In [7]:
# Return sorted tuple of distinct letters in the word
def lettersIn(word):
    letters = list(set(word))
    letters.sort()
    return tuple(letters)

wordsList = words

# Find all words with exactly 7 distinct letters; these can be pangrams, and every
# bee uses such a set of letters.
letterClusters = []
used = []
for word in wordsList:
    letters = lettersIn(word)
    if len(letters) == 7 and not letters in used:
        letterClusters.append( { 'letterTuple': letters, 'letterSet': set(letters), 'wordList': [] } )
        used.append(letters)

In [8]:
%%time
# This takes the bulk of run time. Looping over words first saves a bunch of 'set' calls.
for word in wordsList:
    letters = set(word)
    for cluster in letterClusters:
        if letters.issubset(cluster['letterSet']):
            cluster['wordList'].append(word)

Wall time: 1min 51s


In [9]:
# Score a bee: words are worth their length plus 7 for a pangram
def score(cluster,centerLetter):
    s = 0
    for word in cluster['wordList']:
        if centerLetter in word:
            if len(word) == 4:
                s+= 1
            else:
                s += len(word)
            if len(set(word)) == 7:
                s += 7
    return s

# Score all bees, i.e., all pangram 7-tuples and all choices of center letter
highestSoFar = {'score':0}
for cluster in letterClusters:
    for centerLetter in cluster['letterTuple']:
        s = score(cluster,centerLetter) 
        if s > highestSoFar['score']:
            highestSoFar = {'score': s, 'letterTuple': cluster['letterTuple'], 'centerLetter': centerLetter, 'wordList': cluster['wordList']}

In [10]:
# Report results
print('Maximum score is', highestSoFar['score'], 'for the bee [(letters), center letter]:')
print(highestSoFar['letterTuple'], ',', highestSoFar['centerLetter'])
words = [ word for word in highestSoFar['wordList'] if highestSoFar['centerLetter'] in word ]
print('This bee has', len(words), 'words:')
for i in range(len(words)):
    if len(set(words[i])) == 7:
        print(words[i].upper(), "")
    else:
        print(words[i], "")
    if ((i+1)/7.0).is_integer():
        print("")
print("")

Maximum score is 3898 for the bee [(letters), center letter]:
('a', 'e', 'g', 'i', 'n', 'r', 't') , r
This bee has 537 words:
aerate 
AERATING 
aerie 
aerier 
agar 
ager 
agger 

aggregate 
AGGREGATING 
aginner 
agrarian 
agree 
agreeing 
agria 

aigret 
aigrette 
airer 
airier 
airing 
airn 
airt 

airting 
anear 
anearing 
anergia 
angaria 
anger 
angering 

angrier 
anteater 
antiair 
antiar 
antiarin 
antra 
antre 

area 
areae 
arena 
arenite 
arete 
argent 
ARGENTINE 

ARGENTITE 
arginine 
aria 
arietta 
ariette 
arraign 
arraigning 

arrange 
arranger 
arranging 
arrant 
arrear 
arrearage 
artier 

atria 
attainer 
attar 
attire 
attiring 
attrite 
eager 

eagerer 
eagre 
earing 
earn 
earner 
earning 
earring 

eater 
eerie 
eerier 
eger 
eggar 
egger 
egret 

engager 
engineer 
engineering 
engirt 
engrain 
engraining 
enrage 

enraging 
enter 
entera 
enterer 
entering 
entertain 
entertainer 

ENTERTAINING 
entire 
entrain 
entrainer 
ENTRAINING 
entrant 
entreat 

ENTREATIN