# **Spelling Bee (version with S)**

## [Riddler Classic](https://fivethirtyeight.com/features/can-you-solve-the-vexing-vexillology/), Jan 3, 2020

### solution by [Laurent Lessard](https://laurentlessard.com)

The New York Times recently launched some new word puzzles, one of which is [Spelling Bee](https://www.nytimes.com/puzzles/spelling-bee). In this game, seven letters are arranged in a honeycomb lattice, with one letter in the center. Here’s the lattice from December 24, 2019:

<img src="https://fivethirtyeight.com/wp-content/uploads/2020/01/Screen-Shot-2019-12-24-at-5.46.55-PM.png?w=1136" width="250">

The goal is to identify as many words that meet the following criteria:
- The word must be at least four letters long.
- The word must include the central letter.
- The word cannot include any letter beyond the seven given letters.

Note that letters can be repeated. For example, the words GAME and AMALGAM are both acceptable words. Four-letter words are worth 1 point each, while five-letter words are worth 5 points, six-letter words are worth 6 points, seven-letter words are worth 7 points, etc. Words that use all of the seven letters in the honeycomb are known as “pangrams” and earn 7 bonus points (in addition to the points for the length of the word). So in the above example, MEGAPLEX is worth 15 points.

Which seven-letter honeycomb results in the highest possible game score? To be a valid choice of seven letters, no letter can be repeated, it must not contain the letter S (that would be too easy) and there must be at least one pangram.

For consistency, please use [this word list](https://norvig.com/ngrams/enable1.txt) to check your game score.

---


# My solution

I used a brute-force approach: compute the score of every possible game board, and then pick the best one. That being said, not all brute-force approaches are created equal! The **slow way** to solve this problem is to enumerate all sets of 7 distinct letters with each possible central tile. This leads to 4,604,600 boards to test (3,364,900 if we don't include S). A **better approach** is to start by listing all pangrams. There are 37,876 pangrams (14,741 if you don't include S). Each pangram must contain exactly 7 distinct letters. Since we know a board must contain a pangram in order to be eligible, we can extract all possible boards from the list of pangrams! This leads to a set of 109,193 boards to test (55,902 if we don't include S). So a much smaller list.

To find the score for a particular board, we need to look at all words in the dictionary, see which ones we can make using the letters in our board, and add up all the scores for those words. We can use several strategies to save time in this step:
- Make the dictionary smaller. We can eliminate all words that have fewer than 4 letters, or that contain more than 7 different letters.
- Pre-compute the scores of all words in the dictionary and store them in a lookup table.
- Create separate dictionaries for each "center tile", e.g. dictionary of all words that contain 'a', all words that contain 'b', etc.
- Use loops that exit early whenever possible; no need to search the rest of the list if we already found what we're looking for.

After all was said and done, it took about 26 min to score all the boards (6 min if we don't include S).

- The winning board (using S) was **EAINRST** (with E in the center), for a total score of 8681.<br/>
This board yields a total of 1179 words, 86 of which are pangrams.<br/>
The highest-scoring words are ENTERTAINERS, INTERSTRAINS, STRAITNESSES (19 points each).

- The winning board (not using S) was **RAEGINT** (with R in the center), for a total score of 3898.<br/>
This board yields a total of 537 words, 50 of which are pangrams.<br/>
The highest-scoring words are REAGGREGATING and REINTEGRATING (20 points each).

- The worst possible board was **XCINOPR** (with X in the center), for a total score of 14.<br/>
There is only one valid word you can make with this board (must be 4+ letters long, must include X): PRINCOX.<br/>
So a single word, and it's a pangram!

Here is the code I wrote to solve the problem:

---

### Perform pre-computations
Filter out the dictionary to eliminate words that can't occur and throw away all boards that don't contain pangrams.

In [1]:
# read full word list
with open('enable1.txt', 'r') as f:
    wordlist = f.read().splitlines()
    
# all letters in the alphabet
letters = 'abcdefghijklmnopqrstuvwxyz'

# keep only the words that have length at least 4
wordlist = [word for word in wordlist if len(word) >= 4]

# keep only the words that have at most 7 distinct letters
wordlist = [word for word in wordlist if len(set(word)) <= 7]

# set of pangram words
panlist = [word for word in wordlist if len(set(word)) == 7]

print("there are", len(wordlist), "admissible words in the dictionary")
print("there are", len(panlist), "pangrams in the dictionary")

# all the strings of seven distinct letters that can produce a pangram
# (start from list of pangrams and extract all sets of seven letters)
valid_seven_letters = sorted(list(set( [ ''.join(sorted(set(list(p)))) for p in panlist ] )))

# list all admissible boards (boards that contain at least one pangram)
boards = [ s[i]+s[:i]+s[i+1:] for i in range(7) for s in valid_seven_letters ]

print("there are", len(boards), "different valid boards (contain at least one pangram)")

there are 98141 admissible words in the dictionary
there are 37876 pangrams in the dictionary
there are 109193 different valid boards (contain at least one pangram)


### Method for scoring individual words

In [17]:
# compute the score of a given word
def word_score(word):
    wordlen = len(word)
    if wordlen == 4:
        return 1
    elif len(set(word)) == 7:  # if it's a pangram, add 7 more points
        return wordlen + 7
    else:
        return wordlen

# helper lookup table of all admissible words and their scores
# precomputing the scores of all words saves time later!
wscorelook = {word:word_score(word) for word in wordlist}

### Method for scoring boards

In [18]:
# create a separate list of words for each center letter
# this also helps to save time
wlistc = {}
for center_letter in letters:
    wlistc[center_letter] = [word for word in wordlist if center_letter in word]

# helper function: is a given word makeable from a given board?
# NOTE: this implementation is about 3x faster than using set(word).issubset(board)
def islegal(word,board):
    for lett in word:
        if lett not in board:
            return False
    return True

# compute the score of a given board
def score_board(board):
    return sum( [wscorelook[word] for word in wlistc[board[0]] if islegal(word,board)] )

### Big computation: score all the boards!

In [None]:
%%time
board_scores = [ (board,score_board(board)) for board in boards ]

In [19]:
# sort all the boards based on their score
board_scores.sort(key = lambda x: x[1])

print("The best board (that may contain S) is:", board_scores[-1][0], "with score:", board_scores[-1][1])

for (board,score) in board_scores[::-1]:
    if 's' not in board:
        print("The best board (that does not contain S) is:", board, "with score:", score)
        break

print("The worst board is:", board_scores[0][0], "with score:", board_scores[0][1])

The best board (that may contain S) is: eainrst with score: 8681
The best board (that does not contain S) is: raegint with score: 3898
The worst board is: xcinopr with score: 14


In [36]:
# get stats for a given word
def get_stats(board):
    # extract words from this winner
    wrds = [ (word,wscorelook[word]) for word in wlistc[board[0]] if islegal(word,board) ]
    wrds.sort(key=lambda x: x[1], reverse=True)
    pcount = len( [word for word,score in wrds if len(set(word))==7] )
    print("The board", board, "can make", len(wrds), "valid words, including", pcount, "pangrams.")
    print("Total score is", sum([score for word,score in wrds]), "and the top 5 words are:")
    print(wrds[:5])

In [37]:
get_stats('eainrst')

The board eainrst can make 1179 valid words, including 86 pangrams.
Total score is 8681 and the top 5 words are:
[('entertainers', 19), ('interstrains', 19), ('straitnesses', 19), ('intenerates', 18), ('interstates', 18)]


In [38]:
get_stats('raegint')

The board raegint can make 537 valid words, including 50 pangrams.
Total score is 3898 and the top 5 words are:
[('reaggregating', 20), ('reintegrating', 20), ('entertaining', 19), ('intenerating', 19), ('regenerating', 19)]


In [39]:
get_stats('xcinopr')

The board xcinopr can make 1 valid words, including 1 pangrams.
Total score is 14 and the top 5 words are:
[('princox', 14)]
