# Wordle

https://www.powerlanguage.co.uk/wordle/

We've all played Wordle by now! And we've all agonised (just me?) over the optimal choice of first word. We could choose the first word that pops into our heads, a word with very common letters, a word with lots of vowels... 

In this notebook I want to find the optimal first word choice for a few different strategies.


First of all we'll need some words. I'm actually using the list of all playable words on the Wordle game - obtained from the javascript file that runs the game online. (Thanks to this blog post for this tip https://bert.org/2021/11/24/the-best-starting-word-in-wordle/)

In [1]:
import pandas as pd
import numpy as np

In [2]:
five_letter_words_pd = pd.read_csv("wordle_words.csv")
FIVE_LETTER_WORDS = five_letter_words_pd["word"].tolist()
num_five_letter_words = len(FIVE_LETTER_WORDS)

In [3]:
print('# 5 letter words:', num_five_letter_words)
five_letter_words_pd

# 5 letter words: 12972


Unnamed: 0,word
0,bubus
1,civic
2,masus
3,tikis
4,dolly
...,...
12967,myope
12968,sugar
12969,yurts
12970,wiper


## Letter occurences in 5 letter words

The first strategy will be to choose a first word that is made up of the most common letters.The hypothesis being that we increase the possibility of collisions with the answer and thus increase the number of orange and green squares we expect to get with our first guess.

Inspired by this blog https://www3.nd.edu/~busiforc/handouts/cryptography/letterfrequencies.html we actually want to use the most common letters in the set of all 5 letter words i.e. not the most common letters in all english *text* where the letter *t* is very common due to the prevelance of the word *"the"*. 

Let's work out the frequency of each letter in our 5 letter word set. 

In [4]:
alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
# concatenate all 5 letter words together so we can count occurences of each letter.
concatenated_words = ''.join(FIVE_LETTER_WORDS)
letter_occurences = [concatenated_words.count(letter) for letter in alphabet]

letter_occurences_pd = pd.DataFrame({'letter': alphabet, 'count': letter_occurences})

In [5]:
letter_occurences_pd.sort_values(by="count", ascending=False, ignore_index=True)

Unnamed: 0,letter,count
0,s,6665
1,e,6662
2,a,5990
3,o,4438
4,r,4158
5,i,3759
6,l,3371
7,t,3295
8,n,2952
9,u,2511


The 5 most common letters in this set of 5 letter english words in *s,e,a,o,r*. Using the first anagram solver I found on the web we find the optimal word under this strategy: **AROSE**.

## Expected number of green squares

The next strategy will be to maximise the number of green squares we get with our first guess. After all the aim of the game is to get all 5 green squares. 

For a given guess word we can loop over all possible answer words and calculate how many green squares we would have got in that case. We'd like to choose the guess word with the highest average number of green squares (when we average of all possible answer words).

In [6]:
def expected_number_of_greens(guess_word):
    """The expected number of green squares for a given word - if the unknown word is chosen uniformly from all five letter words."""

    # size of overlap with eery other word
    overlaps = []

    # loop over all 5 letter words
    for answer_word in FIVE_LETTER_WORDS:

        count = 0
        # for each position check if guess_word matches word - i.e. if there would be a green square
        for letter_position in range(5):
            if guess_word[letter_position] == answer_word[letter_position]:
                count = count + 1

        # append total number of green squares to overlaps list
        overlaps.append(count)

    # return average overlap size over all 5 letter words. 
    mean_green_count = np.mean(overlaps)

    return mean_green_count



In [7]:
%%time
# calculate for every possible guess word and sort by expected number of green squares
five_letter_words_pd["expected_number_of_green_squares"] = five_letter_words_pd["word"].apply(expected_number_of_greens)
five_letter_words_pd.sort_values(by="expected_number_of_green_squares", ascending=False, ignore_index=True)

CPU times: total: 2min 21s
Wall time: 2min 21s


Unnamed: 0,word,expected_number_of_green_squares
0,sores,0.859081
1,sanes,0.853916
2,sales,0.844974
3,sones,0.841042
4,soles,0.832100
...,...,...
12967,oxbow,0.109158
12968,imshi,0.108002
12969,ewhow,0.103916
12970,ethyl,0.096053


Under this strategy the optimal word choice is **SORES** where we could expect to get a green square nearly every time (on average).

## Expected number of words ruled out

Actually maybe we want to choose a word that will rule out the most words - forgetting about green and orange squares for a second the aim of the game is use the information we have to whittle down the five-letter words until there are few enough to guess or there is only one possible word left. 

Both green AND orange squares help us to rule out words so we should use all the information available to us. 

In [8]:
def guess_and_answer_rules_out_third_word(guess_word, answer_word, third_word):
    # """Checks if a for a given guess_word and a given answer_word a third_word will be ruled out because of the information we obtain through green and orange squares."""
    # check if each position in guess_word is green or orange
    for letter_pos in range(5):
        letter = guess_word[letter_pos]
        if letter == answer_word[letter_pos]:
            # this square is green so check if this rules out third word
            # if third_word doesnt have correct letter at that position it is ruled out
            if third_word[letter_pos] != letter:
                return True
        elif letter in answer_word:
            # this square is orange so check if this rules out third word
            # if third_word doesnt contain this letter then it is ruled out
            if letter not in third_word:
                return True
            # if third_word does have this letter but at that same position then it is ruled out 
            # orange squares indicate that letter must be at a different position
            if third_word[letter_pos] == letter:
                return True
    
    # if we get this far then the third_word is still a valid answer
    return False

In [9]:
# e.g. if 
guess_word = "whisk"
# and
answer_word = "water"
# then we we can check if
third_word = "trees"
# is ruled out by the green and orange squares we would receive after making our guess

guess_and_answer_rules_out_third_word(guess_word, answer_word, third_word)

True

In [10]:
def average_number_of_words_ruled_out(guess_word):
    """The expected number of words you will rule out - if the unknown word is chosen uniformly from all five letter words."""

    # size of overlap with eery other word
    numbers_ruled_out = []

    # loop over all possible answers
    for answer_word in FIVE_LETTER_WORDS:

        count = 0
        for third_word in FIVE_LETTER_WORDS:
            if guess_and_answer_rules_out_third_word(guess_word, answer_word, third_word):
                count = count + 1

        # append total number of green squares to overlaps list
        numbers_ruled_out.append(count)

    # return average overlap size over all 5 letter words. 
    return np.mean(numbers_ruled_out)



In [11]:
arose_score = average_number_of_words_ruled_out('arose')
print('AROSE rules out', arose_score, 'words.')
print('This would leave', num_five_letter_words - arose_score, 'possible words remaining.')

AROSE rules out 10934.715695343817 words.
This would leave 2037.284304656183 possible words remaining.


In [12]:
sores_score = average_number_of_words_ruled_out('sores')
print('SORES rules out', sores_score, 'words.')
print('This would leave', num_five_letter_words - sores_score, 'possible words remaining.')

SORES rules out 10315.140302189331 words.
This would leave 2656.859697810669 possible words remaining.


By this reckoning AROSE is actually the more optimal word over SORES. The next step would be to calculate this score for all the 13K possible first words however a quick back-of-the-envelope calculation suggests this would take 20 full days on my computer...

Go for **AROSE** :) 