## Wordle Project

The goal of this project will be to analyze five-letter words that are commonly used in the English language, in order to inform decision making for the popular online game Wordle. Wordle requires players to guess a five-letter word that is determined at random. Guessed letters in the correct position are filled in with a green color, while letters that are somewhere in the word but in the wrong position are colored yellow. Letters that are not in the word at all are shown in gray, and can be eliminated by the guesser. The player has six attempts to guess the word before losing.

While it is usually possible, via an intuitive approach, to succeed at Wordle most of the time, here we are interested in a more scientific analysis. Which letters are the most common, and in which positions? What would the most economical first guess be? (The first guess is a significant inflection point, making this question important.) Which five-letter words are the most commonly used?

This project utilizes <a href="https://www.kaggle.com/datasets/rtatman/english-word-frequency">data from Kaggle</a>, which is described as "counts of the 333,333 most commonly used single words on the English language web, as derived from the Google Web Trillion Word Corpus."

In [1]:
# Complete necessary imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns



In [2]:
# Read in file
words = pd.read_csv('unigram_freq.csv')

In [3]:
# Take a look at the first few words
words.head()

Unnamed: 0,word,count
0,the,23135851162
1,of,13151942776
2,and,12997637966
3,to,12136980858
4,a,9081174698


In [4]:
# Confirm the number of words listed is correct
words.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333333 entries, 0 to 333332
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   word    333331 non-null  object
 1   count   333333 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 5.1+ MB


In [5]:
# Add a length column to words
words['length'] = words.word.str.len()
# Filter by this column to create a new DataFrame
wwords = words[words.length == 5].reset_index(drop = True)
wwords.head()

Unnamed: 0,word,count,length
0,about,1226734006,5.0
1,other,978481319,5.0
2,which,810514085,5.0
3,their,782849411,5.0
4,there,701170205,5.0


In [6]:
wwords.shape

(39933, 3)

In [7]:
# This list is still 40,000 words!
# Wordle only uses 2,309 words.
# There are some really obscure words in here, as we can see...
wwords.iloc[20000:20005]

Unnamed: 0,word,count,length
20000,selpa,40536,5.0
20001,issam,40527,5.0
20002,taner,40524,5.0
20003,heier,40522,5.0
20004,dryed,40522,5.0


In [8]:
# Let's only use the first 2,500, sorted by usage
# But are we sure count is sorted?
wwords['count'].is_monotonic_decreasing

True

In [9]:
# Do the filtering operation
wwords = wwords.iloc[0:2500,0:1]
wwords.shape

(2500, 1)

In [10]:
# Double check we have the correct columns
wwords.head()

Unnamed: 0,word
0,about
1,other
2,which
3,their
4,there


In [11]:
# Now we're going to want to break each letter out by position so we can analyze their frequencies
wwords['l1'] = wwords.word.str.slice(0, 1)
wwords['l2'] = wwords.word.str.slice(1, 2)
wwords['l3'] = wwords.word.str.slice(2, 3)
wwords['l4'] = wwords.word.str.slice(3, 4)
wwords['l5'] = wwords.word.str.slice(4, 5)
wwords.head()

Unnamed: 0,word,l1,l2,l3,l4,l5
0,about,a,b,o,u,t
1,other,o,t,h,e,r
2,which,w,h,i,c,h
3,their,t,h,e,i,r
4,there,t,h,e,r,e


In [12]:
# First letter frequencies - top 5
pd.DataFrame(wwords['l1'].value_counts())[0:5]

Unnamed: 0,l1
s,310
c,210
b,184
t,171
p,152


In [13]:
# Second letter frequencies - top 5
pd.DataFrame(wwords['l2'].value_counts())[0:5]

Unnamed: 0,l2
a,419
o,373
e,315
i,281
r,227


In [14]:
# Third letter frequencies - top 5
pd.DataFrame(wwords['l3'].value_counts())[0:5]

Unnamed: 0,l3
a,299
i,242
r,218
n,218
o,205


In [15]:
# Fourth letter frequencies - top 5
pd.DataFrame(wwords['l4'].value_counts())[0:5]

Unnamed: 0,l4
e,428
n,201
t,188
i,175
a,172


In [16]:
# Fifth letter frequencies - top 5
# The letter "s" is not included, since Wordle no longer includes plurals ending in the letter "s"
pd.DataFrame(wwords['l5'].value_counts())[1:6]

Unnamed: 0,l5
e,350
y,210
n,176
t,170
r,156


In [18]:
# Create an integrated DataFrame
total_counts = pd.DataFrame({'l1': wwords['l1'].value_counts(),
             'l2': wwords['l2'].value_counts(),
             'l3': wwords['l3'].value_counts(),
             'l4': wwords['l4'].value_counts(),
             'l5': wwords['l5'].value_counts(),
             })
# Add and sort by a column containing the total count for all letters
total_counts['all'] = total_counts.sum(axis = 1)
total_counts = total_counts.sort_values('all', ascending = False)
total_counts

Unnamed: 0,l1,l2,l3,l4,l5,all
e,74,315.0,178.0,428,350.0,1345.0
a,143,419.0,299.0,172,122.0,1155.0
s,310,33.0,99.0,126,575.0,1143.0
r,121,227.0,218.0,160,156.0,882.0
o,40,373.0,205.0,138,87.0,843.0
i,42,281.0,242.0,175,34.0,774.0
t,171,69.0,147.0,188,170.0,745.0
l,122,174.0,162.0,158,107.0,723.0
n,54,60.0,218.0,201,176.0,709.0
c,210,34.0,78.0,129,27.0,478.0


There is some mostly irrelevant but fascinating trivia available to us here. Out of the top 2,500 five-letter words in the English language, the following patterns emerge:

<ul>
    <li>Only 4 words start with the letter "x", and only 7 with the letter "z".</li>
    <li>No words have "j" in their second or final letter.</li>
    <li>No words have "q" in their third or final letter.</li>
    <li>Every other letter occurs at least once at each place in every word.</li>
    <li>Only 1 word ends in the letter "v".</li>
    <li>Only 12 words begin with "y", and only 13 have it in their fourth letter, but 210 words end with it!</li>
    <li>Notwithstanding the words that end with "y", "u" and "y" are more rare than we might expect, with 8 and 10 consonants ahead of them respectively.</li>
    <li>"S" is the most common starting letter, but then it experiences a nearly 90% drop-off moving to letter two.</li>
    <li>"E" is by far the most common letter. There is a clear top tier of letters, which encompasses "e", "a", "s", "r", "o", "i", "t", "l", and "n".</li>
</ul>

Let's use our findings to determine the perfect Wordle guess.

For the first letter, we have a clear winner: the letter "s". This consonant appears at the first position in nearly 100 more words than its runner-up. It is also the third most common letter, and most common consonant, in total.

For the second letter, "a" is the most common option. It also makes sense to choose a vowel. We could choose "e", but "e" is the most common letter at the end of words, so we want to save that for the fifth one. "A" makes more sense than "o" as it is both more common at the second position and in general.

As was mentioned, the fifth letter should clearly be "e". "E" is the most common letter in that position, as well as the most common letter in the alphabet.

With these clear winners in mind, which words can we come up with that begin with "Sa", have two common characters in the third and fourth positions, and end in "e"? Well, let's look through the list and see.

In [20]:
# Find words matching the "sa__e" pattern
wwords[wwords.word.str.startswith('sa') &\
       wwords.word.str.endswith('e')]

Unnamed: 0,word,l1,l2,l3,l4,l5
851,sauce,s,a,u,c,e


Hmmm...the only available word is "sauce". While it's tempting to knock out three vowels, "u" is a rare one indeed, appearing only half as often as some consonants. "C" is a decent letter, but firmly in the second tier.

"A" is the most common letter in the second position, but it is also the most common in the third position, and by a higher margin over its runner-up. If we move "a" to the third position, and "e" to the end, we get far more options.

In [21]:
# Words that follow the pattern 's_a_e'
wwords[(wwords.l1 == 's') & (wwords.l3 == 'a') & (wwords.l5 == 'e')]

Unnamed: 0,word,l1,l2,l3,l4,l5
10,state,s,t,a,t,e
131,space,s,p,a,c,e
135,share,s,h,a,r,e
278,stage,s,t,a,g,e
302,scale,s,c,a,l,e
479,shape,s,h,a,p,e
860,slave,s,l,a,v,e
896,spare,s,p,a,r,e
1089,snake,s,n,a,k,e
1136,shade,s,h,a,d,e


The letter "r" holds a notable lead (about 140 appearances) over the nearest consonant. Let's also mandate the inclusion of "r".

In [22]:
# Words that follow the pattern 's_a_e' and contain the letter "r"
wwords[(wwords.l1 == 's') & (wwords.l3 == 'a') &\
       (wwords.l5 == 'e') & (wwords.word.str.contains('r'))]

Unnamed: 0,word,l1,l2,l3,l4,l5
135,share,s,h,a,r,e
896,spare,s,p,a,r,e
1853,scare,s,c,a,r,e
2336,stare,s,t,a,r,e


All these have identical structures besides the different second letters.

Which is the most popular: "h", "p", "c", or "t"? Overall it's "t", but "h" is indeed more common in the second position. The odds of getting both "s" and "h" correct are still relatively low, and "h" occurs quite rarely in other positions, so it makes sense to guess the more common consonant and, if we're wrong, at least potentially open up a yellow square somewhere else.

So, we have arrived at our optimal first guess:

## STARE

Out of the seven most common characters, five, and all of the top four, are represented in this guess.

Once we have made our guess, "stare", there are an incredibly large number of possibilities, unfortunately far too many to game out here. There are three possible colors for each letter, leading to 3^5 possible "positions."

In [23]:
print(3 ** 5)

243


We will cover two of these possibilities.

1) Our answer, "stare", is correct, and we win in one! Yay!

2) Every letter is gray, meaning none of the letters in "stare" are in the day's word.

What will we do in case #2?

In [24]:
# Create a new DataFrame that only contains words without any of the "stare" letters in them
wwords_ns = wwords[~wwords.word.str.contains('s') & ~wwords.word.str.contains('t') &\
      ~wwords.word.str.contains('a') & ~wwords.word.str.contains('r') & ~wwords.word.str.contains('e')].reset_index(drop = True)
print(wwords_ns.shape)
wwords_ns.head()

(154, 6)


Unnamed: 0,word,l1,l2,l3,l4,l5
0,which,w,h,i,c,h
1,would,w,o,u,l,d
2,click,c,l,i,c,k
3,could,c,o,u,l,d
4,found,f,o,u,n,d


In [25]:
# First letter frequencies - top 5
pd.DataFrame(wwords_ns['l1'].value_counts())[0:5]

Unnamed: 0,l1
c,27
b,20
f,15
l,14
p,12


In [26]:
# Second letter frequencies - top 5
pd.DataFrame(wwords_ns['l2'].value_counts())[0:5]

Unnamed: 0,l2
o,44
u,32
i,29
l,24
h,10


In [27]:
# Third letter frequencies - top 5
pd.DataFrame(wwords_ns['l3'].value_counts())[0:5]

Unnamed: 0,l3
i,27
n,25
u,20
o,19
l,17


In [28]:
# Fourth letter frequencies - top 5
pd.DataFrame(wwords_ns['l4'].value_counts())[0:5]

Unnamed: 0,l4
c,23
n,23
l,18
o,18
i,14


In [29]:
# Fifth letter frequencies - top 5
# The letter "s" is not included, since Wordle no longer includes plurals ending in the letter "s"
pd.DataFrame(wwords_ns['l5'].value_counts())[1:6]

Unnamed: 0,l5
d,22
n,17
k,13
h,12
o,11


In [30]:
# Create an integrated DataFrame
total_counts_ns = pd.DataFrame({'l1': wwords_ns['l1'].value_counts(),
             'l2': wwords_ns['l2'].value_counts(),
             'l3': wwords_ns['l3'].value_counts(),
             'l4': wwords_ns['l4'].value_counts(),
             'l5': wwords_ns['l5'].value_counts(),
             })
# Add and sort by a column containing the total count for all letters
total_counts_ns['all'] = total_counts_ns.sum(axis = 1)
total_counts_ns = total_counts_ns.sort_values('all', ascending = False)
total_counts_ns

Unnamed: 0,l1,l2,l3,l4,l5,all
o,2.0,44.0,19.0,18.0,11.0,94.0
i,5.0,29.0,27.0,14.0,7.0,82.0
l,14.0,24.0,17.0,18.0,5.0,78.0
n,4.0,6.0,25.0,23.0,17.0,75.0
c,27.0,1.0,4.0,23.0,6.0,61.0
u,1.0,32.0,20.0,3.0,1.0,57.0
y,2.0,5.0,1.0,3.0,38.0,49.0
d,9.0,,4.0,11.0,22.0,46.0
b,20.0,1.0,7.0,9.0,2.0,39.0
h,5.0,10.0,,,12.0,27.0


Let's use our findings to determine the ideal non-stare Wordle guess.

For the first letter, we have a clear winner: the letter "c". It is most common in the first position, and also the third-most common overall consonant.

For the second through fourth letters, the letters "o", "u", "i", "n", and "l" are the most common. Along with "c", these make up the top six most common letters in this dataset.

The fifth letter is actually most commonly "d", however, since we already have lots of letters, this one might have to be dropped.

There are three vowels in the top six, so a battle royale is going to have to be done among these. Unless there are words with all three in there? It seems unlikely...

In [31]:
wwords_ns[wwords_ns.word.str.contains('i')\
          & wwords_ns.word.str.contains('o')\
          & wwords_ns.word.str.contains('u')]

Unnamed: 0,word,l1,l2,l3,l4,l5
14,union,u,n,i,o,n


Actually, there is one, but it wastes a valuable letter by guessing "n" in two places.

It looks like we will have to drop "u", as it is the least common vowel here, with only 57 appearances to "i"'s 82 and "o"'s field-leading 94. Let's see if we can get "i", "o", "n", "l", and "c" together.

In [32]:
wwords_ns[wwords_ns.word.str.contains('i')\
          & wwords_ns.word.str.contains('o')\
          & wwords_ns.word.str.contains('n')\
         & wwords_ns.word.str.contains('c')\
         & wwords_ns.word.str.contains('l')]

Unnamed: 0,word,l1,l2,l3,l4,l5
47,colin,c,o,l,i,n


Unfortunately, Colin is only a name. We are going to have to broaden the field a bit.

In [33]:
# Get words that contain i and o, as well as any of the common consonants n, c, and l
wwords_ns[wwords_ns.word.str.contains('i')\
         & wwords_ns.word.str.contains('o')
         & (wwords_ns.word.str.contains('n') |
            wwords_ns.word.str.contains('c') |
            wwords_ns.word.str.contains('l'))]

Unnamed: 0,word,l1,l2,l3,l4,l5
5,going,g,o,i,n,g
6,login,l,o,g,i,n
12,doing,d,o,i,n,g
14,union,u,n,i,o,n
21,logic,l,o,g,i,c
25,dildo,d,i,l,d,o
27,comic,c,o,m,i,c
31,nikon,n,i,k,o,n
43,bingo,b,i,n,g,o
47,colin,c,o,l,i,n


Many of these options have double characters, have throwaway characters like "x", or are not real words. The ones that we can consider are "bingo", "logic", "doing", and "login". Only "logic" and "login" have two of the three consonants we want. "N" is actually rather popular in the final position, second only to "d", so ding, ding, ding...we have a winner!

## LOGIN

This brief analysis only deals with a few of the many possibilities that we can encounter while playing Wordle. That is the brilliance of the game&mdash;it requires massive ingenuity to play and cannot be reduced to a brute-force calculation. Still, knowing the frequency of the letters and a good few initial guesses could go a long way toward improving our play.