Wordle is a popular game where you need to figure out what the 5 letter word is. To do this, you get 6 guesses and if a letter appears in the word, but is in wrong position, it's highlighted in yellow. If the letter appears in the word and is in the correct position, it's highlighted in green. Thus the word you start with carries a lot of weight, you want to maximize the usefulness of the word. Firstly this is done by eliminating words that repeat the same letter "eerie" would be a bad word to start with, since it includes 3 "e" letters. Other considiration is if the word uses a lot of common letters or uncommon ones. Both have their strengths and weaknesses.

In [1]:
import pandas as pd

The dataset is from Github where someone else had used it.  
https://github.com/steve-kasica/wordle-words

In [2]:
df = pd.read_csv('wordle.csv')

In [3]:
df

Unnamed: 0,word,occurrence,day
0,aahed,1.850950e-09,
1,aalii,6.224471e-10,
2,aargh,2.158188e-10,
3,aarti,7.668332e-10,
4,abaca,6.320646e-08,
...,...,...,...
12967,zuzim,4.259832e-09,
12968,zygal,7.210628e-10,
12969,zygon,1.448293e-09,
12970,zymes,1.359240e-08,


For the task at hand, the "occurrence" and "day" columns aren't needed and are thus dropped.

In [4]:
df.drop(['occurrence', 'day'], axis=1, inplace=True)

In [5]:
df

Unnamed: 0,word
0,aahed
1,aalii
2,aargh
3,aarti
4,abaca
...,...
12967,zuzim
12968,zygal
12969,zygon
12970,zymes


Let's make sure all the words start with a lower case.

In [6]:
df['word'] = df['word'].str.lower()

## Common letters

To find the most common letters split each word of the dataset into individual letters.

In [7]:
total_letters = df['word'].apply(list)
print(total_letters)

0        [a, a, h, e, d]
1        [a, a, l, i, i]
2        [a, a, r, g, h]
3        [a, a, r, t, i]
4        [a, b, a, c, a]
              ...       
12967    [z, u, z, i, m]
12968    [z, y, g, a, l]
12969    [z, y, g, o, n]
12970    [z, y, m, e, s]
12971    [z, y, m, i, c]
Name: word, Length: 12972, dtype: object


Explode splits the data into a long string, then value_counts counts all the individual letters.

In [8]:
total_count = total_letters.explode().value_counts()
print(total_count)

word
s    6665
e    6662
a    5990
o    4438
r    4158
i    3759
l    3371
t    3295
n    2952
u    2511
d    2453
y    2074
c    2028
p    2019
m    1976
h    1760
g    1644
b    1627
k    1505
f    1115
w    1039
v     694
z     434
j     291
x     288
q     112
Name: count, dtype: int64


The data shows raw number of each letters combined, it does give an idea of which letters are popular, but let's convert it to percentages to get a better idea.

In [9]:
letter_pct = total_count / total_count.sum() * 100
print(letter_pct)

word
s    10.275979
e    10.271354
a     9.235276
o     6.842430
r     6.410731
i     5.795560
l     5.197348
t     5.080173
n     4.551341
u     3.871415
d     3.781992
y     3.197656
c     3.126735
p     3.112858
m     3.046562
h     2.713537
g     2.534690
b     2.508480
k     2.320382
f     1.719087
w     1.601912
v     1.069997
z     0.669134
j     0.448659
x     0.444033
q     0.172680
Name: count, dtype: float64


These are the *total* number of letters, thus a word like "clock" would add 2 c's to the total. These numbers are still usefull and should be taken into consideration when making the final decision, but don't tell the whole story. From these results alone a word like "seals" gives roughly 45% chance of containing a correct letter and thus would look promising. But for the first word it's best to avoid duplicate letters to have a higher chance of containing a correct letters.

Next step would be to remove the duplicate letters, so that "clock" would only count as 1 c. This tells the number of *unique* letters in the data. 

In [10]:
unique_letters = df['word'].apply(set)
print(unique_letters)

0           {d, h, a, e}
1              {i, a, l}
2           {r, g, h, a}
3           {r, i, a, t}
4              {b, c, a}
              ...       
12967       {i, u, z, m}
12968    {z, l, g, y, a}
12969    {z, o, n, g, y}
12970    {z, m, e, s, y}
12971    {z, m, c, i, y}
Name: word, Length: 12972, dtype: object


The first word "aahed" becomes "haed" since the second "a" isn't being counted.

In [11]:
frequency = unique_letters.explode().value_counts()
print(frequency)

word
s    5936
e    5705
a    5330
o    3911
r    3909
i    3589
l    3114
t    3033
n    2787
u    2436
d    2298
y    2031
c    1920
p    1885
m    1868
h    1708
g    1543
b    1519
k    1444
w    1028
f     990
v     674
z     391
j     289
x     287
q     111
Name: count, dtype: int64


Two letters switched their places, when comparing to the total numbers. With the totals the placings were "k,f,w", but with only unique letters they became "k,w,f", meaning that the "w" and "f" switched places.  
This does make sense, since out of the 990 five letter words, there are 117 words that contain double f's.  
Where as the w has 1028 words in total, but only 11 of them contain double w's.

In [12]:
frequency_pct = frequency / frequency.sum() * 100
print(frequency_pct)

word
s    9.937056
e    9.550355
a    8.922593
o    6.547141
r    6.543793
i    6.008102
l    5.212937
t    5.077340
n    4.665528
u    4.077943
d    3.846926
y    3.399960
c    3.214142
p    3.155551
m    3.127093
h    2.859247
g    2.583032
b    2.542855
k    2.417303
w    1.720905
f    1.657292
v    1.128298
z    0.654547
j    0.483795
x    0.480447
q    0.185818
Name: count, dtype: float64


Next step is to add up the scores from the unique letters to see what are some of the better openers. First option only looks at the top scores.

In [13]:
best_word = None
best_score = 0

for word in df['word']:
    letters = set(word)
    score = sum(frequency[letter] for letter in letters)
    if score > best_score:
        best_score = score
        best_word = word
        print(best_word, best_score)


aahed 15041
aarti 15861
abase 18490
abers 22399
acers 22800
aeons 23669
aeros 24791


The second option scores all the words.

In [14]:
scores = []

for word in df['word']:
    letters = set(word)
    score = sum(frequency[letter] for letter in letters)
    scores.append((word, score))

scores = sorted(scores, key=lambda x: x[1], reverse=True)

print(scores[:10])

[('aeros', 24791), ('arose', 24791), ('soare', 24791), ('aesir', 24469), ('arise', 24469), ('raise', 24469), ('reais', 24469), ('serai', 24469), ('aloes', 23996), ('arles', 23994)]


I'm not an expert on Wordle metagame, but I haven't seen any of these words being discussed for the best opener. The next step is to add some heuristics to make the scoring better.

## Heuristics