# Wordle Solver

The popular minigame Wordle has a lot to do with statistics. So, this program will help to solve the German version of it (such as https://wordle.at/). But since the probability calculation is independent of the language, it can also be used analogously for English.

### Import of the necessary libraries

In [1]:
import csv
import pandas as pd
import numpy as np
import string
import os.path

### Read in the words

The first function will read in all words from the file 'words.csv' (or from other file that is given in the argument) and preprocess it. **This only needs to be done once.** After that a new csv is created, which can be used afterwards.

In [2]:
def read_words(file_name = 'words.csv'):
    df = pd.read_csv(file_name)

    # The first loop will put the five letters in corresponding columns let1, ..., let5
    for i in range(0,5):
        col_name = 'let'+str(i)
        df[col_name] = df['word'].str[i]

    # The second loop is going through the alphabet and count how often each letter appears in a word
    # and puts it in the columns numA, ..., numZ
    for m in string.ascii_uppercase:
        col_name = 'num' + m
        df[col_name] = df['word'].str.count(m)
        
    df.to_csv('list_final_cor.csv', index=False)

    return df

### Calculation of the probabilities

The following function is the heart of the program by calculating the probability of each word. The calculation of the probability is based on the frequency of the letters in order to get as many green or orange letters as early as possible. 

The formular for the probability of a word is
\begin{equation*}
P = \prod_{j} \frac{N_j}{M}
\end{equation*}
with $N_j$ the numbers of all the letters in the word list from $N_A$ to $N_Z$ and $M$ the number of words. So, the probability of occurrence of each letter is multiplied. The algorithm programmed here is naive (analog to the Naive Bayes), since only the frequency of the letters is considered, but not their sequence.

In [3]:
# The function takes the preprocessed DataFrame
def propRow(df):
    
    # This dataframe will contain the frequency how often a letter occurs in a word
    df_p = pd.DataFrame(index=[0, 1, 2, 3])

    # Loop through the alphabet
    for m in string.ascii_uppercase:
        col_name = 'num' + m
        
        # To get the frequency of the letters, the numA, ..., numZ from df need to be grouped and normalized with M
        newCol_group = (df.groupby(col_name).count()/df.shape[0])
        FV = newCol_group.first_valid_index()
        
        # Add leading zeros if letter is in every word
        if FV > 0:
            newCol = [0]*FV + newCol_group['word'].tolist()
        else: newCol = newCol_group['word'].tolist()

        # If a letter is twice in a word it is automatically also once in it and so on...
        for i in range(1,len(newCol)-1):
            newCol[i] = sum(newCol[i:])    
        
        #Add zeros at the end, if word is not twice or three times in word
        if len(newCol) < 4:
            newCol += [0] * (4-len(newCol))

        df_p[m] = newCol

    df_p.index.names = ['NumOfLet']
    
    # Add prop row to starting DataFrame df
    df['prop'] = [1]*df.shape[0]
    
    # Get the probabilities of each letter in word and multiply
    for m in string.ascii_uppercase:
        col_name = 'num' + m
        df['prop'] = df['prop'] * np.where(df[col_name] == 0, 1, \
                                           np.where(df[col_name] == 1, df_p[m].tolist()[1], \
                                                    np.where(df[col_name] == 2, df_p[m].tolist()[2], df_p[m].tolist()[3]))) 
    
    return df

### Deleting of words

In [4]:
# Since the list of words contains every german word with 5 letters, it could be that the word is not in wordle. 
# Then it schould be deleted and the DataFrame is saved as a csv.
def del_word(d_word):
    df2 = pd.read_csv('list_final_cor.csv')
    
    df2.drop(df2[df2.word == d_word].index, inplace=True)
    df2 = df2.reset_index(drop=True)
    df2.to_csv('list_final_cor.csv', index=False)
    
    return df2

### 1. Read the data

In [5]:
# This will read in the data. If the data isnt preprocessed yet, the read_words function will be called.
if os.path.isfile('list_final_cor.csv'): df = propRow(pd.read_csv('list_final_cor.csv'))
else: df = propRow(read_words())

### 2. Choose Starter Word

In [6]:
# Choose one of the words with the highest probabilities
df.sort_values('prop', ascending=False).head(10)

Unnamed: 0,word,let0,let1,let2,let3,let4,numA,numB,numC,numD,...,numR,numS,numT,numU,numV,numW,numX,numY,numZ,prop
96,ARTEN,A,R,T,E,N,1,0,0,0,...,1,0,1,0,0,0,0,0,0,0.006832
1056,RATEN,R,A,T,E,N,1,0,0,0,...,1,0,1,0,0,0,0,0,0,0.006832
48,ALERT,A,L,E,R,T,1,0,0,0,...,1,0,1,0,0,0,0,0,0,0.006049
59,ALTER,A,L,T,E,R,1,0,0,0,...,1,0,1,0,0,0,0,0,0,0.006049
1291,TALER,T,A,L,E,R,1,0,0,0,...,1,0,1,0,0,0,0,0,0,0.006049
1054,RASEN,R,A,S,E,N,1,0,0,0,...,1,1,0,0,0,0,0,0,0,0.005971
95,ARSEN,A,R,S,E,N,1,0,0,0,...,1,1,0,0,0,0,0,0,0,0.005971
103,ASTER,A,S,T,E,R,1,0,0,0,...,1,1,1,0,0,0,0,0,0,0.00591
58,ALTEN,A,L,T,E,N,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0.005488
1160,SENAT,S,E,N,A,T,1,0,0,0,...,0,1,1,0,0,0,0,0,0,0.005361


### 3. Put in informations and recalculate the probabilities

After the first word was put in, we got new information. So what lettermis green, gray and orange. This will be put in now and then the probabilities are recalculated to get the best next word.

In [7]:
# In the first list all green letters will be put in but only once, except a letter is twice in the word.
# Example: list_g1 = ['A','E']
list_g1 = ['E','L']
# The second list should contain also the information on what position the letter is green. (Starting with 0)
# Example: list_g2 = [['A',0],['E',3]]
list_g2 = [['L',4],['E',3]]

# With the orange words we will do the same. Again, in the first list each letter only once (except a letter is twice)
# Example: list_o1 = ['N']
list_o1 = []
# In the next list the letters can also be twice.
# Example: list_o2 = [['N',1],['N',2]]
list_o2 = []

# Lastly, all the gray letters will be in one list
# Example: list_non = ['R','T']
list_non = ['A','R','T','N','P','U','D','V','O','G']

With the lists prepared, the new DataFrame an be created taking into account the information and calculating the new probabilities:

In [8]:
#Handling of green
for i in list_g2:
    col_name = 'let' + str(i[1])
    # Only take words with green letter on the right position
    df = df[df[col_name] == i[0]]
    
#Handling of orange
for i in list_o2:
    col_name = 'num' + i[0]
    # The orange letters need to be in the word
    df = df[df[col_name] > list_o1.count(i[0])-1]
    # But they cant be on the position where they were orange
    col_name2 = 'let' + str(i[1])
    df = df[df[col_name2] != i[0]]
    
#Handling of non:
for i in list_non:
    col_name = 'num' + i
    # The gray letters cant be in the word, but careful: If a letter is not twice, it will get gray the second time.
    # So it need to be checked if a letter is already in green or orange
    if i in list_g1 or i in list_o1:
        df = df[df[col_name] <= max([list_g1.count(i),list_o1.count(i)])]
    else:
        df = df[df[col_name] == 0]

# Finally, recalculate the probabilities to get next word
propRow(df).sort_values('prop',ascending=False).head(10)

Unnamed: 0,word,let0,let1,let2,let3,let4,numA,numB,numC,numD,...,numR,numS,numT,numU,numV,numW,numX,numY,numZ,prop
150,BIBEL,B,I,B,E,L,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0.222222
391,FIBEL,F,I,B,E,L,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0.222222
534,HEBEL,H,E,B,E,L,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0.111111


### Deleting a word

In [19]:
# If a word from the list is not part of the wordle, it can be deleted with the next row: 
#del_word('LAUER')