Thank you to have a brief view of this ipynb document. 

The question I got is to develop an algotithm to find the best initial guessing word for a wordle game, here follows the detailed descriptions about it. This project includes the following directory:

- Dataset, a folder to store the `wordList` prepared for using and corresponding csv file `srcdata` for running.
- Images, a folder to store some screenshots.
- `wordle_process.py`, file to store the kernel of wordle solving.
- `functions.py`, file to store functions using in this project.
- `wordle_run.ipynb`, file to store the running ipynb file.

## How to use?

We need to do the following things:
1. Prepare the dataset.
2. To write a function includes two inputs, `n` and `ipt_word` to guess.
3. Provides a solver which could solve them in `n + 2` attempts, by finding the best guess in probability in each turn.

To setup a dataset we could use `from english_words import english_words_lower_alpha_set` which provides a word list of 9324 words in total, but for a better precision or winning rate of the wordle game online, for example, https://wordly.org/, I suggest using the `wordList.txt` which has a larger dataset of 5-letters word. We use `prepareEnglishWordsDS()` to prepare these txt files, and `data_conversion()` to convert the txt files into csv dataset, with columns specifies the count of each alphabet-letters and the position info for each word.

In [1]:
import pandas as pd
df5 = pd.read_csv("Dataset/srcdata5.csv")
df5

Unnamed: 0.1,Unnamed: 0,word,a,b,c,d,e,f,g,h,...,v,w,x,y,z,1,2,3,4,5
0,0,staff,1,0,0,0,0,2,0,0,...,0,0,0,0,0,s,t,a,f,f
1,1,royal,1,0,0,0,0,0,0,0,...,0,0,0,1,0,r,o,y,a,l
2,2,welch,0,0,1,0,1,0,0,1,...,0,1,0,0,0,w,e,l,c,h
3,3,circa,1,0,2,0,0,0,0,0,...,0,0,0,0,0,c,i,r,c,a
4,4,wince,0,0,1,0,1,0,0,0,...,0,1,0,0,0,w,i,n,c,e
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3207,3207,voice,0,0,1,0,1,0,0,0,...,1,0,0,0,0,v,o,i,c,e
3208,3208,snark,1,0,0,0,0,0,0,0,...,0,0,0,0,0,s,n,a,r,k
3209,3209,gecko,0,0,1,0,1,0,1,0,...,0,0,0,0,0,g,e,c,k,o
3210,3210,gemma,1,0,0,0,1,0,1,0,...,0,0,0,0,0,g,e,m,m,a


Then we develop the function we use, hereby could find in `functions.py`, running `runfn`, here takes apple as an example, this is a word of length 5 so we take n = 5. Notes that the word's length shall be equal to n or otherwise would result in assertion error. Since the logic of finding the most appropriate word is of `O(N)` time cost, then a larger scale of dataset would result in a corresponding proportional increase of time cost.

In [3]:
from functions import *
runfn(n = 5, ipt_word = "apple")

Current turn: 1, guess is: saute, result is 01002, remaining possible words list size: 3212
green: [(5, 'e')], yellow: [(2, 'a')], grey: [(1, 's'), (3, 'u'), (4, 't')].
Current turn: 2, guess is: crane, result is 00102, remaining possible words list size: 77
green: [(5, 'e')], yellow: [(3, 'a')], grey: [(1, 'c'), (2, 'r'), (4, 'n')].
Current turn: 3, guess is: alike, result is 21002, remaining possible words list size: 15
green: [(1, 'a'), (5, 'e')], yellow: [(2, 'l')], grey: [(3, 'i'), (4, 'k')].
Current turn: 4, guess is: adele, result is 20022, remaining possible words list size: 5
green: [(1, 'a'), (4, 'l'), (5, 'e')], yellow: [], grey: [(2, 'd'), (3, 'e')].
Current turn: 5, guess is: amble, result is 20022, remaining possible words list size: 3
green: [(1, 'a'), (4, 'l'), (5, 'e')], yellow: [], grey: [(2, 'm'), (3, 'b')].
Current turn: 6, guess is: apple, result is 22222, remaining possible words list size: 1
Congratulations!, you reached the answer apple in 6 turns.


## Logics

### Sorting by probability
How we get the guessing word is basing on the probability, we first counts the frequency of each letter on each positions, getting the below frequency map. With this map we could a give a score for each word: 
$$score_{word} = \sum_{i}^{n} freq_{word[i]}(i)$$
while the probability could be expressed as:
$$prob_{word} = \prod_{i}^{n} \frac{freq_{word[i]}(i)}{N}$$
thereby, the probability could be also logarithmically expressed as:
$$prob_{word} = exp(\sum_{i}^{n} ln(\frac{freq_{word[i]}(i)}{N}))$$
hence we could use the score to evaluate the probability ranking among our candidates.

In [40]:
from functions import *
from wordle_process import *

myS = wordle_Strategy(size = 5, txt_fname = "Dataset/wordList5.txt", csv_fname = "Dataset/srcdata5.csv")
myS.setup()
myS.freq_df

Unnamed: 0,1,2,3,4,5
a,218,494,395,266,183
b,261,21,86,40,15
c,271,51,82,205,46
d,159,29,106,101,121
e,113,372,240,379,577
f,149,7,31,52,35
g,152,15,101,104,38
h,115,184,19,45,178
i,50,274,318,260,44
j,47,2,6,3,0


We get a ranking as below, in this case, we choose the word `saute` as the guess word (with highest frequency score).

In [41]:
myS.guess()
myS.answers_range

Unnamed: 0.1,Unnamed: 0,word,a,b,c,d,e,f,g,h,...,w,x,y,z,1,2,3,4,5,freq_index
0,627,saute,1,0,0,0,1,0,0,0,...,0,0,0,0,s,a,u,t,e,1931
1,1859,sauce,1,0,1,0,1,0,0,0,...,0,0,0,0,s,a,u,c,e,1929
2,1594,salle,1,0,0,0,1,0,0,0,...,0,0,0,0,s,a,l,l,e,1919
3,1579,caine,1,0,1,0,1,0,0,0,...,0,0,0,0,c,a,i,n,e,1905
4,1557,slate,1,0,0,0,1,0,0,0,...,0,0,0,0,s,l,a,t,e,1892
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3207,2952,infix,0,0,0,0,0,1,0,0,...,0,1,0,0,i,n,f,i,x,464
3208,806,kudzu,0,0,0,1,0,0,0,0,...,0,0,0,1,k,u,d,z,u,458
3209,1824,ethyl,0,0,0,0,1,0,0,1,...,0,0,1,0,e,t,h,y,l,455
3210,2958,nymph,0,0,0,0,0,0,0,1,...,0,0,1,0,n,y,m,p,h,445


By using the rules of wordle game, developed in `class wordle_GameRunner():`, we filter the `answer_range` by yellow, green and grey in order, and then do the frequency map among the filtered `answer_range` again, we could approach our guessed answer.

In [42]:
curr_guess = myS.guess()
myGame = wordle_GameRunner()
myGame.setup_runner(guess_word = "apple", ipt_size = 5)

while (True):
    try:
        numSeries = myGame.get_numSeries(curr_guess)
        # print(f"Current turn: {myGame.iteration}, guess is: {curr_guess}, result is {numSeries}, remaining possible words list size: {myS.answers_range.shape[0]}")
        if ((myS.answers_range.shape[0] == 1) or (curr_guess == myGame.guess_word)): 
            print(f"Congratulations!, you reached the answer {curr_guess} in {myGame.iteration} turns.")
            break
        myS.run(curr_guess, numSeries)
        curr_guess = myS.guess()

    except Exception as err:
        print(f"Error of turn {myGame.iteration}: guess is {curr_guess}.\nException: {err}.")
        break

Current turn: 1, guess is: saute, result is 01002, remaining possible words list size: 3212
green: [(5, 'e')], yellow: [(2, 'a')], grey: [(1, 's'), (3, 'u'), (4, 't')].
Current turn: 1, guess is: crane, result is 00102, remaining possible words list size: 77
green: [(5, 'e')], yellow: [(3, 'a')], grey: [(1, 'c'), (2, 'r'), (4, 'n')].
Current turn: 1, guess is: alike, result is 21002, remaining possible words list size: 15
green: [(1, 'a'), (5, 'e')], yellow: [(2, 'l')], grey: [(3, 'i'), (4, 'k')].
Current turn: 1, guess is: adele, result is 20022, remaining possible words list size: 5
green: [(1, 'a'), (4, 'l'), (5, 'e')], yellow: [], grey: [(2, 'd'), (3, 'e')].
Current turn: 1, guess is: amble, result is 20022, remaining possible words list size: 3
green: [(1, 'a'), (4, 'l'), (5, 'e')], yellow: [], grey: [(2, 'm'), (3, 'b')].
Current turn: 1, guess is: apple, result is 22222, remaining possible words list size: 1
Congratulations!, you reached the answer apple in 1 turns.


About speed, since so far takes a lot of print and exceptions catching, and also for a more flexible development I didn't adoopt the `np.vectorize` to vectorize my data, hence performed slowly.