## Problem Statement

The number of reported results vary daily. Develop a model to explain this variation and use your model to create a prediction interval for the number of reported results on March 1, 2023. Do any attributes of the word affect the percentage of scores reported that were played in Hard Mode? If so, how? If not, why not?

In [84]:
import numpy as np
import pandas as pd
import scipy.stats as stats

First, read the dates and the corresponding data.

In [85]:
df = pd.read_excel("./dataset/wordle_data.xlsx", index_col=0, usecols="A,C,D,E,F:L")
df["h/t"] = df["hard_mode_num"] / df["result_num"]
df = df.sort_index()

We would need to normalize the percentage of scores reported that were played in Hard Mode to $[0,1]$.

In [86]:
orimin, orimax = df["h/t"].min(), df["h/t"].max()
df["h/t"] = (df["h/t"] - orimin) / (orimax - orimin)

### Repeatedness of characters (ROC)

We define the ROC of a word by calculating the information entropy of characters in a word, such that

$$ H(X):=-\sum _{x\in\mathcal{X}}p(x)\log p(x)=\mathbb{E}[-\log p(X)]. $$

In [87]:
def charCount(s):
    dic = {}
    for ch in s:
        try:
            dic[ch] += 1
        except KeyError:
            dic[ch] = 1
    return dic

def calc_roc(s): 
    dic = charCount(s)
    l = [val for val in dic.values()]
    return stats.entropy(l)

ROC increases as the repeatedness of letters in a character drops.

In [88]:
print("ROC of mummy:", calc_roc("mummy"))
print("ROC of apple:", calc_roc("apple"))
print("ROC of audio:", calc_roc("audio"))

ROC of mummy: 0.9502705392332347
ROC of apple: 1.3321790402101223
ROC of audio: 1.6094379124341005


### Percentage of Vowels (POV)

We define the POV of a word by calculating the percentage of vowel characters in a word.

In [89]:
def calc_pov(s):
    I2 = 0
    for c in s:
        if c in ["a", "e", "i", "o", "u"]: I2 += 1
    return I2 / 5

POC increases and the percentage of vowels in a word increases.

In [90]:
print("POV of mummy:", calc_pov("mummy"))
print("POV of apple:", calc_pov("apple"))
print("POV of audio:", calc_pov("audio"))

POV of mummy: 0.2
POV of apple: 0.4
POV of audio: 0.8


### Expectation of yellow hit (EYH) and expectation of green hit (EGH)

*Any formal mathematical definition? @Linyi*

In [91]:
totWords = 2309
charRank = {'a': 906, 'b': 266, 'c': 446, 'd': 370, 'e': 1053, 'f': 206, 'g': 299, 'h': 377, 'i': 646, \
            'j': 27, 'k': 202, 'l': 645, 'm': 298, 'n': 548, 'o': 672, 'p': 345, 'q': 29, 'r': 835, \
            's': 617, 't': 667, 'u': 456, 'v': 148, 'w': 193, 'x': 37, 'y': 416, 'z': 35}
charRankByPos = [{'a': 140, 'b': 173, 'c': 198, 'd': 111, 'e': 72, 'f': 135, 'g': 115, 'h': 69, \
                  'i': 34, 'j': 20, 'k': 20, 'l': 87, 'm': 107, 'n': 37, 'o': 41, 'p': 141, 'q': 23, \
                  'r': 105, 's': 365, 't': 149, 'u': 33, 'v': 43, 'w': 82, 'y': 6, 'z': 3}, \
                 {'a': 304, 'b': 16, 'c': 40, 'd': 20, 'e': 241, 'f': 8, 'g': 11, 'h': 144, 'i': 201, 'j': 2, \
                  'k': 10, 'l': 200, 'm': 38, 'n': 87, 'o': 279, 'p': 61, 'q': 5, 'r': 267, 's': 16, 't': 77, \
                  'u': 185, 'v': 15, 'w': 44, 'x': 14, 'y': 22, 'z': 2}, {'a': 306, 'b': 56, 'c': 56, 'd': 75, \
                  'e': 177, 'f': 25, 'g': 67, 'h': 9, 'i': 266, 'j': 3, 'k': 12, 'l': 112, 'm': 61, 'n': 137, 'o': 243, \
                  'p': 57, 'q': 1, 'r': 163, 's': 80, 't': 111, 'u': 165, 'v': 49, 'w': 26, 'x': 12, 'y': 29, 'z': 11}, \
                 {'a': 162, 'b': 24, 'c': 150, 'd': 69, 'e': 318, 'f': 35, 'g': 76, 'h': 28, 'i': 158, 'j': 2, \
                  'k': 55, 'l': 162, 'm': 68, 'n': 182, 'o': 132, 'p': 50, 'r': 150, 's': 171, 't': 139, 'u': 82, \
                  'v': 45, 'w': 25, 'x': 3, 'y': 3, 'z': 20}, {'a': 63, 'b': 11, 'c': 31, 'd': 118, 'e': 422, 'f': 26, \
                  'g': 41, 'h': 137, 'i': 11, 'k': 113, 'l': 155, 'm': 42, 'n': 130, 'o': 58, 'p': 56, 'r': 212, 's': 36, \
                  't': 253, 'u': 1, 'w': 17, 'x': 8, 'y': 364, 'z': 4}]

In [92]:
def calc_rank():
    with open("./dataset/wordle_wordbank.txt") as f:
        lines = f.readlines()
    print(len(lines))
    lines = [s.strip()for s in lines]
    dic = {}
    dic2 = [{} for _ in range(5)]
    for s in lines:
        for c in set(s.lower()):
            try: dic[c] += 1
            except KeyError: dic[c] = 1
        for i, c in enumerate(s):
            try: dic2[i][c] += 1
            except KeyError: dic2[i][c] = 1
    return dict(sorted(dic.items())), [dict(sorted(dic2[i].items())) for i in range(5)]

def calc_EYH(s):
    I3 = 0
    for c in set(s):
        I3 += charRank[c] / totWords
    return I3

# I3G is expectation of green hits
def calc_EGH(s):
    I3 = 0
    for i, c in enumerate(s):
        I3 += charRankByPos[i][c] / totWords
    return I3

*Any further explanation of the EYH and EGH? For instance, how to correlates with the easiness of a guess, etc. @Linyi*

In [93]:
print("mummy: EYH = {}, EGH = {}".format(calc_EYH("mummy"), calc_EGH("mummy")))
print("apple: EYH = {}, EGH = {}".format(calc_EYH("apple"), calc_EGH("apple")))
print("audio: EYH = {}, EGH = {}".format(calc_EYH("audio"), calc_EGH("audio")))

mummy: EYH = 0.5067128627111304, EGH = 0.33997401472498917
apple: EYH = 1.277176266782157, EGH = 0.364660025985275
audio: EYH = 1.320918146383716, EGH = 0.2667821567778259


### Check the correlation

In [95]:
def get_result(df, func_name, row_name):
    valAttrib = [func_name(word.strip()) for word in df["word"]]
    res = np.asarray(df[row_name])
    pearson = np.corrcoef(valAttrib, res)
    # plt.scatter(valAttrib, res)
    # plt.show()
    return pearson[0][1]

cols = ["trial_" + str(i) for i in range(1, 7)] + ["trial_x"] + ["h/t"]
attribs = [calc_roc, calc_pov, calc_EYH, calc_EGH]
dic = {}
for attrib in attribs: 
    l = []
    for colname in cols:
        l.append(get_result(df, attrib, colname))
    dic[attrib.__name__] = l
res = pd.DataFrame(dic, index = cols)
res

Unnamed: 0,calc_roc,calc_pov,calc_EYH,calc_EGH
trial_1,0.222949,0.085566,0.338521,0.280004
trial_2,0.362358,0.125828,0.580781,0.335103
trial_3,0.432937,0.006557,0.501621,0.172001
trial_4,0.098531,-0.104577,-0.116221,-0.311545
trial_5,-0.419158,0.03467,-0.521995,-0.259876
trial_6,-0.361306,-0.005921,-0.368545,-0.01169
trial_x,-0.202787,-0.046894,-0.12372,0.099137
h/t,-0.081527,0.079484,-0.039814,-0.119932


*Any possibility of visualizing these data? @Linyi*