# Dataset Preparation

In [14]:
import pandas as pd

read csv file

In [15]:
df = pd.read_csv("unigram_freq.csv")
# remove nan values
df.dropna(inplace=True)
df

Unnamed: 0,word,count
0,the,23135851162
1,of,13151942776
2,and,12997637966
3,to,12136980858
4,a,9081174698
...,...,...
333328,gooek,12711
333329,gooddg,12711
333330,gooblle,12711
333331,gollgo,12711


the dataset needs to be refined, in order to avoid non-popular words or even non-words  
we can utilize count column for that

In [16]:
df.describe()

Unnamed: 0,count
count,333331.0
mean,1764283.0
std,66300050.0
min,12711.0
25%,21224.0
50%,41519.0
75%,136572.0
max,23135850000.0


we can see that the difference between most used and least used word is so large  
to take a clearer look on that, lets create logCount column

In [17]:
from math import log10

In [18]:
df["logCount"] = df["count"].apply(lambda x:round(log10(x)))
df

Unnamed: 0,word,count,logCount
0,the,23135851162,10
1,of,13151942776,10
2,and,12997637966,10
3,to,12136980858,10
4,a,9081174698,10
...,...,...,...
333328,gooek,12711,4
333329,gooddg,12711,4
333330,gooblle,12711,4
333331,gollgo,12711,4


now the difference is too clear, we have a huge gap there  
if we dive through the dataset, we will surely find so many words that are not used or not even valid english words  
so, we need to filter this dataframe  
for our project -and to reduce the dataset size also- we will remove all words with logCount < 6

In [19]:
df = df[df.logCount>5]
df

Unnamed: 0,word,count,logCount
0,the,23135851162,10
1,of,13151942776,10
2,and,12997637966,10
3,to,12136980858,10
4,a,9081174698,10
...,...,...,...
51533,shannen,316269,6
51534,threadless,316266,6
51535,capoeira,316262,6
51536,accomplice,316255,6


In [20]:
df = df.reset_index(drop=True)

add length column, number of letters in each word

In [21]:
df["length"] = df["word"].apply(lambda x:len(x))

remove count column, we no longer need it

In [22]:
df.drop(columns=["count"], inplace=True)

In [23]:
df.length.describe()

count    51536.000000
mean         6.961464
std          2.645922
min          1.000000
25%          5.000000
50%          7.000000
75%          9.000000
max         26.000000
Name: length, dtype: float64

we do not need words with length less than 4 or more than 15 letters, not playable in Guess My Word

In [24]:
df = df[(df.length>3) & (df.length<=15)]
df.reset_index().drop("index",axis=1)
df

Unnamed: 0,word,logCount,length
9,that,10,4
11,this,10,4
12,with,10,4
20,from,9,4
23,your,9,4
...,...,...,...
51531,shannen,6,7
51532,threadless,6,10
51533,capoeira,6,8
51534,accomplice,6,10


transform data to an array, where each element at index i is an array of words of same length i

In [25]:
longest = df.length.max()
array = [[] for _ in range(longest+1)]
for index, row in df.iterrows():
    array[row.length].append(row.word)
array

[[],
 [],
 [],
 [],
 ['that',
  'this',
  'with',
  'from',
  'your',
  'have',
  'more',
  'will',
  'home',
  'page',
  'free',
  'time',
  'they',
  'site',
  'what',
  'news',
  'only',
  'when',
  'here',
  'also',
  'help',
  'view',
  'been',
  'were',
  'some',
  'like',
  'than',
  'find',
  'date',
  'back',
  'list',
  'name',
  'just',
  'over',
  'year',
  'into',
  'next',
  'used',
  'work',
  'last',
  'most',
  'data',
  'make',
  'them',
  'post',
  'city',
  'such',
  'best',
  'then',
  'good',
  'well',
  'info',
  'high',
  'each',
  'very',
  'book',
  'read',
  'need',
  'many',
  'user',
  'said',
  'does',
  'mail',
  'full',
  'life',
  'know',
  'days',
  'part',
  'real',
  'item',
  'ebay',
  'must',
  'made',
  'line',
  'send',
  'type',
  'take',
  'area',
  'want',
  'long',
  'code',
  'show',
  'even',
  'much',
  'sign',
  'file',
  'link',
  'open',
  'case',
  'same',
  'both',
  'game',
  'care',
  'down',
  'size',
  'shop',
  'text',
  'rate',


now lets save each subarray into a text file

In [26]:
import os
os.mkdir("Dataset")

for wordSet in array:
    if wordSet==[]: continue
    with open(f'Dataset/word_dataset_length_{len(wordSet[0])}_size_{len(wordSet)}.txt', 'w') as file:
        for word in wordSet:
            file.write(word + '\n')