# Random Word Generator
This code is a generator designed to create random words. It operates by training on a list of words or sentences supplied by the user.

## Selecting your text file

To pick the word list file, you can either open a file selection screen using tkinter, or write the relative path to your text file into the TXT variable below.

### Using tkinter

For selecting your text file with a file selection screen, please import and activate the tkinter module using the script below. If this fails, make sure you have pip installed and your python interpreter can access it. Run
```console
pip install tkinter
```
in your console to install tkinter to your system if it is not already.

In [1]:
import tkinter as tk
from tkinter import filedialog

root = tk.Tk()
root.withdraw()

''

Now execute the code below to open the file selection screen whenever you want to change the wordlist file. If it does not seem to open, try looking for it in the alt+tab view, or minimize all windows until you can see it.

In [None]:
TXT = filedialog.askopenfilename(initialdir=".", filetypes=[("Text Files", "*.txt")])

### Defining the file manually
Alternatively, you can enter the path to your wordlist into the variable like here. I have commented out alternative files I use for testing. These files are not actually included in the GitHub repository, as I recommend choosing your own. Then change the string to be the relative path to your file and execute the code snippet to save the variable.

In [2]:
# TXT = "1-1000.txt"
# TXT = "1000häufigste.txt"
# TXT = "google-10000-english.txt"
TXT = "wortliste.txt"
# TXT = "testing.txt"

## Training with the wordlist
### Reading the wordlist
Use the following code to initialize the wordlist or to tell the code to read it again.

In [3]:
with open(TXT, 'r', encoding='utf-8') as f:
    txt = f.read()

### Generate the prediction cache
#### Dynamic Token Prediction Encoder
This version of the encoder counts every instance of every possible length of string (within a group). Generating words with the cache generated by this will mostly result in large parts of or entirely preexisting words, which you might consider as a lower quality result. **WARNING:** If your file is large, this code will generate even larger cache files. For example: 1000 words of english (seperated by newlines) compile to max 33000 lines using the dynamic encoder. That's half a megabyte.
In the case of 10000 words, almost 550k lines will be written (10 MB).
A full wordlist of almost 300k words will result in files around 50 million lines (1 GB).

I recommend skipping this and using the n-length to one character encoder [here](#one-character-prediction-encoder-with-n-character-context).

In [None]:
# dynamic token prediction encoder

groups = txt.split('\n')

count = {}
for group in groups:
    for first_token_start in range(0, len(group) + 1):
        for first_token_end in range(first_token_start, len(group) + 1):
            next_token = None
            first_token = "" if first_token_end == 0 else group[first_token_start:first_token_end]
            for next_token_size in range(1, len(group) + 2):
                if next_token_size == len(group) + 1 and first_token_end == 0:
                    continue
                new_next_token = group[first_token_end:first_token_end + next_token_size]
                if next_token is not None and new_next_token == next_token:
                    continue
                next_token = "" if next_token_size == len(group) + 1 else new_next_token

                if first_token not in count:
                    count[first_token] = {}

                if next_token not in count[first_token]:
                    count[first_token][next_token] = 0

                count[first_token][next_token] += 1

import json, os
if not os.path.exists('count.json'):
    os.makedirs('count.json')
with open('count.json', 'w', encoding='utf-8') as f:
    json.dump(count, f, indent=4)

### One character prediction encoder with n character context
This encoder is more disk efficient and produces more unfamiliar results by only generating one character at a time, but using the last n characters to determine the probabilities of each.
With this encoder, even large files with 300k words will result in only a 2MB cache (100k lines).
Execute the code below to compile our generator, if you have not used the inefficient one above already.
If you want, you may change the value of n. This is the maximum amount of characters the encoder will enable to be used as context for the next token, default 3.

In [4]:
# 1 to n letter to single letter prediction encoder

n = 3

groups = txt.split("\n")

count = {}
for group in groups:
    for j in range(1, n + 1):
        previous_next_letter = None
        for i in range(j - 1, len(group) + 1):
            firstTokenStart = max(0, i-j)
            first_letter = group[firstTokenStart:i] if i > 0 else ""
            
            next_letter = "" if i == len(group) else group[i]
            
            if previous_next_letter is not None and next_letter == previous_next_letter:
                continue
            previous_next_letter = next_letter

            if first_letter not in count:
                count[first_letter] = {}

            if next_letter not in count[first_letter]:
                count[first_letter][next_letter] = 0

            count[first_letter][next_letter] += 1

import json, os

if not os.path.exists("count.json"):
    os.makedirs("count.json")
with open("count.json", "w", encoding="utf-8") as f:
    json.dump(count, f, indent=4)

## Generating words
### Generation function
The generation function uses the parameter n for determining the target size of the context it wants to use for prediction. The one letter encoder directly above already sets this value, so we will **not** change it here. Only execute this line if you picked the less efficient encoder.

In [None]:
n = 1

Below is the generation function. This generates a single word with some maximum amount of tokens. Each token is a string of predicted letters, so if you used the one letter encoder, each letter is a token. Execute the code below to define the function for later use.

In [5]:
import random

def generate_string(start, max_tokens, min_length=1, target_size=n):
    string = start
    lastPick = string
    for i in range(max_tokens):
        try:
            followUpDict = count[lastPick]
        except KeyError:
            possibleLastPicks = list(count.keys())
            
            target = None
            difference = 100
            for pick in possibleLastPicks:
                if string.endswith(pick):
                    if abs(target_size - len(pick)) < difference:
                        target = pick
                        difference = abs(target_size - len(pick))
                    
            if target is None:
                print(f"{string} does not end in anything expandable.")
                return string

            lookUp = target
            try:
                followUpDict = count[lookUp]
            except KeyError:
                print(f"Failed to generate followUp to {lookUp}")
                return string
        if len(string) < min_length:
            cleanKeys = [key for key in followUpDict.keys() if key != ""]
            if len(cleanKeys) == 0:
                print(f"{string} is not expandable.")
                return string
            cleanValues = [value for key, value in followUpDict.items() if key != ""]
            lastPick = random.choices(
                cleanKeys,
                cleanValues,
                k=1
            )[0]
        else:
            lastPick = random.choices(list(followUpDict.keys()), list(followUpDict.values()), k=1)[0]
        if lastPick == "":
            print(f"{string}: legit end picked")
            return string
        string += lastPick
    print(f"{string}: Maxed out characters. End might be cut off.")
    return string

Before we can actually use the function, we will need to read the cache our encoder provided. We do this using the code below. Execute it each time you make changes to count.json or run the encoder.

In [6]:
with open("count.json", "r", encoding="utf-8") as f:
    count = json.load(f)

Now we are almost ready. Let's define the minimum length of our result string in characters, default 4, and the maximum length in tokens, default 9. Run the code below to apply your changes and save the variables.

In [7]:
minimum = 4
maximum = 9

### Generating
Now we are free to request new words. Just calling the generate_string function with the correct parameters is enough to return a new string. The code below will print out one generated word each time it is executed. Try it out! Debugging information will be provided alongside your actual result.

In [21]:
print(generate_string("", maximum, minimum))

Dredendun: Maxed out characters. End might be cut off.
Dredendun


If we want to mass-generate words, we can generate a list of words and save it to a file. Change the wordCount to however many words you need and execute the code below to save your new words to a file. Open results.txt in your favorite text editor to see and change the results.

In [25]:
wordCount = 25

words = [generate_string("", random.randint(minimum, maximum), minimum) for _ in range(wordCount)]

with open("results.txt", "w", encoding="utf-8") as f:
    f.write("\n".join(words))

Tome: Maxed out characters. End might be cut off.
Aung: legit end picked
Figter: legit end picked
Mäman: Maxed out characters. End might be cut off.
mähr: Maxed out characters. End might be cut off.
Arige: Maxed out characters. End might be cut off.
Baun: Maxed out characters. End might be cut off.
Skerboh: Maxed out characters. End might be cut off.
fensgsge: legit end picked
unouni: Maxed out characters. End might be cut off.
keisen: Maxed out characters. End might be cut off.
Pansck: Maxed out characters. End might be cut off.
Veng: legit end picked
sckunbe: legit end picked
Dienb: Maxed out characters. End might be cut off.
Jateletem: Maxed out characters. End might be cut off.
Preit: Maxed out characters. End might be cut off.
Raun: legit end picked
Halis: Maxed out characters. End might be cut off.
urpe: Maxed out characters. End might be cut off.
Enitahri: Maxed out characters. End might be cut off.
Spos: legit end picked
Außuk: Maxed out characters. End might be cut off.
Kußis: