**Instead of using statistical models like bigrams or trigrams, we will use a Multi-Layer Perceptron (MLP). This is because the array matrix required for storing letter combinations grows exponentially as 26^x, where x is the number of letter combinations. Since 26 represents the number of letters in the alphabet, this approach would require excessive computation and memory. Using a neural network like an MLP is a more efficient alternative.**

In [4]:
import torch 
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline

In [5]:
#Read in txt file
words = open('names.txt', 'r').read().splitlines()
words[:8]

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']

In [6]:
len(words)

32033

In [7]:
# Convert the alphabet to a index integer using mapping 
chars = sorted(list(set(''.join(words))))
# print(chars)
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0 
itos = {i:s for s,i in stoi.items()}
print(stoi)
print(itos)

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26, '.': 0}
{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


For this following FUnction we are creating a dataset based off of indexs based on the previous mapping and storing them as tensors for each word. For example there is 5 words, so we will end up with a dataset of 5 empty arrays [. . .] and then a variation of all the words as shown in the print out of the funciton: 

In [None]:
# This is the block size for how big the input is going to be for the MLP
block_size = 3

X,y = [],[] # X is the input, y is the Label
count = 5
for w in words[:5]: 
    print(w)
    context = [0] * block_size
    count += len(w)
    for ch in w + '.': 
        ix = stoi[ch]
        # print('ch:', ch)
        # print('ix', ix)
        X.append(context)
        # print('context', context)
        y.append(ix)
        print(''.join([itos[i] for i in context]),'----->', itos[ix])
        context = context[1:] + [ix] 
        
print(count)
X = torch.tensor(X)
Y = torch.tensor(y)

emma
... -----> e
..e -----> m
.em -----> m
emm -----> a
mma -----> .
olivia
... -----> o
..o -----> l
.ol -----> i
oli -----> v
liv -----> i
ivi -----> a
via -----> .
ava
... -----> a
..a -----> v
.av -----> a
ava -----> .
isabella
... -----> i
..i -----> s
.is -----> a
isa -----> b
sab -----> e
abe -----> l
bel -----> l
ell -----> a
lla -----> .
sophia
... -----> s
..s -----> o
.so -----> p
sop -----> h
oph -----> i
phi -----> a
hia -----> .
32



---

### Understanding the Function:

This function constructs a dataset by mapping words into numerical indices based on a predefined character-to-index dictionary (`stoi`). Each word is converted into input-output pairs, which are then stored as tensors.

#### Process Breakdown:
1. **Defining Block Size**:  
   - The `block_size` (3 in this case) determines how many previous characters are used as input to predict the next character.

2. **Initializing Inputs (`X`) and Labels (`y`)**:  
   - `X` stores the input sequences (context windows of size `block_size`).
   - `y` stores the corresponding next character (label).

3. **Iterating Over Words**:  
   - The function loops through the first five words in `words[:5]`.
   - Each word is extended with a stopping character (`.`) to signify the end.
   - The initial context (input window) starts as `[0, 0, 0]` (empty context).

4. **Generating Input-Output Pairs**:  
   - For each character in the word (including `.` at the end), the function:
     - Converts the character into its corresponding index (`ix`).
     - Stores the current context in `X` and the next character index in `y`.
     - Updates the context by shifting left and adding the new character.

5. **Example Breakdown** (for `"emma"` with `block_size=3`):
   ```
       emma
       ... -----> e   # Initial context is empty, predicting 'e'
       ..e -----> m   # Context shifts, predicting 'm'
       .em -----> m   # Context shifts, predicting 'm'
       emm -----> a   # Context shifts, predicting 'a'
       mma -----> .   # Context shifts, predicting the end character '.'
   ```

6. **Final Conversion to Tensors**:  
   - The `X` and `y` lists are converted into PyTorch tensors for further processing.

This method effectively converts words into training data for a model, where each input sequence predicts the next character.

--- 

## The final result of Dataset