# Course exercise 

During part 2-3 of `Building makemore` course [Neural Networks: Zero to Hero](https://github.com/karpathy/nn-zero-to-hero/tree/master) from Andrej Karpathy we've built an [MLP](https://en.wikipedia.org/wiki/Multilayer_perceptron).

At the end of Part 2, he suggested that we explore ways to improve the results he obtained, which included a `loss` of `2.17`. This could be achieved by adjusting the number of dimensions for the character embedding vectors, the number of neurons in the hidden layer, and other parameters.

I achieved some impressive scores, and here is an example:

```
Train loss is : 1.805803894996643
Valid loss : 1.8799562454223633
Test loss: 1.8769153356552124

block_size: 2
pred_block_size: 2
Embeding layers : 6
Hidden layer: 100
NB_Epoch : 200000
No profound strategy for lr  (same lr decay): lr = 0.1 if i < 100000 else 0.01
Generator : manual_seed(2147483647)
Minibatch: 256
```

In this notebook, you'll find my contribution to the original code. I introduced pred_block_size, which is used to make predictions not only on a single character but on a quantity that we can adjust.

As a beginner in ML, I can't definitively say whether this is a brilliant idea, but I've observed that the model can quickly overfit if the dataset used contains a lot of short words. The increased `vocab` size might also impact performance.

Additionally, I incorporated two other datasets consisting of French and Indian names to compare the solutions.

In [2]:
import torch
import torch.nn.functional as F
import pandas as pd
import random
import matplotlib.pyplot as plt # for making figures
%matplotlib inline

In [None]:
# EN-US raw data
#rawwords = open('names.csv', 'r').read()
#words_ha = rawwords.split(',')
#words = [ w.strip('\"') for w in words_ha]
#len(words)

In [None]:
# FR raw data
#df = pd.read_csv('liste_des_prenoms.csv', sep=';') # file has ; as separator instead of ,
#words = df['Prenoms'].tolist()
#len(words)

In [101]:
# INDIAN raw data
df = pd.read_csv('Indian_Names.csv', )
df['Name'] = df['Name'].astype(str) # clean data
words = df['Name'].astype(str).tolist()
len(words)

6486

### Dataset settings
The `block_size` variable is used to create the `X` values and represents the number of consecutive characters our predictions will be based on.

`pred_block_size` is the additional variable that I have incorporated into the original exercise. This one influences the values of our `Y` tensors. As a result, the inference can now be made on a customizable number of characters.

While testing this idea, I discovered that the `loss` result improved, but the generative results exhibited some overfitting, with a significant number of values not being original new words. You'll see how this issue is addressed.

In [91]:
# Choosing settings
# With regard of choosing the right block_size & pred_block_size. We can inspect the average lenght of words of our dataset.
alw =  sum([ len(w) for w in words]) / len(words) # average lenght word 
alw
# EN -> 6.12
# FR -> 5.72
# IND -> 6.35

6.356614246068455

In [118]:
# build the dataset settings
block_size = 3 # context char length 
pred_block_size = 3 # prediction char lenght
padd_size = (pred_block_size - 1) # padding size
pred_context = '.' * padd_size 

### Build the vocabulary of characters and mappings to/from integers

In [119]:
pbs_id = ['.'] * pred_block_size 
pbs_id_t = tuple(pbs_id)

# extra function since we have var pred_block_size
def build_vocab(words,pred_block_size): 
    vocab_context = []
    
    for w in words:
        context = ['.'] * pred_block_size 
        formated_word = '.' + w + pred_context # apply contextual padding
        
        for pos,ch in enumerate(formated_word):
            context = context[1:] + [ch] # crop and append
            vocab_context.append(tuple(context))
            
    return vocab_context

vocab_build = build_vocab(words, pred_block_size) 
predChars = sorted(list(set(vocab_build)))
#monochars = sorted(list(set(''.join(words))))

#predChars
stoi = {s:i+1 for i,s in enumerate(predChars)} 
stoi.update({pbs_id_t:0}) # adding 0 index value to dict
itos = {i:s for s,i in stoi.items()}

len(predChars)

3884

In [120]:
#print(itos)
#stoi

In [121]:
# Add + 1 for . (index 0) val char
vocab_size = len(predChars) + 1 

In [122]:
def createContextY(idx,fWord, predBlockSize):
    contextY = []
    for j in range(predBlockSize):
        
        if idx+j <= (len(fWord)-1):
            contextY.append(fWord[idx+j])
        else:
            contextY = ['.'] * predBlockSize
    return contextY

def contextToI(context):
    icontext = []
    return stoi[tuple(context)]

def build_dataset(words):  
    X, Y = [], []
    
    for w in words: 
        context = [0] * block_size
        formated_word = pred_context + w + '.'  # apply padding
        
        for pos, ch in enumerate(formated_word):
            if pos < len(formated_word) - pred_block_size + 1:
                v = tuple(formated_word[pos:pos+pred_block_size])
                ix = stoi.get(v, "Key not found in dictionary")
            else:
                l = [ch]
                l.extend(['.'] * padd_size)
                v = tuple(l)
                ix = stoi[v]
            
            context_pred = createContextY(pos,formated_word,pred_block_size)     
            iy = contextToI(context_pred) 
            X.append(context)
            Y.append(iy)
            context = context[1:] + [ix] # crop and append
                
    X = torch.tensor(X)
    Y = torch.tensor(Y)
    print(X.shape, Y.shape)
    return X, Y

In [123]:
# creating seed, suffle and divide data set in parts training 80%, validation 10%, test 10%
random.seed(42)
random.shuffle(words)
n1 = int(0.8*len(words))
n2 = int(0.9*len(words))

In [124]:
# Creating datasets  
Xtr, Ytr = build_dataset(words[:n1]) # training set, 80%
Xdev, Ydev = build_dataset(words[n1:n2]) # validation set, 10%
Xte, Yte = build_dataset(words[n2:]) # test set, 10%

torch.Size([48500, 3]) torch.Size([48500])
torch.Size([6115, 3]) torch.Size([6115])
torch.Size([6072, 3]) torch.Size([6072])


### I did not use batch normalization
I chose to not apply batch normalization, since I don't understand it very well yet.

In [125]:
g = torch.Generator().manual_seed(2147483647) # for reproducibility

nb_embed = 6 # the dimensionality of the character embedding vectors
n_hidden = 100 # the number of neurons in the hidden layer of the MLP

C = torch.randn((vocab_size, nb_embed), generator=g) 
W1 = torch.randn(((nb_embed*block_size), n_hidden), generator=g) * (5/3)/((nb_embed * block_size)**0.5) # * is for init optimization and (5/3) bythebooks value,see details about init opt.
b1 = torch.randn(n_hidden, generator=g) * 0.01
W2 = torch.randn((n_hidden, vocab_size), generator=g) * 0.01
b2 = torch.randn(vocab_size, generator=g) * 0
parameters = [C, W1, b1, W2, b2]

print(f"Number of parameters: {sum(p.nelement() for p in parameters)}") # number of parameters in total

for p in parameters:
    p.requires_grad = True

Number of parameters: 417595


## Training the model

In [126]:
nb_epoch = 200000
progression_step = 10
batch_size = 256

for i in range(nb_epoch):
        
    # minibatch construct
    ix = torch.randint(0, Xtr.shape[0], (batch_size,), generator=g)
    Xb, Yb = Xtr[ix], Ytr[ix] # batch X,Y of (Xtr training set)

    # forward pass
    emb = C[Xb] 
    h = torch.tanh(emb.view(-1, (nb_embed*block_size)) @ W1 + b1) # 
    logits = h @ W2 + b2 # 
    loss = F.cross_entropy(logits, Yb) # Here Yb is Ytr[ix] and `ix` refer to the dynamic tupple created with respect of pred_block_size

    # backward pass
    for p in parameters:
        p.grad = None
    loss.backward()

    # update
    #lr = lrs[i] ## <<<< Use this to find the correct sweet spot value for learing rate
    
    # Apply a learning rate decay strategy when it plateau
    lr = 0.1 if i < 100000 else 0.01
    for p in parameters:
        p.data += -lr * p.grad

    # track stats <<<< Utility vars for learing rate finder
    #lri.append(lr)
    #lri.append(lre[i])
    #lossi.append(loss.item())
    #stepi.append(i)
    #lossi.append(loss.log10().item())
    
    # printing progression in %
    if (i*100)/nb_epoch == progression_step:
        print(f"{progression_step=} %  {loss.item():.4f}")
        progression_step += 10

print(loss.item()) 

progression_step=10 %  2.7317
progression_step=20 %  2.2137
progression_step=30 %  1.7082
progression_step=40 %  1.4694
progression_step=50 %  1.5310
progression_step=60 %  1.3851
progression_step=70 %  1.2280
progression_step=80 %  1.3655
progression_step=90 %  1.3464
1.1867951154708862


### Inspect loss values for the 3 datasets

In [141]:
# Training set 

emb = C[Xtr]
h = torch.tanh(emb.view(-1, (nb_embed*block_size)) @ W1 + b1) 
logits = h @ W2 + b2 
loss = F.cross_entropy(logits, Ytr)
print(f'Training loss: {loss.item()= }')

Training loss: loss.item()= 1.2724426984786987


In [142]:
# Validation set

emb = C[Xdev] # Andrej use the term `dev` to refer to validation set
h = torch.tanh(emb.view(-1, (nb_embed*block_size)) @ W1 + b1) 
logits = h @ W2 + b2 
loss = F.cross_entropy(logits, Ydev)
print(f'{loss.item()= }')

loss.item()= 2.9081592559814453


In [143]:
# Test set

emb = C[Xte] # verifiy with test set
h = torch.tanh(emb.view(-1, (nb_embed*block_size)) @ W1 + b1) 
logits = h @ W2 + b2 
loss = F.cross_entropy(logits, Yte)
print(f'{loss.item()= }')

loss.item()= 2.8859503269195557


## Generate new words

In [144]:
context = [0] * block_size
#C[torch.tensor([context])].shape 

In [145]:
g = torch.Generator().manual_seed(2147483647 + 10)
final_render = []

gen_quantity = 120

for _ in range(gen_quantity):
    out = []
    context = [0] * block_size # initialize with all ...
    
    while True:
        emb = C[torch.tensor([context])] # (1,block_size,d)
        h = torch.tanh(emb.view(1, -1) @ W1 + b1)
        logits = h @ W2 + b2
        probs = F.softmax(logits, dim=1)
        ix = torch.multinomial(probs, num_samples=1, generator=g).item()
        #print(ix)
        context = context[1:] + [ix]
        out.append((ix))
        if ix == 0:
            break
    #print(out)
    final_render.append(''.join(itos[i][0]  for i in out))

final_render = sorted(final_render)
clean_output = []

for n in final_render:
    new_word = n.strip('.')
    clean_output.append(new_word)
    #print(new_word)

#print(clean_output[:100])

### Verify overfit / originality

In [146]:
# I want to know amount of existing generated word
list_existing = list(set([x for i,x in enumerate(clean_output) if x in words]))
amount = len(list_existing) 
print(f"Quantity of NOT new words: {amount}")
percentage = (amount/gen_quantity) *100
print(f"{percentage= } %")

Quantity of NOT new words: 9
percentage= 7.5 %


### Show Result
Only the completly new words

In [147]:
generated_without_existing = sorted(list(set([x for i,x in enumerate(clean_output) if x not in list_existing])))
print(f"Quantity of new words generated: {len(generated_without_existing)} \n")
for n in generated_without_existing:
    print(n)

Quantity of new words generated: 110 

aasi
ai
ajmkash
alsma
amarda
amrch
amuddi
ane
armjshijan
babnoenibhu
ban
bhuwmohpsaraha
biav
bin
binwapovla
budpshb
ch
cmparruddi
daho
dant
dhans
dhanvi
dhralta
dolroo
fark
fau
fi
firana
ganabl
gyipen
i
iewbitbanchanmee
inaa
inayan
jaa
jayan
jdilslahee
jeu
jtus
kaenun
kaktik
kal
kalpa
kamdl
kamruddi
kann
kih
kishog
kraiuna
ktim
laksarhmdev
mandh
mea
min
misn
mnripmra
munku
munn
nam
narendr
nirm
nschandr
nu
parsanjee
pivnaa
pt
pyaswanak
raa
radh
rafa
raghura
rajm
rajuddi
ramdhy
rames
ramr
rpjuhlnaea
rshp
sahtanidh
sakaa
satn
sattar
shabash
shadnit
shayiara
slafal
sonpnu
sp
sudra
sushris
til
tisa
tnagi
twnt
uu
vaeeknuzami
vaka
vaslee
veree
vidi
vijk
vikhariy
vioo
vish
vs
yashish
yasrrit
ynn
zeb
zitudi
