This Jupyter Notebook file implements a simple next-word prediction model using a recurrent neural network (LSTM) with Keras and TensorFlow. Here is a step-by-step explanation for each cell:

In [1]:
import numpy as np
import pandas as pd

from tensorflow.keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding, InputLayer
import string

In [None]:
# df = pd.read_csv("train.csv")
# len(df)

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/CostiCTI/CourseML/refs/heads/main/Part2-Models/train.csv")

In [10]:
df.head()

Unnamed: 0,index,text,label
0,0,acest document mi-a deschis cu adevarat ochii ...,1
1,1,tine mancarea rece. ce altceva ii mai trebuie?...,1
2,2,excelent\nrecomand!,1
3,3,"ca un rocker imbatranit, acest film mentioneaz...",1
4,4,"ei bine, a facut o groaza veche si foarte intu...",1


The 'text' column from the DataFrame df is extracted and converted into a Python list called data. This data list will be the text source for training the next word prediction model.

In [11]:
data = list(df['text'])

In [12]:
data[0]

'acest document mi-a deschis cu adevarat ochii la ceea ce oamenii din afara statelor unite s-au gandit la atacurile din 11 septembrie. acest film a fost construit in mod expert si prezinta acest dezastru ca fiind mai mult decat un atac asupra pamantului american. urmarile acestui dezastru sunt previzionate din multe tari si perspective diferite. cred ca acest film ar trebui sa fie mai bine distribuit pentru acest punct. de asemenea, el ajuta in procesul de vindecare sa vada in cele din urma altceva decat stirile despre atacurile teroriste. si unele dintre piese sunt de fapt amuzante, dar nu abuziv asa. acest film a fost extrem de recomandat pentru mine, si am trecut pe acelasi sentiment.'

This cell defines a function called generate_subsentences.
It takes a list of sentences as input.
For each sentence:
- It creates a translator to replace all punctuation characters with spaces.
- It cleans the sentence using translate().
- It splits the cleaned sentence into words.
- Generates sub-sentences (n-grams) of length 2, 3, and 4 from these words and adds them to the result list.
- It also adds a special case "start" to help predict the first word in a new sequence.
- The try-except block handles any errors in processing a sentence.
- The function returns a list of sub-sentences. These sub-sentences will be used to create input-output pairs for the model.

In [5]:
def generate_subsentences(sentences: list[str]) -> list[str]:
    result = []
    translator = str.maketrans(string.punctuation, ' ' * len(string.punctuation))
    for sentence in sentences:
        try:
            clean_sentence = sentence.translate(translator)
            words = clean_sentence.split()
            for length in range(2, 5):
                for start in range(len(words) - length + 1):
                    subsentence = ' '.join(words[start:start + length])
                    result.append(subsentence)
            result.append("<start>" + " " + sentence.split(" ")[0])
        except:
            pass
    return result

#sentences = ["the cat sat on the table", "i like it"]
#print(generate_subsentences(sentences))

In [13]:
props = generate_subsentences(data)
print (len(props))

4090603


In [14]:
import random

random.shuffle(props)

In [16]:
props[1]

'cu actori de'

In [17]:
props = props[:500000]
print (len(props))

500000


This cell defines a constant NO_WORDS at 2000. This will be the maximum vocabulary size that the Tokenizer will use. Words that appear less frequently and do not fall within this NO_WORDS top will be treated as "out-of-vocabulary" (OVS).

In [18]:
NO_WORDS = 2000

In [19]:
tokenizer = Tokenizer(num_words=NO_WORDS, oov_token='unktoken')
tokenizer.fit_on_texts(props)

In [20]:
len(tokenizer.index_word)

43775

In [21]:
tokenizer.index_word[1]

'unktoken'

In [24]:
# tokenizer.index_word

This cell creates a list called oftenit that contains only words whose index is less than or equal to NO_WORDS. This should be equivalent to the actual vocabulary that the model will use (excluding OOV words, which are all mapped to a single index, 1). Then, it displays the length of this list. It should be equal to NO_WORDS.

In [25]:
oftenit = []
for k, v in tokenizer.index_word.items():
    if k <= NO_WORDS:
        oftenit.append(v)
print (len(oftenit))

2000


In [26]:
len(oftenit)

2000

In [None]:
#tokenizer.index_word

In [27]:
props[10]

'bolnave ei'

tokenizer.texts_to_sequences(props) converts each sub-sentence in props into a sequence of integers, where each number represents the index of the corresponding word in the tokenizer's vocabulary. Words that are not in the top NO_WORDS will be mapped to the oov_token index (which is 1).

In [28]:
sequences = tokenizer.texts_to_sequences(props)

In [29]:
sequences[10]

[1, 68]

In [30]:
sequences[124]

[5, 20, 419]

This cell filters the sequences list.
- Sequences that are longer than 4 or whose last word is oov_token (index 1) are ignored.
- Only sequences that are at most 4 long and whose last word is NOT oov_token are added to the new xsequences list.
- The motivation for len(seq) > 4 is that the model will use a 3-word input window to predict the 4th word. Also, ignoring sequences that end with oov_token (1) makes sense, because predicting an unknown word is not useful in this context.

In [32]:
xsequences = []
for seq in sequences:
    if len(seq) > 4 or seq[-1] == 1:
        pass
    else:
        xsequences.append(seq)
print (len(xsequences))

410223


- pad_sequences from tensorflow.keras.preprocessing.sequence: A utility to pad sequences to a uniform length (in this case, it will be 4, since we filtered out sequences longer than 4).
- tensorflow as tf: Imports TensorFlow.
- padded = pad_sequences(xsequences, padding=‘pre’): Pad the sequences in xsequences by adding zeros before the sequences so that they are all the same length. The maximum length of the sequences in xsequences is 4, so all shorter sequences will be padded to 4.

In [33]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf

padded = pad_sequences(xsequences, padding='pre')

In [34]:
padded[124]

array([  0,   0, 655,   5], dtype=int32)

In [35]:
print (padded[32])
print (padded[100])
print (padded[124])

[  0   0 196  13]
[ 3  1  1 28]
[  0   0 655   5]


In [36]:
len(padded)

410223

- X, y = padded[:,:-1], padded[:,-1]: This splits the padded array into input (X) and output/label (y) sets.
  - X: Contains all columns in padded except the last column. These are the input words for prediction (context). The length will be 3.
  - y: Contains only the last column of padded. This is the target word (the next word to be predicted).
- y = to_categorical(y, num_classes=NO_WORDS + 1): Converts the label vector y to one-hot encoding format. num_classes=NO_WORDS + 1 ensures that a class is assigned to each of the NO_WORDS words and one to oov_token.

In [37]:
X, y = padded[:,:-1], padded[:,-1]
y = to_categorical(y, num_classes=NO_WORDS + 1)

In [38]:
from sklearn.model_selection import train_test_split
X_train = X[:350000]
X_test = X[350000:]
y_train = y[:350000]
y_test = y[350000:]
print (X_train.shape)
print (y_train.shape)
print (X_test.shape)
print (y_test.shape)

(350000, 3)
(350000, 2001)
(60223, 3)
(60223, 2001)


Această funcție definește o metrică personalizată pentru Keras, numită top_3_accuracy.
- Calculează dacă eticheta reală (y_true) se află printre primele 3 predicții cu cea mai mare probabilitate (y_pred) ale modelului.
  - tf.cast(tf.argmax(y_true, axis=-1), tf.int32): Convertește etichetele one-hot (y_true) în etichete întregi.
  - tf.math.top_k(y_pred, k=3).indices: Extrage indexurile (adică, ID-urile cuvintelor) celor mai probabile 3 predicții.
  - tf.reduce_any(tf.equal(tf.expand_dims(y_true, -1), top_3), axis=-1): Verifică dacă eticheta reală este prezentă în aceste top 3 predicții.
  - tf.reduce_mean(tf.cast(matches, tf.float32)): Calculează procentul de potriviri, adică acuratețea top-3.

In [39]:
def top_3_accuracy(y_true, y_pred):
    y_true = tf.cast(tf.argmax(y_true, axis=-1), tf.int32)  # Convert one-hot to integer labels
    top_3 = tf.math.top_k(y_pred, k=3).indices
    matches = tf.reduce_any(tf.equal(tf.expand_dims(y_true, -1), top_3), axis=-1)
    return tf.reduce_mean(tf.cast(matches, tf.float32))

In [40]:
model = Sequential()
model.add(InputLayer(input_shape=(3, ), dtype=np.int32))
model.add(Embedding(NO_WORDS + 1, 8, input_length=4))
model.add(LSTM(8))
model.add(Dense(NO_WORDS + 1, activation='softmax'))
print(model.summary())



None


In [41]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy', top_3_accuracy])

In [42]:
model.fit(X_train, y_train, batch_size=16, epochs=4, validation_data=(X_test, y_test))

Epoch 1/4
[1m21875/21875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 4ms/step - accuracy: 0.0550 - loss: 5.8858 - top_3_accuracy: 0.1231 - val_accuracy: 0.0801 - val_loss: 5.4755 - val_top_3_accuracy: 0.1829
Epoch 2/4
[1m21875/21875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m76s[0m 3ms/step - accuracy: 0.0980 - loss: 5.3719 - top_3_accuracy: 0.1973 - val_accuracy: 0.1301 - val_loss: 5.1399 - val_top_3_accuracy: 0.2333
Epoch 3/4
[1m21875/21875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m77s[0m 4ms/step - accuracy: 0.1372 - loss: 5.0779 - top_3_accuracy: 0.2429 - val_accuracy: 0.1522 - val_loss: 4.9931 - val_top_3_accuracy: 0.2545
Epoch 4/4
[1m21875/21875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 4ms/step - accuracy: 0.1532 - loss: 4.9450 - top_3_accuracy: 0.2611 - val_accuracy: 0.1573 - val_loss: 4.9195 - val_top_3_accuracy: 0.2635


<keras.src.callbacks.history.History at 0x7e11be648560>

In [43]:
preds = model.predict(X_test)

[1m1882/1882[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 1ms/step


Această celulă procesează predicțiile:
- reverse_word_map: Se creează un dicționar inversat care mapează indexurile numerice înapoi la cuvinte (de la tokenizer.word_index).
- results, wordsr: Două liste goale pentru a stoca rezultatele.
- Bucla iterează prin fiecare predicție (pred) din preds (care este un array de probabilități pentru toate cuvintele din vocabular pentru o singură intrare):
  - ar = pred.argsort()[-3:][::-1]: Obține indexurile celor mai probabile 3 cuvinte. argsort() returnează indexurile care ar sorta array-ul, [-3:] ia ultimele 3 (care corespund celor mai mari probabilități), iar [::-1] le inversează pentru a obține ordinea descrescătoare a probabilităților.
  - results.append([ar[0], ar[1], ar[2]]): Adaugă indexurile celor top 3 cuvinte prezise la lista results.
  - wordsr.append([reverse_word_map[ar[0]], reverse_word_map[ar[1]], reverse_word_map[ar[2]]]): Converște indexurile înapoi la cuvinte folosind reverse_word_map și adaugă cuvintele prezise la lista wordsr.

In [44]:
reverse_word_map = dict(map(reversed, tokenizer.word_index.items()))
results = []
wordsr = []
for pred in preds:
    ar = pred.argsort()[-3:][::-1]
    results.append([ar[0], ar[1], ar[2]])
    wordsr.append([reverse_word_map[ar[0]], reverse_word_map[ar[1]], reverse_word_map[ar[2]]])

Etichetele reale (y_test) sunt în format one-hot encoding. Această celulă le convertește înapoi la formatul de index întreg (adică, ID-ul cuvântului real) folosind np.argmax.

In [45]:
testy = [np.argmax(x) for x in y_test]

In [46]:
len(testy)

60223

In [47]:
acc1 = 0
acc2 = 0
acc3 = 0
for i in range(len(results)):
    if results[i][0] == testy[i]:
        acc1 += 1
    if testy[i] in results[i][:2]:
        acc2 += 1
    if testy[i] in results[i][:3]:
        acc3 += 1
print ('R1:', acc1 / len(testy))
print ('R2:', acc2 / len(testy))
print ('R3:', acc3 / len(testy))

R1: 0.15733191637746377
R2: 0.22111153545987414
R3: 0.26347076698271427


In [55]:
aux = "salut ce faci"
example = "imi place"

In [56]:
example_seq = tokenizer.texts_to_sequences([aux, example])
print (example_seq)

[[1, 25, 633], [102, 138]]


In [57]:
example_padded = pad_sequences(example_seq, padding='pre')
print (example_padded)

[[  1  25 633]
 [  0 102 138]]


In [58]:
pred = model.predict(example_padded)[1]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step


In [59]:
pred

array([8.3212276e-10, 8.2865576e-10, 5.8938868e-02, ..., 1.8583516e-05,
       3.2859039e-05, 8.2878376e-10], dtype=float32)

In [60]:
ar = pred.argsort()[-3:][::-1]
res = [ar[0], ar[1], ar[2]]
words_pred = [reverse_word_map[ar[0]], reverse_word_map[ar[1]], reverse_word_map[ar[2]]]

In [61]:
res

[np.int64(2), np.int64(7), np.int64(10)]

In [62]:
words_pred

['de', 'sa', 'o']