<a href="https://www.kaggle.com/code/canozensoy/recurrent-neural-network?scriptVersionId=228456355" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

This code preprocesses text data by converting sentences into numerical sequences using a Tokenizer, padding those sequences to a fixed length, and then transforming the integer sequences into dense vector representations using an Embedding layer. This process creates word embeddings, which are numerical representations of words that capture semantic relationships, making them suitable for input to neural networks for various natural language processing tasks.

<font color = "DeepSkyBlue">**Tokenizer**

The code effectively performs the following:

It takes a list of sentences.

It creates a vocabulary from the unique words in those sentences.

It assigns a unique numerical index to each word in the vocabulary.

It converts the sentences into sequences of integers, where each integer corresponds to the index of a word.

It displays the word index dictionary, showing the word to index mapping.

This is a fundamental step in many natural language processing (NLP) tasks, as machine learning models typically work with numerical data.

In [1]:
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

metin = [
    'This is an example sentence.',
    'There is another sentence.',
    'A different example text.'
]
# Creating a tokenizer and adapting it to text data.
tokenizer = Tokenizer()
tokenizer.fit_on_texts(metin)

# Assign index
sequences = tokenizer.texts_to_sequences(metin)

word_index = tokenizer.word_index
print("Word index dictionary: ", word_index)

Word index dictionary:  {'is': 1, 'example': 2, 'sentence': 3, 'this': 4, 'an': 5, 'there': 6, 'another': 7, 'a': 8, 'different': 9, 'text': 10}


This loop iterates through each element in the sequences list. Each element (seq) represents a sequence of integers, where each integer corresponds to a word index.

In [2]:
for seq in sequences:
  print(seq)

[4, 1, 5, 2, 3]
[6, 1, 7, 3]
[8, 9, 2, 10]


This code snippet demonstrates how to pad sequences to a fixed length using Keras's pad_sequences function.

<font color = "DeepSkyBlue">**max_sequence_length = 5:** <font color = "White">This line sets the maximum length for all sequences to 5. If a sequence is shorter than 5, it will be padded; if it's longer, it will be truncated.

<font color = "DeepSkyBlue">**pad_sequences(sequences, maxlen=max_sequence_length):**<font color = "White">

- This function takes the sequences (the list of integer sequences generated by the Tokenizer) and pads them to the specified max_sequence_length. 

- By default, padding is done at the beginning of the sequences (pre-padding) with zeros.

- The result is a new list of padded sequences, which is assigned to the padded_sequences variable.

<font color = "DeepSkyBlue">**for seq in padded_sequences:** <font color = "White">This loop iterates through each padded sequence in the padded_sequences list.

<font color = "DeepSkyBlue">**print(seq):** <font color = "White">This line prints each padded sequence to the console.

In [3]:
# Fixed length
max_sequence_length = 5
padded_sequences = pad_sequences(sequences, maxlen = max_sequence_length)

for seq in padded_sequences:
  print(seq)

[4 1 5 2 3]
[0 6 1 7 3]
[ 0  8  9  2 10]


<font color = "DeepSkyBlue">**Embedding**

<font color = "DeepSkyBlue">**Calculate the vocabulary size**

The vocabulary size is the number of unique words in the text data, plus one for padding/unknown tokens.

In [4]:
from keras.layers import Embedding
import numpy as np

size = len(tokenizer.word_index) + 1


<font color = "DeepSkyBlue">**Define the embedding dimension**

The embedding dimension is the size of the dense vector representation for each word.

A higher dimension can capture more complex semantic relationships, but also increases computational cost.

In [5]:
dimension = 100


<font color = "DeepSkyBlue">**Create the Embedding layer**

The Embedding layer maps integer indices to dense vectors.

<font color = "DeepSkyBlue">**input_dim:** <font color = "White">The size of the vocabulary (number of unique tokens).

<font color = "DeepSkyBlue">**output_dim:** <font color = "White">The dimension of the dense embedding.

<font color = "DeepSkyBlue">**input_length:** <font color = "White">The length of the input sequences (padded sequences).

In [6]:
embedding_layer = Embedding(input_dim = size, output_dim = dimension, input_length = max_sequence_length)




<font color = "DeepSkyBlue">**Apply the Embedding layer to the padded sequences**

Converts the padded sequences from integer indices to dense vector representations.

<font color = "DeepSkyBlue">**padded_sequences:** <font color = "White">The sequences of integer indices, padded to a fixed length.

<font color = "DeepSkyBlue">**np.array(...):** <font color = "White">Converts the list of sequences to a NumPy array, which is required by the Embedding layer.

In [7]:
embedded_sequences = embedding_layer(np.array(padded_sequences))


<font color = "DeepSkyBlue">**Print the embedding result**

Displays the embedded sequences, which are now represented as dense vectors.

<font color = "DeepSkyBlue">**The output is a 3D tensor:** <font color = "White">(number of sequences, sequence length, embedding dimension).

In [8]:
print("Embedding result: ")
print(embedded_sequences)

Embedding result: 
tf.Tensor(
[[[-0.00440811  0.0367333  -0.04424722 ...  0.02052054 -0.0331463
   -0.00938786]
  [-0.03940564 -0.03481667 -0.0252624  ...  0.04283511 -0.0337648
   -0.01106616]
  [-0.02147403 -0.03271443  0.01033436 ... -0.01358676  0.00922159
    0.00468035]
  [ 0.03005802  0.00400939 -0.00136522 ...  0.03290996 -0.01794069
   -0.01546265]
  [-0.01146143  0.03761804 -0.02635258 ...  0.02882895  0.0406894
    0.03326086]]

 [[-0.03371916  0.00113312 -0.00088071 ... -0.02993964  0.01448205
   -0.02482233]
  [-0.02695836  0.01384885  0.02952087 ...  0.01972664  0.03502656
    0.01162883]
  [-0.03940564 -0.03481667 -0.0252624  ...  0.04283511 -0.0337648
   -0.01106616]
  [-0.02588189 -0.02417557 -0.00342847 ... -0.01197594  0.04132814
   -0.02149092]
  [-0.01146143  0.03761804 -0.02635258 ...  0.02882895  0.0406894
    0.03326086]]

 [[-0.03371916  0.00113312 -0.00088071 ... -0.02993964  0.01448205
   -0.02482233]
  [ 0.01072764  0.04473083 -0.01721001 ... -0.0154246  -0.