# 06. Embeddings and Data Preparation

## The Problem with One-Hot
In basic NLP, we represent words as One-Hot vectors. 
- **Issues**: High dimensionality (vocab size), sparse, no semantic relationship (distance between 'king' and 'queen' is same as 'king' and 'apple').

## Word Embeddings
Embeddings are dense vector representations where similar words have similar encodings.
- **Learned from data**: We can learn them as part of our model using an Embedding Layer.
- **Pre-trained**: Word2Vec, GloVe.

## Data Preparation Pipeline
1. **Tokenization**: Converting text to integers.
2. **Padding**: Making all sequences the same length (using 0s).

In [1]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Sample Data
sentences = [
    'I love deep learning',
    'RNNs are capable of processing sequential data',
    'LSTMs are great!'
]

# 1. Tokenization
tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

print("Word Index:", word_index)
print("Sequences:", sequences)

# 2. Padding
padded = pad_sequences(sequences, padding='post', maxlen=10, truncating='post')
print("\nPadded Sequences:\n", padded)



Word Index: {'<OOV>': 1, 'are': 2, 'i': 3, 'love': 4, 'deep': 5, 'learning': 6, 'rnns': 7, 'capable': 8, 'of': 9, 'processing': 10, 'sequential': 11, 'data': 12, 'lstms': 13, 'great': 14}
Sequences: [[3, 4, 5, 6], [7, 2, 8, 9, 10, 11, 12], [13, 2, 14]]

Padded Sequences:
 [[ 3  4  5  6  0  0  0  0  0  0]
 [ 7  2  8  9 10 11 12  0  0  0]
 [13  2 14  0  0  0  0  0  0  0]]


## Embedding Layer
The Embedding layer is essentially a lookup table that maps integer indices to dense vectors.

In [2]:
from tensorflow.keras.layers import Embedding
import numpy as np

# Vocab size = 100, Embedding Dim = 5, Input Length = 10
embedding_layer = Embedding(input_dim=100, output_dim=5, input_length=10)

# Simulate input (batch of 1 sequence)
input_seq = np.array([padded[0]])
output = embedding_layer(input_seq)

print("Input:", input_seq)
print("Embeddings Output shape:", output.shape) # (1, 10, 5)

Input: [[3 4 5 6 0 0 0 0 0 0]]
Embeddings Output shape: (1, 10, 5)


