---
#### **1. Text Preprocessing**
- **Sentences (`sent`)**: These are sample text sentences used as input.
- **Vocabulary Size (`voc_size`)**: This defines the range of integer values used for encoding words. In real applications, it's usually based on the dataset size.
- **One-Hot Encoding**: Converts words into unique integers within the range `[0, voc_size-1]` using Keras's `one_hot()` method. This step is a prerequisite for feeding data into the Embedding layer.

#### **2. Sequence Padding**
- **Why Padding?** Sentences have varying lengths, but deep learning models require inputs of uniform shape. Padding ensures all sequences are the same length by adding zeros (or truncating) to reach a fixed `sent_length`.
- **Pre-padding**: Zeros are added at the beginning of sequences (`padding='pre'`).

#### **3. Word Embeddings**
- **Embedding Layer**:
  - Maps each word (integer) to a dense vector of fixed size (`output_dim` or `dim`).
  - The embedding layer is initialized randomly and learns meaningful representations during training.
  - Input shape is inferred automatically when padding is used, so `input_length` is unnecessary.

#### **4. Model Compilation**
- **Optimizer (`adam`)**: A popular optimization algorithm that adapts learning rates during training.
- **Loss Function (`mse`)**: Placeholder for demonstration. For text tasks, a categorical or sparse cross-entropy loss is typically used.

#### **5. Model Summary**
- Displays the structure of the model, including layers, output shapes, and trainable parameters.

---

### Key Concepts
1. **Word Embeddings**:
   - Dense vector representations of words that capture semantic relationships.
   - Words with similar meanings or contexts have closer embeddings.

2. **One-Hot Encoding vs. Word Embeddings**:
   - One-hot encoding creates sparse, high-dimensional vectors where only one element is `1`.
   - Embeddings create dense, low-dimensional vectors that are learned during training.

3. **Padding**:
   - Uniform sequence lengths are critical for batch processing in deep learning.

4. **Why Use an Embedding Layer?**
   - Avoids manual feature engineering by learning data-specific embeddings.
   - Reduces input dimensionality and improves model efficiency.

This code serves as a foundation for working with text data, enabling you to preprocess input, pad sequences, and create embeddings using Keras.

In [1]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import one_hot

In [2]:
#Sentences that we have
# List of sentences (sample dataset)

sent=[  'the glass of milk',
     'the glass of juice',
     'the cup of tea',
    'I am a good boy',
     'I am a good developer',
     'understand the meaning of words',
     'your videos are good',]


In [3]:
sent

['the glass of milk',
 'the glass of juice',
 'the cup of tea',
 'I am a good boy',
 'I am a good developer',
 'understand the meaning of words',
 'your videos are good']

In [17]:
# Vocabulary size for one-hot encoding (arbitrarily chosen; typically depends on dataset)
voc_size=500

In [18]:
# Generate one-hot encoded representation for each sentence
# Each word in a sentence is mapped to an integer between 0 and voc_size - 1
onehot_repr=[one_hot(words,voc_size)for words in sent]
print(onehot_repr)

[[313, 201, 92, 134], [313, 201, 92, 244], [313, 316, 92, 261], [239, 84, 360, 140, 324], [239, 84, 360, 140, 425], [381, 313, 201, 92, 256], [6, 330, 109, 140]]


In [19]:
#Word Embedding Representation
from tensorflow.keras.layers import Embedding, Input
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential

In [20]:
import numpy as np

In [21]:
#Pre-padding
# Set a fixed sequence length for all sentences (padding or truncating as necessary)
sent_length=8
# Pad the one-hot encoded sentences with zeros (pre-padding) to ensure uniform length
# This is necessary for input into the Embedding layer, as it expects uniform input dimensions
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

[[  0   0   0   0 313 201  92 134]
 [  0   0   0   0 313 201  92 244]
 [  0   0   0   0 313 316  92 261]
 [  0   0   0 239  84 360 140 324]
 [  0   0   0 239  84 360 140 425]
 [  0   0   0 381 313 201  92 256]
 [  0   0   0   0   6 330 109 140]]


In [22]:
# Define the dimensionality of word embeddings
dim=10 #Feature representation dimension if you have larger dataset use somewhere around 300 or 300

In [23]:
# Define the model using Input layer
# Create a Sequential model with an Embedding layer
# Add an Embedding layer
# - input_dim: Vocabulary size (voc_size), the number of unique integers in the input data
# - output_dim: Embedding dimension (dim), the size of the dense vector representing each word
# - The Embedding layer learns the embeddings during training

model = Sequential([
    Input(shape=(sent_length,)),  # Specify input shape
    Embedding(input_dim=voc_size, output_dim=dim)
])
# Compile the model with the Adam optimizer and Mean Squared Error (MSE) loss
# - The loss function and optimizer settings here are placeholders, as this model is not yet trained
model.compile(optimizer='adam', loss='mse')
# Display a summary of the model architecture
model.summary()


In [24]:
embedded_docs[0]

array([  0,   0,   0,   0, 313, 201,  92, 134], dtype=int32)

In [26]:
prediction = model.predict(embedded_docs[0].reshape(1, -1))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 232ms/step


In [27]:
print(prediction)

[[[ 0.02551533 -0.01755876  0.00032029 -0.04363328 -0.02342428
   -0.0191457  -0.01017138 -0.00359224 -0.00796907 -0.01994984]
  [ 0.02551533 -0.01755876  0.00032029 -0.04363328 -0.02342428
   -0.0191457  -0.01017138 -0.00359224 -0.00796907 -0.01994984]
  [ 0.02551533 -0.01755876  0.00032029 -0.04363328 -0.02342428
   -0.0191457  -0.01017138 -0.00359224 -0.00796907 -0.01994984]
  [ 0.02551533 -0.01755876  0.00032029 -0.04363328 -0.02342428
   -0.0191457  -0.01017138 -0.00359224 -0.00796907 -0.01994984]
  [-0.04429581 -0.03727648  0.04626607 -0.00178454  0.00846018
    0.02377937  0.02640684  0.01891888 -0.00699757 -0.0066157 ]
  [-0.04591066 -0.03623826 -0.02711327 -0.02671381 -0.04603697
    0.04737509  0.03400474 -0.0309536   0.03483728  0.00633885]
  [-0.04657155  0.03075459 -0.00354611  0.03127669 -0.04636631
    0.0389827  -0.04626621 -0.02154769  0.0328099  -0.02025727]
  [ 0.00858914  0.00389517 -0.04662341  0.02347447  0.03368037
   -0.04104428 -0.00757492  0.02280011 -0.009269

In [28]:
print(model.predict(embedded_docs.reshape(1, -1)))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 104ms/step
[[[ 0.02551533 -0.01755876  0.00032029 -0.04363328 -0.02342428
   -0.0191457  -0.01017138 -0.00359224 -0.00796907 -0.01994984]
  [ 0.02551533 -0.01755876  0.00032029 -0.04363328 -0.02342428
   -0.0191457  -0.01017138 -0.00359224 -0.00796907 -0.01994984]
  [ 0.02551533 -0.01755876  0.00032029 -0.04363328 -0.02342428
   -0.0191457  -0.01017138 -0.00359224 -0.00796907 -0.01994984]
  [ 0.02551533 -0.01755876  0.00032029 -0.04363328 -0.02342428
   -0.0191457  -0.01017138 -0.00359224 -0.00796907 -0.01994984]
  [-0.04429581 -0.03727648  0.04626607 -0.00178454  0.00846018
    0.02377937  0.02640684  0.01891888 -0.00699757 -0.0066157 ]
  [-0.04591066 -0.03623826 -0.02711327 -0.02671381 -0.04603697
    0.04737509  0.03400474 -0.0309536   0.03483728  0.00633885]
  [-0.04657155  0.03075459 -0.00354611  0.03127669 -0.04636631
    0.0389827  -0.04626621 -0.02154769  0.0328099  -0.02025727]
  [ 0.00858914  0.00389517 -0.04662341