The one_hot function in TensorFlow Keras converts words into integer indices by mapping them to a fixed vocabulary size using hashing.

text -> vectors

In [1]:
from tensorflow.keras.preprocessing.text import one_hot
### sentences
sent=[  'the glass of milk',
     'the glass of juice',
     'the cup of tea',
    'I am a good boy',
     'I am a good developer',
     'understand the meaning of words',
     'your videos are good',]
sent

['the glass of milk',
 'the glass of juice',
 'the cup of tea',
 'I am a good boy',
 'I am a good developer',
 'understand the meaning of words',
 'your videos are good']

in numbers ka matlab h ki , glass of sentence 1 or any other  will result
 1 in 6775th index & rest 9999 will be zero
 
 one_hot(words, voc_size) → expects a single string (sentence) as input.

In [2]:
## Define the vocabulary size
voc_size=10000
### One Hot Representation
one_hot_repr=[one_hot(words,voc_size)for words in sent]
one_hot_repr

[[5157, 482, 6065, 4753],
 [5157, 482, 6065, 5417],
 [5157, 8128, 6065, 347],
 [8649, 35, 2586, 729, 3217],
 [8649, 35, 2586, 729, 6503],
 [2925, 5157, 9429, 6065, 8874],
 [6068, 4617, 3553, 729]]

pad_sequences makes all lists the same length by adding zeros or cutting extra values

In [3]:
## word Embedding Representation
from tensorflow.keras.layers import Embedding
#from tensorflow.keras.processing.sequence import pad_sequences
from tensorflow.keras.utils import pad_sequences                 
from tensorflow.keras.models import Sequential #sequential is must for every type of neural network
import numpy as np
sent_length=8  #max length  #list name       
embedded_docs=pad_sequences(one_hot_repr,padding='pre',maxlen=sent_length) #post se aage 0 add honge
print(embedded_docs)

[[   0    0    0    0 5157  482 6065 4753]
 [   0    0    0    0 5157  482 6065 5417]
 [   0    0    0    0 5157 8128 6065  347]
 [   0    0    0 8649   35 2586  729 3217]
 [   0    0    0 8649   35 2586  729 6503]
 [   0    0    0 2925 5157 9429 6065 8874]
 [   0    0    0    0 6068 4617 3553  729]]


### Purpose of MODEL: Converts words (integers) into dense vector representations before feeding into the model.
If before a word was a 10,000-dim one-hot vector, after embedding it becomes a 10-dim dense vector.

matlab first these 10 dimensions are randomly initialised , and then we update these 10 elements of vector with the help of back propagation

These 10-dim dense vectors are learned representations that capture semantic meaning — i.e., similar words get similar vectors (e.g., "king" and "queen").

The model uses **word indices** (numbers) to fetch **embedding vectors** from an embedding matrix, instead of using large, sparse **one-hot vectors**. This is more efficient and helps the model learn better word meanings.

In [4]:
## feature representation
dim=10                    # Each word in the vocabulary will be converted into a 10-dimensional vector.
model=Sequential()                                           # Initializes a sequential model.
model.add(Embedding(voc_size,dim,input_length=sent_length))  # Expected sentence length (i.e., how many words per input).
model.compile('adam','mse')   
model.summary()
#Uses Adam optimizer for training.
# Mean Squared Error (MSE) as the loss function.



In [5]:
model.predict(embedded_docs) # 7 sentences -> 8 words  per sentence -> 10 dimensions per word

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 207ms/step


array([[[ 0.00147997,  0.02197584,  0.02876599, -0.01417321,
         -0.04609977, -0.01785077,  0.01445006, -0.02481191,
         -0.03570934, -0.00948672],
        [ 0.00147997,  0.02197584,  0.02876599, -0.01417321,
         -0.04609977, -0.01785077,  0.01445006, -0.02481191,
         -0.03570934, -0.00948672],
        [ 0.00147997,  0.02197584,  0.02876599, -0.01417321,
         -0.04609977, -0.01785077,  0.01445006, -0.02481191,
         -0.03570934, -0.00948672],
        [ 0.00147997,  0.02197584,  0.02876599, -0.01417321,
         -0.04609977, -0.01785077,  0.01445006, -0.02481191,
         -0.03570934, -0.00948672],
        [ 0.00127659, -0.01866531, -0.04405781,  0.00346626,
          0.03091947,  0.00297161, -0.03637736, -0.01783059,
          0.03567885,  0.02222336],
        [ 0.03851828,  0.04706664,  0.00856987, -0.02980471,
         -0.02187045,  0.01102706, -0.04920062,  0.04284799,
         -0.02777903,  0.04258044],
        [ 0.01992457, -0.0197791 , -0.01805289, -0.0

In [6]:
embedded_docs[0] #first sentence

array([   0,    0,    0,    0, 5157,  482, 6065, 4753], dtype=int32)

The first line, `model.predict(embedded_docs[0])`, gives an error because `embedded_docs[0]` is a **1D array**, but the model expects a **2D array** (even for a single input). The model wants to receive a batch of inputs, so it expects the shape to be `(1, input_size)`.

The second line, `model.predict(np.array([embedded_docs[0]]))`, works because it **wraps `embedded_docs[0]` in an additional array**, turning it into a 2D array with shape `(1, input_size)`, which is what the model expects.

In short:
- **1D array** (e.g., shape `(input_size,)`) → **error**
- **2D array** (e.g., shape `(1, input_size)`) → **works**

Let me know if that clears things up!

In [14]:
# model.predict(embedded_docs[0])
model.predict(np.array([embedded_docs[0]]))



'2.19.0'