**Word embeddings **

Word embeddings are a type of word representation that allows words with similar meaning to
have a similar representation. They are a distributed representation for text that is perhaps one
of the key breakthroughs for the impressive performance of deep learning methods on challenging
natural language processing problems.

word embedding approach for representing text is and how it differs from other
feature extraction methods.

there are 3 main algorithms for learning a word embedding from text data.

1)Continuous Bag-of-Words, or CBOW model.

2)Continuous Skip-Gram Model.

3)matrix factorization techniques

you can either train a new embedding or use a pre-trained embedding on your natural
language processing task.

The CBOW model learns the embedding by predicting the current word based on its context.
The continuous skip-gram model learns by predicting the surrounding words given a current
word.

**What Are Word Embeddings?**

A word embedding is a learned representation for text where words that have the same meaning
have a similar representation.

word embedding Modees

**Word2Vec**

Word2Vec is a statistical method for eciently learning a standalone word embedding from a
text corpus. It was developed by Tomas Mikolov, et al. at Google in 2013 as a response to make
the neural-network-based training of the embedding more ecient and since then has become
the de facto standard for developing pre-trained word embedding.

We find that these representations are surprisingly good at capturing syntactic and
semantic regularities in language, and that each relationship is characterized by a
relation-specific vector offset. This allows vector-oriented reasoning based on the
offsets between words. For example, the male/female relationship is automatically
learned, and with the induced vector representations, King - Man + Woman results
in a vector very close to Queen.

**GloVe**

The Global Vectors for Word Representation, or GloVe, algorithm is an extension to the
Word2Vec method for eficiently learning word vectors, developed by Pennington, et al. at
Stanford. Classical vector space model representations of words were developed using matrix factorization techniques such as Latent Semantic Analysis (LSA) that do a good job of using
global text statistics but are not as good as the learned methods like Word2Vec at capturing
meaning and demonstrating it on tasks like calculating analogies (e.g. the King and Queen
example above).
GloVe is an approach to marry both the global statistics of matrix factorization techniques
like LSA with the local context-based learning in Word2Vec. Rather than using a window to
dene local context, GloVe constructs an explicit word-context or word co-occurrence matrix
using statistics across the whole text corpus. The result is a learning model that may result in
generally better word embeddings.

**Reuse an Embedding**

It is common for researchers to make pre-trained word embeddings available for free, often under
a permissive license so that you can use them on your own academic or commercial projects. For
example, both Word2Vec and GloVe word embeddings are available for free download. These
can be used on your project instead of training your own embeddings from scratch. You have
two main options when it comes to using pre-trained embeddings:

Static, where the embedding is kept static and is used as a component of your model.
This is a suitable approach if the embedding is a good fit for your problem and gives good
results.

Updated, where the pre-trained embedding is used to seed the model, but the embedding
is updated jointly during the training of the model. This may be a good option if you are
looking to get the most out of the model and embedding on your task.

Example of Learning an Embedding.ipynb

In [1]:
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding

In [2]:
sent=[  'the glass of milk',
     'the glass of juice',
     'the cup of tea',
    'I am a good boy',
     'I am a good developer',
     'understand the meaning of words',
     'your videos are good',]

In [11]:
### Vocabulary size
voc_size=50

In [12]:
onehot_repr=[one_hot(words,voc_size)for words in sent] 
print(onehot_repr)

[[35, 16, 3, 23], [35, 16, 3, 42], [35, 19, 3, 9], [42, 2, 19, 32, 17], [42, 2, 19, 32, 18], [21, 35, 34, 3, 49], [49, 14, 19, 32]]


In [13]:
onehot_repr

[[35, 16, 3, 23],
 [35, 16, 3, 42],
 [35, 19, 3, 9],
 [42, 2, 19, 32, 17],
 [42, 2, 19, 32, 18],
 [21, 35, 34, 3, 49],
 [49, 14, 19, 32]]

In [14]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
import numpy as np


In [16]:
sent_length=5
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

[[ 0 35 16  3 23]
 [ 0 35 16  3 42]
 [ 0 35 19  3  9]
 [42  2 19 32 17]
 [42  2 19 32 18]
 [21 35 34  3 49]
 [ 0 49 14 19 32]]


In [17]:
sent_length=5
embedded_docs=pad_sequences(onehot_repr,padding='post',maxlen=sent_length)
print(embedded_docs)

[[35 16  3 23  0]
 [35 16  3 42  0]
 [35 19  3  9  0]
 [42  2 19 32 17]
 [42  2 19 32 18]
 [21 35 34  3 49]
 [49 14 19 32  0]]


input_dim: This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.

output_dim: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. For example, it could be 32 or 100 or even larger. Test different values for your problem.

input_length: This is the length of input sequences, as you would define for any input layer of a Keras model. For example, if all of your input documents are comprised of 1000 words, this would be 1000.

In [23]:
model=Sequential()
model.add(Embedding(voc_size,30,input_length=sent_length))
model.compile('adam','mse')

In [24]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 5, 30)             1500      
Total params: 1,500
Trainable params: 1,500
Non-trainable params: 0
_________________________________________________________________


In [25]:
print(model.predict(embedded_docs))

[[[-0.04107963  0.02798226  0.00504835 ... -0.01300856 -0.03860293
   -0.02766579]
  [-0.00481569 -0.01037474  0.04120917 ...  0.00263121 -0.02568066
    0.01011343]
  [ 0.00198215 -0.02069547  0.02648307 ... -0.03252503  0.02282203
   -0.04572082]
  [ 0.03792934 -0.00712482  0.02544123 ...  0.04123605 -0.01814964
   -0.00654888]
  [-0.02668467 -0.01835966 -0.03522006 ...  0.04140535  0.0189225
   -0.04600055]]

 [[-0.04107963  0.02798226  0.00504835 ... -0.01300856 -0.03860293
   -0.02766579]
  [-0.00481569 -0.01037474  0.04120917 ...  0.00263121 -0.02568066
    0.01011343]
  [ 0.00198215 -0.02069547  0.02648307 ... -0.03252503  0.02282203
   -0.04572082]
  [ 0.01316294  0.00065475  0.03676306 ... -0.04894551 -0.00638874
    0.04131993]
  [-0.02668467 -0.01835966 -0.03522006 ...  0.04140535  0.0189225
   -0.04600055]]

 [[-0.04107963  0.02798226  0.00504835 ... -0.01300856 -0.03860293
   -0.02766579]
  [ 0.0131233   0.01387981 -0.02150233 ...  0.01600361  0.0482084
   -0.0252017 ]
  [

In [26]:
k=model.predict(embedded_docs)

In [27]:
k.shape

(7, 5, 30)

In [28]:
embedded_docs[0]

array([35, 16,  3, 23,  0], dtype=int32)

In [None]:
k[0]

array([[ 0.03437621, -0.04484242,  0.01587452, -0.0010276 ,  0.01732356,
         0.00340376,  0.0001472 , -0.01428288,  0.00482794, -0.02990921,
         0.04243921,  0.00995375, -0.02060571,  0.02119625,  0.00522514,
         0.01108086, -0.0222962 ,  0.03276416, -0.0435637 , -0.02021109],
       [ 0.0076919 ,  0.01903603, -0.00264392,  0.03987211, -0.02075864,
         0.01787741, -0.02374915, -0.00939726,  0.01387617, -0.04343615,
         0.00634513, -0.02770804, -0.04302816,  0.02323948, -0.03237991,
         0.04366989,  0.01167832, -0.01801644,  0.03282542, -0.02173652],
       [ 0.04840187, -0.00031609,  0.00032292, -0.00491963, -0.00975794,
         0.00951775,  0.04937495,  0.04188155, -0.00395557,  0.03028804,
         0.01697947, -0.00046738,  0.02995516,  0.04132083,  0.00053941,
         0.01912549, -0.02020488,  0.02088561, -0.01641457,  0.01974192],
       [ 0.04154417,  0.04976075, -0.02040995,  0.02098817, -0.01643115,
        -0.02558912, -0.04480329,  0.01053611,  

In [29]:
k[0][0]

array([-0.04107963,  0.02798226,  0.00504835,  0.03728746, -0.03153685,
        0.03970054, -0.04818378, -0.020865  , -0.01016577,  0.03835512,
        0.01578159,  0.03428097,  0.01599525,  0.01716676,  0.01395203,
       -0.02697303, -0.03869788,  0.03522957, -0.02241256,  0.03318893,
        0.00314027,  0.02009565,  0.02629269, -0.03216336,  0.01254712,
       -0.01182114,  0.03007689, -0.01300856, -0.03860293, -0.02766579],
      dtype=float32)

In [None]:
embedded_docs[1]

array([ 4, 27, 14, 20,  0], dtype=int32)

In [30]:
k[1][0]

array([-0.04107963,  0.02798226,  0.00504835,  0.03728746, -0.03153685,
        0.03970054, -0.04818378, -0.020865  , -0.01016577,  0.03835512,
        0.01578159,  0.03428097,  0.01599525,  0.01716676,  0.01395203,
       -0.02697303, -0.03869788,  0.03522957, -0.02241256,  0.03318893,
        0.00314027,  0.02009565,  0.02629269, -0.03216336,  0.01254712,
       -0.01182114,  0.03007689, -0.01300856, -0.03860293, -0.02766579],
      dtype=float32)

In [38]:
a=k[0][4]
b=k[1][4]

In [39]:
c=a.dot(b)

In [40]:
a1=np.linalg.norm(a)
b1=np.linalg.norm(b)

In [41]:
ang=c/(a1*b1)

In [42]:
ang

1.0000001

Example of Learning an Embedding

In [43]:

# define documents
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!',
		'Weak',
		'Poor effort!',
		'not good',
		'poor work',
		'Could have done better.']
# define class labels
labels = np.array([1,1,1,1,1,0,0,0,0,0])

In [44]:
# integer encode the documents
vocab_size = 50
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print(encoded_docs)

[[11, 15], [32, 21], [23, 26], [20, 21], [38], [30], [14, 26], [17, 32], [14, 21], [28, 32, 15, 48]]


In [45]:
# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)

[[11 15  0  0]
 [32 21  0  0]
 [23 26  0  0]
 [20 21  0  0]
 [38  0  0  0]
 [30  0  0  0]
 [14 26  0  0]
 [17 32  0  0]
 [14 21  0  0]
 [28 32 15 48]]


In [46]:
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding

In [47]:
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 4, 8)              400       
_________________________________________________________________
flatten (Flatten)            (None, 32)                0         
_________________________________________________________________
dense (Dense)                (None, 1)                 33        
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________
None


In [48]:
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=1)


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7fa0cdd66a90>

In [49]:
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 69.999999


In [50]:
padded_docs.shape

(10, 4)

In [51]:
a=padded_docs[0]

In [52]:
a

array([11, 15,  0,  0], dtype=int32)

In [53]:
a.shape

(4,)

In [54]:
a=a.reshape(1,4)

In [55]:
a.shape

(1, 4)

In [57]:
k=model.predict(a)

In [58]:
k=np.where(k>.5,1,0)

In [59]:
k

array([[1]])

In [60]:
sent1="great Work"

In [66]:
onehot_repr=[one_hot(sent1,voc_size)] 
print(onehot_repr)

[[23, 21]]


In [67]:
# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(onehot_repr, maxlen=max_length, padding='post')
print(padded_docs)

[[23 21  0  0]]


In [68]:
padded_docs.shape

(1, 4)

In [69]:
k=model.predict(a)

In [70]:
k=np.where(k>.5,1,0)

In [71]:
k

array([[1]])

In [72]:
sent2="great work but do again"