# Word Embedding

* Word Embedding is a technique of word representation that allows words with similar meaning to be understood by machine learning algorithms. Technically speaking, it is a mapping of words into vectors of real numbers using the neural network, probabilistic model, or dimension reduction on word co-occurrence matrix.

* Word embeddings are also very useful in mitigating the curse of dimensionality, a very recurring problem in artificial intelligence. Without word embedding, the unique identifiers representing the words generate scattered data, isolated points in a vast sparse representation. With word embedding, on the other hand, the space becomes much more limited in terms of dimensionality with a widely richer amount of semantic information. With such numerical features, it is easier for a computer to perform different mathematical operations like matrix factorization, dot product, etc. which are mandatory to use shallow and deep learning techniques. There are many techniques available at our disposal to achieve this transformation. In this article, we will be covering: 
  1. Bag-Of-Words: The grammar and word order are neglected while the frequency is kept the same.
  
  2. TF-ID: The Term Frequency-Inverse Document Frequency (a.k.a. TF-IDF) is another way to represent a document based on its words. With TF-IDF, words are given weights by TF-IDF importance instead of only frequency. The weight increases in proportion to the number of occurrences of the word in the document. It also varies according to the frequency of the word in the corpus. 
* By definition, TF-IDF embedding is composed by two terms: the first computes the normalized **Term Frequency** (TF), a.k.a. the occurrence a word appears in a document, divided by the total number of words in that document; the second term is the **Inverse Document Frequency** (IDF) which computes the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.
  
  3. Word2Vec: One of the most efficient techniques to represent a word is Word2Vec. Word2vec is a computationally efficient predictive model for learning word embeddings from raw text. It plots the words in a multi-dimensional vector space, where similar words tend to be close to each other. The surrounding words of a word provide the context to that word.
* Word2Vec can rely on either one of two model architectures in order to produce a distributed representation of input words: **Continuous Bag-of-Words (CBoW) or Continuous Skip-Gram** as shown in the figure below. Vector representation extracts semantic relationships based on the co-occurrence of words in the dataset.

* The CBoW and skip-gram models are trained using a binary classification to discriminate between the real target word and the other words in the same context. The accuracy at which the model predicts the words depends on how many times the model sees these words within the same context throughout the dataset. The hidden representation is changed by more words and context co-occurrences during the training process, which allows the model to have more future successful predictions, leading to a better representation of word and context in the vector space. Skip gram is much slower than CBOW, but performs more accurately with infrequent words.
  4. Doc2vec and Doc2vecC

In [13]:
from gensim.models import word2vec

In [2]:
!pip install gensim

Collecting gensim

You should consider upgrading via the 'c:\users\user\anaconda3\envs\tensorflow\python.exe -m pip install --upgrade pip' command.



  Downloading gensim-3.8.3-cp36-cp36m-win_amd64.whl (24.2 MB)
Collecting smart-open>=1.8.1
  Downloading smart_open-4.2.0.tar.gz (119 kB)
Collecting Cython==0.29.14
  Downloading Cython-0.29.14-cp36-cp36m-win_amd64.whl (1.7 MB)
Building wheels for collected packages: smart-open
  Building wheel for smart-open (setup.py): started
  Building wheel for smart-open (setup.py): finished with status 'done'
  Created wheel for smart-open: filename=smart_open-4.2.0-py3-none-any.whl size=109637 sha256=8074f14648a1ce95b7cd7427acaef0e314bf4c36d7d212280f05cb3ff42d5c78
  Stored in directory: c:\users\user\appdata\local\pip\cache\wheels\05\12\87\d479d6a8f92130cd8b27e331cc433bb28dda9c20e57f0b1ab2
Successfully built smart-open
Installing collected packages: smart-open, Cython, gensim
Successfully installed Cython-0.29.14 gensim-3.8.3 smart-open-4.2.0


# Training own word2vec model

In [1]:
tokenized_sentences = [['Hello', 'This', 'is', 'python','training','by','Tony'],
                      ['Hello', 'This', 'is', 'Java','training','by','Tony'],
                      ['Hello', 'This', 'is', 'Data Science','training','by','Tony'],
                      ['Hello', 'This', 'is', 'Progamming','training','by','Tony']]

In [2]:
tokenized_sentences


[['Hello', 'This', 'is', 'python', 'training', 'by', 'Tony'],
 ['Hello', 'This', 'is', 'Java', 'training', 'by', 'Tony'],
 ['Hello', 'This', 'is', 'Data Science', 'training', 'by', 'Tony'],
 ['Hello', 'This', 'is', 'Progamming', 'training', 'by', 'Tony']]

In [22]:
#training word2vec model
from gensim.models import Word2Vec
import warnings
#warnings.filterwarning('ignore')
mymodel = Word2Vec(tokenized_sentences, min_count=1)

In [23]:
print(mymodel)

Word2Vec(vocab=10, size=100, alpha=0.025)


In [25]:
# summarize vocabulary
words = list(mymodel.wv.vocab)

In [27]:
words

['Hello',
 'This',
 'is',
 'python',
 'training',
 'by',
 'Tony',
 'Java',
 'Data Science',
 'Progamming']

In [32]:
# access word vectors for one word training
print(mymodel['Hello'])

[-4.8681172e-03  1.1729884e-03  2.8752410e-03 -2.6626838e-03
 -3.7820984e-03  5.8760698e-04 -2.5913378e-03  2.3307244e-04
 -2.3304853e-03 -3.4849791e-04  7.2109944e-04  4.6286588e-03
 -9.4307517e-04 -1.7961187e-03  4.6753464e-03 -3.9190808e-03
  3.8409515e-03  8.9049798e-05  4.3157525e-03 -2.5745116e-03
 -3.9713527e-03 -4.8930417e-03 -2.3911954e-03  3.5810263e-03
  1.8859054e-03  2.9160185e-03  1.1615632e-03  4.7634281e-03
  1.2557892e-03 -2.7929787e-03  8.7547407e-04 -3.2141067e-03
 -1.3748390e-04  1.9690676e-03 -1.9810899e-04  1.6533141e-03
  4.9852738e-03 -3.0008946e-03  2.3838012e-03 -1.6623344e-03
  4.8859636e-03 -4.4806344e-03  1.3658563e-03 -3.1863546e-04
  3.8947506e-04 -2.1078769e-04  3.6475965e-04 -4.7123926e-03
  2.9228902e-03 -2.3194427e-04 -2.1781977e-03 -4.4255052e-03
  3.5965063e-03  2.6320957e-03  8.5198320e-04 -1.8269044e-03
  2.8597277e-03  2.0904257e-03 -1.9068263e-03 -1.5578708e-03
  3.9975457e-03 -1.7435518e-03  2.0039666e-03 -3.6953052e-03
  1.5547335e-03  4.76269

  


In [36]:
# try finding most similar words for word 'data'
mymodel.most_similar('Hello')

  


[('Tony', 0.2056553065776825),
 ('python', 0.07027705013751984),
 ('This', 0.02738291770219803),
 ('is', 0.023710981011390686),
 ('Progamming', -0.043045032769441605),
 ('training', -0.08911749720573425),
 ('Data Science', -0.10308177769184113),
 ('by', -0.12382837384939194),
 ('Java', -0.2511283755302429)]

In [39]:
# try other words not in word2vec model

#mymodel.most_similar('Good')

# KeyError: "word 'Good' not in vocabulary"


This is a great limitation in training your own modle

# Create Embedding model using Kerass Embedding

In [51]:
import tensorflow as tf
print(tf.version.VERSION
     )

2.0.0-beta1


In [64]:
from numpy import array
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding

In [66]:
# word embedding from scratch
sent = ['Hello, how are you', 'how are you',
       'how are you doing', 'I am doing great',
       'I am doing good', 'I am good']

In [67]:
# defining class labels

sent_labels = array([1,1,1,0,0,0])

In [68]:
#integer encoding of the documents
my_vocab_size = 30
encoded_sent = [one_hot(i, my_vocab_size) for i in sent]
print(encoded_sent)

[[20, 17, 28, 10], [17, 28, 10], [17, 28, 10, 20], [29, 16, 20, 7], [29, 16, 20, 4], [29, 16, 4]]


In [73]:
# padding documents to a max length = 5

length = 5
padded_sent = pad_sequences(encoded_sent, maxlen=length, padding ='pre')
print(padded_sent)

[[ 0 20 17 28 10]
 [ 0  0 17 28 10]
 [ 0 17 28 10 20]
 [ 0 29 16 20  7]
 [ 0 29 16 20  4]
 [ 0  0 29 16  4]]


# defining a NN  model

In [78]:
mymodel2 = Sequential()
mymodel2.add(Embedding(my_vocab_size, 8, input_length=length))
mymodel2.add(Flatten())
mymodel2.add(Dense(1, activation='sigmoid'))

In [80]:
#Compiling the model
mymodel2.compile(optimizer='adam', loss ='binary_crossentropy', metrics =['accuracy'])

In [88]:
# fiting the model
mymodel2.fit(padded_sent, sent_labels, epochs=30)

mymodelloss, mymodelaccuracy = mymodel2.evaluate(padded_sent, sent_labels, verbose=0 )
print('Accuracy: %f' % (mymodelaccuracy*100))

Train on 6 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
Accuracy: 100.000000


# Prediction part


In [89]:
mysent_to_predict = ['how are  you Tony', 'I am super good']

In [91]:
# integer encode the documents
vocab_size = 30
encoded =  [one_hot(d, vocab_size) for d in mysent_to_predict]
print(encoded)

[[17, 28, 10, 17], [29, 16, 1, 4]]


In [92]:
# pad documents to a max length of 4 words
max_length = 5
mypadded = pad_sequences(encoded, maxlen=max_length, padding='pre')
print(mypadded)

[[ 0 17 28 10 17]
 [ 0 29 16  1  4]]


In [95]:
mymodel2.predict_classes(mypadded)

array([[1],
       [0]])