#### Word embedding is a mapping of words into vectors of real numbers using the neural network, probabilistic model, or dimension reduction on word co-occurrence matrix.
### word2vec is used for semantic (closely related items together) and syntactic (sequence) matching. Using word2vec, one can find similar words, dissimilar words, dimensional reduction, and many others. Another important feature of word2vec is to convert the higher dimensional representation of the text into lower dimensional of vectors.

## Importing modules

In [1]:
from nltk.tokenize import sent_tokenize, word_tokenize 
import warnings 
  
warnings.filterwarnings(action = 'ignore') 
  
import gensim 
from gensim.models import Word2Vec 

## Getting text

In [4]:
sample = """ Word2vec represents words in vector space representation. Words are represented in the form of vectors and placement is done in such a way that similar meaning words appear together and dissimilar words are located far away. This is also termed as a semantic relationship. Neural networks do not understand text instead they understand only numbers. Word Embedding provides a way to convert text to a numeric vector.

Word2vec reconstructs the linguistic context of words. Before going further let us understand, what is linguistic context? In general life scenario when we speak or write to communicate, other people try to figure out what is objective of the sentence. For example, "What is the temperature of India", here the context is the user wants to know "temperature of India" which is context. In short, the main objective of a sentence is context. Word or sentence surrounding spoken or written language (disclosure) helps in determining the meaning of context. Word2vec learns vector representation of words through the contexts."""


In [5]:
# Replaces escape character with space 
f = sample.replace("\n", " ") 

In [6]:
f

' Word2vec represents words in vector space representation. Words are represented in the form of vectors and placement is done in such a way that similar meaning words appear together and dissimilar words are located far away. This is also termed as a semantic relationship. Neural networks do not understand text instead they understand only numbers. Word Embedding provides a way to convert text to a numeric vector.  Word2vec reconstructs the linguistic context of words. Before going further let us understand, what is linguistic context? In general life scenario when we speak or write to communicate, other people try to figure out what is objective of the sentence. For example, "What is the temperature of India", here the context is the user wants to know "temperature of India" which is context. In short, the main objective of a sentence is context. Word or sentence surrounding spoken or written language (disclosure) helps in determining the meaning of context. Word2vec learns vector re

## Tokenize

In [7]:
data = [] 
  
# iterate through each sentence in the file 
for i in sent_tokenize(f): 
    temp = [] 
      
    # tokenize the sentence into words 
    for j in word_tokenize(i): 
        temp.append(j.lower())   #converting to lowercase
  
    data.append(temp) 

In [9]:
print(data)
#Prints tokens for every sentence..There are 12 sentences in text so 12 arrrays are generated in the main array

[['word2vec', 'represents', 'words', 'in', 'vector', 'space', 'representation', '.'], ['words', 'are', 'represented', 'in', 'the', 'form', 'of', 'vectors', 'and', 'placement', 'is', 'done', 'in', 'such', 'a', 'way', 'that', 'similar', 'meaning', 'words', 'appear', 'together', 'and', 'dissimilar', 'words', 'are', 'located', 'far', 'away', '.'], ['this', 'is', 'also', 'termed', 'as', 'a', 'semantic', 'relationship', '.'], ['neural', 'networks', 'do', 'not', 'understand', 'text', 'instead', 'they', 'understand', 'only', 'numbers', '.'], ['word', 'embedding', 'provides', 'a', 'way', 'to', 'convert', 'text', 'to', 'a', 'numeric', 'vector', '.'], ['word2vec', 'reconstructs', 'the', 'linguistic', 'context', 'of', 'words', '.'], ['before', 'going', 'further', 'let', 'us', 'understand', ',', 'what', 'is', 'linguistic', 'context', '?'], ['in', 'general', 'life', 'scenario', 'when', 'we', 'speak', 'or', 'write', 'to', 'communicate', ',', 'other', 'people', 'try', 'to', 'figure', 'out', 'what', 'i

### Architecture of word2vec:
### 1) CBOW     2)Skip-gram 

#### Skip-gram and CBOW convert unsupervised representation to supervised form for model training.
#### In CBOW, the current word is predicted using the window of surrounding context windows.
#### Skip-Gram performs opposite of CBOW which implies that it predicts the given sequence or context from the word. 

In [8]:
# Create CBOW model  (Continous Bag of Words)
model1 = gensim.models.Word2Vec(data, min_count = 1,  
                              size = 100, window = 5) 
#first parameter is the tokenized data 
#second parameter tells that it will ignore all the words with a total frequency lower than this.
#Size tells the dimensionality of the word vectors 
#Maximum distance between the current and predicted word within a sentence is given by window.
#window are mutable can be changed and accordingly can vary cosine similarity

### Cosine similarity between two words

In [12]:
# Print results 
print("Cosine similarity between 'word2vec' " + 
               "and 'vector' - CBOW : ", 
    model1.similarity('word2vec', 'vector')) 

Cosine similarity between 'word2vec' and 'vector' - CBOW :  0.22827439


In [14]:
print(model1['word'])

[-2.2737612e-03 -3.3196187e-04  1.8430880e-03 -1.1446645e-03
 -1.7483812e-03  2.1176853e-03  3.1230367e-05 -1.8511017e-04
 -1.5797671e-03  4.4432711e-03  6.4991799e-04  3.7352688e-04
 -2.6550293e-03 -1.9726602e-03 -3.8098360e-03 -5.5754028e-04
  2.5970479e-03  4.4397218e-04  1.8673867e-03  1.4980830e-03
 -5.5223657e-04  2.1636818e-04  2.1586348e-03  3.5639561e-03
 -1.4623078e-03 -4.7500846e-03  3.8055563e-03  2.0230834e-03
  1.5285627e-03 -4.1268342e-03 -2.2883867e-03  1.8498213e-03
 -6.5290520e-04 -2.1653136e-03 -1.6062878e-03  2.0962527e-03
 -3.2103178e-03  3.8158756e-03  4.3343212e-03 -8.8505720e-04
  4.3593780e-03 -2.8637296e-03 -2.2938501e-04  5.5735005e-04
 -4.6465552e-04  3.2901354e-03  4.5688311e-03  4.0289501e-04
 -3.8200894e-03 -3.6102720e-03  4.2836759e-03 -2.3915959e-03
  8.1738521e-04  1.9635456e-03  3.6609645e-03 -3.4428877e-03
 -4.9864231e-03  3.0929032e-03  4.5460057e-03 -2.8903047e-03
  1.7890721e-03 -3.7593528e-04 -1.6116655e-04 -1.5436831e-03
 -1.1730894e-03 -3.40949

### Most similar words

In [15]:
similar_words = model1.most_similar('word')
print(similar_words)

[('written', 0.2410072684288025), ('this', 0.24062633514404297), ('represented', 0.2244727909564972), ('instead', 0.17960643768310547), ('such', 0.16462530195713043), (',', 0.16017262637615204), ('convert', 0.15883475542068481), ('try', 0.15453214943408966), ('vector', 0.15447998046875), ('we', 0.1442028284072876)]


### Dissimilar words

In [16]:
dissimlar_words = model1.doesnt_match('Words are represented in the form of vectors'.split())
print(dissimlar_words)

are
