### Word Embedding Techniques
Word Embeddings are the texts converted into numbers.
A Word Embedding format generally tries to map a word using a dictionary to a vector. Let us break this sentence down into finer details to have a clear view.

"Word Embeddings are Words converted into numbers"

A dictionary is the list of all unique words in the sentence. So, a dictionary may look like – ['Word','Embeddings','are','words','Converted','into','numbers']

A vector representation of a word may be a one-hot encoded vector where 1 stands for the position where the word exists and 0 everywhere else. The vector representation of "numbers" in this format according to the above dictionary is [0,0,0,0,0,1] and "converted" is[0,0,0,1,0,0].

Types of Word Embeddings

1.Frequency based Embedding

2.Prediction based Embedding

Frequency based Embedding

Count Vector

TF-IDF Vector

Count Vectorizer

Consider a Corpus C of D documents {d1,d2…..dD} and N unique tokens extracted out of the corpus C. The
N tokens will form our dictionary and the size of the Count Vector matrix M will be given by D X N. 
Each row in the matrix M contains the frequency of tokens in document D(i).

Let us understand this using a simple example.

D1: He is lazy boy. She is also lazy.

D2: Neeta is lazy person.

The dictionary created may be a list of unique tokens(words) in the corpus =
['He','is','She','lazy','boy','also','Neeta','person']
Here, D=2, N=8
The count matrix M of size 2 X 8 will be represented as –


D1 1 2 1 2 1 1 0 0

D2 0 1 0 1 0 0 1 1


In [1]:
#Count Vectorizer
import sklearn
from sklearn.feature_extraction.text import CountVectorizer

document = ["He is lazy boy. She is also lazy.",
            "Neeta is lazy person."]

# Create a Vectorizer Object
vectorizer = CountVectorizer()

vectorizer.fit(document)

# Printing the identified Unique words along with their indices
print("Vocabulary: ", sorted(vectorizer.vocabulary_))

# Encode the Document
vector = vectorizer.transform(document)

# Summarizing the Encoded Texts
print("Encoded Document is:")
print(vector.toarray())


Vocabulary:  ['also', 'boy', 'he', 'is', 'lazy', 'neeta', 'person', 'she']
Encoded Document is:
[[1 1 1 2 2 0 0 1]
 [0 0 0 1 1 1 1 0]]


#### TF-IDF Vectorizer
It is different to the count vectorization in the sense that it takes into account not just the occurrence of a word in a single document but in the entire corpus. 
TF-IDF works by penalising common words like ('the','a','is') by assigning them lower weights while giving importance
to significant words in a particular document.



In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
d1="He is lazy boy. She is also lazy."

d2="Neeta is lazy person."

doc_corpus=[d1,d2]

# create object
tfidf = TfidfVectorizer(stop_words='english')


# Get TF-IDF values
tfidf_matrix = tfidf.fit_transform(doc_corpus)


# get idf values
print('\nidf values:')
dic = dict(zip(tfidf.get_feature_names_out(), tfidf.idf_))
print(dic)


# Create a dataframe for the TF-IDF values
df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf.get_feature_names_out())
df.index = ["Document 1", "Document 2"]

# Print the TF-IDF table
print("TF-IDF values:")
print(df)


idf values:
{'boy': 1.4054651081081644, 'lazy': 1.0, 'neeta': 1.4054651081081644, 'person': 1.4054651081081644}
TF-IDF values:
                 boy      lazy     neeta    person
Document 1  0.574962  0.818180  0.000000  0.000000
Document 2  0.000000  0.449436  0.631667  0.631667


Prediction based Embedding

CBOW

Skip-gram

In [3]:
!pip install nltk
!pip install gensim



In [4]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

#### CBOW
CBOW model predicts the current word given context words within a specific window. The input layer contains the context words and the output layer contains the current word. The hidden layer contains the number of dimensions in which we want to represent the current word present at the output layer. 




#### Skip-gram

Skip – gram follows the same topology as of CBOW. It just flips CBOW’s architecture on its head. 
The aim of skip-gram is to predict the context given a word.