Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
__init__.py		__init__.py
data.py		data.py
model.py		model.py
train.py		train.py

README.md

Word2vec

References:

1. Corpus Vectorisation (preprocessing)

We can use counter vectorisation but we might get high count vectors for one document and low count vectors for others. Intead we will favour TF-IDF(Term frequency-Inverse term frequency).

tf-idf is a weighting factor which is used to get the important features from the documents(corpus).

It actually tells us how important a word is to a document in a corpus, the importance of a word increases proportionally to the number of times the word appears in the individual document, this is called Term Frequency(TF).

Ex : document 1:

“ Mady loves programming. He programs all day, he will be a world class programmer one day ”

if we apply tokenization, steeming and stopwords (we discussed in the last story) to this document, we get features with high count like → program(3), day(2),love(1) and etc….

TF = (no of times the word appear in the doc) / (total no of words in the doc)

Here program is the highest frequent term in the document.

so program is a good feature if we consider TF.

However, if multiple documents contain the word “program” many times then we might say…

it’s also a frequent word in all other documents in our corpus so it does not give much meaning so it probably may not be an important feature.

To adjust this we use IDF.

The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents.

IDF — Log(total no of documents / no of documents with the term t in it).

so TF-IDF = TF * IDF.

Problems:

TF-IDF or CounterVectorisations do not maintain order or semantic relationship between the words
Instead we need to build Word2Vec model -> Converts this high dimensional vector (10000 sized) into low dimensional vector (let’s say 200 sized)

2. Word 2 vec

Word2vec takes care of 2 things:

Converts this high dimensional vector (10000 sized) into low dimensional vector (let’s say 200 sized)
Maintains the word context (meaning)

the word context / meaning can be created using 2 simple algorithms which are

Continuous Bag-of-Words model (CBOW)

Ex: Text= “Mady goes crazy about machine leaning” and window size is 3

-> [ [“Mady”,”crazy” ] , “goes”] → “goes” is the target word

Skip-Gram model

It takes one word as input and try to predict the surrounding (neighboring) words,

[“Mady”, “goes”],[“goes”,”crazy”] → “goes” is the input word and “Mady” and “Crazy” are the surrounding words (Output probabilities)

What is word2vec in short?

→ it’s a neural network training for all the words in our dictionary to get the weights(vectors )

→ it has word embeddings for every word in the dictionary

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

032-word2vec

032-word2vec

README.md

Word2vec

1. Corpus Vectorisation (preprocessing)

2. Word 2 vec

Files

032-word2vec

Directory actions

More options

Directory actions

More options

Latest commit

History

032-word2vec

Folders and files

parent directory

README.md

Word2vec

1. Corpus Vectorisation (preprocessing)

2. Word 2 vec