<h3>Word Vectors Overview Using Gensim Library</h3>

All gensim models are listed on this page: https://github.com/RaRe-Technologies/gensim-data

Word2Vec is a technique that creates vector representations of words, capturing their meaning based on the surrounding words in a large text corpus. These vectors, called word embeddings, allow for measuring semantic and syntactic similarity between words. 

Advantages:

* Semantic Relationships: Word2Vec effectively captures semantic relationships between words like happy and joy.
* Efficiency: It requires less memory and can generate high-quality embeddings, even with limited training data. 
* Applications: Used in various NLP tasks, including text classification, sentiment analysis, and machine translation. 

TYPES OF WORD2VEC

* Continuous Bag of Words (CBOW):

Predicts a target word based on its surrounding context words.

Considers the context words as input and aims to predict the central word.

The order of words in the context is not important, hence the "bag of words" aspect.

![CBOW.png](attachment:CBOW.png)

* Skip-gram:

Predicts the surrounding context words given a target word.

Takes a target word as input and aims to predict the words within a certain range before and after it.

According to TensorFlow, the context consists of a few words before and after the current (middle) word.

![skip-gram.png](attachment:skip-gram.png)

In [1]:
import gensim.downloader as api
# This is a huge model (~1.6 gb) and it will take some time to load

wv = api.load('word2vec-google-news-300')



In [2]:
wv.similarity(w1="great", w2="good")

0.729151

In [3]:
wv.most_similar("good")

[('great', 0.7291508913040161),
 ('bad', 0.7190051078796387),
 ('terrific', 0.6889115571975708),
 ('decent', 0.6837348937988281),
 ('nice', 0.6836092472076416),
 ('excellent', 0.644292950630188),
 ('fantastic', 0.6407778263092041),
 ('better', 0.6120728850364685),
 ('solid', 0.5806035399436951),
 ('lousy', 0.576420247554779)]

In [81]:
wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=5)

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321839332581)]

In [117]:
wv.doesnt_match(["facebook", "cat", "google", "microsoft"])

'cat'

In [118]:
wv.doesnt_match(["dog", "cat", "google", "mouse"])

'google'