#AvgWord2vec Indepth Intuition
Introduction to Average Word2Vec

concept of average Word2Vec. This topic is crucial because using Word2Vec alone is insufficient for solving classification problems; instead, we need to apply average Word2Vec.

Let's consider a simple example to understand the theoretical intuition behind average Word2Vec. We have text data consisting of documents such as:

"The food is good" with output labels like 1, 0, and 1.
Each of these represents a document.

Using Word2Vec, each word in these documents is converted into a vector. For instance, if we use a Google pre-trained Word2Vec model, each word is represented as a 300-dimensional vector. For example, the word "the" is converted into a 300-dimensional vector, and similarly, "food", "is", and "good" each get their own 300-dimensional vectors.

This means for the sentence "The food is good", we obtain separate 300-dimensional vectors for each word:

"the" → 300-dimensional vector
"food" → 300-dimensional vector
"is" → 300-dimensional vector
"good" → 300-dimensional vector
This process applies to all words in the sentence.

While this word-level vectorization is useful, it presents a challenge: for an entire sentence or document, we want a single vector representation rather than multiple vectors for each word.

For example, for the sentence "The food is good", we want one 300-dimensional vector representing the whole sentence, not four separate vectors.

To address this, we take all the word vectors in the sentence and compute their average. This average vector then represents the entire sentence or document. Thus, instead of multiple vectors, we have one 300-dimensional vector that summarizes the semantic content of the sentence.

This approach is called average Word2Vec. It works because averaging the vectors preserves the semantic information of the sentence while reducing dimensionality to a single vector. This vector can then be used as input to machine learning models for tasks such as text classification.

###summary, average Word2Vec involves:

Converting each word in a sentence to a 300-dimensional vector using a pre-trained model.
Averaging all these word vectors to obtain a single 300-dimensional vector representing the entire sentence.
Using this averaged vector as the feature input for classification or other NLP tasks.
In upcoming tutorials, we will explore practical implementations of average Word2Vec using libraries such as Gensim. We will first use pre-trained Google Word2Vec models and then train Word2Vec models from scratch using datasets, all facilitated by Gensim.

To conclude, average Word2Vec is essential because it provides a fixed-length vector representation for entire sentences or documents by averaging word vectors, thereby maintaining semantic information and enabling effective text classification.

### Key Takeaways


Average Word2Vec converts each word in a sentence into a 300-dimensional vector using pre-trained models.

To represent an entire sentence or document, the vectors of all words are averaged to produce a single 300-dimensional vector.

This averaged vector serves as the input feature for text classification models.
Libraries like Gensim facilitate working with pre-trained Word2Vec models and training new ones from scratch.

In [1]:
!pip install gensim



In [2]:

!pip install gensim numpy scipy



In [3]:
import gensim

In [4]:
from gensim.models import Word2Vec, KeyedVectors

In [13]:
import gensim.downloader as api

info = api.info()

available_models = list(info['models'].keys())
print("avail models:")
for model_name in available_models:
    print(model_name)

avail models:
fasttext-wiki-news-subwords-300
conceptnet-numberbatch-17-06-300
word2vec-ruscorpora-300
word2vec-google-news-300
glove-wiki-gigaword-50
glove-wiki-gigaword-100
glove-wiki-gigaword-200
glove-wiki-gigaword-300
glove-twitter-25
glove-twitter-50
glove-twitter-100
glove-twitter-200
__testing_word2vec-matrix-synopsis


In [5]:
import gensim.downloader as api

wv = api.load('word2vec-google-news-300')

vec_king = wv['king']

In [7]:
vec_king

array([ 1.25976562e-01,  2.97851562e-02,  8.60595703e-03,  1.39648438e-01,
       -2.56347656e-02, -3.61328125e-02,  1.11816406e-01, -1.98242188e-01,
        5.12695312e-02,  3.63281250e-01, -2.42187500e-01, -3.02734375e-01,
       -1.77734375e-01, -2.49023438e-02, -1.67968750e-01, -1.69921875e-01,
        3.46679688e-02,  5.21850586e-03,  4.63867188e-02,  1.28906250e-01,
        1.36718750e-01,  1.12792969e-01,  5.95703125e-02,  1.36718750e-01,
        1.01074219e-01, -1.76757812e-01, -2.51953125e-01,  5.98144531e-02,
        3.41796875e-01, -3.11279297e-02,  1.04492188e-01,  6.17675781e-02,
        1.24511719e-01,  4.00390625e-01, -3.22265625e-01,  8.39843750e-02,
        3.90625000e-02,  5.85937500e-03,  7.03125000e-02,  1.72851562e-01,
        1.38671875e-01, -2.31445312e-01,  2.83203125e-01,  1.42578125e-01,
        3.41796875e-01, -2.39257812e-02, -1.09863281e-01,  3.32031250e-02,
       -5.46875000e-02,  1.53198242e-02, -1.62109375e-01,  1.58203125e-01,
       -2.59765625e-01,  

In [8]:
vec_king.shape

(300,)

In [9]:
wv['cricket']

array([-3.67187500e-01, -1.21582031e-01,  2.85156250e-01,  8.15429688e-02,
        3.19824219e-02, -3.19824219e-02,  1.34765625e-01, -2.73437500e-01,
        9.46044922e-03, -1.07421875e-01,  2.48046875e-01, -6.05468750e-01,
        5.02929688e-02,  2.98828125e-01,  9.57031250e-02,  1.39648438e-01,
       -5.41992188e-02,  2.91015625e-01,  2.85156250e-01,  1.51367188e-01,
       -2.89062500e-01, -3.46679688e-02,  1.81884766e-02, -3.92578125e-01,
        2.46093750e-01,  2.51953125e-01, -9.86328125e-02,  3.22265625e-01,
        4.49218750e-01, -1.36718750e-01, -2.34375000e-01,  4.12597656e-02,
       -2.15820312e-01,  1.69921875e-01,  2.56347656e-02,  1.50146484e-02,
       -3.75976562e-02,  6.95800781e-03,  4.00390625e-01,  2.09960938e-01,
        1.17675781e-01, -4.19921875e-02,  2.34375000e-01,  2.03125000e-01,
       -1.86523438e-01, -2.46093750e-01,  3.12500000e-01, -2.59765625e-01,
       -1.06933594e-01,  1.04003906e-01, -1.79687500e-01,  5.71289062e-02,
       -7.41577148e-03, -

In [10]:
wv.most_similar('cricket')

[('cricketing', 0.8372225761413574),
 ('cricketers', 0.8165745735168457),
 ('Test_cricket', 0.8094819188117981),
 ('Twenty##_cricket', 0.8068488240242004),
 ('Twenty##', 0.7624265551567078),
 ('Cricket', 0.75413978099823),
 ('cricketer', 0.7372578382492065),
 ('twenty##', 0.7316356897354126),
 ('T##_cricket', 0.7304614186286926),
 ('West_Indies_cricket', 0.6987985968589783)]

In [11]:
wv.most_similar('happy')

[('glad', 0.7408890724182129),
 ('pleased', 0.6632170677185059),
 ('ecstatic', 0.6626912355422974),
 ('overjoyed', 0.6599286794662476),
 ('thrilled', 0.6514049172401428),
 ('satisfied', 0.6437949538230896),
 ('proud', 0.636042058467865),
 ('delighted', 0.627237856388092),
 ('disappointed', 0.6269949674606323),
 ('excited', 0.6247665286064148)]

In [14]:
wv.similarity("hockey","sports")

0.53541523

In [15]:
vec=wv['king']-wv['man']+wv['woman']

In [16]:
vec

array([ 4.29687500e-02, -1.78222656e-01, -1.29089355e-01,  1.15234375e-01,
        2.68554688e-03, -1.02294922e-01,  1.95800781e-01, -1.79504395e-01,
        1.95312500e-02,  4.09919739e-01, -3.68164062e-01, -3.96484375e-01,
       -1.56738281e-01,  1.46484375e-03, -9.30175781e-02, -1.16455078e-01,
       -5.51757812e-02, -1.07574463e-01,  7.91015625e-02,  1.98974609e-01,
        2.38525391e-01,  6.34002686e-02, -2.17285156e-02,  0.00000000e+00,
        4.72412109e-02, -2.17773438e-01, -3.44726562e-01,  6.37207031e-02,
        3.16406250e-01, -1.97631836e-01,  8.59375000e-02, -8.11767578e-02,
       -3.71093750e-02,  3.15551758e-01, -3.41796875e-01, -4.68750000e-02,
        9.76562500e-02,  8.39843750e-02, -9.71679688e-02,  5.17578125e-02,
       -5.00488281e-02, -2.20947266e-01,  2.29492188e-01,  1.26403809e-01,
        2.49023438e-01,  2.09960938e-02, -1.09863281e-01,  5.81054688e-02,
       -3.35693359e-02,  1.29577637e-01,  2.41699219e-02,  3.48129272e-02,
       -2.60009766e-01,  

In [17]:
wv.most_similar([vec])

[('king', 0.8449392318725586),
 ('queen', 0.7300517559051514),
 ('monarch', 0.645466148853302),
 ('princess', 0.6156251430511475),
 ('crown_prince', 0.5818676352500916),
 ('prince', 0.5777117609977722),
 ('kings', 0.5613663792610168),
 ('sultan', 0.5376775860786438),
 ('Queen_Consort', 0.5344247817993164),
 ('queens', 0.5289887189865112)]

how to use Gensim to load Google's pre-trained Word2Vec model, retrieve word vectors, find similar words, compute similarity scores, and perform vector arithmetic for analogies.

This pre-trained model can be used to solve many natural language processing problems without training from scratch.

we will explore training Word2Vec models from scratch using Gensim, which involves a different process and dataset preparation. The current demonstration was performed in Google Colab due to the large model size.

Key Takeaways
Demonstrated how to load and use Google's pre-trained Word2Vec model with Gensim.
Explained how to retrieve word vectors and their dimensions.
Showed how to find most similar words using cosine similarity.
Illustrated vector arithmetic to find analogies such as "king - man + woman = queen".

In [None]:
## References: https://stackoverflow.com/questions/46433778/import-googlenews-vectors-negative300-bin
