# Word Embedding

## Word Embeddings

Word embeddings are a pivotal concept in Natural Language Processing (NLP) that enable the representation of words as numerical vectors in a continuous vector space. This approach captures the semantic meaning of words based on their context within a corpus, allowing for more nuanced understanding and processing of language. Unlike traditional methods such as one-hot encoding, which treat words as distinct and unrelated entities, word embeddings facilitate the representation of words in a way that reflects their relationships and similarities.

The development of word embeddings has transformed how machines understand language, making it possible to capture complex linguistic patterns. One of the most notable techniques for generating word embeddings is Word2Vec, introduced by Tomas Mikolov and his team at Google in 2013. Word2Vec employs neural networks to learn word representations from large text corpora, using methods like the Skip-Gram and Continuous Bag of Words (CBOW) models. These models effectively predict word occurrences based on their context, enabling the creation of dense vector representations that encapsulate semantic relationships.

Word embeddings have numerous applications in NLP tasks, including sentiment analysis, machine translation, and information retrieval. They allow algorithms to perform vector arithmetic on words, revealing intriguing relationships—such as the famous example where "king" minus "man" plus "woman" results in a vector close to "queen." This capability highlights how word embeddings can capture not just meanings but also contextual nuances.

## Word2Vec

Word2Vec is a groundbreaking technique in natural language processing that transforms words into continuous vector representations, capturing their meanings and relationships within a given context. Developed by researchers at Google in 2013, Word2Vec employs a shallow neural network model to analyze large corpora of text, enabling computers to understand linguistic nuances and semantic similarities among words.
The primary goal of Word2Vec is to represent each word as a point in a high-dimensional space, where the spatial proximity of these points reflects the semantic relationships between the words. For instance, words that frequently appear in similar contexts are positioned closer together in this vector space, facilitating the discovery of relationships such as synonyms or analogies (e.g., "king" - "man" + "woman" = "queen").
Word2Vec operates through two main architectures: the Continuous Bag-of-Words (CBOW) model, which predicts a target word based on its surrounding context, and the Skip-Gram model, which does the opposite by predicting context words given a target word. This flexibility allows Word2Vec to effectively learn from diverse datasets while maintaining computational efficiency.
The introduction of Word2Vec has significantly enhanced various applications in natural language processing, including search engines, sentiment analysis, and recommendation systems, by providing a robust framework for understanding and manipulating language data in a way that was not previously possible.

In [3]:
pip install gensim


Collecting gensim
  Using cached gensim-4.3.3-cp312-cp312-macosx_11_0_arm64.whl.metadata (8.1 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Using cached numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl.metadata (61 kB)
Collecting scipy<1.14.0,>=1.7.0 (from gensim)
  Using cached scipy-1.13.1-cp312-cp312-macosx_12_0_arm64.whl.metadata (60 kB)
Collecting smart-open>=1.8.1 (from gensim)
  Using cached smart_open-7.0.5-py3-none-any.whl.metadata (24 kB)
Collecting wrapt (from smart-open>=1.8.1->gensim)
  Using cached wrapt-1.17.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (6.4 kB)
Using cached gensim-4.3.3-cp312-cp312-macosx_11_0_arm64.whl (24.0 MB)
Using cached numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl (13.7 MB)
Using cached scipy-1.13.1-cp312-cp312-macosx_12_0_arm64.whl (30.4 MB)
Using cached smart_open-7.0.5-py3-none-any.whl (61 kB)
Using cached wrapt-1.17.0-cp312-cp312-macosx_11_0_arm64.whl (38 kB)
Installing collected packages: wrapt, numpy, smart-open, scipy, gensim
  Attemptin

Gensim is an open-source Python library specifically designed for natural language processing (NLP) tasks, particularly focusing on unsupervised topic modeling and document similarity analysis. Developed by Radim Řehůřek and first released in 2009, Gensim has become a vital tool for researchers and practitioners in the field of text mining and semantic analysis.

Key Features:

1: Unsupervised Learning: Gensim employs unsupervised machine learning algorithms to analyze large volumes of text data without the need for labeled training sets. This allows it to automatically identify patterns and themes within the data.

2: Scalability: One of Gensim's standout features is its ability to handle large text corpora efficiently. It utilizes data streaming techniques, meaning that it can process datasets that exceed the available RAM, making it suitable for web-scale applications.

3: Versatile Algorithms: The library includes implementations of popular algorithms such as Word2Vec, Doc2Vec, Latent Dirichlet Allocation (LDA), and Latent Semantic Analysis (LSA). These algorithms are essential for tasks like document clustering, topic extraction, and semantic similarity measurement.

4: Cross-Platform Compatibility: Gensim is compatible with various operating systems including Windows, macOS, and Linux, making it accessible to a wide range of users.


In [4]:
import gensim

In [6]:
from gensim.models import Word2Vec, KeyedVectors

In [7]:
### Reference: https://stackoverflow.com/questions/46433778/import-googlenews-vectors-negative300-bin

In [8]:
import gensim.downloader as api
word_vector = api.load('word2vec-google-news-300')



In [12]:
vec_king = word_vector['king']
print(vec_king)

[ 1.25976562e-01  2.97851562e-02  8.60595703e-03  1.39648438e-01
 -2.56347656e-02 -3.61328125e-02  1.11816406e-01 -1.98242188e-01
  5.12695312e-02  3.63281250e-01 -2.42187500e-01 -3.02734375e-01
 -1.77734375e-01 -2.49023438e-02 -1.67968750e-01 -1.69921875e-01
  3.46679688e-02  5.21850586e-03  4.63867188e-02  1.28906250e-01
  1.36718750e-01  1.12792969e-01  5.95703125e-02  1.36718750e-01
  1.01074219e-01 -1.76757812e-01 -2.51953125e-01  5.98144531e-02
  3.41796875e-01 -3.11279297e-02  1.04492188e-01  6.17675781e-02
  1.24511719e-01  4.00390625e-01 -3.22265625e-01  8.39843750e-02
  3.90625000e-02  5.85937500e-03  7.03125000e-02  1.72851562e-01
  1.38671875e-01 -2.31445312e-01  2.83203125e-01  1.42578125e-01
  3.41796875e-01 -2.39257812e-02 -1.09863281e-01  3.32031250e-02
 -5.46875000e-02  1.53198242e-02 -1.62109375e-01  1.58203125e-01
 -2.59765625e-01  2.01416016e-02 -1.63085938e-01  1.35803223e-03
 -1.44531250e-01 -5.68847656e-02  4.29687500e-02 -2.46582031e-02
  1.85546875e-01  4.47265