<a href="https://colab.research.google.com/github/Akshaay23/NLP_Learning/blob/main/Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 **word embeddings**
 - are techniques used to convert words or phrases into numerical vectors while preserving their semantic meaning. There are several types of embedding techniques, categorized into static and contextual embeddings.



#**1. Static Word Embeddings**

These embeddings assign a fixed vector representation to each word, regardless of context.

##**One-Hot Encoding**
- Each word is represented as a sparse binary vector.
High-dimensional and inefficient.
Does not capture semantic meaning.
**TF-IDF** (Term Frequency - Inverse Document Frequency)

Weights words based on their frequency in a document relative to the entire corpus.
Does not capture semantic relationships between words.

##**Word2Vec**

Predicts word representations using either:
- **CBOW** (Continuous Bag of Words): Predicts a target word from surrounding words.
- **Skip-Gram**: Predicts surrounding words given a target word.
Trained on large corpora to learn meaningful relationships.

Example: "king - man + woman ≈ queen"

##**GloVe** (Global Vectors for Word Representation)
- Uses word co-occurrence statistics across the entire corpus.
Captures both local and global word relationships.

Example: "apple" and "fruit" are closer in vector space.

##**FastText**
- An extension of Word2Vec that represents words as character n-grams.
Captures subword information, making it useful for morphologically rich languages.

Example: "running" is decomposed into "run", "unn", "nni", "nin", "ing".

#**2.Contextual Word Embeddings**

Unlike static embeddings, these generate different representations for the same word depending on context.

#**ELMo** (Embeddings from Language Models)
- Uses bi-directional LSTMs to generate dynamic word embeddings.
Example:
"The bank is near the river" → "bank" (financial institution)
"He deposited money in the bank" → "bank" (riverbank)

#**BERT** (Bidirectional Encoder Representations from Transformers)
- A transformer-based model that generates context-aware word embeddings.
Uses masked language modeling (MLM) and next sentence prediction (NSP).
Example:
"Apple is a fruit" → "Apple" refers to the fruit.
"I bought an Apple laptop" → "Apple" refers to the company.

#**Transformer-based Models** (GPT, T5, RoBERTa, etc.)
- Variants of transformer architectures that generate embeddings.
- GPT: Uses autoregressive modeling (left-to-right).
- T5: Treats NLP tasks as a text-to-text problem.
- RoBERTa: An improved BERT model trained with more data.

#**Sentence Transformers** (SBERT, USE - Universal Sentence Encoder)
- Generates sentence-level embeddings instead of word-level.
Used in tasks like semantic search and similarity detection.
- SBERT (Sentence-BERT): Fine-tunes BERT for sentence similarity tasks.
- USE (Universal Sentence Encoder): Optimized for multi-lingual sentence similarity.

#One-Hot Encoding & TF-IDF (Sklearn)

In [2]:

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Sample corpus
corpus = ["cat", "dog", "fish"]

# One-hot encoding
encoder = OneHotEncoder(sparse_output=False)
one_hot = encoder.fit_transform(np.array(corpus).reshape(-1, 1))

print("One-Hot Encoding:\n", one_hot)


One-Hot Encoding:
 [[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
corpus = ["I love NLP", "NLP is amazing", "I love deep learning"]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)

print("TF-IDF Features:\n", vectorizer.get_feature_names_out())
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())


TF-IDF Features:
 ['amazing' 'deep' 'is' 'learning' 'love' 'nlp']
TF-IDF Matrix:
 [[0.         0.         0.         0.         0.70710678 0.70710678]
 [0.62276601 0.         0.62276601 0.         0.         0.4736296 ]
 [0.         0.62276601 0.         0.62276601 0.4736296  0.        ]]


##Word2Vec (Gensim)

In [5]:
import gensim
from gensim.models import Word2Vec

# Sample sentences
sentences = [["I", "love", "NLP"], ["Word2Vec", "is", "powerful"], ["I", "love", "deep", "learning"]]

# Train Word2Vec
model = Word2Vec(sentences, vector_size=10, window=2, min_count=1, workers=4)

# Get word embedding
print("Vector for 'NLP':\n", model.wv["NLP"])


Vector for 'NLP':
 [ 0.05455794  0.08345953 -0.01453741 -0.09208143  0.04370552  0.00571785
  0.07441908 -0.00813283 -0.02638414 -0.08753009]
Vector for 'NLP':
 [ 0.05455794  0.08345953 -0.01453741 -0.09208143  0.04370552  0.00571785
  0.07441908 -0.00813283 -0.02638414 -0.08753009]


##FastText (Gensim)

In [6]:
from gensim.models import FastText

# Train FastText
fasttext_model = FastText(sentences, vector_size=10, window=2, min_count=1, workers=4)

# Get word embedding
print("FastText vector for 'NLP':\n", fasttext_model.wv["NLP"])


FastText vector for 'NLP':
 [-0.00134448  0.01835183 -0.02167502  0.03242417  0.01605503 -0.0045414
  0.01362134  0.01509863  0.0002289   0.00653573]


##GloVe (Spacy Pre-trained Vectors)

In [None]:
!pip install spacy
!python -m spacy download en_core_web_md
import spacy

# Load GloVe vectors (pretrained in Spacy)
nlp = spacy.load("en_core_web_md")



In [9]:
# Get word vector
vector = nlp("NLP").vector
print("GloVe vector for 'NLP':\n", vector)

GloVe vector for 'NLP':
 [-1.2606    0.065898  6.0885   -0.22722   0.83154   0.41309   3.1979
 -0.046191 -1.2829   -1.3479    1.7709    3.668    -2.0622    2.7155
 -1.0578   -2.5758    2.4921    1.6091   -1.0377    3.0679   -1.4015
  3.7073    1.9131   -0.57248  -2.6436    0.63337  -0.29285  -3.4357
 -2.1266    1.7317   -5.3598    1.3803   -0.54765   0.35455   2.7631
 -1.977     0.44758  -1.4725    2.8591   -2.1695    2.3519   -1.3073
 -2.5832   -1.1488   -6.6438   -0.93801   0.56867   0.87114  -0.96782
 -5.2648    0.94436   2.2771    1.1189   -0.34377  -2.5144    2.9963
 -2.5062    2.1578   -0.67746  -1.0898    1.6241    3.6518   -3.1079
  4.7306   -0.66454   2.7364    0.13306  -3.4212    1.3897    2.3435
 -5.4255    1.9155   -1.7938   -0.3813    1.5523    0.10848  -2.3448
 -1.336     2.8275   -1.1881   -2.0658   -1.704    -0.72433   1.1114
 -0.59757  -5.9866    2.3778   -0.16238  -2.3423   -1.7955   -0.77142
  0.068012  0.68761   0.67404  -4.4701    2.4112   -0.2604   -1.0389
  2.179

##BERT Embeddings (Hugging Face)

In [10]:
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Encode sentence
sentence = "NLP is amazing"
inputs = tokenizer(sentence, return_tensors="pt")

# Get embeddings
outputs = model(**inputs)
embeddings = outputs.last_hidden_state




The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BERT Embeddings Shape: torch.Size([1, 6, 768])


In [11]:
print("BERT Embeddings Shape:", embeddings.shape)  # (1, num_tokens, hidden_dim)

BERT Embeddings Shape: torch.Size([1, 6, 768])


##Sentence Embeddings (Sentence-BERT)

In [13]:
from sentence_transformers import SentenceTransformer

# Load SBERT
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')

# Get sentence embedding
sentence_embedding = sbert_model.encode("NLP is amazing")

print("SBERT Sentence Embedding Shape:", sentence_embedding.shape)


SBERT Sentence Embedding Shape: (384,)
