# Word and Contextual Embeddings in Natural Language Processing (NLP)

#### After preprocessing and linguistic analysis, the next crucial step in Natural Language Processing is text representation. Machines cannot directly understand words or sentences; therefore, text must be converted into numerical vectors that capture semantic meaning.

#### It covers
* Static word embeddings (Word2Vec, GloVe)

* Contextual word embeddings (BERT, RoBERTa)

In [None]:
# Installing requird libraries
!pip install nltk spacy gensim transformers torch sentencepiece
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m57.8 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


Installs essential NLP libraries:

spaCy – tokenization & linguistic features

Gensim – Word2Vec & GloVe

Transformers – BERT & RoBERTa models

Torch – deep learning backend

#### spaCy provides vector representations for words using pre-trained models. Each word is represented as a high-dimensional numerical vector.

##### Key Observation:

* Words with similar meanings have similar vector representations

* The shape of the vector indicates the dimensionality of the embedding

In [None]:
import spacy
nlp=spacy.load("en_core_web_sm") # Loads spaCy’s English small model
text="Natural Language Processing is fascinating"
doc=nlp(text)
for token in doc:
    print(token.text,token.vector.shape)

Natural (96,)
Language (96,)
Processing (96,)
is (96,)
fasinating (96,)


#### Word2Vec Embeddings (Gensim)

##### Word2Vec is a neural network-based technique that learns word embeddings by analyzing word co-occurrence patterns in a corpus.

##### It works on the idea that:

* Words appearing in similar contexts tend to have similar meanings.


In [None]:
# Word2Vec using Gensim:
from gensim.models import Word2Vec
sentances=[
    ["machine","learning","is","fun"],
    ["natural","languae","processing"],
    ["deep","learning","models"]
]

model=Word2Vec(sentances,vector_size=50,window=3,min_count=1)
vector=model.wv["learning"]
print("Vector for 'learning':", vector)

Vector for 'learning': [-1.0724545e-03  4.7286271e-04  1.0206699e-02  1.8018546e-02
 -1.8605899e-02 -1.4233618e-02  1.2917745e-02  1.7945977e-02
 -1.0030856e-02 -7.5267432e-03  1.4761009e-02 -3.0669428e-03
 -9.0732267e-03  1.3108104e-02 -9.7203208e-03 -3.6320353e-03
  5.7531595e-03  1.9837476e-03 -1.6570430e-02 -1.8897636e-02
  1.4623532e-02  1.0140524e-02  1.3515387e-02  1.5257311e-03
  1.2701781e-02 -6.8107317e-03 -1.8928028e-03  1.1537147e-02
 -1.5043275e-02 -7.8722071e-03 -1.5023164e-02 -1.8600845e-03
  1.9076237e-02 -1.4638334e-02 -4.6675373e-03 -3.8754821e-03
  1.6154874e-02 -1.1861792e-02  9.0324880e-05 -9.5074680e-03
 -1.9207101e-02  1.0014586e-02 -1.7519170e-02 -8.7836506e-03
 -7.0199967e-05 -5.9236289e-04 -1.5322480e-02  1.9229487e-02
  9.9641159e-03  1.8466286e-02]


#### Similarity using Word2Vec

#####Word similarity is computed using cosine similarity between vectors. A higher similarity score indicates that words are semantically related.

* Note: Training Word2Vec on small datasets is only for demonstration. In practice, large corpora are required.

In [None]:
# word similarity Using word2Vec

similarity = model.wv.similarity("learning","natural")
print("similarity:",similarity)

similarity: 0.012442171


#### GloVe Embeddings (Pre-trained)


##### GloVe (Global Vectors) is a word embedding technique that learns representations using global word co-occurrence statistics.

##### Unlike Word2Vec:

* GloVe embeddings are often used in their pre-trained form

* They capture both local and global semantic relationships

In [None]:
# GloVe Embedding(pre-trained)

import gensim.downloader as api

glove = api.load("glove-wiki-gigaword-50")
print(glove["computer"])

print(glove.similarity("computer","laptop"))

[ 0.079084 -0.81504   1.7901    0.91653   0.10797  -0.55628  -0.84427
 -1.4951    0.13418   0.63627   0.35146   0.25813  -0.55029   0.51056
  0.37409   0.12092  -1.6166    0.83653   0.14202  -0.52348   0.73453
  0.12207  -0.49079   0.32533   0.45306  -1.585    -0.63848  -1.0053
  0.10454  -0.42984   3.181    -0.62187   0.16819  -1.0139    0.064058
  0.57844  -0.4556    0.73783   0.37203  -0.57722   0.66441   0.055129
  0.037891  1.3275    0.30991   0.50697   1.2357    0.1274   -0.11434
  0.20709 ]
0.77411586


#### Contextual Embeddings using BERT


##### Static embeddings assign the same vector to a word regardless of context. However, many words have multiple meanings.

Example:

* bank (financial institution)

* bank (river bank)

##### BERT solves this problem by generating different vectors for the same word based on context.

In [None]:
# contextual Embeddings using BERT
from transformers import BertTokenizer, BertModel
import torch
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model=BertModel.from_pretrained("bert-base-uncased")

sentance="The Bank is near the river"
inputs=tokenizer(sentance, return_tensors='pt')

outputs = model(**inputs)
embeddings = outputs.last_hidden_state

print(embeddings.shape)

torch.Size([1, 8, 768])


##### The embeddings generated for the word bank differ depending on the sentence, demonstrating contextual understanding.

In [None]:
sent1="I Deposited money in the bank"
sent2 = "The river bank is Beautiful"

inputs1 = tokenizer(sent1, return_tensors='pt')
inputs2 = tokenizer(sent2, return_tensors='pt')

emb1=model(**inputs1).last_hidden_state
emb2=model(**inputs2).last_hidden_state

print(emb1[0][5][:5])
print(emb2[0][5][:5])

tensor([ 0.2485, -0.2934,  0.1040,  0.4932,  0.4474], grad_fn=<SliceBackward0>)
tensor([-0.2188, -0.6338, -0.3455,  0.3908,  0.5776], grad_fn=<SliceBackward0>)


#### RoBERTa is an optimized version of BERT with improved training strategies and better performance on many NLP tasks.

##### Key Advantages:

* Trained on larger datasets

* Better handling of language nuances

* Stronger contextual representations


In [None]:
from transformers import RobertaTokenizer, RobertaModel

tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model=RobertaModel.from_pretrained("roberta-base")

inputs=tokenizer("Word representaton are powerful ",
return_tensors='pt')

outputs = model(**inputs)

print(outputs.last_hidden_state.shape)


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


torch.Size([1, 8, 768])
