[link text](https://)Write a summary on how the word2vec and glove processes takes place in LLM?
Use this website:

https://medium.com/bright-ml/glove-embedding-for-sentence-dc49936d24a7




The article discusses GloVe embedding for sentences. It explains that GloVe is a log-bilinear model that uses co-occurrence of words in a corpus to give words meaning. Word2Vec and GloVe are two prominent techniques used in language model architectures for natural language processing (NLP).

Word2Vec is an embedding technique that represents words as continuous vectors in a high-dimensional space. It employs two primary training algorithms: Continuous Bag of Words (CBOW) and Skip-gram. In CBOW, the model predicts a target word based on its context, while Skip-gram predicts context words given a target word. These models leverage neural networks trained on large text corpora to learn vector representations. Once trained, Word2Vec assigns each word a unique vector, capturing semantic relationships effectively. However, it may struggle with rare words and ignores word order.

On the other hand, GloVe (Global Vectors for Word Representation) utilizes global word-word co-occurrence statistics to embed words based on their collective context in a corpus. It constructs a word-context matrix to represent the likelihood of words appearing together, then performs matrix factorization to yield refined vector representations for each word. GloVe efficiently captures global statistics of the corpus and is effective in representing semantic and syntactic relationships. However, it requires more memory for storing co-occurrence matrices and may be less effective with very small corpora.

These embedding techniques play crucial roles in NLP tasks, offering different advantages and being suitable for diverse datasets. FastText, another advanced embedding technique, extends Word2Vec by incorporating subword information, making it highly effective for morphologically rich languages and handling out-of-vocabulary words. When choosing an embedding model, factors such as semantic relationships, dataset size, and language morphology need to be considered to ensure optimal performance in NLP applications.



**Change input data to do the same analysis on the code provided in below link:**

https://medium.com/bright-ml/glove-embedding-for-sentence-dc49936d24a7

In [3]:
#Importing libraries for word2vec assignment
import nltk
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
# Toy dataset
sentences = ["I love natural language processing.",
             "Word embeddings are powerful."]

In [8]:
# Tokenize sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]
print(tokenized_sentences)

[['i', 'love', 'natural', 'language', 'processing', '.'], ['wor', 'embeddings', 'are', 'powerful', '.']]


In [6]:
model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)

In [11]:
word_embeddings = model.wv
print(word_embeddings['i'])

[-0.00515624 -0.00666834 -0.00777684  0.00831073 -0.00198234 -0.00685496
 -0.00415439  0.00514413 -0.00286914 -0.00374966  0.00162143 -0.00277629
 -0.00158436  0.00107449 -0.00297794  0.00851928  0.00391094 -0.00995886
  0.0062596  -0.00675425  0.00076943  0.00440423 -0.00510337 -0.00211067
  0.00809548 -0.00424379 -0.00763626  0.00925791 -0.0021555  -0.00471943
  0.0085708   0.00428334  0.00432484  0.00928451 -0.00845308  0.00525532
  0.00203935  0.00418828  0.0016979   0.00446413  0.00448629  0.00610452
 -0.0032021  -0.00457573 -0.00042652  0.00253373 -0.00326317  0.00605772
  0.00415413  0.00776459  0.00256927  0.00811668 -0.00138721  0.00807793
  0.00371702 -0.00804732 -0.00393361 -0.00247188  0.00489304 -0.00087216
 -0.00283091  0.00783371  0.0093229  -0.00161493 -0.00515925 -0.00470176
 -0.00484605 -0.00960283  0.00137202 -0.00422492  0.00252671  0.00561448
 -0.00406591 -0.00959658  0.0015467  -0.00670012  0.00249517 -0.00378063
  0.00707842  0.00064022  0.00356094 -0.00273913 -0

In [17]:
from gensim.models import FastText
from nltk.tokenize import word_tokenize

In [18]:
# Toy dataset
sentences = ["FastText embeddings handle subword information.",
             "It is effective for various languages."]
# Tokenize sentences

In [19]:
# Tokenize sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

In [20]:
# Train FastText model
model = FastText(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)

In [21]:
# Access embeddings
word_embeddings = model.wv
print(word_embeddings['subword'])

[ 1.2919701e-03 -1.0602611e-04 -1.1323356e-03  1.6584302e-03
 -5.7117449e-04 -2.9840841e-04 -4.3193492e-04 -3.1250282e-04
 -1.9898350e-04  9.4852143e-04  1.6994212e-03 -1.8581563e-04
 -8.1228669e-04 -1.5968895e-03 -1.3839703e-03 -6.3576088e-05
 -5.7436171e-04 -1.1720147e-03  7.4763177e-04 -2.1684753e-05
  4.5981101e-04 -1.7291495e-03 -4.4365969e-04  4.0478929e-04
  1.2072949e-04  6.4071972e-04 -1.0785459e-03  1.3050955e-03
  3.5044085e-04 -1.8284899e-04 -1.5951110e-04 -1.0465594e-03
 -1.5170674e-04 -5.7858619e-04 -1.8307484e-03  1.0248278e-03
  6.9344341e-04  1.6159177e-03 -8.4400486e-04  8.9535897e-04
 -1.3508157e-04  2.3538095e-03 -3.7109022e-04 -2.8064058e-04
  2.6269807e-04 -2.8326022e-04 -7.7332847e-04  1.8949938e-03
  2.1798143e-03 -4.4569728e-04 -6.4175081e-04  1.4240020e-04
  2.5182988e-03 -1.5666584e-03  1.3954224e-04 -6.9046958e-04
  5.8793183e-04 -1.4282564e-03  2.1278318e-04 -2.2993281e-03
 -4.3249400e-03 -1.6397990e-03  1.3989839e-03 -1.3229308e-03
  2.0258871e-03 -2.96638

In [22]:
from gensim.models import Word2Vec

In [50]:
#Example sentence
sentences = [['this', 'is', 'that', 'a', 'ball', 'you', 'me', 'myself']]

In [51]:
model = Word2Vec(sentences, min_count=1, window=2, sg=1)

In [52]:
word_vectors = model.wv

In [53]:
similarity = word_vectors.similarity('this', 'that')
print(f"Similarity between 'this' and 'that': {similarity}")

Similarity between 'this' and 'that': -0.05774581432342529


In [59]:
most_similar = word_vectors.most_similar('this')
print(f"Most similar words to 'this': {most_similar}")


Most similar words to 'this': [('myself', 0.09291722625494003), ('is', 0.00484249135479331), ('you', -0.0027540253940969706), ('ball', -0.013679751195013523), ('a', -0.028491031378507614), ('that', -0.05774581804871559), ('me', -0.11555545777082443)]
