### Week 13 Word Embeddings Tutorial
Tutorial for extracting word embeddings from words.<br>

### Word2Vec embeddings using the Gensim library.<br> 
    Word2Vec is a popular technique for learning word embeddings, which are dense vector representations of words that capture semantic relationships between words based on their context.<br>
    As discussed, Word2Vec have 2 types, Skipgrams and CBOW. Where SkipGrams are trained to predict context words given the target word, however CBOW is trained to predict target words given its context.


- Requirements: downloading punkt from nltk, and installing gensim library

In [43]:
# importing needed libraries
import nltk
# nltk.download('punkt')
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec

In [44]:

# Sample Sentence
text = """
Natural language processing (NLP) is a subfield of artificial intelligence (AI) that focuses on the interaction between computers and humans through natural language. NLP techniques aim to enable computers to understand, interpret, and generate human language in a way that is both meaningful and contextually relevant.
"""
tokenized_words = word_tokenize(text.lower())

In [34]:

# Create a model
model_word2Vec = Word2Vec(sentences=[tokenized_words], vector_size=100, window=5,  min_count=1, workers=4)


In [1]:

# Get word vector for a specific word


In [2]:

# Find similar words


SKIPGRAM

In [4]:

# Create a model

# Get word vector for a specific word

# Find similar words


In [5]:
# How to save/load skipgram and CBOW models?

How to use word2Vec for SkipGrams and CBOW? Explore whether they will give different results for similar words

### Clustering Words based on Cooccurence Pattern --> Brown Clustering

In [51]:
from nltk.corpus import brown
import numpy as np
# download brown corpus if not downloaded before
# nltk.download('brown')

In [68]:
# retrieve sentences from brown corpus
corpus = brown.sents()[:1]
# transform all sentences to lower case
corpus = [[word.lower() for word in sent] for sent in corpus]

# Create a set of unique words in the corpus --> Vocab
vocab = set(word for sent in corpus for word in sent)

In [7]:
# to find co-occurence pattern, co-occurence matrix is needed to show the word count and which words does it co-occur with


In [70]:
# Performing brown clustering, you need to set the number of clusters

# assign each word as cluster

# Perform Brown clustering by recursively merging clusters

    # Find the pair of clusters with the highest co-occurrence count
 
    # Merge the clusters by assigning the same cluster ID to both clusters

    # Update the co-occurrence matrix by merging the counts of the two clusters

# Final cluster assignments


**NOTE: DONT RUN FOR LARGE DATASET**<br>
Visualizing brown clusters: <br>
Install scipy if you didn't use it before. pip install scipy


In [8]:
# This code should be able to visualize brown clustering, however you need to experiment it with less words as it does many
# computations for the linkage matrix.
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
# Create a linkage matrix for hierarchical clustering

# Plot the dendrogram


### GloVe: 
Global Vectors for Word Representation is an unsupervised learning algorithm used to learn word embeddings from large amounts of text data. Word embeddings are dense vector representations of words that capture semantic relationships between words based on their co-occurrence statistics. 

- Steps: Preprocess the text data.<br>
Created the dictionary.<br>
Traverse the glove file of a specific dimension and compare each word with all words in the dictionary,
if a match occurs, copy the equivalent vector from the glove and paste into embedding_matrix at the corresponding index.<br>


In [None]:
#Download Glove Pretrained Embeddings From: http://nlp.stanford.edu/data/glove.6B.zip  

def embedding_for_vocab(filepath, word_index,
                        embedding_dim):
    vocab_size = len(word_index) + 1
      
    # Adding again 1 because of reserved 0 index
    embedding_matrix_vocab = np.zeros((vocab_size,
                                       embedding_dim))
  
    with open(filepath, encoding="utf8") as f:
        for line in f:
            word, *vector = line.split()
            if word in word_index:
                idx = word_index.index(word)
                embedding_matrix_vocab[idx] = np.array(
                    vector, dtype=np.float32)[:embedding_dim]
  
    return embedding_matrix_vocab
  
  
# matrix for vocab: tokenized_words
embedding_dim = 50
embedding_matrix_vocab = embedding_for_vocab(
    './glove.6B/glove.6B.50d.txt', tokenized_words,
  embedding_dim)
  
print("Dense vector for first word is => ",
      embedding_matrix_vocab[1])

Dense vector for first word is =>  [-5.79900026e-01 -1.10100001e-01 -1.15569997e+00 -2.99059995e-03
 -2.06129998e-01  4.52890009e-01 -1.66710004e-01 -1.03820002e+00
 -9.92410004e-01  3.98840010e-01  5.92299998e-01  2.29900002e-01
  1.52129996e+00 -1.77640006e-01 -2.97259986e-01 -3.92349988e-01
 -7.84709990e-01  1.55939996e-01  6.90769970e-01  5.95369995e-01
 -4.43399996e-01  5.35139978e-01  3.28530014e-01  1.24370003e+00
  1.29719996e+00 -1.38779998e+00 -1.09249997e+00 -4.09249991e-01
 -5.69710016e-01 -3.46560001e-01  3.71630001e+00 -1.04890001e+00
 -4.67079997e-01 -4.47389990e-01  6.22999994e-03  1.96490008e-02
 -4.01609987e-01 -6.29130006e-01 -8.25060010e-01  4.55909997e-01
  8.26259971e-01  5.70909977e-01  2.11989999e-01  4.68650013e-01
 -6.00269973e-01  2.99199998e-01  6.79440022e-01  1.42379999e+00
 -3.21520008e-02 -1.26029998e-01]


### Dimensionality Reduction --> SVD (LSA)
    Latent Semantic Analysis (LSA) is a technique used in natural language processing to uncover the latent structure in a corpus of text documents by applying Singular Value Decomposition (SVD) to a term-document matrix. It allows us to reduce the dimensionality of the document-term space, thereby capturing the underlying semantic relationships between words and documents.

Lets create a text corpus:

In [46]:
corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "The dog barked at the fox.",
    "The fox ran away quickly.",
    "The dog is lazy.",
    "The fox is cunning.",
]


- Create Document-Term Matrix: We use the CountVectorizer from scikit-learn to convert the text documents into a document-term matrix. Each row in the matrix corresponds to a document, and each column represents a word's frequency in that document.

- Apply LSA (SVD): We use the TruncatedSVD class from scikit-learn to perform Latent Semantic Analysis. We specify the number of components (dimensions) we want to reduce the feature space to (in this case, we use n_components=2 for simplicity).

- Normalize Data: To ensure that each row in the transformed matrix has unit norm, we use the Normalizer from scikit-learn.



In [47]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline

In [48]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X)

  (0, 14)	2
  (0, 11)	1
  (0, 3)	1
  (0, 6)	1
  (0, 8)	1
  (0, 10)	1
  (0, 9)	1
  (0, 5)	1
  (1, 14)	2
  (1, 6)	1
  (1, 5)	1
  (1, 2)	1
  (1, 0)	1
  (2, 14)	1
  (2, 6)	1
  (2, 13)	1
  (2, 1)	1
  (2, 12)	1
  (3, 14)	1
  (3, 9)	1
  (3, 5)	1
  (3, 7)	1
  (4, 14)	1
  (4, 6)	1
  (4, 7)	1
  (4, 4)	1


In [49]:
# Apply SVD (Latent Semantic Analysis)
n_components = 2  # Number of components after reducing dimensions
lsa = TruncatedSVD(n_components)
X_lsa = lsa.fit_transform(X)
X_lsa

array([[ 2.9971702 , -1.12180185],
       [ 2.43176445,  0.54765212],
       [ 1.32104798,  1.45967715],
       [ 1.4173243 , -0.50151648],
       [ 1.32627769,  0.61297725]])

In [50]:
# Normalize the transformed data
lsa_pipeline = make_pipeline(lsa, Normalizer(copy=False))
X_lsa_normalized = lsa_pipeline.fit_transform(X)
print("\nLSA Reduced Dimensionality:")
print(X_lsa_normalized)


LSA Reduced Dimensionality:
[[ 0.93654853 -0.35053794]
 [ 0.97556634  0.21970507]
 [ 0.67102161  0.7414378 ]
 [ 0.94272191 -0.33357967]
 [ 0.90773814  0.4195372 ]]
