<a href="https://colab.research.google.com/github/TAMIDSpiyalong/Gen-AI/blob/main/Lecture_2_Text_Tokenization_and_Vectorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Tokenziation and Vectorization

Text tokenization is a crucial step in NLP that involves reformatting a piece of text into smaller units called “tokens.” These tokens serve as the building blocks for text vectorization, which converts text into numerical representations (vectors) that machine learning models can work with. A good place to loop up popular tokenizors is https://tiktokenizer.vercel.app/. It is important to know that this tokenization process is not standarized.

# Using Pretrained, Third-Party Vectorizors
There are a variety of pretrained, static word vector packages out there. In this section, we'll use the **Google News** vectors, a collection of three million, 300-dimension word vectors trained from three billion words from a Google News corpus (circa 2015).

In [None]:
from gensim.models import Word2Vec
import gensim.downloader as api

wv = api.load('word2vec-google-news-300')


In [None]:
# Print word vectors (embeddings) for a specific word
print(wv['pizza'])  # Vector for the word 'word2vec'

Visualizing word vectors is straight-forward and can offer insights into what kind of contexts the training algorithm picked up. Because these word vectors have a dimension of 300, we need to reduce them down to two dimensions to plot them on a regular graph. This can be done through Principal Components Analysis (PCA). Google also did a good job visualizting it in 3D https://projector.tensorflow.org/.

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Select some words to visualize
words = ['king', 'queen', 'man', 'woman', 'apple', 'banana', 'car', 'train', 'computer']

# Get the word vectors for these words
word_vectors = [wv[word] for word in words]

# Perform PCA to reduce dimensionality to 2D
pca = PCA(n_components=2)
word_vectors_2d = pca.fit_transform(word_vectors)

# Plot the words in 2D space
plt.figure(figsize=(8, 6))
plt.scatter(word_vectors_2d[:, 0], word_vectors_2d[:, 1], marker='o')

# Annotate each point with the corresponding word
for i, word in enumerate(words):
    plt.annotate(word, xy=(word_vectors_2d[i, 0], word_vectors_2d[i, 1]))

plt.show()


Beyond 3D, it is difficult to visualize the word embedings but we can still measure how close the words are by measuring the consine similarities. There are many ways to calculate the cosine similarity between words and we will use the sklearn function.

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity

# List of words
words = ['texas', 'pizza', 'train', 'doctor', 'nurse', 'lion', 'tiger', 'airplane', 'helicopter', 'grape', 'mango', 'laptop', 'smartphone']

# Get the word vectors for these words
word_vectors = np.array([wv[word] for word in words])

# Compute cosine similarity matrix
similarity_matrix = cosine_similarity(word_vectors)

# Create a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(similarity_matrix, xticklabels=words, yticklabels=words, annot=True, cmap='coolwarm', linewidths=0.5)

# Add labels and title
plt.title('Cosine Similarity Heatmap')
plt.show()

