# Natural Language Processing: Text Embeddings and Semantic Analysis

This notebook demonstrates fundamental NLP techniques for representing and analyzing text data using modern embedding approaches.

## What's in this Notebook?

### 1. Semantic Search with GloVe Embeddings
In the first exercise, we build a simple semantic search engine using GloVe word embeddings. This demonstrates:
- How to represent documents as vectors by averaging word embeddings
- Computing semantic similarity between texts using cosine similarity
- Implementing a basic retrieval system that understands meaning, not just keywords

### 2. Tweet Visualization with Transformer Embeddings
In the second exercise, we visualize tweet embeddings in 2D space using:
- State-of-the-art transformer models to generate high-quality text embeddings
- Dimensionality reduction with PCA to visualize high-dimensional data
- Techniques for exploring patterns and clusters in text data

## Key Concepts

**Word/Sentence Embeddings**: Numerical representations of text that capture semantic meaning in a vector space. Similar texts have similar vectors.

**Semantic Search**: Finding documents based on meaning rather than exact keyword matching.

**Dimensionality Reduction**: Techniques to reduce high-dimensional data (like embeddings) to lower dimensions while preserving important relationships.

**Cosine Similarity**: A measure of similarity between two non-zero vectors, commonly used to compare document embeddings.

These exercises demonstrate how modern NLP techniques can transform unstructured text into structured representations that machines can process to understand meaning.

---
Answer from Perplexity: pplx.ai/share

## Glove

### Semantic Search with GloVe Embeddings
In the first exercise, we build a simple semantic search engine using GloVe word embeddings. This demonstrates:
- How to represent documents as vectors by averaging word embeddings
- Computing semantic similarity between texts using cosine similarity
- Implementing a basic retrieval system that understands meaning, not just keywords

### Problem

In [None]:
!pip install gensim

In [None]:
import numpy as np
import pandas as pd
from gensim.models import KeyedVectors

# Step 1: Load the GloVe embeddings
def load_glove_embeddings():
    """
    Load pre-trained GloVe word embeddings.
    """
    # TODO: Complete this function to load the GloVe embeddings
    glove_path = 'glove.6B.100d.txt'
    print("Loading GloVe embeddings...")
    word_vectors = KeyedVectors.load_word2vec_format(glove_path, binary=False, no_header=True)
    print(f"Loaded {len(word_vectors.key_to_index)} word vectors!")
    return word_vectors

# Step 2: Create document vectors
def create_document_vectors(documents, word_vectors):
    """
    Convert each document into a vector by averaging the word vectors.

    Args:
        documents: List of text documents
        word_vectors: Loaded word embeddings

    Returns:
        List of document vectors
    """
    doc_vectors = []

    # TODO: For each document, create a vector by averaging the vectors of its words
    for doc in documents:
        # TODO: Extract words that exist in the word_vectors vocabulary
        words = [w.lower() for w in doc.split() if w.lower() in ???]

        # If words exist, calculate the average vector; otherwise, use zeros
        if words:
            # TODO: Calculate the mean of word vectors
            doc_vector = np.???([word_vectors[w] for w in words], axis=0)
        else:
            doc_vector = np.zeros(word_vectors.vector_size)

        doc_vectors.append(doc_vector)

    return doc_vectors

# Step 3: Calculate similarity between query and documents
def calculate_similarities(query_vector, doc_vectors):
    """
    Calculate cosine similarity between query vector and all document vectors.

    Args:
        query_vector: Vector representation of the query
        doc_vectors: List of document vectors

    Returns:
        List of (document_index, similarity_score) tuples
    """
    similarities = []

    # TODO: Calculate cosine similarity between query vector and each document vector
    for i, doc_vector in enumerate(doc_vectors):
        # TODO: Implement cosine similarity formula
        similarity = np.dot(???, doc_vector) / (np.linalg.norm(query_vector) * np.linalg.norm(doc_vector))
        similarities.append((i, similarity))

    return similarities

# Step 4: Create the search engine
def semantic_search():
    """
    Main function to run the semantic search engine.
    """
    # Load word embeddings
    word_vectors = load_glove_embeddings()

    # Load documents
    documents = pd.read_csv('facts_collection.csv')['facts'].tolist()

    # Create document vectors
    doc_vectors = create_document_vectors(documents, word_vectors)

    print("Semantic Search Engine")
    print("=====================")

    # Search loop
    while True:
        query = input("\nEnter search query (or 'exit'): ")
        if query.lower() == 'exit':
            break

        # TODO: Convert query to vector using the same method as documents
        query_words = [w.lower() for w in query.split() if w.lower() in ???]

        if not query_words:
            print("No query words found in vocabulary")
            continue

        query_vector = np.mean([word_vectors[w] for w in ???], axis=0)

        # Calculate similarities
        similarities = calculate_similarities(query_vector, ???)

        # Sort by similarity (highest first)
        similarities.sort(key=lambda x: x[1], reverse=True)

        # Display top results
        print("\nSearch results:")
        for i, similarity in similarities[:3]:
            print(f"{similarity:.4f}: {documents[i]}")

semantic_search_engine()

### Solution

In [90]:
import numpy as np
import pandas as pd
from gensim.models import KeyedVectors

# Step 1: Load the GloVe embeddings
def load_glove_embeddings():
    """
    Load pre-trained GloVe word embeddings.
    """
    # TODO: Complete this function to load the GloVe embeddings
    glove_path = 'glove.6B.100d.txt'
    print("Loading GloVe embeddings...")
    word_vectors = KeyedVectors.load_word2vec_format(glove_path, binary=False, no_header=True)
    print(f"Loaded {len(word_vectors.key_to_index)} word vectors!")
    return word_vectors

# Step 2: Create document vectors
def create_document_vectors(documents, word_vectors):
    """
    Convert each document into a vector by averaging the word vectors.

    Args:
        documents: List of text documents
        word_vectors: Loaded word embeddings

    Returns:
        List of document vectors
    """
    doc_vectors = []

    # TODO: For each document, create a vector by averaging the vectors of its words
    for doc in documents:
        # TODO: Extract words that exist in the word_vectors vocabulary
        words = [w.lower() for w in doc.split() if w.lower() in word_vectors]

        # TODO: If words exist, calculate the average vector; otherwise, use zeros
        if words:
            # TODO: Calculate the mean of word vectors
            doc_vector = np.mean([word_vectors[w] for w in words], axis=0)
        else:
            doc_vector = np.zeros(word_vectors.vector_size)

        doc_vectors.append(doc_vector)

    return doc_vectors

# Step 3: Calculate similarity between query and documents
def calculate_similarities(query_vector, doc_vectors):
    """
    Calculate cosine similarity between query vector and all document vectors.

    Args:
        query_vector: Vector representation of the query
        doc_vectors: List of document vectors

    Returns:
        List of (document_index, similarity_score) tuples
    """
    similarities = []

    # TODO: Calculate cosine similarity between query vector and each document vector
    for i, doc_vector in enumerate(doc_vectors):
        # TODO: Implement cosine similarity formula
        similarity = np.dot(query_vector, doc_vector) / (np.linalg.norm(query_vector) * np.linalg.norm(doc_vector))
        similarities.append((i, similarity))

    return similarities

# Step 4: Create the search engine
def semantic_search():
    """
    Main function to run the semantic search engine.
    """
    # Load word embeddings
    word_vectors = load_glove_embeddings()

    # Load documents
    documents = pd.read_csv('facts_collection.csv')['facts'].tolist()

    # Create document vectors
    doc_vectors = create_document_vectors(documents, word_vectors)

    print("Semantic Search Engine")
    print("=====================")

    # Search loop
    while True:
        query = input("\nEnter search query (or 'exit'): ")
        if query.lower() == 'exit':
            break

        # TODO: Convert query to vector using the same method as documents
        query_words = [w.lower() for w in query.split() if w.lower() in word_vectors]

        if not query_words:
            print("No query words found in vocabulary")
            continue

        query_vector = np.mean([word_vectors[w] for w in query_words], axis=0)

        # Calculate similarities
        similarities = calculate_similarities(query_vector, doc_vectors)

        # Sort by similarity (highest first)
        similarities.sort(key=lambda x: x[1], reverse=True)

        # Display top results
        print("\nSearch results:")
        for i, similarity in similarities[:3]:
            print(f"{similarity:.4f}: {documents[i]}")

In [91]:
semantic_search()

Loading GloVe embeddings...
Loaded 400000 word vectors!
Semantic Search Engine

Enter search query (or 'exit'): planes

Search results:
0.5592: Airplanes are the fastest way to travel long distances.
0.5179: Penguins cannot fly but are excellent swimmers.
0.5173: Cars need regular maintenance to run efficiently.
0.5007: Alaska has more coastline than all of the other 49 U.S. states combined.
0.4955: Lightning strikes the Earth about 8.6 million times per day.

Enter search query (or 'exit'): moon

Search results:
0.5426: Venus rotates in the opposite direction compared to most planets in our solar system.
0.5417: A day on Mercury lasts about 176 Earth days.
0.5245: Mt. Thor on Baffin Island, Canada, has the world's greatest vertical drop at 4,101 feet (1,250 meters).
0.5193: Mount Everest is the highest mountain on Earth.
0.5160: The first feature-length animated movie was Disney's Snow White and the Seven Dwarfs.

Enter search query (or 'exit'): water

Search results:
0.7541: Water co

KeyboardInterrupt: Interrupted by user

## BERT Embeddings

### Tweet Visualization with Transformer Embeddings
In the second exercise, we visualize tweet embeddings in 2D space using:
- State-of-the-art transformer models to generate high-quality text embeddings
- Dimensionality reduction with PCA to visualize high-dimensional data
- Techniques for exploring patterns and clusters in text data

### Problem

In [92]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA

# Step 1: Load the tweet data
# TODO: Load the tweets dataset into a pandas DataFrame
df = pd.???('tweets.csv')

# Print basic information about the dataset
print(f"Dataset contains {len(df)} tweets")

# Step 2: Generate embeddings using a pre-trained transformer model
# TODO: Initialize the SentenceTransformer model
model = ???('all-mpnet-base-v2')

# TODO: Convert the tweets into embeddings
print("Generating embeddings for tweets...")
embeddings = model.???(df['text'].tolist())

# Print information about the embeddings
print(f"Embedding shape: {embeddings.shape}")
print(f"Each tweet is represented by a {embeddings.shape[1]}-dimensional vector")

# Step 3: Apply dimensionality reduction
# TODO: Initialize PCA to reduce dimensions to 2
pca = PCA(???=2)

# TODO: Apply PCA to the embeddings
embeddings_2d = pca.fit_transform(???)

# Print information about the reduced embeddings
print(f"Reduced embedding shape: {embeddings_2d.shape}")
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total explained variance: {sum(pca.explained_variance_ratio_):.2%}")

# Step 4: Create the scatter plot
fig = px.scatter(
    x=embeddings_2d[:, 0],
    y=embeddings_2d[:, 1],
    color=df['sentiment'],
    labels={'x': 'PCA Component 1', 'y': 'PCA Component 2'},
    title='2D PCA of Text Embeddings Colored by Sentiment',
    hover_data={'text': df['text']}  # Show the review text on hover
)

# Show the plot
fig.show()


Dataset contains 100 tweets
                                         text  sentiment
0      Everyone's success makes me feel worse          0
1               Nothing ever works out for me          0
2           Lost my keys again. So frustrated          0
3            Don't talk to me. Bad mood today          0
4  Kindness is free. Sprinkle that everywhere          1
Generating embeddings for tweets...
Embedding shape: (100, 768)
Each tweet is represented by a 768-dimensional vector
Reduced embedding shape: (100, 2)
Explained variance ratio: [0.07924224 0.05318008]
Total explained variance: 13.24%


### Solution

In [71]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA

# Step 1: Load the tweet data
# TODO: Load the tweets dataset into a pandas DataFrame
df = pd.read_csv('tweets.csv')

# Print basic information about the dataset
print(f"Dataset contains {len(df)} tweets")
print(df.head())

# Step 2: Generate embeddings using a pre-trained transformer model
# TODO: Initialize the SentenceTransformer model
model = SentenceTransformer('all-mpnet-base-v2')

# TODO: Convert the tweets into embeddings
print("Generating embeddings for tweets...")
embeddings = model.encode(df['text'].tolist())

# Print information about the embeddings
print(f"Embedding shape: {embeddings.shape}")
print(f"Each tweet is represented by a {embeddings.shape[1]}-dimensional vector")

# Step 3: Apply dimensionality reduction
# TODO: Initialize PCA to reduce dimensions to 2
pca = PCA(n_components=2)

# TODO: Apply PCA to the embeddings
embeddings_2d = pca.fit_transform(embeddings)

# Print information about the reduced embeddings
print(f"Reduced embedding shape: {embeddings_2d.shape}")
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total explained variance: {sum(pca.explained_variance_ratio_):.2%}")

# Step 4: Create the scatter plot
fig = px.scatter(
    x=embeddings_2d[:, 0],
    y=embeddings_2d[:, 1],
    color=df['sentiment'],
    labels={'x': 'PCA Component 1', 'y': 'PCA Component 2'},
    title='2D PCA of Text Embeddings Colored by Sentiment',
    hover_data={'text': df['text']}  # Show the review text on hover
)

# Show the plot
fig.show()
