# Day 2: Word Embeddings for Financial Text Analysis
## Word2Vec, GloVe, and Financial NLP Applications

---

## Learning Objectives
- Understand the theory behind word embeddings and distributed representations
- Implement Word2Vec (Skip-gram and CBOW) for financial text
- Work with pre-trained GloVe embeddings
- Build custom financial word embeddings
- Apply embeddings to sentiment analysis and document similarity

## Why Word Embeddings in Finance?
- Capture semantic relationships in financial text (e.g., "bullish" â‰ˆ "optimistic")
- Enable numerical representation of news, earnings calls, SEC filings
- Foundation for advanced NLP models in trading systems
- Measure semantic similarity between financial documents

## 1. Setup and Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import re
import warnings
warnings.filterwarnings('ignore')

# NLP libraries
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Word embedding libraries
from gensim.models import Word2Vec, KeyedVectors
from gensim.models.phrases import Phrases, Phraser
import gensim.downloader as api

# Dimensionality reduction for visualization
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('punkt_tab', quiet=True)

print("Setup complete!")

## 2. Theory: From One-Hot to Distributed Representations

### The Problem with One-Hot Encoding
- **Sparse**: Vector size = vocabulary size (can be 100K+ words)
- **No semantics**: cos("bullish", "bearish") = cos("bullish", "apple") = 0
- **No generalization**: Model can't leverage word relationships

### Word Embeddings Solution
- **Dense vectors**: Typically 50-300 dimensions
- **Semantic meaning**: Similar words have similar vectors
- **Learned from context**: "You shall know a word by the company it keeps" (Firth, 1957)

In [None]:
# Demonstrate One-Hot vs Embedding representation

# Simple vocabulary
vocab = ['stock', 'bond', 'equity', 'bullish', 'bearish', 'market']

# One-hot encoding
print("ONE-HOT ENCODING")
print("="*50)
for i, word in enumerate(vocab):
    one_hot = np.zeros(len(vocab))
    one_hot[i] = 1
    print(f"{word:10} -> {one_hot}")

print("\n" + "="*50)
print("Problems:")
print(f"- Vector dimension: {len(vocab)} (grows with vocabulary)")
print(f"- Cosine similarity between any two words: 0")
print(f"- No semantic information captured")

In [None]:
# Hypothetical embedding (what we want to learn)
print("\nWORD EMBEDDINGS (Hypothetical)")
print("="*50)

# Simulated 4-dimensional embeddings
embeddings = {
    'stock':   np.array([0.8, 0.2, 0.6, 0.1]),
    'equity':  np.array([0.7, 0.3, 0.6, 0.2]),  # Similar to stock
    'bond':    np.array([0.3, 0.8, 0.4, 0.1]),
    'bullish': np.array([0.5, 0.4, 0.9, 0.8]),
    'bearish': np.array([0.5, 0.4, 0.1, 0.2]),  # Opposite sentiment
    'market':  np.array([0.6, 0.5, 0.5, 0.5]),
}

for word, vec in embeddings.items():
    print(f"{word:10} -> {vec}")

# Calculate similarities
print("\nCosine Similarities:")
print(f"stock-equity: {cosine_similarity([embeddings['stock']], [embeddings['equity']])[0,0]:.3f}")
print(f"stock-bond:   {cosine_similarity([embeddings['stock']], [embeddings['bond']])[0,0]:.3f}")
print(f"bullish-bearish: {cosine_similarity([embeddings['bullish']], [embeddings['bearish']])[0,0]:.3f}")

## 3. Word2Vec: Architecture and Training

### Two Architectures:

**1. Skip-gram**: Predict context words from center word
- Input: "earnings"
- Output: ["quarterly", "report", "beat", "expectations"]
- Better for rare words, smaller datasets

**2. CBOW (Continuous Bag of Words)**: Predict center word from context
- Input: ["quarterly", "report", "beat", "expectations"]
- Output: "earnings"
- Faster training, better for frequent words

### Key Hyperparameters:
- `vector_size`: Embedding dimension (100-300 typical)
- `window`: Context window size (5-10 for finance)
- `min_count`: Minimum word frequency threshold
- `sg`: 0=CBOW, 1=Skip-gram

In [None]:
# Create sample financial corpus
financial_corpus = [
    "The stock market rallied on strong earnings reports from major tech companies.",
    "Investors turned bullish after the Federal Reserve signaled interest rate cuts.",
    "Bond yields fell as traders anticipated looser monetary policy.",
    "The equity market experienced high volatility amid trade war concerns.",
    "Hedge funds increased their bullish positions in technology stocks.",
    "The bearish sentiment drove stock prices lower across all sectors.",
    "Market analysts expect the bull market to continue through next quarter.",
    "Corporate earnings exceeded analyst expectations driving stock prices higher.",
    "The Federal Reserve maintained its hawkish stance on inflation.",
    "Treasury yields spiked following stronger than expected jobs data.",
    "Investors rotated from growth stocks to value stocks.",
    "The market correction wiped out gains from the previous quarter.",
    "Options traders bet on increased volatility ahead of earnings season.",
    "The company announced a stock buyback program worth billions.",
    "Dividend stocks outperformed during the market downturn.",
    "Credit spreads widened signaling increased default risk.",
    "The IPO market remained strong with several high profile listings.",
    "Portfolio managers reduced equity exposure and increased cash holdings.",
    "Market makers reported record trading volumes during the selloff.",
    "The central bank intervention stabilized currency markets.",
    "Quantitative trading firms capitalized on market inefficiencies.",
    "Risk parity strategies underperformed during the volatility spike.",
    "Momentum stocks led the market rally while value lagged.",
    "The yield curve inverted raising recession concerns.",
    "Algorithmic trading accounted for majority of daily volume.",
    "Institutional investors accumulated positions in defensive sectors.",
    "The company beat earnings estimates and raised forward guidance.",
    "Short sellers covered positions as the stock price surged.",
    "Market sentiment turned negative on geopolitical tensions.",
    "The tech sector led the market higher on strong revenue growth."
]

print(f"Corpus size: {len(financial_corpus)} sentences")
print(f"\nSample sentences:")
for sent in financial_corpus[:3]:
    print(f"  - {sent}")

In [None]:
# Text preprocessing for financial text

def preprocess_financial_text(text, remove_stopwords=False):
    """
    Preprocess financial text for word embedding training.
    
    For word embeddings, we often keep stopwords because:
    - Context matters for learning word relationships
    - Skip-gram/CBOW use surrounding words
    """
    # Lowercase
    text = text.lower()
    
    # Remove special characters but keep important financial symbols
    text = re.sub(r'[^a-z\s]', '', text)
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Optionally remove stopwords
    if remove_stopwords:
        stop_words = set(stopwords.words('english'))
        tokens = [t for t in tokens if t not in stop_words]
    
    return tokens

# Preprocess corpus
processed_corpus = [preprocess_financial_text(sent) for sent in financial_corpus]

print("Original:", financial_corpus[0])
print("Processed:", processed_corpus[0])

In [None]:
# Detect bigrams (two-word phrases) - important for finance
# Examples: "interest_rate", "federal_reserve", "stock_market"

# Train phrase detector
phrases = Phrases(processed_corpus, min_count=2, threshold=5)
bigram = Phraser(phrases)

# Apply bigrams to corpus
corpus_with_bigrams = [bigram[sent] for sent in processed_corpus]

print("Without bigrams:", processed_corpus[0])
print("With bigrams:", corpus_with_bigrams[0])

# Check detected phrases
print("\nDetected phrases:")
for phrase, score in phrases.find_phrases(processed_corpus).items():
    print(f"  {phrase}: {score:.2f}")

In [None]:
# Train Word2Vec model - Skip-gram

w2v_skipgram = Word2Vec(
    sentences=corpus_with_bigrams,
    vector_size=100,      # Embedding dimension
    window=5,             # Context window
    min_count=1,          # Minimum word frequency (low for small corpus)
    workers=4,            # Parallel training threads
    sg=1,                 # Skip-gram (1) vs CBOW (0)
    epochs=100,           # Training epochs (more for small corpus)
    seed=42
)

print("Skip-gram Model Summary:")
print(f"  Vocabulary size: {len(w2v_skipgram.wv)}")
print(f"  Embedding dimension: {w2v_skipgram.wv.vector_size}")
print(f"  Training words: {w2v_skipgram.corpus_total_words}")

In [None]:
# Train Word2Vec model - CBOW

w2v_cbow = Word2Vec(
    sentences=corpus_with_bigrams,
    vector_size=100,
    window=5,
    min_count=1,
    workers=4,
    sg=0,                 # CBOW
    epochs=100,
    seed=42
)

print("CBOW Model Summary:")
print(f"  Vocabulary size: {len(w2v_cbow.wv)}")
print(f"  Embedding dimension: {w2v_cbow.wv.vector_size}")

In [None]:
# Explore learned embeddings

def explore_embeddings(model, word):
    """Explore word embedding properties."""
    if word not in model.wv:
        print(f"'{word}' not in vocabulary")
        return
    
    print(f"Word: '{word}'")
    print(f"Vector shape: {model.wv[word].shape}")
    print(f"Vector (first 10 dims): {model.wv[word][:10].round(3)}")
    print(f"\nMost similar words:")
    for similar_word, similarity in model.wv.most_similar(word, topn=5):
        print(f"  {similar_word}: {similarity:.3f}")

# Explore financial terms
print("="*60)
explore_embeddings(w2v_skipgram, 'market')
print("\n" + "="*60)
explore_embeddings(w2v_skipgram, 'bullish')
print("\n" + "="*60)
explore_embeddings(w2v_skipgram, 'stock')

In [None]:
# Word analogies: A is to B as C is to ?
# Classic example: king - man + woman = queen
# Financial: bullish - positive + negative = bearish?

def word_analogy(model, word_a, word_b, word_c):
    """
    Find word D such that: A is to B as C is to D
    Computed as: D = B - A + C
    """
    try:
        result = model.wv.most_similar(
            positive=[word_b, word_c],
            negative=[word_a],
            topn=3
        )
        print(f"{word_a} : {word_b} :: {word_c} : ?")
        for word, score in result:
            print(f"  -> {word} ({score:.3f})")
    except KeyError as e:
        print(f"Word not in vocabulary: {e}")

# Note: With small corpus, analogies may not work well
# This demonstrates the concept
print("Word Analogies (limited by small corpus):")
print("="*50)
word_analogy(w2v_skipgram, 'stock', 'stocks', 'market')

## 4. Visualizing Word Embeddings

In [None]:
def visualize_embeddings(model, words=None, method='tsne', perplexity=5):
    """
    Visualize word embeddings in 2D using t-SNE or PCA.
    """
    if words is None:
        words = list(model.wv.key_to_index.keys())[:50]
    
    # Filter words that exist in vocabulary
    words = [w for w in words if w in model.wv]
    
    # Get vectors
    vectors = np.array([model.wv[w] for w in words])
    
    # Dimensionality reduction
    if method == 'tsne':
        reducer = TSNE(n_components=2, random_state=42, perplexity=min(perplexity, len(words)-1))
    else:
        reducer = PCA(n_components=2, random_state=42)
    
    coords = reducer.fit_transform(vectors)
    
    # Plot
    fig, ax = plt.subplots(figsize=(14, 10))
    ax.scatter(coords[:, 0], coords[:, 1], alpha=0.7, s=100)
    
    for i, word in enumerate(words):
        ax.annotate(word, (coords[i, 0], coords[i, 1]), 
                   fontsize=10, alpha=0.8)
    
    ax.set_title(f'Word Embeddings Visualization ({method.upper()})', fontsize=14)
    ax.set_xlabel('Dimension 1')
    ax.set_ylabel('Dimension 2')
    plt.tight_layout()
    plt.show()

# Visualize all words
visualize_embeddings(w2v_skipgram, method='pca')

In [None]:
# Visualize specific financial word groups
financial_words = [
    'stock', 'stocks', 'equity', 'market', 'bond', 'bonds',
    'bullish', 'bearish', 'rally', 'selloff',
    'earnings', 'revenue', 'growth', 'profit',
    'volatility', 'risk', 'trading', 'investors'
]

# Filter to words in vocabulary
available_words = [w for w in financial_words if w in w2v_skipgram.wv]
print(f"Available words: {available_words}")

if len(available_words) >= 5:
    visualize_embeddings(w2v_skipgram, words=available_words, method='pca')

## 5. GloVe: Global Vectors for Word Representation

### Key Differences from Word2Vec:
- **Global statistics**: Uses word co-occurrence matrix from entire corpus
- **Objective**: Weighted least squares on log co-occurrence counts
- **Formula**: $w_i^T \tilde{w}_j + b_i + \tilde{b}_j = \log(X_{ij})$

### Advantages:
- Better captures global corpus statistics
- Often performs better on word analogy tasks
- Pre-trained on massive corpora (Wikipedia, Common Crawl)

In [None]:
# Load pre-trained GloVe embeddings
# Available models: glove-wiki-gigaword-50, glove-wiki-gigaword-100, 
#                   glove-wiki-gigaword-200, glove-wiki-gigaword-300

print("Loading pre-trained GloVe embeddings (this may take a moment)...")
try:
    glove = api.load('glove-wiki-gigaword-100')  # 100-dimensional GloVe
    print(f"\nGloVe Model loaded successfully!")
    print(f"  Vocabulary size: {len(glove)}")
    print(f"  Embedding dimension: {glove.vector_size}")
except Exception as e:
    print(f"Error loading GloVe: {e}")
    print("Continuing with Word2Vec models only.")
    glove = None

In [None]:
# Explore GloVe embeddings for financial terms
if glove is not None:
    financial_terms = ['stock', 'bond', 'equity', 'market', 'bullish', 
                       'bearish', 'rally', 'crash', 'inflation', 'recession']
    
    print("Financial Term Similarities (GloVe):")
    print("="*60)
    
    for term in financial_terms:
        if term in glove:
            print(f"\n'{term}' most similar to:")
            for word, sim in glove.most_similar(term, topn=5):
                print(f"    {word}: {sim:.3f}")

In [None]:
# Word analogies with GloVe - much better with large pre-trained model
if glove is not None:
    print("Word Analogies with GloVe:")
    print("="*60)
    
    analogies = [
        ('man', 'woman', 'king'),           # Classic: king - man + woman = queen
        ('stock', 'stocks', 'bond'),        # Singular to plural
        ('buy', 'sell', 'long'),            # Trading opposites: long/short
        ('profit', 'loss', 'gain'),         # Financial opposites
        ('company', 'ceo', 'country'),      # Leadership analogy
    ]
    
    for a, b, c in analogies:
        try:
            result = glove.most_similar(positive=[b, c], negative=[a], topn=3)
            print(f"\n{a} : {b} :: {c} : ?")
            for word, score in result:
                print(f"    -> {word} ({score:.3f})")
        except KeyError as e:
            print(f"Word not found: {e}")

## 6. Financial Sentiment Lexicon with Embeddings

In [None]:
# Build financial sentiment lexicon using embeddings

# Seed words for positive and negative sentiment
positive_seeds = ['bullish', 'growth', 'profit', 'gain', 'surge', 'rally', 
                  'outperform', 'upgrade', 'strong', 'beat']
negative_seeds = ['bearish', 'decline', 'loss', 'drop', 'crash', 'selloff',
                  'underperform', 'downgrade', 'weak', 'miss']

def expand_sentiment_lexicon(model, seed_words, topn=10):
    """
    Expand sentiment lexicon using word embeddings.
    Find words similar to seed words.
    """
    # Filter seeds that exist in vocabulary
    valid_seeds = [w for w in seed_words if w in model]
    
    if not valid_seeds:
        return []
    
    # Find similar words
    expanded = set(valid_seeds)
    for seed in valid_seeds:
        similar = model.most_similar(seed, topn=topn)
        for word, score in similar:
            if score > 0.5:  # Similarity threshold
                expanded.add(word)
    
    return list(expanded)

if glove is not None:
    # Expand lexicons
    expanded_positive = expand_sentiment_lexicon(glove, positive_seeds)
    expanded_negative = expand_sentiment_lexicon(glove, negative_seeds)
    
    print("Expanded Positive Sentiment Lexicon:")
    print(f"  Original: {len(positive_seeds)} words")
    print(f"  Expanded: {len(expanded_positive)} words")
    print(f"  Sample: {list(expanded_positive)[:15]}")
    
    print("\nExpanded Negative Sentiment Lexicon:")
    print(f"  Original: {len(negative_seeds)} words")
    print(f"  Expanded: {len(expanded_negative)} words")
    print(f"  Sample: {list(expanded_negative)[:15]}")

In [None]:
# Create sentiment scoring function using embeddings

def embedding_sentiment_score(text, model, pos_seeds, neg_seeds):
    """
    Calculate sentiment score using word embedding similarity.
    
    Approach: Compare text words to positive/negative seed centroids.
    """
    # Preprocess
    tokens = preprocess_financial_text(text)
    tokens = [t for t in tokens if t in model]
    
    if not tokens:
        return 0.0
    
    # Calculate centroids
    valid_pos = [w for w in pos_seeds if w in model]
    valid_neg = [w for w in neg_seeds if w in model]
    
    if not valid_pos or not valid_neg:
        return 0.0
    
    pos_centroid = np.mean([model[w] for w in valid_pos], axis=0)
    neg_centroid = np.mean([model[w] for w in valid_neg], axis=0)
    
    # Calculate text embedding (average of word vectors)
    text_embedding = np.mean([model[t] for t in tokens], axis=0)
    
    # Calculate similarities
    pos_sim = cosine_similarity([text_embedding], [pos_centroid])[0, 0]
    neg_sim = cosine_similarity([text_embedding], [neg_centroid])[0, 0]
    
    # Sentiment score: difference between positive and negative similarity
    return pos_sim - neg_sim

# Test on sample headlines
if glove is not None:
    test_headlines = [
        "Stock market surges on strong earnings beat",
        "Markets crash amid recession fears",
        "Company reports steady growth in quarterly revenue",
        "Massive selloff wipes out gains",
        "Investors optimistic about economic outlook",
        "Concerns grow over rising inflation"
    ]
    
    print("Embedding-based Sentiment Scores:")
    print("="*60)
    for headline in test_headlines:
        score = embedding_sentiment_score(headline, glove, positive_seeds, negative_seeds)
        sentiment = "POSITIVE" if score > 0.02 else "NEGATIVE" if score < -0.02 else "NEUTRAL"
        print(f"{score:+.4f} [{sentiment:8}] {headline}")

## 7. Document Embeddings for Financial Texts

In [None]:
# Document embedding methods

def document_embedding_average(text, model):
    """
    Simple average of word vectors.
    Fast but loses word order and importance.
    """
    tokens = preprocess_financial_text(text)
    vectors = [model[t] for t in tokens if t in model]
    
    if not vectors:
        return np.zeros(model.vector_size)
    
    return np.mean(vectors, axis=0)


def document_embedding_tfidf_weighted(text, model, idf_weights=None):
    """
    TF-IDF weighted average of word vectors.
    Emphasizes important/rare words.
    """
    tokens = preprocess_financial_text(text)
    
    if idf_weights is None:
        # Simple frequency-based weighting if no IDF available
        weights = {t: 1.0 for t in tokens}
    else:
        weights = {t: idf_weights.get(t, 1.0) for t in tokens}
    
    weighted_vectors = []
    total_weight = 0
    
    for token in tokens:
        if token in model:
            weight = weights[token]
            weighted_vectors.append(model[token] * weight)
            total_weight += weight
    
    if not weighted_vectors or total_weight == 0:
        return np.zeros(model.vector_size)
    
    return np.sum(weighted_vectors, axis=0) / total_weight


# Test document embeddings
if glove is not None:
    doc1 = "The stock market rallied strongly on positive earnings surprises."
    doc2 = "Equity markets surged following better than expected corporate profits."
    doc3 = "Bond yields fell as traders sought safe haven assets."
    
    emb1 = document_embedding_average(doc1, glove)
    emb2 = document_embedding_average(doc2, glove)
    emb3 = document_embedding_average(doc3, glove)
    
    print("Document Similarity (Cosine):")
    print(f"  Doc1 vs Doc2: {cosine_similarity([emb1], [emb2])[0,0]:.4f} (similar topic)")
    print(f"  Doc1 vs Doc3: {cosine_similarity([emb1], [emb3])[0,0]:.4f} (different topic)")
    print(f"  Doc2 vs Doc3: {cosine_similarity([emb2], [emb3])[0,0]:.4f} (different topic)")

In [None]:
# Financial document clustering using embeddings

if glove is not None:
    sample_docs = [
        # Earnings related
        "Company beat earnings expectations and raised guidance.",
        "Quarterly profits exceeded analyst estimates significantly.",
        "Revenue growth accelerated in the latest quarter.",
        
        # Market movement related
        "Stock market indices hit record highs today.",
        "Markets rallied on optimism about trade deal.",
        "Equity indices surged in afternoon trading.",
        
        # Economic policy related
        "Federal Reserve signaled potential rate cuts ahead.",
        "Central bank maintains accommodative monetary policy.",
        "Interest rate decision expected next week.",
        
        # Risk related
        "Market volatility spiked on geopolitical tensions.",
        "Investors flee to safe haven assets amid uncertainty.",
        "Risk appetite declined sharply this week."
    ]
    
    # Create document embeddings
    doc_embeddings = np.array([document_embedding_average(doc, glove) for doc in sample_docs])
    
    # Compute similarity matrix
    similarity_matrix = cosine_similarity(doc_embeddings)
    
    # Visualize
    plt.figure(figsize=(12, 10))
    sns.heatmap(similarity_matrix, 
                xticklabels=[f"Doc{i+1}" for i in range(len(sample_docs))],
                yticklabels=[f"Doc{i+1}" for i in range(len(sample_docs))],
                annot=True, fmt='.2f', cmap='RdYlGn', center=0.5)
    plt.title('Document Similarity Matrix\n(Docs 1-3: Earnings, 4-6: Market, 7-9: Policy, 10-12: Risk)')
    plt.tight_layout()
    plt.show()

## 8. Training Domain-Specific Financial Embeddings

In [None]:
# Simulate larger financial corpus for better embeddings
# In practice, you would use SEC filings, news articles, earnings calls

extended_financial_corpus = financial_corpus + [
    # More market movement text
    "The S&P 500 index rose two percent on strong economic data.",
    "NASDAQ composite fell sharply amid tech selloff.",
    "Dow Jones industrial average reached new highs.",
    "Small cap stocks outperformed large caps this quarter.",
    "Emerging markets rallied on dollar weakness.",
    
    # Earnings and company news
    "The company reported record quarterly earnings per share.",
    "Management raised full year revenue guidance.",
    "Cost cutting measures improved profit margins.",
    "New product launches drove top line growth.",
    "Restructuring charges impacted bottom line results.",
    
    # Macroeconomic
    "Employment data showed stronger than expected job growth.",
    "Inflation remained elevated above central bank target.",
    "Consumer spending increased despite rising prices.",
    "Manufacturing sector contracted for third straight month.",
    "Housing market showed signs of cooling.",
    
    # Trading and positioning
    "Institutional investors increased their equity allocations.",
    "Hedge funds reduced net long exposure to technology.",
    "Options activity suggested elevated uncertainty.",
    "Short interest declined from recent peaks.",
    "Trading volumes surged on expiration day.",
    
    # Fixed income
    "Investment grade credit spreads tightened.",
    "High yield bond issuance reached record levels.",
    "Duration risk increased in bond portfolios.",
    "Floating rate loans gained investor interest.",
    "Sovereign debt concerns resurfaced in Europe."
]

print(f"Extended corpus size: {len(extended_financial_corpus)} documents")

In [None]:
# Preprocess extended corpus
processed_extended = [preprocess_financial_text(sent) for sent in extended_financial_corpus]

# Detect bigrams
phrases_extended = Phrases(processed_extended, min_count=2, threshold=3)
bigram_extended = Phraser(phrases_extended)
corpus_extended_bigrams = [bigram_extended[sent] for sent in processed_extended]

# Train improved model
w2v_financial = Word2Vec(
    sentences=corpus_extended_bigrams,
    vector_size=100,
    window=7,             # Larger window for financial context
    min_count=1,
    workers=4,
    sg=1,                 # Skip-gram
    epochs=200,           # More epochs for small corpus
    negative=10,          # More negative samples
    seed=42
)

print(f"Financial Word2Vec Model:")
print(f"  Vocabulary size: {len(w2v_financial.wv)}")
print(f"  Vector dimension: {w2v_financial.wv.vector_size}")

In [None]:
# Compare custom financial model with GloVe

def compare_embeddings(word, custom_model, pretrained_model):
    """Compare word similarities between custom and pretrained models."""
    print(f"\nSimilar words to '{word}':")
    print("-" * 50)
    
    print("Custom Financial Model:")
    if word in custom_model.wv:
        for w, s in custom_model.wv.most_similar(word, topn=5):
            print(f"    {w}: {s:.3f}")
    else:
        print("    Word not in vocabulary")
    
    print("\nPre-trained GloVe:")
    if pretrained_model and word in pretrained_model:
        for w, s in pretrained_model.most_similar(word, topn=5):
            print(f"    {w}: {s:.3f}")
    else:
        print("    Word not in vocabulary or model not loaded")

# Compare for financial terms
print("="*60)
print("EMBEDDING COMPARISON: Custom vs Pre-trained")
print("="*60)

for term in ['market', 'earnings', 'volatility']:
    compare_embeddings(term, w2v_financial, glove)

## 9. Practical Applications

In [None]:
# Application 1: Financial News Similarity Search

class FinancialNewsSimilarity:
    """
    Find similar financial news using word embeddings.
    Useful for: news clustering, duplicate detection, related article recommendation.
    """
    
    def __init__(self, embedding_model):
        self.model = embedding_model
        self.documents = []
        self.embeddings = []
    
    def add_documents(self, documents):
        """Add documents to the search index."""
        self.documents = documents
        self.embeddings = [
            document_embedding_average(doc, self.model) 
            for doc in documents
        ]
        self.embeddings = np.array(self.embeddings)
    
    def find_similar(self, query, topn=5):
        """Find documents most similar to query."""
        query_embedding = document_embedding_average(query, self.model)
        
        # Calculate similarities
        similarities = cosine_similarity([query_embedding], self.embeddings)[0]
        
        # Get top matches
        top_indices = np.argsort(similarities)[-topn:][::-1]
        
        results = []
        for idx in top_indices:
            results.append({
                'document': self.documents[idx],
                'similarity': similarities[idx]
            })
        
        return results

# Demo
if glove is not None:
    news_search = FinancialNewsSimilarity(glove)
    news_search.add_documents(extended_financial_corpus)
    
    query = "Stock prices increased after positive earnings announcement"
    print(f"Query: {query}")
    print("\nMost Similar Documents:")
    print("="*60)
    
    for result in news_search.find_similar(query, topn=5):
        print(f"  [{result['similarity']:.3f}] {result['document']}")

In [None]:
# Application 2: Financial Concept Clustering

from sklearn.cluster import KMeans

def cluster_financial_terms(model, terms, n_clusters=4):
    """
    Cluster financial terms using their embeddings.
    """
    # Filter terms in vocabulary
    valid_terms = [t for t in terms if t in model]
    
    if len(valid_terms) < n_clusters:
        print("Not enough valid terms for clustering")
        return None
    
    # Get embeddings
    embeddings = np.array([model[t] for t in valid_terms])
    
    # Cluster
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    clusters = kmeans.fit_predict(embeddings)
    
    # Group terms by cluster
    cluster_groups = {i: [] for i in range(n_clusters)}
    for term, cluster in zip(valid_terms, clusters):
        cluster_groups[cluster].append(term)
    
    return cluster_groups

# Financial terms to cluster
if glove is not None:
    financial_terms = [
        # Asset classes
        'stock', 'bond', 'equity', 'commodity', 'currency', 'option',
        # Sentiment
        'bullish', 'bearish', 'optimistic', 'pessimistic',
        # Market actions
        'buy', 'sell', 'hold', 'trade', 'invest',
        # Performance
        'profit', 'loss', 'gain', 'return', 'yield',
        # Market state
        'rally', 'crash', 'correction', 'volatility', 'stability',
        # Economic
        'inflation', 'recession', 'growth', 'employment', 'gdp'
    ]
    
    clusters = cluster_financial_terms(glove, financial_terms, n_clusters=5)
    
    if clusters:
        print("Financial Term Clusters:")
        print("="*60)
        for cluster_id, terms in clusters.items():
            print(f"\nCluster {cluster_id + 1}:")
            print(f"  {', '.join(terms)}")

In [None]:
# Application 3: Out-of-Vocabulary (OOV) Handling

def handle_oov(word, model, method='similar'):
    """
    Handle out-of-vocabulary words.
    
    Methods:
    - 'zero': Return zero vector
    - 'similar': Find most similar in-vocabulary word
    - 'subword': Average of character n-gram embeddings (simplified)
    """
    if word in model:
        return model[word], word
    
    if method == 'zero':
        return np.zeros(model.vector_size), '<UNK>'
    
    elif method == 'similar':
        # Find most similar word in vocabulary (simple edit distance approach)
        best_match = None
        best_score = 0
        
        for vocab_word in list(model.key_to_index.keys())[:10000]:  # Check first 10k words
            # Simple character overlap score
            overlap = len(set(word) & set(vocab_word)) / max(len(word), len(vocab_word))
            if overlap > best_score:
                best_score = overlap
                best_match = vocab_word
        
        if best_match:
            return model[best_match], f"{best_match} (substitute for {word})"
        return np.zeros(model.vector_size), '<UNK>'
    
    return np.zeros(model.vector_size), '<UNK>'

# Test OOV handling
if glove is not None:
    oov_words = ['cryptocurrency', 'fintech', 'defi', 'nft']
    
    print("Out-of-Vocabulary Handling:")
    print("="*60)
    for word in oov_words:
        in_vocab = word in glove
        if not in_vocab:
            _, substitute = handle_oov(word, glove, method='similar')
            print(f"  '{word}': OOV -> {substitute}")
        else:
            print(f"  '{word}': In vocabulary")

## 10. Model Persistence and Best Practices

In [None]:
# Save and load Word2Vec models
import os

# Create models directory
models_dir = 'models'
os.makedirs(models_dir, exist_ok=True)

# Save full model (can continue training)
model_path = os.path.join(models_dir, 'financial_w2v.model')
w2v_financial.save(model_path)
print(f"Full model saved to: {model_path}")

# Save only word vectors (smaller, faster loading)
vectors_path = os.path.join(models_dir, 'financial_w2v.vectors')
w2v_financial.wv.save(vectors_path)
print(f"Vectors saved to: {vectors_path}")

# Load model
loaded_model = Word2Vec.load(model_path)
print(f"\nLoaded model vocabulary size: {len(loaded_model.wv)}")

# Load only vectors (lightweight)
loaded_vectors = KeyedVectors.load(vectors_path)
print(f"Loaded vectors vocabulary size: {len(loaded_vectors)}")

In [None]:
# Best practices summary

best_practices = """
WORD EMBEDDINGS BEST PRACTICES FOR FINANCE
============================================

1. DATA PREPARATION
   - Use domain-specific text (SEC filings, earnings calls, news)
   - Preserve financial bigrams (interest_rate, earnings_per_share)
   - Consider keeping stopwords for embedding training
   - Clean but don't over-process financial terminology

2. MODEL SELECTION
   - Word2Vec Skip-gram: Better for rare financial terms
   - Word2Vec CBOW: Faster, better for common terms
   - GloVe pre-trained: Good baseline, transfer learning
   - FastText: Better OOV handling with subword information

3. HYPERPARAMETERS
   - vector_size: 100-300 (100 often sufficient for finance)
   - window: 5-10 (larger for capturing financial context)
   - min_count: 5-10 for large corpus, 1-2 for small
   - epochs: More for smaller corpora

4. EVALUATION
   - Word similarity tasks (financial synonyms)
   - Analogy tasks (stock:stocks::bond:?)
   - Downstream task performance (sentiment, classification)
   - Qualitative inspection of nearest neighbors

5. DEPLOYMENT
   - Save KeyedVectors for inference (smaller, faster)
   - Consider quantization for production
   - Handle OOV words gracefully
   - Version control your models
"""

print(best_practices)

## 11. Interview Questions

In [None]:
interview_questions = """
WORD EMBEDDINGS INTERVIEW QUESTIONS
====================================

CONCEPTUAL:

Q1: Explain the difference between Word2Vec Skip-gram and CBOW.
A: Skip-gram predicts context words given center word (better for rare words).
   CBOW predicts center word given context (faster, better for frequent words).

Q2: Why do word embeddings capture semantic meaning?
A: Words in similar contexts have similar embeddings. The distributional 
   hypothesis: "You shall know a word by the company it keeps." Training
   optimizes vectors so semantically related words are geometrically close.

Q3: What's the advantage of GloVe over Word2Vec?
A: GloVe uses global co-occurrence statistics (entire corpus at once),
   while Word2Vec uses local context windows. GloVe often performs better
   on analogy tasks and can be more efficient to train.

Q4: How would you handle out-of-vocabulary words in production?
A: Options include: (1) Return zero/random vector, (2) Use subword 
   embeddings (FastText), (3) Find similar in-vocabulary word, 
   (4) Use character-level models, (5) Hash trick for unknown words.

PRACTICAL:

Q5: How would you build a financial sentiment analyzer using embeddings?
A: Create positive/negative seed word centroids, compute document embedding
   as weighted average of word vectors, measure cosine similarity to 
   each centroid. The closer centroid determines sentiment.

Q6: What's a good embedding dimension for financial NLP?
A: 100-300 dimensions typically work well. 100d often sufficient for 
   finance-specific tasks. Larger dimensions may overfit on small corpora.
   Validate using downstream task performance.

Q7: How do you evaluate embedding quality?
A: (1) Intrinsic: word similarity correlation, analogy accuracy
   (2) Extrinsic: downstream task performance (sentiment, classification)
   (3) Qualitative: inspect nearest neighbors for known relationships

Q8: When would you train custom embeddings vs use pre-trained?
A: Custom: Large domain corpus, domain-specific vocabulary, 
   specialized semantics (financial jargon means different things).
   Pre-trained: Limited data, general concepts, quick prototyping.
   Hybrid: Fine-tune pre-trained on domain data.
"""

print(interview_questions)

## 12. Practice Exercises

In [None]:
exercises = """
PRACTICE EXERCISES
==================

Exercise 1: Custom Financial Embeddings
---------------------------------------
Collect 1000+ sentences from financial news sources.
Train Word2Vec with different hyperparameters.
Compare: vector_size (50, 100, 200), window (3, 5, 10), sg (0, 1).
Evaluate using financial word similarity and analogy tasks.

Exercise 2: Sector Classification
---------------------------------
Create document embeddings for company descriptions.
Use k-means or hierarchical clustering to group by sector.
Evaluate cluster quality against known sector labels.

Exercise 3: News Similarity Engine
-----------------------------------
Build a news article similarity search system.
Index 100+ financial news articles with embeddings.
Implement efficient similarity search (approximate nearest neighbors).
Test with various query types.

Exercise 4: Embedding Visualization Dashboard
----------------------------------------------
Create interactive visualization of financial term embeddings.
Allow filtering by category (assets, sentiment, actions).
Show nearest neighbors on hover.
Use Plotly or Bokeh for interactivity.

Exercise 5: Transfer Learning Comparison
-----------------------------------------
Compare sentiment classification accuracy using:
(a) Random initialization
(b) GloVe embeddings
(c) Financial-specific embeddings
(d) Fine-tuned embeddings
Document the performance differences.
"""

print(exercises)

## Summary

### Key Concepts Covered:
1. **Word Embeddings Theory**: From one-hot to distributed representations
2. **Word2Vec**: Skip-gram and CBOW architectures, training process
3. **GloVe**: Global vectors using co-occurrence statistics
4. **Financial Applications**: Sentiment analysis, document similarity, clustering
5. **Best Practices**: Preprocessing, hyperparameters, evaluation, deployment

### Next Steps:
- Day 3: Sentiment Analysis with Transformers (BERT, FinBERT)
- Day 4: Named Entity Recognition for Financial Text
- Day 5: News-based Trading Signals

### Resources:
- [Word2Vec Paper](https://arxiv.org/abs/1301.3781)
- [GloVe Paper](https://nlp.stanford.edu/pubs/glove.pdf)
- [Gensim Documentation](https://radimrehurek.com/gensim/)
- [Financial Word Embeddings Research](https://arxiv.org/abs/2006.08997)

In [None]:
# Cleanup
import shutil

# Remove models directory if desired
# shutil.rmtree('models', ignore_errors=True)

print("Day 2 Complete: Word Embeddings for Financial Text Analysis")
print("="*60)
print("Key takeaways:")
print("  1. Word embeddings capture semantic relationships")
print("  2. Word2Vec learns from local context windows")
print("  3. GloVe uses global co-occurrence statistics")
print("  4. Domain-specific training improves financial NLP")
print("  5. Document embeddings enable similarity search")