# Day 01: NLP Basics for Financial Documents

## Week 19: NLP & Alternative Data

**Learning Objectives:**
- Master text preprocessing techniques for financial documents
- Understand tokenization strategies for financial text
- Implement TF-IDF for document analysis and similarity
- Apply NLP techniques to earnings calls, news, and SEC filings

**Interview Topics Covered:**
- Text preprocessing pipelines
- Bag-of-words vs TF-IDF representations
- Document similarity and retrieval
- Domain-specific NLP challenges in finance

---
## 1. Environment Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string
from collections import Counter

# NLP Libraries
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.util import ngrams

# Scikit-learn NLP tools
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('punkt_tab', quiet=True)

plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

print("Environment ready for NLP analysis!")

---
## 2. Financial Text Data

Let's create sample financial documents representing different types of financial text:
- Earnings call transcripts
- Financial news articles
- SEC filing excerpts
- Analyst reports

In [None]:
# Sample financial documents
financial_documents = {
    'earnings_call_1': """
    Good morning, and welcome to TechCorp's Q3 2025 earnings call. Our revenue 
    increased 15% year-over-year to $2.3 billion, exceeding analyst expectations. 
    EBITDA margins expanded by 200 basis points to 28.5%. We're raising our 
    full-year guidance due to strong demand in cloud services. Free cash flow 
    was $450 million, and we repurchased $200 million in shares during the quarter.
    """,
    
    'earnings_call_2': """
    Thank you for joining GlobalBank's quarterly earnings presentation. Net interest 
    income declined 8% as the yield curve remained inverted. Provisions for credit 
    losses increased to $1.2 billion amid rising defaults in commercial real estate. 
    Our CET1 ratio stands at 12.5%, well above regulatory requirements. We're 
    implementing cost-cutting measures to improve efficiency ratios.
    """,
    
    'news_positive': """
    Apple Inc. shares surged 5% in after-hours trading following better-than-expected 
    iPhone sales. The tech giant reported record services revenue, beating Wall Street 
    estimates by a wide margin. Analysts upgraded their price targets, citing strong 
    momentum in emerging markets and the successful launch of new AI features.
    """,
    
    'news_negative': """
    Oil prices plummeted 8% today as OPEC+ failed to reach agreement on production 
    cuts. Energy stocks faced significant selling pressure, with Exxon and Chevron 
    down over 4%. Traders fear oversupply concerns could push crude below $60 per 
    barrel. The bearish sentiment extended to related sectors including oilfield services.
    """,
    
    'sec_filing': """
    RISK FACTORS: Our business is subject to various risks including market volatility, 
    regulatory changes, and cybersecurity threats. We face intense competition in our 
    primary markets. Currency fluctuations may adversely affect our international 
    operations. The company has significant debt obligations totaling $5.2 billion 
    maturing over the next five years.
    """,
    
    'analyst_report': """
    We initiate coverage of MegaRetail Corp with an OVERWEIGHT rating and $150 price 
    target, implying 25% upside. Key catalysts include margin expansion from supply 
    chain optimization and accelerating e-commerce growth. Valuation is attractive at 
    15x forward P/E versus 20x for peers. Main risks include consumer spending weakness 
    and competitive pressures from discount retailers.
    """
}

# Create DataFrame
docs_df = pd.DataFrame([
    {'doc_id': k, 'text': v.strip(), 'category': k.split('_')[0]} 
    for k, v in financial_documents.items()
])

print(f"Created {len(docs_df)} financial documents")
docs_df[['doc_id', 'category']].head()

---
## 3. Text Preprocessing Pipeline

### 3.1 Basic Text Cleaning

Financial text often contains special characters, numbers, and formatting that need handling.

In [None]:
class FinancialTextPreprocessor:
    """
    Text preprocessing pipeline optimized for financial documents.
    
    Key considerations for financial text:
    - Preserve financial numbers (percentages, currency)
    - Handle financial acronyms (EBITDA, P/E, CET1)
    - Keep sentiment-bearing words
    """
    
    def __init__(self):
        self.stemmer = PorterStemmer()
        self.lemmatizer = WordNetLemmatizer()
        
        # Standard English stopwords
        self.stop_words = set(stopwords.words('english'))
        
        # Financial stopwords to add
        self.financial_stopwords = {
            'company', 'corporation', 'inc', 'corp', 'llc', 'ltd',
            'quarter', 'year', 'fiscal', 'annual', 'period'
        }
        
        # Words to preserve (financially meaningful)
        self.preserve_words = {
            'up', 'down', 'above', 'below', 'high', 'low',
            'increase', 'decrease', 'rise', 'fall', 'gain', 'loss',
            'bullish', 'bearish', 'overweight', 'underweight'
        }
        
        # Remove preserved words from stopwords
        self.stop_words = self.stop_words - self.preserve_words
        self.stop_words = self.stop_words | self.financial_stopwords
    
    def clean_text(self, text):
        """Basic text cleaning."""
        # Convert to lowercase
        text = text.lower()
        
        # Replace currency symbols with tokens
        text = re.sub(r'\$([\d,.]+)\s*(billion|million|thousand)?', 
                      r'CURRENCY_\2 ', text)
        
        # Replace percentages with tokens
        text = re.sub(r'([\d.]+)\s*%', r'PERCENT ', text)
        
        # Replace basis points
        text = re.sub(r'(\d+)\s*basis\s*points?', r'BASIS_POINTS ', text)
        text = re.sub(r'(\d+)\s*bps', r'BASIS_POINTS ', text)
        
        # Remove remaining numbers (optional - may want to keep some)
        text = re.sub(r'\b\d+\.?\d*\b', '', text)
        
        # Remove special characters but keep spaces
        text = re.sub(r'[^a-zA-Z\s_]', ' ', text)
        
        # Remove extra whitespace
        text = re.sub(r'\s+', ' ', text).strip()
        
        return text
    
    def remove_stopwords(self, tokens):
        """Remove stopwords while preserving financial terms."""
        return [t for t in tokens if t not in self.stop_words and len(t) > 2]
    
    def stem_tokens(self, tokens):
        """Apply Porter stemming."""
        return [self.stemmer.stem(t) for t in tokens]
    
    def lemmatize_tokens(self, tokens):
        """Apply lemmatization (better for financial text)."""
        return [self.lemmatizer.lemmatize(t) for t in tokens]
    
    def preprocess(self, text, use_lemmatization=True, remove_stops=True):
        """Full preprocessing pipeline."""
        # Clean text
        cleaned = self.clean_text(text)
        
        # Tokenize
        tokens = word_tokenize(cleaned)
        
        # Remove stopwords
        if remove_stops:
            tokens = self.remove_stopwords(tokens)
        
        # Lemmatize or stem
        if use_lemmatization:
            tokens = self.lemmatize_tokens(tokens)
        else:
            tokens = self.stem_tokens(tokens)
        
        return tokens


# Initialize preprocessor
preprocessor = FinancialTextPreprocessor()

# Example preprocessing
sample_text = financial_documents['earnings_call_1']
print("Original text:")
print(sample_text[:200], "...")
print("\nCleaned text:")
print(preprocessor.clean_text(sample_text)[:200], "...")
print("\nTokens:")
print(preprocessor.preprocess(sample_text)[:15])

### 3.2 Stemming vs Lemmatization

**Interview Question:** What's the difference between stemming and lemmatization? When would you prefer one over the other in financial NLP?

In [None]:
# Compare stemming vs lemmatization on financial terms
financial_words = [
    'trading', 'traded', 'trades', 'trader',
    'increasing', 'increased', 'increases',
    'volatility', 'volatile',
    'earnings', 'earned', 'earning',
    'better', 'best', 'good',
    'running', 'ran', 'runs'
]

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

comparison = pd.DataFrame({
    'Original': financial_words,
    'Stemmed': [stemmer.stem(w) for w in financial_words],
    'Lemmatized': [lemmatizer.lemmatize(w) for w in financial_words]
})

print("Stemming vs Lemmatization Comparison:")
print(comparison.to_string(index=False))

print("\n" + "="*60)
print("KEY INSIGHTS:")
print("- Stemming is faster but can create non-words (e.g., 'volat')")
print("- Lemmatization preserves real words, better for interpretation")
print("- For financial text, lemmatization often preferred for readability")
print("- Stemming may be better for pure ML tasks where interpretability is less critical")

---
## 4. Tokenization Strategies

### 4.1 Word-Level Tokenization

In [None]:
def analyze_tokenization(text, doc_name):
    """
    Analyze different tokenization approaches.
    """
    # Word tokenization
    word_tokens = word_tokenize(text.lower())
    
    # Sentence tokenization
    sentences = sent_tokenize(text)
    
    # Simple whitespace split
    simple_tokens = text.lower().split()
    
    print(f"\n{'='*60}")
    print(f"Document: {doc_name}")
    print(f"{'='*60}")
    print(f"\nSentences ({len(sentences)}):")
    for i, sent in enumerate(sentences[:3]):
        print(f"  {i+1}. {sent[:80]}..." if len(sent) > 80 else f"  {i+1}. {sent}")
    
    print(f"\nWord tokens (NLTK): {len(word_tokens)} tokens")
    print(f"Simple split: {len(simple_tokens)} tokens")
    
    # Show difference
    nltk_set = set(word_tokens)
    simple_set = set(simple_tokens)
    
    print(f"\nTokens unique to NLTK: {list(nltk_set - simple_set)[:10]}")
    
    return word_tokens

# Analyze tokenization for different document types
for doc_id in ['earnings_call_1', 'sec_filing']:
    analyze_tokenization(financial_documents[doc_id], doc_id)

### 4.2 N-gram Tokenization

N-grams capture multi-word financial phrases like "interest rate", "market volatility", "price target".

In [None]:
def extract_ngrams(text, n=2):
    """
    Extract n-grams from text.
    """
    # Preprocess
    tokens = preprocessor.preprocess(text, use_lemmatization=True, remove_stops=False)
    
    # Generate n-grams
    n_grams = list(ngrams(tokens, n))
    
    return [' '.join(gram) for gram in n_grams]


# Extract bigrams from all documents
all_text = ' '.join(financial_documents.values())

# Bigrams
bigrams = extract_ngrams(all_text, n=2)
bigram_freq = Counter(bigrams)

print("Top 15 Bigrams in Financial Documents:")
print("-" * 40)
for phrase, count in bigram_freq.most_common(15):
    print(f"{phrase:30} | {count}")

# Trigrams
print("\nTop 10 Trigrams:")
print("-" * 40)
trigrams = extract_ngrams(all_text, n=3)
trigram_freq = Counter(trigrams)
for phrase, count in trigram_freq.most_common(10):
    print(f"{phrase:40} | {count}")

### 4.3 Financial Domain-Specific Tokenization

In [None]:
class FinancialTokenizer:
    """
    Custom tokenizer for financial text that handles:
    - Financial acronyms (EBITDA, P/E, ROE)
    - Ticker symbols ($AAPL, $TSLA)
    - Currency amounts
    - Percentages and basis points
    """
    
    def __init__(self):
        # Common financial acronyms to preserve
        self.financial_acronyms = {
            'ebitda', 'ebit', 'eps', 'pe', 'pb', 'roe', 'roa', 'roi',
            'cet1', 'rwa', 'nii', 'nim', 'npv', 'irr', 'wacc', 'capm',
            'yoy', 'qoq', 'mom', 'ytd', 'mtd', 'cagr', 'fcf', 'dcf',
            'ipo', 'sec', 'fed', 'fomc', 'gdp', 'cpi', 'pmi'
        }
        
        # Patterns for financial entities
        self.patterns = {
            'ticker': r'\$[A-Z]{1,5}\b',
            'currency': r'\$[\d,.]+\s*(billion|million|thousand|B|M|K)?',
            'percentage': r'[\d.]+\s*%',
            'basis_points': r'\d+\s*(basis\s*points?|bps)',
            'ratio': r'\b\d+\.?\d*x\b',
            'date': r'Q[1-4]\s*\'?\d{2,4}|FY\s*\'?\d{2,4}'
        }
    
    def extract_financial_entities(self, text):
        """
        Extract financial entities from text.
        """
        entities = {}
        
        for entity_type, pattern in self.patterns.items():
            matches = re.findall(pattern, text, re.IGNORECASE)
            if matches:
                entities[entity_type] = matches
        
        return entities
    
    def tokenize(self, text):
        """
        Tokenize with financial entity preservation.
        """
        # Extract entities first
        entities = self.extract_financial_entities(text)
        
        # Replace entities with placeholders
        modified_text = text
        placeholders = {}
        
        for entity_type, pattern in self.patterns.items():
            placeholder = f'__{entity_type.upper()}__'
            modified_text = re.sub(pattern, placeholder, modified_text, flags=re.IGNORECASE)
        
        # Standard tokenization
        tokens = word_tokenize(modified_text.lower())
        
        return tokens, entities


# Test financial tokenizer
fin_tokenizer = FinancialTokenizer()

test_text = """
$AAPL reported Q3'25 earnings with EPS of $2.50, beating estimates by 15%. 
Revenue was $95.2 billion, up 12% YoY. EBITDA margin expanded 200 basis points.
The stock trades at 25x forward P/E ratio.
"""

tokens, entities = fin_tokenizer.tokenize(test_text)

print("Extracted Financial Entities:")
for entity_type, values in entities.items():
    print(f"  {entity_type}: {values}")

print(f"\nTokens (sample): {tokens[:20]}")

---
## 5. Bag-of-Words (BoW) Representation

### 5.1 Count Vectorization

In [None]:
# Preprocess all documents
processed_docs = []
for text in docs_df['text']:
    tokens = preprocessor.preprocess(text)
    processed_docs.append(' '.join(tokens))

docs_df['processed_text'] = processed_docs

# Create Count Vectorizer
count_vectorizer = CountVectorizer(
    max_features=100,
    ngram_range=(1, 2),  # Unigrams and bigrams
    min_df=1,            # Minimum document frequency
    max_df=0.9           # Maximum document frequency (remove very common terms)
)

# Fit and transform
bow_matrix = count_vectorizer.fit_transform(docs_df['processed_text'])

# Create DataFrame for visualization
bow_df = pd.DataFrame(
    bow_matrix.toarray(),
    columns=count_vectorizer.get_feature_names_out(),
    index=docs_df['doc_id']
)

print(f"Vocabulary size: {len(count_vectorizer.vocabulary_)}")
print(f"Document-term matrix shape: {bow_matrix.shape}")
print("\nSample of BoW representation:")
bow_df.iloc[:, :10]

In [None]:
# Visualize term frequencies across documents
fig, ax = plt.subplots(figsize=(14, 6))

# Get top terms by total frequency
term_freq = bow_df.sum().sort_values(ascending=False).head(20)

colors = plt.cm.Blues(np.linspace(0.4, 0.9, len(term_freq)))
bars = ax.bar(range(len(term_freq)), term_freq.values, color=colors)
ax.set_xticks(range(len(term_freq)))
ax.set_xticklabels(term_freq.index, rotation=45, ha='right')
ax.set_ylabel('Total Count')
ax.set_title('Top 20 Terms by Frequency (Bag-of-Words)')

plt.tight_layout()
plt.show()

---
## 6. TF-IDF (Term Frequency-Inverse Document Frequency)

### 6.1 TF-IDF Theory

**Interview Question:** Explain TF-IDF and why it's better than raw term counts for document analysis.

$$\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)$$

Where:
- $\text{TF}(t, d)$ = Term frequency of term $t$ in document $d$
- $\text{IDF}(t, D) = \log\frac{N}{|\{d \in D : t \in d\}|}$
- $N$ = Total number of documents

**Key Insight:** TF-IDF downweights common terms and highlights distinctive terms.

In [None]:
# Create TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(
    max_features=100,
    ngram_range=(1, 2),
    min_df=1,
    max_df=0.9,
    sublinear_tf=True,  # Apply sublinear TF scaling (1 + log(tf))
    norm='l2'           # L2 normalization
)

# Fit and transform
tfidf_matrix = tfidf_vectorizer.fit_transform(docs_df['processed_text'])

# Create DataFrame
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=tfidf_vectorizer.get_feature_names_out(),
    index=docs_df['doc_id']
)

print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
print("\nTF-IDF scores (sample):")
tfidf_df.iloc[:, :8].round(3)

In [None]:
def get_top_tfidf_terms(doc_index, n_terms=10):
    """
    Get top TF-IDF terms for a document.
    """
    doc_name = docs_df['doc_id'].iloc[doc_index]
    scores = tfidf_df.iloc[doc_index].sort_values(ascending=False)
    
    return doc_name, scores.head(n_terms)


# Analyze top terms for each document
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for i in range(len(docs_df)):
    doc_name, top_terms = get_top_tfidf_terms(i, 8)
    
    ax = axes[i]
    colors = plt.cm.RdYlGn(np.linspace(0.3, 0.8, len(top_terms)))
    ax.barh(range(len(top_terms)), top_terms.values, color=colors)
    ax.set_yticks(range(len(top_terms)))
    ax.set_yticklabels(top_terms.index)
    ax.set_xlabel('TF-IDF Score')
    ax.set_title(f'{doc_name}', fontsize=10)
    ax.invert_yaxis()

plt.suptitle('Top TF-IDF Terms by Document', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

### 6.2 Comparing BoW vs TF-IDF

**Interview Question:** When would you use BoW vs TF-IDF for financial text analysis?

In [None]:
# Compare rankings between BoW and TF-IDF
print("Comparison: BoW vs TF-IDF Term Rankings")
print("="*60)

# Overall top terms
bow_top = bow_df.sum().sort_values(ascending=False).head(10)
tfidf_top = tfidf_df.sum().sort_values(ascending=False).head(10)

comparison_df = pd.DataFrame({
    'BoW_Term': bow_top.index,
    'BoW_Score': bow_top.values,
    'TF-IDF_Term': tfidf_top.index,
    'TF-IDF_Score': tfidf_top.values.round(3)
})

print(comparison_df.to_string(index=False))

print("\n" + "="*60)
print("KEY DIFFERENCES:")
print("-" * 60)
print("BoW: Raw counts favor frequent terms across all documents")
print("TF-IDF: Highlights distinctive terms for each document")
print("\nUse BoW when: Frequency matters (e.g., keyword detection)")
print("Use TF-IDF when: Finding distinctive content (e.g., document retrieval)")

---
## 7. Document Similarity with TF-IDF

### 7.1 Cosine Similarity

In [None]:
# Calculate document similarity
similarity_matrix = cosine_similarity(tfidf_matrix)

similarity_df = pd.DataFrame(
    similarity_matrix,
    index=docs_df['doc_id'],
    columns=docs_df['doc_id']
)

print("Document Similarity Matrix (Cosine Similarity):")
print(similarity_df.round(3))

# Visualize
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(
    similarity_df, 
    annot=True, 
    fmt='.2f', 
    cmap='RdYlGn',
    center=0.5,
    ax=ax
)
ax.set_title('Document Similarity Matrix (TF-IDF + Cosine Similarity)', fontsize=12)
plt.tight_layout()
plt.show()

In [None]:
# Find most similar document pairs
print("Most Similar Document Pairs:")
print("="*60)

pairs = []
for i in range(len(similarity_df)):
    for j in range(i+1, len(similarity_df)):
        pairs.append({
            'doc1': similarity_df.index[i],
            'doc2': similarity_df.columns[j],
            'similarity': similarity_df.iloc[i, j]
        })

pairs_df = pd.DataFrame(pairs).sort_values('similarity', ascending=False)
print(pairs_df.head(10).to_string(index=False))

print("\nInterpretation:")
print("- Earnings calls are similar to each other (corporate language)")
print("- News articles have distinct vocabulary based on sentiment")
print("- SEC filings use different legal/regulatory language")

### 7.2 Query-Based Document Retrieval

In [None]:
def find_similar_documents(query, vectorizer, doc_matrix, doc_names, top_n=3):
    """
    Find documents most similar to a query using TF-IDF.
    
    This is commonly used in:
    - Document search systems
    - News article retrieval
    - SEC filing analysis
    """
    # Preprocess query
    query_processed = ' '.join(preprocessor.preprocess(query))
    
    # Transform query to TF-IDF vector
    query_vector = vectorizer.transform([query_processed])
    
    # Calculate similarity
    similarities = cosine_similarity(query_vector, doc_matrix)[0]
    
    # Rank documents
    results = pd.DataFrame({
        'document': doc_names,
        'similarity': similarities
    }).sort_values('similarity', ascending=False)
    
    return results.head(top_n)


# Test queries
test_queries = [
    "company earnings revenue growth profit",
    "stock price target analyst recommendation",
    "risk factors debt regulatory concerns",
    "oil energy prices market decline"
]

print("Document Retrieval Results:")
print("="*60)

for query in test_queries:
    print(f"\nQuery: '{query}'")
    print("-" * 40)
    results = find_similar_documents(
        query, 
        tfidf_vectorizer, 
        tfidf_matrix, 
        docs_df['doc_id'].values
    )
    for _, row in results.iterrows():
        print(f"  {row['document']:20} | Similarity: {row['similarity']:.3f}")

---
## 8. Advanced TF-IDF Applications

### 8.1 Extracting Key Financial Topics (LSA)

In [None]:
# Latent Semantic Analysis (LSA) using SVD on TF-IDF
n_topics = 3

svd = TruncatedSVD(n_components=n_topics, random_state=42)
doc_topics = svd.fit_transform(tfidf_matrix)

# Get top terms for each topic
terms = tfidf_vectorizer.get_feature_names_out()

print("Latent Topics from Financial Documents (LSA):")
print("="*60)

for topic_idx, topic in enumerate(svd.components_):
    top_term_indices = topic.argsort()[-8:][::-1]
    top_terms = [terms[i] for i in top_term_indices]
    
    print(f"\nTopic {topic_idx + 1}:")
    print(f"  Terms: {', '.join(top_terms)}")
    
    # Documents most associated with this topic
    top_doc_idx = doc_topics[:, topic_idx].argsort()[-2:][::-1]
    top_docs = [docs_df['doc_id'].iloc[i] for i in top_doc_idx]
    print(f"  Top docs: {', '.join(top_docs)}")

# Visualize document-topic matrix
topic_df = pd.DataFrame(
    doc_topics,
    columns=[f'Topic_{i+1}' for i in range(n_topics)],
    index=docs_df['doc_id']
)

print("\nDocument-Topic Matrix:")
print(topic_df.round(3))

### 8.2 Financial Sentiment Lexicon Application

In [None]:
# Simple financial sentiment lexicon (Loughran-McDonald style)
financial_sentiment = {
    'positive': [
        'growth', 'increase', 'gain', 'profit', 'surge', 'strong', 'beat',
        'exceed', 'upgrade', 'bullish', 'opportunity', 'momentum', 'record',
        'optimistic', 'outperform', 'attractive', 'upside', 'expansion'
    ],
    'negative': [
        'loss', 'decline', 'risk', 'weak', 'fall', 'plummet', 'bearish',
        'concern', 'fear', 'default', 'downgrade', 'pressure', 'threat',
        'adverse', 'underperform', 'volatile', 'recession', 'uncertainty'
    ],
    'uncertainty': [
        'may', 'could', 'might', 'possible', 'uncertain', 'unclear',
        'depending', 'subject', 'approximate', 'estimate', 'expect'
    ],
    'litigious': [
        'litigation', 'lawsuit', 'regulatory', 'compliance', 'investigation',
        'penalty', 'settlement', 'violation', 'enforcement'
    ]
}

def calculate_sentiment_scores(text):
    """
    Calculate sentiment scores based on financial lexicon.
    """
    tokens = preprocessor.preprocess(text, remove_stops=False)
    token_set = set(tokens)
    total_tokens = len(tokens)
    
    scores = {}
    for sentiment, words in financial_sentiment.items():
        count = sum(1 for t in tokens if t in words)
        scores[sentiment] = count / total_tokens if total_tokens > 0 else 0
    
    return scores


# Calculate sentiment for all documents
sentiment_results = []
for idx, row in docs_df.iterrows():
    scores = calculate_sentiment_scores(row['text'])
    scores['doc_id'] = row['doc_id']
    sentiment_results.append(scores)

sentiment_df = pd.DataFrame(sentiment_results).set_index('doc_id')

print("Financial Sentiment Scores:")
print(sentiment_df.round(4))

# Visualize
fig, ax = plt.subplots(figsize=(12, 6))
sentiment_df.plot(kind='bar', ax=ax, colormap='RdYlGn')
ax.set_ylabel('Sentiment Score (word frequency)')
ax.set_title('Financial Sentiment Analysis by Document')
ax.legend(title='Sentiment', bbox_to_anchor=(1.02, 1))
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

---
## 9. Interview Practice Questions

### Q1: Text Preprocessing Decisions

In [None]:
print("""
INTERVIEW QUESTION 1: Text Preprocessing
=========================================

Q: You're building an NLP model to analyze earnings call transcripts.
   What preprocessing steps would you take and why?

ANSWER:
-------
1. CASE NORMALIZATION: Convert to lowercase for consistency,
   BUT preserve acronyms like EBITDA, EPS (common in finance)

2. NUMBER HANDLING: 
   - Replace specific numbers with tokens (e.g., $2.3B â†’ CURRENCY_BILLION)
   - Preserves semantic meaning without vocabulary explosion
   - Percentages â†’ PERCENT token

3. STOPWORDS:
   - Remove standard stopwords
   - BUT keep sentiment words (up, down, above, below)
   - Add domain stopwords (quarter, fiscal, company)

4. LEMMATIZATION over STEMMING:
   - Preserves real words for interpretability
   - Important for financial reports where terms matter

5. N-GRAMS:
   - Include bigrams to capture phrases like "interest rate"
   - Financial language is phrase-heavy

6. ENTITY RECOGNITION:
   - Identify and normalize ticker symbols, dates, currencies
   - May need custom NER for financial entities
""")

### Q2: TF-IDF Intuition

In [None]:
print("""
INTERVIEW QUESTION 2: TF-IDF
============================

Q: Explain TF-IDF and when you'd use it vs. raw word counts.

ANSWER:
-------
TF-IDF = Term Frequency Ã— Inverse Document Frequency

TF (Term Frequency):
  - How often a term appears in a document
  - Higher TF = term is important to this document

IDF (Inverse Document Frequency):
  - log(N / df), where df = docs containing the term
  - Downweights common terms ("the", "company")
  - Upweights rare, distinctive terms

WHY USE TF-IDF?
  - Raw counts favor common words
  - TF-IDF highlights what makes documents UNIQUE
  - Better for: document similarity, search, classification

WHEN TO USE RAW COUNTS?
  - When frequency itself is meaningful (keyword detection)
  - As input to some models (Naive Bayes)
  - Topic modeling (LDA prefers counts)

PRACTICAL EXAMPLE:
  In SEC filings, "risk" appears in every document.
  TF-IDF will downweight "risk" and highlight
  specific risks like "cybersecurity" or "litigation".
""")

### Q3: Cosine Similarity

In [None]:
print("""
INTERVIEW QUESTION 3: Document Similarity
=========================================

Q: Why use cosine similarity instead of Euclidean distance
   for comparing documents?

ANSWER:
-------
COSINE SIMILARITY:
  cos(Î¸) = (A Â· B) / (||A|| Ã— ||B||)
  
  - Measures angle between vectors (direction)
  - Normalized: ignores document length
  - Range: [-1, 1] (or [0, 1] for TF-IDF since no negatives)

EUCLIDEAN DISTANCE:
  d = sqrt(Î£(a_i - b_i)Â²)
  
  - Measures absolute distance
  - Affected by document length

WHY COSINE FOR DOCUMENTS?
  1. Document length varies wildly
     - 10-K filing: 50,000 words
     - News headline: 10 words
     - Euclidean would say they're always different
  
  2. We care about topic similarity, not length
     - Two articles about same topic should be similar
     - Even if one is 3x longer
  
  3. Sparse high-dimensional vectors
     - TF-IDF vectors are very sparse
     - Cosine handles this well
""")

# Demonstrate with example
from sklearn.metrics.pairwise import euclidean_distances

# Compare cosine vs euclidean
print("\nComparison: Cosine vs Euclidean Similarity")
print("="*50)
cosine_sim = cosine_similarity(tfidf_matrix)
euclidean_dist = euclidean_distances(tfidf_matrix)

print("\nCosine Similarity (first 3 docs):")
print(cosine_sim[:3, :3].round(3))
print("\nEuclidean Distance (first 3 docs):")
print(euclidean_dist[:3, :3].round(3))

---
## 10. Practical Exercise: News Sentiment Analyzer

In [None]:
class FinancialNewsAnalyzer:
    """
    Complete NLP pipeline for financial news analysis.
    
    Features:
    - Text preprocessing
    - TF-IDF vectorization
    - Document similarity
    - Sentiment scoring
    """
    
    def __init__(self):
        self.preprocessor = FinancialTextPreprocessor()
        self.vectorizer = TfidfVectorizer(
            max_features=500,
            ngram_range=(1, 2),
            min_df=1,
            sublinear_tf=True
        )
        self.documents = []
        self.tfidf_matrix = None
        self.is_fitted = False
    
    def fit(self, documents):
        """
        Fit the analyzer on a corpus of documents.
        """
        self.documents = documents
        
        # Preprocess
        processed = [
            ' '.join(self.preprocessor.preprocess(doc)) 
            for doc in documents
        ]
        
        # Fit TF-IDF
        self.tfidf_matrix = self.vectorizer.fit_transform(processed)
        self.is_fitted = True
        
        return self
    
    def find_similar(self, query, top_n=3):
        """
        Find documents similar to a query.
        """
        if not self.is_fitted:
            raise ValueError("Analyzer not fitted. Call fit() first.")
        
        # Process query
        query_processed = ' '.join(self.preprocessor.preprocess(query))
        query_vector = self.vectorizer.transform([query_processed])
        
        # Calculate similarity
        similarities = cosine_similarity(query_vector, self.tfidf_matrix)[0]
        
        # Get top results
        top_indices = similarities.argsort()[-top_n:][::-1]
        
        results = []
        for idx in top_indices:
            results.append({
                'document': self.documents[idx][:100] + '...',
                'similarity': similarities[idx]
            })
        
        return results
    
    def get_key_terms(self, doc_index, n_terms=10):
        """
        Get most important terms for a document.
        """
        if not self.is_fitted:
            raise ValueError("Analyzer not fitted. Call fit() first.")
        
        terms = self.vectorizer.get_feature_names_out()
        scores = self.tfidf_matrix[doc_index].toarray()[0]
        
        top_indices = scores.argsort()[-n_terms:][::-1]
        
        return [(terms[i], scores[i]) for i in top_indices]


# Test the analyzer
analyzer = FinancialNewsAnalyzer()
analyzer.fit(list(financial_documents.values()))

print("Financial News Analyzer Demo")
print("="*60)

# Test query
query = "stock market decline bearish sentiment"
print(f"\nQuery: '{query}'")
print("\nMost Similar Documents:")

for i, result in enumerate(analyzer.find_similar(query)):
    print(f"\n{i+1}. Similarity: {result['similarity']:.3f}")
    print(f"   {result['document']}")

---
## 11. Summary & Key Takeaways

### What We Covered Today:

1. **Text Preprocessing for Finance**
   - Domain-specific considerations (numbers, acronyms, sentiment words)
   - Stemming vs lemmatization trade-offs
   - Custom tokenization for financial entities

2. **Tokenization Strategies**
   - Word-level tokenization
   - N-grams for phrase capture
   - Financial entity extraction

3. **Document Representations**
   - Bag-of-Words (BoW)
   - TF-IDF and its advantages
   - When to use each approach

4. **Applications**
   - Document similarity with cosine similarity
   - Query-based document retrieval
   - Topic extraction with LSA
   - Sentiment analysis with financial lexicons

### Interview Prep Checklist:
- [ ] Explain TF-IDF formula and intuition
- [ ] Compare stemming vs lemmatization
- [ ] Why cosine similarity for documents?
- [ ] Design a text preprocessing pipeline for financial data
- [ ] Explain challenges of NLP in finance

In [None]:
print("""
ðŸŽ¯ NEXT STEPS:
==============

Day 02: Word Embeddings (Word2Vec, GloVe)
Day 03: Sentiment Analysis with ML Models
Day 04: Named Entity Recognition for Finance
Day 05: Topic Modeling (LDA)
Day 06: Alternative Data Sources
Day 07: Interview Review & Practice

ðŸ“š RECOMMENDED READING:
- "Textual Analysis in Finance" - Loughran & McDonald
- NLTK Book (free online)
- Scikit-learn text feature extraction docs
""")