# NLP for Trading - Chapters 13-15

This notebook covers Natural Language Processing techniques for algorithmic trading,
spanning three chapters from the Puffin tutorial:

- **Chapter 13**: Sentiment analysis -- NLP pipeline, bag-of-words, TF-IDF, and news classification
- **Chapter 14**: Topic modeling -- LSI, LDA, and earnings call topic analysis
- **Chapter 15**: Word embeddings -- Word2Vec, Doc2Vec, and transformer-based embeddings

We use the `puffin.nlp` module, which provides production-ready implementations
of each technique with financial-domain defaults (Loughran-McDonald lexicon, etc.).

## 1. NLP Pipeline

The `NLPPipeline` tokenizes text, extracts lemmas, named entities, and sentences.
It uses spaCy when available and falls back to a regex-based tokenizer.
The returned `ProcessedDoc` dataclass carries all extracted features.

In [None]:
from puffin.nlp import NLPPipeline, ProcessedDoc

pipeline = NLPPipeline()

text = (
    "Apple Inc. reported Q3 revenue of $81.4B, beating estimates by 4.5%. "
    "CEO Tim Cook highlighted strong iPhone demand in emerging markets. "
    "The stock rallied 3.2% in after-hours trading."
)

doc: ProcessedDoc = pipeline.process(text)

print("Tokens (first 15):", doc.tokens[:15])
print("Sentences:", doc.sentences)
print("Entities:", doc.entities)

# Extract financial entities specifically
fin_entities = pipeline.extract_entities(text)
print("\nFinancial entities:", fin_entities)

# Extract financial terms
fin_terms = pipeline.extract_financial_terms(text)
print("Financial terms:", fin_terms)

## 2. Bag-of-Words & TF-IDF

Two fundamental text representations for quantitative analysis:

- **Bag-of-Words** (`build_bow`): raw term counts -- useful for topic models like LDA
- **TF-IDF** (`build_tfidf`): term-frequency inverse-document-frequency -- down-weights
  common terms and highlights distinctive vocabulary

The `DocumentTermMatrix` class wraps both methods and provides helper functions
for inspecting top terms, document lengths, and vocabulary size.

In [None]:
from puffin.nlp import build_bow, build_tfidf, DocumentTermMatrix

# Sample financial headlines
headlines = [
    "Fed raises interest rates by 25 basis points amid inflation concerns",
    "Tech stocks rally as earnings beat expectations across the sector",
    "Oil prices surge after OPEC announces production cuts",
    "Bank earnings disappoint as loan losses increase sharply",
    "Inflation data comes in hotter than expected raising rate hike fears",
    "Strong jobs report boosts market confidence in economic recovery",
    "Treasury yields rise as Fed signals more rate increases ahead",
    "Tech earnings season kicks off with strong revenue growth",
    "Housing market slows as mortgage rates hit multi-year highs",
    "Consumer spending remains robust despite inflation pressures",
]

# --- build_bow: raw counts, min_df=1 for small corpus ---
bow_matrix, bow_features = build_bow(headlines, max_features=200, min_df=1)
print(f"BoW shape: {bow_matrix.shape}  (docs x features)")
print(f"Sample features: {bow_features[:10]}")

# --- build_tfidf: weighted, unigrams + bigrams ---
tfidf_matrix, tfidf_features = build_tfidf(
    headlines, max_features=200, ngram_range=(1, 2), min_df=1
)
print(f"\nTF-IDF shape: {tfidf_matrix.shape}")
print(f"Sample bigrams: {[f for f in tfidf_features if ' ' in f][:8]}")

# --- DocumentTermMatrix wrapper ---
dtm = DocumentTermMatrix(method="tfidf", max_features=200, min_df=1)
dtm.fit_transform(headlines)
print(f"\n{dtm}")
print(f"Vocabulary size: {dtm.get_vocabulary_size()}")
print(f"Top terms (corpus-wide): {dtm.get_top_terms(n=8)}")
print(f"Top terms (doc 0): {dtm.get_top_terms(n=5, doc_idx=0)}")

## 3. Sentiment Analysis

Puffin provides three complementary sentiment tools:

| Class | Approach | Best for |
|---|---|---|
| `RuleSentiment` | Loughran-McDonald word counts | Quick baseline, 10-K/10-Q filings |
| `LexiconSentiment` | Weighted lexicon (customisable) | Domain-tuned scoring |
| `NewsClassifier` | TF-IDF + Naive Bayes | Labelled news classification |

In [None]:
from puffin.nlp import RuleSentiment, LexiconSentiment, NewsClassifier

# ----- RuleSentiment (Loughran-McDonald) -----
rule = RuleSentiment()

bullish_text = (
    "Revenue growth exceeded expectations with strong margins. "
    "The company achieved record profits and raised guidance."
)
bearish_text = (
    "The company reported declining revenue amid challenging conditions. "
    "Losses widened and management warned of further deterioration."
)

print("=== RuleSentiment ===")
for label, txt in [("Bullish", bullish_text), ("Bearish", bearish_text)]:
    result = rule.analyze(txt)
    print(f"{label}: score={result['score']:.3f}  label={result['label']}  "
          f"pos={result['positive_count']}  neg={result['negative_count']}")

# ----- LexiconSentiment (weighted, customisable) -----
lexicon = LexiconSentiment()
lexicon.add_positive_word("outperform", weight=2.0)
lexicon.add_negative_word("downgrade", weight=2.0)

print("\n=== LexiconSentiment ===")
custom_text = "Analysts outperform expectations but downgrade the outlook."
result = lexicon.analyze(custom_text)
print(f"Score: {result['score']:.3f}  Label: {result['label']}")
print(f"Positive words: {result['positive_words']}")
print(f"Negative words: {result['negative_words']}")

# ----- NewsClassifier (supervised) -----
train_texts = [
    "Stock surges on strong earnings beat",
    "Company reports record revenue and profit growth",
    "Shares rally after positive guidance update",
    "Dividend increase signals management confidence",
    "Analyst upgrades stock to outperform",
    "Stock plunges after earnings miss",
    "Revenue declines amid weak consumer demand",
    "Company warns of significant losses ahead",
    "Shares drop on disappointing quarterly results",
    "Analyst downgrades stock citing valuation concerns",
    "Market closes flat in light trading volume",
    "Stock trades sideways as investors await data",
]
train_labels = [
    "bullish", "bullish", "bullish", "bullish", "bullish",
    "bearish", "bearish", "bearish", "bearish", "bearish",
    "neutral", "neutral",
]

clf = NewsClassifier(max_features=500, ngram_range=(1, 2), alpha=1.0)
clf.fit(train_texts, train_labels)

test_texts = [
    "Earnings smash estimates with record profit margins",
    "Company issues profit warning as losses mount",
    "Trading volume remains subdued ahead of Fed decision",
]

predictions = clf.predict(test_texts)
probas = clf.predict_proba(test_texts)

print("\n=== NewsClassifier ===")
for txt, pred, prob in zip(test_texts, predictions, probas):
    print(f"  {txt}")
    print(f"    -> {pred}  (probs: {dict(zip(clf.classes_, prob.round(3)))})")

## 4. Topic Modeling

Topic models discover latent themes in a collection of documents:

- **LSI** (Latent Semantic Indexing): applies truncated SVD to TF-IDF. Fast, linear,
  and useful for dimensionality reduction before downstream tasks.
- **LDA** (Latent Dirichlet Allocation): probabilistic generative model where each
  document is a mixture of topics and each topic is a mixture of words.

`find_optimal_topics` sweeps over a range of topic counts and returns the number
with the highest coherence (LDA) or explained variance (LSI).

In [None]:
from puffin.nlp import LSIModel, LDAModel, find_optimal_topics

# Financial news corpus -- enough docs to extract meaningful topics
corpus = [
    "Federal Reserve raises interest rates to combat persistent inflation pressures",
    "Central bank signals further monetary tightening as inflation remains elevated",
    "Treasury yields climb after hawkish Fed commentary on rate path",
    "Bond market sells off as investors price in additional rate hikes",
    "Apple reports record quarterly revenue driven by strong iPhone sales",
    "Microsoft cloud revenue grows 30 percent year over year beating estimates",
    "Google advertising revenue disappoints sending tech stocks lower",
    "Amazon web services growth accelerates boosting overall company profits",
    "Oil prices surge after OPEC announces surprise production cuts",
    "Energy stocks rally as crude oil breaks above resistance levels",
    "Natural gas prices decline on warmer weather forecast and high inventory",
    "Renewable energy investments accelerate as costs continue declining",
    "US jobs report shows strong hiring with unemployment near historic lows",
    "Wage growth accelerates adding to inflation concerns for the Fed",
    "Consumer spending resilient despite rising prices and higher rates",
]

# --- LSI Model ---
lsi = LSIModel()
lsi.fit(corpus, n_topics=3)
topics = lsi.get_topics(n_words=5)

print("=== LSI Topics ===")
for topic_id, words in topics:
    word_str = ", ".join(f"{w} ({v:.3f})" for w, v in words)
    print(f"  Topic {topic_id}: {word_str}")

print(f"\nExplained variance: {lsi.explained_variance_ratio()}")

# Transform a new document
new_doc = ["Inflation data surprises to the upside pushing bond yields higher"]
weights = lsi.transform(new_doc)
print(f"New doc topic weights: {weights[0].round(3)}")

# --- LDA Model (sklearn fallback for portability) ---
lda = LDAModel(use_gensim=False)
lda.fit(corpus, n_topics=3)
lda_topics = lda.get_topics(n_words=5)

print("\n=== LDA Topics ===")
for topic_id, words in lda_topics:
    word_str = ", ".join(f"{w} ({v:.2f})" for w, v in words)
    print(f"  Topic {topic_id}: {word_str}")

## 5. Earnings Topic Analysis

The `EarningsTopicAnalyzer` wraps topic modeling with earnings-call-specific
helpers: dominant-topic detection, temporal trend analysis, and per-topic
sentiment mapping.

In [None]:
from puffin.nlp import EarningsTopicAnalyzer

# Simulated earnings-call excerpts for a tech company over several quarters
transcripts = [
    "We delivered strong revenue growth driven by cloud services and subscription "
    "recurring revenue. Operating margins improved as we scaled our platform.",

    "Cloud adoption continued to accelerate across enterprise customers. "
    "We invested heavily in AI infrastructure to support growing demand.",

    "Advertising revenue softened due to macro headwinds. We are focused on "
    "cost discipline and efficiency improvements across the organization.",

    "Our AI products are seeing significant traction with enterprise clients. "
    "Revenue from AI services grew over 50 percent quarter over quarter.",

    "We continue to invest in AI infrastructure while maintaining cost discipline. "
    "Cloud revenue reached a new record driven by AI workload migration.",

    "Capital expenditure increased substantially to support AI data center expansion. "
    "We expect this investment to drive long term revenue growth and margins.",

    "Subscription revenue growth remains robust with strong net retention rates. "
    "We are seeing healthy demand across all customer segments and geographies.",

    "Operating expenses declined as restructuring efforts take effect. "
    "Free cash flow improved significantly enabling increased shareholder returns.",
]

dates = [
    "2024-01-15", "2024-04-15", "2024-07-15", "2024-10-15",
    "2025-01-15", "2025-04-15", "2025-07-15", "2025-10-15",
]

analyzer = EarningsTopicAnalyzer(n_topics=3, model_type="lsi")
results = analyzer.analyze(transcripts, dates=dates)

print("=== Earnings Topics ===")
for topic_id, words in results["topics"]:
    top_words = ", ".join(w for w, _ in words[:6])
    print(f"  Topic {topic_id}: {top_words}")

print(f"\nDominant topics per quarter: {results['dominant_topics']}")

# Topic sentiment mapping
sentiment_map = analyzer.topic_sentiment(transcripts)
print("\nTopic sentiment:")
for tid, info in sentiment_map.items():
    print(f"  Topic {tid}: avg_sentiment={info['avg_sentiment']:.3f}, "
          f"docs={info['document_count']}")

## 6. Word Embeddings

Dense vector representations capture semantic relationships between words:

- **Word2Vec** (skip-gram / CBOW): learns embeddings by predicting context words.
  Financial analogies like *bull - positive + negative = bear* emerge naturally.
- **Doc2Vec** (PV-DM / PV-DBOW): extends Word2Vec to learn a single vector per
  document -- useful for comparing entire earnings transcripts or SEC filings.
- **GloVe**: factorises the global word co-occurrence matrix. Puffin can load
  pretrained GloVe vectors via `GloVeLoader`.

All three trainers live in `puffin.nlp.embeddings` and require `gensim`.

In [None]:
# Word2Vec and Doc2Vec require gensim -- demonstrate the API pattern
# even if gensim is not installed in the current environment.

try:
    from puffin.nlp.embeddings import Word2VecTrainer, Doc2VecTrainer

    # Tokenised financial sentences (list of lists)
    tokenized_docs = [
        ["revenue", "growth", "exceeded", "expectations", "strong", "margins"],
        ["earnings", "beat", "estimates", "driven", "by", "cloud", "revenue"],
        ["stock", "price", "declined", "after", "missing", "revenue", "targets"],
        ["interest", "rates", "rose", "as", "inflation", "remained", "elevated"],
        ["market", "volatility", "increased", "amid", "geopolitical", "tensions"],
        ["dividend", "yield", "improved", "after", "strong", "cash", "flow"],
        ["cloud", "revenue", "growth", "accelerated", "across", "enterprise"],
        ["operating", "margins", "expanded", "due", "to", "cost", "discipline"],
    ]

    # --- Word2Vec ---
    w2v = Word2VecTrainer()
    w2v.train(tokenized_docs, vector_size=50, window=3, min_count=1, sg=1, epochs=50)

    print("=== Word2Vec ===")
    vec = w2v.word_vector("revenue")
    print(f"'revenue' vector shape: {vec.shape}")
    similar = w2v.similar_words("revenue", topn=5)
    print(f"Similar to 'revenue': {similar}")

    doc_vec = w2v.document_vector(["strong", "revenue", "growth"])
    print(f"Document vector shape: {doc_vec.shape}")

    # --- Doc2Vec ---
    d2v = Doc2VecTrainer()
    d2v.train(tokenized_docs, vector_size=50, window=3, min_count=1, dm=1, epochs=50)

    print("\n=== Doc2Vec ===")
    inferred = d2v.infer_vector(["revenue", "growth", "strong"])
    print(f"Inferred vector shape: {inferred.shape}")

    train_vec = d2v.document_vector("0")
    print(f"Training doc 0 vector shape: {train_vec.shape}")

except ImportError:
    print("gensim not installed -- skipping Word2Vec / Doc2Vec demo.")
    print("Install with: pip install gensim")

## 7. Putting It All Together

A typical NLP-for-trading workflow chains the components above:

1. **Preprocess** raw text with `NLPPipeline`
2. **Vectorise** with `DocumentTermMatrix` or word embeddings
3. **Score sentiment** with `RuleSentiment` / `LexiconSentiment`
4. **Discover themes** with `LSIModel` / `LDAModel`
5. **Classify** incoming news with `NewsClassifier`
6. Feed signals into the Puffin backtesting engine

In [None]:
import pandas as pd
from puffin.nlp import NLPPipeline, RuleSentiment, DocumentTermMatrix

# Simulated incoming news feed
news_feed = [
    {"ticker": "AAPL", "headline": "Apple reports record services revenue, raises buyback program"},
    {"ticker": "TSLA", "headline": "Tesla deliveries miss estimates as competition intensifies in China"},
    {"ticker": "MSFT", "headline": "Microsoft Azure growth accelerates on strong AI adoption"},
    {"ticker": "JPM",  "headline": "JPMorgan warns of rising loan losses amid economic uncertainty"},
    {"ticker": "NVDA", "headline": "Nvidia data center revenue surges on unprecedented AI chip demand"},
]

pipeline = NLPPipeline()
sentiment = RuleSentiment()

records = []
for item in news_feed:
    doc = pipeline.process(item["headline"])
    score = sentiment.score(item["headline"])
    analysis = sentiment.analyze(item["headline"])
    records.append({
        "ticker": item["ticker"],
        "headline": item["headline"],
        "sentiment_score": round(score, 3),
        "label": analysis["label"],
        "n_tokens": len(doc.tokens),
    })

df = pd.DataFrame(records)
print(df.to_string(index=False))
print(f"\nAverage sentiment: {df['sentiment_score'].mean():.3f}")

## Exercises

1. **Custom lexicon**: Add domain-specific terms to `LexiconSentiment` (e.g., "tapering" as negative, "dovish" as positive) and re-score the Fed-related headlines.
2. **Topic sweep**: Use `find_optimal_topics` on the earnings transcripts with `min_topics=2, max_topics=6`. Plot the coherence/variance curve.
3. **Classifier evaluation**: Split the training headlines into train/test sets, fit `NewsClassifier`, and call `evaluate()` to inspect precision, recall, and F1.
4. **Embedding analogies**: With a larger corpus, try `Word2VecTrainer.analogy(positive=['bullish', 'decline'], negative=['bearish'])` and inspect the result.
5. **End-to-end signal**: Combine sentiment scores with topic weights into a single alpha signal and feed it into the Puffin backtester.