# Part V: Advanced Topics and Modern Analysis

## Chapter 19: Sentiment Analysis and Big Data

**Chapter Objective:** Traditional analysis relies on financial statements and price charts, but the modern investor has access to a vast array of new data sources. Sentiment analysis—the measurement of public mood and opinion—can provide an edge by capturing shifts in investor psychology before they are fully reflected in prices. This chapter explores how to gather and analyze sentiment from news articles, social media, insider transactions, and analyst ratings. We also introduce the world of alternative data (big data) and discuss how to integrate these signals into a disciplined investment process. By the end, you will be able to build your own sentiment indicators and critically evaluate the growing role of data science in investing.

---

### 19.1 What is Sentiment Analysis?

Sentiment analysis, also known as opinion mining, uses natural language processing (NLP) to determine the emotional tone behind a body of text. In finance, it is applied to news headlines, social media posts, earnings call transcripts, and other textual data to gauge whether the market's mood is bullish, bearish, or neutral.

**Why Sentiment Matters**

- **Behavioral Finance Link:** Markets are driven by human emotions—fear, greed, euphoria, panic. Sentiment analysis attempts to quantify these emotions.
- **Leading Indicator:** Shifts in sentiment often precede price movements. For example, a wave of negative news articles may predict a sell‑off.
- **Contrarian Signals:** Extreme sentiment (widespread bullishness or bearishness) can signal market tops or bottoms.

**Types of Sentiment Data**

- **News Sentiment:** Tone of articles about a company, industry, or the economy.
- **Social Media Sentiment:** Tweets, Reddit posts (e.g., WallStreetBets), stock forums.
- **Earnings Call Sentiment:** Analysis of management's language during conference calls.
- **Insider Transactions:** Buying/selling by executives (a form of "revealed sentiment").
- **Analyst Ratings:** Recommendations and price targets from sell‑side analysts.

**Challenges**

- **Noise:** Much of social media is irrelevant or misleading.
- **Sarcasm and Context:** NLP models struggle with irony and complex language.
- **Timeliness:** Sentiment must be captured and acted upon quickly.
- **Overfitting:** Correlations between sentiment and returns may be spurious.

---

### 19.2 Analyzing News and Social Media Sentiment

#### Natural Language Processing Basics

Sentiment analysis typically involves:
1.  **Text Preprocessing:** Removing punctuation, stop words, stemming/lemmatization.
2.  **Feature Extraction:** Converting text to numerical vectors (bag‑of‑words, TF‑IDF, word embeddings).
3.  **Classification:** Assigning a sentiment score (positive, negative, neutral) using lexicon‑based methods (e.g., VADER, Loughran‑McDonald financial word lists) or machine learning models.

For financial applications, pre‑trained models like FinBERT (a BERT model fine‑tuned on financial text) are available.

#### Using News APIs

Several APIs provide access to news articles:
- **NewsAPI** (newsapi.org): Free tier with headlines from thousands of sources.
- **GDELT Project** (gdeltproject.org): Massive database of global news, updated every 15 minutes.
- **Bloomberg, Reuters, etc.:** Professional terminals (expensive).

#### Example: Analyzing Headlines with VADER

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule‑based sentiment tool specifically attuned to social media, but works reasonably well on news headlines.

```python
import requests
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import pandas as pd

# Initialize VADER
analyzer = SentimentIntensityAnalyzer()

def get_news_sentiment(company_name, api_key, from_date, to_date):
    """
    Fetch news headlines for a company and compute average sentiment.
    Uses NewsAPI (free tier requires API key).
    """
    url = ('https://newsapi.org/v2/everything?'
           f'q={company_name}&'
           f'from={from_date}&'
           f'to={to_date}&'
           'language=en&'
           'sortBy=relevancy&'
           f'apiKey={api_key}')

    response = requests.get(url)
    if response.status_code != 200:
        print("Error fetching news")
        return None

    data = response.json()
    articles = data.get('articles', [])

    sentiments = []
    for article in articles:
        title = article['title']
        if title:
            vs = analyzer.polarity_scores(title)
            sentiments.append({
                'title': title,
                'compound': vs['compound'],
                'positive': vs['pos'],
                'negative': vs['neg'],
                'neutral': vs['neu']
            })

    df = pd.DataFrame(sentiments)
    if len(df) == 0:
        print("No articles found")
        return None

    avg_sentiment = df['compound'].mean()
    print(f"Average sentiment for {company_name}: {avg_sentiment:.3f}")
    print(f"Articles analyzed: {len(df)}")
    return df

# Example usage (replace with your NewsAPI key)
# df_news = get_news_sentiment('Apple', 'your_api_key', '2024-01-01', '2024-01-31')
```

#### Social Media Sentiment

Twitter and Reddit are popular sources. The `tweepy` library can access Twitter API (now requires developer account). Reddit data can be obtained via `praw`.

**Example: Reddit Sentiment (WallStreetBets)**

```python
import praw
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Reddit API credentials (create a Reddit app to get these)
reddit = praw.Reddit(
    client_id='YOUR_CLIENT_ID',
    client_secret='YOUR_SECRET',
    user_agent='sentiment_analysis/0.1'
)

subreddit = reddit.subreddit('wallstreetbets')
analyzer = SentimentIntensityAnalyzer()

sentiments = []
for submission in subreddit.hot(limit=50):
    title = submission.title
    if title:
        vs = analyzer.polarity_scores(title)
        sentiments.append({
            'title': title,
            'score': submission.score,
            'compound': vs['compound']
        })

df_reddit = pd.DataFrame(sentiments)
print("Reddit sentiment (WSB hot posts):")
print(df_reddit[['title', 'compound']].head())
```

#### Earnings Call Sentiment

Earnings call transcripts can be downloaded from services like Seeking Alpha or parsed from SEC filings. The Loughran‑McDonald dictionary is specifically designed for financial documents. FinBERT is even more accurate.

```python
# Pseudocode for FinBERT (requires transformers library)
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

def finbert_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    outputs = model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
    # labels: 0=negative, 1=neutral, 2=positive
    return probs.detach().numpy()[0]
```

---

### 19.3 Tracking Insider Trading Activity

Insiders—officers, directors, and large shareholders—are required to report their transactions to the SEC via Form 4. These filings are publicly available and can provide valuable signals.

#### Why Insider Trading Matters

- **Alignment:** Insiders buying their own stock signal confidence.
- **Information Advantage:** Insiders know the company better than outsiders.
- **Predictive Power:** Studies show that insider buying (especially open market purchases) tends to precede positive returns, while selling can be less informative (sales may be for diversification).

#### Data Sources

- **SEC EDGAR:** Raw Form 4 filings (difficult to parse).
- **Yahoo Finance:** Provides summary insider transaction data via `stock.insider_transactions`.
- **Commercial APIs:** Quiver Quantitative, Alpha Vantage, Intrinio offer cleaner data.

#### Analyzing Insider Transactions

Key points:
- **Open Market Purchases:** The most bullish signal; insider buys with own money.
- **Option Exercises:** Often not a signal (may be for tax reasons or to sell).
- **Cluster Buying:** Multiple insiders buying around the same time is a strong signal.
- **Magnitude:** Large purchases relative to insider's salary are significant.

**Python Example: Fetching Insider Transactions with yfinance**

```python
import yfinance as yf
import pandas as pd

def analyze_insider_activity(ticker, min_shares=1000):
    stock = yf.Ticker(ticker)
    transactions = stock.insider_transactions

    if transactions is None or transactions.empty:
        print("No insider transaction data available")
        return None

    # Filter for open market purchases
    purchases = transactions[
        (transactions['Transaction'].str.contains('Purchase', na=False)) &
        (transactions['Shares'] >= min_shares)
    ].copy()

    if purchases.empty:
        print("No significant purchases found")
        return None

    purchases['Value'] = purchases['Shares'] * purchases['Price']
    purchases = purchases.sort_values('Start Date', ascending=False)

    print(f"=== Insider Purchases for {ticker} ===")
    print(purchases[['Start Date', 'Insider', 'Title', 'Shares', 'Price', 'Value']].to_string(index=False))

    # Check for cluster (multiple insiders buying recently)
    recent = purchases[purchases['Start Date'] > pd.Timestamp.now() - pd.DateOffset(months=3)]
    if len(recent) >= 3:
        print("⚠️  Cluster of recent insider purchases detected!")
    else:
        print(f"Recent purchases: {len(recent)}")

    return purchases

# Example
insider_data = analyze_insider_activity('AAPL')
```

#### Creating an Insider Sentiment Indicator

You can build a simple score based on net insider activity (buys minus sells) over a rolling window.

```python
def insider_sentiment_score(transactions, lookback_days=90):
    """
    Compute net insider sentiment: net shares purchased (buys - sells) / total shares outstanding
    """
    if transactions is None or transactions.empty:
        return 0

    cutoff = pd.Timestamp.now() - pd.DateOffset(days=lookback_days)
    recent = transactions[transactions['Start Date'] > cutoff]

    buys = recent[recent['Transaction'].str.contains('Purchase', na=False)]['Shares'].sum()
    sells = recent[recent['Transaction'].str.contains('Sale', na=False)]['Shares'].sum()

    net = buys - sells
    # Normalize by shares outstanding (could get from yfinance)
    # Here we just return raw net
    return net
```

---

### 19.4 Analyzing Analyst Ratings and Their Effectiveness

Sell‑side analysts at investment banks and research firms issue ratings (buy, hold, sell) and price targets. While their independence can be questioned (due to investment banking conflicts), consensus ratings and changes still move markets.

#### Understanding Analyst Data

- **Rating Scale:** Typically 1‑5 (strong buy to strong sell) or similar.
- **Price Target:** Analyst's estimate of fair value over the next 12 months.
- **Upgrades/Downgrades:** Changes in rating can cause immediate price moves.
- **Consensus:** Average rating and price target from all analysts covering a stock.

#### Data Sources

- **Yahoo Finance:** Provides summary analyst data (`stock.recommendations` and `stock.analyst_price_target`).
- **Bloomberg, FactSet:** Professional terminals.
- **Free APIs:** Some limited data from financial APIs (Alpha Vantage, Financial Modeling Prep).

#### Python Example: Fetching Analyst Recommendations

```python
def get_analyst_ratings(ticker):
    stock = yf.Ticker(ticker)
    recs = stock.recommendations
    if recs is not None:
        print(f"=== Analyst Recommendations for {ticker} ===")
        print(recs.tail(10))  # last 10 rating changes
    else:
        print("No recommendation data")

    target = stock.analyst_price_target
    if target:
        print(f"\nCurrent Price Target:")
        print(f"  Mean: ${target.get('mean', 'N/A')}")
        print(f"  High: ${target.get('high', 'N/A')}")
        print(f"  Low:  ${target.get('low', 'N/A')}")
    return recs, target

get_analyst_ratings('AAPL')
```

#### Analyzing Analyst Sentiment

You can create a sentiment score based on:
- **Net Upgrades/Downgrades:** Number of upgrades minus downgrades over a period.
- **Price Target Revisions:** Percentage change in mean price target.
- **Dispersion:** High dispersion (disagreement) may indicate uncertainty.

**Example: Net Rating Score**

```python
def analyst_sentiment_score(ticker, lookback_days=90):
    stock = yf.Ticker(ticker)
    recs = stock.recommendations
    if recs is None:
        return 0

    cutoff = pd.Timestamp.now() - pd.DateOffset(days=lookback_days)
    recent = recs[recs.index > cutoff]

    # Simplified: count upgrades and downgrades
    # (In practice, you'd map ratings to numeric values)
    upgrade_count = len(recent[recent['ToGrade'].str.contains('Buy|Outperform', na=False)])
    downgrade_count = len(recent[recent['ToGrade'].str.contains('Sell|Underperform', na=False)])

    net = upgrade_count - downgrade_count
    return net
```

---

### 19.5 Alternative Data and Big Data in Investing

Alternative data refers to non‑traditional data sources used to gain an investment edge. Hedge funds and quantitative funds have been using alternative data for years; now it's becoming more accessible to retail investors.

#### Examples of Alternative Data

- **Satellite Imagery:** Count cars in retail parking lots to estimate foot traffic and sales (e.g., for Walmart or McDonald's).
- **Credit Card Transactions:** Aggregated and anonymized data to track consumer spending (e.g., for specific retailers).
- **Web Traffic:** Unique visitors to e‑commerce sites (SimilarWeb, Alexa).
- **Job Postings:** Number and type of job openings to gauge hiring trends (Indeed, LinkedIn).
- **App Downloads:** Sensor Tower, App Annie data for app‑based companies.
- **Supply Chain Data:** Shipping container tracking, port data (for industrials, retailers).
- **Social Media Sentiment:** Beyond simple sentiment, tracking brand mentions, hashtags.

#### How Hedge Funds Use Alternative Data

- **Earnings Prediction:** Using credit card data to estimate same‑store sales before earnings release.
- **Macro Indicators:** Using satellite images of oil tankers to predict oil inventories.
- **Event Detection:** Monitoring social media for early signs of product issues.

#### Challenges

- **Cost:** Many datasets are expensive (tens of thousands of dollars per year).
- **Data Cleaning:** Raw data is messy; requires significant processing.
- **Legality:** Must ensure data is obtained legally and ethically (not insider trading).
- **Correlation vs. Causation:** Easy to find spurious patterns.

#### Getting Started with Alternative Data (Free/Cheap Sources)

- **Google Trends:** Search interest for a company or product.
- **Wikipedia Page Views:** Traffic to a company's Wikipedia page.
- **Reddit/Twitter APIs:** Free but limited.
- **SEC EDGAR:** Download all filings and analyze text (e.g., risk factor changes).

**Example: Google Trends for a Company**

```python
from pytrends.request import TrendReq
import pandas as pd

def google_trends_sentiment(keyword, timeframe='today 3-m'):
    pytrends = TrendReq(hl='en-US', tz=360)
    pytrends.build_payload([keyword], cat=0, timeframe=timeframe, geo='', gprop='')
    data = pytrends.interest_over_time()
    if not data.empty:
        data = data.drop(columns=['isPartial'])
        data.plot(title=f'Google Trends for "{keyword}"')
        return data
    return None

# Example: Search interest for "Nike"
trends = google_trends_sentiment('Nike')
```

---

### 19.6 Integrating Sentiment into Investment Decisions

Sentiment indicators are most powerful when combined with traditional analysis. A standalone sentiment spike is noise; sentiment confirming a fundamental or technical signal is valuable.

#### Building a Composite Sentiment Indicator

Combine multiple sentiment sources into a single score:

| Source | Metric | Weight |
|--------|--------|--------|
| News Sentiment | Average compound score (last week) | 20% |
| Insider Trading | Net purchases (last 3 months) | 30% |
| Analyst Ratings | Net upgrades (last 3 months) | 20% |
| Social Media | Reddit/Twitter sentiment (last week) | 15% |
| Google Trends | Relative search interest (change) | 15% |

**Python Example: Composite Score**

```python
def composite_sentiment(ticker):
    score = 0
    details = {}

    # Insider
    insiders = analyze_insider_activity(ticker, min_shares=1000)
    if insiders is not None and len(insiders) > 0:
        # Simple scoring: if any recent purchase, +1
        score += 1
        details['Insider'] = 1
    else:
        details['Insider'] = 0

    # Analyst upgrades
    analyst_score = analyst_sentiment_score(ticker)
    if analyst_score > 2:
        score += 1
        details['Analyst'] = 1
    else:
        details['Analyst'] = 0

    # News (we need to fetch news; here placeholder)
    # For demonstration, assume news sentiment positive
    news_score = 1  # placeholder
    score += news_score
    details['News'] = news_score

    print(f"Composite Sentiment Score for {ticker}: {score}/3")
    print("Component scores:", details)
    return score
```

#### Sentiment as a Contrarian Indicator

When sentiment reaches extreme levels, it can signal a reversal. For example, if every news article is bullish and insiders are selling, be cautious. A simple rule:

- **Bullish Extreme:** When sentiment score > 80th percentile historically → potential sell signal.
- **Bearish Extreme:** When sentiment score < 20th percentile → potential buy signal.

#### Backtesting Sentiment Strategies

Before using any sentiment signal, backtest it. For example, buy when insider buying is heavy and hold for 3 months; compare to benchmark.

```python
def backtest_insider_signal(ticker, start_date, end_date):
    # This is a complex function; outline:
    # 1. Fetch historical insider transaction data
    # 2. For each month, compute net insider buying
    # 3. If net > threshold, simulate buying at next open
    # 4. Hold for fixed period, then sell
    # 5. Compare to buy-and-hold
    pass
```

---

### Chapter Summary

- **Sentiment analysis** quantifies market mood from news, social media, and other text. It can provide leading indicators and contrarian signals.
- **News and social media sentiment** can be analyzed using tools like VADER, FinBERT, and APIs (NewsAPI, Reddit, Twitter).
- **Insider transactions** (especially open market purchases) are a powerful signal. Track cluster buying and large purchases.
- **Analyst ratings** offer consensus views and changes can move prices, but be aware of conflicts.
- **Alternative data** (satellite imagery, credit card data, web traffic) is increasingly used by professionals; retail investors can start with free sources like Google Trends.
- **Integrate sentiment** with fundamental and technical analysis. A composite score can help filter signals.
- **Backtest any sentiment strategy** to ensure it adds value beyond random chance.

**Exercises:**

1.  **Conceptual:** Why might insider selling be less informative than insider buying? What are some legitimate reasons insiders sell?
2.  **Practical:** Choose a stock and use NewsAPI (free tier) to collect headlines from the past week. Compute average sentiment using VADER. Compare to the stock's price movement over the same period.
3.  **Research:** Look up a study on the predictive power of insider trading. What were the key findings? How large was the average excess return?
4.  **Coding:** Build a function that fetches analyst recommendations from yfinance for a list of stocks, computes a net upgrade score, and ranks them. Test whether the top‑ranked stocks outperformed the bottom‑ranked over the next month (requires historical data and backtesting).

---

**Looking Ahead to Chapter 20: Portfolio Management and Risk**

With a deep toolkit for analyzing individual stocks and special situations, the final piece is assembling them into a cohesive portfolio. Chapter 20 covers portfolio management—diversification, asset allocation, position sizing, and risk management. You will learn how to apply Modern Portfolio Theory, implement stop‑losses and hedging, and evaluate your portfolio's performance against benchmarks. This chapter brings together all the concepts from the book into a practical framework for long‑term investment success.