# News Headlines Topic Modeling Analysis

This notebook focuses on analyzing news headlines using Natural Language Processing (NLP) to:
1. Identify common keywords and phrases
2. Extract significant events and topics
3. Track how topics evolve over time

We'll use Latent Dirichlet Allocation (LDA) combined with specific event detection to understand what our news headlines are discussing.

In [1]:
import pandas as pd
import numpy as np 
import os 
import sys
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

from src.topic_modeling import (
    make_vectorizer,
    build_dtm,
    run_lda,
    top_keywords_per_topic,
    get_document_topics
)

## Data Loading

We'll use the cleaned news headlines dataset that we prepared in our EDA notebook.

In [14]:
cleaned_filtered_data = pd.read_csv("../Data/cleaned/cleaned_filtered_news.csv")
print(f"Loaded {len(cleaned_filtered_data)} headlines")

# Display sample headlines
print("\nSample Headlines:")
print(cleaned_filtered_data['headline'].sample(5).to_list())

Loaded 7261 headlines

Sample Headlines:
["IVP, Sapphire Ventures Lead $100M Series E In CircleCI's Leading Software Solution", "Benzinga's Top Upgrades, Downgrades For November 16, 2018", "Baird Reiterates Outperform On Tesla As Firm Notes 'Headline cash balance ($5B) and cash generation ($614M) numbers were extremely favorable, in our view, and should support future growth investments'", 'AMD, Texas Instruments See Short Interest Surge', "Trading 212 Takes On Disciplined Investing With 'Next Generation Product'"]


## Event Detection Framework

Before applying topic modeling, we'll first look for specific, important events in our headlines.
This helps us identify concrete instances of significant events like "FDA approval" or "price target changes".

In [15]:
# Define specific events to track
key_events = {
    'price_targets': ['price target', 'raises target', 'lowers target', 'upgrades', 'downgrades'],
    'fda_related': ['fda approval', 'fda clears', 'clinical trial', 'drug approval'],
    'earnings': ['beats earnings', 'misses earnings', 'earnings preview', 'quarterly results'],
    'leadership': ['ceo', 'executive changes', 'board member', 'management'],
    'product': ['launches', 'announces', 'unveils', 'releases']
}

# Analyze headlines for specific events
print("Analyzing Headlines for Specific Events:")
event_counts = {event: 0 for event in key_events}
headline_events = []

for headline in cleaned_filtered_data['headline']:
    headline_lower = headline.lower()
    found_events = []
    
    for event, phrases in key_events.items():
        if any(phrase in headline_lower for phrase in phrases):
            event_counts[event] += 1
            found_events.append(event)
    
    headline_events.append(found_events)

# Add events to dataframe
cleaned_filtered_data['specific_events'] = headline_events

# Show event statistics
print("\nFrequency of Specific Events:")
for event, count in event_counts.items():
    print(f"{event}: {count} headlines")

# Show example headlines for each event type
print("\nExample Headlines by Event Type:")
for event in key_events:
    event_headlines = cleaned_filtered_data[
        cleaned_filtered_data['specific_events'].apply(lambda x: event in x)
    ]['headline'].sample(min(3, event_counts[event]))
    
    print(f"\n{event.replace('_', ' ').title()}:")
    for headline in event_headlines:
        print(f"- {headline}")

Analyzing Headlines for Specific Events:

Frequency of Specific Events:
price_targets: 775 headlines
fda_related: 0 headlines
earnings: 26 headlines
leadership: 129 headlines
product: 172 headlines

Example Headlines by Event Type:

Price Targets:
- UBS Maintains Buy on NVIDIA, Raises Price Target to $330
- Credit Suisse Maintains Outperform on Amazon.com, Lowers Price Target to $2760
- Jefferies Maintains Buy on Tesla, Raises Price Target to $400

Fda Related:

Earnings:
- Nvidia Q1 Earnings Preview: Data Center, Gaming Inventory In Focus Amid Fundamental Uncertainties
- AMD Earnings Preview: A Look At What Might Be Expected For The Chipmakers' Q2 Results
- Nvidia Q3 Earnings Preview: Analysts Forecast Data Center, Gaming, Automotive Performance

Leadership:
- Did Alphabet's CEO Shuffle Come As A Surprise?
- UPDATE: Nomura On Alphabet Notes 'and management noted the first two months of the quarter were strong followed by a significant slowdown in advertising revenue in March (consiste

## Topic Modeling Analysis

Now we'll use LDA to discover broader themes and patterns in our headlines.
This will help us understand the general topics being discussed, beyond specific events.

In [16]:
# Step 1: Convert headlines to numerical format
print("Converting headlines to numerical format...")
vectorizer = make_vectorizer(method='tfidf', max_features=5000)
dtm, feature_names = build_dtm(cleaned_filtered_data['headline'].tolist(), vectorizer)

# Step 2: Run LDA to discover topics
print("\nDiscovering topics...")
n_topics = 15  # Number of topics to discover
lda_model = run_lda(dtm, num_topics=n_topics)

# Step 3: Get the main words for each topic
topics = top_keywords_per_topic(lda_model, feature_names, n_top=15)

# Print discovered topics
print("\nDiscovered Topics:")
for idx, topic_words in enumerate(topics):
    print(f"\nTopic {idx + 1}:")
    print(f"Keywords: {', '.join(topic_words)}")

Converting headlines to numerical format...

Discovering topics...

Discovered Topics:

Topic 1:
Keywords: tesla, biggest, changes, musk, price, elon, electrek, target, model, 10, says, china, electric, production, friday

Topic 2:
Keywords: trade, tesla, china, devices, google, says, advanced, purchases, etf, huge, micro, amazon, car, new, share

Topic 3:
Keywords: nvidia, citron, tesla, goldman, deutsche, services, graphics, says, amd, left, bank, cramer, sachs, communication, new

Topic 4:
Keywords: upgrades, downgrades, nvidia, benzinga, morgan, stanley, buy, pt, bank, hold, initiates, coverage, america, 00, corporation

Topic 5:
Keywords: tesla, street, model, google, nvidia, earnings, investor, year, wall, deliveries, movement, semiconductor, coronavirus, index, starts

Topic 6:
Keywords: media, social, google, cloud, dow, nvidia, crude, etf, trump, gaming, video, executive, games, house, afternoon

Topic 7:
Keywords: target, price, maintains, raises, nvidia, buy, outperform, low

## Topic Assignment Analysis

Let's examine how these topics appear in our headlines and how they relate to our specific events.

In [17]:
# Assign topics to headlines
doc_topics = get_document_topics(lda_model, dtm)
cleaned_filtered_data['topics'] = doc_topics

# Show example headlines with their topics and events
print("Example Headlines with Topics and Events:")
for _, row in cleaned_filtered_data.sample(5).iterrows():
    print(f"\nHeadline: {row['headline']}")
    print(f"Assigned Topics: {[i+1 for i in row['topics']]}")
    print("Topic Keywords:")
    for topic_idx in row['topics']:
        print(f"- Topic {topic_idx + 1}: {', '.join(topics[topic_idx][:5])}")
    if row['specific_events']:
        print(f"Specific Events Found: {row['specific_events']}")

# Show topic assignment statistics
topic_counts = cleaned_filtered_data['topics'].apply(len).value_counts()
print("\nTopic Assignment Statistics:")
print(f"Headlines with single topic: {topic_counts[1]}")
print(f"Headlines with multiple topics: {sum(topic_counts[topic_counts > 1])}")

Example Headlines with Topics and Events:

Headline: Shares of several semiconductor companies are trading higher following a strong Q2 earnings report from Micron.
Assigned Topics: [11]
Topic Keywords:
- Topic 11: shares, trading, higher, stock, lower

Headline: Mellanox Shares Volatile Over Last Few Mins. On Volume Spike; Hearing Capitol Forum Says Deal Approval 'In Trouble,' Recent Talks Failed To Address Objections
Assigned Topics: [1]
Topic Keywords:
- Topic 1: tesla, biggest, changes, musk, price

Headline: Today's Pickup: Idelic Integrates Samsara ELD, Camera Data Into Platform
Assigned Topics: [9]
Topic Keywords:
- Topic 9: nvidia, driving, ai, self, report

Headline: Shares of several semiconductor companies are trading higher. Strength potentially related to earnings from notable names in the space this week as well as overall market strength amid a rebound in oil.
Assigned Topics: [11]
Topic Keywords:
- Topic 11: shares, trading, higher, stock, lower

Headline: KrebsOnSecuri

## Temporal Analysis

Finally, let's analyze how our topics and events evolve over time.

In [18]:
# Add year information
cleaned_filtered_data['year'] = pd.to_datetime(cleaned_filtered_data['date']).dt.year

# Analyze topics and events over time
print("Topic and Event Evolution by Company:")
for company in cleaned_filtered_data['stock'].unique():
    company_data = cleaned_filtered_data[cleaned_filtered_data['stock'] == company]
    print(f"\n{company}:")
    
    # Group by year
    yearly_data = company_data.groupby('year')
    
    for year, year_data in yearly_data:
        print(f"\nYear {year}:")
        # Show topics
        year_topics = set([topic for topics in year_data['topics'] for topic in topics])
        print("Topics:", [i+1 for i in year_topics])
        
        # Show specific events
        year_events = set([event for events in year_data['specific_events'] for event in events])
        if year_events:
            print("Specific Events:", list(year_events))

# Save results
cleaned_filtered_data.to_csv("../Data/cleaned/analyzed/tech_news_with_topics_and_events.csv", index=False)

Topic and Event Evolution by Company:

MSF:

Year 2010:
Topics: [10]

Year 2016:
Topics: [5]

NVDA:

Year 2011:
Topics: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Specific Events: ['earnings', 'leadership', 'product', 'price_targets']

Year 2012:
Topics: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Specific Events: ['leadership', 'product', 'price_targets']

Year 2013:
Topics: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Specific Events: ['earnings', 'leadership', 'product', 'price_targets']

Year 2014:
Topics: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Specific Events: ['product', 'price_targets']

Year 2015:
Topics: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Specific Events: ['leadership', 'product', 'price_targets']

Year 2016:
Topics: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Specific Events: ['earnings', 'leadership', 'product', 'price_targets']

Year 2017:
Topics: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Specific Event

In [19]:
cleaned_filtered_data.columns

Index(['Unnamed: 0', 'headline', 'url', 'publisher', 'date', 'stock',
       'specific_events', 'topics', 'year'],
      dtype='object')