# Stock Sentiment Analysis - Reddit vs the News
## 1. Data Collection

#### Created By: Ben Chamblee - https://github.com/Bench-amblee

### Contents
- [Introduction](#introduction)
- [Imports](#Imports)
- [Define Data and Output](#define-data-and-output)
- [Reddit Data Collection](#reddit-data-collection)
- [News API Data Collection](#news-api-data-collection)
- [Combined Data Collection](#combined-data-collection)
- [Conclusion and Next Steps](#conclusion-and-next-steps)


### Introduction

This notebook documents the data collection process for this stock sentiment analysis machine learning project. It encompasses the methodologies, tools, and techniques used to gather the raw data for further analysis and modeling.

The primary objectives of this notebook are to:

- Collect source data from Reddit and News Sources using APIs
- Collect posts and news sources to gather content on the 'Magnificent 7' Stocks (Google, Amazon, Apple, Meta, Microsoft, Nvidia, Tesla)
- Document data sources and retrieval methods
- Demonstrate the ETL (Extract, Transform, Load) pipeline
- Outline quality assurance checks implemented during collection
- Provide a clear path for reproducing the data collection process

### Imports

In [2]:
import praw # Reddit API wrapper for Python
import re # Regular expressions library for text pattern matching and manipulation
import requests # HTTP library for making requests to websites and APIs
import pandas as pd # Data manipulation library providing DataFrame structures
from datetime import datetime, timedelta # For working with dates and times
import time # Provides various time-related functions
import os # For interacting with the operating system (file paths, environment variables)
from dotenv import load_dotenv # For loading environment variables from a .env file (useful for API keys/credentials)

# Load environment variables from .env file
load_dotenv()


False

### Define Data and Output

Since we are searching for stocks by company name and by ticker, we'll use a dictionary to store both values. We will also create a smaller dictionary for testing purposes and define our output directory for data to be stored.

clean_text will be used to save memory - many web pages will return blank space (especially from news pages), this function will remove the blank spaces and only return useful text.

In [94]:
# Define the Magnificent 7 stocks
MAGNIFICENT_7 = {
    'GOOGL': 'Alphabet',
    'AMZN': 'Amazon',
    'AAPL': 'Apple',
    'META': 'Meta Platforms',
    'MSFT': 'Microsoft',
    'NVDA': 'Nvidia',
    'TSLA': 'Tesla'
}

test_dict = {'GOOGL': 'Alphabet'}  # Test with one stock for quick results

# Create output directory if it doesn't exist
output_dir = "magnificent7_data"
os.makedirs(output_dir, exist_ok=True)

# Helper function to clean text
def clean_text(text):
    """Clean text by removing extra whitespace and ensuring complete sentences"""
    if not text:
        return ""
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)
    return text.strip()


### Reddit Data Collection

collect_reddit_data gathers posts about specific stocks from multiple financial subreddits. It:

1. Authenticates with the Reddit API using .env variables (API Key, Client Info)
2. Searches across four financial subreddits - r/wallstreetbets, r/stocks, r/investing, and r/StockMarket
3. For each stock, perform separate searches and collect posts
4. Collect post details including title, body, author, and timestamp
5. Retrieves up to 5 top-level comments for each post
6. Processes and cleans the text content
7. Adds metadata like stock symbol, search term, and collection time
8. Compiles everything into a pandas DataFrame, removing duplicates
9. Includes error handling and rate limiting (2-second pause between searches)

This function returns a dataset of Reddit posts about the specified stocks.

In [84]:
def collect_reddit_data(stocks_dict, post_limit=30):
    """Collect Reddit data for multiple stocks"""

    reddit = praw.Reddit(
        client_id=os.getenv("reddit-CLIENT_ID"),
        client_secret=os.getenv("reddit-CLIENT_SECRET"),
        user_agent=os.getenv("reddit-USER_AGENT"),
    )
    
    all_posts_data = []
    
    # Subreddits to search in
    subreddits = ["wallstreetbets", "stocks", "investing", "StockMarket"]
    subreddit_obj = reddit.subreddit("+".join(subreddits))
    
    # Collect data for each stock
    for symbol, company_name in stocks_dict.items():
        print(f"Collecting Reddit data for {company_name} ({symbol})...")
        
        # Search terms - include both ticker and company name
        search_terms = [symbol, company_name]
        
        for search_term in search_terms:
            print(f"  - Searching for '{search_term}'")
            
            try:
                # Search for posts containing the term
                posts = subreddit_obj.search(search_term, limit=post_limit, sort="relevance")
                
                post_count = 0
                for post in posts:
                    try:
                        # Get the subreddit name
                        sub_name = post.subreddit.display_name
                        
                        # Get top-level comments (limited to 5 for simplicity)
                        try:
                            post.comments.replace_more(limit=0)
                            comments = []
                            for comment in list(post.comments)[:5]:
                                cleaned_comment = clean_text(comment.body)
                                if cleaned_comment:  # Only add non-empty comments
                                    comments.append(cleaned_comment)
                        except Exception as e:
                            print(f"    Error fetching comments: {e}")
                            comments = []
                        
                        # Convert created UTC to readable time
                        created_time = datetime.fromtimestamp(post.created_utc)
                        
                        # Clean and prepare the post body
                        body = clean_text(post.selftext)
                        title = clean_text(post.title)
                        
                        # Store data with stock information
                        post_data = {
                            "platform": "Reddit",
                            "stock_symbol": symbol,
                            "stock_name": company_name,
                            "post_id": post.id,
                            "title": title,
                            "body": body,
                            "author": str(post.author),
                            "score": post.score,
                            "created_at": created_time,
                            "num_comments": post.num_comments,
                            "comments": str(comments),  # Convert list to string for CSV
                            "subreddit": sub_name,
                            "url": post.url,
                            "search_term": search_term,
                            "collection_time": datetime.now()
                        }
                        all_posts_data.append(post_data)
                        post_count += 1
                    except Exception as e:
                        print(f"    Error processing post: {e}")
                        continue
                
                print(f"    Found {post_count} posts for '{search_term}'")
                
                # Sleep to respect rate limits
                time.sleep(2)
                
            except Exception as e:
                print(f"Error during Reddit search for {search_term}: {e}")
                continue
    
    # Create DataFrame
    df = pd.DataFrame(all_posts_data)
    
    # Remove any duplicates based on post_id
    if not df.empty:
        df = df.drop_duplicates(subset=['post_id'])
    
    print(f"Total Reddit posts collected: {len(df)}")
    return df

Let's test our function and take a look at the reddit data to make sure it all looks good

In [85]:
# Reddit data test
reddit_df = collect_reddit_data(test_dict)
reddit_df.head()

Collecting Reddit data for Alphabet (GOOGL)...
  - Searching for 'GOOGL'
    Found 30 posts for 'GOOGL'
  - Searching for 'Alphabet'
    Found 30 posts for 'Alphabet'
Total Reddit posts collected: 59


Unnamed: 0,platform,stock_symbol,stock_name,post_id,title,body,author,score,created_at,num_comments,comments,subreddit,url,search_term,collection_time
0,Reddit,GOOGL,Alphabet,1j739da,"GOOGL is the most bullish ""safe"" stock for lon...",My arguments are the following: \- Alphabet ha...,SwissCowOnMoon,657,2025-03-09 04:28:35,381,['If they get broken up (whether by force or c...,stocks,https://www.reddit.com/r/stocks/comments/1j739...,GOOGL,2025-03-20 10:00:19.176094
1,Reddit,GOOGL,Alphabet,1izuaqk,Is Google ($GOOGL) a great long-term buy after...,Google (GOOGL) has been dropping quite a bit r...,biznisgod,329,2025-02-27 18:56:30,222,['The drop is broad based across all the mega-...,investing,https://www.reddit.com/r/investing/comments/1i...,GOOGL,2025-03-20 10:00:20.468976
2,Reddit,GOOGL,Alphabet,1je3bod,$GOOG & $GOOGL buy WIZ start up $32 Billion,$GOOG & $GOOGL buy WIZ start up $32 Billion [h...,Othe-un-dots,1246,2025-03-18 08:03:22,223,['**User Report**| | | | :--|:--|:--|:-- **Tot...,wallstreetbets,https://www.reddit.com/r/wallstreetbets/commen...,GOOGL,2025-03-20 10:00:21.530009
3,Reddit,GOOGL,Alphabet,1j3nziz,Waiting to be stimulated GOOGL,Let's wait for the next move What's going to h...,apslumas,55,2025-03-04 17:47:50,27,['**User Report**| | | | :--|:--|:--|:-- **Tot...,wallstreetbets,https://www.reddit.com/r/wallstreetbets/commen...,GOOGL,2025-03-20 10:00:21.780409
4,Reddit,GOOGL,Alphabet,1gwss9g,Why is nobody talking about GOOGL?,Yesterday I was thinking about GOOGL as a safe...,britax12,84,2024-11-21 17:52:06,177,['**User Report**| | | | :--|:--|:--|:-- **Tot...,wallstreetbets,https://www.reddit.com/r/wallstreetbets/commen...,GOOGL,2025-03-20 10:00:22.643566


### News API Data Collection

collect_newsapi_articles gathers news articles about specific stocks using the NewsAPI service. It:

1. Authenticates with NewsAPI using .env variables (API Key)
2. Searches for articles published in the last year
3. Constructs specific searches for each stock like "AAPL stock" and "Apple earnings"
4. Makes HTTP request to the NewsAPI endpoint
5. Processes the JSON response to extract data
6. Cleans and normalizes text content (title, description, content)
7. Combines description and content for easier analysis
8. Captures metadata including source, author, URL, and publication date
9. Implements error handling and rate limits (1 second)
10. Returns a pandas DataFrame of all collected articles with duplicates removed

The function provides a structured daatset of news coverage for each specified stock.

In [None]:
def collect_newsapi_articles(stocks_dict, api_key=None, max_articles=30):
    """Collect news articles from NewsAPI for multiple stocks"""
    
    all_news_data = []
    
    # Get API key from parameter or environment variable
    if api_key is None:
        api_key = os.getenv("NEWSAPI_KEY")
    
    # Check if API key exists
    if not api_key:
        print("Error: NewsAPI key not found.")
        print("Please provide an API key or set the NEWSAPI_KEY environment variable.")
        return pd.DataFrame()
    
    # Calculate date range (365 days ago to today)
    end_date = datetime.now()
    start_date = end_date - timedelta(days=365)
    
    # Format dates for API
    from_date = start_date.strftime("%Y-%m-%d")
    to_date = end_date.strftime("%Y-%m-%d")
    
    # Base URL for NewsAPI
    base_url = "https://newsapi.org/v2/everything"
    
    # Headers for API request
    headers = {
        "X-Api-Key": api_key
    }
    
    # Collect news for each stock
    for symbol, company_name in stocks_dict.items():
        print(f"Collecting NewsAPI articles for {company_name} ({symbol})...")
        
        # Create more specific search queries
        queries = [
            f"{symbol} stock",
            f"{company_name} stock",
            f"{company_name} earnings",
            f"{company_name} financial"
        ]
        
        for query in queries:
            print(f"  - Searching for '{query}'")
            
            # Parameters for API request
            params = {
                "q": query,
                "from": from_date,
                "to": to_date,
                "language": "en",
                "sortBy": "relevancy",
                "pageSize": min(max_articles, 100)  # API limit is 100 per request
            }
            
            try:
                # Make API request
                response = requests.get(base_url, params=params, headers=headers)
                
                # Print status for debugging
                print(f"    Status code: {response.status_code}")
                
                if response.status_code == 200:
                    data = response.json()
                    
                    # Extract articles
                    articles = data.get('articles', [])
                    print(f"    Found {len(articles)} articles")
                    
                    # Check if there are any articles
                    if not articles:
                        print("    No articles found")
                        continue
                    
                    # Process each article
                    article_count = 0
                    for article in articles:
                        try:
                            # Parse published date
                            published_at = article.get('publishedAt', '')
                            
                            # Clean text fields
                            title = clean_text(article.get('title', ''))
                            description = clean_text(article.get('description', ''))
                            content = clean_text(article.get('content', ''))
                            
                            # Combine description and content for better sentiment analysis
                            full_text = description
                            if content and content != description:
                                if full_text:
                                    full_text += " " + content
                                else:
                                    full_text = content
                            
                            # Store article data
                            article_data = {
                                "platform": "NewsAPI",
                                "stock_symbol": symbol,
                                "stock_name": company_name,
                                "title": title if title else "No title",
                                "full_text": full_text,
                                "source": article.get('source', {}).get('name', 'Unknown'),
                                "author": article.get('author', 'Unknown'),
                                "url": article.get('url', ''),
                                "published_at": published_at,
                                "search_query": query,
                                "collection_time": datetime.now()
                            }
                            
                            # Only add if we have a title and URL
                            if article_data["title"] != "No title" and article_data["url"]:
                                all_news_data.append(article_data)
                                article_count += 1
                            
                        except Exception as e:
                            print(f"    Error processing article: {e}")
                            continue
                    
                    print(f"    Successfully processed {article_count} articles")
                    
                else:
                    print(f"    Error: API returned status code {response.status_code}")
                    if response.status_code == 401:
                        print("    Invalid API key or authentication error")
                    elif response.status_code == 429:
                        print("    Rate limit exceeded")
                    response_text = response.text[:200] + "..." if len(response.text) > 200 else response.text
                    print(f"    Response: {response_text}")
                    
            except Exception as e:
                print(f"Error collecting NewsAPI data for {query}: {e}")
                continue
            
            # Sleep to respect rate limits
            time.sleep(1)
    
    # Create DataFrame from the collected data
    df = pd.DataFrame(all_news_data)
    
    # Remove duplicates based on URL
    if not df.empty:
        df = df.drop_duplicates(subset=['url'])
    
    print(f"Total NewsAPI articles collected: {len(df)}")
    return df


Let's test the function and take a look at a sample to make sure it looks alright

In [89]:
# news api data test
news_df = collect_newsapi_articles(test_dict)
news_df.head()


Collecting NewsAPI articles for Alphabet (GOOGL)...
  - Searching for 'GOOGL stock'
    Status code: 200
    Found 30 articles
    Successfully processed 30 articles
  - Searching for 'Alphabet stock'
    Status code: 200
    Found 29 articles
    Successfully processed 29 articles
  - Searching for 'Alphabet earnings'
    Status code: 200
    Found 30 articles
    Successfully processed 30 articles
  - Searching for 'Alphabet financial'
    Status code: 200
    Found 29 articles
    Successfully processed 29 articles
Total NewsAPI articles collected: 82


Unnamed: 0,platform,stock_symbol,stock_name,title,full_text,source,author,url,published_at,search_query,collection_time
0,NewsAPI,GOOGL,Alphabet,D-WAVE QUANTUM Stock Rises 62% Post Q4 Results...,QBTS stock benefits from an expanding clientel...,Yahoo Entertainment,Nilanshi Mukherjee,https://finance.yahoo.com/news/d-wave-quantum-...,2025-03-18T17:54:00Z,GOOGL stock,2025-03-20 10:04:08.526688
1,NewsAPI,GOOGL,Alphabet,"Nvidia, Google, Tesla, BYD, Tencent Music, Xpe...",Stock futures edged down ahead of a slew of ec...,Quartz India,Josh Fellman,https://qz.com/nvidia-google-byd-tesla-xpeng-t...,2025-03-18T12:18:00Z,GOOGL stock,2025-03-20 10:04:08.526688
2,NewsAPI,GOOGL,Alphabet,Is Alphabet Inc. (GOOGL) the Most Profitable L...,"If you click 'Accept all', we and our partners...",Yahoo Entertainment,,https://consent.yahoo.com/v2/collectConsent?se...,2025-03-13T23:54:18Z,GOOGL stock,2025-03-20 10:04:08.527798
3,NewsAPI,GOOGL,Alphabet,Is Alphabet Inc. (GOOGL) the Top Stock to Buy ...,"If you click 'Accept all', we and our partners...",Yahoo Entertainment,,https://consent.yahoo.com/v2/collectConsent?se...,2025-03-18T22:20:05Z,GOOGL stock,2025-03-20 10:04:08.527798
4,NewsAPI,GOOGL,Alphabet,"Facts About The Stock Corrections, Tariffs, An...",Stocks entered into a correction with a declin...,Forbes,"Bill Stone, Contributor, \n Bill Stone, Contri...",https://www.forbes.com/sites/bill_stone/2025/0...,2025-03-16T11:00:00Z,GOOGL stock,2025-03-20 10:04:08.527798


### Combined Data Collection

combined_data_collection runs both of the previous data collection scripts for all 7 of the magnificent 7 stocks. In total it should take ~20 minutes to gather all the data and output in 3 datasets: one for just reddit, one for just newsAPI, and one combined. The data will be sent to output directory that was defined in the beginning of this notebook.

In [96]:
def combined_data_collection():
    """Main function to collect data from Reddit and NewsAPI for Magnificent 7 stocks"""

    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    all_data_frames = []
    
    try:
        # Reddit data
        reddit_df = collect_reddit_data(MAGNIFICENT_7)
        if not reddit_df.empty:
            reddit_df.to_csv(f"{output_dir}/reddit_mag7_{timestamp}.csv", index=False, encoding='utf-8')
            all_data_frames.append(reddit_df)
        
        # NewsAPI data
        newsapi_df = collect_newsapi_articles(MAGNIFICENT_7)
        if not newsapi_df.empty:
            newsapi_df.to_csv(f"{output_dir}/newsapi_mag7_{timestamp}.csv", index=False, encoding='utf-8')
            all_data_frames.append(newsapi_df)
        
        # Combine all data into one DataFrame
        if all_data_frames:
            combined_df = pd.concat(all_data_frames, ignore_index=True)
            
            # Save combined data
            combined_file = f"{output_dir}/magnificent7_combined_{timestamp}.csv"
            combined_df.to_csv(combined_file, index=False, encoding='utf-8')
            print(f"\nCombined data saved to {combined_file}")
            print(f"Total collected items: {len(combined_df)}")
            
            # Print data distribution
            print("\nData Distribution:")
            platform_counts = combined_df['platform'].value_counts()
            for platform, count in platform_counts.items():
                print(f"  {platform}: {count} items")
                
            stock_counts = combined_df['stock_symbol'].value_counts()
            print("\nData by Stock:")
            for stock, count in stock_counts.items():
                print(f"  {stock}: {count} items")
        else:
            print("\nNo data was collected from any source.")
        
    except Exception as e:
        print(f"Error in data collection: {e}")

In [None]:
# run combined data collection function
combined_data_collection()

### Conclusion and Next Steps

This notebook successfully establishes a data collection pipeline for financial market analysis. We've implemented two primary data sources:

Reddit Data: Using praw to gather social media sentiment from investment communities including r/wallstreetbets, r/stocks, r/investing, and r/StockMarket.
News Articles: Using NewsAPI to collect recent financial news about our target stocks from reputable sources.

The collection process includes proper error handling, rate limiting, and deduplication to ensure data quality. This foundation provides us with a diverse dataset that captures both institutional perspectives (news) and retail investor sentiment (Reddit).

#### Next Steps

Data Cleaning & Exploratory Data Analysis (EDA)

- Standardize text formats and remove special characters
- Handle missing values appropriately
- Convert timestamps to a consistent format
- Extract meaningful features from raw text
- Analyze post/article frequency by stock and source
- Assess initial sentiment distribution
- Tokenize and normalize text data
- Remove stopwords and irrelevant content

The cleaned dataset and insights from our EDA will form the foundation for our sentiment analysis models, which will help us quantify market sentiment and potentially identify correlations with stock price movements.