# Task 3: Correlation between News and Stock Movement

## Overview
This notebook analyzes the relationship between news sentiment and stock price movements. We'll:
1. Align news and stock price data
2. Perform sentiment analysis on news headlines
3. Calculate stock returns
4. Analyze correlations between sentiment and price movements

## Main Objectives
1. Understand how to align time series data
2. Perform sentiment analysis using NLP
3. Calculate and interpret correlations
4. Visualize relationships between sentiment and stock movements

In [14]:
# Environment Setup
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
import warnings
warnings.filterwarnings('ignore')

# Add project root to Python path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

# Download required NLTK data
nltk.download('vader_lexicon')
nltk.download('punkt')

# Set plotting style
plt.style.use('seaborn')
sns.set_palette('husl')

# Print library versions for reproducibility
print(f"pandas version: {pd.__version__}")
print(f"numpy version: {np.__version__}")
print(f"nltk version: {nltk.__version__}")

pandas version: 2.0.3
numpy version: 1.24.4
nltk version: 3.8.1


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/dinki/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /home/dinki/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Data Loading and Initial Exploration

We'll load both datasets:
1. News data (raw_analyst_ratings.csv)
2. Stock price data (from yfinance_data directory)

Let's examine both datasets to understand their structure and ensure proper alignment.

In [15]:
# Load news data
news_data = pd.read_csv('../data/raw/raw_analyst_ratings.csv')

# Load stock data (using NVDA as example)
stock_data = pd.read_csv('../data/raw/yfinance_data/NVDA_historical_data.csv')

# Display basic information about the datasets
print("News Data Information:")
print("-" * 50)
print(f"Number of news articles: {len(news_data)}")
print(f"Date range: {news_data['date'].min()} to {news_data['date'].max()}")
print("\nColumns in news data:")
print(news_data.columns.tolist())

print("\nStock Data Information:")
print("-" * 50)
print(f"Number of trading days: {len(stock_data)}")
print(f"Date range: {stock_data['Date'].min()} to {stock_data['Date'].max()}")
print("\nColumns in stock data:")
print(stock_data.columns.tolist())

# Display sample data
print("\nSample News Data:")
display(news_data.head())

print("\nSample Stock Data:")
display(stock_data.head())

News Data Information:
--------------------------------------------------
Number of news articles: 1407328
Date range: 2009-02-14 00:00:00 to 2020-06-11 17:12:35-04:00

Columns in news data:
['Unnamed: 0', 'headline', 'url', 'publisher', 'date', 'stock']

Stock Data Information:
--------------------------------------------------
Number of trading days: 6421
Date range: 1999-01-22 to 2024-07-30

Columns in stock data:
['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'Dividends', 'Stock Splits']

Sample News Data:


Unnamed: 0.1,Unnamed: 0,headline,url,publisher,date,stock
0,0,Stocks That Hit 52-Week Highs On Friday,https://www.benzinga.com/news/20/06/16190091/s...,Benzinga Insights,2020-06-05 10:30:54-04:00,A
1,1,Stocks That Hit 52-Week Highs On Wednesday,https://www.benzinga.com/news/20/06/16170189/s...,Benzinga Insights,2020-06-03 10:45:20-04:00,A
2,2,71 Biggest Movers From Friday,https://www.benzinga.com/news/20/05/16103463/7...,Lisa Levin,2020-05-26 04:30:07-04:00,A
3,3,46 Stocks Moving In Friday's Mid-Day Session,https://www.benzinga.com/news/20/05/16095921/4...,Lisa Levin,2020-05-22 12:45:06-04:00,A
4,4,B of A Securities Maintains Neutral on Agilent...,https://www.benzinga.com/news/20/05/16095304/b...,Vick Meyer,2020-05-22 11:38:59-04:00,A



Sample Stock Data:


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Dividends,Stock Splits
0,1999-01-22,0.04375,0.048828,0.038802,0.041016,0.037621,2714688000,0.0,0.0
1,1999-01-25,0.044271,0.045833,0.041016,0.045313,0.041562,510480000,0.0,0.0
2,1999-01-26,0.045833,0.046745,0.041146,0.041797,0.038337,343200000,0.0,0.0
3,1999-01-27,0.041927,0.042969,0.039583,0.041667,0.038218,244368000,0.0,0.0
4,1999-01-28,0.041667,0.041927,0.041276,0.041536,0.038098,227520000,0.0,0.0


## Data Preprocessing and Date Alignment

Before we can analyze the correlation between news sentiment and stock movements, we need to:
1. Convert dates to datetime format
2. Align news and stock data by date
3. Handle any missing values
4. Ensure proper timezone alignment

In [16]:
# First, let's examine the data structure
print("Loading and examining the data...")

# Load news data
news_data = pd.read_csv('../data/raw/raw_analyst_ratings.csv')

# Load stock data
stock_data = pd.read_csv('../data/raw/yfinance_data/NVDA_historical_data.csv')

# Display basic information about the datasets
print("\nNews Data Information:")
print("-" * 50)
print("Columns in news data:")
print(news_data.columns.tolist())
print("\nFirst few rows of news data:")
display(news_data.head())

print("\nStock Data Information:")
print("-" * 50)
print("Columns in stock data:")
print(stock_data.columns.tolist())
print("\nFirst few rows of stock data:")
display(stock_data.head())

# Now let's handle the date conversion
try:
    # Convert dates to datetime
    print("\nConverting dates...")
    
    # First, let's check the date column names
    print("Date column in news data:", [col for col in news_data.columns if 'date' in col.lower()])
    print("Date column in stock data:", [col for col in stock_data.columns if 'date' in col.lower()])
    
    # Convert dates (adjust column names based on actual data)
    date_col_news = [col for col in news_data.columns if 'date' in col.lower()][0]
    date_col_stock = [col for col in stock_data.columns if 'date' in col.lower()][0]
    
    news_data[date_col_news] = pd.to_datetime(news_data[date_col_news], errors='coerce')
    stock_data[date_col_stock] = pd.to_datetime(stock_data[date_col_stock], errors='coerce')
    
    # Set dates as index
    news_data.set_index(date_col_news, inplace=True)
    stock_data.set_index(date_col_stock, inplace=True)
    
    # Sort by date
    news_data.sort_index(inplace=True)
    stock_data.sort_index(inplace=True)
    
    # Calculate daily stock returns
    stock_data['Daily_Return'] = stock_data['Close'].pct_change()
    
    # Display date ranges and data points
    print("\nDate Ranges:")
    print(f"News data: {news_data.index.min()} to {news_data.index.max()}")
    print(f"Stock data: {stock_data.index.min()} to {stock_data.index.max()}")
    
    # Check for missing values
    print("\nMissing values in news data:")
    print(news_data.isnull().sum())
    print("\nMissing values in stock data:")
    print(stock_data.isnull().sum())
    
    # Display aligned data
    print("\nAligned Data Sample:")
    display(pd.DataFrame({
        'News_Count': news_data.groupby(news_data.index.date).size(),
        'Stock_Return': stock_data['Daily_Return']
    }).head())

except Exception as e:
    print(f"\nError occurred: {str(e)}")
    print("\nLet's examine the data more closely:")
    print("\nNews data info:")
    print(news_data.info())
    print("\nStock data info:")
    print(stock_data.info())

Loading and examining the data...

News Data Information:
--------------------------------------------------
Columns in news data:
['Unnamed: 0', 'headline', 'url', 'publisher', 'date', 'stock']

First few rows of news data:


Unnamed: 0.1,Unnamed: 0,headline,url,publisher,date,stock
0,0,Stocks That Hit 52-Week Highs On Friday,https://www.benzinga.com/news/20/06/16190091/s...,Benzinga Insights,2020-06-05 10:30:54-04:00,A
1,1,Stocks That Hit 52-Week Highs On Wednesday,https://www.benzinga.com/news/20/06/16170189/s...,Benzinga Insights,2020-06-03 10:45:20-04:00,A
2,2,71 Biggest Movers From Friday,https://www.benzinga.com/news/20/05/16103463/7...,Lisa Levin,2020-05-26 04:30:07-04:00,A
3,3,46 Stocks Moving In Friday's Mid-Day Session,https://www.benzinga.com/news/20/05/16095921/4...,Lisa Levin,2020-05-22 12:45:06-04:00,A
4,4,B of A Securities Maintains Neutral on Agilent...,https://www.benzinga.com/news/20/05/16095304/b...,Vick Meyer,2020-05-22 11:38:59-04:00,A



Stock Data Information:
--------------------------------------------------
Columns in stock data:
['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'Dividends', 'Stock Splits']

First few rows of stock data:


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Dividends,Stock Splits
0,1999-01-22,0.04375,0.048828,0.038802,0.041016,0.037621,2714688000,0.0,0.0
1,1999-01-25,0.044271,0.045833,0.041016,0.045313,0.041562,510480000,0.0,0.0
2,1999-01-26,0.045833,0.046745,0.041146,0.041797,0.038337,343200000,0.0,0.0
3,1999-01-27,0.041927,0.042969,0.039583,0.041667,0.038218,244368000,0.0,0.0
4,1999-01-28,0.041667,0.041927,0.041276,0.041536,0.038098,227520000,0.0,0.0



Converting dates...
Date column in news data: ['date']
Date column in stock data: ['Date']

Date Ranges:
News data: 2011-04-27 21:01:48-04:00 to 2020-06-11 17:12:35-04:00
Stock data: 1999-01-22 00:00:00 to 2024-07-30 00:00:00

Missing values in news data:
Unnamed: 0    0
headline      0
url           0
publisher     0
stock         0
dtype: int64

Missing values in stock data:
Open            0
High            0
Low             0
Close           0
Adj Close       0
Volume          0
Dividends       0
Stock Splits    0
Daily_Return    1
dtype: int64

Aligned Data Sample:


Unnamed: 0,News_Count,Stock_Return
2011-04-27,1.0,0.0
2011-04-28,2.0,0.010881
2011-04-29,2.0,0.025115
2011-04-30,1.0,
2011-05-01,1.0,


## Sentiment Analysis Setup

Before we begin analyzing sentiment, we need to:
1. Initialize our sentiment analyzers (VADER and TextBlob)
2. Define our sentiment analysis functions
3. Apply sentiment analysis to our news headlines
4. Calculate basic sentiment statistics

In [None]:
# Take a smaller sample for testing
sample_size = 10000
news_data_sample = news_data.sample(n=sample_size, random_state=42)

# Initialize VADER
sia = SentimentIntensityAnalyzer()

def analyze_sentiment_vader(text):
    if not isinstance(text, str):
        return {'compound': 0, 'pos': 0, 'neg': 0, 'neu': 0}
    return sia.polarity_scores(text)

# Process the sample
chunk_size = 1000
processed_data = []

print(f"Processing {sample_size} headlines...")

for start_idx in range(0, sample_size, chunk_size):
    end_idx = min(start_idx + chunk_size, sample_size)
    chunk = news_data_sample.iloc[start_idx:end_idx].copy()
    
    # Apply sentiment analysis
    chunk['vader_sentiment'] = chunk['headline'].apply(analyze_sentiment_vader)
    chunk['vader_compound'] = chunk['vader_sentiment'].apply(lambda x: x['compound'])
    chunk['vader_positive'] = chunk['vader_sentiment'].apply(lambda x: x['pos'])
    chunk['vader_negative'] = chunk['vader_sentiment'].apply(lambda x: x['neg'])
    chunk['vader_neutral'] = chunk['vader_sentiment'].apply(lambda x: x['neu'])
    
    processed_data.append(chunk)
    
    print(f"Progress: {(end_idx/sample_size)*100:.1f}%")

# Combine all processed chunks
news_data_processed = pd.concat(processed_data)

# Save the processed data
news_data_processed.to_csv('../data/processed/sentiment_analysis_sample.csv', index=True)

print("\nProcessing complete!")

## Sentiment Distribution Analysis

Now that we have calculated sentiment scores, we'll:
1. Visualize the distribution of sentiment scores
2. Compare VADER and TextBlob sentiment distributions
3. Identify any patterns or biases in the sentiment analysis
4. Understand the overall sentiment landscape of our news data

In [None]:
# Visualize sentiment distributions
plt.figure(figsize=(15, 5))

# VADER sentiment distribution
plt.subplot(1, 2, 1)
sns.histplot(news_data_processed['vader_compound'], bins=50, kde=True)
plt.title('Distribution of VADER Sentiment Scores')
plt.xlabel('Compound Sentiment Score')
plt.ylabel('Frequency')

# VADER positive/negative distribution
plt.subplot(1, 2, 2)
sns.histplot(news_data_processed['vader_positive'], bins=50, kde=True, label='Positive', alpha=0.5)
sns.histplot(news_data_processed['vader_negative'], bins=50, kde=True, label='Negative', alpha=0.5)
plt.title('Distribution of Positive and Negative Scores')
plt.xlabel('Sentiment Score')
plt.ylabel('Frequency')
plt.legend()

plt.tight_layout()
plt.show()

# Print some statistics
print("\nSentiment Analysis Statistics:")
print("-" * 50)
print(f"Mean compound score: {news_data_processed['vader_compound'].mean():.3f}")
print(f"Mean positive score: {news_data_processed['vader_positive'].mean():.3f}")
print(f"Mean negative score: {news_data_processed['vader_negative'].mean():.3f}")
print(f"Mean neutral score: {news_data_processed['vader_neutral'].mean():.3f}")

## Daily Sentiment Trends Analysis

In this section, we'll analyze how sentiment changes over time by:
1. Calculating daily average sentiment scores
2. Visualizing sentiment trends over time
3. Identifying any patterns or seasonality in sentiment
4. Comparing sentiment trends with stock price movements

In [None]:
# Calculate daily average sentiment
print("Calculating daily sentiment trends...")

# Group by date and calculate mean sentiment scores
daily_sentiment = news_data.groupby(news_data.index.date).agg({
    'vader_compound': 'mean',
    'textblob_sentiment': 'mean'
}).reset_index()
daily_sentiment['date'] = pd.to_datetime(daily_sentiment['date'])
daily_sentiment.set_index('date', inplace=True)

# Plot daily sentiment trends
plt.figure(figsize=(15, 5))
plt.plot(daily_sentiment.index, daily_sentiment['vader_compound'], label='VADER Sentiment')
plt.plot(daily_sentiment.index, daily_sentiment['textblob_sentiment'], label='TextBlob Sentiment')
plt.title('Daily Average Sentiment Scores')
plt.xlabel('Date')
plt.ylabel('Sentiment Score')
plt.legend()
plt.grid(True)
plt.show()

# Calculate and display trend statistics
print("\nDaily Sentiment Trend Statistics:")
print("-" * 50)
print("VADER Sentiment:")
print(f"Overall trend: {daily_sentiment['vader_compound'].mean():.3f}")
print(f"Maximum daily average: {daily_sentiment['vader_compound'].max():.3f}")
print(f"Minimum daily average: {daily_sentiment['vader_compound'].min():.3f}")
print(f"Standard deviation: {daily_sentiment['vader_compound'].std():.3f}")

print("\nTextBlob Sentiment:")
print(f"Overall trend: {daily_sentiment['textblob_sentiment'].mean():.3f}")
print(f"Maximum daily average: {daily_sentiment['textblob_sentiment'].max():.3f}")
print(f"Minimum daily average: {daily_sentiment['textblob_sentiment'].min():.3f}")
print(f"Standard deviation: {daily_sentiment['textblob_sentiment'].std():.3f}")

# Calculate rolling averages to identify trends
daily_sentiment['vader_rolling'] = daily_sentiment['vader_compound'].rolling(window=7).mean()
daily_sentiment['textblob_rolling'] = daily_sentiment['textblob_sentiment'].rolling(window=7).mean()

# Plot rolling averages
plt.figure(figsize=(15, 5))
plt.plot(daily_sentiment.index, daily_sentiment['vader_rolling'], label='VADER 7-day Rolling Average')
plt.plot(daily_sentiment.index, daily_sentiment['textblob_rolling'], label='TextBlob 7-day Rolling Average')
plt.title('7-day Rolling Average of Sentiment Scores')
plt.xlabel('Date')
plt.ylabel('Sentiment Score')
plt.legend()
plt.grid(True)
plt.show()

## Correlation Analysis: Sentiment vs Stock Returns

Now that we have our sentiment scores and stock returns, we'll analyze their relationship by:
1. Calculating daily correlations
2. Visualizing the relationships
3. Testing statistical significance
4. Analyzing lag effects

In [None]:
# Calculate daily average sentiment scores
print("Calculating daily sentiment averages...")

daily_sentiment = news_data.groupby(news_data.index.date).agg({
    'vader_compound': 'mean',
    'textblob_sentiment': 'mean'
}).reset_index()
daily_sentiment['date'] = pd.to_datetime(daily_sentiment['date'])
daily_sentiment.set_index('date', inplace=True)

# Merge with stock returns
merged_data = pd.merge(
    daily_sentiment,
    stock_data['Daily_Return'],
    left_index=True,
    right_index=True,
    how='inner'
)

# Calculate correlations
correlation_vader = merged_data['vader_compound'].corr(merged_data['Daily_Return'])
correlation_textblob = merged_data['textblob_sentiment'].corr(merged_data['Daily_Return'])

print("\nCorrelation Analysis Results:")
print("-" * 50)
print(f"VADER Sentiment Correlation: {correlation_vader:.3f}")
print(f"TextBlob Sentiment Correlation: {correlation_textblob:.3f}")

# Visualize the relationships
plt.figure(figsize=(15, 5))

# VADER sentiment vs returns
plt.subplot(1, 2, 1)
plt.scatter(merged_data['vader_compound'], merged_data['Daily_Return'], alpha=0.5)
plt.title('VADER Sentiment vs Stock Returns')
plt.xlabel('VADER Sentiment Score')
plt.ylabel('Daily Return')

# TextBlob sentiment vs returns
plt.subplot(1, 2, 2)
plt.scatter(merged_data['textblob_sentiment'], merged_data['Daily_Return'], alpha=0.5)
plt.title('TextBlob Sentiment vs Stock Returns')
plt.xlabel('TextBlob Sentiment Score')
plt.ylabel('Daily Return')

plt.tight_layout()
plt.show()

## Lagged Correlation Analysis

We'll analyze if there's a delayed effect between news sentiment and stock returns by:
1. Calculating correlations with different time lags
2. Identifying optimal lag periods
3. Visualizing lag effects

In [None]:
# Calculate lagged correlations
print("Calculating lagged correlations...")

max_lag = 5  # Analyze up to 5 days of lag
lagged_correlations = pd.DataFrame(index=range(max_lag + 1))

for lag in range(max_lag + 1):
    if lag == 0:
        lagged_correlations.loc[lag, 'VADER'] = correlation_vader
        lagged_correlations.loc[lag, 'TextBlob'] = correlation_textblob
    else:
        # Calculate lagged correlations
        lagged_correlations.loc[lag, 'VADER'] = merged_data['vader_compound'].corr(merged_data['Daily_Return'].shift(-lag))
        lagged_correlations.loc[lag, 'TextBlob'] = merged_data['textblob_sentiment'].corr(merged_data['Daily_Return'].shift(-lag))

print("\nLagged Correlations:")
display(lagged_correlations)

# Visualize lagged correlations
plt.figure(figsize=(10, 5))
plt.plot(lagged_correlations.index, lagged_correlations['VADER'], marker='o', label='VADER')
plt.plot(lagged_correlations.index, lagged_correlations['TextBlob'], marker='o', label='TextBlob')
plt.title('Lagged Correlations between Sentiment and Returns')
plt.xlabel('Lag (days)')
plt.ylabel('Correlation Coefficient')
plt.legend()
plt.grid(True)
plt.show()

## Statistical Significance Testing

We'll test the statistical significance of our correlations using:
1. Pearson correlation test
2. P-value analysis
3. Confidence intervals

In [None]:
from scipy import stats

print("Statistical Significance Tests:")
print("-" * 50)

# VADER sentiment
vader_t_stat, vader_p_value = stats.pearsonr(merged_data['vader_compound'], merged_data['Daily_Return'])
print(f"VADER Sentiment:")
print(f"t-statistic: {vader_t_stat:.3f}")
print(f"p-value: {vader_p_value:.3f}")
print(f"Significant at 5% level: {vader_p_value < 0.05}")

# TextBlob sentiment
textblob_t_stat, textblob_p_value = stats.pearsonr(merged_data['textblob_sentiment'], merged_data['Daily_Return'])
print(f"\nTextBlob Sentiment:")
print(f"t-statistic: {textblob_t_stat:.3f}")
print(f"p-value: {textblob_p_value:.3f}")
print(f"Significant at 5% level: {textblob_p_value < 0.05}")

# Calculate confidence intervals
def correlation_confidence_interval(r, n, alpha=0.05):
    z = np.arctanh(r)
    se = 1/np.sqrt(n-3)
    z_score = stats.norm.ppf(1-alpha/2)
    ci_lower = np.tanh(z - z_score*se)
    ci_upper = np.tanh(z + z_score*se)
    return ci_lower, ci_upper

print("\nConfidence Intervals (95%):")
print("-" * 50)

# VADER confidence interval
vader_ci_lower, vader_ci_upper = correlation_confidence_interval(correlation_vader, len(merged_data))
print(f"VADER Sentiment: [{vader_ci_lower:.3f}, {vader_ci_upper:.3f}]")

# TextBlob confidence interval
textblob_ci_lower, textblob_ci_upper = correlation_confidence_interval(correlation_textblob, len(merged_data))
print(f"TextBlob Sentiment: [{textblob_ci_lower:.3f}, {textblob_ci_upper:.3f}]")