# Financial Market Prediction using Sentiment Analysis and Machine Learning

This project implements a research framework for predicting a financial market direction using sentiment analysis from
social media alongside traditional financial models.


## Research Hypotheses

1. **H1**: Financial news headlines and social media sentiment can be used to accurately predict a financial market direction in the short term.
2. **H2**: Machine learning models that incorporate sentiment data will outperform traditional econometric models in predicting market trends.
3. **H3**: During periods of financial crises, social media sentiment will be a more reliable predictor of market volatility compared to conventional sentiment indicators.


In [None]:
from src.data_collection.market_data import get_stock_data, get_factor_data, calculate_fama_french_factors
from src.data_collection.twitter_data import TwitterDataCollector
from src.data_collection.data_cleaning import clean_tweet_csv, clean_stockerbot_export, clean_tweets_remaining, \
    load_and_clean_all_datasets, combine_datasets

## 1. Data Collection

### 1.1 Market Data Collection

In [None]:
# Set date range (2020-2022)
start_date = "2020-01-01"
end_date = "2022-12-31"

# Get market index data (S&P 500)
sp500_data = get_stock_data("^GSPC", start_date, end_date)

# Get factor data from Kenneth French's data library
french_data = get_factor_data(start_date, end_date)

# Calculate Fama-French factors using French data
ff_factors = calculate_fama_french_factors(
    market_data=sp500_data,
    factor_data=french_data
)

### 1.2 Twitter Data Collection
Data has not been collected from tweepy/twitter API for this, because of the limits and costs put in place.
But this will show how it could be collected

In [None]:
api_key = "XXXXXXXXXXXXXXXXXXX"
api_secret = "XXXXXXXXXX"
access_token = "XXXXXXXXXXXXXX"
access_token_secret = "XXXXXXXXXXXXXX"

# Initialize collector
collector = TwitterDataCollector(api_key, api_secret, access_token, access_token_secret)

# Not a ticker, but what people might tweet, should probably use both.
tickers = ["S&P500"]
additional_keywords = ["earnings", "stock", "price", "market", "trading", "investor"]

# Using specific date range
print("\nCollecting tweets from a specific date range (2020-01-01 to 2022-12-31)")
tweets_df = collector.collect_financial_tweets(
    tickers=tickers,
    additional_keywords=additional_keywords,
    start_date="2020-01-01",
    end_date="2022-12-31",
    tweets_per_query=100
)

Instead we will use pre-existing datasets that can be cleaned, they are not in this repo due to size.

In [None]:
tweet_csv_path = "../external_data/Tweet.csv"
stockerbot_path = "../external_data/stockerbot-export.csv"
tweets_remaining_path = "../external_data/tweets_remaining_09042020_16072020.csv"

# Clean individual datasets
tweet_csv_df = clean_tweet_csv(tweet_csv_path)
stockerbot_df = clean_stockerbot_export(stockerbot_path)
tweets_remaining_df = clean_tweets_remaining(tweets_remaining_path)

print(f"Tweet CSV: {len(tweet_csv_df)} rows")
print(f"Stockerbot: {len(stockerbot_df)} rows")
print(f"Tweets Remaining: {len(tweets_remaining_df)} rows")

# Alternatively, use the combined function
datasets = load_and_clean_all_datasets(
    tweets_remaining_path=tweets_remaining_path,
    tweet_csv_path=tweet_csv_path,
    stockerbot_path=stockerbot_path
)

# Combine all datasets
combined_df = combine_datasets(datasets)
print(f"Combined dataset: {len(combined_df)} rows")