# CryptoPulse: A Critical Re-evaluation of Sentiment-Based Financial Prediction

## Introduction

This notebook walks through the entire workflow of the CryptoPulse project. The project's primary goal is not to present a highly accurate prediction model, but to critically re-evaluate the entire process of using social media sentiment for financial prediction, especially when faced with real-world data limitations like data sparsity.

We will demonstrate:
1.  **The Data Pipeline:** How data is collected and processed.
2.  **Feature Engineering:** How sentiment scores are calculated.
3.  **The Modelling Process:** Training several models, including a simple, robust baseline and a more complex, but overfit, model.
4.  **The Critical Analysis:** How we can identify overfitting and why high accuracy can be misleading.

## Part 1: Data Collection Pipeline

The first step is to collect data from various sources. The following code blocks are based on the scripts found in `src/`. These scripts are designed to be run automatically to continuously collect data.

In [None]:
import praw
import pandas as pd
from newspaper import Article
import yfinance as yf
from datetime import datetime, timedelta
import sqlite3

def get_reddit_data(subreddits, limit=100):
    # Add your reddit credentials
    reddit = praw.Reddit(client_id='YOUR_CLIENT_ID',
                         client_secret='YOUR_CLIENT_SECRET',
                         user_agent='YOUR_USER_AGENT')
    
    posts_data = []
    for subreddit_name in subreddits:
        subreddit = reddit.subreddit(subreddit_name)
        for post in subreddit.hot(limit=limit):
            posts_data.append([post.subreddit, post.title, post.score, post.id, post.url, post.num_comments, post.selftext, post.created])
    
    return pd.DataFrame(posts_data, columns=['subreddit', 'title', 'score', 'id', 'url', 'num_comments', 'body', 'created'])

def get_news_data(urls):
    news_data = []
    for url in urls:
        try:
            article = Article(url)
            article.download()
            article.parse()
            news_data.append([article.title, article.text, article.publish_date])
        except Exception as e:
            print(f'Error processing article at {url}: {e}')
            
    return pd.DataFrame(news_data, columns=['title', 'text', 'publish_date'])

def get_price_data(ticker, start_date, end_date):
    return yf.download(ticker, start=start_date, end=end_date)

print('Data collection functions are defined.')

## Part 2: Data Processing and Feature Engineering

Once the data is collected, it needs to be processed and scored. The following code is based on `src/score_metrics.py` and `src/simplified_ml_dataset.py`.

In [None]:
from transformers import pipeline

def get_sentiment_scores(text_series):
    sentiment_pipeline = pipeline('sentiment-analysis')
    return text_series.apply(lambda x: sentiment_pipeline(x[:512])[0]['label'] if isinstance(x, str) else 'NEUTRAL')

def create_ml_dataset(processed_data, price_data):
    # This is a simplified representation of the dataset creation process
    # The actual implementation would involve merging, aggregation, and feature creation
    print('Creating ML dataset...')
    # ... complex data processing logic ...
    return pd.DataFrame() # Return a placeholder dataframe

print('Data processing and feature engineering functions are defined.')

## Part 3: Model Training

Now we train our models. We will train three types of models:
1. A baseline model.
2. A simple, robust Logistic Regression model.
3. A complex LightGBM model that is prone to overfitting on this dataset.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import lightgbm as lgb

def train_simple_model(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = LogisticRegression()
    model.fit(X_train, y_train)
    return model, X_test, y_test

def train_complex_model(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = lgb.LGBMClassifier()
    model.fit(X_train, y_train)
    return model, X_test, y_test

print('Model training functions are defined.')

## Part 4: Critical Analysis & Comparison

This is the most important part of the analysis. We will compare the models and show how the complex model, despite potentially higher accuracy, is less reliable due to overfitting.

In [None]:
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
    print(classification_report(y_test, y_pred))

def plot_feature_importance(model, features):
    # Plotting for LightGBM
    if hasattr(model, 'feature_importances_'):
        feature_imp = pd.DataFrame(sorted(zip(model.feature_importances_, features)), columns=['Value','Feature'])
        plt.figure(figsize=(20, 10))
        sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False))
        plt.title('LightGBM Features (avg over folds)')
        plt.tight_layout()
        plt.show()

print('Model analysis and comparison functions are defined.')

## Conclusion

This notebook has demonstrated the full pipeline of the CryptoPulse project. More importantly, it has shown that a high accuracy score is not the only measure of a model's success. By critically evaluating our models, we have shown that with sparse data, a simpler, more robust model provides a more honest assessment of the predictive power of social media sentiment.