# CryptoPulse: Project Workflow and Guide

## Introduction

This notebook serves as the central guide to the CryptoPulse project. It outlines the entire workflow, from data collection to model analysis, providing explanations and direct links to the source code for each step. The project is a comprehensive and honest investigation into the use of social media sentiment for cryptocurrency price prediction, exploring a wide range of models and techniques to provide a clear picture of the challenges and opportunities in this domain.

Follow the steps below to understand the project's architecture and methodology.

## Part 1: Data Collection Pipeline

The first step is to gather data from a variety of sources. Each source has its own dedicated script in the `src/` or `collection/` directories.

### 1.1 Reddit
- **Purpose:** Collects posts from specified cryptocurrency-related subreddits.
- **Script:** [src/reddit_scraper.py](../src/reddit_scraper.py)

### 1.2 Twitter
- **Purpose:** Scrapes tweets from influential crypto accounts. This is a complex task and the script is more of a proof-of-concept.
- **Script:** [src/twitter_scraper.py](../src/twitter_scraper.py)

### 1.3 News & RSS Feeds
- **Purpose:** Gathers articles from a large list of crypto news websites via their RSS feeds.
- **Script:** [collection/massive_rss_campaign.py](../collection/massive_rss_campaign.py)

### 1.4 Price Data
- **Purpose:** Fetches historical price data for Ethereum (ETH) from Yahoo Finance.
- **Script:** [src/price_collector.py](../src/price_collector.py)

## Part 2: Feature Engineering & Dataset Creation

Once the raw data is collected, it is processed to create features for our models.

### 2.1 Sentiment Scoring
- **Purpose:** Analyzes the text from posts and articles to generate traditional sentiment scores and other custom metrics.
- **Script:** [src/score_metrics.py](../src/score_metrics.py)

### 2.2 ML Dataset Creation
- **Purpose:** Combines the scored data with price data, aggregates it by day, and creates the final dataset for training. This script focuses on a specific 6-month period to ensure a high-quality, consistent dataset.
- **Script:** [src/simplified_ml_dataset.py](../src/simplified_ml_dataset.py)

## Part 3: Model Training and Evaluation

With the final dataset, we train and evaluate a wide range of models to provide a comprehensive analysis.

### 3.1 Baseline Model
- **Purpose:** A simple Logistic Regression model to serve as a baseline for performance comparison.
- **Script:** [src/simple_model_trainer.py](../src/simple_model_trainer.py)

### 3.2 Traditional Machine Learning Models
- **Purpose:** A suite of traditional, tree-based ensemble models (LightGBM, XGBoost, Random Forest) that are known for their high performance on tabular data.
- **Script:** [src/ml_model_trainer.py](../src/ml_model_trainer.py)

### 3.3 Deep Learning Model (LSTM)
- **Purpose:** A Long Short-Term Memory (LSTM) network, a type of recurrent neural network (RNN) well-suited for time-series forecasting.
- **Script:** [src/modeling/lstm_trainer.py](../src/modeling/lstm_trainer.py)

### 3.4 Advanced NLP Model (CryptoBERT)
- **Purpose:** This script uses the pre-trained CryptoBERT model to generate contextual embeddings from the daily text data. It then trains a suite of models on a combination of these embeddings, traditional sentiment features, and technical indicators.
- **Script:** [src/modeling/cryptobert_trainer.py](../src/modeling/cryptobert_trainer.py)

### 3.6 Model Comparison and Plotting
- **Purpose:** Compares the performance of all the different models and generates plots for analysis.
- **Comparison Script:** [src/model_comparison.py](../src/model_comparison.py)
- **Plotting Script:** [src/generate_plots.py](../src/generate_plots.py)

## Conclusion

This notebook provides a clear roadmap to the CryptoPulse project. The workflow is designed to be a comprehensive and honest investigation, allowing for the comparison of a wide range of models. The project's findings are based on a rigorous analysis of the strengths and weaknesses of each approach, providing a solid foundation for future research and development in this domain.