# CryptoPulse: Project Workflow and Guide

## Introduction

This notebook serves as the central guide to the CryptoPulse project. It outlines the entire workflow, from data collection to model analysis, providing explanations and direct links to the source code for each step. The project's core goal is to critically evaluate sentiment-based financial prediction, highlighting the challenges of real-world data.

Follow the steps below to understand the project's architecture and methodology.

## Part 1: Data Collection Pipeline

The first step is to gather data from a variety of sources. Each source has its own dedicated script in the `src/` or `collection/` directories.

### 1.1 Reddit
- **Purpose:** Collects posts from specified cryptocurrency-related subreddits.
- **Script:** [src/reddit_scraper.py](../src/reddit_scraper.py)

### 1.2 Twitter
- **Purpose:** Scrapes tweets from influential crypto accounts. This is a complex task and the script is more of a proof-of-concept.
- **Script:** [src/twitter_scraper.py](../src/twitter_scraper.py)

### 1.3 News & RSS Feeds
- **Purpose:** Gathers articles from a large list of crypto news websites via their RSS feeds.
- **Script:** [collection/massive_rss_campaign.py](../collection/massive_rss_campaign.py)

### 1.4 Price Data
- **Purpose:** Fetches historical price data for Ethereum (ETH) from Yahoo Finance.
- **Script:** [src/price_collector.py](../src/price_collector.py)

## Part 2: Feature Engineering & Dataset Creation

Once the raw data is collected, it is processed to create features for our models.

### 2.1 Sentiment Scoring
- **Purpose:** Analyzes the text from posts and articles to generate sentiment scores and other custom metrics.
- **Script:** [src/score_metrics.py](../src/score_metrics.py)

### 2.2 ML Dataset Creation
- **Purpose:** Combines the scored data with price data, aggregates it by day, and creates the final dataset for training.
- **Script:** [src/simplified_ml_dataset.py](../src/simplified_ml_dataset.py)

## Part 3: Model Training and Evaluation

With the final dataset, we train and evaluate our models.

### 3.1 Model Training
- **Purpose:** Contains the code for training the simple (Logistic Regression) and complex (LightGBM) models.
- **Simple Model Script:** [src/simple_model_trainer.py](../src/simple_model_trainer.py)
- **Complex Model Script:** [src/ml_model_trainer.py](../src/ml_model_trainer.py)

### 3.2 Model Comparison and Plotting
- **Purpose:** Compares the performance of the different models and generates plots for analysis.
- **Comparison Script:** [src/model_comparison.py](../src/model_comparison.py)
- **Plotting Script:** [src/generate_plots.py](../src/generate_plots.py)

## Conclusion

This notebook provides a clear roadmap to the CryptoPulse project. By following the links, you can explore the full source code for each part of the pipeline and understand how they fit together to support the project's final analysis and conclusions.