# Algorithm Project Proposal  
**Team C.J.J.: Joshua Meyer, Chris Wong, Jeremy Orona**  
**CPSC 322, Fall 2025**

---

## 1. Project Title
**Predicting Bitcoin Price Direction Using Discretized Treasury and Sentiment Data**

---

## 2. Dataset Description

### Source
- **Primary Dataset:** [Kaggle â€” Bitcoin and US Treasury with Daily Sentiment](https://www.kaggle.com/datasets/jessearzate/bitcoin-and-us-treasury-with-daily-sentiment?select=bitcoin_sentiment_12012022_11082025.csv)
- **Processed Dataset:** `bitcoin_sentiment_discretized.csv` (preprocessed version with discretized features)

### Format
- **File Type:** CSV (Comma-Separated Values)
- **Encoding:** UTF-8

### Contents
The dataset contains daily financial and market data spanning **December 1, 2022 to November 2025**, with approximately **1,074 instances** and **19 predictor attributes** plus one target variable. The data has been preprocessed and discretized into categorical bins for classification purposes.

**Key Data Components:**
- **Bitcoin Trading Metrics:** Trading volume (discretized as VeryLow, Low, Medium, High, VeryHigh)
- **US Treasury Indicators:** Multiple treasury security categories including:
  - Treasury bills, bonds, notes, and floating rate notes (FRN)
  - Treasury Inflation-Protected Securities (TIPS)
  - Government account series and special purpose vehicles
  - Federal financing bank data
  - Total interest-bearing debt, marketable, and non-marketable securities
- **Market Sentiment:** Weighted sentiment scores (discretized into categorical levels)
- **Target Variable:** Binary price direction classification (Up/Down)

### Dataset Size
- **Instances:** ~1,074 daily observations
- **Features:** 19 predictor attributes (all discretized categorical)
- **Target Classes:** 2 (binary classification: "Up" or "Down")

---

## 3. Attributes and Target

### Predictor Attributes (19 Features)
All attributes have been discretized into categorical bins for classification:

1. **Volume** (VeryLow, Low, Medium, High, VeryHigh)
2. **Federal Financing Bank** (Bank1, Bank2, Bank3, Bank4, Bank5)
3. **Foreign Series** (0, 1)
4. **Government Account Series** (VeryLow, Low, Medium, High, VeryHigh)
5. **Government Account Series Inflation Securities** (VeryLow, Low, Medium, High, VeryHigh)
6. **Special Purpose Vehicle** (VeryLow, Low, Medium, High, VeryHigh)
7. **State and Local Government Series** (VeryLow, Low, Medium, High, VeryHigh)
8. **Total Interest-Bearing Debt** (VeryLow, Low, Medium, High, VeryHigh)
9. **Total Marketable** (VeryLow, Low, Medium, High, VeryHigh)
10. **Total Non-Marketable** (VeryLow, Low, Medium, High, VeryHigh)
11. **Treasury Bills** (VeryLow, Low, Medium, High, VeryHigh)
12. **Treasury Bonds** (VeryLow, Low, Medium, High, VeryHigh)
13. **Treasury Floating Rate Notes (FRN)** (VeryLow, Low, Medium, High, VeryHigh)
14. **Treasury Inflation-Protected Securities (TIPS)** (VeryLow, Low, Medium, High, VeryHigh)
15. **Treasury Notes** (VeryLow, Low, Medium, High, VeryHigh)
16. **United States Savings Inflation Securities** (VeryLow, Low, Medium, High, VeryHigh)
17. **United States Savings Securities** (VeryLow, Low, Medium, High, VeryHigh)
18. **Weighted Sentiment** (VeryLow, Low, Medium, High, VeryHigh)

### Target (Class Information)
- **Target Variable:** `price_direction`
- **Classification Type:** Binary classification
- **Class Labels:** 
  - **"Up"**: Bitcoin closing price increased compared to the previous day
  - **"Down"**: Bitcoin closing price decreased or remained the same compared to the previous day

The target variable is derived by comparing each day's closing price with the previous day's closing price, creating a binary classification problem that predicts whether Bitcoin's price will move up or down based on treasury indicators and market sentiment.

---

## 4. Implementation / Technical Merit

### Preprocessing Pipeline
1. **Data Loading:** CSV parsing with proper encoding (UTF-8) and handling of categorical data
2. **Feature Encoding:** One-Hot Encoding using scikit-learn's `OneHotEncoder` to convert categorical features into numerical format suitable for machine learning algorithms
3. **Data Splitting:** Stratified train-test split (2/3 training, 1/3 testing) with `random_state=9` for reproducibility across all algorithms
4. **Handling Unknown Categories:** OneHotEncoder configured with `handle_unknown="ignore"` to gracefully handle unseen categorical values during prediction

### Classification Algorithms
We will implement and compare three distinct classification approaches:

1. **Decision Tree Classifier (Baseline)**
   - Implementation: scikit-learn's `DecisionTreeClassifier`
   - Hyperparameters: Gini impurity criterion, max_depth=10, min_samples_split=2, min_samples_leaf=1
   - Rationale: Provides interpretable baseline with decision rules that can be visualized

2. **Random Forest Classifier (Custom Implementation)**
   - Implementation: Custom `MyRandomForestClassifier` from course materials
   - Hyperparameters: n_estimators=20, bootstrap=True, max_features determined by sqrt(n_features)
   - Rationale: Demonstrates ensemble learning with bootstrap aggregation and random feature selection, showcasing custom implementation skills

3. **K-Nearest Neighbors (KNN) Classifier**
   - Implementation: scikit-learn's `KNeighborsClassifier`
   - Hyperparameters: n_neighbors=5, weights="uniform", metric="minkowski" (Euclidean distance)
   - Rationale: Instance-based learning approach that captures local patterns in the discretized feature space

### Evaluation Methodology
- **Consistent Evaluation:** All three algorithms use identical train-test splits (random_state=9, test_size=0.33, stratified)
- **Performance Metrics:** Accuracy, precision, recall, F1-score, and classification reports
- **Comparison Framework:** Direct comparison of model performance on identical test data to assess relative effectiveness

### Technical Approach
- **Pipeline Architecture:** scikit-learn Pipeline objects for preprocessing and classification integration
- **Reproducibility:** Fixed random seeds ensure consistent results across algorithm comparisons
- **Code Organization:** Modular design with separate demo scripts for each classifier, facilitating independent evaluation and comparison

---

## 5. Anticipated Challenges

### Data Preprocessing Challenges
1. **Categorical Feature Encoding:** With 19 categorical features, one-hot encoding will create a high-dimensional feature space. This may lead to:
   - Increased computational complexity
   - Potential overfitting with limited training data
   - Sparse feature matrices requiring efficient memory management

2. **Class Distribution:** Binary classification may exhibit class imbalance if Bitcoin price movements are not evenly distributed between "Up" and "Down" classes. Mitigation strategies include:
   - Stratified sampling in train-test splits (already implemented)
   - Class weight balancing in algorithms that support it
   - Evaluation metrics that account for imbalance (precision, recall, F1-score)

3. **Temporal Dependencies:** Daily financial data contains inherent temporal structure, but we treat each day independently. This approach:
   - Ignores potential autocorrelation in price movements
   - May miss sequential patterns that could improve prediction
   - Simplifies the problem but may limit model performance

### Classification Challenges
1. **High Dimensionality:** After one-hot encoding, the feature space expands significantly, potentially leading to:
   - Curse of dimensionality effects, especially for KNN
   - Increased risk of overfitting in decision trees
   - Need for regularization or feature selection

2. **Discretization Information Loss:** Converting continuous treasury and sentiment values to categorical bins may:
   - Lose fine-grained information that could be predictive
   - Introduce discretization bias depending on bin boundaries
   - Reduce model sensitivity to subtle feature variations

3. **Model Interpretability vs. Performance Trade-off:**
   - Decision trees offer high interpretability but may underperform
   - Random forests improve performance but reduce interpretability
   - KNN provides no explicit feature importance insights

### Feature Selection Considerations
With 19 original features (expanding to many more after one-hot encoding), we will:
- **Monitor Feature Importance:** Use embedded methods (decision tree feature importance, random forest feature importance) to identify most predictive features
- **Evaluate Feature Redundancy:** Analyze correlation between discretized features to identify potential redundancies
- **Consider Dimensionality Reduction:** If needed, explore PCA or feature selection techniques, though this may reduce interpretability

---

## 6. Feature Selection Techniques

Given the discretized categorical nature of our dataset and the expansion of features through one-hot encoding, we will employ the following feature selection strategies:

### Embedded Methods (Primary Approach)
1. **Decision Tree Feature Importance:** 
   - Extract feature importance scores from trained decision trees
   - Identify features that contribute most to classification decisions
   - Visualize feature importance rankings

2. **Random Forest Feature Importance:**
   - Aggregate feature importance across multiple trees in the forest
   - More robust than single-tree importance due to ensemble averaging
   - Identify features consistently important across bootstrap samples

### Filter Methods (Secondary Analysis)
1. **Correlation Analysis:**
   - Examine correlations between one-hot encoded features and target variable
   - Identify features with strong predictive relationships
   - Note: Limited utility with categorical data, but can reveal patterns

2. **Class Distribution Analysis:**
   - Analyze how feature value distributions differ between "Up" and "Down" classes
   - Identify features with distinct distributions across classes
   - Use chi-square tests or similar statistical measures for categorical features

### Dimensionality Considerations
- **Current Approach:** Use all features with one-hot encoding to preserve information
- **Future Consideration:** If model performance is poor or overfitting occurs, we may:
  - Select top-k features based on importance scores
  - Use recursive feature elimination
  - Apply PCA (though this reduces interpretability with categorical data)

### Evaluation Strategy
- Compare model performance with full feature set vs. reduced feature sets
- Balance between model complexity and performance
- Maintain interpretability where possible, especially for decision trees

---

## 7. Potential Impact of Results

### Usefulness of Results

**Financial Market Prediction:**
- Our models can help identify relationships between US Treasury indicators, market sentiment, and Bitcoin price movements
- Provides quantitative framework for testing hypotheses about macroeconomic factors influencing cryptocurrency markets
- Demonstrates practical application of machine learning to financial time series prediction

**Methodological Contributions:**
- Compares three distinct classification paradigms (rule-based, ensemble, instance-based) on identical financial data
- Evaluates effectiveness of discretization strategies for financial prediction
- Provides reproducible baseline for future research on cryptocurrency price prediction

**Educational Value:**
- Showcases implementation of custom machine learning algorithms (Random Forest) alongside standard library implementations
- Demonstrates proper evaluation methodology with consistent train-test splits
- Illustrates challenges and solutions in preprocessing categorical financial data

### Stakeholders

**Primary Stakeholders:**
1. **Cryptocurrency Investors and Traders:**
   - Retail investors seeking data-driven insights for Bitcoin trading decisions
   - Day traders looking for short-term price movement indicators
   - Long-term holders interested in understanding market dynamics

2. **Financial Institutions:**
   - Hedge funds and investment firms developing quantitative trading strategies
   - Banks and financial services companies assessing cryptocurrency market exposure
   - Asset management companies evaluating Bitcoin as an investment asset class

3. **Academic Researchers:**
   - Researchers studying relationships between traditional financial markets and cryptocurrencies
   - Economists analyzing macroeconomic factors affecting digital assets
   - Data scientists developing financial prediction models

**Secondary Stakeholders:**
4. **Regulatory Bodies:**
   - Financial regulators understanding market dynamics for policy development
   - Government agencies monitoring cryptocurrency market stability

5. **Technology Companies:**
   - Fintech companies building trading platforms and prediction tools
   - Blockchain companies analyzing market adoption patterns

### Practical Applications
- **Trading Strategy Development:** Models can inform algorithmic trading strategies
- **Risk Management:** Understanding price movement patterns aids in portfolio risk assessment
- **Market Analysis:** Provides quantitative framework for analyzing cryptocurrency market behavior
- **Educational Tool:** Demonstrates machine learning applications in finance for educational purposes

---

## 8. Citations

### Dataset
- **Kaggle Dataset:** "Bitcoin and US Treasury with Daily Sentiment" by Jesse Arzate
  - URL: https://www.kaggle.com/datasets/jessearzate/bitcoin-and-us-treasury-with-daily-sentiment
  - License: Check Kaggle dataset license terms
  - Citation: Arzate, J. (2022-2025). Bitcoin and US Treasury with Daily Sentiment. Kaggle.

### Software Libraries and Tools
- **scikit-learn** (v1.0+): Pedregosa et al., "Scikit-learn: Machine Learning in Python," JMLR 12, pp. 2825-2830, 2011.
  - Documentation: https://scikit-learn.org/stable/
  - Used for: DecisionTreeClassifier, KNeighborsClassifier, train_test_split, OneHotEncoder, Pipeline, evaluation metrics

- **NumPy:** Harris et al., "Array programming with NumPy," Nature 585, pp. 357-362, 2020.
  - Documentation: https://numpy.org/doc/
  - Used for: Numerical array operations and data manipulation

- **Matplotlib:** Hunter, J. D., "Matplotlib: A 2D graphics environment," Computing in Science & Engineering, vol. 9, no. 3, pp. 90-95, 2007.
  - Documentation: https://matplotlib.org/
  - Used for: Data visualization and decision tree plotting

- **Python Standard Library:** csv module for data loading

### Course Materials
- **MyRandomForestClassifier:** Custom implementation provided in CPSC 322 course materials
- **MyEvaluation Functions:** Stratified train-test split and evaluation functions from course materials
- **Course Textbook and Lecture Notes:** Algorithm implementations and theoretical foundations from CPSC 322 course content

### References and Documentation
- **scikit-learn Documentation:**
  - Decision Trees: https://scikit-learn.org/stable/modules/tree.html
  - K-Nearest Neighbors: https://scikit-learn.org/stable/modules/neighbors.html
  - Preprocessing: https://scikit-learn.org/stable/modules/preprocessing.html
  - Model Evaluation: https://scikit-learn.org/stable/modules/model_evaluation.html

- **Machine Learning Theory:**
  - Breiman, L. (2001). "Random Forests." Machine Learning, 45(1), 5-32.
  - Cover, T., & Hart, P. (1967). "Nearest neighbor pattern classification." IEEE Transactions on Information Theory, 13(1), 21-27.
  - Quinlan, J. R. (1986). "Induction of decision trees." Machine Learning, 1(1), 81-106.

### Data Preprocessing
- Discretization methodology and one-hot encoding techniques based on standard machine learning preprocessing practices as covered in CPSC 322 course materials.

---