# Bitcoin Weekly Price Movement Classification

## Objective
The goal of this notebook is to predict the weekly movement of Bitcoin prices as **UP** or **DOWN**. 
Instead of predicting exact prices (regression), we convert this into a **classification problem**, which is easier to evaluate and more meaningful for trading insights.

### Problem Definition
- **Target Variable:** Weekly price movement (`Price_Change_7d`)
- **Classification Label:** 
  - `UP` if the price increased compared to the previous week
  - `DOWN` if the price decreased or remained the same
- **Features:**
  - `Market Cap`
  - `Volume (24hr)`
  - `% 1h`, `% 24h`, `% 7d`
  - `Market_Dominance`
  - `Year`, `Month`
- **Evaluation Metrics:**
  - Accuracy
  - Precision
  - Recall
  - F1-score
  - Confusion Matrix

### Steps in this Notebook:
1. Load preprocessed data (`processed_crypto_data.csv`)
2. Create classification target variable (`UP` / `DOWN`)
3. Split data into training and testing sets
4. Feature scaling
5. Train multiple classifiers:
   - Logistic Regression
   - Decision Tree
   - Random Forest
   - Support Vector Machine (SVM)
   - Gradient Boosting Classifier
6. Evaluate models and compare performance
7. Save the best model for future predictions


## Step 1: Load and Inspect Processed Data
We load the preprocessed CSV file from Stage 3 and inspect the dataset to confirm the features are ready for modeling.

In [1]:
import pandas as pd
import numpy as np

# Load preprocessed data
df = pd.read_csv("processed_crypto_data.csv")

# Display first few rows
df.head()

Unnamed: 0,Date,Name,Symbol,Market Cap,Price,Circulating Supply,Volume (24hr),% 1h,% 24h,% 7d,Market_Dominance,Price_Change_7d,Year,Month
0,2013-04-28,Bitcoin,BTC,1488567000.0,134.21,11091325.0,207329600.0,0.64,0.0,0.0,0.941809,,2013,4
1,2013-04-28,Litecoin,LTC,74637020.0,4.3484,17164230.0,208620200.0,0.8,0.0,0.0,0.047222,,2013,4
2,2013-04-28,Peercoin,PPC,7250187.0,0.3865,18757362.0,5939582.0,0.005,0.0,0.0,0.004587,,2013,4
3,2013-04-28,Namecoin,NMC,5995997.0,1.1072,5415300.0,5939582.0,0.005,0.0,0.0,0.003794,,2013,4
4,2013-04-28,Terracoin,TRC,1503099.0,0.6469,2323570.0,94459170.0,0.61,0.0,0.0,0.000951,,2013,4


## Step 2: Define Target Variable
We create a new target column `Weekly_Movement` based on `Price_Change_7d`:

- `UP` if `Price_Change_7d` > 0
- `DOWN` if `Price_Change_7d` <= 0

In [2]:
# Create classification target
df['Weekly_Movement'] = np.where(df['Price_Change_7d'] > 0, 'UP', 'DOWN')

# Drop the original Price_Change_7d column (optional)
df = df.drop(columns=['Price_Change_7d'])

df['Weekly_Movement'].value_counts()

Weekly_Movement
DOWN    3447
UP      3160
Name: count, dtype: int64

## Step 3: Select Features and Target
We select relevant numerical features for modeling:

- `Market Cap`, `Price`, `Volume (24hr)`, `% 1h`, `% 24h`, `% 7d`, `Market_Dominance`, `Year`, `Month`

In [3]:
# Features and target
features = ['Market Cap', 'Price', 'Volume (24hr)', '% 1h', '% 24h', '% 7d', 'Market_Dominance', 'Year', 'Month']
target = 'Weekly_Movement'

X = df[features]
y = df[target]

## Step 4: Split Data
We split the dataset into **training** (80%) and **testing** (20%) sets for evaluation.

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Training samples:", X_train.shape[0])
print("Testing samples:", X_test.shape[0])


Training samples: 5285
Testing samples: 1322


## Step 5: Feature Scaling
We apply **StandardScaler** to scale numerical features for models like SVM or Logistic Regression.

In [5]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Step 6: Train Classifiers
We train five different classifiers and evaluate their performance:

1. Logistic Regression
2. Decision Tree Classifier
3. Random Forest Classifier
4. Support Vector Machine (SVM)
5. Gradient Boosting Classifier

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Initialize models
models = {
    "Logistic Regression": LogisticRegression(random_state=42),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42),
    "SVM": SVC(random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42)
}

# Train and evaluate
results = {}

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    
    acc = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, output_dict=True)
    
    results[name] = {
        "Accuracy": acc,
        "Precision_UP": report['UP']['precision'],
        "Recall_UP": report['UP']['recall'],
        "F1_UP": report['UP']['f1-score'],
        "Precision_DOWN": report['DOWN']['precision'],
        "Recall_DOWN": report['DOWN']['recall'],
        "F1_DOWN": report['DOWN']['f1-score'],
        "Confusion Matrix": confusion_matrix(y_test, y_pred)
    }

results

{'Logistic Regression': {'Accuracy': 0.6883509833585476,
  'Precision_UP': 0.7791878172588832,
  'Recall_UP': 0.48575949367088606,
  'F1_UP': 0.5984405458089669,
  'Precision_DOWN': 0.6497844827586207,
  'Recall_DOWN': 0.8739130434782608,
  'F1_DOWN': 0.7453646477132262,
  'Confusion Matrix': array([[603,  87],
         [325, 307]])},
 'Decision Tree': {'Accuracy': 0.9311649016641452,
  'Precision_UP': 0.9355877616747182,
  'Recall_UP': 0.9193037974683544,
  'F1_UP': 0.9273743016759777,
  'Precision_DOWN': 0.927246790299572,
  'Recall_DOWN': 0.9420289855072463,
  'F1_DOWN': 0.9345794392523364,
  'Confusion Matrix': array([[650,  40],
         [ 51, 581]])},
 'Random Forest': {'Accuracy': 0.9576399394856279,
  'Precision_UP': 0.941717791411043,
  'Recall_UP': 0.9715189873417721,
  'F1_UP': 0.956386292834891,
  'Precision_DOWN': 0.9731343283582089,
  'Recall_DOWN': 0.9449275362318841,
  'F1_DOWN': 0.9588235294117647,
  'Confusion Matrix': array([[652,  38],
         [ 18, 614]])},
 'SVM'

In [7]:
import pandas as pd

# Convert results dictionary to DataFrame
results_df = pd.DataFrame(results).T  # Transpose to have models as rows

# Optional: reorder columns
results_df = results_df[
    ['Accuracy', 
     'Precision_UP', 'Recall_UP', 'F1_UP',
     'Precision_DOWN', 'Recall_DOWN', 'F1_DOWN',
     'Confusion Matrix']
]

# Display
results_df


Unnamed: 0,Accuracy,Precision_UP,Recall_UP,F1_UP,Precision_DOWN,Recall_DOWN,F1_DOWN,Confusion Matrix
Logistic Regression,0.688351,0.779188,0.485759,0.598441,0.649784,0.873913,0.745365,"[[603, 87], [325, 307]]"
Decision Tree,0.931165,0.935588,0.919304,0.927374,0.927247,0.942029,0.934579,"[[650, 40], [51, 581]]"
Random Forest,0.95764,0.941718,0.971519,0.956386,0.973134,0.944928,0.958824,"[[652, 38], [18, 614]]"
SVM,0.776097,0.823077,0.677215,0.743056,0.745636,0.866667,0.801609,"[[598, 92], [204, 428]]"
Gradient Boosting,0.95764,0.939024,0.974684,0.956522,0.975976,0.942029,0.958702,"[[650, 40], [16, 616]]"


## Step 7: Compare Models
We summarize model performance based on accuracy and F1-score. The best performing model will be saved for future predictions.

In [8]:
# Convert results to DataFrame for easier visualization
performance_df = pd.DataFrame(results).T[['Accuracy', 'F1_UP', 'F1_DOWN']]
performance_df.sort_values(by='Accuracy', ascending=False)

Unnamed: 0,Accuracy,F1_UP,F1_DOWN
Random Forest,0.95764,0.956386,0.958824
Gradient Boosting,0.95764,0.956522,0.958702
Decision Tree,0.931165,0.927374,0.934579
SVM,0.776097,0.743056,0.801609
Logistic Regression,0.688351,0.598441,0.745365


### Observations:

High Accuracy:
Random Forest and Gradient Boosting achieve ~95.76%, showing strong capability in predicting weekly Bitcoin movement (UP/DOWN).

Balanced F1 Scores:
Both UP and DOWN F1-scores are nearly equal, indicating no class bias.

Tree-Based Ensemble Superiority:
Ensemble methods outperform SVM and Logistic Regression due to their ability to capture non-linear patterns in volatile crypto data.

Robustness:
Combining multiple trees ensures stable and reliable predictions, reducing overfitting.

Why Gradient Boosting:
It maintains high accuracy, balanced F1-scores, and provides interpretability for feature importance, making it ideal for this task.

## Save the Best Model

Based on performance, **Gradient Boosting** is chosen for deployment as it is robust, interpretable, and performs equally well as Random Forest.

In [9]:
import joblib

best_model = GradientBoostingClassifier(random_state=42)
best_model.fit(X_train_scaled, y_train)

# Save model
joblib.dump(best_model, 'btc_weekly_movement_model.pkl')

# Save scaler for future prediction
joblib.dump(scaler, 'scaler.pkl')

['scaler.pkl']


## **Summary**

* We successfully built a **weekly Bitcoin price movement classifier** using historical cryptocurrency data.
* The data was carefully **preprocessed**, missing values were handled, and relevant features like `Price`, `Volume (24hr)`, `Market Cap`, `% 1h`, `% 24h`, `% 7d`, `Market_Dominance`, `Year`, and `Month` were created.
* Multiple machine learning models were trained and evaluated, including **Random Forest, Decision Tree, Gradient Boosting, SVM, and Logistic Regression**.
* **Gradient Boosting** was chosen as the **best model** due to its **highest accuracy (~95.76%)** and balanced F1-scores for both UP and DOWN weekly movements.
* The trained model was saved as a **`.pkl` file**, enabling future predictions on new Bitcoin data.
* This classifier provides actionable insights into weekly BTC price trends, supporting **investment strategies** or **analytical decision-making**.