## Classification (Predicting Price Direction)

**Target Variable**: A newly created binary or multi-class variable based on price_change_24h_percent.

**Binary Example (Most Common)**: $Y=1$ if price_change_24h_percent $> 0$ (Price Up), and $Y=0$ otherwise (Price Down/No Change).

**Multi-Class Example**: $Y=1$ (Significant Rise, $> X\%$), $Y=0$ (Stable, $\pm Y\%$), $Y=-1$ (Significant Drop, $<-X\%$).

_Pros_: It simplifies the problem to predicting direction, which is often more accurate and more actionable for a trading strategy (i.e., "Should I buy/sell/hold?").

_Cons_: You lose the magnitude of the price change.

This is generally the recommended starting point for market prediction. Itâ€™s easier to achieve high accuracy on a binary outcome, and the results are directly interpretable for decision-making.

Step 1: Define Your Final Target Variable (Y)

First create your target variable.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set(style='whitegrid')

In [2]:
data = pd.read_csv('../../Dataset/crypto_sentiment_prediction_dataset.csv')
data.head()

Unnamed: 0,timestamp,cryptocurrency,current_price_usd,price_change_24h_percent,trading_volume_24h,market_cap_usd,social_sentiment_score,news_sentiment_score,news_impact_score,social_mentions_count,fear_greed_index,volatility_index,rsi_technical_indicator,prediction_confidence
0,2025-06-04 20:36:49,Algorand,0.3427,-5.35,1716266.1,1762124000.0,0.367,0.374,1.87,13,53.2,95.1,37.2,78.1
1,2025-06-04 20:48:25,Cosmos,12.042,5.14,10520739.91,209917800000.0,-0.278,-0.107,1.01,600,43.5,76.7,65.0,66.7
2,2025-06-04 21:28:54,Cosmos,11.7675,-6.12,642191.11,175536700000.0,-0.255,0.211,5.69,279,49.1,60.4,32.3,77.4
3,2025-06-04 21:57:48,Ethereum,2861.2829,-11.54,5356227.76,47864190000000.0,-0.531,-0.081,5.11,3504,37.0,100.0,63.0,81.7
4,2025-06-04 22:06:40,Solana,95.3583,5.79,735971.56,266761100000.0,0.369,0.248,1.82,3236,61.7,67.5,55.4,81.8


In [5]:
# Assuming you've loaded your data and engineered features (like in 02-feature-engineering.ipynb)
# Let's create a binary target: 1 for a price increase, 0 otherwise.

# Define the target based on the percentage change
data['price_direction'] = np.where(data['price_change_24h_percent'] > 0, 1, 0)

# Drop the original price columns that would cause data leakage
# (since the target is directly derived from them)
X = data.drop(['current_price_usd', 'price_change_24h_percent', 'price_direction', 'timestamp', 'cryptocurrency'], axis=1)
Y = data['price_direction']

Step 2: Time-Series Train/Test Split (Crucial!)
You must NOT use a standard train_test_split (random split). Your data is time-series, and a random split will cause data leakage, as your model would train on future data to predict the past.

Split Type: Use a time-based split.

Process: Sort your data by the timestamp column and split the dataset chronologically. For example, use the first 80% of time-sorted data for training and the last 20% for testing.

In [4]:
# Assuming your data is already sorted by timestamp (it usually is when loaded)
split_point = int(len(data) * 0.8)

# Features and Target (assuming classification target 'price_direction' is created)
X = data.drop(['price_direction', 'timestamp', 'cryptocurrency', ...], axis=1)
Y = data['price_direction']

X_train, X_test = X.iloc[:split_point], X.iloc[split_point:]
Y_train, Y_test = Y.iloc[:split_point], Y.iloc[split_point:]

print(f"Train size: {len(X_train)} samples")
print(f"Test size: {len(X_test)} samples")

KeyError: '[Ellipsis] not found in axis'