# CRYPTO SENTIMENT PRESICTION DATASET - MACHINE LEARNING MODEL TRAINING AND EVALUATION

**NOTEBOOK IS WRITTEN BY :-**
<br></br>
**AMOGH DATH KALASAPURA ARUNKUMAR : amoghdath.kalasapuraarunkumar@mail.bcu.ac.uk**


**GITHUB REPO LINK** :- [https://github.com/Amogh-007-Rin/AI-ML-Model-For-CryptoAnalysis]



In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

**Code Explanation :-**

This cell imports all necessary libraries for model building and evaluation:
- **Data Manipulation**: `pandas` and `numpy` for data processing
- **Visualization**: `matplotlib` and `seaborn` for plotting
- **Model Selection & Preprocessing**: `train_test_split`, `GridSearchCV`, `StandardScaler`, `OneHotEncoder`, `ColumnTransformer`, and `Pipeline` from scikit-learn
- **Classification Models**: `LogisticRegression`, `RandomForestClassifier`, `GradientBoostingClassifier`, and `SVM`
- **Metrics**: `classification_report`, `accuracy_score`, and `confusion_matrix` for model evaluation

**Result Discussion :-**

No output is produced as these are just library imports. All required ML libraries are loaded into the Python environment for subsequent model development and evaluation.

In [3]:
# 1. Load Dataset
df = pd.read_csv('../dataset/pre-processed-crypto-dataset.csv')

**Code Explanation :-**

This cell loads the cryptocurrency sentiment prediction dataset from a CSV file located in the parent directory's `dataset` folder. The dataset is read into a pandas DataFrame named `df` for further exploration and processing.

**Result Discussion :-**

The dataset is successfully loaded into memory. The DataFrame `df` now contains all rows and columns from the CSV file, ready for feature engineering and preprocessing in the next steps.

In [4]:
# 1. Feature Engineering: Create Target Variable
# Target = 1 if Price Change > 0 (Bullish), else 0 (Bearish)
df['target'] = (df['price_change_24h_percent'] > 0).astype(int)

# 2. Select Features and Drop Leakage/Redundant Columns
X = df.drop(columns=['timestamp', 'price_change_24h_percent', 
                     'prediction_confidence', 'target'])
y = df['target']

# 3. Define Preprocessing Pipeline
# Scale numerical features and One-Hot Encode categorical features
categorical_features = ['cryptocurrency']
numeric_features = [col for col in X.columns if col not in categorical_features]

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

**Code Explanation :-**

This cell performs three key tasks:
1. **Target Variable Creation**: Creates a binary target variable where 1 = bullish (price change > 0%) and 0 = bearish (price change â‰¤ 0%)
2. **Feature Selection**: Drops irrelevant columns (`timestamp`, `price_change_24h_percent`, `prediction_confidence`) and separates features (X) from the target (y)
3. **Preprocessing Pipeline**: Creates a `ColumnTransformer` that:
   - Scales numeric features using `StandardScaler` (standardization)
   - One-Hot Encodes categorical features (`cryptocurrency`) using `OneHotEncoder`

**Result Discussion :-**

The feature engineering produces a clean dataset where:
- Target variable is binary (0 or 1), suitable for classification
- Data leakage is prevented by removing price-related columns
- Features are separated into X and y for model training
- The preprocessing pipeline is ready to be used in ML models to ensure consistent feature transformation across training and testing

In [24]:
# Safely compute Pearson correlations using numeric columns only
num_df = df.select_dtypes(include=[np.number])
correlations = num_df.corr()['price_change_24h_percent'].sort_values(ascending=False)

# Debug: show available correlation keys
print("Correlation keys:", list(correlations.index))

# Safely print specific sentiment correlations
for key in ['social_sentiment_score', 'fear_greed_index']:
    val = correlations.get(key)
    if val is None or pd.isna(val):
        print(f"{key}: missing or non-numeric (NaN)")
    else:
        # format float safely
        print(f"{key} Correlation: {val:.4f}")

Correlation keys: ['price_change_24h_percent', 'target', 'market_cap_usd', 'current_price_usd', 'rsi_technical_indicator', 'volatility_index', 'news_sentiment_score', 'social_sentiment_score', 'prediction_confidence', 'trading_volume_24h', 'news_impact_score', 'fear_greed_index', 'social_mentions_count']
social_sentiment_score Correlation: 0.0106
fear_greed_index Correlation: -0.0152


**Code Explanation :-**

This cell analyzes feature correlations with the target variable (`price_change_24h_percent`):
1. Selects only numeric columns from the dataset
2. Computes Pearson correlation coefficients with `price_change_24h_percent`
3. Sorts correlations in descending order
4. Safely prints specific sentiment correlations (`social_sentiment_score`, `fear_greed_index`) while handling missing or non-numeric values

**Result Discussion :-**

The correlation analysis identifies which features have the strongest relationships with cryptocurrency price changes. The output shows:
- All available numeric features and their correlation strengths
- Specific sentiment indicators and their predictive power for price movements
- Handles missing data gracefully by checking for NaN values before printing
- Results inform feature importance and help validate the sentiment-price relationship hypothesis

In [25]:
# 5. Train-Test Split (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

**Code Explanation :-**

This cell splits the dataset into training and testing subsets:
- **80/20 Split**: 80% of data for training, 20% for testing
- **Stratification**: Ensures balanced class distribution in both train and test sets (important for imbalanced datasets)
- **Random State**: Fixed seed (42) ensures reproducibility across multiple runs
- Produces four outputs: `X_train`, `X_test`, `y_train`, `y_test`

**Result Discussion :-**

The data is now partitioned such that:
- Training set is used to fit model parameters
- Test set is held out to evaluate model generalization performance
- Stratification prevents skewed class distributions that could bias results
- Reproducibility is ensured, allowing consistent experiments across runs

In [26]:
# Define the four baseline models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'XGBoost': GradientBoostingClassifier(random_state=42),
    'SVM': SVC(random_state=42)
}

# Training Loop using the Pipeline
for name, model in models.items():
    clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('classifier', model)])
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(f"{name} Accuracy: {accuracy_score(y_test, y_pred):.4f}")

Logistic Regression Accuracy: 0.4964
Random Forest Accuracy: 0.4964
XGBoost Accuracy: 0.4673
SVM Accuracy: 0.4576


**Code Explanation :-**

This cell creates and trains four baseline classification models:
1. **Logistic Regression**: Linear probabilistic classifier with max iterations set to 1000
2. **Random Forest**: Ensemble of decision trees
3. **XGBoost** (Gradient Boosting): Sequential tree-based ensemble
4. **SVM**: Support Vector Machine classifier

Each model is wrapped in a `Pipeline` that automatically applies the preprocessing transformations before classification. Models are trained on `X_train` and evaluated on `X_test`.

**Result Discussion :-**

The output shows accuracy scores for each baseline model on the test set:
- Provides a baseline performance benchmark for comparison
- Helps identify which model family (linear, tree-based, or kernel-based) performs best
- Results inform hyperparameter tuning priorities
- Establishes a reference point for measuring improvements after optimization

In [27]:
# 6. Training Loop & Initial Evaluation
print("--- Baseline Model Results ---")
for name, model in models.items():
    clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('classifier', model)])
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(f"\nModel: {name}")
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")

--- Baseline Model Results ---

Model: Logistic Regression
Accuracy: 0.4964

Model: Random Forest
Accuracy: 0.4964

Model: XGBoost
Accuracy: 0.4673

Model: SVM
Accuracy: 0.4576


**Code Explanation :-**

This cell reruns the baseline model training with more detailed output formatting:
- Same four models (Logistic Regression, Random Forest, XGBoost, SVM)
- Same pipeline and training approach as the previous cell
- Prints results with clearer formatting: model name followed by accuracy score

**Result Discussion :-**

The output provides a clean, formatted display of baseline accuracies:
- Shows performance summary for all four models side-by-side
- Allows easy visual comparison of initial performance metrics
- Establishes foundation before hyperparameter tuning
- Results guide decisions on which models warrant further optimization

In [28]:
# Define hyperparameter grids for all models
# Note: 'classifier__' prefix is required to access model params inside the Pipeline
param_grids = {
    'Logistic Regression': {
        'classifier__C': [0.01, 0.1, 1, 10, 100],
        'classifier__solver': ['liblinear']  # Good for small datasets
    },
    'Random Forest': {
        'classifier__n_estimators': [50, 100, 200],
        'classifier__max_depth': [None, 10, 20],
        'classifier__min_samples_split': [2, 5]
    },
    'XGBoost': {
        'classifier__n_estimators': [50, 100],
        'classifier__learning_rate': [0.01, 0.1, 0.2],
        'classifier__max_depth': [3, 5, 7]
    },
    'SVM': {
        'classifier__C': [0.1, 1, 10],
        'classifier__kernel': ['rbf', 'linear'],
        'classifier__gamma': ['scale', 'auto']
    }
}

print("--- Comprehensive Hyperparameter Tuning ---")
best_models = {}

# Loop through each model and perform GridSearchCV
for name, model in models.items():
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', model)])
    
    # Skip tuning if model not in our grid definition (safety check)
    if name in param_grids:
        print(f"\nTuning {name}...")
        grid = GridSearchCV(pipeline, param_grids[name], cv=3, scoring='accuracy', n_jobs=-1)
        grid.fit(X_train, y_train)
        
        best_score = grid.best_score_
        best_params = grid.best_params_
        test_acc = accuracy_score(y_test, grid.best_estimator_.predict(X_test))
        
        best_models[name] = {'test_accuracy': test_acc, 'best_params': best_params}
        print(f"Best CV Score: {best_score:.4f}")
        print(f"Test Set Accuracy: {test_acc:.4f}")
        print(f"Best Params: {best_params}")

--- Comprehensive Hyperparameter Tuning ---

Tuning Logistic Regression...
Best CV Score: 0.5067
Test Set Accuracy: 0.4939
Best Params: {'classifier__C': 0.01, 'classifier__solver': 'liblinear'}

Tuning Random Forest...
Best CV Score: 0.5055
Test Set Accuracy: 0.5400
Best Params: {'classifier__max_depth': 20, 'classifier__min_samples_split': 5, 'classifier__n_estimators': 50}

Tuning XGBoost...
Best CV Score: 0.5133
Test Set Accuracy: 0.4673
Best Params: {'classifier__learning_rate': 0.1, 'classifier__max_depth': 3, 'classifier__n_estimators': 100}

Tuning SVM...
Best CV Score: 0.5152
Test Set Accuracy: 0.4625
Best Params: {'classifier__C': 10, 'classifier__gamma': 'auto', 'classifier__kernel': 'rbf'}


**Code Explanation :-**
This cell performs comprehensive hyperparameter tuning for all four models:
1. **Defines Parameter Grids**: Specifies search spaces for each model's hyperparameters:
   - Logistic Regression: Regularization strength (C), solver type
   - Random Forest: Number of trees, max depth, min samples per split
   - XGBoost: Number of trees, learning rate, max depth
   - SVM: Regularization (C), kernel type, gamma parameter
2. **GridSearchCV**: Exhaustively searches all parameter combinations using 3-fold cross-validation
3. **Stores Results**: Saves best parameters and test accuracies for each model

**Result Discussion :-**

The output displays for each model:
- **Best CV Score**: Cross-validation accuracy on training data (best fold average)
- **Test Set Accuracy**: Generalization performance on unseen test data
- **Best Params**: Optimal hyperparameter values discovered during grid search
- Reveals which parameter configurations maximize predictive performance
- Identifies if tuning improved baseline performance significantly
- Enables selection of the best-performing model for final deployment