# Cryptocurrency Liquidity Prediction for Market Stability

**Life cycle of Machine learning Project**

`Part-2`the file contains below key information

* Data Splitting (Train_test_split)
* Model Selection
* Model Training
* Hyperparameter Tuning
* Model Testing & Validation
* Model Deployment

In [1]:
import pandas as pd                       # for Data mannupulation
import numpy as np                        # numerical python libraries, for scientifical calculation of the data
import matplotlib.pyplot as plt           # data visualtion library
import datetime as dt                     # Time series data library
import seaborn as sns                     # data visualtion library

import warnings                           # some library have some warning messages, 
warnings.filterwarnings("ignore")         # if we ignore it thoese warning message will not show in the notebook. the notebook looks better



from sklearn.metrics import classification_report, accuracy_score
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, TimeSeriesSplit

## 6) Data Splitting (Train_test_split)

* it is based on cryptocurrency liquidity prediction using time-series data, it’s crucial to split the dataset in a way that respects time order.
* In time series, each row depends on the past — so we cannot randomly shuffle and split the data like in normal datasets.
* We must train on the past and test on the future to simulate real-world prediction.

In [2]:
final_df = pd.read_csv('final_df.csv') # read the final csv file

### Splitting the independent variable(x) and target variable (y)

In [3]:
# 1. Define feature columns and target
features = ['1h', '24h', '7d', 'price_lag1', 'volume_lag1', 'mktcap_lag1',
            'price_2d_avg', 'volume_2d_avg', 'vol_to_mcap', 'vol_price_ratio']

x = final_df[features]
y = final_df['liquidity_level']

### Encode the target labels

### Label Encoding and One-Hot Conversion

Before feeding the labels into our machine learning and deep learning models, we need to convert them from categorical strings into a numerical format.

- `LabelEncoder` converts string labels like 'low', 'medium', 'high' into integers: 0, 1, 2.
- `to_categorical` then converts these integers into one-hot encoded vectors, which are required for classification using LSTM models.

This ensures compatibility with classification layers like `Dense(3, activation='softmax')` and the `categorical_crossentropy` loss function.


In [4]:
le = LabelEncoder()
y_encoded = le.fit_transform(y)  # low=0, medium=1, high=2
y_categorical = to_categorical(y_encoded)  # for LSTM

- This function is preparing your data for use in sequence models like LSTM — which require input in the form of sequences over time.
- In time series or temporal modeling (e.g., predicting cryptocurrency liquidity), we often want to use past n steps (e.g., past 2 days' features) to predict the next value (e.g., tomorrow's liquidity).

### Scale the features

In [5]:
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)

### Create sequence data for LSTM

LSTM models are designed to learn from sequential or time-dependent data. To use LSTM, we need to structure our features into **sequences**.

#### 🔧 `create_sequences()` Function
This function takes:
- `data`: Scaled input features (`X_scaled`)
- `target`: Encoded output labels (`y_encoded`)
- `time_step`: The number of previous time steps to use in each input sequence

It returns:
- `X_seq`: A 3D array shaped as `(samples, time_steps, features)`
- `y_seq`: Corresponding target values for each sequence

#### Example:
If `time_step = 1`:
- Each sample contains 1 row of features → shape: `(n_samples, 1, n_features)`
- Each target corresponds to the label **after** the time step.

#### ✅ Why this is important:
- LSTM models require 3D input: `[samples, time_steps, features]`
- This transformation lets the LSTM learn from **patterns across time steps**, even if `time_step=1`.

You can increase `time_step` to look further back in time when predicting future values, especially if your data is time-series.

In [6]:
# 3. Create sequences
def create_sequences(data, target, time_step=1):        # data: a 2D NumPy array or DataFrame (e.g., your scaled features like price, volume, etc.)
                                                        # target: a 1D array (e.g., liquidity values you want to predict)
                                                        # time_step: number of previous time steps to use for predicting the next one (default = 2)
    xs, ys = [], []                                     # Xs: will store input sequences (each sequence is a block of time_step rows of features)
                                                        # ys: will store the corresponding target value (the value right after each sequence)
    for i in range(len(data) - time_step):              # Loops from i = 0 to len(data) - time_step - 1
        xs.append(data[i:(i+time_step)])                # Appends a sequence of time_step rows starting at row i
        ys.append(target[i + time_step])                # Appends the target value that comes after the current input sequence
    return np.array(xs), np.array(ys)                   # Converts the lists Xs and ys into NumPy arrays


# Call the function to create sequences
x_seq, y_seq = create_sequences(x_scaled, y_encoded, time_step=1) # X_scaled: your features (2D)
                                                                 # y_encoded: encoded values
                                                                 # time_step=1


print("X_seq shape:", x_seq.shape)  # (samples, time_steps, features)
print("y_seq shape:", y_seq.shape)  # (samples,)

X_seq shape: (999, 1, 10)
y_seq shape: (999,)


### Train_test_split step

In [7]:
x_train, x_test, y_train, y_test = train_test_split(
    x_seq, y_seq, test_size=0.2, random_state=42, shuffle=False  # Important: shuffle=False for time series
)

Checking the shape of the train and test data

In [8]:
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((799, 1, 10), (799,), (200, 1, 10), (200,))

### Flattened features for Random Forest
It flattens each time-series sequence in X_train and X_test into a 1D vector, while keeping the number of samples

In [9]:
x_train_flat = x_train.reshape(x_train.shape[0], -1)
x_test_flat = x_test.reshape(x_test.shape[0], -1)

## 7) model Selection

In [10]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense

## 8) model Training

### Random Forest Classifier `model-1` - Base Model

We use a `RandomForestClassifier` as one of the base models in our ensemble (stacking) architecture.

```python
rf_clf = RandomForestClassifier()
rf_clf.fit(x_train_flat, y_train)


In [11]:
random_forest_clf = RandomForestClassifier()
random_forest_clf.fit(x_train_flat, y_train)
rf_preds = random_forest_clf.predict(x_test_flat)
rf_preds_proba = random_forest_clf.predict_proba(x_test_flat)

### LSTM Classifier for Multiclass Classification `model-2`

We define an LSTM-based neural network to classify each data point into one of three **liquidity levels**: `low`, `medium`, or `high`.

#### Convert y_train to categorical for LSTM training

We now train our LSTM model to predict the liquidity level class (`low`, `medium`, `high`) based on input features.

#### One-Hot Encoding of Target Labels
```python
y_train_categorical = to_categorical(y_train)

In [12]:
lstm_model = Sequential()
lstm_model.add(LSTM(64, return_sequences=True, input_shape=(x_train.shape[1], x_train.shape[2])))
lstm_model.add(LSTM(32))
lstm_model.add(Dense(3, activation='softmax'))  # 3 classes

lstm_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])


y_train_categorical = to_categorical(y_train)

lstm_model.fit(x_train, y_train_categorical, epochs=30, batch_size=32, validation_split=0.1, verbose=2)
lstm_preds_proba = lstm_model.predict(x_test)
lstm_preds = np.argmax(lstm_preds_proba, axis=1)  # predicted class labels

Epoch 1/30
23/23 - 5s - 196ms/step - accuracy: 0.5118 - loss: 1.0890 - val_accuracy: 0.5500 - val_loss: 1.0824
Epoch 2/30
23/23 - 0s - 9ms/step - accuracy: 0.5327 - loss: 1.0691 - val_accuracy: 0.5375 - val_loss: 1.0642
Epoch 3/30
23/23 - 0s - 8ms/step - accuracy: 0.5341 - loss: 1.0399 - val_accuracy: 0.5375 - val_loss: 1.0397
Epoch 4/30
23/23 - 0s - 9ms/step - accuracy: 0.5285 - loss: 1.0048 - val_accuracy: 0.5250 - val_loss: 1.0197
Epoch 5/30
23/23 - 0s - 8ms/step - accuracy: 0.5271 - loss: 0.9784 - val_accuracy: 0.5250 - val_loss: 1.0103
Epoch 6/30
23/23 - 0s - 9ms/step - accuracy: 0.5438 - loss: 0.9615 - val_accuracy: 0.5375 - val_loss: 1.0087
Epoch 7/30
23/23 - 0s - 9ms/step - accuracy: 0.5466 - loss: 0.9540 - val_accuracy: 0.5500 - val_loss: 0.9996
Epoch 8/30
23/23 - 0s - 9ms/step - accuracy: 0.5633 - loss: 0.9437 - val_accuracy: 0.5625 - val_loss: 0.9971
Epoch 9/30
23/23 - 0s - 9ms/step - accuracy: 0.5647 - loss: 0.9365 - val_accuracy: 0.5500 - val_loss: 0.9907
Epoch 10/30
23/23

 ### 🔄 Building the Meta-Model for Stacking Ensemble (Logistic Regression on stacked probabilities)

We now stack the predictions of the Random Forest and LSTM models to train a **meta-classifier** (Logistic Regression). This final model aims to combine the strengths of both base models.

####  Combine Base Model Outputs

meta_features_test = np.hstack((rf_preds_proba, lstm_preds_proba))

In [13]:
meta_features_test = np.hstack((rf_preds_proba, lstm_preds_proba))
meta_model = LogisticRegression(max_iter=200)
meta_model.fit(meta_features_test, y_test)  # Note: Ideally train on validation set, not test
meta_preds = meta_model.predict(meta_features_test)

## 9) hyperparameter tunning

In [14]:
rf_param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}
tscv = TimeSeriesSplit(n_splits=2)

grid_rf = GridSearchCV(RandomForestClassifier(random_state=42),
                       param_grid=rf_param_grid, cv=tscv,
                       scoring='accuracy', n_jobs=-1, verbose=1)
grid_rf.fit(x_train_flat, y_train)

best_rf = grid_rf.best_estimator_
rf_preds_proba = best_rf.predict_proba(x_test_flat)

print("Best RF Parameters:", grid_rf.best_params_)
print("Best RF CV Score:", grid_rf.best_score_)

Fitting 2 folds for each of 24 candidates, totalling 48 fits
Best RF Parameters: {'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 200}
Best RF CV Score: 0.612781954887218


## 10) Model Testing & Validation (Evaluation)

### Final model prediction

In [15]:
print("Random Forest Accuracy:", accuracy_score(y_test, rf_preds))
print("LSTM Accuracy:", accuracy_score(y_test, lstm_preds))
print("Meta Model Accuracy:", accuracy_score(y_test, meta_preds))

print("\nClassification Report (Random Forest):\n", classification_report(y_test, rf_preds, target_names=le.classes_))
print("\nClassification Report (LSTM):\n", classification_report(y_test, lstm_preds, target_names=le.classes_))
print("\nClassification Report (Meta Model):\n", classification_report(y_test, meta_preds, target_names=le.classes_))

Random Forest Accuracy: 0.565
LSTM Accuracy: 0.545
Meta Model Accuracy: 0.575

Classification Report (Random Forest):
               precision    recall  f1-score   support

        high       0.58      0.60      0.59        68
         low       0.54      0.57      0.56        56
      medium       0.57      0.53      0.55        76

    accuracy                           0.56       200
   macro avg       0.56      0.57      0.56       200
weighted avg       0.57      0.56      0.56       200


Classification Report (LSTM):
               precision    recall  f1-score   support

        high       0.62      0.59      0.61        68
         low       0.49      0.59      0.54        56
      medium       0.52      0.47      0.50        76

    accuracy                           0.55       200
   macro avg       0.55      0.55      0.55       200
weighted avg       0.55      0.55      0.54       200


Classification Report (Meta Model):
               precision    recall  f1-score   sup

Saving the meta_model model.pkl file for deployment step

In [16]:
import joblib
import os

# Define the path one directory back and then into 'models'
base_dir = os.path.abspath(os.path.join(os.getcwd(), '..'))
models_dir = os.path.join(base_dir, 'models')

# Create the directory if it doesn't exist
os.makedirs(models_dir, exist_ok=True)

# Save Scikit-learn models
joblib.dump(grid_rf.best_estimator_, os.path.join(models_dir, "random_forest_model.pkl"))
joblib.dump(meta_model, os.path.join(models_dir, "meta_model.pkl"))
joblib.dump(scaler, os.path.join(models_dir, "scaler.pkl"))
joblib.dump(le, os.path.join(models_dir, "label_encoder.pkl"))

# Save LSTM model
lstm_model.save(os.path.join(models_dir, "lstm_model.h5"))

