# Cryptocurrency Liquidity Prediction for Market Stability

1. Exploratory Data Analysis (EDA) Report

- the EDA Report summarizes the dataset statistics and provides basic visualizations of trends, correlations, and distributions.
  1. Introduction:	Briefly state the goal of the EDA: to understand the structure, quality, and initial patterns in the cryptocurrency dataset (e.g., coin_gecko_2022-03-17.csv and coin_gecko_2022-03-16.csv)
  2. Dataset Overview	Show the first few rows of the data, list the features (coin, price, 24h_volume, mkt_cap, etc.), and provide data types and unique value counts.
  3. Data Quality Assessment
    - Missing Values: Identify and quantify any missing values in columns like price or 24h_volume.
    - Data Consistency: Note any inconsistencies .
    - Outlier Detection: Identify extreme outliers in volume or price data.
  4. Univariate Analysis
    - Visualize the distribution of key numerical variables (e.g., price, 24h_volume, mkt_cap) using histograms.
    - Analyze the distribution of the target variable (liquidity proxy, possibly a function of 24h_volume/mkt_cap).
  5. Multivariate Analysis
    - Generate a correlation matrix to find relationships between variables.
    - Plot time series trends for the top N coins (e.g., Bitcoin, Ethereum) showing price vs. date.
    - Visualize the correlation between a coin's price change (e.g., 24h change) and its 24h_volume
  6. Conclusion	Summarize the key findings and how they will inform the subsequent Feature Engineering and Model Selection steps.

2. HLD & LLD Document
 - This document should include a High-Level Design that provides an overview of the system and architecture, and a Low-Level Design that details how each component is implemented.

 - High-Level Design: The HLD focuses on the overall system structure, major components, and how they interact.
 - System Components: Define the major blocks: Data Ingestion , Data Storage, Data Processing/ML Training (Jupyter/Python scripts), ML Model, and Deployment/Prediction API .
 - Architecture Diagram: Sketch the flow of data from the source to the final prediction.
 - Technology Stack: List the main technologies: Python, Pandas, Scikit-learn/TensorFlow/PyTorch, and Flask/Streamlit .

3. Pipeline Architecture and Document
 - This deliverable specifically focuses on the data flow for the project. This section is closely related to the HLD but specifically diagrams and explains the automated workflow.
 - Diagram: Provide a clear diagram showing the flow.
 - Stages of the Pipeline:
    - Data Acquisition: Raw data is collected from the source.
    - Data Validation/Cleaning: Missing values are handled, and data types are corrected.
    - Feature Engineering: New features, including the target liquidity metric, are created.
    - Model Training & Evaluation: The processed data is used to train the ML model, and performance is evaluated.
    - Model Register/Deployment: The best performing model is saved and made available for inference.
    - Inference/Prediction: The deployed model receives new data and returns a liquidity prediction.

4. Final Report:
 - The Final Report summary of the project's findings, model performance, and key insights.
   1. Executive Summary	A concise, non-technical summary of the project goal, methodology, and main result
   2. Problem Statement & Objective	Reiterate the need for a model to predict  cryptocurrency liquidity for market stability and risk management.
   3. Methodology
    - Data Preprocessing: Summarize the key steps taken.
    -  Model Selection: Justify the final model chosen and the features that had the most impact.
    4. Results & Evaluation	Present the model's performance on the test data using the agreed-upon metrics. Discuss the robustness of the model.
    5. Key Insights & Findings
    - Liquidity Drivers: What market factors (e.g., 24h_volume, 7d price change, new features) were the strongest predictors of liquidity?
    - Market Stability: What do the model's predictions suggest about the future stability of the cryptocurrency market.
    6. Conclusion & Future Work: Conclude by summarizing the project's success. Suggest next steps, such as deploying the model for real-time predictions or incorporating new data sources

In [None]:
# python crypto_liquidity_project.py

'''
'''
# crypto_liquidity_project.py
# Complete pipeline for Cryptocurrency Liquidity Prediction
# Requirements: pandas, numpy, matplotlib, seaborn, scikit-learn, xgboost, joblib

import warnings
warnings.filterwarnings('ignore')

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib

sns.set(style='whitegrid')
plt.rcParams['figure.figsize'] = (10,6)

# ------------------------------
# 1) Load & combine CSV files
# ------------------------------
file1 = '/mnt/data/coin_gecko_2022-03-16.csv'
file2 = '/mnt/data/coin_gecko_2022-03-17.csv'

if not (os.path.exists(file1) and os.path.exists(file2)):
    raise FileNotFoundError(f"Make sure both CSV files exist at {file1} and {file2}")

df = pd.concat([pd.read_csv(file1), pd.read_csv(file2)], ignore_index=True)
df.drop_duplicates(inplace=True)
print("Combined dataframe shape:", df.shape)

# ------------------------------
# 2) Preprocessing & Feature Engineering
# ------------------------------
df.fillna(df.median(numeric_only=True), inplace=True)

# engineered features
df['liquidity_ratio'] = df['24h_volume'] / df['mkt_cap']
df['volatility_index'] = df[['1h','24h','7d']].std(axis=1)
df['price_stability_score'] = 1 - (df['24h'].abs() / (df['7d'].replace(0, np.nan).abs()+1e-9))

df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.fillna(0, inplace=True)

scaler = MinMaxScaler()
numeric_cols = ['price','24h_volume','mkt_cap','liquidity_ratio','volatility_index','price_stability_score']
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

print("After preprocessing shape:", df.shape)
print(df[numeric_cols].describe().T)

# Save a short CSV of processed data for quick inspection
os.makedirs('/mnt/data/processed', exist_ok=True)
df.to_csv('/mnt/data/processed/crypto_processed.csv', index=False)
print("Saved processed CSV to /mnt/data/processed/crypto_processed.csv")

# ------------------------------
# 3) Exploratory Data Analysis (plots will be saved)
# ------------------------------
os.makedirs('/mnt/data/plots', exist_ok=True)

# Distribution of price
plt.figure()
sns.histplot(df['price'], bins=40, kde=True)
plt.title('Distribution of Cryptocurrency Prices (Normalized)')
plt.xlabel('Normalized Price')
plt.savefig('/mnt/data/plots/price_distribution.png')
plt.close()

# Correlation heatmap
plt.figure(figsize=(8,6))
corr = df[['price','24h_volume','mkt_cap','liquidity_ratio','volatility_index','price_stability_score']].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.savefig('/mnt/data/plots/correlation_heatmap.png')
plt.close()

# Liquidity Ratio vs Market Cap scatter
plt.figure()
sns.scatterplot(x='mkt_cap', y='liquidity_ratio', data=df, alpha=0.7)
plt.title('Liquidity Ratio vs Market Cap')
plt.savefig('/mnt/data/plots/liquidity_vs_mktcap.png')
plt.close()

# Volatility vs Stability
plt.figure()
sns.scatterplot(x='volatility_index', y='price_stability_score', data=df, alpha=0.7)
plt.title('Volatility vs Price Stability')
plt.savefig('/mnt/data/plots/volatility_vs_stability.png')
plt.close()

# Top 10 by liquidity ratio
top10 = df.sort_values('liquidity_ratio', ascending=False).head(10)
plt.figure(figsize=(8,5))
sns.barplot(x='liquidity_ratio', y='coin', data=top10)
plt.title('Top 10 Cryptocurrencies by Liquidity Ratio')
plt.savefig('/mnt/data/plots/top10_liquidity.png')
plt.close()

print("Saved EDA plots to /mnt/data/plots")

# ------------------------------
# 4) Modeling: Train & Evaluate
# ------------------------------
feature_cols = ['price','24h','7d','24h_volume','mkt_cap','volatility_index','price_stability_score']
X = df[feature_cols]
y = df['liquidity_ratio']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Train/Test shapes:", X_train.shape, X_test.shape)

# 1) Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

# 2) Random Forest (small grid)
rf = RandomForestRegressor(random_state=42, n_jobs=-1)
rf_params = {'n_estimators':[50,100], 'max_depth':[6,10]}
rf_grid = GridSearchCV(rf, rf_params, cv=3, scoring='neg_mean_squared_error', n_jobs=1)
rf_grid.fit(X_train, y_train)
rf_best = rf_grid.best_estimator_
y_pred_rf = rf_best.predict(X_test)

# 3) XGBoost (small grid)
xgb = XGBRegressor(random_state=42, n_jobs=1, verbosity=0)
xgb_params = {'n_estimators':[50,100], 'max_depth':[3,5], 'learning_rate':[0.05,0.1]}
xgb_grid = GridSearchCV(xgb, xgb_params, cv=3, scoring='neg_mean_squared_error', n_jobs=1)
xgb_grid.fit(X_train, y_train)
xgb_best = xgb_grid.best_estimator_
y_pred_xgb = xgb_best.predict(X_test)

def evaluate(y_true, y_pred):
    return {
        'MAE': mean_absolute_error(y_true, y_pred),
        'RMSE': np.sqrt(mean_squared_error(y_true, y_pred)),
        'R2': r2_score(y_true, y_pred)
    }

results = {
    'Linear Regression': evaluate(y_test, y_pred_lr),
    'Random Forest': evaluate(y_test, y_pred_rf),
    'XGBoost': evaluate(y_test, y_pred_xgb)
}

print("Evaluation results:")
for name, metrics in results.items():
    print(f"--- {name} ---")
    print(metrics)

# Save models
os.makedirs('/mnt/data/models', exist_ok=True)
joblib.dump(lr, '/mnt/data/models/linear_regression.joblib')
joblib.dump(rf_best, '/mnt/data/models/random_forest.joblib')
joblib.dump(xgb_best, '/mnt/data/models/xgboost.joblib')
joblib.dump(scaler, '/mnt/data/models/scaler.joblib')
print("Saved models to /mnt/data/models")

# End of script
print("Pipeline complete. Check /mnt/data for outputs (processed CSV, plots, and models).")


'''
'''

# Complete notebook: load data, preprocess, EDA, feature engineering, models (Linear Regression, Random Forest, XGBoost), evaluation, and save models.

'''
'''
import warnings
warnings.filterwarnings('ignore')

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib

sns.set(style='whitegrid')
plt.rcParams['figure.figsize'] = (10,6)

'''
'''

# Imports
import warnings
warnings.filterwarnings('ignore')

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ense mble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib

sns.set(style='whitegrid')
plt.rcParams['figure.figsize'] = (10,6)

'''
'''

file1 = '/mnt/data/coin_gecko_2022-03-16.csv'
file2 = '/mnt/data/coin_gecko_2022-03-17.csv'

df = pd.concat([pd.read_csv(file1), pd.read_csv(file2)], ignore_index=True)
df.drop_duplicates(inplace=True)
print('Combined shape:', df.shape)
df.head()

'''
'''

df.fillna(df.median(numeric_only=True), inplace=True)
df['liquidity_ratio'] = df['24h_volume'] / df['mkt_cap']
df['volatility_index'] = df[['1h','24h','7d']].std(axis=1)
df['price_stability_score'] = 1 - (df['24h'].abs() / (df['7d'].replace(0, np.nan).abs()+1e-9))
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.fillna(0, inplace=True)

scaler = MinMaxScaler()
numeric_cols = ['price','24h_volume','mkt_cap','liquidity_ratio','volatility_index','price_stability_score']
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

df[numeric_cols].describe().T

'''
'''

# distribution
sns.histplot(df['price'], bins=40, kde=True)
plt.title('Distribution of Price (Normalized)')
plt.show()

# correlation heatmap
corr = df[['price','24h_volume','mkt_cap','liquidity_ratio','volatility_index','price_stability_score']].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.show()

# liquidity vs mktcap
sns.scatterplot(x='mkt_cap', y='liquidity_ratio', data=df, alpha=0.7)
plt.show()

# volatility vs stability
sns.scatterplot(x='volatility_index', y='price_stability_score', data=df, alpha=0.7)
plt.show()

# top 10 liquidity
top10 = df.sort_values('liquidity_ratio', ascending=False).head(10)
sns.barplot(x='liquidity_ratio', y='coin', data=top10)
plt.show()

'''
'''

feature_cols = ['price','24h','7d','24h_volume','mkt_cap','volatility_index','price_stability_score']
X = df[feature_cols]
y = df['liquidity_ratio']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear Regression
lr = LinearRegression().fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

# Random Forest (small grid)
rf = RandomForestRegressor(random_state=42, n_jobs=-1)
rf_params = {'n_estimators':[50,100], 'max_depth':[6,10]}
rf_grid = GridSearchCV(rf, rf_params, cv=3, scoring='neg_mean_squared_error', n_jobs=1)
rf_grid.fit(X_train, y_train)
rf_best = rf_grid.best_estimator_
y_pred_rf = rf_best.predict(X_test)

# XGBoost (small grid)
xgb = XGBRegressor(random_state=42, n_jobs=1, verbosity=0)
xgb_params = {'n_estimators':[50,100], 'max_depth':[3,5], 'learning_rate':[0.05,0.1]}
xgb_grid = GridSearchCV(xgb, xgb_params, cv=3, scoring='neg_mean_squared_error', n_jobs=1)
xgb_grid.fit(X_train, y_train)
xgb_best = xgb_grid.best_estimator_
y_pred_xgb = xgb_best.predict(X_test)

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
def evaluate(y_true, y_pred):
    return {'MAE': mean_absolute_error(y_true, y_pred), 'RMSE': np.sqrt(mean_squared_error(y_true, y_pred)), 'R2': r2_score(y_true, y_pred)}

results = {
    'Linear Regression': evaluate(y_test, y_pred_lr),
    'Random Forest': evaluate(y_test, y_pred_rf),
    'XGBoost': evaluate(y_test, y_pred_xgb)
}
results

'''
'''

os.makedirs('/mnt/data/models', exist_ok=True)
joblib.dump(lr, '/mnt/data/models/linear_regression.joblib')
joblib.dump(rf_best, '/mnt/data/models/random_forest.joblib')
joblib.dump(xgb_best, '/mnt/data/models/xgboost.joblib')
joblib.dump(scaler, '/mnt/data/models/scaler.joblib')
print("Saved models in /mnt/data/models")

'''
'''
