**Cryptocurrency Volatility
Prediction**

crypto-volatility/
├─ data/                         # Raw or simulated data (CSV)
├─ notebooks/
│  ├─ 01_eda.ipynb
│  ├─ 02_feature_engineering.ipynb
│  └─ 03_modeling.ipynb
├─ src/
│  ├─ data_loader.py
│  ├─ preprocess.py
│  ├─ features.py
│  ├─ models.py
│  ├─ evaluate.py
│  └─ pipeline.py
├─ app/
│  └─ streamlit_app.py
├─ reports/
│  ├─ EDA_Report.md
│  ├─ HLD.md
│  ├─ LLD.md
│  ├─ Pipeline_Architecture.md
│  └─ Final_Report.md
├─ requirements.txt
└─ README.md


In [7]:
!python crypto-volatility/src/pipeline.py

python3: can't open file '/content/crypto-volatility/src/pipeline.py': [Errno 2] No such file or directory


In [8]:
!streamlit run crypto-volatility/app/streamlit_app.py

Usage: streamlit run [OPTIONS] [TARGET] [ARGS]...
Try 'streamlit run --help' for help.

Error: Invalid value: File does not exist: crypto-volatility/app/streamlit_app.py


Core code
1) Data simulation (50+ cryptos, OHLC, volume, market cap)

In [9]:
# src/data_loader.py
import numpy as np, pandas as pd

def simulate_crypto_data(n_symbols=50, n_days=365, seed=42):
    np.random.seed(seed)
    symbols = [f"CRYPTO_{i}" for i in range(1, n_symbols+1)]
    dates = pd.date_range("2023-01-01", periods=n_days)
    rows = []
    for sym in symbols:
        base = 100 + np.cumsum(np.random.randn(n_days))  # random walk
        high = base + np.random.rand(n_days) * 5
        low = base - np.random.rand(n_days) * 5
        open_ = base + np.random.randn(n_days)
        close = base + np.random.randn(n_days)
        volume = np.random.randint(5_000, 100_000, size=n_days)
        market_cap = np.abs(close) * volume
        for d in range(n_days):
            rows.append([dates[d], sym, open_[d], high[d], low[d], close[d], volume[d], market_cap[d]])
    return pd.DataFrame(rows, columns=["date","symbol","open","high","low","close","volume","market_cap"])


2) Preprocessing (missing values, scaling)

In [11]:
# src/preprocess.py
import pandas as pd
from sklearn.preprocessing import StandardScaler

def preprocess(df: pd.DataFrame):
    df = df.copy()
    df = df.dropna()
    df["date"] = pd.to_datetime(df["date"])
    df = df.sort_values(["symbol","date"])
    scaler = StandardScaler()
    num_cols = ["open","high","low","close","volume","market_cap"]
    df[num_cols] = scaler.fit_transform(df[num_cols])
    return df, scaler


3) Feature engineering (MA, rolling volatility, Bollinger Bands, ATR, liquidity)

In [12]:
# src/features.py
import pandas as pd
import numpy as np

def add_features(df: pd.DataFrame):
    df = df.copy()
    df["price_range"] = df["high"] - df["low"]
    df["volatility_7"] = df.groupby("symbol")["close"].transform(lambda x: x.rolling(7).std())
    df["ma_7"] = df.groupby("symbol")["close"].transform(lambda x: x.rolling(7).mean())
    df["ma_21"] = df.groupby("symbol")["close"].transform(lambda x: x.rolling(21).mean())
    df["bb_upper_21"] = df["ma_21"] + 2*df.groupby("symbol")["close"].transform(lambda x: x.rolling(21).std())
    df["bb_lower_21"] = df["ma_21"] - 2*df.groupby("symbol")["close"].transform(lambda x: x.rolling(21).std())
    # ATR (approx): rolling mean of high-low
    df["atr_14"] = df.groupby("symbol")["price_range"].transform(lambda x: x.rolling(14).mean())
    df["liquidity_ratio"] = df["volume"] / (df["market_cap"] + 1e-6)
    df = df.dropna()
    return df


4) Modeling (Linear Regression, Random Forest, XGBoost optional)

In [13]:
# src/models.py
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

def get_models():
    return {
        "LinearRegression": LinearRegression(),
        "RandomForest": RandomForestRegressor(n_estimators=200, max_depth=None, random_state=42)
    }



5) Evaluation (RMSE, MAE, R²)

In [14]:
# src/evaluate.py
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def evaluate(y_true, y_pred):
    return {
        "RMSE": float(np.sqrt(mean_squared_error(y_true, y_pred))),
        "MAE": float(mean_absolute_error(y_true, y_pred)),
        "R2": float(r2_score(y_true, y_pred))
    }


6) Pipeline (end‑to‑end)


In [None]:
# src/pipeline.py
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
# The following imports are removed, assuming functions are globally available
# from data_loader import simulate_crypto_data
# from preprocess import preprocess
# from features import add_features
# from models import get_models
# from evaluate import evaluate

def run_pipeline():
    df = simulate_crypto_data(n_symbols=50, n_days=365)
    df, scaler = preprocess(df)
    df = add_features(df)

    features = ["open","high","low","close","volume","market_cap",
                "price_range","ma_7","ma_21","bb_upper_21","bb_lower_21",
                "atr_14","liquidity_ratio"]
    target = "volatility_7"

    X, y = df[features], df[target]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    models = get_models()
    results = {}
    for name, model in models.items():
        model.fit(X_train, y_train)
        preds = model.predict(X_test)
        results[name] = evaluate(y_test, preds)

    # Hyperparameter tuning for RandomForest
    grid = GridSearchCV(models["RandomForest"],
                        param_grid={"n_estimators":[100,200], "max_depth":[5,10,None]},
                        cv=3, n_jobs=-1)
    grid.fit(X_train, y_train)
    tuned_preds = grid.best_estimator_.predict(X_test)
    results["RandomForest_Tuned"] = evaluate(y_test, tuned_preds)

    pd.DataFrame(results).T.to_csv("reports/model_evaluation.csv")
    with open("reports/best_params.txt","w") as f:
        f.write(str(grid.best_params_))

    print("Pipeline complete. Results saved to reports/")
    return results

if __name__ == "__main__":
    run_pipeline()

Streamlit app (local deployment)
python

In [None]:
# app/streamlit_app.py
import streamlit as st
import pandas as pd
import joblib
from data_loader import simulate_crypto_data
from preprocess import preprocess
from features import add_features
from models import get_models

st.title("Cryptocurrency Volatility Prediction")

st.markdown("Upload data or use simulated dataset to predict 7-day rolling volatility.")

use_sim = st.checkbox("Use simulated dataset", value=True)
if use_sim:
    df = simulate_crypto_data(n_symbols=10, n_days=180)
else:
    file = st.file_uploader("Upload CSV with columns: date, symbol, open, high, low, close, volume, market_cap")
    if file:
        df = pd.read_csv(file)
    else:
        st.stop()

df, scaler = preprocess(df)
df = add_features(df)

features = ["open","high","low","close","volume","market_cap",
            "price_range","ma_7","ma_21","bb_upper_21","bb_lower_21",
            "atr_14","liquidity_ratio"]
target = "volatility_7"

st.subheader("Sample data")
st.dataframe(df.head())

models = get_models()
model_choice = st.selectbox("Model", list(models.keys()))
model = models[model_choice]
X, y = df[features], df[target]
model.fit(X, y)

st.subheader("Predict volatility for latest rows")
n = st.slider("Rows to predict", 5, 50, 10)
preds = model.predict(X.tail(n))
out = df.tail(n)[["date","symbol"]].copy()
out["predicted_volatility_7"] = preds
st.dataframe(out)


EDA notebook highlights
Trends: Moving averages vs close price to visualize momentum.

Distributions: Histograms of volatility, ATR, liquidity ratio.

Correlations: Heatmap of engineered features vs target.

Insights: Liquidity ratio and ATR often correlate with short‑term volatility; MA gaps (MA7−MA21) capture trend shifts.

Documentation
High‑level design (HLD)
Goal: Predict short‑term volatility (7‑day rolling std of close).

Inputs: OHLC, volume, market cap.

Processing: Scaling → feature engineering (MA, Bollinger, ATR, liquidity).

Models: Linear Regression, Random Forest (+tuning).

Outputs: Predictions, evaluation metrics, simple UI via Streamlit.

Low‑level design (LLD)
data_loader.py: Simulates realistic price paths and market variables.

preprocess.py: Cleans, sorts, scales numeric columns.

features.py: Grouped rolling features per symbol; drops NaNs post‑rolling.

models.py: Model registry for easy swapping.

pipeline.py: Train/test split, training, evaluation, grid search, artifact saving.

app/streamlit_app.py: Interactive predictions on latest rows.

Pipeline architecture
Ingest: Simulated/real CSV → DataFrame.

Preprocess: Clean, scale, sort by symbol/date.

Feature engineering: Rolling stats per symbol.

Train: Fit baseline + tuned model.

Evaluate: RMSE, MAE, R²; save results.

Serve: Streamlit app for local testing.


Final report (summary)
Objective: Forecast short‑term volatility to support risk management and trading decisions.

Best model: Random Forest (tuned) typically outperforms Linear Regression due to non‑linear relationships.

Key features: ATR, liquidity ratio, price range, MA7/MA21, Bollinger bands.

Performance: Reported via RMSE/MAE/R²; values depend on dataset realism and window sizes.

Limitations: Synthetic data lacks regime shifts and real market microstructure; for production, use real datasets and robust validation (walk‑forward, time‑series split).

Next steps: Add time‑aware CV, more indicators (RSI, MACD), LSTM/Temporal Fusion Transformer, and model monitoring.