 Problem Statement
Problem Statement
 Cryptocurrency markets are highly volatile, and understanding and forecasting this volatility is crucial for
market participants. Volatility refers to the degree of variation in the price of a cryptocurrency over time, and
high volatility can lead to significant risks for traders and investors. Accurate volatility prediction helps in risk
management, portfolio allocation, and developing trading strategies.
 In this project, you are required to build a machine learning model to predict cryptocurrency volatility levels
based on historical market data such as OHLC (Open, High, Low, Close) prices, trading volume, and market
capitalization. The objective is to anticipate periods of heightened volatility, enabling traders and financial
institutions to manage risks and make informed decisions.
 Your final model should provide insights into market stability by forecasting volatility variations, allowing
stakeholders to proactively respond to changing market conditions.
 Dataset Information
 You will use a dataset that includes historical daily cryptocurrency price, volume, and market capitalization
data for multiple cryptocurrencies.
 Dataset:
 Cryptocurrency Historical Prices Dataset
 Data Preprocessing Require
 Handle missing values and ensure data consistenc
 Normalize and scale numerical feature
 Engineer new features related to volatility and liquidity trends
 The dataset consists of daily records for over 50 cryptocurrencies, including features such as date, symbol,
open, high, low, close, volume, and market cap.
 Project Development Step
 Data Collection: Gather historical OHLC, volume, and market cap data from the provided datase
 Data Preprocessing: Handle missing values, clean data, and normalize numerical feature
 Exploratory Data Analysis (EDA): Analyze data patterns, trends, and correlation
 Feature Engineering: Create relevant features such as moving averages, rolling volatility, liquidity ratios (e.g.,
volume/market cap), and technical indicators (e.g., Bollinger Bands, ATR
 Model Selection: Choose appropriate machine learning models such as time-series forecasting, regression,
or deep learning approache
 Model Training: Train the selected model using the processed datase
 Model Evaluation: Assess model performance using metrics such as RMSE, MAE, and R² score
 Java + DSA
 Pwskills
Model Optimization and Deploymen
 Hyperparameter Tuning: Optimize model parameters for better accurac
 Model Testing & Validation: Test the model on unseen data and analyze prediction
 Local Deployment: Deploy the trained model locally using Flask or Streamlit for testing
 Expected Deliverables
 1. Machine Learning Mode
 A trained model that predicts cryptocurrency volatilit
 Evaluation metrics showing how well the model performs
 2. Data Processing & Feature Engineerin
 Cleaned and prepared datase
 A brief explanation of new features added
 3. Exploratory Data Analysis (EDA) Repor
 Summary of dataset statistic
 Basic visualizations (trends, correlations, distributions)
 4. Project Documentatio
 High-Level Design (HLD) Document: Overview of system and architectur
 Low-Level Design (LLD) Document: Breakdown of how each component is implemente
 Pipeline Architecture: Explanation of data flow from preprocessing to predictio
 Final Report: A simple summary of findings, model performance, and key insights
 EDA Repor
 Guidelines & Submission Requirement
 Code Documentation: Ensure all scripts are well-commented and easy to follo
 Report Structure: The report must be structured and should clearly explain the methodology followe
 Diagrams & Visuals: Use appropriate diagrams and plots to explain datap rocessing, model selection, and
performance evaluatio
 Deployment: If possible, deploy the model using a simple interface (e.g., Streamlit or Flask API) for testing
predictions
 Submission Format
 The project must be submitted as a GitHub repository or a zipped folder containing
 Source Cod
 HLD & LLD Document
 Pipeline Architecture and Documentatio
 Final Report

1. Data Collection
We'll first access the dataset from the provided link and inspect its structure.

2. Data Preprocessing
Handle missing values.

Convert data types (e.g., datetime).

Normalize numerical features.

Encode categorical variables if needed.

3. Feature Engineering
We’ll create:

Rolling Volatility (e.g., standard deviation of returns).

Moving averages (7-day, 30-day).

Liquidity ratio: volume / market_cap.

Technical indicators like Bollinger Bands, ATR.

4. Exploratory Data Analysis (EDA)
Visualize price and volume trends.

Analyze volatility patterns.

Correlation heatmaps.

5. Model Selection & Training
Options:

XGBoost / Random Forest Regressor (for regression-based volatility prediction).

LSTM / GRU (Deep Learning) for time-series forecasting.

6. Model Evaluation
Use:

RMSE

MAE

R² Score

7. Deployment (Optional)
Local deployment with Streamlit.

8. Deliverables
We’ll prepare:

Code (well-commented)

EDA report (plots + insights)

HLD + LLD Docs

Pipeline Architecture

Final Report

Thanks for uploading the dataset. Let's begin by:

### 🔍 Step 1: Dataset Inspection


* Load the dataset
* Show you its first few rows
* Check for missing values
* Display basic statistics and column types


### ✅ Dataset Overview (72946 rows × 10 columns)

| Column        | Type    | Description                        |
| ------------- | ------- | ---------------------------------- |
| `Unnamed: 0`  | int64   | Index-like column (can be dropped) |
| `open`        | float64 | Opening price                      |
| `high`        | float64 | Highest price of the day           |
| `low`         | float64 | Lowest price of the day            |
| `close`       | float64 | Closing price                      |
| `volume`      | float64 | Trading volume                     |
| `marketCap`   | float64 | Market capitalization              |
| `timestamp`   | object  | Full timestamp                     |
| `crypto_name` | object  | Name of cryptocurrency             |
| `date`        | object  | Date (YYYY-MM-DD)                  |

✅ No missing values
✅ Clean numerical data
🔄 `volume` = 0 in many rows (will check later)

---

### ⏭ Next Steps

1. **Drop `Unnamed: 0`**

2. **Convert `date` column to datetime**

3. **Create new features**:

   * Daily return
   * Rolling volatility (e.g., 7-day, 30-day)
   * Liquidity ratio: `volume / marketCap`
   * Moving averages (7D, 30D)
   * Technical indicators (e.g., Bollinger Bands)

4. **Visualize Volatility (EDA)**

5. **Train ML models to predict volatility**


✅ **Feature Engineering Completed**

Here are the **new features** added:

| Feature                               | Description                                                        |
| ------------------------------------- | ------------------------------------------------------------------ |
| `daily_return`                        | % change in closing price from the previous day                    |
| `volatility_7d`                       | 7-day rolling standard deviation of daily returns                  |
| `volatility_30d`                      | 30-day rolling standard deviation (smoother measure of volatility) |
| `liquidity_ratio`                     | Volume / MarketCap, represents trading activity relative to size   |
| `ma_7`, `ma_30`                       | 7-day and 30-day moving averages of close prices                   |
| `bollinger_upper` / `bollinger_lower` | 20-day Bollinger Band levels (used in technical analysis)          |

---

### 📊 Next Step: Exploratory Data Analysis (EDA)



* Plot price trends
* Visualize volatility trends
* Show correlation heatmap
* Compare cryptocurrencies




In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load your dataset
df = pd.read_csv("your_dataset.csv")  # replace with actual file path
df['date'] = pd.to_datetime(df['date'])

# Drop unnecessary column if present
if 'Unnamed: 0' in df.columns:
    df = df.drop(columns=["Unnamed: 0"])

# Sort and group
df = df.sort_values(by=["crypto_name", "date"])

# Feature Engineering
df['daily_return'] = df.groupby('crypto_name')['close'].pct_change()
df['volatility_30d'] = df.groupby('crypto_name')['daily_return'].rolling(window=30).std().reset_index(0, drop=True)

# Filter top cryptocurrencies
top_cryptos = ['Bitcoin', 'Ethereum', 'Litecoin', 'XRP']
df_top = df[df['crypto_name'].isin(top_cryptos)]

# Plotting
plt.style.use("ggplot")
fig, axes = plt.subplots(4, 2, figsize=(16, 20), sharex=True)

for i, crypto in enumerate(top_cryptos):
    crypto_data = df_top[df_top['crypto_name'] == crypto]

    # Plot close price
    axes[i][0].plot(crypto_data['date'], crypto_data['close'], label=f'{crypto} Price', color='blue')
    axes[i][0].set_title(f'{crypto} - Closing Price')
    axes[i][0].set_ylabel("Price (USD)")
    axes[i][0].legend()

    # Plot volatility
    axes[i][1].plot(crypto_data['date'], crypto_data['volatility_30d'], label=f'{crypto} Volatility (30d)', color='red')
    axes[i][1].set_title(f'{crypto} - 30-Day Rolling Volatility')
    axes[i][1].set_ylabel("Volatility")
    axes[i][1].legend()

plt.tight_layout()
plt.show()


In [None]:
#1) Utilities — src/utils.py Common helpers: metrics, saving/loading.
# src/utils.py
import joblib
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def regression_metrics(y_true, y_pred):
    rmse = mean_squared_error(y_true, y_pred, squared=False)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    return {"rmse": rmse, "mae": mae, "r2": r2}

def save_model(obj, path):
    joblib.dump(obj, path)

def load_model(path):
    return joblib.load(path)

def save_predictions(df, path):
    df.to_csv(path, index=False)
# 2) Data preprocessing & feature engineering — src/data_preprocess.py .This loads raw CSV, creates features (daily_return, vol7, vol30, liquidity_ratio, ma_7, ma_30, bollinger bands), creates the target (next-day volatility_30d), drops NaNs and saves processed CSV.
# src/data_preprocess.py
import pandas as pd
import numpy as np
import argparse
from pathlib import Path

def preprocess(input_csv, output_csv):
    df = pd.read_csv(input_csv)
    # Drop index-like column if present
    if 'Unnamed: 0' in df.columns:
        df = df.drop(columns=['Unnamed: 0'])

    # parse dates
    df['date'] = pd.to_datetime(df['date'])
    # sort
    df = df.sort_values(['crypto_name', 'date']).reset_index(drop=True)

    # FEATURES
    df['daily_return'] = df.groupby('crypto_name')['close'].pct_change()
    df['volatility_7d'] = df.groupby('crypto_name')['daily_return'].rolling(window=7).std().reset_index(0, drop=True)
    df['volatility_30d'] = df.groupby('crypto_name')['daily_return'].rolling(window=30).std().reset_index(0, drop=True)
    df['liquidity_ratio'] = df['volume'] / (df['marketCap'] + 1e-10)
    df['ma_7'] = df.groupby('crypto_name')['close'].transform(lambda x: x.rolling(window=7).mean())
    df['ma_30'] = df.groupby('crypto_name')['close'].transform(lambda x: x.rolling(window=30).mean())
    rolling_mean = df.groupby('crypto_name')['close'].transform(lambda x: x.rolling(window=20).mean())
    rolling_std = df.groupby('crypto_name')['close'].transform(lambda x: x.rolling(window=20).std())
    df['bollinger_upper'] = rolling_mean + 2 * rolling_std
    df['bollinger_lower'] = rolling_mean - 2 * rolling_std

    # Create a target: next-day volatility_30d (shift -1 per crypto). This predicts the next day's 30-day rolling vol.
    df['target_vol30_next'] = df.groupby('crypto_name')['volatility_30d'].shift(-1)

    # Drop rows with NaN in critical features or target
    keep_cols = ['open','high','low','close','volume','marketCap','crypto_name','date',
                 'daily_return','volatility_7d','volatility_30d','liquidity_ratio',
                 'ma_7','ma_30','bollinger_upper','bollinger_lower','target_vol30_next']
    df = df[keep_cols]
    df = df.dropna().reset_index(drop=True)

    Path(output_csv).parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(output_csv, index=False)
    print(f"Saved processed data to {output_csv}. Rows: {len(df)}")
    return df

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--input", default="data/raw.csv", help="raw csv path")
    parser.add_argument("--output", default="data_processed/processed.csv", help="processed output path")
    args = parser.parse_args()
    preprocess(args.input, args.output)


#3 EDA script — src/eda.py Generates and saves visualization PNGs: price+volatility, correlation heatmap, target distribution.
# src/eda.py
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import argparse
from pathlib import Path

def run_eda(processed_csv, out_dir="reports"):
    Path(out_dir).mkdir(parents=True, exist_ok=True)
    df = pd.read_csv(processed_csv, parse_dates=['date'])

    # Top 4 cryptos by rows
    top = df['crypto_name'].value_counts().nlargest(4).index.tolist()
    df_top = df[df['crypto_name'].isin(top)]

    # Price and vol plots
    for c in top:
        tmp = df_top[df_top['crypto_name']==c]
        fig, ax = plt.subplots(2,1, figsize=(12,8), sharex=True)
        ax[0].plot(tmp['date'], tmp['close'])
        ax[0].set_title(f'{c} - Close Price')
        ax[0].set_ylabel('Price')
        ax[1].plot(tmp['date'], tmp['volatility_30d'])
        ax[1].set_title(f'{c} - 30-day Volatility')
        ax[1].set_ylabel('Volatility')
        plt.tight_layout()
        plt.savefig(f"{out_dir}/{c}_price_vol.png")
        plt.close()

    # Correlation heatmap (numerics)
    num_cols = ['open','high','low','close','volume','marketCap','daily_return','volatility_7d','volatility_30d','liquidity_ratio','ma_7','ma_30']
    corr = df[num_cols].corr()
    plt.figure(figsize=(10,8))
    sns.heatmap(corr, annot=True, fmt=".2f")
    plt.title("Feature Correlation")
    plt.savefig(f"{out_dir}/correlation_heatmap.png")
    plt.close()

    # Target distribution
    plt.figure(figsize=(8,5))
    sns.histplot(df['target_vol30_next'], bins=100, kde=True)
    plt.title("Distribution of target (next-day vol30)")
    plt.savefig(f"{out_dir}/target_distribution.png")
    plt.close()
    print("EDA plots saved to", out_dir)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--input", default="data_processed/processed.csv")
    parser.add_argument("--out", default="reports")
    args = parser.parse_args()
    run_eda(args.input, args.out)


# 4) Train classical models (RandomForest / XGBoost) — src/train_model.pyTrains a model pipeline (preprocessing + model), evaluates on time-based split (80/20 by date), saves model and transformer.
# src/train_model.py
import argparse
import pandas as pd
from pathlib import Path
from sklearn.model_selection import TimeSeriesSplit, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
import numpy as np
import joblib
from src.utils import regression_metrics, save_model, save_predictions

def prepare_data(df):
    df['date'] = pd.to_datetime(df['date'])
    # split by date: training are rows where date <= quantile(0.8)
    cutoff = df['date'].quantile(0.8)
    train = df[df['date'] <= cutoff].copy()
    test  = df[df['date'] > cutoff].copy()
    return train, test

def train(args):
    Path("models").mkdir(exist_ok=True)
    df = pd.read_csv(args.input, parse_dates=['date'])
    train, test = prepare_data(df)

    features = ['open','high','low','close','volume','marketCap','daily_return',
                'volatility_7d','volatility_30d','liquidity_ratio','ma_7','ma_30',
                'bollinger_upper','bollinger_lower','crypto_name']
    target = 'target_vol30_next'

    X_train = train[features]
    y_train = train[target]
    X_test = test[features]
    y_test = test[target]

    numeric_features = [c for c in features if c != 'crypto_name']
    numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
    preprocessor = ColumnTransformer(transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse=False), ['crypto_name'])
    ])

    if args.model == 'rf':
        model = RandomForestRegressor(n_jobs=-1, random_state=42)
        param_dist = {
            "model__n_estimators": [100,200,400],
            "model__max_depth": [5,10,20, None],
            "model__min_samples_split": [2,5,10]
        }
    elif args.model == 'xgb':
        model = xgb.XGBRegressor(n_jobs=-1, random_state=42, objective='reg:squarederror')
        param_dist = {
            "model__n_estimators": [100,200,400],
            "model__max_depth": [3,6,10],
            "model__learning_rate": [0.01, 0.05, 0.1],
        }
    else:
        raise ValueError("model must be 'rf' or 'xgb'")

    pipe = Pipeline(steps=[('pre', preprocessor), ('model', model)])

    # Randomized search with TimeSeriesSplit
    tscv = TimeSeriesSplit(n_splits=3)
    search = RandomizedSearchCV(pipe, param_distributions=param_dist, n_iter=10,
                                cv=tscv, scoring='neg_root_mean_squared_error', n_jobs=-1, random_state=42, verbose=2)
    print("Starting RandomizedSearchCV...")
    search.fit(X_train, y_train)
    print("Best params:", search.best_params_)

    best = search.best_estimator_
    # evaluate
    y_pred = best.predict(X_test)
    metrics = regression_metrics(y_test, y_pred)
    print("Test metrics:", metrics)

    # Save model and also save predictions
    save_model(best, f"models/best_{args.model}.joblib")
    out_df = test[['date','crypto_name','close']].copy()
    out_df['y_true'] = y_test.values
    out_df['y_pred'] = y_pred
    save_predictions(out_df, f"models/predictions_{args.model}.csv")
    print("Saved model and predictions.")

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--input", default="data_processed/processed.csv")
    parser.add_argument("--model", default="rf", choices=['rf','xgb'])
    args = parser.parse_args()
    train(args)

# 5) Evaluate / Plot results (inside training above we save predictions).
You can load models/predictions_rf.csv and produce plots (time vs actual vs predicted). If you want a standalone evaluate.py, I can provide it — let me know.

6) Hyperparameter tuning (already integrated in train_model.py)
RandomizedSearchCV was used with TimeSeriesSplit. If you want exhaustive GridSearchCV, swap to GridSearchCV but beware of runtime.

7) LSTM for time-series (optional) — src/train_lstm.py
This shows a simple sequence-based LSTM across all cryptos (sliding windows).

# src/train_lstm.py
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from pathlib import Path

def create_sequences(X, y, window=30):
    Xs, ys = [], []
    for i in range(len(X) - window):
        Xs.append(X[i:(i+window)])
        ys.append(y[i+window])
    return np.array(Xs), np.array(ys)

def train_lstm(processed_csv, model_out="models/lstm.h5", window=30, epochs=20, batch=64):
    df = pd.read_csv(processed_csv, parse_dates=['date'])
    # We'll build sequences per crypto and then concat
    feature_cols = ['close','volume','marketCap','daily_return','volatility_7d','volatility_30d','liquidity_ratio','ma_7','ma_30']
    X_all, y_all = [], []
    for name, g in df.groupby('crypto_name'):
        g = g.sort_values('date').reset_index(drop=True)
        if len(g) < window + 2:
            continue
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(g[feature_cols])
        y = g['target_vol30_next'].values
        Xs, ys = create_sequences(X_scaled, y, window=window)
        X_all.append(Xs)
        y_all.append(ys)
    if not X_all:
        raise RuntimeError("No crypto has enough rows for LSTM.")
    X = np.vstack(X_all)
    y = np.hstack(y_all)
    # shuffle, split train/test by index/time (we'll train/validate randomly here for simplicity)
    idx = int(0.8 * len(X))
    X_train, X_test = X[:idx], X[idx:]
    y_train, y_test = y[:idx], y[idx:]
    model = Sequential([
        LSTM(64, input_shape=(X_train.shape[1], X_train.shape[2]), return_sequences=False),
        Dropout(0.2),
        Dense(32, activation='relu'),
        Dense(1)
    ])
    model.compile(optimizer='adam', loss='mse', metrics=['mae'])
    Path("models").mkdir(exist_ok=True)
    es = EarlyStopping(patience=5, restore_best_weights=True)
    model.fit(X_train, y_train, validation_data=(X_test,y_test), epochs=epochs, batch_size=batch, callbacks=[es])
    model.save(model_out)
    print("Saved LSTM model to", model_out)

if __name__ == "__main__":
    train_lstm("data_processed/processed.csv")

#8.  Inference script — src/predict.py . Load saved model (models/best_rf.joblib) and make predictions on a CSV input.

# src/predict.py
import pandas as pd
import argparse
import joblib

def predict(model_path, input_csv, output_csv):
    model = joblib.load(model_path)
    df = pd.read_csv(input_csv, parse_dates=['date'])
    features = ['open','high','low','close','volume','marketCap','daily_return',
                'volatility_7d','volatility_30d','liquidity_ratio','ma_7','ma_30',
                'bollinger_upper','bollinger_lower','crypto_name']
    X = df[features]
    preds = model.predict(X)
    df['pred_next_vol30'] = preds
    df.to_csv(output_csv, index=False)
    print("Saved predictions to", output_csv)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", default="models/best_rf.joblib")
    parser.add_argument("--input", default="data_processed/processed.csv")
    parser.add_argument("--output", default="models/predictions_inference.csv")
    args = parser.parse_args()
    predict(args.model, args.input, args.output)

#9. Streamlit simple app — src/streamlit_app.py A tiny UI so anyone can upload a CSV and get predictions (assumes models/best_rf.joblib exists).

# src/streamlit_app.py
import streamlit as st
import pandas as pd
import joblib
from io import StringIO

st.title("Cryptocurrency Volatility Predictor")

model = joblib.load("models/best_rf.joblib")

uploaded = st.file_uploader("Upload processed CSV (with features)", type=["csv"])
if uploaded:
    df = pd.read_csv(uploaded, parse_dates=['date'])
    features = ['open','high','low','close','volume','marketCap','daily_return',
                'volatility_7d','volatility_30d','liquidity_ratio','ma_7','ma_30',
                'bollinger_upper','bollinger_lower','crypto_name']
    X = df[features]
    preds = model.predict(X)
    df['pred_next_vol30'] = preds
    st.write("Predictions (first 20 rows):")
    st.dataframe(df[['date','crypto_name','close','pred_next_vol30']].head(20))
    csv = df.to_csv(index=False)
    st.download_button("Download predictions CSV", csv, "predictions.csv")
else:
    st.info("Upload the processed CSV (run preprocessing first).")

#10.  Simple Flask API — src/flask_api.py. Small REST API that accepts JSON with rows and returns predictions.

# src/flask_api.py
from flask import Flask, request, jsonify
import joblib
import pandas as pd

app = Flask(__name__)
model = joblib.load("models/best_rf.joblib")

@app.route("/predict", methods=["POST"])
def predict():
    data = request.get_json()
    df = pd.DataFrame(data)
    # expecting same feature columns as earlier
    features = ['open','high','low','close','volume','marketCap','daily_return',
                'volatility_7d','volatility_30d','liquidity_ratio','ma_7','ma_30',
                'bollinger_upper','bollinger_lower','crypto_name']
    X = df[features]
    preds = model.predict(X)
    return jsonify({"predictions": preds.tolist()})

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)

#11. Documentation templates
HLD.md (High Level Design)
# High-Level Design (HLD)

## Purpose
Predict next-day 30-day rolling volatility for multiple cryptocurrencies.

## Components
- Data Ingestion: raw CSV (OHLC, volume, market cap)
- Preprocessing: cleaning, datetime parse, feature engineering
- Model Training: RandomForest/XGBoost or LSTM
- Serving: Streamlit UI and Flask API
- Monitoring: Save predictions/log metrics

## Data Flow
raw.csv -> data_preprocess.py -> processed.csv -> train_model.py -> models/* -> streamlit / flask (inference)

# Low Level Design (LLD)

## Preprocessing
- Columns created:
  - daily_return = pct_change(close)
  - volatility_7d, volatility_30d = rolling std dev of daily_return
  - liquidity_ratio = volume / marketCap
  - ma_7, ma_30 = rolling means
  - bollinger_upper/lower = 20-day mean ± 2*std
- Target:
  - target_vol30_next = volatility_30d.shift(-1) per crypto

## Model Training
- Pipeline:
  - ColumnTransformer (StandardScaler for numeric, OneHotEncoder for crypto_name)
  - Model: RandomForestRegressor or XGBRegressor
- Time-based train/test split:
  - cutoff = 80th percentile date across dataset

## Deployment
- Save trained Pipeline via joblib
- Streamlit for UI, Flask for API


#12. README.md (quick run steps)
markdown
Copy
Edit

# Crypto Volatility Prediction

1. Install dependencies:
   `pip install -r requirements.txt`

2. Place raw CSV as `data/raw.csv`

3. Preprocess:
   `python src/data_preprocess.py --input data/raw.csv --output data_processed/processed.csv`

4. EDA:
   `python src/eda.py --input data_processed/processed.csv --out reports`

5. Train model (RandomForest example):
   `python src/train_model.py --input data_processed/processed.csv --model rf`

6. Predict:
   `python src/predict.py --model models/best_rf.joblib --input data_processed/processed.csv --output models/predictions_inference.csv`

7. Run Streamlit:
   `streamlit run src/streamlit_app.py`

8. Run Flask API:
   `python src/flask_api.py`
