# convert_sample_to_cryptoMamba_format.ipynb

This script preprocesses a sample Bitcoin dataset (sourced from Kaggle) and converts it into the same structure used by the original CryptoMamba project. It performs timestamp conversion, chronological sorting, and time-based splitting into training, validation, and testing sets (4 years / 1 year / 1 year). The resulting CSV files are formatted to be fully compatible with CryptoMamba’s data pipeline and `mode_1.yaml` configuration.

**Output files:**
- `train_bitcoin_sample.csv`
- `val_bitcoin_sample.csv`
- `test_bitcoin_sample.csv`

**Purpose:** Convert raw sample data into the expected CryptoMamba format for model training and evaluation.

In [None]:
import pandas as pd

# Step 1: Load the raw Kaggle Bitcoin dataset
df = pd.read_csv('coin_Bitcoin.csv')

# Step 2: Select only the required columns
df = df[['Open', 'High', 'Low', 'Close', 'Volume', 'Date']].copy()

# Step 3: Convert 'Date' to datetime format
df['Date'] = pd.to_datetime(df['Date'])

# Step 4: Add UNIX timestamp column (in seconds)
df['Timestamp'] = df['Date'].view('int64') // 10**9

# Step 5: Sort chronologically for proper splitting
df = df.sort_values('Date').reset_index(drop=True)

# Step 6: Define 4y/1y/1y split boundaries based on date
start_date = df['Date'].min()
train_end = start_date + pd.DateOffset(years=4)
val_end = train_end + pd.DateOffset(years=1)
test_end = val_end + pd.DateOffset(years=1)

# Step 7: Split into train, val, and test sets
df_train = df[df['Date'] < train_end].copy()
df_val   = df[(df['Date'] >= train_end) & (df['Date'] < val_end)].copy()
df_test  = df[(df['Date'] >= val_end) & (df['Date'] < test_end)].copy()

# Step 8: Drop 'Date' column (model uses only numeric inputs)
for split in [df_train, df_val, df_test]:
    split.drop(columns='Date', inplace=True)

# Step 9: Save train, val, and test datasets
df_train.to_csv('train_bitcoin_sample.csv')
df_val.to_csv('val_bitcoin_sample.csv')
df_test.to_csv('test_bitcoin_sample.csv')
