# convert_sample_to_cryptoMamba_format.ipynb

This script preprocesses a sample Bitcoin dataset (sourced from Kaggle) and converts it into the required format used by the original CryptoMamba project. It includes timestamp conversion, chronological sorting, and time-based splitting into training, validation, and testing sets.

In this notebook, two versions of the dataset are prepared:

## Version 1 – Original split (used for baseline experiments)

### Dataset Info:
- Train interval: `2013-04-29` to `2017-04-28`
- Validation interval: `2017-04-28` to `2018-04-28`
- Test interval: `2018-04-29` to `2019-04-28`
- 
### Output files:
- train_bitcoin_v1.csv
- val_bitcoin_v1.csv
- test_bitcoin_v1.csv

## Purpose: 
Convert raw sample data into CryptoMamba’s expected format to enable consistent and reproducible model training and evaluation across different dataset versions.

In [1]:
import pandas as pd

# Step 1: Load the raw Kaggle Bitcoin dataset
df = pd.read_csv('coin_Bitcoin.csv')

# Step 2: Select only the required columns
df = df[['Open', 'High', 'Low', 'Close', 'Volume', 'Date']].copy()

# Step 3: Convert 'Date' to datetime format
df['Date'] = pd.to_datetime(df['Date'])

# Step 4: Add UNIX timestamp column (in seconds)
df['Timestamp'] = df['Date'].view('int64') // 10**9

# Step 5: Sort chronologically for proper splitting
df = df.sort_values('Date').reset_index(drop=True)

# Step 6: Define 4y/1y/1y split boundaries based on date
start_date = df['Date'].min()
train_end = start_date + pd.DateOffset(years=4)
val_end = train_end + pd.DateOffset(years=1)
test_end = val_end + pd.DateOffset(years=1)

# Step 7: Split into train, val, and test sets
df_train = df[df['Date'] < train_end].copy()
df_val   = df[(df['Date'] >= train_end) & (df['Date'] < val_end)].copy()
df_test  = df[(df['Date'] >= val_end) & (df['Date'] < test_end)].copy()

# Step 8: Drop 'Date' column (model uses only numeric inputs)
for split in [df_train, df_val, df_test]:
    split.drop(columns='Date', inplace=True)

# Step 9: Save train, val, and test datasets
df_train.to_csv('train_bitcoin_v1.csv')
df_val.to_csv('val_bitcoin_v1.csv')
df_test.to_csv('test_bitcoin_v1.csv')


  df['Timestamp'] = df['Date'].view('int64') // 10**9


## Version 2:

### Dataset info:
1. test_interval:
- `2019-03-30`
- `2019-04-28`
2. train_interval:
- `2017-01-01`
- `2019-02-27`
3. val_interval:
- `2021-02-28`
- `2019-03-29`

### Output files:
- train_bitcoin_v2.csv
- val_bitcoin_v2.csv
- test_bitcoin_v2.csv

In [2]:
# Step 1: Load the raw Kaggle Bitcoin dataset
df = pd.read_csv('coin_Bitcoin.csv')

# Step 2: Select only the required columns
df = df[['Open', 'High', 'Low', 'Close', 'Volume', 'Date']].copy()

# Step 3: Convert 'Date' to datetime format
df['Date'] = pd.to_datetime(df['Date'])

# Step 4: Add UNIX timestamp column (in seconds)
df['Timestamp'] = df['Date'].view('int64') // 10**9

# Step 5: Sort chronologically for proper splitting
df = df.sort_values('Date').reset_index(drop=True)

# Step 6: Override the dataset's actual date range
start_date = pd.Timestamp('2017-01-01')
end_date = pd.Timestamp('2019-04-28')
print(f"Specified date range: {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}")

# Step 7: Filter the dataset to the specified date range
df = df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]

# Step 8: Define fixed dates for the splits
# Testing period: last 30 days of the specified range
test_end = end_date
test_start = test_end - pd.Timedelta(days=29)  # 30 days inclusive
test_lookback_start = test_start - pd.Timedelta(days=14)  # 14-day lookback

# Validation period: 30 days before testing
val_end = test_start - pd.Timedelta(days=1)
val_start = val_end - pd.Timedelta(days=29)  # 30 days inclusive
val_lookback_start = val_start - pd.Timedelta(days=14)  # 14-day lookback

# Training period: from start_date to before validation start date
train_start = start_date
train_end = val_start - pd.Timedelta(days=1)

# Step 9: Split into train, val, and test sets
df_train = df[(df['Date'] >= train_start) & (df['Date'] <= train_end)].copy()
df_val = df[(df['Date'] >= val_lookback_start) & (df['Date'] <= val_end)].copy()
df_test = df[(df['Date'] >= test_lookback_start) & (df['Date'] <= test_end)].copy()

# Step 10: Drop 'Date' column (model uses only numeric inputs)
for split in [df_train, df_val, df_test]:
    split.drop(columns='Date', inplace=True)

#Step 11: Save train, val, and test datasets
df_train.to_csv('train_bitcoin_v2.csv', index=True)
df_val.to_csv('val_bitcoin_v2.csv', index=True)
df_test.to_csv('test_bitcoin_v2.csv', index=True)

# Print the date ranges and sizes to verify
print(f"Training period: {train_start.strftime('%Y-%m-%d')} to {train_end.strftime('%Y-%m-%d')} ({df_train.shape[0]} rows)")
print(f"Validation period (with lookback): {val_lookback_start.strftime('%Y-%m-%d')} to {val_end.strftime('%Y-%m-%d')} ({df_val.shape[0]} rows)")
print(f"   - Lookback: {val_lookback_start.strftime('%Y-%m-%d')} to {val_start - pd.Timedelta(days=1)}")
print(f"   - Actual validation: {val_start.strftime('%Y-%m-%d')} to {val_end.strftime('%Y-%m-%d')}")
print(f"Testing period (with lookback): {test_lookback_start.strftime('%Y-%m-%d')} to {test_end.strftime('%Y-%m-%d')} ({df_test.shape[0]} rows)")
print(f"   - Lookback: {test_lookback_start.strftime('%Y-%m-%d')} to {test_start - pd.Timedelta(days=1)}")
print(f"   - Actual testing: {test_start.strftime('%Y-%m-%d')} to {test_end.strftime('%Y-%m-%d')}")

Specified date range: 2017-01-01 to 2019-04-28
Training period: 2017-01-01 to 2019-02-27 (787 rows)
Validation period (with lookback): 2019-02-14 to 2019-03-29 (43 rows)
   - Lookback: 2019-02-14 to 2019-02-27 00:00:00
   - Actual validation: 2019-02-28 to 2019-03-29
Testing period (with lookback): 2019-03-16 to 2019-04-28 (43 rows)
   - Lookback: 2019-03-16 to 2019-03-29 00:00:00
   - Actual testing: 2019-03-30 to 2019-04-28


  df['Timestamp'] = df['Date'].view('int64') // 10**9
