# Data Cleaning and Preprocessing

## Objective
The objective of this notebook is to clean and preprocess the raw Bitcoin
cryptocurrency price data, resolve structural issues such as multi-level
columns, handle datetime indexing, and prepare a clean time series dataset
for exploratory data analysis and forecasting.


In [1]:
import pandas as pd
import numpy as np

# Load raw dataset (do NOT parse dates yet)
df = pd.read_csv("../data/raw/btc_usd_raw.csv", index_col=0)

# Remove non-date index rows safely
df = df[~df.index.isin(["Date", "Ticker"])]

# Convert index to datetime (let pandas infer format)
df.index = pd.to_datetime(df.index, errors="coerce")

# Drop rows where datetime conversion failed
df = df[~df.index.isna()]

# Convert all columns to numeric
df = df.apply(pd.to_numeric, errors="coerce")

# Keep only required columns
df = df[['Close', 'Volume']]

# Final verification
print(df.info())

# Save cleaned dataset
df.to_csv("../data/processed/btc_usd_cleaned.csv")

print("Cleaned data saved successfully.")


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2567 entries, 2019-01-01 to 2026-01-10
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Close   2567 non-null   float64
 1   Volume  2567 non-null   int64  
dtypes: float64(1), int64(1)
memory usage: 60.2 KB
None
Cleaned data saved successfully.
