# **(Crypto Currency Future Price Forecast ETL)**

## Objectives

Purpose of this ETL:
- prepare a clean, consistent dataset for multi-coin analysis and forecasting (top coins I chose: BTC, DOGE, ETH, HBAR, QNT, SOL, XDC, XLM, XRP).

What I will do
1) Load each coin CSV from `data/raw/`
2) Look at the head/shape/info so I understand what’s inside
3) Standardise column names (Date, Open, High, Low, Close, Volume), fix dtypes (if needed)
4) Remove exact duplicates, handle missing values safely
5) Save cleaned coin CSV to `data/cleaned/`


## Inputs
- Raw Data: DataSet>Raw>BTC.csv
- Raw Data: DataSet>Raw>DOGE.csv
- Raw Data: DataSet>Raw>ETH.csv
- Raw Data: DataSet>Raw>HBAR.csv
- Raw Data: DataSet>Raw>QNT.csv
- Raw Data: DataSet>Raw>SOL.csv
- Raw Data: DataSet>Raw>XDC.csv
- Raw Data: DataSet>Raw>XLM.csv
- Raw Data: DataSet>Raw>XRP.csv

## Outputs
- Cleaned Data: DataSet>Cleaned>crypto_clean.csv



## Additional Comments
- This section was assisted by AI (ChatGPT-4) to help write a robust CSV loader function as I was encountering load errors due to inconsistent CSV formats from different sources. I provided the AI with examples of the different CSV formats and it generated a function that could handle these variations. I then reviewed and tested the function to ensure it worked correctly with my data.





---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [3]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\Nine\\OneDrive\\Documents\\VS Code Projects\\Crypto-Currency-Future-Price-Forecast\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\Nine\\OneDrive\\Documents\\VS Code Projects\\Crypto-Currency-Future-Price-Forecast'

# Section 1

Import libraries

In [6]:
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt

---

# Section 2

Load the raw data and start ETL process.

---

In [15]:
RAW_DIR   = Path("DataSet/Raw")
CLEAN_DIR = Path("DataSet/Cleaned")
SYMBOLS   = ["BTC","DOGE","ETH","HBAR","QNT","SOL","XDC","XLM","XRP"]

print("Raw:", RAW_DIR.resolve())
print("Clean:", CLEAN_DIR.resolve())
print("Raw CSVs found:", sorted(p.name for p in RAW_DIR.glob("*.csv")))

Raw: C:\Users\Nine\OneDrive\Documents\VS Code Projects\Crypto-Currency-Future-Price-Forecast\DataSet\Raw
Clean: C:\Users\Nine\OneDrive\Documents\VS Code Projects\Crypto-Currency-Future-Price-Forecast\DataSet\Cleaned
Raw CSVs found: ['BTC.csv', 'DOGE.csv', 'ETH.csv', 'HBAR.csv', 'QNT.csv', 'SOL.csv', 'XDC.csv', 'XLM.csv', 'XRP.csv']


---

In [16]:
# Load each CSV into a DataFrame, store in list

frames = []

for sym in SYMBOLS:
    path = RAW_DIR / f"{sym}.csv"
    assert path.exists(), f"Missing file: {path.name}"

    df = pd.read_csv(path)
    df.columns = [c.strip().lower() for c in df.columns]   # normalise headers

    # your files: usually "ticker, date, open, high, low, close" (no volume)
    needed = ["date","open","high","low","close"]
    assert all(c in df.columns for c in needed), f"{path.name} is missing one of {needed}"

    df = df[needed].copy()
    df = df.rename(columns={
        "date":"Date", "open":"Open", "high":"High", "low":"Low", "close":"Close"
    })

    # add Symbol from filename
    df.insert(0, "Symbol", sym)

    frames.append(df)

print("Loaded tables:", len(frames))
frames[0].head()


Loaded tables: 9


Unnamed: 0,Symbol,Date,Open,High,Low,Close
0,BTC,2010-07-17,0.04951,0.04951,0.04951,0.04951
1,BTC,2010-07-18,0.04951,0.08585,0.04951,0.08584
2,BTC,2010-07-19,0.08584,0.09307,0.07723,0.0808
3,BTC,2010-07-20,0.0808,0.08181,0.07426,0.07474
4,BTC,2010-07-21,0.07474,0.07921,0.06634,0.07921


In [17]:
# Combine files and clean data

combined = pd.concat(frames, ignore_index=True)

# Types
combined["Date"] = pd.to_datetime(combined["Date"], errors="coerce")
for c in ["Open","High","Low","Close"]:
    combined[c] = pd.to_numeric(combined[c], errors="coerce")

# Drop rows without a Date or with all OHLC missing
combined = combined.dropna(subset=["Date"])
all_na = combined[["Open","High","Low","Close"]].isna().all(axis=1)
combined = combined.loc[~all_na].copy()

# Remove impossible prices (<= 0)
bad = (combined[["Open","High","Low","Close"]] <= 0).any(axis=1)
combined = combined.loc[~bad].copy()

# De-duplicate by (Symbol, Date) and sort
before = len(combined)
combined = (combined
            .sort_values(["Symbol","Date"])
            .drop_duplicates(["Symbol","Date"])
            .reset_index(drop=True))
after = len(combined)

print("Combined shape:", combined.shape, "| Duplicates removed:", before - after)
combined.head()


Combined shape: (27898, 6) | Duplicates removed: 0


Unnamed: 0,Symbol,Date,Open,High,Low,Close
0,BTC,2010-07-17,0.04951,0.04951,0.04951,0.04951
1,BTC,2010-07-18,0.04951,0.08585,0.04951,0.08584
2,BTC,2010-07-19,0.08584,0.09307,0.07723,0.0808
3,BTC,2010-07-20,0.0808,0.08181,0.07426,0.07474
4,BTC,2010-07-21,0.07474,0.07921,0.06634,0.07921


In [18]:
# Basic quality check

print("Coins:", sorted(combined["Symbol"].unique()))
print("Rows per coin (and date range):")
display(
    combined.groupby("Symbol")
            .agg(first=("Date","min"), last=("Date","max"), rows=("Date","count"))
            .reset_index()
)

print("Any remaining NaN counts (top 6 cols):")
combined[["Symbol","Date","Open","High","Low","Close"]].isna().sum()



Coins: ['BTC', 'DOGE', 'ETH', 'HBAR', 'QNT', 'SOL', 'XDC', 'XLM', 'XRP']
Rows per coin (and date range):


Unnamed: 0,Symbol,first,last,rows
0,BTC,2010-07-17,2025-10-14,5521
1,DOGE,2016-07-01,2025-10-14,3345
2,ETH,2015-08-07,2025-10-14,3674
3,HBAR,2019-09-20,2025-10-14,2169
4,QNT,2019-02-05,2025-10-14,2396
5,SOL,2020-04-10,2025-10-14,1966
6,XDC,2020-04-02,2025-10-14,1813
7,XLM,2017-01-17,2025-10-14,3145
8,XRP,2015-01-21,2025-10-14,3869


Any remaining NaN counts (top 6 cols):


Symbol    0
Date      0
Open      0
High      0
Low       0
Close     0
dtype: int64

In [19]:
out_path = CLEAN_DIR / "crypto_clean.csv"
combined.to_csv(out_path, index=False)

# read back preview (prevents blank-file mistakes)
check = pd.read_csv(out_path, nrows=5, parse_dates=["Date"])
print("Saved:", out_path, "| rows:", len(combined))
check



Saved: DataSet\Cleaned\crypto_clean.csv | rows: 27898


Unnamed: 0,Symbol,Date,Open,High,Low,Close
0,BTC,2010-07-17,0.04951,0.04951,0.04951,0.04951
1,BTC,2010-07-18,0.04951,0.08585,0.04951,0.08584
2,BTC,2010-07-19,0.08584,0.09307,0.07723,0.0808
3,BTC,2010-07-20,0.0808,0.08181,0.07426,0.07474
4,BTC,2010-07-21,0.07474,0.07921,0.06634,0.07921


In [20]:
# helper functions kept tiny
def rsi(s, period=14):
    delta = s.diff()
    up  = delta.clip(lower=0)
    down = -delta.clip(upper=0)
    ma_up   = up.rolling(period, min_periods=period).mean()
    ma_down = down.rolling(period, min_periods=period).mean()
    rs = ma_up / (ma_down + 1e-9)
    return 100 - (100 / (1 + rs))

def macd(s, fast=12, slow=26, signal=9):
    ema_fast = s.ewm(span=fast, adjust=False).mean()
    ema_slow = s.ewm(span=slow, adjust=False).mean()
    line = ema_fast - ema_slow
    sig  = line.ewm(span=signal, adjust=False).mean()
    hist = line - sig
    return line, sig, hist

def bollinger(s, window=20, n_std=2):
    ma = s.rolling(window, min_periods=window).mean()
    sd = s.rolling(window, min_periods=window).std()
    upper = ma + n_std*sd
    lower = ma - n_std*sd
    return ma, upper, lower

def atr(high, low, close, period=14):
    prev_close = close.shift(1)
    tr = pd.concat([(high-low), (high-prev_close).abs(), (low-prev_close).abs()], axis=1).max(axis=1)
    return tr.rolling(period, min_periods=period).mean()

# reload the clean file we just saved
df = pd.read_csv(CLEAN_DIR / "crypto_clean.csv", parse_dates=["Date"]).sort_values(["Symbol","Date"])

# build features coin by coin (no fancy groupby)
featured = []
for sym in df["Symbol"].unique():
    g = df[df["Symbol"]==sym].copy()
    s = g["Close"]

    g["return_1d"] = s.pct_change()
    g["ma_7"]      = s.rolling(7,  min_periods=3).mean()
    g["ma_30"]     = s.rolling(30, min_periods=10).mean()

    g["rsi_14"]    = rsi(s, 14)

    line, sig, hist = macd(s)
    g["macd_line"]   = line
    g["macd_signal"] = sig
    g["macd_hist"]   = hist

    ma, up, lo = bollinger(s, 20, 2)
    g["bb_ma20"]  = ma
    g["bb_upper"] = up
    g["bb_lower"] = lo
    g["bb_width"] = (up - lo) / ma

    g["atr_14"] = atr(g["High"], g["Low"], s, 14)

    featured.append(g)

df_feat = pd.concat(featured, ignore_index=True).sort_values(["Symbol","Date"]).reset_index(drop=True)
print("With features:", df_feat.shape)
df_feat.head()


With features: (27898, 18)


Unnamed: 0,Symbol,Date,Open,High,Low,Close,return_1d,ma_7,ma_30,rsi_14,macd_line,macd_signal,macd_hist,bb_ma20,bb_upper,bb_lower,bb_width,atr_14
0,BTC,2010-07-17,0.04951,0.04951,0.04951,0.04951,,,,,0.0,0.0,0.0,,,,,
1,BTC,2010-07-18,0.04951,0.08585,0.04951,0.08584,0.733791,,,,0.002898,0.00058,0.002318,,,,,
2,BTC,2010-07-19,0.08584,0.09307,0.07723,0.0808,-0.058714,0.07205,,,0.004734,0.00141,0.003323,,,,,
3,BTC,2010-07-20,0.0808,0.08181,0.07426,0.07474,-0.075,0.072722,,,0.005634,0.002255,0.003379,,,,,
4,BTC,2010-07-21,0.07474,0.07921,0.06634,0.07921,0.059807,0.07402,,,0.006632,0.003131,0.003502,,,,,


In [21]:
dups = df_feat.duplicated(subset=["Symbol","Date"]).sum()
print("Duplicate (Symbol,Date) rows:", dups)

display(
    df_feat.groupby("Symbol")
           .agg(first=("Date","min"), last=("Date","max"), rows=("Date","count"))
           .reset_index()
)

df_feat[["return_1d","ma_7","ma_30","rsi_14","macd_line","macd_signal","bb_ma20","atr_14"]].isna().sum()


Duplicate (Symbol,Date) rows: 0


Unnamed: 0,Symbol,first,last,rows
0,BTC,2010-07-17,2025-10-14,5521
1,DOGE,2016-07-01,2025-10-14,3345
2,ETH,2015-08-07,2025-10-14,3674
3,HBAR,2019-09-20,2025-10-14,2169
4,QNT,2019-02-05,2025-10-14,2396
5,SOL,2020-04-10,2025-10-14,1966
6,XDC,2020-04-02,2025-10-14,1813
7,XLM,2017-01-17,2025-10-14,3145
8,XRP,2015-01-21,2025-10-14,3869


return_1d        9
ma_7            18
ma_30           81
rsi_14         126
macd_line        0
macd_signal      0
bb_ma20        171
atr_14         117
dtype: int64

In [23]:
final_path = CLEAN_DIR / "crypto_clean.csv"
df_feat.to_csv(final_path, index=False)

check2 = pd.read_csv(final_path, nrows=30, parse_dates=["Date"])
print("Saved final:", final_path, "| rows:", len(df_feat))
check2


Saved final: DataSet\Cleaned\crypto_clean.csv | rows: 27898


Unnamed: 0,Symbol,Date,Open,High,Low,Close,return_1d,ma_7,ma_30,rsi_14,macd_line,macd_signal,macd_hist,bb_ma20,bb_upper,bb_lower,bb_width,atr_14
0,BTC,2010-07-17,0.04951,0.04951,0.04951,0.04951,,,,,0.0,0.0,0.0,,,,,
1,BTC,2010-07-18,0.04951,0.08585,0.04951,0.08584,0.733791,,,,0.002898,0.00058,0.002318,,,,,
2,BTC,2010-07-19,0.08584,0.09307,0.07723,0.0808,-0.058714,0.07205,,,0.004734,0.00141,0.003323,,,,,
3,BTC,2010-07-20,0.0808,0.08181,0.07426,0.07474,-0.075,0.072722,,,0.005634,0.002255,0.003379,,,,,
4,BTC,2010-07-21,0.07474,0.07921,0.06634,0.07921,0.059807,0.07402,,,0.006632,0.003131,0.003502,,,,,
5,BTC,2010-07-22,0.07921,0.08181,0.0505,0.0505,-0.362454,0.0701,,,0.005049,0.003514,0.001534,,,,,
6,BTC,2010-07-23,0.0505,0.06767,0.0505,0.06262,0.24,0.069031,,,0.004717,0.003755,0.000962,,,,,
7,BTC,2010-07-24,0.06262,0.06262,0.05049,0.05454,-0.129032,0.06975,,,0.003759,0.003756,3e-06,,,,,
8,BTC,2010-07-25,0.05454,0.05941,0.0505,0.0505,-0.074074,0.064701,,,0.002643,0.003533,-0.00089,,,,,
9,BTC,2010-07-26,0.0505,0.056,0.05,0.056,0.108911,0.061159,0.064426,,0.002177,0.003262,-0.001084,,,,,
