# **(Crypto Currency Future Price Forecast ETL)**

## Objectives

Purpose of this ETL:
- prepare a clean, consistent dataset for multi-coin analysis and forecasting (top coins I chose: BTC, DOGE, ETH, HBAR, QNT, SOL, XDC, XLM, XRP).

What I will do
1) Load each coin CSV from `data/raw/`
2) Look at the head/shape/info so I understand whatâ€™s inside
3) Standardise column names (Date, Open, High, Low, Close, Volume), fix dtypes (if needed)
4) Remove exact duplicates, handle missing values safely
5) Save cleaned coin CSV to `data/cleaned/`


## Inputs
- Raw Data: DataSet>Raw>BTC.csv
- Raw Data: DataSet>Raw>DOGE.csv
- Raw Data: DataSet>Raw>ETH.csv
- Raw Data: DataSet>Raw>HBAR.csv
- Raw Data: DataSet>Raw>QNT.csv
- Raw Data: DataSet>Raw>SOL.csv
- Raw Data: DataSet>Raw>XDC.csv
- Raw Data: DataSet>Raw>XLM.csv
- Raw Data: DataSet>Raw>XRP.csv

## Outputs
- Cleaned Data: DataSet>Cleaned>crypto_clean.csv



## Additional Comments
- This section was assisted by AI (ChatGPT-4) to help write a robust CSV loader function as I was encountering load errors due to inconsistent CSV formats from different sources. I provided the AI with examples of the different CSV formats and it generated a function that could handle these variations. I then reviewed and tested the function to ensure it worked correctly with my data.





---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [3]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\Nine\\OneDrive\\Documents\\VS Code Projects\\Crypto-Currency-Future-Price-Forecast\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\Nine\\OneDrive\\Documents\\VS Code Projects\\Crypto-Currency-Future-Price-Forecast'

# Section 1

Import libraries

In [6]:
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt

---

# Section 2

Load the raw data and start ETL process.

---

In [7]:

RAW_DIR = Path("DataSet/Raw")       # use your folder
CLEAN_DIR = Path("DataSet/Cleaned") # use your folder (already exists)
SYMBOLS = ["BTC","DOGE","ETH","HBAR","QNT","SOL","XDC","XLM","XRP"]

print("Raw folder:", RAW_DIR.resolve())
print("Clean folder:", CLEAN_DIR.resolve())
print("Files found:", [p.name for p in RAW_DIR.glob("*.csv")])


Raw folder: C:\Users\Nine\OneDrive\Documents\VS Code Projects\Crypto-Currency-Future-Price-Forecast\DataSet\Raw
Clean folder: C:\Users\Nine\OneDrive\Documents\VS Code Projects\Crypto-Currency-Future-Price-Forecast\DataSet\Cleaned
Files found: ['BTC.csv', 'DOGE.csv', 'ETH.csv', 'HBAR.csv', 'QNT.csv', 'SOL.csv', 'XDC.csv', 'XLM.csv', 'XRP.csv']


In [None]:
# Load each coin's data, clean and standardise it, then combine into one big DataFrame

frames = []

for sym in SYMBOLS:
    path = RAW_DIR / f"{sym}.csv"
    assert path.exists(), f"Missing file: {path.name}"

    # Read and normalise column names
    df = pd.read_csv(path)
    df.columns = [c.strip().lower() for c in df.columns]   # e.g., 'date','open','high','low','close','ticker'
    
    # Keep only what I need, then rename to a standard schema
    # The files you showed have: ticker, date, open, high, low, close (no volume)
    keep = ["date","open","high","low","close"]
    assert all(c in df.columns for c in keep), f"{path.name} is missing required columns."
    
    df = df[keep].copy()
    df = df.rename(columns={
        "date":  "Date",
        "open":  "Open",
        "high":  "High",
        "low":   "Low",
        "close": "Close"
    })
    
    # Add the Symbol column from the filename (clear and reliable)
    df.insert(0, "Symbol", sym)

    frames.append(df)

len(frames)



9

In [None]:
# Combine all coins into one table

raw_all = pd.concat(frames, ignore_index=True)

print("Combined shape:", raw_all.shape)
raw_all.head(3)


Combined shape: (27898, 6)


Unnamed: 0,Symbol,Date,Open,High,Low,Close
0,BTC,2010-07-17,0.04951,0.04951,0.04951,0.04951
1,BTC,2010-07-18,0.04951,0.08585,0.04951,0.08584
2,BTC,2010-07-19,0.08584,0.09307,0.07723,0.0808


In [10]:
# Basic Cleaning

# Types
raw_all["Date"]  = pd.to_datetime(raw_all["Date"], errors="coerce")
for col in ["Open","High","Low","Close"]:
    raw_all[col] = pd.to_numeric(raw_all[col], errors="coerce")

# Drop rows with no date or all prices missing
raw_all = raw_all.dropna(subset=["Date"])
all_na_prices = raw_all[["Open","High","Low","Close"]].isna().all(axis=1)
raw_all = raw_all.loc[~all_na_prices].copy()

# Remove impossible prices (<= 0)
bad = (raw_all[["Open","High","Low","Close"]] <= 0).any(axis=1)
raw_all = raw_all.loc[~bad].copy()

# Remove exact duplicates by (Symbol, Date)
before = len(raw_all)
raw_all = raw_all.sort_values(["Symbol","Date"]).drop_duplicates(["Symbol","Date"]).reset_index(drop=True)
after = len(raw_all)

print(f"Duplicates removed: {before - after}")
print("Cleaned shape:", raw_all.shape)
raw_all.head(3)




Duplicates removed: 0
Cleaned shape: (27898, 6)


Unnamed: 0,Symbol,Date,Open,High,Low,Close
0,BTC,2010-07-17,0.04951,0.04951,0.04951,0.04951
1,BTC,2010-07-18,0.04951,0.08585,0.04951,0.08584
2,BTC,2010-07-19,0.08584,0.09307,0.07723,0.0808


In [11]:
# Quick Validation


print("Coins included:", sorted(raw_all["Symbol"].unique()))
print("Date range by coin:")
display(
    raw_all.groupby("Symbol")
    .agg(first_date=("Date","min"), last_date=("Date","max"), rows=("Date","count"))
    .reset_index()
)

# Check for any remaining missing values
raw_all.isna().sum()


Coins included: ['BTC', 'DOGE', 'ETH', 'HBAR', 'QNT', 'SOL', 'XDC', 'XLM', 'XRP']
Date range by coin:


Unnamed: 0,Symbol,first_date,last_date,rows
0,BTC,2010-07-17,2025-10-14,5521
1,DOGE,2016-07-01,2025-10-14,3345
2,ETH,2015-08-07,2025-10-14,3674
3,HBAR,2019-09-20,2025-10-14,2169
4,QNT,2019-02-05,2025-10-14,2396
5,SOL,2020-04-10,2025-10-14,1966
6,XDC,2020-04-02,2025-10-14,1813
7,XLM,2017-01-17,2025-10-14,3145
8,XRP,2015-01-21,2025-10-14,3869


Symbol    0
Date      0
Open      0
High      0
Low       0
Close     0
dtype: int64

In [None]:
# Save the cleaned combined data to DataSet/Cleaned/crypto_clean.csv

out_path = CLEAN_DIR / "crypto_clean.csv"
raw_all.to_csv(out_path, index=False)

# Read back a preview to prove it saved correctly
check = pd.read_csv(out_path, nrows=5, parse_dates=["Date"])
print("Saved:", out_path, "| rows:", len(raw_all))
check


Saved: DataSet\Cleaned\crypto_clean.csv | rows: 27898


Unnamed: 0,Symbol,Date,Open,High,Low,Close
0,BTC,2010-07-17,0.04951,0.04951,0.04951,0.04951
1,BTC,2010-07-18,0.04951,0.08585,0.04951,0.08584
2,BTC,2010-07-19,0.08584,0.09307,0.07723,0.0808
3,BTC,2010-07-20,0.0808,0.08181,0.07426,0.07474
4,BTC,2010-07-21,0.07474,0.07921,0.06634,0.07921


---