# **Data Collection & Cleaning Notebook**

## Objectives

* This notebook will be used to fetch Kaggle datasets and also a BitCoin price dataset from CoinCodex.  Permission was received from Leo Daris, a content manager at CoinCodex.
* The datasets will be merged into one dataset and the data will be inspected and cleaned

## Inputs

* Kaggle JSON file - the authentication token.
* CoinCodex .csv download - BitCoin prices - https://coincodex.com/crypto/bitcoin/
* Kaggle NASDAQ dataset by Sai Karthik
* Kaggle US Economic Vital Signs: 25 Years Of Macro Data dataset by Eswaran Muthu
* Yahoo Finance download - Tickers used are "IXIC", "GC=F" & "BZ=F" starting from 19/07/2010 to 31/07/2025
* FRED API (https://fred.stlouisfed.org/) - Data downloaded was from 01/01/2010 for the following economic indicators:
  
  * 'CPIAUCSL' - Inflation
  * 'DGS10' - 10yr Treasury Yield
  * 'FEDFUNDS' - Fed Funds Interest Rate
  * 'M2SL' - Money Supply
  * 'VIXCLS' - CBOE Volatility Index
  * 'UMCSENT' - Consumer Sentiment
  * 'GDPC1' - Real GDP
  * 'UNRATE' - Unemployment Rate
  * 'RSAFS' - Retail Sales
  * 'GFDEGDQ188S' - Debt to GDP

## Outputs

* Generate Dataset: outputs/datasets/collection/

## Additional Comments

* Having started this project with the kaggle datasets list above, I took the decision to make my own dataset with identical or similar columns.  Extra economic factors were added also.  

* These were all downloaded from the Yahoo Finance library and also from FRED's API.  I have since removed my API key as the datasets were all merged into bitcoin_yahoo_fred_combined.csv

* bitcoin_yahoo_fred_combined.csv is the dataset I will use as it includes data since the commencement of BitCoin market prices, and is much larger than my original dataset.


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/Project5/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/Project5'

# Fetch Kaggle Datasets

Install Kaggle package to fetch data

In [4]:
%pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting python-slugify (from kaggle==1.5.12)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.12)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73026 sha256=4ea163f8e695b64978b2a6086d00c65789d10af926adcf99301a496b655fad0d
  Stored in directory: /home/cistudent/.cache/pip/wheels/f5/69/4d/d701fc604b9fb09be59718b4056fd5556a22588ce1f25dd090
Successfully built kaggle
Installing collected packages: text-unidecode, python-slugify, kaggle
Successfully installed kaggl

Recognition of token

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Download Kaggle Datasets

In [None]:
dataset_1 = "sai14karthik/nasdq-dataset"
dataset_2 = "eswaranmuthu/u-s-economic-vital-signs-25-years-of-macro-data"
DestinationFolder = "inputs/datasets/raw"

!kaggle datasets download -d {dataset_1} -p {DestinationFolder}
!kaggle datasets download -d {dataset_2} -p {DestinationFolder}

Downloading nasdq-dataset.zip to inputs/datasets/raw
100%|█████████████████████████████████████████| 126k/126k [00:00<00:00, 514kB/s]
100%|█████████████████████████████████████████| 126k/126k [00:00<00:00, 513kB/s]
Downloading u-s-economic-vital-signs-25-years-of-macro-data.zip to inputs/datasets/raw
  0%|                                               | 0.00/8.65k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 8.65k/8.65k [00:00<00:00, 24.8MB/s]


Unzip the files

In [11]:
import os
import zipfile

for file in os.listdir(DestinationFolder):
    if file.endswith(".zip"):
        zip_path = os.path.join(DestinationFolder, file)
        print(f"Unzipping: {zip_path}")
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(DestinationFolder)
        os.remove(zip_path)


Unzipping: inputs/datasets/raw/nasdq-dataset.zip
Unzipping: inputs/datasets/raw/u-s-economic-vital-signs-25-years-of-macro-data.zip


---

# Data Cleaning And Merging

Load and inspect the data - Having reviewed the original Kaggle datasets, we will not use them going forward as they were not extensive enough.

Instead, we will proceed to download our required data from Yahoo Finance & FRED and use these together with the BitCoin price data to create 1 dataset.

In [4]:
import pandas as pd
df_bitcoin = pd.read_csv(f"inputs/datasets/raw/bitcoin_2010-07-17_2025-07-31.csv")
df_bitcoin.head()

Unnamed: 0,Start,End,Open,High,Low,Close,Volume,Market Cap
0,2025-07-31,2025-08-01,117823.0,118867.0,115606.0,115606.0,64625460000.0,2347673000000.0
1,2025-07-30,2025-07-31,117796.0,118699.0,116027.0,117800.0,57484230000.0,2343903000000.0
2,2025-07-29,2025-07-30,118100.0,119095.0,117084.0,117877.0,60250390000.0,2351844000000.0
3,2025-07-28,2025-07-29,119370.0,119759.0,117435.0,117883.0,53716600000.0,2360848000000.0
4,2025-07-27,2025-07-28,117944.0,119767.0,117842.0,119429.0,34950470000.0,2357912000000.0


We are only insterested in the date and closing price of the BitCoin dataset, renaming 'End' and 'Close', while also converting the date using datetime. 

In [5]:
df_bitcoin = df_bitcoin[['End', 'Close']].copy()
df_bitcoin.rename(columns={'End': 'Date', 'Close': 'Bitcoin_Close'}, inplace=True)

df_bitcoin['Date'] = pd.to_datetime(df_bitcoin['Date']).dt.strftime('%Y-%m-%d')

In [6]:
df_bitcoin.head()

Unnamed: 0,Date,Bitcoin_Close
0,2025-08-01,115606.0
1,2025-07-31,117800.0
2,2025-07-30,117877.0
3,2025-07-29,117883.0
4,2025-07-28,119429.0


Next, we download the historic prices or the Nasdaq Index, Gold and Oil from Yahoo Finance via yfinance in Python.  

All dates are from 19/07/2010 to 31/07/2025 to coincide with the BitCoin dates.

This data is then merged and sorted by date.

In [7]:
import yfinance as yf
import pandas as pd

nasdaq = yf.Ticker("^IXIC")
df_nasdaq_history = nasdaq.history(start="2010-07-19", end="2025-07-31")
df_nasdaq_history = df_nasdaq_history.reset_index()
df_nasdaq_history['Date'] = df_nasdaq_history['Date'].dt.strftime('%Y-%m-%d')
df_nasdaq_history.rename(columns={'Close': 'Nasdaq_Close'}, inplace=True)
df_nasdaq_index = df_nasdaq_history[['Date', 'Nasdaq_Close']]

gold = yf.Ticker("GC=F")
df_gold_history = gold.history(start="2010-07-19", end="2025-07-31")
df_gold_history = df_gold_history.reset_index()
df_gold_history['Date'] = df_gold_history['Date'].dt.strftime('%Y-%m-%d')
df_gold_history.rename(columns={'Close': 'Gold_Close'}, inplace=True)
df_gold_index = df_gold_history[['Date', 'Gold_Close']]

brent = yf.Ticker("BZ=F")
df_brent_history = brent.history(start="2010-07-19", end="2025-07-31")
df_brent_history = df_brent_history.reset_index()
df_brent_history['Date'] = df_brent_history['Date'].dt.strftime('%Y-%m-%d')
df_brent_history.rename(columns={'Close': 'Brent_Close'}, inplace=True)
df_brent_index = df_brent_history[['Date', 'Brent_Close']]

# Merge Nasdaq and Gold
df_nasdaq_commodities = pd.merge(df_nasdaq_index, df_gold_index, on='Date', how='outer')

# Merge the result with Brent
df_nasdaq_commodities = pd.merge(df_nasdaq_commodities, df_brent_index, on='Date', how='outer')

# Sort by Date
df_nasdaq_commodities = df_nasdaq_commodities.sort_values(by='Date').reset_index(drop=True)

df_nasdaq_commodities.head()



Unnamed: 0,Date,Nasdaq_Close,Gold_Close,Brent_Close
0,2010-07-19,2198.22998,1181.699951,75.620003
1,2010-07-20,2222.48999,1191.5,76.220001
2,2010-07-21,2187.330078,1191.599976,75.370003
3,2010-07-22,2245.889893,1195.5,77.82
4,2010-07-23,2269.469971,1187.699951,77.449997


We check for any missing data.

In [8]:
print(df_nasdaq_commodities.isnull().sum())

Date             0
Nasdaq_Close     3
Gold_Close       4
Brent_Close     32
dtype: int64


We forward fill any blanks with the last data entry.

In [9]:
df_nasdaq_commodities = df_nasdaq_commodities.ffill()

We can see that there are no blanks after forward filling.

In [10]:
print(df_nasdaq_commodities.isnull().sum())

Date            0
Nasdaq_Close    0
Gold_Close      0
Brent_Close     0
dtype: int64


Next, we download a number of economic indicators from FRED via an API (https://fred.stlouisfed.org/).

We loop through the list, saving each individually to the intially empty macro_dataframes dictionary.

In [2]:
from fredapi import Fred
import pandas as pd
from dotenv import load_dotenv
import os

load_dotenv()
fred_api_key = os.getenv("FRED_API_KEY")
fred = Fred(api_key=fred_api_key)

series_dict = {
    'CPIAUCSL': 'CPI',
    'DGS10': 'Ten_Year_Yield',
    'FEDFUNDS': 'Fed_Funds_Rate',
    'M2SL': 'M2_Money_Supply',
    'VIXCLS': 'VIX',
    'UMCSENT': 'Consumer_Sentiment',
    'GDPC1': 'Real_GDP',
    'UNRATE': 'Unemployment_Rate',
    'RSAFS': 'Retail_Sales',
    'GFDEGDQ188S': 'Debt_to_GDP'
}

macro_dataframes = {}

# Loop through each series and format
for series_id, label in series_dict.items():
    data = fred.get_series(series_id, observation_start='2010-01-01')
    df = data.reset_index()
    df.columns = ['Date', label]
    df['Date'] = pd.to_datetime(df['Date']).dt.strftime('%Y-%m-%d')
    macro_dataframes[label] = df

We then merge each download into one file based on date and we then sort by the date.

In [3]:
from functools import reduce
import pandas as pd

dfs = list(macro_dataframes.values())

df_macro_merged = reduce(lambda left, right: pd.merge(left, right, on='Date', how='outer'), dfs)

df_macro_merged = df_macro_merged.sort_values(by='Date').reset_index(drop=True)


In [4]:
df_macro_merged.head()

Unnamed: 0,Date,CPI,Ten_Year_Yield,Fed_Funds_Rate,M2_Money_Supply,VIX,Consumer_Sentiment,Real_GDP,Unemployment_Rate,Retail_Sales,Debt_to_GDP
0,2010-01-01,217.488,,0.11,8478.0,,74.4,16582.71,9.8,339093.0,86.51175
1,2010-01-04,,3.85,,,20.04,,,,,
2,2010-01-05,,3.77,,,19.35,,,,,
3,2010-01-06,,3.85,,,19.16,,,,,
4,2010-01-07,,3.85,,,19.06,,,,,


We check for missing data and see there are quite a lot of missing entries.  This is due to the timeframes of the data - Monthly, weekly, quarterly.

In [15]:
print(df_macro_merged.isnull().sum())

Date                     0
CPI                   3946
Ten_Year_Yield         221
Fed_Funds_Rate        3946
M2_Money_Supply       3947
VIX                    175
Consumer_Sentiment    3947
Real_GDP              4071
Unemployment_Rate     3946
Retail_Sales          3946
Debt_to_GDP           4072
dtype: int64


We forward fill again for any blank data

In [16]:
df_macro_merged = df_macro_merged.sort_values(by='Date').reset_index(drop=True)
df_macro_merged = df_macro_merged.ffill()

And backfill for index 0

In [17]:
df_macro_merged = df_macro_merged.bfill()

In [18]:
df_macro_merged.head()

Unnamed: 0,Date,CPI,Ten_Year_Yield,Fed_Funds_Rate,M2_Money_Supply,VIX,Consumer_Sentiment,Real_GDP,Unemployment_Rate,Retail_Sales,Debt_to_GDP
0,2010-01-01,217.488,3.85,0.11,8478.0,20.04,74.4,16582.71,9.8,339093.0,86.51175
1,2010-01-04,217.488,3.85,0.11,8478.0,20.04,74.4,16582.71,9.8,339093.0,86.51175
2,2010-01-05,217.488,3.77,0.11,8478.0,19.35,74.4,16582.71,9.8,339093.0,86.51175
3,2010-01-06,217.488,3.85,0.11,8478.0,19.16,74.4,16582.71,9.8,339093.0,86.51175
4,2010-01-07,217.488,3.85,0.11,8478.0,19.06,74.4,16582.71,9.8,339093.0,86.51175


Now there is no missing data on this dataset.

In [19]:
print(df_macro_merged.isnull().sum())

Date                  0
CPI                   0
Ten_Year_Yield        0
Fed_Funds_Rate        0
M2_Money_Supply       0
VIX                   0
Consumer_Sentiment    0
Real_GDP              0
Unemployment_Rate     0
Retail_Sales          0
Debt_to_GDP           0
dtype: int64


Finally, we are ready to merge all 3 datasets into one, using BitCoin as the anchor.

In [20]:
df_merged = pd.merge(df_bitcoin, df_nasdaq_commodities, on='Date', how='left')

df_merged = pd.merge(df_merged, df_macro_merged, on='Date', how='left')

df_merged = df_merged.sort_values(by='Date').reset_index(drop=True)

Once again, we check for missing data in the merged file and see that there are quite a lot of missing entries.

This is due to the weekend dates being included in the BitCoin dataset, whereas weekends and holidays were stripped out of the others.

In [21]:
print(df_merged.isnull().sum())

Date                     0
Bitcoin_Close            0
Nasdaq_Close          1709
Gold_Close            1709
Brent_Close           1709
CPI                   1517
Ten_Year_Yield        1517
Fed_Funds_Rate        1517
M2_Money_Supply       1517
VIX                   1517
Consumer_Sentiment    1517
Real_GDP              1517
Unemployment_Rate     1517
Retail_Sales          1517
Debt_to_GDP           1517
dtype: int64


We forward fill for missing weekends and holidays.

In [22]:
df_merged = df_merged.ffill()

We back fill as 18/07/2010 was a Sunday.

In [23]:
df_merged = df_merged.bfill()

Now there is no missing data left in our fully merged dataset.

In [24]:
print(df_merged.isnull().sum())

Date                  0
Bitcoin_Close         0
Nasdaq_Close          0
Gold_Close            0
Brent_Close           0
CPI                   0
Ten_Year_Yield        0
Fed_Funds_Rate        0
M2_Money_Supply       0
VIX                   0
Consumer_Sentiment    0
Real_GDP              0
Unemployment_Rate     0
Retail_Sales          0
Debt_to_GDP           0
dtype: int64


In [25]:
df_merged.head()

Unnamed: 0,Date,Bitcoin_Close,Nasdaq_Close,Gold_Close,Brent_Close,CPI,Ten_Year_Yield,Fed_Funds_Rate,M2_Money_Supply,VIX,Consumer_Sentiment,Real_GDP,Unemployment_Rate,Retail_Sales,Debt_to_GDP
0,2010-07-18,0.05,2198.22998,1181.699951,75.620003,217.605,2.99,0.18,8639.8,25.97,67.8,16872.266,9.4,347612.0,89.56528
1,2010-07-19,0.0858,2198.22998,1181.699951,75.620003,217.605,2.99,0.18,8639.8,25.97,67.8,16872.266,9.4,347612.0,89.56528
2,2010-07-20,0.0808,2222.48999,1191.5,76.220001,217.605,2.98,0.18,8639.8,23.93,67.8,16872.266,9.4,347612.0,89.56528
3,2010-07-21,0.0747,2187.330078,1191.599976,75.370003,217.605,2.9,0.18,8639.8,25.64,67.8,16872.266,9.4,347612.0,89.56528
4,2010-07-22,0.0792,2245.889893,1195.5,77.82,217.605,2.96,0.18,8639.8,24.63,67.8,16872.266,9.4,347612.0,89.56528


In [26]:
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5494 entries, 0 to 5493
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Date                5494 non-null   object 
 1   Bitcoin_Close       5494 non-null   float64
 2   Nasdaq_Close        5494 non-null   float64
 3   Gold_Close          5494 non-null   float64
 4   Brent_Close         5494 non-null   float64
 5   CPI                 5494 non-null   float64
 6   Ten_Year_Yield      5494 non-null   float64
 7   Fed_Funds_Rate      5494 non-null   float64
 8   M2_Money_Supply     5494 non-null   float64
 9   VIX                 5494 non-null   float64
 10  Consumer_Sentiment  5494 non-null   float64
 11  Real_GDP            5494 non-null   float64
 12  Unemployment_Rate   5494 non-null   float64
 13  Retail_Sales        5494 non-null   float64
 14  Debt_to_GDP         5494 non-null   float64
dtypes: float64(14), object(1)
memory usage: 644.0+ KB


We will now save this merged dataset.

In [28]:
df_merged.to_csv("inputs/datasets/raw/bitcoin_yahoo_fred_combined.csv", index=False)

---

# Push files to Repo

* We finally save the dataset for use going forward

In [29]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df_merged.to_csv(f"outputs/datasets/collection/BitCoinVsMacroNasdaq_v5.csv",index=False)

[Errno 17] File exists: 'outputs/datasets/collection'


### We can now proceed to workbook 2 - Data Analysis and Feature Engineering