# **Data Collection & Cleaning Notebook**

## Objectives

* This notebook will be used to fetch Kaggle datasets and also a BitCoin price dataset from CoinCodex.  Permission was received from Leo Daris, a content manager at CoinCodex.
* The datasets will be merged into one dataset and the data will be inspected and cleaned
* We will lag all features and save a final version after discarding the original unlagged features

## Inputs

* Kaggle JSON file - the authentication token.
* CoinCodex .csv download - BitCoin prices - https://coincodex.com/crypto/bitcoin/
* Kaggle NASDAQ dataset by Sai Karthik
* Kaggle US Economic Vital Signs: 25 Years Of Macro Data dataset by Eswaran Muthu

## Outputs

* Generate Dataset: outputs/datasets/collection/


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/Project5/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/Project5'

# Fetch Kaggle Datasets

Install Kaggle package to fetch data

In [4]:
%pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting python-slugify (from kaggle==1.5.12)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.12)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73026 sha256=4ea163f8e695b64978b2a6086d00c65789d10af926adcf99301a496b655fad0d
  Stored in directory: /home/cistudent/.cache/pip/wheels/f5/69/4d/d701fc604b9fb09be59718b4056fd5556a22588ce1f25dd090
Successfully built kaggle
Installing collected packages: text-unidecode, python-slugify, kaggle
Successfully installed kaggl

Recognition of token

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Download Kaggle Datasets

In [None]:
dataset_1 = "sai14karthik/nasdq-dataset"
dataset_2 = "eswaranmuthu/u-s-economic-vital-signs-25-years-of-macro-data"
DestinationFolder = "inputs/datasets/raw"

!kaggle datasets download -d {dataset_1} -p {DestinationFolder}
!kaggle datasets download -d {dataset_2} -p {DestinationFolder}

Downloading nasdq-dataset.zip to inputs/datasets/raw
100%|█████████████████████████████████████████| 126k/126k [00:00<00:00, 514kB/s]
100%|█████████████████████████████████████████| 126k/126k [00:00<00:00, 513kB/s]
Downloading u-s-economic-vital-signs-25-years-of-macro-data.zip to inputs/datasets/raw
  0%|                                               | 0.00/8.65k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 8.65k/8.65k [00:00<00:00, 24.8MB/s]


Unzip the files

In [11]:
import os
import zipfile

for file in os.listdir(DestinationFolder):
    if file.endswith(".zip"):
        zip_path = os.path.join(DestinationFolder, file)
        print(f"Unzipping: {zip_path}")
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(DestinationFolder)
        os.remove(zip_path)


Unzipping: inputs/datasets/raw/nasdq-dataset.zip
Unzipping: inputs/datasets/raw/u-s-economic-vital-signs-25-years-of-macro-data.zip


Load and inspect the data

In [5]:
import pandas as pd
df_bitcoin = pd.read_csv(f"inputs/datasets/raw/bitcoin_2010-07-17_2025-07-31.csv")
df_bitcoin.head()

Unnamed: 0,Start,End,Open,High,Low,Close,Volume,Market Cap
0,2025-07-31,2025-08-01,117823.0,118867.0,115606.0,115606.0,64625460000.0,2347673000000.0
1,2025-07-30,2025-07-31,117796.0,118699.0,116027.0,117800.0,57484230000.0,2343903000000.0
2,2025-07-29,2025-07-30,118100.0,119095.0,117084.0,117877.0,60250390000.0,2351844000000.0
3,2025-07-28,2025-07-29,119370.0,119759.0,117435.0,117883.0,53716600000.0,2360848000000.0
4,2025-07-27,2025-07-28,117944.0,119767.0,117842.0,119429.0,34950470000.0,2357912000000.0


In [10]:
import pandas as pd
df_macro = pd.read_csv(f"inputs/datasets/raw/new_macro_data.csv")
df_macro['Date'] = pd.to_datetime(df_macro['Date'], dayfirst=True)
df_macro.head()

Unnamed: 0,Date,CPI,10Y Treasury Yield,Fed Funds Rate,M2_Money_Supply,Monthly_Inflation_Rate_%
0,2010-07-18,217.605,2.99,0.19,8595.1,0.186925
1,2010-07-19,217.605,2.99,0.19,8595.1,
2,2010-07-20,217.605,2.98,0.18,8595.1,
3,2010-07-21,217.605,2.9,0.18,8595.1,
4,2010-07-22,217.605,2.96,0.18,8595.1,


In [11]:
import pandas as pd
df_nasdaq = pd.read_csv(f"inputs/datasets/raw/nasdq.csv")
df_nasdaq.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,InterestRate,ExchangeRate,VIX,TEDSpread,EFFR,Gold,Oil
0,2010-01-04,6.64,6.81,6.633333,6.746667,6514500.0,0.11,1.4419,20.04,0.17,0.12,1117.699951,81.510002
1,2010-01-05,6.643333,6.773333,6.643333,6.766667,4445100.0,0.11,1.4402,19.35,0.18,0.12,1118.099976,81.769997
2,2010-01-06,6.733333,6.786667,6.72,6.763333,7340100.0,0.11,1.4404,19.16,0.19,0.12,1135.900024,83.18
3,2010-01-07,6.75,6.766667,6.63,6.673333,8498400.0,0.11,1.4314,19.06,0.2,0.1,1133.099976,82.660004
4,2010-01-08,6.676667,6.766667,6.626667,6.743333,4347600.0,0.11,1.4357,18.13,0.2,0.11,1138.199951,82.75


---

# Data Inspection & Cleaning

Convert all datasets' date using datetime

In [12]:
df_bitcoin['Date'] = pd.to_datetime(df_bitcoin['End'])
df_macro['Date'] = pd.to_datetime(df_macro['Date'])
df_nasdaq['Date'] = pd.to_datetime(df_nasdaq['Date'])

Sort date values

In [13]:
df_bitcoin.sort_values('Date', inplace=True)
df_macro.sort_values('Date', inplace=True)
df_nasdaq.sort_values('Date', inplace=True)

Merge the datasets & rename BitCoin 'Close' column as 'BitCoin_Close' to remove ambiguity across datasets

In [14]:
df_bitcoin.rename(columns={'Close': 'BitCoin_Close'}, inplace=True)

In [16]:
df_merged = pd.merge(df_bitcoin[['Date', 'BitCoin_Close']], df_macro, on='Date', how='left')

In [17]:
df_final = pd.merge(df_merged, df_nasdaq, on='Date', how='left')

In [18]:
df_final.head()


Unnamed: 0,Date,BitCoin_Close,CPI,10Y Treasury Yield,Fed Funds Rate,M2_Money_Supply,Monthly_Inflation_Rate_%,Open,High,Low,Close,Volume,InterestRate,ExchangeRate,VIX,TEDSpread,EFFR,Gold,Oil
0,2010-07-18,0.05,217.605,2.99,0.19,8595.1,0.186925,,,,,,,,,,,,
1,2010-07-19,0.0858,217.605,2.99,0.19,8595.1,,5.886667,5.926667,5.833333,5.903333,4311000.0,0.18,1.2963,25.97,0.35,0.19,1181.699951,76.540001
2,2010-07-20,0.0808,217.605,2.98,0.18,8595.1,,5.836667,6.06,5.806667,6.05,6911700.0,0.18,1.2905,23.93,0.35,0.18,1191.5,77.440002
3,2010-07-21,0.0747,217.605,2.9,0.18,8595.1,,6.133333,6.133333,5.92,5.926667,5768400.0,0.18,1.2818,25.64,0.35,0.18,1191.599976,76.559998
4,2010-07-22,0.0792,217.605,2.96,0.18,8595.1,,6.0,6.163333,5.983333,6.133333,5718300.0,0.18,1.2903,24.63,0.34,0.18,1195.5,79.300003


Check for missing values:

In [20]:
print(df_final.isnull().sum())

Date                           0
BitCoin_Close                  0
CPI                            1
10Y Treasury Yield             1
Fed Funds Rate                 1
M2_Money_Supply                1
Monthly_Inflation_Rate_%    5314
Open                        1721
High                        1721
Low                         1721
Close                       1721
Volume                      1721
InterestRate                1721
ExchangeRate                1721
VIX                         1721
TEDSpread                   1721
EFFR                        1721
Gold                        1721
Oil                         1721
dtype: int64


In [21]:
df_final['Monthly_Inflation_Rate_%'] = df_macro['Monthly_Inflation_Rate_%'].ffill()

In [25]:
df_final = df_final[df_final['Date'] != pd.to_datetime('2025-08-01')]

In [28]:
df_final.drop(columns=['Open', 'Low', 'High', 'InterestRate', 'EFFR'], inplace=True)
df_final.rename(columns={
    'Close': 'Nasdaq_Close',
    'Volume': 'Nasdaq_Volume',
    }, inplace=True)

In [32]:
df_final.drop(columns=['Nasdaq_Volume', 'ExchangeRate'], inplace=True)

In [33]:
pd.concat([df_final.head(3), df_final.tail(3)])

Unnamed: 0,Date,BitCoin_Close,CPI,10Y Treasury Yield,Fed Funds Rate,M2_Money_Supply,Monthly_Inflation_Rate_%,Nasdaq_Close,VIX,TEDSpread,Gold,Oil
0,2010-07-18,0.05,217.605,2.99,0.19,8595.1,0.186925,,,,,
1,2010-07-19,0.0858,217.605,2.99,0.19,8595.1,0.186925,5.903333,25.97,0.35,1181.699951,76.540001
2,2010-07-20,0.0808,217.605,2.98,0.18,8595.1,0.186925,6.05,23.93,0.35,1191.5,77.440002
5490,2025-07-29,117883.0,322.132,4.34,4.33,22005.4,0.196579,,,,,
5491,2025-07-30,117877.0,322.132,4.38,4.33,22005.4,0.196579,,,,,
5492,2025-07-31,117800.0,322.132,4.37,4.33,22005.4,0.196579,,,,,


In [34]:
print(df_final.isnull().sum())

Date                           0
BitCoin_Close                  0
CPI                            0
10Y Treasury Yield             0
Fed Funds Rate                 0
M2_Money_Supply                0
Monthly_Inflation_Rate_%       0
Nasdaq_Close                1720
VIX                         1720
TEDSpread                   1720
Gold                        1720
Oil                         1720
dtype: int64


Use interpolation to fill in missing values based on those values around them in the SOFR column

In [15]:
df_final['SOFR'] = df_final['SOFR'].interpolate(method='linear')


In [16]:
print(df_final.isnull().sum())

Date                  0
Open                  0
High                  0
Low                   0
Close                 0
Volume                0
InterestRate          0
ExchangeRate          0
VIX                   0
TEDSpread             0
EFFR                  0
Gold                  0
Oil                   0
M2_Money_Supply       0
10Y Treasury Yield    0
Fed Funds Rate        0
CPI                   0
Inflation_Rate_%      0
SOFR                  0
BitCoin_Close         0
dtype: int64


### Drop unnecessary columns & rename others for clarity

* Drop open, low and high prices as we're only interested in the close prices.  Also drop InterestRate & EFFR as these are duplicated with Fed Funds Rate

* Rename Close and Volume columns for better clarity

In [17]:
df_final.drop(columns=['Open', 'Low', 'High', 'InterestRate', 'EFFR'], inplace=True)
df_final.rename(columns={
    'Close': 'Nasdaq_Close',
    'Volume': 'Nasdaq_Volume',
    }, inplace=True)
df_final.head()

Unnamed: 0,Date,Nasdaq_Close,Nasdaq_Volume,ExchangeRate,VIX,TEDSpread,Gold,Oil,M2_Money_Supply,10Y Treasury Yield,Fed Funds Rate,CPI,Inflation_Rate_%,SOFR,BitCoin_Close
0,2018-04-03,28.883333,4917300.0,1.2261,21.1,0.6,1332.800049,63.509998,13993.9,2.87,1.69,250.227,2.470996,1.83,7061.622526
1,2018-04-04,28.74,3822600.0,1.2292,20.06,0.64,1335.800049,63.369999,13993.9,2.87,1.69,250.227,2.470996,1.74,7454.69179
2,2018-04-05,28.77,3174300.0,1.223,18.94,0.64,1324.300049,63.540001,13993.9,2.87,1.69,250.227,2.470996,1.75,6840.93611
3,2018-04-06,28.4,2808000.0,1.2274,21.49,0.64,1331.900024,62.060001,13993.9,2.87,1.69,250.227,2.470996,1.75,6819.726657
4,2018-04-09,28.43,1798200.0,1.232,21.77,0.61,1336.300049,63.419998,13993.9,2.87,1.69,250.227,2.470996,1.75,7000.923355


In [18]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1714 entries, 0 to 1713
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Date                1714 non-null   datetime64[ns]
 1   Nasdaq_Close        1714 non-null   float64       
 2   Nasdaq_Volume       1714 non-null   float64       
 3   ExchangeRate        1714 non-null   float64       
 4   VIX                 1714 non-null   float64       
 5   TEDSpread           1714 non-null   float64       
 6   Gold                1714 non-null   float64       
 7   Oil                 1714 non-null   float64       
 8   M2_Money_Supply     1714 non-null   float64       
 9   10Y Treasury Yield  1714 non-null   float64       
 10  Fed Funds Rate      1714 non-null   float64       
 11  CPI                 1714 non-null   float64       
 12  Inflation_Rate_%    1714 non-null   float64       
 13  SOFR                1714 non-null   float64     

### Lag remaining data for time series evaluation

* Here, we shift features into their past values to get a measure of those values and enhance predictive power.

In [21]:
# Lag target (daily)
df_final["BTC_Close_Lag1"] = df_final["BitCoin_Close"].shift(1)
df_final["BTC_Close_RollingMean3"] = df_final["BitCoin_Close"].rolling(window=3).mean()

# Lag market indices (daily)
df_final["Nasdaq_Lag1"] = df_final["Nasdaq_Close"].shift(1)
df_final["Nasdaq_Volume_Lag1"] = df_final["Nasdaq_Volume"].shift(1)
df_final["VIX_Lag1"] = df_final["VIX"].shift(1)

# Lag currency and rates (daily)
df_final["ExchangeRate_Lag1"] = df_final["ExchangeRate"].shift(1)
df_final["SOFR_Lag1"] = df_final["SOFR"].shift(1)

# Lag commodities (daily)
df_final["Gold_Lag1"] = df_final["Gold"].shift(1)
df_final["Oil_Lag1"] = df_final["Oil"].shift(1)

# Lag macroeconomic indicators (monthly logic)
# Step 1: Create monthly snapshot
monthly_macro = df_final[[
    "Date", "CPI", "Inflation_Rate_%", "M2_Money_Supply",
    "TEDSpread", "10Y Treasury Yield", "Fed Funds Rate"
]].copy()

monthly_macro = monthly_macro.resample("M", on="Date").last()

# Step 2: Lag by one month
monthly_macro["CPI_Lag1"] = monthly_macro["CPI"].shift(1)
monthly_macro["Inflation_Lag1"] = monthly_macro["Inflation_Rate_%"].shift(1)
monthly_macro["M2_Lag1"] = monthly_macro["M2_Money_Supply"].shift(1)
monthly_macro["TEDSpread_Lag1"] = monthly_macro["TEDSpread"].shift(1)
monthly_macro["TreasuryYield_Lag1"] = monthly_macro["10Y Treasury Yield"].shift(1)
monthly_macro["FedFundsRate_Lag1"] = monthly_macro["Fed Funds Rate"].shift(1)

# Step 3: Create 'Month' column for merging
df_final["Month"] = df_final["Date"].dt.to_period("M").astype(str)
monthly_macro["Month"] = monthly_macro.index.to_period("M").astype(str)

# Step 4: Merge lagged monthly values into daily data
df_final = df_final.merge(
    monthly_macro[[
        "Month", "CPI_Lag1", "Inflation_Lag1", "M2_Lag1",
        "TEDSpread_Lag1", "TreasuryYield_Lag1", "FedFundsRate_Lag1"
    ]],
    on="Month",
    how="left"
)

# Forward-fill any remaining gaps
df_final.ffill(inplace=True)

* Drop the Month column that was created during the lagging process as it is not needed for modelling

In [23]:
df_final.drop(columns=["Month"], inplace=True)

* Check for missing values

In [25]:
print(df_final.isnull().sum())

Date                       0
Nasdaq_Close               0
Nasdaq_Volume              0
ExchangeRate               0
VIX                        0
TEDSpread                  0
Gold                       0
Oil                        0
M2_Money_Supply            0
10Y Treasury Yield         0
Fed Funds Rate             0
CPI                        0
Inflation_Rate_%           0
SOFR                       0
BitCoin_Close              0
BTC_Close_Lag1             1
BTC_Close_RollingMean3     2
Nasdaq_Lag1                1
Nasdaq_Volume_Lag1         1
VIX_Lag1                   1
ExchangeRate_Lag1          1
SOFR_Lag1                  1
Gold_Lag1                  1
Oil_Lag1                   1
CPI_Lag1_x                20
Inflation_Lag1_x          20
M2_Lag1_x                 20
TEDSpread_Lag1_x          20
TreasuryYield_Lag1_x      20
FedFundsRate_Lag1_x       20
CPI_Lag1_y                20
Inflation_Lag1_y          20
M2_Lag1_y                 20
TEDSpread_Lag1_y          20
TreasuryYield_

* We drop those rows containing missing values as these are the original, unlagged rows, that had no data to lag.  
  
* You will notice that the monthlies have 20 rows, or dates, whereas the dailies only had 1 (2 for the 3day rolling average)

In [26]:
df_final.dropna(subset=[
    "BTC_Close_Lag1", "BTC_Close_RollingMean3", "Nasdaq_Lag1",
    "Nasdaq_Volume_Lag1", "VIX_Lag1", "ExchangeRate_Lag1", "SOFR_Lag1",
    "Gold_Lag1", "Oil_Lag1",
    "CPI_Lag1_x", "Inflation_Lag1_x", "M2_Lag1_x",
    "TEDSpread_Lag1_x", "TreasuryYield_Lag1_x", "FedFundsRate_Lag1_x"
], inplace=True)

* Drop duplicated columns from the lagging process

In [27]:
df_final.drop(columns=[
    "CPI_Lag1_y", "Inflation_Lag1_y", "M2_Lag1_y",
    "TEDSpread_Lag1_y", "TreasuryYield_Lag1_y", "FedFundsRate_Lag1_y"
], inplace=True)

* You will see below that all missing values are gone

In [28]:
print(df_final.isnull().sum())

Date                      0
Nasdaq_Close              0
Nasdaq_Volume             0
ExchangeRate              0
VIX                       0
TEDSpread                 0
Gold                      0
Oil                       0
M2_Money_Supply           0
10Y Treasury Yield        0
Fed Funds Rate            0
CPI                       0
Inflation_Rate_%          0
SOFR                      0
BitCoin_Close             0
BTC_Close_Lag1            0
BTC_Close_RollingMean3    0
Nasdaq_Lag1               0
Nasdaq_Volume_Lag1        0
VIX_Lag1                  0
ExchangeRate_Lag1         0
SOFR_Lag1                 0
Gold_Lag1                 0
Oil_Lag1                  0
CPI_Lag1_x                0
Inflation_Lag1_x          0
M2_Lag1_x                 0
TEDSpread_Lag1_x          0
TreasuryYield_Lag1_x      0
FedFundsRate_Lag1_x       0
dtype: int64


In [29]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1694 entries, 20 to 1713
Data columns (total 30 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Date                    1694 non-null   datetime64[ns]
 1   Nasdaq_Close            1694 non-null   float64       
 2   Nasdaq_Volume           1694 non-null   float64       
 3   ExchangeRate            1694 non-null   float64       
 4   VIX                     1694 non-null   float64       
 5   TEDSpread               1694 non-null   float64       
 6   Gold                    1694 non-null   float64       
 7   Oil                     1694 non-null   float64       
 8   M2_Money_Supply         1694 non-null   float64       
 9   10Y Treasury Yield      1694 non-null   float64       
 10  Fed Funds Rate          1694 non-null   float64       
 11  CPI                     1694 non-null   float64       
 12  Inflation_Rate_%        1694 non-null   float64     

Finally, we cut the _x from the names of columns that picked that up during lagging

In [31]:
df_final.rename(columns={
    "CPI_Lag1_x": "CPI_Lag1",
    "Inflation_Lag1_x": "Inflation_Lag1",
    "M2_Lag1_x": "M2_Lag1",
    "TEDSpread_Lag1_x": "TEDSpread_Lag1",
    "TreasuryYield_Lag1_x": "TreasuryYield_Lag1",
    "FedFundsRate_Lag1_x": "FedFundsRate_Lag1"
}, inplace=True)

In [33]:
df_final.head()

Unnamed: 0,Date,Nasdaq_Close,Nasdaq_Volume,ExchangeRate,VIX,TEDSpread,Gold,Oil,M2_Money_Supply,10Y Treasury Yield,...,ExchangeRate_Lag1,SOFR_Lag1,Gold_Lag1,Oil_Lag1,CPI_Lag1,Inflation_Lag1,M2_Lag1,TEDSpread_Lag1,TreasuryYield_Lag1,FedFundsRate_Lag1
20,2018-05-01,29.67,2063100.0,1.2,15.49,0.53,1303.800049,67.25,14049.6,2.98,...,1.2074,1.77,1316.199951,68.57,250.227,2.470996,13993.9,0.52,2.87,1.69
21,2018-05-02,29.303333,4036800.0,1.1968,15.97,0.55,1302.599976,67.93,14049.6,2.98,...,1.2,1.76,1303.800049,67.25,250.227,2.470996,13993.9,0.52,2.87,1.69
22,2018-05-03,28.67,5250000.0,1.197,15.9,0.56,1310.699951,68.43,14049.6,2.98,...,1.1968,1.75,1302.599976,67.93,250.227,2.470996,13993.9,0.52,2.87,1.69
23,2018-05-04,29.233334,3045900.0,1.1946,14.77,0.57,1312.699951,69.720001,14049.6,2.98,...,1.197,1.74,1310.699951,68.43,250.227,2.470996,13993.9,0.52,2.87,1.69
24,2018-05-07,29.553333,3077400.0,1.1927,14.75,0.57,1312.199951,70.730003,14049.6,2.98,...,1.1946,1.72,1312.699951,69.720001,250.227,2.470996,13993.9,0.52,2.87,1.69


### Data Cleaning Summary

* BitCoin's 'Close' column was renamed to 'BitCoin_Close'
* Interpolation was used to handle missing data evident on the SOFR column
* Unnecessary columns were removed to avoid duplicated columns in the final dataset
* Nasdaq 'Close' and 'Volume' columns had 'Nasdaq' included in the column names for futher clarity
* All remaining features were lagged.  Monthly econonmic features were also lagged by 1 month but forward filled
* No missing data/NaNs remain

---

# Push files to Repo

* We will save the dataset now, but will only include the lagged features for modelling and predictions.  Original features will be stripped out.

In [34]:
lagged_features = [
    col for col in df_final.columns
    if "_Lag1" in col or "RollingMean" in col
]

df_final = df_final[["Date", "BitCoin_Close"] + lagged_features]

In [35]:
df_final.head()

Unnamed: 0,Date,BitCoin_Close,BTC_Close_Lag1,BTC_Close_RollingMean3,Nasdaq_Lag1,Nasdaq_Volume_Lag1,VIX_Lag1,ExchangeRate_Lag1,SOFR_Lag1,Gold_Lag1,Oil_Lag1,CPI_Lag1,Inflation_Lag1,M2_Lag1,TEDSpread_Lag1,TreasuryYield_Lag1,FedFundsRate_Lag1
20,2018-05-01,9240.335679,9413.409305,9307.424274,29.440001,2580900.0,15.93,1.2074,1.77,1316.199951,68.57,250.227,2.470996,13993.9,0.52,2.87,1.69
21,2018-05-02,9096.810022,9240.335679,9250.185002,29.67,2063100.0,15.49,1.2,1.76,1303.800049,67.25,250.227,2.470996,13993.9,0.52,2.87,1.69
22,2018-05-03,9236.704389,9096.810022,9191.283363,29.303333,4036800.0,15.97,1.1968,1.75,1302.599976,67.93,250.227,2.470996,13993.9,0.52,2.87,1.69
23,2018-05-04,9769.290293,9236.704389,9367.601568,28.67,5250000.0,15.9,1.197,1.74,1310.699951,68.43,250.227,2.470996,13993.9,0.52,2.87,1.69
24,2018-05-07,9666.853489,9769.290293,9557.616057,29.233334,3045900.0,14.77,1.1946,1.72,1312.699951,69.720001,250.227,2.470996,13993.9,0.52,2.87,1.69


* We finally save the dataset for use going forward

In [37]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df_final.to_csv(f"outputs/datasets/collection/BitCoinVsMacroNasdaq_v3.csv",index=False)

[Errno 17] File exists: 'outputs/datasets/collection'
