# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset: outputs/datasets/collection/BTCDaily.csv

## Additional Comments

* Just for this project learning context, the data is hosted in a public repo.


---

# Install python packages in the notebooks

In [1]:
%pip install -r /workspace/bitcoin-forecast/requirements.txt 

Note: you may need to restart the kernel to use updated packages.


# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/bitcoin-forecast/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspace/bitcoin-forecast'

# Fetch data from Kaggle

Install Kaggle package to fetch data

In [5]:
%pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm (from kaggle==1.5.12)
  Downloading tqdm-4.66.5-py3-none-any.whl.metadata (57 kB)
Collecting python-slugify (from kaggle==1.5.12)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.12)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Downloading tqdm-4.66.5-py3-none-any.whl (78 kB)
Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73026 sha256=6b247549f5badb249232e7923c9174fcf264a8d46fba74df5177658273c8d4f7
  Stored in directory: /home/gitpod/.cache/pip/wheels/29/da/11/144cc25aebdaeb4931b231e25fd34b3

In [7]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Define the Kaggle dataset, and destination folder and download it.

In [8]:
KaggleDatasetPath = "prasoonkottarathil/btcinusd"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading btcinusd.zip to inputs/datasets/raw
 97%|████████████████████████████████████▉ | 78.0M/80.2M [00:03<00:00, 40.0MB/s]
100%|██████████████████████████████████████| 80.2M/80.2M [00:03<00:00, 26.9MB/s]


Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [9]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Archive:  inputs/datasets/raw/btcinusd.zip
  inflating: inputs/datasets/raw/BTC-2017min.csv  
  inflating: inputs/datasets/raw/BTC-2018min.csv  
  inflating: inputs/datasets/raw/BTC-2019min.csv  
  inflating: inputs/datasets/raw/BTC-2020min.csv  
  inflating: inputs/datasets/raw/BTC-2021min.csv  
  inflating: inputs/datasets/raw/BTC-Daily.csv  
  inflating: inputs/datasets/raw/BTC-Hourly.csv  


---

# Load and Inspect Kaggle data

In [10]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/BTC-Daily.csv")
df.head()

Unnamed: 0,unix,date,symbol,open,high,low,close,Volume BTC,Volume USD
0,1646092800,2022-03-01 00:00:00,BTC/USD,43221.71,43626.49,43185.48,43185.48,49.006289,2116360.0
1,1646006400,2022-02-28 00:00:00,BTC/USD,37717.1,44256.08,37468.99,43178.98,3160.61807,136472300.0
2,1645920000,2022-02-27 00:00:00,BTC/USD,39146.66,39886.92,37015.74,37712.68,1701.817043,64180080.0
3,1645833600,2022-02-26 00:00:00,BTC/USD,39242.64,40330.99,38600.0,39146.66,912.724087,35730100.0
4,1645747200,2022-02-25 00:00:00,BTC/USD,38360.93,39727.97,38027.61,39231.64,2202.851827,86421490.0


DataFrame Summary

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2651 entries, 0 to 2650
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   unix        2651 non-null   int64  
 1   date        2651 non-null   object 
 2   symbol      2651 non-null   object 
 3   open        2651 non-null   float64
 4   high        2651 non-null   float64
 5   low         2651 non-null   float64
 6   close       2651 non-null   float64
 7   Volume BTC  2651 non-null   float64
 8   Volume USD  2651 non-null   float64
dtypes: float64(6), int64(1), object(2)
memory usage: 186.5+ KB


The data.info() method shows that the 'date' column is of datatype 'object' instead of 'datetime64[ns]'. To rectify this, we convert the 'date' column to the correct datatype using the pd.to_datetime() function.

In [12]:
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2651 entries, 0 to 2650
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   unix        2651 non-null   int64         
 1   date        2651 non-null   datetime64[ns]
 2   symbol      2651 non-null   object        
 3   open        2651 non-null   float64       
 4   high        2651 non-null   float64       
 5   low         2651 non-null   float64       
 6   close       2651 non-null   float64       
 7   Volume BTC  2651 non-null   float64       
 8   Volume USD  2651 non-null   float64       
dtypes: datetime64[ns](1), float64(6), int64(1), object(1)
memory usage: 186.5+ KB


Check for duplicates in 'unix' to ensure there are no repeated entries before dropping it. There are not.

In [14]:
duplicates = df[df.duplicated(subset=['unix'])]
print(f"Number of duplicate 'unix' entries: {len(duplicates)}")

Number of duplicate 'unix' entries: 0


The dataset is in descending order, from the most recent to the earliest data. Below, we reverse the order and remove irrelevant columns.

In [15]:
# Reverse the data order
df = df.iloc[::-1]

# Drop irrelevant columns
df = df.drop(['unix', 'symbol'], axis=1)
df.head()

Unnamed: 0,date,open,high,low,close,Volume BTC,Volume USD
2650,2014-11-28,363.59,381.34,360.57,376.28,3220878.18,8617.15
2649,2014-11-29,376.42,386.6,372.25,376.72,2746157.05,7245.19
2648,2014-11-30,376.57,381.99,373.32,373.34,1145566.61,3046.33
2647,2014-12-01,376.4,382.31,373.03,378.39,2520662.37,6660.56
2646,2014-12-02,378.39,382.86,375.23,379.25,2593576.46,6832.53


---

# Push files to Repo

In [1]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/BTCDaily.csv",index=False)

NameError: name 'df' is not defined