In [1]:
import shutil
import kagglehub
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


### About Dataset

**Context**

Bitcoin is the longest running and most well known cryptocurrency, first released as open source in 2009 by the anonymous Satoshi Nakamoto. Bitcoin serves as a decentralized medium of digital exchange, with transactions verified and recorded in a public distributed ledger (the blockchain) without the need for a trusted record keeping authority or central intermediary. Transaction blocks contain a SHA-256 cryptographic hash of previous transaction blocks, and are thus "chained" together, serving as an immutable record of all transactions that have ever occurred. As with any currency/commodity on the market, bitcoin trading and financial instruments soon followed public adoption of bitcoin and continue to grow. Included here is historical bitcoin market data at 1-min intervals for select bitcoin exchanges where trading takes place. Happy (data) mining! 

Content
([See](https://github.com/mczielinski/kaggle-bitcoin/) for automation/scraping script) \
**btcusd_1-min_data.csv**

CSV files for select bitcoin exchanges for the time period of Jan 2012 to Present (Measured by UTC day), with minute to minute updates of OHLC (Open, High, Low, Close) and Volume in BTC.

If a timestamp is missing, or if there are jumps, this may be because the exchange (or its API) was down, the exchange (or its API) did not exist, or some other unforeseen technical error in data reporting or gathering. I'm not perfect, and I'm also busy! All effort has been made to deduplicate entries and verify the contents are correct and complete to the best of my ability, but obviously trust at your own risk.
Acknowledgements and Inspiration

Bitcoin charts for the data, originally. Now thank you to the Bitstamp API directly. The various exchange APIs, for making it difficult or unintuitive enough to get OHLC and volume data at 1-min intervals that I set out on this data scraping project. Satoshi Nakamoto and the novel core concept of the blockchain, as well as its first execution via the bitcoin protocol. I'd also like to thank viewers like you! Can't wait to see what code or insights you all have to share. 

[DatasetLink](https://www.kaggle.com/datasets/mczielinski/bitcoin-historical-data)

Column Descriptions
| Column Name               | Description                                           |
|---------------------------|-------------------------------------------------------|
| Timestamp 	            | Start time of time window (60s window), in Unix time  |
| Open 	                    | Open price at start time window                       |
| High       	            | High price at start time window                       |
| Low            	        | Low price at start time window                        |     
| Close              	    | Close price at start time window                      |
| Volume 	                | Volume of BTC transacted in this window               |
| datetime 	                | date & time                                           |

In [8]:
# Download latest version
path = kagglehub.dataset_download("mczielinski/bitcoin-historical-data")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/mczielinski/bitcoin-historical-data?dataset_version_number=185...


100%|██████████| 112M/112M [00:11<00:00, 10.6MB/s] 

Extracting files...





Path to dataset files: /home/alex/.cache/kagglehub/datasets/mczielinski/bitcoin-historical-data/versions/185


In [None]:
full_path = f'{path}/btcusd_1-min_data.csv'

In [None]:
shutil.copy(full_path, './')

'./btcusd_1-min_data.csv'

In [2]:
df = pd.read_csv('btcusd_1-min_data.csv')
df.head()

Unnamed: 0,Timestamp,Open,High,Low,Close,Volume,datetime
0,1325412000.0,4.58,4.58,4.58,4.58,0.0,2012-01-01 10:01:00+00:00
1,1325412000.0,4.58,4.58,4.58,4.58,0.0,2012-01-01 10:02:00+00:00
2,1325412000.0,4.58,4.58,4.58,4.58,0.0,2012-01-01 10:03:00+00:00
3,1325412000.0,4.58,4.58,4.58,4.58,0.0,2012-01-01 10:04:00+00:00
4,1325412000.0,4.58,4.58,4.58,4.58,0.0,2012-01-01 10:05:00+00:00


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6944800 entries, 0 to 6944799
Data columns (total 7 columns):
 #   Column     Dtype  
---  ------     -----  
 0   Timestamp  float64
 1   Open       float64
 2   High       float64
 3   Low        float64
 4   Close      float64
 5   Volume     float64
 6   datetime   object 
dtypes: float64(6), object(1)
memory usage: 370.9+ MB


In [5]:
missing_rows = df[df.isna().any(axis=1)]
print(missing_rows)

            Timestamp      Open      High       Low     Close    Volume  \
443279   1.352009e+09     10.49     10.49     10.49     10.49  0.000000   
443280   1.352009e+09     10.49     10.49     10.49     10.49  0.000000   
443281   1.352009e+09     10.49     10.49     10.49     10.49  0.000000   
443282   1.352009e+09     10.49     10.49     10.49     10.49  0.000000   
443283   1.352009e+09     10.49     10.49     10.49     10.49  0.260338   
...               ...       ...       ...       ...       ...       ...   
6944795  1.742169e+09  82554.00  82554.00  82554.00  82554.00  0.072000   
6944796  1.742169e+09  82584.00  82615.00  82584.00  82615.00  0.194670   
6944797  1.742169e+09  82555.00  82555.00  82555.00  82555.00  0.002680   
6944798  1.742170e+09  82555.00  82555.00  82555.00  82555.00  0.000000   
6944799  1.742170e+09  82569.00  82569.00  82566.00  82566.00  0.003019   

        datetime  
443279       NaN  
443280       NaN  
443281       NaN  
443282       NaN  
4432

In [10]:
columns_with_nan = df.columns[df.isna().any()].tolist()
print(columns_with_nan)

['datetime']


In [11]:
df = df.dropna()
df = df.reset_index(drop=True)
print(df.isna().sum())

Timestamp    0
Open         0
High         0
Low          0
Close        0
Volume       0
datetime     0
dtype: int64


In [12]:
if df['datetime'].astype(str).str.endswith('+00:00').all():
    df['datetime'] = df['datetime'].str[:-6]
else:
    print('None')

In [13]:
df['datetime']

0          2012-01-01 10:01:00
1          2012-01-01 10:02:00
2          2012-01-01 10:03:00
3          2012-01-01 10:04:00
4          2012-01-01 10:05:00
                  ...         
6782275    2025-03-14 23:56:00
6782276    2025-03-14 23:57:00
6782277    2025-03-14 23:58:00
6782278    2025-03-14 23:59:00
6782279    2025-03-15 00:00:00
Name: datetime, Length: 6782280, dtype: object