# Bitcoin analysis and prediction (2011-2024)

The dataset contain the value of Bitcoin from 2011 to 2024 and another features. More information about this on the link below.

Data source: [Bitcoin Historical Dataset](https://www.kaggle.com/datasets/whenamancodes/bitcoin-latest-data-2011-2024)

### Next steps in order:

- EDA (exploration data analysis)
  - [Import and inspect data](#import-and-inspect-data)
  - [Explore data and visualization](#explore-data-and-visualization)
- [Train and predict](#train-and-predict) with Prophet.
- [Conclusion](#conclusion)

In [1]:
# Installations

%pip install pandas
%pip install numpy
%pip install plotly
%pip install prophet
%pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip



Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip





In [2]:
# Data manipulation
import pandas as pd
import numpy as np

np.float_ = np.float64

# Visualization
import plotly.express as px

# Predict
from prophet import Prophet
from prophet.plot import plot_plotly

# Metrics
from sklearn.metrics import root_mean_squared_error

# Others
import warnings
import itertools

warnings.filterwarnings("ignore")

  from .autonotebook import tqdm as notebook_tqdm


# EDA

### Import and instapect data

In [3]:
df = pd.read_csv('./BTC Daily 2011-2024.csv')

df.head()

Unnamed: 0,Timestamp,Open,High,Low,Close,Volume (BTC),Volume (Currency),Weighted Price
0,2011-09-13 0:00:00,5.8,6.0,5.65,5.97,58.37,346.1,5.93
1,2011-09-14 0:00:00,5.58,5.72,5.52,5.53,61.15,341.85,5.59
2,2011-09-15 0:00:00,5.12,5.24,5.0,5.13,80.14,408.26,5.09
3,2011-09-16 0:00:00,4.82,4.87,4.8,4.85,39.91,193.76,4.85
4,2011-09-17 0:00:00,4.87,4.87,4.87,4.87,0.3,1.46,4.87


As the objective is understand the BTC behaviour and try to predict his future value, only date and close features will be kept. Also to make it work with Prophet these columns name should change to "y" for close and "ds" for date

In [4]:
df = df[["Close", "Timestamp"]]
df.rename(columns={"Close": "y", "Timestamp": "ds"}, inplace=True)
df.head()

Unnamed: 0,y,ds
0,5.97,2011-09-13 0:00:00
1,5.53,2011-09-14 0:00:00
2,5.13,2011-09-15 0:00:00
3,4.85,2011-09-16 0:00:00
4,4.87,2011-09-17 0:00:00


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4671 entries, 0 to 4670
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   y       4671 non-null   object
 1   ds      4671 non-null   object
dtypes: object(2)
memory usage: 73.1+ KB


There are not null registers and types are correct but the ds should just have the year, month and day, not hours, minutes and seconds

In [6]:
df["ds"] = df["ds"].str.slice(0, -8)
df.head()

Unnamed: 0,y,ds
0,5.97,2011-09-13
1,5.53,2011-09-14
2,5.13,2011-09-15
3,4.85,2011-09-16
4,4.87,2011-09-17


Column y must be number but not all the values can be converted because of follow

In [7]:
df['y'].value_counts()

y
—        21
5         8
13.2      7
3.3       6
5.07      6
         ..
64233     1
63154     1
60252     1
61789     1
61396     1
Name: count, Length: 4466, dtype: int64

This line "—" cannot be converted and it happen just 21 times of 4466 and according to Prophet "The best way to handle outliers is to remove them - Prophet has no problem with missing data. If you set their values to NA in the history but leave the dates in future, then Prophet will give you a prediction for their values." so they will be removed

In [8]:
df.drop(index=df[df['y'] == '—'].index, inplace=True)
df['y'] = df['y'].astype('float64')

### Explore data and visualization

In [9]:
df.describe()

Unnamed: 0,y
count,4650.0
mean,12959.851948
std,17838.776663
min,2.24
25%,370.1475
50%,4602.85
75%,20124.5
max,73121.0


<b>NOTE</b>: because it is being used just the close value it will not for sure caught the real minimum and maximum value of BTC

In [10]:
px.area(df, 'ds', 'y')

# Train and predict

Prophet will be used as our model to forecast. First the params are declared then all variations of these are created to search the best combinations of them which result in a lower RMSE (root mean squared error)

In [11]:

model_param = {
    "daily_seasonality": [False],
    "weekly_seasonality": [False],
    "yearly_seasonality": [True],
    "seasonality_mode": ["multiplicative"],
    "growth": ["logistic"],
    "changepoint_prior_scale": [0.001, 0.01, 0.1, 0.5],
    "seasonality_prior_scale": [0.01, 0.1, 1.0, 10.0],
}

df["cap"] = df["y"].max() + df["y"].std()

Iterating over all params, training multiple Prophet models with these and calculating the lower RMSE. Storing the best model and forecast to plot it

In [12]:
all_params = [
    dict(zip(model_param.keys(), v)) for v in itertools.product(*model_param.values())
]

rmse = np.inf
best_model_forecast = {}

for params in all_params:
    model = Prophet(**params).fit(df)
    future = model.make_future_dataframe(365, "D")

    future["cap"] = df["cap"].max()
    forecast = model.predict(future)

    current_rmse = root_mean_squared_error(df["y"], forecast["yhat"][: len(df["y"])])

    if rmse > current_rmse:
        rmse = current_rmse
        best_model_forecast['m'] = model
        best_model_forecast['fcst'] = forecast

10:34:25 - cmdstanpy - INFO - Chain [1] start processing
10:34:25 - cmdstanpy - INFO - Chain [1] done processing
10:34:40 - cmdstanpy - INFO - Chain [1] start processing
10:34:40 - cmdstanpy - INFO - Chain [1] done processing
10:34:54 - cmdstanpy - INFO - Chain [1] start processing
10:34:54 - cmdstanpy - INFO - Chain [1] done processing
10:35:08 - cmdstanpy - INFO - Chain [1] start processing
10:35:08 - cmdstanpy - INFO - Chain [1] done processing
10:35:22 - cmdstanpy - INFO - Chain [1] start processing
10:35:22 - cmdstanpy - INFO - Chain [1] done processing
10:35:35 - cmdstanpy - INFO - Chain [1] start processing
10:35:35 - cmdstanpy - INFO - Chain [1] done processing
10:35:49 - cmdstanpy - INFO - Chain [1] start processing
10:35:49 - cmdstanpy - INFO - Chain [1] done processing
10:36:03 - cmdstanpy - INFO - Chain [1] start processing
10:36:03 - cmdstanpy - INFO - Chain [1] done processing
10:36:16 - cmdstanpy - INFO - Chain [1] start processing
10:36:17 - cmdstanpy - INFO - Chain [1]

In [13]:
print(f'RMSE: {root_mean_squared_error(df['y'], best_model_forecast['fcst']['yhat'][:len(df)])}')

plot_plotly(**best_model_forecast)

RMSE: 6697.851750203023


# Conclusion

The trained model can, in theory, predict the real value of BTC with <b> ~6697 of RMSE</b>.

The price of BTC trend to increse his value (bull market) througt the years, where between March 2021 to July 2022 have a trend to go down (bear market). 

From 2011 to 2024 BTC has increased his value a <b>~3264330%</b>