<a href="https://colab.research.google.com/github/GuilhermeDumam/LinearRegression4bitcoin/blob/Master/data_quant_teste.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Algoritmo de regressão linear montado para prever preço de bitcoin


####1-Coleta dos dados

####2-Limpeza e transformação dos dados

####3-Feature engineering

####4-Treino de algoritmo e score.

##Coleta de dados e importação das bibliotecas

In [210]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import pyplot

In [211]:
df = pd.read_csv('https://raw.githubusercontent.com/coinmetrics/data/master/csv/btc.csv', sep = ',')

##Limpeza e transformação dos dados, para ficarem adequados para o modelo.

In [212]:
df.shape

(4980, 145)

In [213]:
df.head()

Unnamed: 0,time,AdrActCnt,AdrBal1in100KCnt,AdrBal1in100MCnt,AdrBal1in10BCnt,AdrBal1in10KCnt,AdrBal1in10MCnt,AdrBal1in1BCnt,AdrBal1in1KCnt,AdrBal1in1MCnt,...,TxTfrCnt,TxTfrValAdjNtv,TxTfrValAdjUSD,TxTfrValMeanNtv,TxTfrValMeanUSD,TxTfrValMedNtv,TxTfrValMedUSD,VelCur1yr,VtyDayRet180d,VtyDayRet30d
0,2009-01-03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,,,,,,,,
1,2009-01-04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,,,,,,,,
2,2009-01-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,,,,,,,,
3,2009-01-06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,,,,,,,,
4,2009-01-07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,,,,,,,,


In [214]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4980 entries, 0 to 4979
Columns: 145 entries, time to VtyDayRet30d
dtypes: float64(144), object(1)
memory usage: 5.5+ MB


In [215]:
df.dtypes == 'float64'

time                False
AdrActCnt            True
AdrBal1in100KCnt     True
AdrBal1in100MCnt     True
AdrBal1in10BCnt      True
                    ...  
TxTfrValMedNtv       True
TxTfrValMedUSD       True
VelCur1yr            True
VtyDayRet180d        True
VtyDayRet30d         True
Length: 145, dtype: bool

In [216]:
df.isna().sum()/df.shape[0]

time                0.000000
AdrActCnt           0.000201
AdrBal1in100KCnt    0.000201
AdrBal1in100MCnt    0.000201
AdrBal1in10BCnt     0.000201
                      ...   
TxTfrValMedNtv      0.052008
TxTfrValMedUSD      0.112851
VelCur1yr           0.001406
VtyDayRet180d       0.148996
VtyDayRet30d        0.118876
Length: 145, dtype: float64

In [217]:
df[df.duplicated()].count()

time                0
AdrActCnt           0
AdrBal1in100KCnt    0
AdrBal1in100MCnt    0
AdrBal1in10BCnt     0
                   ..
TxTfrValMedNtv      0
TxTfrValMedUSD      0
VelCur1yr           0
VtyDayRet180d       0
VtyDayRet30d        0
Length: 145, dtype: int64

###Dropamos os valores nulos, eles não serão de utilidade para nós.

In [218]:
df.dropna(inplace=True)

In [219]:
df.select_dtypes(include=['object'])

Unnamed: 0,time
3822,2019-06-22
3823,2019-06-23
3824,2019-06-24
3825,2019-06-25
3826,2019-06-26
...,...
4974,2022-08-17
4975,2022-08-18
4976,2022-08-19
4977,2022-08-20


In [220]:
df['time'] = pd.to_datetime(df['time'])

##Feature engineering (seleção de features, scalling) 

In [104]:
from sklearn.feature_selection import RFE
from sklearn.feature_selection import RFECV
from sklearn.feature_selection import SelectKBest, f_classif

In [161]:
df_features = df.drop(['PriceUSD','time'], axis = 1)

In [162]:
x1 = df_features
y1 = df['PriceUSD']

In [163]:
regr = LinearRegression()

In [164]:
regr.fit(x1,y1)

LinearRegression()

In [165]:
x1_train, x1_test, y1_train, y1_test = train_test_split(x1, y1, test_size = 0.3)

In [166]:
dt1 = LinearRegression()
dt1.fit(x1_train, y1_train)

LinearRegression()

###Utilização de um seletor das melhores features, pois o dataset contém muitas colunas e precisamos filtrar para colocar apenas as que realmente influenciam na nossa target : "PriceUSD"

In [167]:
rfe = RFE(estimator=dt1, step=1)
rfe = rfe.fit(x1_train, y1_train)

In [168]:
selected_rfe_features = pd.DataFrame({'Feature':list(x1_train.columns),
                                      'Ranking':rfe.ranking_})
selected_rfe_features.sort_values(by='Ranking')

Unnamed: 0,Feature,Ranking
71,ReferenceRateETH,1
79,RevNtv,1
78,RevHashUSD,1
77,RevHashRateUSD,1
76,RevHashRateNtv,1
...,...,...
135,TxTfrValAdjUSD,69
39,CapRealUSD,70
41,DiffMean,71
40,DiffLast,72


In [191]:
best_features = x1[[
'ReferenceRateETH',
'SplyAdrBal1in1K',
'IssTotNtv',
'SplyAdrBalNtv1',
'SplyAdrBalNtv10',
'SplyAdrBalNtv100',
'IssContNtv',
'SplyAdrBalNtv1K',
'SplyAdrBalUSD1',
'SplyAdrBalUSD100',
'SplyAdrBalUSD100K',
'FlowInExNtv',
'SplyAdrBalUSD10K',
'FeeTotNtv',
'SplyAdrBalUSD10M',
'FeeMeanUSD',
'SplyAdrBal1in1B',
'SplyAdrBalUSD1K',
'NVTAdj',
'NVTAdjFF',
'SplyAct3yr',
'RevNtv',
'SplyAct5yr',
'SplyAct90d',
'SplyActEver',
'SplyActPct1yr',
'SplyAdrBal1in100K',
'SplyAdrBal1in100M',
'ReferenceRateEUR',
'SplyAdrBal1in10K',
'ReferenceRate',
'ROI30d',
'ROI1yr',
'SplyAdrBal1in10M',
'NVTAdjFF90',
'NVTAdj90',
'SplyCur',
'SplyAdrTop10Pct',
'AdrBalUSD100KCnt',
'AdrBalNtv100Cnt',
'AdrBalNtv100KCnt',
'AdrBalNtv10Cnt',
'AdrBalNtv10KCnt',
'AdrBalNtv1Cnt',
'AdrBalNtv1KCnt',
'VelCur1yr',
'AdrBalUSD100Cnt',
'SplyFF',
'AdrBalUSD10Cnt',
'AdrBalUSD10KCnt',
'AdrBalUSD10MCnt',
'AdrBalCnt',
'AdrBal1in1MCnt',
'AdrBalNtv0.001Cnt',
'AdrBalUSD1KCnt',
'AdrBalUSD1MCnt',
'TxTfrValMedUSD',
'BlkCnt',
'TxTfrValMeanNtv',
'TxTfrCnt',
'TxCnt',
'AdrBal1in1BCnt',
'AdrBal1in10MCnt',
'CapMVRVCur',
'CapMVRVFF',
'AdrBal1in10KCnt',
'AdrBal1in10BCnt',
'AdrBal1in100KCnt',
'SplyMiner0HopAllNtv',
'AdrBal1in1KCnt',
'AdrBalNtv0.01Cnt'
]]

###Feito a seleção das melhores features, utilizamos um scaler para deixar as features contínuas com valores menos discrepantes, gerando ruídos e problemas no nosso modelo.

In [103]:
from sklearn.preprocessing import RobustScaler
robust = RobustScaler()

In [231]:
scaled_features = robust.fit_transform(best_features)

In [232]:
df1 = pd.DataFrame(scaled_features)

In [233]:
X = df1
y = y1

In [224]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

In [225]:
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

###Modelo criado e com um score altíssimo.  

In [226]:
model.score(X_test, y_test)

0.9997981839202693