# <Center> PREDICTING THE NEXT DAY PRICE OF BITCOIN USING MACHINE LEARNING TECHNIQUES </center>
## <center> Data Collection and Data Cleaning Process </center>
### <center> 2148040, 2148041 </center>

## Installing the required Python modules

#### Quandl

`!pip install Quandl` is used to extract  economic, financial and alternative datasets. 

#### investpy  

`!pip install investpy` is used to download both recent and historical data from all the financial products indexed at Investing.com.  

## Importing Libraries

In [1]:
#The requests module allows you to send HTTP requests using Python.
import requests

#Beautiful Soup is a Python library for pulling data out of HTML and XML files.
from bs4 import BeautifulSoup

#Basic packages for data manipulation, visuals, analysis
import pandas as pd
import numpy as np

#For creating Progress Meters or Progress Bars.
from tqdm import tqdm

from scipy.stats.mstats import winsorize

# Visuals
import matplotlib.pyplot as plt

#Ignore warnings
import warnings
warnings.filterwarnings("ignore")

#APIs for web scrapping
import quandl
import investpy

## Web Scrapping 

This function scraps the given link and returns dataframe. The parameters are as follows:
1. **URL(string)**: URL to be scrapped from bitcoin website

2. **col_name(string)**: column name for dataframe

3. **join_df(variable)**: dataframe withwhich output dataframe will be left joined on Date

4. **join(boolean)**: iF True,join, else don't join

5. **check_column(boolean)**: check if column name already exists

6. **check_URL(boolean)**: check if URL is already processed

7. **clear_URL_array(boolean)**: if true URL_processed array will be cleared

8. **show_details(boolean)**: various details wil be printed such as scrapping first and last details, df head & df tail     

In [2]:
URL_array  = set()
def link2df(URL,col_name,join_df,join=True,check_column=True,check_URL = True,clear_URL_array=False,show_details=False):
        
    print(f'processing {col_name}')

    #clear URL append array
    if clear_URL_array==True:
        URL_array.clear()

    #set join parameters if false
    if join == False:
        join_df = None
        check_column=False

    #process column name by making it lowercase and replacing spaces,commas, full stops
    col_name = col_name.lower().replace(',','').replace(" ", "_").replace(".", "_")

    #col_name validation if exists already
    if check_column==True and col_name in list(join_df.columns):
        print(f'column {col_name} already esists in dataframe, stopped here')
        return join_df

    #URL validation if processes already
    elif check_URL==True and URL in list(URL_array):
        print(f'{URL} is already processed, stopped here')
        return join_df 

    #web scrapping
    page = requests.get(URL)
    soup = page.content
    soup = str(soup)
    scraped_output = (soup.split('[[')[1]).split('{labels')[0][0:-2]
    if show_details == True:
        print('head')
        print({scraped_output[0:30]})
        print('tail')
        print({scraped_output[-30:]})

    processed_str = scraped_output.replace('new Date(','')
    processed_str = processed_str.replace(')','')
    processed_str = processed_str.replace('[','')
    processed_str = processed_str.replace(']','')
    processed_str = processed_str.replace('"','')

    processed_str_list = processed_str.split(',')
    date_list,data_list = processed_str_list[::2],processed_str_list[1::2]

    #validate column lengths
    if len(date_list)!=len(data_list):
        print(f'date & data length:{len(date_list),len(data_list),len(date_list)==len(data_list)}')

    #convert list data to a dataframe
    if join == False:
        df = pd.DataFrame()
        df['Date'] = pd.to_datetime(date_list)
        df[col_name] = data_list
        URL_array.add(URL)
        if show_details == True:
            print('*'*100)
            print('df head')
            print(df.head(1))
            print('*'*100)
            print('df tail')
            print(df.tail(1))
            print('*'*100)
            print(f'df shape{df.shape}')
            print('='*100)
            
        return df

    elif col_name not in list(join_df.columns) and join == True:
        df = pd.DataFrame()
        df['Date'] = pd.to_datetime(date_list)
        df[col_name] = data_list
        join_df = pd.merge(join_df,df,on=['Date'],how='left')
        URL_array.add(URL)
        if show_details == True:
            print('*'*100)
            print('df head')
            print(df.head(1))
            print('*'*100)
            print('df tail')
            print(df.tail(1))
            print('*'*100)
            print(f'output df shape= {df.shape},joined_df shape = {join_df.shape}')
            print('='*100)
            print(f'Number of duplicate columns in dataframe {df.columns.duplicated().sum()}')
            print('='*100)
    
        return join_df

## Web scraping - Data Collection

The data is being Web Scrapped using <a href="https://bitinfocharts.com/comparison/bitcoin-price.html#alltime">bitinfocharts</a>. Bitinfocharts provides following 17 raw features. The features were retrieved using this following custom python script.

### 01.Price

The Open-High-Low-Closed (OHLC) chart features & target variable is mined using investpy API, which retrieves data from investing.com. Values of the previous day are used to predict next day closing price.

Hence we scrapped Opening Price, Highest Price, Lowest Price, Closing Price from the date 01-01-2013 to 01-04-2022.

The data is considered as historical data of bitcoin. We have dropped unnecessary variables like volume and currency. 

In [3]:
#Pulling out the data
final_df = investpy.get_crypto_historical_data(crypto='bitcoin',from_date='01/01/2013',to_date='01/04/2022')
final_df = final_df.reset_index()

#removing unnecessay variables
final_df.drop(['Currency','Volume'],inplace=True,axis=1)

#Appending the final columns
final_df.columns = ['Date','opening_price','highest_price','lowest_price','closing_price']
final_df

Unnamed: 0,Date,opening_price,highest_price,lowest_price,closing_price
0,2013-01-01,13.5,13.6,13.2,13.3
1,2013-01-02,13.3,13.4,13.2,13.3
2,2013-01-03,13.3,13.5,13.3,13.4
3,2013-01-04,13.4,13.5,13.3,13.5
4,2013-01-05,13.5,13.6,13.3,13.4
...,...,...,...,...,...
3373,2022-03-28,46859.0,48199.0,46672.0,47105.0
3374,2022-03-29,47126.0,48127.0,47029.0,47449.0
3375,2022-03-30,47449.0,47714.0,46601.0,47075.0
3376,2022-03-31,47071.0,47624.0,45228.0,45525.0


### 02.Number of transactions in blockchain per day

Bitcoin transactions are messages digitally signed using cryptography and sent to the entire Bitcoin Network for verification. The number of daily transactions highlights the value of the Bitcoin network to securely transfer funds.

In [4]:
final_df = link2df('https://bitinfocharts.com/comparison/bitcoin-transactions.html',
                   'transactions in blockchain',join_df=final_df,join=True)

processing transactions in blockchain


### 03.Average block size

Blocks are files which permanently records most recent transactional information related to the Bitcoin network. It cannot be altered or removed after writing. Each time a block is ‘completed’, the next block comes in the blockchain. The maximum block size is currently set at 1 megabyte.

In [5]:
final_df = link2df('https://bitinfocharts.com/comparison/size-btc.html',
                   'avg block size',join_df=final_df,join=True)

processing avg block size


### 04.Number of unique (from) addresses per day

These are distinct addresses from which payments are made every day.

In [6]:
final_df = link2df('https://bitinfocharts.com/comparison/sentbyaddress-btc.html',
                   'sent by adress',join_df=final_df,join=True)

processing sent by adress


### 05.Average mining difficulty per day

It reflects how difficult the proof of work calculation with respect to the difficulty value set at the beginning (1). A higher difficulty means that it will take more computing power to mine the same number of blocks, making the network more secure against attacks. The difficulty is adjusted every 2016 blocks (every 2 weeks approximately) so that the average time between each block remains 10 minutes. It is increased if a greater number of blocks are being created within a 2-week period, and will be reduced if less blocks are created.

`difficulty_new = (difficulty_old x blocks x minutes)/(time taken to mine the last blocks) `

In [7]:
final_df = link2df('https://bitinfocharts.com/comparison/bitcoin-difficulty.html',
                   'avg mining difficulty',join_df=final_df,join=True)

processing avg mining difficulty


### 06.Average hashrate (hash/s) per day

Hash rate is a measure of the mining computational power per second used. It is measured as how many calculations per second can be performed. Machines with a high hash power are highly efficient and can process a lot of data in a single second.

In [8]:
final_df = link2df('https://bitinfocharts.com/comparison/bitcoin-hashrate.html',
                   'avg hashrate',join_df=final_df,join=True)

processing avg hashrate


### 07.Mining Profitability USD/Day for 1 Hash/s

It is mining profitability for 1 Hash/s

In [9]:
final_df = link2df('https://bitinfocharts.com/comparison/bitcoin-mining_profitability.html',
                   'mining profitability',join_df=final_df,join=True)

processing mining profitability


### 08.Sent coins in USD per day

It is the total value of Bitcoins sent daily.

In [10]:
final_df = link2df('https://bitinfocharts.com/comparison/sentinusd-btc.html',
                   'Sent coins in USD',join_df=final_df,join=True)

processing Sent coins in USD


### 09.Average transaction fee, USD

Each transaction can have a transaction fee determined by the sender and paid to miners who verify the transaction. Transactions with higher fees reward the Bitcoin miners to process them sooner than transactions with lower fees.

In [11]:
final_df = link2df('https://bitinfocharts.com/comparison/bitcoin-transactionfees.html',
                   'avg transaction fees',join_df=final_df,join=True)

processing avg transaction fees


### 10.Median transaction fee, USD

In [12]:
final_df = link2df('https://bitinfocharts.com/comparison/bitcoin-median_transaction_fee.html',
                   'median transaction fees',join_df=final_df,join=True)

processing median transaction fees


### 11.Average block time (minutes)

Block time is the time required to create the next block in a chain. It is a time taken by a blockchain miner to find a solution to the hash. Usually, it is around 10 minutes, but can fluctuate depending on the hash rate of the network.

In [13]:
final_df = link2df('https://bitinfocharts.com/comparison/bitcoin-confirmationtime.html',
                   'avg block time',join_df=final_df,join=True)

processing avg block time


### 12.Avg. Transaction Value, USD

The average value of the transactions in Bitcoin

In [14]:
final_df = link2df('https://bitinfocharts.com/comparison/transactionvalue-btc.html',
                   'avg transaction value',join_df=final_df,join=True)

processing avg transaction value


### 13.Median Transaction Value, USD

The median value of the transactions in Bitcoin

In [15]:
final_df = link2df('https://bitinfocharts.com/comparison/mediantransactionvalue-btc.html',
                   'median transaction value',join_df=final_df,join=True)

processing median transaction value


### 14.Tweets per day  

Number of tweets per day - These features represent impact of social media on bitcoin price.

In [16]:
final_df = link2df('https://bitinfocharts.com/comparison/tweets-btc.html',
                   'tweets',join_df=final_df,join=True)

processing tweets


### 15.Google Trends to "Bitcoin"

Google trends for Bitcoin. These features represent impact of social media on bitcoin price.

In [17]:
final_df = link2df('https://bitinfocharts.com/comparison/google_trends-btc.html',
                   'google trends',join_df=final_df,join=True)

processing google trends


### 16.Number of unique (from or to) addresses per day  - active address 
These are number of unique addresses taking part in a transaction by either sending or receiving Bitcoins.



In [18]:
final_df = link2df('https://bitinfocharts.com/comparison/activeaddresses-btc.html',
                   'active addresses',join_df=final_df,join=True)

processing active addresses


### 17.Top 100 Richest Addresses to Total coins % 
This is the ratio between top 100 rich addresses to total coins.

In [19]:
final_df = link2df('https://bitinfocharts.com/comparison/top100cap-btc.html',
                   'top100 to total percentage',join_df=final_df,join=True)

processing top100 to total percentage


### 18.Average Fee Percentage in Total Block Reward 
Bitcoin block rewards are new bitcoins awarded to miners for being the first to solve a complex math problem and creating a new block of verified bitcoin transactions. Rewards were started at 50 BTC and it halves every 210,000 blocks. The current reward lies at 6.25. This feature is the ratio of the fee sent in a transaction to the reward for verifying that transaction by the other users.

In [20]:
final_df = link2df('https://bitinfocharts.com/comparison/fee_to_reward-btc.html',
                   'avg fee to reward',join_df=final_df,join=True)

processing avg fee to reward


### 19.Total number of bitcoins in circulation 
It is a total number of mined bitcoins that are currently circulating on the network. The total supply of BTC is limited to 21 million.

In [21]:
btc_in_circulation_df = quandl.get("BCHAIN/TOTBC",authtoken='9ztFCcK4_e1xGo_gjzK7')
btc_in_circulation_df = btc_in_circulation_df.rename(columns={'Value': 'number_of_coins_in_circulation'})

### 20.Bitcoin Miners Revenue
Total value of coinbase block rewards and transaction fees paid to miners.

In [22]:
miners_revenue_df = quandl.get("BCHAIN/MIREV",authtoken='9ztFCcK4_e1xGo_gjzK7')
miners_revenue_df = miners_revenue_df.rename(columns={'Value': 'miner_revenue'})

In [23]:
final_df = pd.merge(final_df,btc_in_circulation_df,on=['Date'],how='left')
final_df = pd.merge(final_df,miners_revenue_df,on=['Date'],how='left')
final_df

Unnamed: 0,Date,opening_price,highest_price,lowest_price,closing_price,transactions_in_blockchain,avg_block_size,sent_by_adress,avg_mining_difficulty,avg_hashrate,...,avg_block_time,avg_transaction_value,median_transaction_value,tweets,google_trends,active_addresses,top100_to_total_percentage,avg_fee_to_reward,number_of_coins_in_circulation,miner_revenue
0,2013-01-01,13.5,13.6,13.2,13.3,31734,89033,26174,2979637,24331539528899,...,8.889,625.432,14.518,,1.194,37846,19.536,0.627,10621175.00,5.264860e+04
1,2013-01-02,13.3,13.4,13.2,13.3,39280,114077,31809,2979637,22804647966378,...,9.412,650.617,14.514,,1.497,43104,19.597,0.835,10621575.00,5.486525e+04
2,2013-01-03,13.3,13.5,13.3,13.4,42147,108023,38197,2979637,23724885599725,...,8.889,542.73,19.732,,1.798,51268,19.621,0.925,10628700.00,4.811833e+04
3,2013-01-04,13.4,13.5,13.3,13.5,48436,141811,34990,2979637,22608181137438,...,9.412,632.431,11.384,,1.841,47341,19.54,1,10632425.00,5.087274e+04
4,2013-01-05,13.5,13.6,13.3,13.4,39455,118240,38008,2979637,22590695489434,...,10.213,697.556,13.945,,1.826,53417,19.543,0.885,10633200.00,5.139673e+04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3373,2022-03-28,46859.0,48199.0,46672.0,47105.0,293145,683575,520476,27452707696467,1.925704544402E+20,...,9.057,764524,649.352,239616,96.113,916092,15.531,1.077,18996587.50,4.754287e+07
3374,2022-03-29,47126.0,48127.0,47029.0,47449.0,286789,656320,519299,27452707696467,2.2061925827943E+20,...,9.536,737579,609.259,234380,84.155,828096,15.464,1.045,18997462.50,4.533058e+07
3375,2022-03-30,47449.0,47714.0,46601.0,47075.0,272729,727827,485753,27452707696467,1.9751284069115E+20,...,10.667,637283,590.518,200360,79.054,769223,15.466,1.308,18998493.75,4.037914e+07
3376,2022-03-31,47071.0,47624.0,45228.0,45525.0,279072,718935,516129,28315874718216,1.94309372292E+20,...,10.435,655276,662.979,164113,76.768,806205,15.459,1.889,18999187.50,4.107078e+07


### Next day closing price

In [24]:
final_df['next_day_closing_price'] = final_df['closing_price'].shift(-1)
final_df

Unnamed: 0,Date,opening_price,highest_price,lowest_price,closing_price,transactions_in_blockchain,avg_block_size,sent_by_adress,avg_mining_difficulty,avg_hashrate,...,avg_transaction_value,median_transaction_value,tweets,google_trends,active_addresses,top100_to_total_percentage,avg_fee_to_reward,number_of_coins_in_circulation,miner_revenue,next_day_closing_price
0,2013-01-01,13.5,13.6,13.2,13.3,31734,89033,26174,2979637,24331539528899,...,625.432,14.518,,1.194,37846,19.536,0.627,10621175.00,5.264860e+04,13.3
1,2013-01-02,13.3,13.4,13.2,13.3,39280,114077,31809,2979637,22804647966378,...,650.617,14.514,,1.497,43104,19.597,0.835,10621575.00,5.486525e+04,13.4
2,2013-01-03,13.3,13.5,13.3,13.4,42147,108023,38197,2979637,23724885599725,...,542.73,19.732,,1.798,51268,19.621,0.925,10628700.00,4.811833e+04,13.5
3,2013-01-04,13.4,13.5,13.3,13.5,48436,141811,34990,2979637,22608181137438,...,632.431,11.384,,1.841,47341,19.54,1,10632425.00,5.087274e+04,13.4
4,2013-01-05,13.5,13.6,13.3,13.4,39455,118240,38008,2979637,22590695489434,...,697.556,13.945,,1.826,53417,19.543,0.885,10633200.00,5.139673e+04,13.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3373,2022-03-28,46859.0,48199.0,46672.0,47105.0,293145,683575,520476,27452707696467,1.925704544402E+20,...,764524,649.352,239616,96.113,916092,15.531,1.077,18996587.50,4.754287e+07,47449.0
3374,2022-03-29,47126.0,48127.0,47029.0,47449.0,286789,656320,519299,27452707696467,2.2061925827943E+20,...,737579,609.259,234380,84.155,828096,15.464,1.045,18997462.50,4.533058e+07,47075.0
3375,2022-03-30,47449.0,47714.0,46601.0,47075.0,272729,727827,485753,27452707696467,1.9751284069115E+20,...,637283,590.518,200360,79.054,769223,15.466,1.308,18998493.75,4.037914e+07,45525.0
3376,2022-03-31,47071.0,47624.0,45228.0,45525.0,279072,718935,516129,28315874718216,1.94309372292E+20,...,655276,662.979,164113,76.768,806205,15.459,1.889,18999187.50,4.107078e+07,46297.0


### Saving Raw data

In [25]:
final_df.to_csv("rawdata.csv")

## Data Cleaning

### Date Check

In [26]:
final_df = final_df[(final_df['Date'] >= '2013-01-01')].reset_index(drop=True)

This will be filtering data as we are considering the data after 2013-01-01 peiod only.

### Data Duplicates

In [27]:
dups = final_df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))

Number of duplicate rows = 0


There is no duplicates in the data.

### Missing Value Imputation

In [28]:
final_df.replace(to_replace='null', value=np.nan,inplace=True)
final_df.drop(final_df.tail(1).index,inplace=True)
final_df

Unnamed: 0,Date,opening_price,highest_price,lowest_price,closing_price,transactions_in_blockchain,avg_block_size,sent_by_adress,avg_mining_difficulty,avg_hashrate,...,avg_transaction_value,median_transaction_value,tweets,google_trends,active_addresses,top100_to_total_percentage,avg_fee_to_reward,number_of_coins_in_circulation,miner_revenue,next_day_closing_price
0,2013-01-01,13.5,13.6,13.2,13.3,31734,89033,26174,2979637,24331539528899,...,625.432,14.518,,1.194,37846,19.536,0.627,10621175.00,5.264860e+04,13.3
1,2013-01-02,13.3,13.4,13.2,13.3,39280,114077,31809,2979637,22804647966378,...,650.617,14.514,,1.497,43104,19.597,0.835,10621575.00,5.486525e+04,13.4
2,2013-01-03,13.3,13.5,13.3,13.4,42147,108023,38197,2979637,23724885599725,...,542.73,19.732,,1.798,51268,19.621,0.925,10628700.00,4.811833e+04,13.5
3,2013-01-04,13.4,13.5,13.3,13.5,48436,141811,34990,2979637,22608181137438,...,632.431,11.384,,1.841,47341,19.54,1,10632425.00,5.087274e+04,13.4
4,2013-01-05,13.5,13.6,13.3,13.4,39455,118240,38008,2979637,22590695489434,...,697.556,13.945,,1.826,53417,19.543,0.885,10633200.00,5.139673e+04,13.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3372,2022-03-27,44542.0,46947.0,44445.0,46859.0,218071,595651,421737,27452707696467,1.9945908047524E+20,...,678394,505.231,202928,72.196,696507,15.548,0.905,18995681.25,3.761548e+07,47105.0
3373,2022-03-28,46859.0,48199.0,46672.0,47105.0,293145,683575,520476,27452707696467,1.925704544402E+20,...,764524,649.352,239616,96.113,916092,15.531,1.077,18996587.50,4.754287e+07,47449.0
3374,2022-03-29,47126.0,48127.0,47029.0,47449.0,286789,656320,519299,27452707696467,2.2061925827943E+20,...,737579,609.259,234380,84.155,828096,15.464,1.045,18997462.50,4.533058e+07,47075.0
3375,2022-03-30,47449.0,47714.0,46601.0,47075.0,272729,727827,485753,27452707696467,1.9751284069115E+20,...,637283,590.518,200360,79.054,769223,15.466,1.308,18998493.75,4.037914e+07,45525.0


In [29]:
missing_values = pd.DataFrame(final_df.isna().sum(),columns=['missing_count'])
missing_values.sort_values(by='missing_count',ascending=False)

Unnamed: 0,missing_count
tweets,519
active_addresses,22
top100_to_total_percentage,6
avg_block_time,1
Date,0
median_transaction_fees,0
miner_revenue,0
number_of_coins_in_circulation,0
avg_fee_to_reward,0
google_trends,0


### Treating missing values 

In [30]:
final_df['tweets'] = final_df['tweets'].fillna(final_df['tweets'].rolling(40, min_periods=1).mean()).bfill()
final_df['active_addresses'] = final_df['active_addresses'].fillna(final_df['active_addresses'].rolling(14, min_periods=1).mean())
final_df['google_trends'] = final_df['google_trends'].fillna(final_df['google_trends'].rolling(14, min_periods=1).mean())
final_df['top100_to_total_percentage'] = final_df['top100_to_total_percentage'].fillna(final_df['top100_to_total_percentage'].rolling(7, min_periods=1).mean())
final_df['avg_block_time'] = final_df['avg_block_time'].fillna(final_df['avg_block_time'].rolling(7, min_periods=1).mean())

In [31]:
missing_values = pd.DataFrame(final_df.isna().sum(),columns=['missing_count'])
missing_values.sort_values(by='missing_count',ascending=False)

Unnamed: 0,missing_count
Date,0
median_transaction_fees,0
miner_revenue,0
number_of_coins_in_circulation,0
avg_fee_to_reward,0
top100_to_total_percentage,0
active_addresses,0
google_trends,0
tweets,0
median_transaction_value,0


**The data is cleaned and ready for use**

### Viewing the final Data frame

In [32]:
final_df.head()

Unnamed: 0,Date,opening_price,highest_price,lowest_price,closing_price,transactions_in_blockchain,avg_block_size,sent_by_adress,avg_mining_difficulty,avg_hashrate,...,avg_transaction_value,median_transaction_value,tweets,google_trends,active_addresses,top100_to_total_percentage,avg_fee_to_reward,number_of_coins_in_circulation,miner_revenue,next_day_closing_price
0,2013-01-01,13.5,13.6,13.2,13.3,31734,89033,26174,2979637,24331539528899,...,625.432,14.518,8193,1.194,37846,19.536,0.627,10621175.0,52648.6,13.3
1,2013-01-02,13.3,13.4,13.2,13.3,39280,114077,31809,2979637,22804647966378,...,650.617,14.514,8193,1.497,43104,19.597,0.835,10621575.0,54865.251543,13.4
2,2013-01-03,13.3,13.5,13.3,13.4,42147,108023,38197,2979637,23724885599725,...,542.73,19.732,8193,1.798,51268,19.621,0.925,10628700.0,48118.33062,13.5
3,2013-01-04,13.4,13.5,13.3,13.5,48436,141811,34990,2979637,22608181137438,...,632.431,11.384,8193,1.841,47341,19.54,1.0,10632425.0,50872.74,13.4
4,2013-01-05,13.5,13.6,13.3,13.4,39455,118240,38008,2979637,22590695489434,...,697.556,13.945,8193,1.826,53417,19.543,0.885,10633200.0,51396.725494,13.4


### Creating CSV file for clean data

In [33]:
final_df.to_csv("cleandata.csv")

In [40]:
cleandata = pd.read_csv('cleandata.csv')
cleandata.head()

Unnamed: 0.1,Unnamed: 0,Date,opening_price,highest_price,lowest_price,closing_price,transactions_in_blockchain,avg_block_size,sent_by_adress,avg_mining_difficulty,...,avg_transaction_value,median_transaction_value,tweets,google_trends,active_addresses,top100_to_total_percentage,avg_fee_to_reward,number_of_coins_in_circulation,miner_revenue,next_day_closing_price
0,0,2013-01-01,13.5,13.6,13.2,13.3,31734,89033,26174,2979637,...,625.432,14.518,8193.0,1.194,37846.0,19.536,0.627,10621175.0,52648.6,13.3
1,1,2013-01-02,13.3,13.4,13.2,13.3,39280,114077,31809,2979637,...,650.617,14.514,8193.0,1.497,43104.0,19.597,0.835,10621575.0,54865.251543,13.4
2,2,2013-01-03,13.3,13.5,13.3,13.4,42147,108023,38197,2979637,...,542.73,19.732,8193.0,1.798,51268.0,19.621,0.925,10628700.0,48118.33062,13.5
3,3,2013-01-04,13.4,13.5,13.3,13.5,48436,141811,34990,2979637,...,632.431,11.384,8193.0,1.841,47341.0,19.54,1.0,10632425.0,50872.74,13.4
4,4,2013-01-05,13.5,13.6,13.3,13.4,39455,118240,38008,2979637,...,697.556,13.945,8193.0,1.826,53417.0,19.543,0.885,10633200.0,51396.725494,13.4


### Deleting unnecessary variable  

As `unnamed: 0` is an unnecessary variable, we are deleting it from our data

In [38]:
del cleandata['Unnamed: 0']

In [39]:
cleandata.head()

Unnamed: 0,Date,opening_price,highest_price,lowest_price,closing_price,transactions_in_blockchain,avg_block_size,sent_by_adress,avg_mining_difficulty,avg_hashrate,...,avg_transaction_value,median_transaction_value,tweets,google_trends,active_addresses,top100_to_total_percentage,avg_fee_to_reward,number_of_coins_in_circulation,miner_revenue,next_day_closing_price
0,2013-01-01,13.5,13.6,13.2,13.3,31734,89033,26174,2979637,24331540000000.0,...,625.432,14.518,8193.0,1.194,37846.0,19.536,0.627,10621175.0,52648.6,13.3
1,2013-01-02,13.3,13.4,13.2,13.3,39280,114077,31809,2979637,22804650000000.0,...,650.617,14.514,8193.0,1.497,43104.0,19.597,0.835,10621575.0,54865.251543,13.4
2,2013-01-03,13.3,13.5,13.3,13.4,42147,108023,38197,2979637,23724890000000.0,...,542.73,19.732,8193.0,1.798,51268.0,19.621,0.925,10628700.0,48118.33062,13.5
3,2013-01-04,13.4,13.5,13.3,13.5,48436,141811,34990,2979637,22608180000000.0,...,632.431,11.384,8193.0,1.841,47341.0,19.54,1.0,10632425.0,50872.74,13.4
4,2013-01-05,13.5,13.6,13.3,13.4,39455,118240,38008,2979637,22590700000000.0,...,697.556,13.945,8193.0,1.826,53417.0,19.543,0.885,10633200.0,51396.725494,13.4


----

### The data has been scarpped from the site and is cleaned and saved in CSV format for further prediction.