# <Center> PREDICTING THE NEXT DAY PRICE OF BITCOIN USING MACHINE LEARNING TECHNIQUES </center>
## <center>Feature Engineering </center>
### <center> 2148040, 2148041 </center>

Feature engineering is a process of using domain knowledge of the data that is being used in the study to create features that make the predicting machine learning models work is a customized way.  

Here, in the field of bitcoin prediction, we use technical indicators. Technical indicators are pattern-based signals produced by the price, volume, and/or open interest of a security or contract used by traders who follow technical analysis.


By analyzing historical data, technical analysts use indicators to predict future price movements. Predictors or analysts use technical indicators in historical asset price data to judge entry and exit points for trades.  

There are different indicators which show trend, momentum, volatility, volume. e.g., Moving averages, Rate of change, Bollinger bands, Moving Average Convergence Divergence, standard deviation, etc.

We are considering following technical indicators of each feature for a period of past 7 days, 30 days & 90 days. Python library `Pandas-ta` was used to calculate all features.

## IMPORTING DATA AND LIBRARIES

Use `!pip install pandas_ta` for Technical Analysis library.

In [1]:
#Data Manipulation, analysis
import pandas as pd
import numpy as np

#Visualization
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.lines as mlines

#Statistical computation
import scipy.stats as st
from scipy.stats import describe

#To ignore warnings 
import warnings
warnings.filterwarnings("ignore")

#Using for feature eng. technical indicators calculation
import pandas_ta as ta

In [2]:
final_df = pd.read_csv('cleandata.csv')
final_df

Unnamed: 0.1,Unnamed: 0,Date,opening_price,highest_price,lowest_price,closing_price,transactions_in_blockchain,avg_block_size,sent_by_adress,avg_mining_difficulty,...,avg_transaction_value,median_transaction_value,tweets,google_trends,active_addresses,top100_to_total_percentage,avg_fee_to_reward,number_of_coins_in_circulation,miner_revenue,next_day_closing_price
0,0,2013-01-01,13.5,13.6,13.2,13.3,31734,89033,26174,2979637,...,625.432,14.518,8193.0,1.194,37846.0,19.536,0.627,10621175.00,5.264860e+04,13.3
1,1,2013-01-02,13.3,13.4,13.2,13.3,39280,114077,31809,2979637,...,650.617,14.514,8193.0,1.497,43104.0,19.597,0.835,10621575.00,5.486525e+04,13.4
2,2,2013-01-03,13.3,13.5,13.3,13.4,42147,108023,38197,2979637,...,542.730,19.732,8193.0,1.798,51268.0,19.621,0.925,10628700.00,4.811833e+04,13.5
3,3,2013-01-04,13.4,13.5,13.3,13.5,48436,141811,34990,2979637,...,632.431,11.384,8193.0,1.841,47341.0,19.540,1.000,10632425.00,5.087274e+04,13.4
4,4,2013-01-05,13.5,13.6,13.3,13.4,39455,118240,38008,2979637,...,697.556,13.945,8193.0,1.826,53417.0,19.543,0.885,10633200.00,5.139673e+04,13.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3372,3372,2022-03-27,44542.0,46947.0,44445.0,46859.0,218071,595651,421737,27452707696467,...,678394.000,505.231,202928.0,72.196,696507.0,15.548,0.905,18995681.25,3.761548e+07,47105.0
3373,3373,2022-03-28,46859.0,48199.0,46672.0,47105.0,293145,683575,520476,27452707696467,...,764524.000,649.352,239616.0,96.113,916092.0,15.531,1.077,18996587.50,4.754287e+07,47449.0
3374,3374,2022-03-29,47126.0,48127.0,47029.0,47449.0,286789,656320,519299,27452707696467,...,737579.000,609.259,234380.0,84.155,828096.0,15.464,1.045,18997462.50,4.533058e+07,47075.0
3375,3375,2022-03-30,47449.0,47714.0,46601.0,47075.0,272729,727827,485753,27452707696467,...,637283.000,590.518,200360.0,79.054,769223.0,15.466,1.308,18998493.75,4.037914e+07,45525.0


In [3]:
final_df.shape

(3377, 26)

As `unnamed: 0` is an unnecessary variable, we are deleting it from our data

In [4]:
del final_df['Unnamed: 0']

In [5]:
final_df.shape

(3377, 25)

### Feature Smoothening function

In [6]:
def feature_smoothening(df,feature_name,smoothening_type,smoothening_range=[7,30,90]):
    if smoothening_type == 'sma':
        for j in smoothening_range:
            df[f'{smoothening_type}{j} {feature_name}'] = ta.sma(df[feature_name],j) 

    elif smoothening_type == 'var':
        for j in smoothening_range:
            df[f'{smoothening_type}{j} {feature_name}'] = ta.variance(df[feature_name],j)

    elif smoothening_type == 'stdev':
        for j in smoothening_range:
            df[f'{smoothening_type}{j} {feature_name}'] = ta.stdev(df[feature_name],j)
    
    elif smoothening_type == 'ema':
        for j in smoothening_range:
            df[f'{smoothening_type}{j} {feature_name}'] = ta.ema(df[feature_name],j)

    elif smoothening_type == 'wma':
        for j in smoothening_range:
            df[f'{smoothening_type}{j} {feature_name}'] = ta.wma(df[feature_name],j)

    elif smoothening_type == 'rsi':
        for j in smoothening_range:
             df[f'{smoothening_type}{j} {feature_name}'] = ta.rsi(df[feature_name],j)

    elif smoothening_type == 'roc':
        for j in smoothening_range:
            df[f'{smoothening_type}{j} {feature_name}'] = ta.roc(df[feature_name],j)  

    elif smoothening_type == 'dema':
        for j in smoothening_range:
            df[f'{smoothening_type}{j} {feature_name}'] = ta.dema(df[feature_name],j) 

    elif smoothening_type == 'tema':
        for j in smoothening_range:
            df[f'{smoothening_type}{j} {feature_name}'] = ta.tema(df[feature_name],j) 

    elif smoothening_type == 'bband_lower':
        for j in smoothening_range:
            bband_df = ta.bbands(df[feature_name],j)
            df[f'{smoothening_type}{j} {feature_name}'] = bband_df[f'BBL_{j}_2.0']

    elif smoothening_type == 'bband_upper':
        for j in smoothening_range:
            bband_df = ta.bbands(df[feature_name],j)
            df[f'{smoothening_type}{j} {feature_name}'] = bband_df[f'BBU_{j}_2.0']

    elif smoothening_type == 'macd':
        macd_df = ta.macd(df[feature_name])
        df[f'{smoothening_type} hist {feature_name}'] = macd_df['MACDh_12_26_9']
        df[f'{smoothening_type} signal {feature_name}'] = macd_df['MACDs_12_26_9']
        df[f'{smoothening_type} {feature_name}'] = macd_df['MACD_12_26_9']

    
    

### Smoothening highlights the special features in the data and different smoothening types can be used

In [7]:
feature_list = [i for i in list(final_df.columns) if i not in ['Date','next_day_closing_price']]

In [8]:
feature_smoothening(final_df,'closing_price','macd')

In [9]:
feature_list

['opening_price',
 'highest_price',
 'lowest_price',
 'closing_price',
 'transactions_in_blockchain',
 'avg_block_size',
 'sent_by_adress',
 'avg_mining_difficulty',
 'avg_hashrate',
 'mining_profitability',
 'sent_coins_in_usd',
 'avg_transaction_fees',
 'median_transaction_fees',
 'avg_block_time',
 'avg_transaction_value',
 'median_transaction_value',
 'tweets',
 'google_trends',
 'active_addresses',
 'top100_to_total_percentage',
 'avg_fee_to_reward',
 'number_of_coins_in_circulation',
 'miner_revenue']

#### We are using each feature for a period of past 7 days, 30 days & 90 days and that is specified as the range. Python library `Pandas-ta` was used to calculate all features.

## Simple Moving Average
* A simple moving average (SMA) calculates the average of a selected range of feature values for a number of periods in that range.

* It determins if an asset price will continue or if it will reverse a bull or bear trend.

In [10]:
for feature in feature_list:
    feature_smoothening(final_df,feature,'sma')

## Weighted Moving Average
* While calculating average, It assigns a greater weighting to the most recent data points, and less weighting to data points in the distant past.  

# <center>WMA = $\frac{Price_{1} * n + Price_{1} * (n-1)+....+Price_{n}}{\frac{n(n+1)}{2}}$</center>

In [11]:
for feature in feature_list:
    feature_smoothening(final_df,feature,'wma')

## Exponential Moving Average
* The EMA is a moving average that places a greater weight and significance on the most recent data points. They work similar to WMA but formula is different.

* The EMA adapts more quickly to price changes than the SMA does. For example, when a price reverses direction, the EMA will reverse direction more quickly than the SMA will, because the EMA formula gives more weight to recent prices and less weight to prices from the past.
  
#### <center>$EMA_{Today} = (Value_{Today} * \frac{(Smoothing)}{(1 + Days)} + EMA_{Yesterday} * (1 - (\frac{Smoothing}{1+Days})$</center>

In [12]:
for feature in feature_list:
    feature_smoothening(final_df,feature,'ema')

## Double Exponential Moving Average
* DEMA responds more quickly to near-term price changes than a normal exponential moving average (EMA). 

* It helps to filter out noise

#### <center>$DEMA = 2 * EMA - EMA(EMA)$</center>

In [13]:
for feature in feature_list:
    feature_smoothening(final_df,feature,'dema')

## Triple Exponential Moving Average
* It uses multiple EMA calculations and subtracts the lag to create a trend following indicator that reacts quickly to price changes.

* The TEMA reacts to price changes quicker than a traditional MA or EMA will. This is because some of the lag has been subtracted out in the calculation.  

#### <center>$TEMA = 3 * EMA - 3 * EMA(EMA) + EMA(EMA(EMA))$</center>

In [14]:
for feature in feature_list:
    feature_smoothening(final_df,feature,'tema')

## Standard Deviaiton
* Standard deviation is the statistical measure of market volatility, measuring how widely feature values are dispersed from the average feature values. 
* If feature values trade in a narrow trading range, the standard deviation will return a low value that indicates low volatility. 
* Conversely, if feature values swing wildly up and down, then standard deviation returns a high value that indicates high volatility.

In [15]:
for feature in feature_list:
    feature_smoothening(final_df,feature,'stdev')

## Variance
* Variance is another statistical measure of market volatility, measuring how widely feature values are dispersed from the average feature values. 
* It is interpreted similar to standard deviation as varaince is square of standadrd deviation

In [16]:
for feature in feature_list:
    feature_smoothening(final_df,feature,'var')

## Relative Strength Index

#### <center>$RSI = 100 – \frac{100}{(1 + RS)}$

### <center>$RS = \frac{n_{up}}{n_{down}}$
    
  $n_{up}$ = average gain, $n_{down}$ = average loss

An asset is considered oversold or undervalued when the RSI drops below 30. On the other hand, it's deemed to be overbought if the RSI goes above 70.

In [17]:
for feature in feature_list:
    feature_smoothening(final_df,feature,'rsi')

## Rate of Change
* Measures the percentage change in price between the current feature value and the feature a certain number of periods ago.e.g., 7,30,90

*   A rising ROC above zero typically confirms an uptrend while a falling ROC below zero indicates a downtrend.






In [18]:
for feature in feature_list:
    feature_smoothening(final_df,feature,'roc')

## Bollinger Bands
* Bollinger Bands are envelopes (Upper and Lower range levels) plotted at a standard deviation level above and below a simple moving average of the price. Because the distance of the bands is based on standard deviation, they adjust to volatility swings in the underlying price.



<center>$Upper band =  SMA_{n_day} + 2(SD_{n_day})$</center>

<center>$Lower band =  SMA_{n_day} – 2(SD_{n_day})$</center>


* Bollinger bands help determine whether values are high or low on a relative basis. They are used in pairs, both upper and lower bands and in conjunction with a moving average. Further, the pair of bands is not intended to be used on its own. Use the pair to confirm signals given with other indicators.


In [19]:
for feature in feature_list:
    feature_smoothening(final_df,feature,'bband_lower')

In [20]:
for feature in feature_list:
    if feature not in ['Date','weekday', 'year', 'month','7th_day_avg_price','30th_day_avg_price','90th_day_avg_price','avg_btc_price']:
        feature_smoothening(final_df,feature,'bband_upper')

## Moving Average Convergence Divergence
The MACD represents a trend following indicator that highlights whether the short-term price momentum is moving in the same direction as the long-term price momentum, and in cases where it's not, then it's used to determine if a trend change is near. The MACD consists of four components.





In [21]:
for feature in feature_list:
    feature_smoothening(final_df,feature,'macd')

plt.figure(figsize=(20,20))
corrMatrix = final_df.corr()
sns.heatmap(corrMatrix, annot=True,linewidths=3, linecolor='black').set(title='Heatmap for correlation')

## Final Processing and saving file

In [None]:
final_df = final_df[(final_df['Date'] >= '2013-04-01')].fillna(method='bfill')

In [None]:
final_df.shape

In [None]:
final_df.head()

In [None]:
final_df.to_csv('final_data_after_feat_engg.csv')

In [None]:
#plot a heatmap again after generating features

## Conclusion for feature engineering

Feature engineering is the addition and construction of additional variables, or features,so that it  improves our
model performance and accuracy.Here we can observe that during the initial stage there was 3377 rows and 25 columns
and now the rows are reduced to 3287 and the columns are now increased as more features are added to the existing data.  