# Goal: using the technical indicators provided in the data set of winning trades, predict the value of the target variable *tipo* (type of trade -buy/sell) 

## Import and data loading

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
from datetime import time
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
from bokeh.models import DatetimeTickFormatter
from sklearn import model_selection, metrics, linear_model, datasets, feature_selection, tree, preprocessing
from sklearn.model_selection import train_test_split

In [None]:
df1 = pd.read_csv('data/EURUSD_15m_BID.csv', sep=",")
df2 = pd.read_csv('data/EURUSD_4h_profit.csv',sep=",")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

## DF1 - Price values for EURUSD pair

 **Variable Definitions**
 
  - Open: price of the pair at the start of the time interval
  - High: highest price of the pair within the duration of the time interval
  - Low: minimum price of the pair within the duration of the time interval
  - Close: price of the pair at the end of the time interval
  - Volume: amount of trades that occured within the duration of the time interval

In [None]:
df1=df1.set_index("Time")
df1.index.names=[None]
df1.head()

## DF2 - Winning trades for EURUSD pair

 **Variable Definitions**
 
  - **RSI**: The relative strength index (RSI) is a momentum indicator that measures the magnitude of recent price changes to evaluate overbought or oversold conditions in the price of a stock or other asset.

$$RSI_\text{step one}= 100− \left[ \frac{100}{1+\frac{\text {Average gain}}{\text {Average loss}}} \right]$$

$$RSI_\text{step two}= 100− \left[ \frac{100}{1+\frac{\text{Previous average gain}*13+ \text{Current gain} }{\text{Average average loss}*13+ \text{Current loss} }} \right]$$
  
  - **Stoch**: A stochastic oscillator is a momentum indicator comparing a particular closing price of a security to a range of its prices over a certain period of time. The sensitivity of the oscillator to market movements is reducible by adjusting that time period or by taking a moving average of the result.

$$\%K = \left( \frac{C-L14}{H14-L14} \right) * 100 $$
  
 
  - **EMA**: An exponential moving average (EMA) is a type of moving average (MA) that places a greater weight and significance on the most recent data points. The below equations refer to the slope of the EMA over the last 20, 50, 100, and 200 days.
   
$$EMA_{Today} =  \left( \left( Value_{Today} * \frac{Smoothing}{1+Days} \right) \right) + EMA_{Yesterday}* \left( 1- \left( \frac{Smoothing}{1+Days} \right) \right)$$
   
  - ema20slope
  - ema50slope
  - ema100slope
  - ema200slope<br><br>
  - **std**: Standard Deviation, a statistical measure of a stock's volatility.

  
  - **mom**: This indicator compares the price of any given instrument to the price over a selected number of preceding periods. It is calculated by taking the difference in today's closing price and the closing price of n periods before. When the indicator is above 100 it means the price is rising, below 100 is represents a downward trend. 
  
  
$$MOM = \left( \frac{CP}{CPn} \right) * 100$$<br>

<center>    
   <b>where:</b> <br>
- C = most recent closing price<br>
- L14 = lowest price traded of the previous 14 trading sessions<br>
- H14 = highest price traded during the same 14-day period<br>
- %K = current value of the stochastic indicator<br>
</center>

  - **BB_up_percen**:
  
  - **cci**: Commodity Channel Index​ (CCI) is a momentum-based oscillator used to help determine when an investment vehicle is reaching a condition of being overbought or oversold. It is also used to assess price trend direction and strength.
  
$$CCI = \frac{\text{Typical Price} - MA}{.015*\text{Mean Deviation}}$$
  
$$\text{Typical Price} = \sum_{i=1}^{P} \frac {High + Low + Close}{3}$$

$$P = \text{Number of Periods}$$

$$MA = \text{Moving Average} = \frac {\sum_{i=1}^{P} \text{Typical Price}}{P}$$
 
$$\text{Mean Deviation} = \frac{\sum_{i=1}^{P} |\text{Typical Price} - \text{MA}|}{P}$$
  
  - **force**: The force index is a technical indicator that measures the amount of power used to move the price of an asset. The term and its formula were developed by psychologist and trader Alexander Elder and published in his 1993 book Trading for a Living. The force index uses price and volume to determine the amount of strength behind a price move.
  
$$\text{FI} \left( 1 \right) = \left( \text{CCP - PCP} \right) * \text{VFI} \left( 13 \right) = \text{13-Period EMA of FI} \left( 1 \right)$$<br>

<center>    
   <b>where:</b> <br>
- FI = Force Index <br>
- CCP = current close price <br>
- PCP = Prevjous close price <br>
- VFI = Volume force index <br>
- EMA = Exponential moving average 
</center>
  
  
  - **macd**: The MACD (moving average convergence divergence) charts the difference between two exponential moving averages (a longer period EMA subtracted to a short period MA). The most common settings applied to MACD are 26 periods EMA and a 12 period EMA. The MACD is positive when the EMA(12) is above the EMA(26) indicating that the rate of change of the shorter period MA is higher than the longer period MA and this indicates positive momentum. On the other hand, it is negative when the EMA(12) is below the EMA(26), the rate of change of the shorter period MA is lower than the longer period MA indicating negative momentum.
  
$$MACD=EMA_{12} − EMA_{26}$$

  - **bearsPower**: The Bears Power oscillator was developed by Alexander Elder. It measures the difference between the lowest price and a 13-day Exponential Moving Average (EMA), plotted as a histogram. If the Bears Power indicator is below zero, it means sellers were able to drive price below the EMA. If the Bears Power indicator is above zero, it means buyers were able to keep the lowest price above the EMA
  
$$\text{Bears Power} = Low - EMA_{13}$$
  
  - **bullsPower**:
  
$$\text {Bulls Power} = High - EMA_{13}$$

  - **WPR**: Williams %R, also known as the Williams Percent Range, is a type of momentum indicator that moves between 0 and -100 and measures overbought and oversold levels. The Williams %R may be used to find entry and exit points in the market. The indicator is very similar to the Stochastic oscillator and is used in the same way.
  
$$\text{Williams Percentage Range} = \frac{\text{Highest High} - \text{Close}}{\text{Highest High}-\text{Lowest Low}}$$

  - **tipo**: type of operation (0=buy, 1=sell) **this is our target variable**

In [None]:
hour = df2['hour']
df2.drop(labels=['hour'], axis=1,inplace = True)
df2.insert(0, 'hour', hour)
day = df2['dayOfWeek']
df2.drop(labels=['dayOfWeek'],axis=1,inplace=True)
df2.insert(1,'dayOfWeek',day)
df2.head(10)

In [None]:
df2.describe()

if order hour open = 16<br><br>
rsi1 = rsi @ t-4 = rsi @ hour12<br>
rsi2 = rsi @ t-8 = rsi @ hour8<br>
rsi3 = rsi @ t-12 = rsi @ hour4<br>
rsi4 = rsi @ t-16 = rsi @ hour0<br>
rsi5 = rsi @ t-20 = rsi @ hour20 (previous day)<br>
rsi6 = rsi @ t-24 = rsi @ hour16 (previous day)

## Helper Functions

**Below I defined the following helper functions that I use throughout my analysis of these data sets. In this section you will find:**
 - drop_col(srs) - this function takes a series *srs* and deletes the columns with the titles found in the series
 - move_cols(srs)- this function takes a series *srs* and inserst the columes with the titles found in the series at the beginning of the row
 - inc_day2(x) - this function takes an iterator *x* and increments the days according to the hour and dayOfWeek column values in the dataframe
 - iso_interval(lst,num) - this function takes list *lst* and adds *num* in order to get the names of the "timeframed" variables
 - create_iso(df,num) - this function takes the base dataframe *df* and creates a dataframe with the subset of var *num*

In [None]:
def drop_col(srs,df):
    for x in srs:
        del df[x]

In [None]:
def move_cols(srs):
    for i in srs:
        #name = "my_"+i
        name = df2[i]
        df2.drop(labels=[i], axis=1,inplace = True)
        df2.insert(0, i, name)

In [None]:
def inc_day2(x):
    day = pd.offsets.Day()
    ts1 = df2['my_time'][x]
    time1 = df2['hour'][x]
    dow1 = df2['dayOfWeek'][x]
    
    timestampStr = ts1.strftime("%Y-%m-%d %H:%M:%S")
    print("Pre func: " + timestampStr)
    #print("time1: " + str(time1))
    print("index1: " + str(x))
    ts1 = df2['my_time'][x]
    temp = 0
    
    days = {1:"Monday",2:"Tuesday",3:"Wednesday",4:"Thursday",5:"Friday"}

    if(time1==0)&(x!=0):
        if(dow1 == 1):
            print("Monday")
            ts2 = ts1 + day*3
        else:
            print(days[dow1])
            ts2 = ts1 + day
            
    else:
        print("same day")            
        ts2 = ts1
        
        
    timestampStr2 = ts2.strftime("%Y-%m-%d %H:%M:%S")
    print("Post func: " + timestampStr2)
    
    return ts2
        

In [None]:
def iso_interval(lst, num):
    result = []
    for i in lst:
        i = i + str(num)
        result.append(i)
    result.append("tipo")
    #print(result)
    return result

In [None]:
def create_iso(df,num):
    keep = ["rsi","stoch","ema20Slope","ema50Slope","ema100Slope","ema200Slope","std","mom","BB_up_percen","cci","force","macd","bearsPower","bullsPower","WPR","close"]
    #for combination of multiple time frames
        
    result = iso_interval(keep, num)
    df_temp = df.copy()
    column_names=list(df_temp)
    column_names_not = [i for i in column_names if i not in result]
    #print(column_names_not)
    drop_col(column_names_not,df_temp)
    return df_temp

In [None]:
##checks for duplicate close values across all 6 indicator sets
#df2["close_same"] = np.where(((df2["close1"]==df2["close2"])&(df2["close2"]==df2["close3"])&(df2["close3"]==df2["close4"])&(df2["close4"]==df2["close5"])),1,0)
#df2["close_same"].value_counts()

In [None]:
##we ended up only using one set of these indicators (aka: rsi1,stoch1,ema20slope1, etc.) but we kept duplicate close just incase
#duplicate_close = ['close2','close3','close4','close5','close6']
#drop_col(duplicate_close)

In [None]:
#df2.head()

In [None]:
df2['time'] =(df2["hour"].astype(str)+":00:00")

df2['my_time'] = (pd.to_datetime("2015-8-1"+ ' ' +df2['time']))

date_series = ["time", "my_time"]

move_cols(date_series)

##debug
#df2.head()

In [None]:
# for x in range(len(df2['my_time'])):
#     #print(x)
#     ts1 = df2['my_time'][x]
#     str2 = df2['dayOfWeek'][x]
#     str2_next = df2['dayOfWeek'][x+1]
#     str3 = df2['time'][x]
#     #print(str1)
#     #print(str2)
#     timestampStr = ts1.strftime("%Y-%m-%d %H:%M:%S")
#     res = inc_day2(x)
#     print("---x=" + str(x) + "--dow=" + str(str2)+"--"+"time="+ str(str3) +"--")
#     print("res: " + str(res))
#     print("-----------")
    
#     df2['my_time'][x]=res

#     i=1
#     while(i<5):
#         df2['my_time'][x+i]=res
#         i+=1
    
    

In [None]:
#df2.head(50)

In [None]:
df2["dt"] = df2["my_time"].astype(str) + " " + df2["time"]

In [None]:
move_cols(['dt'])
pd.to_datetime(df2.dt+' '+df2.time)

In [None]:
##debug
#df2.head()

In [None]:
drop_col(['my_time', 'time','hour','dayOfWeek'],df2)

In [None]:
##this is for creating a time series, which we dont want to do at this point
#df2=df2.set_index("dt")
#df2.index.names=[None]

#debug
#df2.head()

In [None]:
df2_tminus4 = create_iso(df2,1)

In [None]:
df2_tminus4.describe()

In [None]:
##Standard Scaler nor MinMaxScaler isn't appropriate for this set
##Potentially a % change of each metric over the given time frames
#scaler = preprocessing.StandardScaler()
#scaler = preprocessing.MinMaxScaler()
#df2_tminus4[df2_tminus4.columns] = scaler.fit_transform(df2_tminus4[df2_tminus4.columns])
df2_tminus4.head(10)

In [None]:
sns.pairplot(df2_tminus4)

In [None]:
plt.subplots(figsize=(15,15))
sns.heatmap(df2_tminus4.corr(),annot=True,cmap="YlGnBu" )

In [None]:
df2_tminus8 = create_iso(df2,2)

In [None]:
##debug
df2_tminus8.head()

In [None]:
sns.pairplot(df2_tminus8)

In [None]:
plt.subplots(figsize=(15,15))
sns.heatmap(df2_tminus8.corr(),annot=True, cmap="YlGnBu")

In [None]:
df2_tminus12 = create_iso(df2,3)

In [None]:
##debug
#df2_tminus12.head()

In [None]:
sns.pairplot(df2_tminus12)

In [None]:
plt.subplots(figsize=(15,15))
sns.heatmap(df2_tminus12.corr(),annot=True,cmap="YlGnBu")

In [None]:
df2_tminus16 = create_iso(df2,4)

In [None]:
##debug
#df2_tminus16.head()

In [None]:
sns.pairplot(df2_tminus16)

In [None]:
plt.subplots(figsize=(15,15))
sns.heatmap(df2_tminus16.corr(),annot=True, cmap="YlGnBu")

In [None]:
df2_tminus20 = create_iso(df2,5)

In [None]:
df2_tminus20.head()

In [None]:
sns.pairplot(df2_tminus20)

In [None]:
plt.subplots(figsize=(15,15))
sns.heatmap(df2_tminus20.corr(),annot=True,cmap="YlGnBu")

In [None]:
 df2_tminus24 = create_iso(df2,6)

In [None]:
df2_tminus24.head() 

In [None]:
sns.pairplot(df2_tminus24)

In [None]:
plt.subplots(figsize=(15,15))
sns.heatmap(df2_tminus24.corr(),annot=True,cmap="YlGnBu")

In [None]:
ema200_corr_avg = (-.073+-.072+-.071+-.07+-.069)/5
print(ema200_corr_avg)

In [None]:
train, test = train_test_split(df2_tminus4, test_size=0.2)

In [None]:
X=train[['rsi1','macd1','ema20Slope1','ema50Slope1','ema100Slope1','ema200Slope1','bearsPower1','bullsPower1','WPR1']]

In [None]:
y = train['tipo']

In [None]:
bt_model = linear_model.LogisticRegression(solver='liblinear')
bt_model.fit(X,y)

In [None]:
Xnew = test[['rsi1','macd1','ema20Slope1','ema50Slope1','ema100Slope1','ema200Slope1','bearsPower1','bullsPower1','WPR1']]

In [None]:
test_action = bt_model.predict(Xnew)
print(metrics.accuracy_score(test['tipo'],test_action))

In [None]:
print("intercept")
beta_0 = bt_model.intercept_[0]
print(beta_0)
beta_1 = bt_model.coef_[0][0]
beta_2 = bt_model.coef_[0][1]
beta_3 = bt_model.coef_[0][2]
beta_4 = bt_model.coef_[0][3]
beta_5 = bt_model.coef_[0][4]
beta_6 = bt_model.coef_[0][5]
beta_7 = bt_model.coef_[0][6]
beta_8 = bt_model.coef_[0][7]
beta_9 = bt_model.coef_[0][8]
print("coefs")
print(beta_1)
print(beta_2)
print(beta_3)
print(beta_4)
print(beta_5)
print(beta_6)
print(beta_7)
print(beta_8)
print(beta_9)

In [None]:
bt_model.predict_proba(Xnew)[:,1]

In [None]:
bt_model.score(X,y)

## testing different input

In [None]:
X2=train[['close1','std1','macd1','force1','WPR1','bullsPower1']]

In [None]:
bt_model2 = linear_model.LogisticRegression(solver='liblinear')
bt_model2.fit(X2,y)

In [None]:
Xnew2 = test[['close1','std1','macd1','force1','WPR1','bullsPower1']]

In [None]:
test_action2 = bt_model2.predict(Xnew2)
print(metrics.accuracy_score(test['tipo'],test_action2))

In [None]:
bt_model2.score(X2,y)

## Linear Regression with all inputs

In [None]:
cols = list(df2_tminus4.columns)

In [None]:
X3=train[cols]

In [None]:
new_X3 = X3.drop(['tipo'],axis=1)

In [None]:
new_X3.head()

In [None]:
bt_model3 = linear_model.LogisticRegression(solver='liblinear')
bt_model3.fit(new_X3,y)

In [None]:
Xnew3 = test[cols].drop(['tipo'],axis=1)

In [None]:
test_action3 = bt_model3.predict(Xnew3)
print(metrics.accuracy_score(test['tipo'],test_action3))

In [None]:
bt_model3.score(new_X3,y)

## Decision Tree Classifier

In [None]:
df2_tminus4.head()

In [None]:
#removing some of the "noise" - redundant or non-correlated indicators
drop_col(['std1','BB_up_percen1','bearsPower1','bullsPower1','force1','WPR1'],df2_tminus4)

In [None]:
df2_tminus4.head()

In [None]:
actions = tree.DecisionTreeClassifier()
train_tree, test_tree = train_test_split(df2_tminus4, test_size=0.2)

In [None]:
train_tree2 = train_tree.drop(['tipo'],axis=1)

In [None]:
test_tree2 = test_tree.drop(['tipo'],axis=1)

In [None]:
actions.fit(train_tree2, train_tree['tipo'])

In [None]:
test_action = actions.predict(test_tree2)
print(metrics.accuracy_score(test_tree['tipo'],test_action))

we achieve a prediction accuracy of **78%** using the Decision Tree Classification model !!

In [None]:
conf_matrix = metrics.confusion_matrix(test_tree['tipo'],test_action)
conf_matrix

Below are the results of our Decision Tree Classifier on our test set:

|                       |  Predicted Sell|  Predicted Buy |
|---------------------:|:---------------------:|:---------------:|
| **Actual Sell** |           358           |       101         |    0              |    0              |
| **Actual Buy**      |            96          |        341        |    
