<a href="https://colab.research.google.com/github/11223548/EmergencyUploadA2/blob/master/11223548_A2_SampleReportCode.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Assignment 2: Simplified Report Code

## Important Disclaimer
<br>As discussed in the assignment 2 report, the dataset used in the report was subject to strict prohibitions against redistribution as it is a proprietary dataset. As a result, the code here loads an anonymised sub-sample of the dataset that has removed many of the identifying features of the data. Furthermore, since the data preprocessing took multiple days on a computing cluster it is not practical to include that section of the script here. Instead this script will upload a pre-processed, anonymised sub-sample of the merged proprietary dataset.

#Simplified Report Code

***Import of Python Libraries***

In [0]:
import pandas as pd
import numpy as np
import math
from os import listdir
from os.path import isfile, join
import random
import seaborn as sns 
import matplotlib.pyplot as plt 
from matplotlib import cm as cm
from sklearn.neural_network import MLPClassifier
import time as tm 
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from imblearn.over_sampling import RandomOverSampler 
from imblearn.over_sampling import SMOTE
import gspread
from gspread_dataframe import get_as_dataframe, set_with_dataframe




***Import of Pre-Processed Data***
<br>
<br>Full_Sample represents the data for Implementation 1 discussed in the report.
<br>Full_Sample_M2 represents the data for Implementation 2 discussed in the report.

<br>By default "Full_Sample" is the data will flow through to subsequent sections of this script. In order to run the second implementation through the rest of the model, simply uncomment the final low of code below.


In [6]:
# Fetch Implementation 1 Data from GitHub.
!curl --remote-name \
     -H 'Accept: application/vnd.github.v3.raw' \
     --location https://github.com/11223548/UTS_ML2019_Main/blob/master/_M1.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  335k    0  335k    0     0   538k      0 --:--:-- --:--:-- --:--:--  537k


In [7]:
# Fetch Implementation 2 Data from GitHub.
!curl --remote-name \
     -H 'Accept: application/vnd.github.v3.raw' \
     --location https://github.com/11223548/UTS_ML2019_Main/blob/master/_M2.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  514k    0  514k    0     0  1180k      0 --:--:-- --:--:-- --:--:-- 1177k


In [0]:


URL_1 = "https://github.com/11223548/UTS_ML2019_Main/blob/master/_M1.csv"
URL_2 = "https://github.com/11223548/UTS_ML2019_Main/blob/master/_M2.csv"

Full_Sample = pd.read_csv(URL_1) # Model Implementation 1 Data
Full_Sample_M2 = pd.read_csv(URL_2) # Model Implementation 2 Data

#Full_Sample=Full_Sample_M2

***Defining Some Functions used Throughout Script***
<br>
1. A progress bar for loops 
<br>*(this was more important for data pre-processing where scripts would run for many hours - not soo much here)*
2. A correlation matrix function
3. An F1-Score function
4. A confusion matrix plot function

In [0]:
# Print iterations progress
def printProgressBar (iteration, total, prefix = '', suffix = '', decimals = 1, length = 100, fill = '█'):
    """
    Call in a loop to create terminal progress bar
    @params:
        iteration   - Required  : current iteration (Int)
        total       - Required  : total iterations (Int)
        prefix      - Optional  : prefix string (Str)
        suffix      - Optional  : suffix string (Str)
        decimals    - Optional  : positive number of decimals in percent complete (Int)
        length      - Optional  : character length of bar (Int)
        fill        - Optional  : bar fill character (Str)
    """
    percent = ("{0:." + str(decimals) + "f}").format(100 * (iteration / float(total)))
    filledLength = int(length * iteration // total)
    bar = fill * filledLength + '-' * (length - filledLength)
    print('\r%s |%s| %s%% %s' % (prefix, bar, percent, suffix), end = '\r')
    # Print New Line on Complete
    if iteration == total: 
        print()


def correlation_matrix(df,labels):
    fig = plt.figure()
    ax1 = fig.add_subplot(111)
    cmap = cm.get_cmap('jet', 30)
    cax = ax1.imshow(df.corr(), interpolation="nearest", cmap=cmap)
    ax1.grid(True)
    plt.title('Correlation Matrix', fontsize = 40)
    #labels = ["age", "duration", "campaign", "previous", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed", "BinFinalY"]
    ax1.set_xticks(np.arange(len(labels)))
    ax1.set_yticks(np.arange(len(labels)))
    ax1.set_xticklabels(labels,fontsize=20, rotation=90)
    ax1.set_yticklabels(labels,fontsize=20)
    # Add colorbar, make sure to specify tick locations to match desired ticklabels
    fig.colorbar(cax, ticks=[-1,-0.75,-0.5,-.25,.0,.25,.50,.75,1])
    plt.tight_layout()
    plt.show()

    
def F1_Score(ConfusionMatrix):
    True_Pos = ConfusionMatrix[1,1] 
    False_Pos = ConfusionMatrix[0,1] 
    #True_Neg = ConfusionMatrix[0,0]
    False_Neg = ConfusionMatrix[1,0]
    Precision = True_Pos / (True_Pos + False_Pos)
    Recall = True_Pos / (True_Pos + False_Neg)
    F_score = 2 * (Precision * Recall) / (Precision + Recall)
    return F_score
  
  

def plot_confusion_matrix(y_true, y_pred, 
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    classes = ["No Insider Trading", "Insider Trading"]
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax


np.set_printoptions(precision=2)


***Prospect Data for Likely Discriminative Features***
<br>
<br> The below code plots the conditional distributions of input variables to the neural networks as well as a correlation matrix. These are used prior to analysis to provide quick visual insight into variables that are likely to be effective inputs.

In [0]:
# Some Filters for Plotting
Deltas = ["Delta_Price","Delta_Avg_Vol.","Delta_Tot_Vol."]
D_IT = Deltas + ["IT_Flag"]

sns.pairplot(Imp1_Data[D_IT], hue="IT_Flag") #distinguishes each scatter plot by data points associated with "yes" (subscription) and "no" (no subscription)

correlation_matrix(Full_Sample[D_IT],CorMatFilter)


***Assign Observations to Training/Cross-Validation/Testing***

In [0]:
# Identify remaining unique IT prosecuted cases 
UniqueTickers = pd.DataFrame(Full_Sample["Ticker"].copy().unique().tolist(),columns=["Tickers"])
UniqueTickers["IT_Count"]=0

for row in range(len(UniqueTickers)):
    TempTick = UniqueTickers["Tickers"].iloc[row]
    Tick_IT_Count = Full_Sample["IT_Flag"][(Full_Sample["Ticker"]==TempTick)].sum()
    UniqueTickers["IT_Count"].iloc[row] = Tick_IT_Count

UniqueTickers = UniqueTickers[(UniqueTickers["IT_Count"]>0)]
UniqueTickers = UniqueTickers.iloc[:,0].tolist()

# Assign data to training, cross-validation or testing samples
Imports = len(UniqueTickers)
Imp_Increment1 = int(round(0.6*Imports,0))
Imp_Increment2 = int(round(0.8*Imports,0))
Train_Tickers =  pd.DataFrame(UniqueTickers[0:Imp_Increment1].copy(),columns=["Tickers"]) # allocate 60% of imports to training data
CV_Tickers = pd.DataFrame(UniqueTickers[Imp_Increment1:Imp_Increment2].copy(),columns=["Tickers"]) # allocate 20% of imports to cross-validation data
Test_Tickers = pd.DataFrame(UniqueTickers[Imp_Increment2:].copy(),columns=["Tickers"]) # allocate 20% of imports to cross-validation data

# Apply filters to full_sample to split into respective sets
Train_Mask = Full_Sample["Ticker"].isin(Train_Tickers["Tickers"])
FinParameters_All_Train = Full_Sample[Train_Mask].copy()

CV_Mask = Full_Sample["Ticker"].isin(CV_Tickers["Tickers"])
FinParameters_All_CV = Full_Sample[CV_Mask].copy()

Test_Mask = Full_Sample["Ticker"].isin(Test_Tickers["Tickers"])
FinParameters_All_Test = Full_Sample[Test_Mask].copy()

***OverSampling of Minority Class***
<br>Here the default has been set to random oversampling with SMOTE commented out. This can be quickly reversed by commenting out random oversampling and uncommenting SMOTE.

In [0]:
#%% Perform Partial Random Oversampling with replacement of Minority Class (training set only --> exclude cross val and test set)

MinClassPortion = 1 # 1 = balanced 50/50 between minority/majority class

#Isolate target variable from rest of dataset
X_train_num = np.array(FinParameters_All_Train[ParamColumnHeaders[2:]]) #Note this is only for numericals atm
y_train_num = np.array(FinParameters_All_Train.loc[:, FinParameters_All_Train.columns == 'IT_Flag'])
print('Shape of X (Train): {}'.format(X_train_num.shape))
print('Shape of y (Train): {}'.format(y_train_num.shape))
print("Before OverSampling, counts of label '1' (Training Set): {}".format(sum(y_train_num==1)))
print("Before OverSampling, counts of label '0' (Training Set): {}".format(sum(y_train_num==0)))

# Apply resampling technique to training set only:
ros_Partial = RandomOverSampler(sampling_strategy=MinClassPortion, random_state=0)
X_train_ReOVERpartial_num, y_train_ReOVERpartial_num = ros_Partial.fit_resample(X_train_num, y_train_num.ravel())

# Apply SMOTE resampling technique to training set only:
#sm_ote = SMOTE(random_state=2)
#X_train_ReOVERpartial_num, y_train_ReOVERpartial_num = sm_ote.fit_sample(X_train_num, y_train_num.ravel())

print('After OverSampling, the shape of the Training Set Input Attributes: {}'.format(X_train_ReOVERpartial_num.shape))
print('After OverSampling, the shape of the Training Set Class Attribute: {}'.format(y_train_ReOVERpartial_num.shape))

print("After OverSampling, counts of label '1' (Training Set): {}".format(sum(y_train_ReOVERpartial_num==1)))
print("After OverSampling, counts of label '0' (Training Set): {}".format(sum(y_train_ReOVERpartial_num==0)))


# Also save resampled X and Y values as dataframe
df_rand_Train1_num_ReOVERpartial = pd.DataFrame(X_train_ReOVERpartial_num,columns=ParamColumnHeaders[2:]).astype("float64")
df_rand_Train1_num_ReOVERpartial["IT_Flag"] = y_train_ReOVERpartial_num.astype("float64")


***Train Neural Network and Report Performance***
<br>
<br>This section of the script trains a neural network with various iteration limit cut-offs. Results are outputed into a table for a variety of important metrics. Confusion matrixes are then plotted for cross-validation and test sample performance.

<br>It is possible to switch between ADAM and SGD by changing the "solver" section of the MLPClassifier.

In [0]:

y_train = np.array(df_rand_Train1_num_ReOVERpartial.copy()).astype("float64")
X_train = y_train[:,3:-1] # Independent Delta Variables
y_train = y_train[:,-1].reshape(-1,1) # Prosecuted IT Case during window

y_CV = np.array(FinParameters_All_CV[ParamColumnHeaders].copy()).astype("float64")
X_CV = y_CV[:,5:] # Independent Delta Variables
y_CV = y_CV[:,0] # Prosecuted IT Case during window

y_Test = np.array(FinParameters_All_Test[ParamColumnHeaders].copy()).astype("float64")
X_Test = y_Test[:,5:] # Independent Delta Variables
y_Test = y_Test[:,0] # Prosecuted IT Case during window

# Create Placeholder for Results
Res_Row_Lab = ["Iteration Limit", "Run Time", "Training Set: Accuracy", "Cross-Validation Set: Accuracy", "Training Set: F1 Score", "Cross-Validation Set: F1 Score"]
NN_Results = pd.DataFrame(np.zeros((len(Res_Row_Lab),1)),columns=["Row_Labels"])
NN_Results["Row_Labels"] = Res_Row_Lab


Iteration_Limits = [1,5,20,50,100,250]#,500,1000,2000]


for Iteration_Limit in Iteration_Limits:
    # Start Loop Timer
    Loop_Start = tm.time()
    # Specify Classifier             
    clf_NN = MLPClassifier(hidden_layer_sizes=(5,5), max_iter=Iteration_Limit, alpha=1e-4,
                    solver='adam', verbose=10, tol=1e-4, random_state=1,
                    learning_rate_init=.1)
    #Train the Data
    clf_NN.fit(X_train, y_train.ravel())    
    # Save output of training data for debugging
    Train_Predictions_NN = clf_NN.predict(X_train).reshape(-1,1).astype("float64")
    print("Training IT Predictions: ",Train_Predictions_NN.sum()/len(Train_Predictions_NN),"%")
    # Try using the trained NN for predictions:
    CrossV_Predictions_NN = clf_NN.predict(X_CV)
    CrossV_Predictions_NN = CrossV_Predictions_NN.reshape(-1,1).astype("float64")
    Test_Predictions_NN = clf_NN.predict(X_Test).reshape(-1,1).astype("float64")
    Temp_F1_Cross = f1_score(y_CV, CrossV_Predictions_NN) 
    # Mark Loop End Time
    LoopRunTime = tm.time() - Loop_Start
    print("Loop: ",Iteration_Limit,"\n","Iteration Time: ",LoopRunTime)
    # Save Results
    NN_Results[Iteration_Limit] = [Iteration_Limit, LoopRunTime, clf_NN.score(X_train, y_train), clf_NN.score(X_CV, y_CV), F1_Score(confusion_matrix(clf_NN.predict(X_train), y_train)),F1_Score(confusion_matrix(clf_NN.predict(X_CV), y_CV))]

clf_NN.score(X_Test, y_Test)
F1_Score(confusion_matrix(clf_NN.predict(X_Test), y_Test))


NN_ConfuMat = confusion_matrix(y_CV, CrossV_Predictions_NN)
print(NN_ConfuMat)

# Plot non-normalized confusion matrix
plot_confusion_matrix(y_CV, CrossV_Predictions_NN,
                      title='NN CV Confusion matrix, without normalization')

# Plot normalized confusion matrix
plot_confusion_matrix(y_CV, CrossV_Predictions_NN, normalize=True,
                      title='NN CV Normalized confusion matrix')

plt.show()

# Plot non-normalized confusion matrix
plot_confusion_matrix(y_Test, Test_Predictions_NN,
                      title='NN Testing Set Confusion matrix, without normalization')

# Plot normalized confusion matrix
plot_confusion_matrix(y_Test, Test_Predictions_NN, normalize=True,
                      title='NN Testing Set Normalized confusion matrix')

plt.show()



# Full Sample
print(len(Full_Sample["IT_Flag"]))
# % that represent insider trading periods
print(100*Full_Sample["IT_Flag"].sum()/len(Full_Sample["IT_Flag"]),"%")