<a href="https://colab.research.google.com/github/DonErnesto/masterclassSFI_2021/blob/main/notebooks/Supervised.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Crime detection with Supervised Machine Learning

### Introduction

The purpose of this Jupyter notebook is to guide you through a basic supervised machine learning pipeline. 

The data is taken from Kaggle: https://www.kaggle.com/ellipticco/elliptic-data-set 
and describes blockchain transactions, some of which are flagged as "illicit" (i.e., relating to illegal activity), others as "licit" or "unknown" (the majority, about 80%). The authors give as examples of illicit categories: "scams, malware, terrorist organizations, ransomware, Ponzi schemes, etc."


Note that there are two types of cells in this notebook: Markdown cells (that contain text, like this one), and Code cells (that execute some code, like the next cell). 

By clicking the Play button on a cell, we execute a code cell. Lines that start with a "#" are comments, and not executed. 

Your input is required whenever there is a Question (in that case: write in the Markdown cell) or whenever you find some 'xxxxx' in the code cell (in this case, some code needs to be fixed or completed).


We start by downloading the data we will be training on, which has already been splitted into "X" (features) and "y" (labels).

In [None]:
## Data import from Github
import os
if not os.path.exists('X_train_supervised.csv.zip'):
    !curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/data/X_train_supervised.csv.zip
    !curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/data/y_train_supervised.csv.zip


We will be using pandas for data handling, and scikit-learn (sklearn) for supervised machine learning algorithms. 

In [None]:
## Package import: pandas for data handling and manipulation
import pandas as pd

Next, we will load the data in a so-called DataFrame (a pandas object), and inspect it by plotting the N-top rows

In [None]:
X_train = pd.read_csv('X_train_supervised.csv.zip')
y_train = pd.read_csv('y_train_supervised.csv.zip')['class']
# .head() returns a DataFrame, that consists of the first N (default: N=5) rows 
# of the DataFrame it is applied on
X_train.head() 

**Further documentation on this dataset:**

There are 166 features associated with each node. Due to intellectual property issues, we cannot provide an exact description of all the features in the dataset. There is a time step associated to each node, representing a measure of the time when a transaction was broadcasted to the Bitcoin network. The time steps, running from 1 to 49, are evenly spaced with an interval of about two weeks. Each time step contains a single connected component of transactions that appeared on the blockchain within less than three hours between each other; there are no edges connecting the different time steps.

The first 94 features represent local information about the transaction – including the time step described above, number of inputs/outputs, transaction fee, output volume and aggregated figures such as average BTC received (spent) by the inputs/outputs and average number of incoming (outgoing) transactions associated with the inputs/outputs. The remaining 72 features are aggregated features, obtained using transaction information one-hop backward/forward from the center node - giving the maximum, minimum, standard deviation and correlation coefficients of the neighbour transactions for the same information data (number of inputs/outputs, transaction fee, etc.).

We only look at the node data (i.e., ignore the network topology), although many of the features are derived from the surrounding nodes and do therefore contain information regarding the network structure. 


In [None]:
print(X_train.shape, '\n')
print(y_train.value_counts(normalize=True))

We have 151k training samples, of which 78% has label "unknown", 20% has label "licit" and 2.5% has label "illicit". 

In [None]:
from tensorflow import keras
from tensorflow.keras import layers