# Graph Link Prediction Demo

In this notebook we show how to use MLBlocks and MLPrimitives to generate a pipeline able to predict
whether two nodes from a graph will be related or not.

In order to do so, we combine in a single pipeline a custom primitive that make use of some of the
`networkx.link_prediction` module functions to extract fetures and then uses an `XGBoostClassifier`
to make the predictions.

In [2]:
# Setup logging and imports

from utils import get_tunables, pprint, setup

setup()

from mlblocks import MLPipeline
from mlprimitives.datasets import load_umls

### Load the Dataset

First we load the UMLs Dataset from the MLPrimitives library.

In [3]:
dataset = load_umls()

In [4]:
dataset.describe()

UMLs Dataset.

    The data consists of information about a 135 Graph and the relations between
    their nodes given as a DataFrame with three columns, source, target and type,
    indicating which nodes are related and with which type of link. The target is
    a 1d numpy binary integer array indicating whether the indicated link exists
    or not.
    


In [9]:
X = dataset.data
y = dataset.target
graph = dataset.graph

The variable `X` is a table specifying edges and types of edges between different
nodes of the graph:

In [10]:
X.head()

Unnamed: 0,source,target,type
0,50,125,43
1,4,107,24
2,30,39,32
3,51,75,2
4,102,119,40


The `y` variabe contains a 1d numpy array specifying whether the corresponding relatinship from
the `X` table exists or not in the Graph.

In [11]:
y[0:5]

array([1, 1, 1, 0, 1])

Finally, the `graph` variable contains an NetworkX Graph object with all the nodes and their relationships.

In [13]:
graph.nodes

NodeView((0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134))

### Split the dataset in train and test

We will use the `get_split` function from the dataset object
to split the data in two parts:

- `X_train` and `y_train` is the data that we will use to fit our pipeline.
- `X_test` and `y_test` is the data that we will use to evaluate our pipeline performance.

In [16]:
X_train, X_test, y_train, y_test = dataset.get_splits(1)

In [17]:
X_train.head()

Unnamed: 0,source,target,type
1596,57,108,19
2140,115,59,43
842,57,130,29
397,62,78,4
65,26,46,26


### Create the Pipeline

In this case we will create a pipeline with only two primitives:

- A custom primitive from mlprimitives that uses several functions from the `networkx.link_predictions` to extract features form the graph.
- An eXtrem Gradient Boosting Classifier

In [14]:
primitives = [
    "networkx.link_prediction_feature_extraction",
    "xgboost.XGBClassifier"
]

In [15]:
pipeline = MLPipeline(primitives)

### Fit the Pipeline

Once the pipeline has been created, we fit it to the training data.

In [18]:
pipeline.fit(X_train, y_train, graph=dataset.graph, node_columns=['source', 'target'])

### Make Predictions

After fitting it, we can use the pipeline to make predictions over new data

In [19]:
predictions = pipeline.predict(X_test, graph=dataset.graph, node_columns=['source', 'target'])

### Evaluate the Performance

Finally, we can use the `score` method from the dataset object to evaluate how good
our predictions were.

In this case, the `score` function computes an accuracy score, which is simply the percentage
of values that were successfully predicted.

In [20]:
dataset.score(y_test, predictions)

0.8784838350055741