# Graph Link Prediction Demo

In this notebook we show how to use MLBlocks and MLPrimitives to generate a pipeline able to predict
whether two nodes from a graph will be related or not.

In order to do so, we combine in a single pipeline a custom primitive that make use of some of the
`networkx.link_prediction` module functions to extract fetures and then uses an `XGBoostClassifier`
to make the predictions.

In [3]:
# Setup logging and imports

from utils import get_tunables, pprint, setup

setup()

from mlblocks import MLPipeline
from mlprimitives.datasets import load_umls

In [4]:
primitives = [
    "networkx.link_prediction_feature_extraction",
    "xgboost.XGBClassifier"
]

In [5]:
dataset = load_umls()

In [6]:
dataset.describe()

UMLs Dataset.

    The data consists of information about a 135 Graph and the relations between
    their nodes given as a DataFrame with three columns, source, target and type,
    indicating which nodes are related and with which type of link. The target is
    a 1d numpy binary integer array indicating whether the indicated link exists
    or not.
    


In [7]:
dataset.data.head()

Unnamed: 0,source,target,type
0,50,125,43
1,4,107,24
2,30,39,32
3,51,75,2
4,102,119,40


In [8]:
dataset.target

array([1, 1, 1, ..., 1, 0, 1])

In [9]:
dataset.graph.nodes

NodeView((0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134))

In [11]:
pipeline = MLPipeline(primitives)

In [12]:
X_train, X_test, y_train, y_test = dataset.get_splits(1)

In [13]:
X_train.head()

Unnamed: 0,source,target,type
359,17,46,26
2062,0,102,43
2094,50,50,8
2406,7,69,48
3204,0,128,7


In [14]:
pipeline.fit(X_train, y_train, graph=dataset.graph, node_columns=['source', 'target'])

In [37]:
predictions = pipeline.predict(X_test, graph=dataset.graph, node_columns=['source', 'target'])

In [38]:
dataset.score(y_test, predictions)

0.9041248606465998