# Anomalus Vertices Detection in Academia.edu Graph
In this notebook I am going to demonstrate how I use the method described in https://arxiv.org/abs/1610.07525 to detect anomlies in Academia.edu social network.
Since we don't have a ground truth labels we are going to use the academia.edu dataset with simulated anomalus vertices.

To install the package please read the [installation instructions](https://github.com/Kagandi/anomalous-vertices-detection).

In [1]:
from anomalous_vertices_detection.graph_learning_controller import GraphLearningController
from anomalous_vertices_detection.learners.sklearner import SkLearner
from anomalous_vertices_detection.learners.gllearner import GlLearner
from anomalous_vertices_detection.datasets.academia import load_data
import os

Besides the basic installation of the package with networkx as its graph analysis package, it is also possible to install and use iGraph.

First we will define what are considered as positive and negative labels

In [2]:
labels = {"neg": "Real", "pos": "Fake"}

Next we will load the graph of academia.edu.
load_data will return a graph object(academia_graph) and a config object(academia_config).
Since the acedemia.edu dataset doesn't has real world labels load_data will simulate 10% fake vertices. 

In [3]:
academia_graph, academia_config = load_data(labels_map=labels, simulate_fake_vertices=False, limit=100)
print(len(academia_graph.vertices))

Loading graph...
Data loaded.
101


We are going to configure a learning controller that will use Scikit as its ml package. (It is also possible to use Turi).

In [4]:
len(academia_graph.vertices)

101

In [5]:
glc = GraphLearningController(GlLearner(labels=labels), academia_config)

The result will be written in result_path

In [6]:
output_folder = "../output/"
result_path = os.path.join(output_folder, academia_config.name + "_res.csv")

Some of the extracted feature can be useful for understanding the result, but they must not be used in the ml process

In [7]:
if academia_graph.is_directed:
    meta_data_cols = ["dst", "src", "out_degree_v", "in_degree_v", "out_degree_u", "in_degree_u"]
else:
    meta_data_cols = ["dst", "src", "number_of_friends_u", "number_of_friends_v"]

Finally we are going to execute the classification algorithm.

In [8]:
glc.classify_by_links(academia_graph, result_path, test_size={"neg": 1000, "pos": 100},
                      train_size={"neg": 20000, "pos": 20000}, meta_data_cols=meta_data_cols)

Setting training and test sets
Existing files were loaded.


+--------------------+-------------------+-------------------------------+-------------------------------+
| out_common_friends | bi_common_friends | preferential_attachment_score | is_opposite_direction_friends |
+--------------------+-------------------+-------------------------------+-------------------------------+
|         7          |         5         |             672.0             |             False             |
|         0          |         0         |             833.0             |             False             |
|         1          |         0         |             154.0             |             False             |
|         1          |         0         |             966.0             |             False             |
|         2          |         1         |             833.0             |             False             |
|        207         |         13        |           4870295.0           |             False             |
|         0          |         0     

{'f1_score': 0.8874560648029609, 'auc': 0.9503357650000008, 'recall': 0.95315, 'precision': 0.8302338748312356, 'log_loss': 0.32273291339669813, 'roc_curve': Columns:
	threshold	float
	fpr	float
	tpr	float
	p	int
	n	int

Rows: 100001

Data:
+-----------+-----+-----+-------+-------+
| threshold | fpr | tpr |   p   |   n   |
+-----------+-----+-----+-------+-------+
|    0.0    | 1.0 | 1.0 | 20000 | 20000 |
|   1e-05   | 1.0 | 1.0 | 20000 | 20000 |
|   2e-05   | 1.0 | 1.0 | 20000 | 20000 |
|   3e-05   | 1.0 | 1.0 | 20000 | 20000 |
|   4e-05   | 1.0 | 1.0 | 20000 | 20000 |
|   5e-05   | 1.0 | 1.0 | 20000 | 20000 |
|   6e-05   | 1.0 | 1.0 | 20000 | 20000 |
|   7e-05   | 1.0 | 1.0 | 20000 | 20000 |
|   8e-05   | 1.0 | 1.0 | 20000 | 20000 |
|   9e-05   | 1.0 | 1.0 | 20000 | 20000 |
+-----------+-----+-----+-------+-------+
[100001 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns., 'confusion_matri

{'f1_score': 1.0, 'auc': 1.0000000000000033, 'recall': 1.0, 'precision': 1.0, 'log_loss': 0.19181401175792054, 'roc_curve': Columns:
	threshold	float
	fpr	float
	tpr	float
	p	int
	n	int

Rows: 100001

Data:
+-----------+-----+-----+-------+-------+
| threshold | fpr | tpr |   p   |   n   |
+-----------+-----+-----+-------+-------+
|    0.0    | 1.0 | 1.0 | 17970 | 18030 |
|   1e-05   | 1.0 | 1.0 | 17970 | 18030 |
|   2e-05   | 1.0 | 1.0 | 17970 | 18030 |
|   3e-05   | 1.0 | 1.0 | 17970 | 18030 |
|   4e-05   | 1.0 | 1.0 | 17970 | 18030 |
|   5e-05   | 1.0 | 1.0 | 17970 | 18030 |
|   6e-05   | 1.0 | 1.0 | 17970 | 18030 |
|   7e-05   | 1.0 | 1.0 | 17970 | 18030 |
|   8e-05   | 1.0 | 1.0 | 17970 | 18030 |
|   9e-05   | 1.0 | 1.0 | 17970 | 18030 |
+-----------+-----+-----+-------+-------+
[100001 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns., 'confusion_matrix': Columns:
	target_label	int
	pr

TypeError: 'RandomForestClassifier' object is not callable

In [9]:
import turicreate as tc

In [10]:
df = tc.SFrame().read_csv("/home/kagandi/.avd/temp/academia_config_train.csv")

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[float,int,int,float,str,float,int,float,int,float,float,str,str,int,float,float,float,float,float,float,float,float,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [None]:
tc.SFrame(data=[df[i] for i in range(10)]).unpack('X1', column_name_prefix="")

In [None]:
'label' not in features.column_names()

In [None]:
features = features.rename({'edge_label': 'label'})

In [None]:
import numpy as np
x = np.arange(100)
x

In [None]:

def split_kfold(data, labels=None, n_folds=10):
    x = np.arange(len(data))
    np.random.shuffle(x)
    split = np.array_split(x, n_folds)
    split_copy = np.copy(split)
    for i, item in enumerate(split):
        temp = item
        split_copy = np.delete(split_copy, i, 0)
        yield np.block(split_copy), item
        split_copy = np.insert(split_copy, i, item, axis=0)


In [None]:
split = split_kfold(range(100))

In [None]:
r = split.next()

In [None]:
tr, ts = r

In [None]:
tr =np.hstack(tr)

In [None]:
tr

In [None]:
df['id'] = range(df.num_rows())

In [None]:
df.filter_by(tr, 'id')

In [14]:
pos_data = df[df["edge_label"] == "Real"]

In [15]:
pos_data

out_degree_v,out_common_friends,bi_common_friends,preferential_attachment_s core ...,is_opposite_direction_fri ends ...
56.0,7,5,672.0,False
49.0,0,0,833.0,False
14.0,1,0,154.0,False
21.0,1,0,966.0,False
49.0,2,1,833.0,False
13955.0,207,13,4870295.0,False
149.0,0,0,2086.0,False
92.0,5,0,16376.0,False
12.0,0,0,180.0,False
31.0,0,0,1054.0,False

jaccards_coefficient,dst,total_friends,number_of_transitive_frie nds ...,in_degree_v,in_degree_u,edge_label
0.114754098361,49092,61.0,5,36.0,13.0,Real
0.0,170032,66.0,0,0.0,18.0,Real
0.0416666666667,321695,24.0,0,7.0,10.0,Real
0.0151515151515,76869,66.0,0,17.0,24.0,Real
0.03125,9299,64.0,2,49.0,11.0,Real
0.0146839753139,5383,14097.0,13,937.0,249.0,Real
0.0,485942,163.0,0,151.0,18.0,Real
0.0188679245283,62184,265.0,0,21.0,124.0,Real
0.0,10296,27.0,0,4.0,27.0,Real
0.0,14664,65.0,0,0.0,27.0,Real

src,in_common_friends,out_degree_u,knn_weight1,knn_weight3,knn_weight2,knn_weight5
75586,7,12.0,0.431660229218,0.399714477619,0.441749085418,0.0439374775164
Fake66819,0,17.0,1.22941573387,0.370837090108,1.2357022604,0.229415733871
321690,1,11.0,0.655064735171,0.559710234325,0.642228525188,0.106600358178
267624,1,46.0,0.435702260396,0.413200716356,0.381567251893,0.0471404520791
48699,2,17.0,0.430096490832,0.430096490832,0.377123616633,0.0408248290464
2547,207,349.0,0.0958967106258,0.0717104081335,0.0861034058049,0.00206504051391
20381,0,14.0,0.310526444436,0.311065391963,0.339309600313,0.0186080731891
379010,5,178.0,0.302643435456,0.193137888573,0.287944225631,0.0190692517849
219987,0,15.0,0.636195832005,0.466332334617,0.6972135955,0.0845154254729
Fake57739,0,34.0,1.1889822365,0.365758931801,1.16903085095,0.188982236505

knn_weight4,knn_weight7,knn_weight6,knn_weight8
0.409803333819,0.0353996162702,0.0455960752588,0.0367359179185
0.377123616633,0.0324442842262,0.235702260396,0.0333333333333
0.546874024342,0.0778498944162,0.102062072616,0.07453559925
0.359065707854,0.0426401432711,0.0343807082086,0.0310985206786
0.377123616633,0.0408248290464,0.0333333333333,0.0333333333333
0.0619171033126,0.000535364432843,0.00174527777652,0.000452465528247
0.33984854784,0.0187317162316,0.0209426954146,0.0210818510678
0.178438678748,0.0092747779152,0.0159353697204,0.0077505408613
0.527350098113,0.0524142418361,0.111803398875,0.0693375245282
0.345807546242,0.0334076552391,0.169030850946,0.0298807152334


In [89]:
import numpy as np
from collections import defaultdict
import turicreate as tc


def split_stratified_kfold(data, label='label', n_folds=10):
    if label in data.column_names():
        labels = data[label].unique()
        labeled_data = [data[data[label] == l] for l in labels]
        fold = [split_kfold(item, n_folds) for item in labeled_data]
        for _ in range(n_folds):
            train, test = tc.SFrame(), tc.SFrame()
            for f in fold:
                x_train, x_test = f.next()
                train = train.append(x_train)
                test = test.append(x_test)
            yield train, test
    else:
        yield split_kfold(data, n_folds)


def split_kfold(data, n_folds=10):
    data['id'] = range(data.num_rows())
    x = np.arange(len(data))
    np.random.shuffle(x)
    split = np.array_split(x, n_folds)
    split_copy = np.copy(split)
    for i, item in enumerate(split):
        split_copy = np.delete(split_copy, i, 0)
        yield data.filter_by(np.hstack(split_copy), 'id').remove_column('id'), data.filter_by(item, 'id').remove_column(
            'id')
        split_copy = np.insert(split_copy, i, item, axis=0)


def get_classification_metrics(model, targets, predictions):
    precision = tc.evaluation.precision(targets, predictions)
    accuracy = tc.evaluation.accuracy(targets, predictions)
    recall = tc.evaluation.recall(targets, predictions)
    auc = tc.evaluation.auc(targets, predictions)
    return {"recall": recall,
            "precision": precision,
            "accuracy": accuracy,
            "auc": auc
            }


def cross_validate(datasets, model, model_parameters=None, evaluator=get_classification_metrics,  label='label'):
    if not model_parameters:
        model_parameters = {}
    cross_val_metrics = defaultdict(list)
    for train, test in datasets:
        cross_val = model(train, **model_parameters)
        prediction = cross_val.predict(test)
        metrics = evaluator(cross_val, test[label], prediction)
        for k, v in metrics.iteritems():
            cross_val_metrics[k].append(v)
    return {k: np.mean(v) for k,v in cross_val_metrics.iteritems()}

In [90]:
#url = 'https://static.turi.com/datasets/xgboost/mushroom.csv'
#sf = tc.SFrame.read_csv(url)
#sf['label'] = (sf['label'] == 'p')
params = {'target': 'label'}
folds = split_stratified_kfold(sf,'label', 5)
cross_validate(folds, tc.random_forest_classifier.create, params, label='label')

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



{'accuracy': 0.99938461519817923,
 'auc': 0.99872351239346313,
 'precision': 0.99881432257846892,
 'recall': 1.0}

In [100]:
[(item[0].to_numpy(), item[1].to_numpy()) for item in split_kfold(tc.SFrame(range(10)), 10)]

[(array([[1],
         [2],
         [3],
         [4],
         [5],
         [6],
         [7],
         [8],
         [9]]), array([[0]])), (array([[0],
         [1],
         [2],
         [3],
         [4],
         [5],
         [6],
         [8],
         [9]]), array([[7]])), (array([[0],
         [1],
         [2],
         [3],
         [4],
         [6],
         [7],
         [8],
         [9]]), array([[5]])), (array([[0],
         [1],
         [2],
         [4],
         [5],
         [6],
         [7],
         [8],
         [9]]), array([[3]])), (array([[0],
         [2],
         [3],
         [4],
         [5],
         [6],
         [7],
         [8],
         [9]]), array([[1]])), (array([[0],
         [1],
         [2],
         [3],
         [4],
         [5],
         [6],
         [7],
         [9]]), array([[8]])), (array([[0],
         [1],
         [2],
         [3],
         [4],
         [5],
         [6],
         [7],
         [8]]), array([[9]])), (arra

In [134]:
import numpy as np
import turicreate as tc
from collections import defaultdict


def shuffle_sframe(sf):
    sf["shuffle_col"] = tc.SArray.random_integers(sf.num_rows())
    return sf.sort("shuffle_col").remove_column("shuffle_col")


def kfold_sections(data, n_folds):
    """
    Based on scikit implementation.
    """
    Neach_section, extras = divmod(len(data), n_folds)
    section_sizes = ([0] +
                     extras * [Neach_section + 1] +
                     (n_folds - extras) * [Neach_section])
    div_points = np.array(section_sizes).cumsum()
    for i in range(n_folds):
        st = div_points[i]
        end = div_points[i + 1]
        yield st, end


def split_kfold(data, n_folds=10):
    for st, end in kfold_sections(data, n_folds):
        idx = np.zeros(len(data))
        idx[st:end] = 1
        yield data[tc.SArray(1 - idx)], data[tc.SArray(idx)]


def split_stratified_kfold(data, label='label', n_folds=10):
    if label in data.column_names():
        labels = data[label].unique()
        labeled_data = [data[data[label] == l] for l in labels]
        fold = [split_kfold(item, n_folds) for item in labeled_data]
        for _ in range(n_folds):
            train, test = tc.SFrame(), tc.SFrame()
            for f in fold:
                x_train, x_test = f.next()
                train = train.append(x_train)
                test = test.append(x_test)
            yield train, test
    else:
        yield split_kfold(data, n_folds)


def cross_validate(datasets, model_factory, model_parameters=None, evaluator=get_classification_metrics, label='label'):
    if not model_parameters:
        model_parameters = {}
    cross_val_metrics = defaultdict(list)
    for train, test in datasets:
        model = model_factory(train, **model_parameters)
        prediction = model.predict(test)
        metrics = evaluator(model, test[label], prediction)
        for k, v in metrics.iteritems():
            cross_val_metrics[k].append(v)
    return {k: np.mean(v) for k, v in cross_val_metrics.iteritems()}


In [135]:
url = 'https://static.turi.com/datasets/xgboost/mushroom.csv'
sf = tc.SFrame.read_csv(url)
sf['label'] = (sf['label'] == 'p')
params = {'target': 'label'}
sf = shuffle_sframe(sf)
folds = split_stratified_kfold(sf,'label', 5)
cross_validate(folds, tc.random_forest_classifier.create, params, label='label')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



{'accuracy': 0.99926146248881975,
 'auc': 0.99846808455182834,
 'precision': 0.99857679288725765,
 'recall': 1.0}