# Transforming Temporal-Dynamic Graphs Into Time-Series Data for Solving Event Detection Problems

Event detection problems on temporal-dynamic graphs aim to detect important events by detection abnormal changes on the
network. Because of the excessive use of social media, many real world problems can be modelled as temporal-dynamic graph
data. With the recent progress in graph representation learning, new anomaly detection on static graphs are studied. In this
work, we present a workflow for event detection on temporal-dynamic graphs with using graph representation learning.
Our workflow uses generated embeddings of the temporal-dynamic graph to transform the problem into a unsupervised
time-series anomaly detection problem. Since this is a widely studied research area, transforming temporal-dynamic graph
data into multivariate time series data, provides many possible solutions for the event detection problems. We have evaluated
our proposed workflow on four different real-world datasets and compared our results. Our workflow shows competitive per-
formance, when compared to previous studies. This study gives a proof of concept for using graph embeddings as time-series
data in anomaly detection task.

# Proposed Workflow

In the figure bellow you can see the proposed model workflow. Input is a temporal-dynamic graph G which consists of static
snapshots of the graph taken in different time steps.Then model generates n-dimensional vector embeddings from given
input graph, with using graph representation learning. After this step model pass these embeddings to an unsupervised
anomaly detector. Output of proposed workflow is the anomaly scores corresponding to each time step.

<img src="Proposed_Workflow.png">

In following experiments, we are going to use our proposed workflow. After pre-processing our data, first step is
to generate graph embeddings. For this task we used tdGraphEmbed model. Model generates 40 random-walks
for each node in the graph and length of each walk is 16 in our experiments. The model is trained 50 iterations,
with generated random-walk document. In the second step we are going to use time-series anomaly detectors we
mentioned above. For these algorithms we used Merlion machine learning library for time series data.

# Importing Packages

Before importing libraries, make sure to install requirements in Github repository.

In [1]:
from gensim.models.doc2vec import Doc2Vec
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from merlion.utils import TimeSeries
from evaluation_util import *

from tdGraphEmbed.tdGraphEmbed.temporal_graph import TemporalGraph
from tdGraphEmbed.tdGraphEmbed.model import TdGraphEmbed
from datasetConverter import dataset_convert
from tempfile import TemporaryFile

import datetime

  from .autonotebook import tqdm as notebook_tqdm


# Generating Temporal-Dynamic Graph Embeddings
### !Important Note: Training of this section takes aproximately 15-24 hours. But, pretrained models and embeddings are available. Optionally you can skip this part.

In this step we are going to read the data from files and generate a model for training process. For getting the datasets you, we will use the data_conver() function. Available datasets are:

Tw-WorldCup - The Twitter WorldCup datasets. In experiments you can use granularity as hours.

Tw-Terror-Security - The Twitter Terror Security. In experiments you can use granularity as days.

gameofthrones - The Reddit Game of Thrones. You can read directly, since this dataset is provided as picke file.

formula - The Reddit Formula 1. You can read directly, since this dataset is provided as picke file.

!!!You can find the datasets with size more than 100MB in https://drive.google.com/drive/folders/1D8P9LBHXERWN_r-hiTWNU4HDe3VmVHbx?usp=sharing

In [None]:
graphs = dataset_convert(dataset="Tw-WorldCup",granularity="hours")
model = TdGraphEmbed(dataset_name="Tw-WorldCup")

In [None]:
documents = model.get_documents_from_graph(graphs)

In [None]:
model.run_doc2vec(documents)

In [None]:
graph_vectors = model.get_embeddings()
np.save("tdGraphEmbed/saved_embeddings/Tw-WorldCup.npy", graph_vectors)

In the code above you can read the dataset and train the graph representation learning model. Then save the model and embeddings into files. Training for this process takes a long time around 15-24 hours to complete. Because of this we will use the model and files we have saved.

# Unsupervised Time Series Anomaly Detection

Bellow, we will read the saved model and dataset labels and prepares the data in the format of Merlion library. Saved model file will provide us with time-stamps and n-dimensional embeddings and we will use the labels to evaluate our model. All training process is fully unsupervised and we only use labels for evaluation.

In [2]:
model_path = "tdGraphEmbed/trained_models/Tw-WorldCup.model"
labels_path = "Datasets/Twitter_WorldCup/Twitter_WorldCup_2014_labels.txt"

#model_path = "tdGraphEmbed/trained_models/Tw-Terror-Security.model"
#labels_path = "Datasets/Twitter_Security/Twitter_May_Aug_2014_TerrorSecurity_labels.txt"

#model_path = "tdGraphEmbed/trained_models/GoT-2017.model"
#labels_path = "Datasets/gameofthrones/gameofthrones_2017_labels.txt"

#model_path = "tdGraphEmbed/trained_models/Formula-2019.model"
#labels_path = "Datasets/formula/formula_2019_labels.txt"



model = Doc2Vec.load(model_path)
doc_vecs = model.docvecs.doctag_syn0
doc_vecs = doc_vecs[np.argsort([model.docvecs.index_to_doctag(i) for i in range(0, doc_vecs.shape[0])])]

time_stamps = list(model.docvecs.doctags.keys())
time_series_custom = pd.DataFrame(doc_vecs, index=time_stamps)

ls = readFiles(labels_path, granularity="hours")
df_metadata = pd.DataFrame(columns = ['trainval', 'anomaly'], index = time_stamps)
df_metadata = generate_metadata(df_metadata, time_stamps, ls)

Call to deprecated `doctag_syn0` (Attribute will be removed in 4.0.0, use docvecs.vectors_docs instead).


In this step we are going to train our unsupervised time-series anomaly detection model. There are some different settings for different datasets. Bellow the settings are adjusted for The Twitter World-Cup dataset. Overall important parameters are; "top" variable is the k variable in Recall@k and Precision@k and defines the amount of anomalies to detect. Available models we recommend are "VAE", "LSTMED", and "IsolationForest". For other datasets please follow the comments.

In [9]:
from merlion.models.factory import ModelFactory
from merlion.post_process.threshold import AggregateAlarms

# @k parameter to set
top=30

train_data = TimeSeries.from_pd(time_series_custom[:])
test_labels = TimeSeries.from_pd(df_metadata["anomaly"][:])

#Available models are VAE, LSTMED, and IsolationForest
#Because of the data properties Isolation Forest is best in Twitter Security dataset.
#It is best to use VAE or LSTMED on other datasets.
model = ModelFactory.create("VAE",
                            threshold=AggregateAlarms(alm_threshold=0))

model.train(train_data)
labels = model.get_anomaly_label(train_data)
df_temp = labels.to_pd()
df_cpy = df_temp.copy()

#For datasets other than The Twitter World-Cup ascending should be True. This is due to some
#implementation error on Merlion library.
df_temp = get_top_anomalies(df_temp,ascending=False , top=top)

#If you are using Isolation Forest Model you have to use test_labels_temp = test_labels[1:]
#Because ısolation forest does nor return an anomaly score for first time-stamp.
test_labels_temp = test_labels[:]

 |████████████████████████████████████████| 100.0% Complete, Loss 0.0013
Anomaly Threshold: 
anom_score    1.995362
Name: 2014-06-14 19:00:00, dtype: float64


# Evaluation

In this part you can evaluate the model with precision and recall. Also you can try out different delay factors. Since model training is trivial. You can get different results in significance interval.

In [10]:
prec = get_precision(df_temp, test_labels_temp,delay=0)
rec = get_recall(df_temp, test_labels_temp,delay=0,top=top)
acc = get_accuracy(df_temp, test_labels_temp)
print("Top",top,"time-stamps are considered as anomalies")
print("Precision:", prec)
print("Recall:", rec)
print("Accuracy:", acc)

Top 30 time-stamps are considered as anomalies
Precision: 0.6333333333333333
Recall: 0.15966386554621848
Accuracy: 0.9380234505862647


In [11]:
for i in range(12):  
    prec = get_precision(df_temp, test_labels_temp,delay=i)
    rec = get_recall(df_temp, test_labels_temp,delay=i,top=top)
    print("Precision and Recall with delay",i)
    print("Precision:", prec)
    print("Recall:", rec)
    print("---------")
    

Precision and Recall with delay 0
Precision: 0.6333333333333333
Recall: 0.15966386554621848
---------
Precision and Recall with delay 1
Precision: 0.9333333333333333
Recall: 0.453781512605042
---------
Precision and Recall with delay 2
Precision: 0.9666666666666667
Recall: 0.5546218487394958
---------
Precision and Recall with delay 3
Precision: 0.9666666666666667
Recall: 0.6470588235294118
---------
Precision and Recall with delay 4
Precision: 0.9666666666666667
Recall: 0.7394957983193278
---------
Precision and Recall with delay 5
Precision: 0.9666666666666667
Recall: 0.7815126050420168
---------
Precision and Recall with delay 6
Precision: 0.9666666666666667
Recall: 0.7899159663865546
---------
Precision and Recall with delay 7
Precision: 1.0
Recall: 0.7899159663865546
---------
Precision and Recall with delay 8
Precision: 1.0
Recall: 0.7899159663865546
---------
Precision and Recall with delay 9
Precision: 1.0
Recall: 0.7899159663865546
---------
Precision and Recall with delay 10
