# Main objective - demystified 

We said that we wanted **classify** small subgraphs into one of the subnetworks for each dataset. For example, given a small *directed temporal subgraph* from the email network, we want to determine to which department it belongs to.  

We can achieve this in 3 steps:
1. Group together the timestamped edges so that they form small subgraphs [sampling]
2. Transform these small graphs into feature vectors [embedding]
3. Train various machine learning models using the vectors generated in the previous step and classify unknown samples [modeling]


## Sampling
As for any machine learning problem, we first need to define what the samples are. In this case, we only have the network as a whole for each of the classes. So we need to break it up.  

A fairly reasonable way to partition the network is to break it into **time-windows** with variable sizes that span fixed amount of time. This way, not only do we preserve the temporal motif formations, but we also get a sense of how they evolved throughout the years.  
I know this is a mouthful, but it's a pretty simple process. First, we sort the network (which is a list of timestamped edges) by the timestamp value, then we form time-windows that span 24-hour period. From here, we build 2 datasets: 
* each sample is a single window
* each sample is a concatenation of itself and its two previous windows  

The first alternative can hurt the models in a sense that it loses a sense of context temporality as a result of the discretization performed on the graph - any information in the neighboring time-frames is lost.  
However, the second alternative will produce samples with more context - what was happening in the previous two time-frames. Not only that, but we preserve the overall number of samples in the dataset - if we were to only concatenate each triplet of time-windows without overlapping, we would end up with samples that have as much temporal context as the second alternative, but the resulting dataset would have 3 times less samples than the one produced in the second alternative, which will definitely hurt the models since we're already dealing with limited number of samples.

## Embedding
Now that we have defined what our samples are, we need to embed them - transform them into vectors. How exactly are we going to achieve this? **Motifs** come to the rescue. Each subgraph will be represented in 3 ways:
1. With its **temporal motif** distribution - a vector of 36 values, each one representing the count of the temporal motifs described in "Motifs in Temporal Networks" by Jure Leskovec  
2. With its **static motif** distribution - a vector of 13 values, each one representing the count of the static motifs described in "Efficient Detection of Network Motifs" by 
Sebastian Wernicke
3. A combination of the previous two  

## Modeling
After embedding the subgraphs, we construct the datasets and split them in train/test sets. In this notebook we only go over the data preparation step. The model definition step will be discussed in the next notebook.

## load data

In [2]:
import pandas as pd
import numpy as np
import subprocess
import os
from datetime import datetime
import csv
import networkx as nx

In [2]:
df_email_dept_1 = pd.read_csv('data/formatted/email/dept_1.csv', delim_whitespace=True)
df_email_dept_2 = pd.read_csv('data/formatted/email/dept_2.csv', delim_whitespace=True)
df_email_dept_3 = pd.read_csv('data/formatted/email/dept_3.csv', delim_whitespace=True)
df_email_dept_4 = pd.read_csv('data/formatted/email/dept_4.csv', delim_whitespace=True)

In [2]:
df_sx_superuser = pd.read_csv('data/formatted/sx/superuser.csv', delim_whitespace=True)
df_sx_askubuntu = pd.read_csv('data/formatted/sx/askubuntu.csv', delim_whitespace=True)
df_sx_mathoverflow = pd.read_csv('data/formatted/sx/mathoverflow.csv', delim_whitespace=True)
df_sx_stackoverflow = pd.read_csv('data/formatted/sx/stackoverflow.csv', delim_whitespace=True)

In [5]:
datasets = [(df_email_dept_1, 'dept_1'), (df_email_dept_2, 'dept_2'), (df_email_dept_3, 'dept_3'), (df_email_dept_4, 'dept_4'), 
            (df_sx_superuser, 'superuser'), (df_sx_askubuntu, 'askubuntu'), (df_sx_mathoverflow, 'mathoverflow'), (df_sx_stackoverflow, 'stackoverflow')]

look_around = np.arange(2)

## generate temporal subgraphs

In [6]:
def get_directory(subnetwork):
    return 'email' if subnetwork.startswith('dept') else 'sx'

In [None]:
for ds, name in datasets:
    # frames will contain the hourly time-frames
    print('Processing: {0}\t[{1}]'.format(name, datetime.now()))
    frames = []
    # timestamp of the first hour of each time-frame
    prev_ts = datetime.utcfromtimestamp(list(ds.head(1)['timestamp'])[0])
    prev_ts = datetime(prev_ts.year, prev_ts.month, prev_ts.day, 0, 0, 0)    # CHANGE THIS LATER IF EXECUTING FOR SX NETWORK
    # rows for each individual hourly time-frame
    rows = []
    
    for row in ds.itertuples(index=False):
        curr_ts = datetime.utcfromtimestamp(row.timestamp)
        curr_ts = datetime(curr_ts.year, curr_ts.month, curr_ts.day, 0, 0, 0)

        if curr_ts != prev_ts: # went into next time window
            frames.append(pd.DataFrame(rows, columns=['from', 'to', 'timestamp']))
            # add missing hours from previous timestamp till now as empty dataframes
            frames.extend([pd.DataFrame(columns=['from', 'to', 'timestamp'])  for i in range(int((curr_ts-prev_ts).total_seconds() // (24*3600)) - 1)])
            rows = []
            prev_ts = curr_ts
        
        # append row to current time-window 
        rows.append(list(row))
    
    for la in look_around:
        # result_frames will contain all windowed timeframes as explained above
        result_frames = []
        
        # iterate over the hourly frames and construct time-windows frames
        for i in range (la, len(frames)-la):
            look_around_items = [frames[i+l].values for l in range(-la, la+1)]
            result_frames.append(pd.DataFrame(data=np.concatenate(look_around_items, axis=0)))
            
        directory = get_directory(name)
        
        # save subgraphs to disk (note: directories should exist before running this cell)
        for i, df in enumerate(result_frames):
            df.to_csv('data/subgraphs/temporal/{0}/la_{1}/{2}/{3:05d}.csv'.format(directory, la, name, i), sep=' ', index=False, header=False)

## remove empty files

In [2]:
clean_dir = 'data/subgraphs/temporal'
dataset_subnets = {'email': ['dept_1', 'dept_2', 'dept_3', 'dept_4'], 'sx': ['superuser', 'askubuntu', 'mathoverflow', 'stackoverflow']}
look_around_values = ['la_0', 'la_1']

In [None]:
for dataset, subnets in dataset_subnets.items():
    for subnet in subnets:
        print('Processing: {0}\t\t[{1}]'.format(subnet, datetime.now()))
        for la in look_around_values:
            delete_from_dir = '{0}/{1}/{2}/{3}'.format(clean_dir, dataset, la, subnet)
            
            for file in os.listdir(delete_from_dir):
                file_location = '{0}/{1}'.format(delete_from_dir, file)
                
                if os.path.getsize(file_location) == 0:
                    os.remove(file_location)
                

## generate static subgraphs from the temporal ones

In [None]:
input_dir = 'data/subgraphs/temporal'
output_dir = 'data/subgraphs/static'
dataset_subnets = {'email': ['dept_1', 'dept_2', 'dept_3', 'dept_4'], 'sx': ['superuser', 'askubuntu', 'mathoverflow', 'stackoverflow']}
look_around_values = ['la_0', 'la_1']

In [None]:
for dataset, subnets in dataset_subnets.items():
    for subnet in subnets:
        print('Processing: {0}\t[{1}]'.format(subnet, datetime.now()))
        for la in look_around_values:
            read_from_dir = '{0}/{1}/{2}/{3}'.format(input_dir, dataset, la, subnet)
            write_to_dir = '{0}/{1}/{2}/{3}'.format(output_dir, dataset, la, subnet)
            
            for file in os.listdir(read_from_dir):
                input_file_location = '{0}/{1}'.format(read_from_dir, file)
                output_file_location = '{0}/{1}'.format(write_to_dir, file)
                
                # reading the temporal graph file, converting it into static DiGraph and saving it back to disk
                df = pd.read_csv(input_file_location, delim_whitespace=True, header=None, names=['from', 'to', 'timestamp'])
                G = nx.from_pandas_edgelist(df, 'from', 'to', create_using=nx.DiGraph())
                df = nx.to_pandas_edgelist(G)
                df.to_csv(output_file_location, sep=' ', index=False, header=False)

## clean dir

In [None]:
# clean_dir = 'data/subgraphs_temporal_motif_distribution'

# for dataset, subnets in dataset_subnets.items():
#     for subnet in subnets:
#         print('Processing: {0}\t\t[{1}]'.format(subnet, datetime.now()))
#         for la in look_around_values:
#             delete_from_dir = '{0}/{1}/{2}/{3}'.format(clean_dir, dataset, la, subnet)
            
#             for file in os.listdir(delete_from_dir):
#                 file_location = '{0}/{1}'.format(delete_from_dir, file)
#                 os.remove(file_location)
                

## count temporal motifs

In [None]:
input_dir = 'data/subgraphs/temporal'
output_dir = 'data/motif_distribution/temporal'
dataset_subnets = {'email': ['dept_1', 'dept_2', 'dept_3', 'dept_4'], 'sx': ['superuser', 'askubuntu', 'mathoverflow', 'stackoverflow']}
look_around_values = ['la_0', 'la_1']

In [None]:
for dataset, subnets in dataset_subnets.items():
    for subnet in subnets:
        print('Processing: {0}\t[{1}]'.format(subnet, datetime.now()))
        for la in look_around_values:
            read_from_dir = '{0}/{1}/{2}/{3}'.format(input_dir, dataset, la, subnet)
            write_to_dir = '{0}/{1}/{2}/{3}'.format(output_dir, dataset, la, subnet)
            
            for input_file in os.listdir(read_from_dir):
                parts = input_file.split(r'.')
                output_file = parts[0] + '_motif_istribution.txt'
                
                input_file_location = '{0}/{1}'.format(read_from_dir, input_file)
                output_file_location = '{0}/{1}'.format(write_to_dir, output_file)
                
                subprocess.run(['temporalmotifsmain', '-i:' + input_file_location, '-delta:3600', '-o:' + output_file_location])

## count static motifs

In [1]:
input_dir = 'data/subgraphs/static'
output_dir = 'data/motif_distribution/static'
dataset_subnets = {'email': ['dept_1', 'dept_2', 'dept_3', 'dept_4'], 'sx': ['superuser', 'askubuntu', 'mathoverflow', 'stackoverflow']}
look_around_values = ['la_0', 'la_1']

In [None]:
for dataset, subnets in dataset_subnets.items():
    for subnet in subnets:
        print('Processing: {0}\t[{1}]'.format(subnet, datetime.now()))
        for la in look_around_values:
            read_from_dir = '{0}/{1}/{2}/{3}'.format(input_dir, dataset, la, subnet)
            write_to_dir = '{0}/{1}/{2}/{3}'.format(output_dir, dataset, la, subnet)
            
            for input_file in os.listdir(read_from_dir):
                parts = input_file.split(r'.')
                prefix = parts[0]
                
                input_file_location = '{0}/{1}'.format(read_from_dir, input_file)
                output_file_location = '{0}/{1}'.format(write_to_dir, prefix)
                
                subprocess.run(['motifs', '-i:' + input_file_location, '-m:3', '-d:F','-o:' + output_file_location])

## create csv files from the temporal motif distributions

In [2]:
input_dir = 'data/motif_distribution/temporal'
output_dir = 'data/csv/temporal'
dataset_subnets = {'email': ['dept_1', 'dept_2', 'dept_3', 'dept_4'], 'sx': ['superuser', 'askubuntu', 'mathoverflow', 'stackoverflow']}
look_around_values = ['la_0', 'la_1']

In [None]:
for dataset, subnets in dataset_subnets.items():
    for subnet in subnets:
        print('Processing: {0}\t[{1}]'.format(subnet, datetime.now()))
        for la in look_around_values:
            read_from_dir = '{0}/{1}/{2}/{3}'.format(input_dir, dataset, la, subnet)
            write_to_dir = '{0}/{1}/{2}'.format(output_dir, dataset, la)
            output_file_location = '{0}/{1}.csv'.format(write_to_dir, subnet)
            
            with open (output_file_location, 'w', newline='') as out_file:
                writer = csv.writer(out_file)
                
                for input_file in os.listdir(read_from_dir):
                    input_file_location = '{0}/{1}'.format(read_from_dir, input_file)
                    with open (input_file_location, 'r') as in_file:
                        values = []
                        for line in in_file:
                            parts = line.split(" ")
                            values.extend([int(p) for p in parts])
                        writer.writerow(values)

## create csv files from the static motif distributions

In [8]:
input_dir = 'data/motif_distribution/static'
output_dir = 'data/csv/static'
dataset_subnets = {'email': ['dept_1', 'dept_2', 'dept_3', 'dept_4'], 'sx': ['superuser', 'askubuntu', 'mathoverflow', 'stackoverflow']}
look_around_values = ['la_0', 'la_1']

In [None]:
for dataset, subnets in dataset_subnets.items():
    for subnet in subnets:
        print('Processing: {0}\t[{1}]'.format(subnet, datetime.now()))
        for la in look_around_values:
            read_from_dir = '{0}/{1}/{2}/{3}'.format(input_dir, dataset, la, subnet)
            write_to_dir = '{0}/{1}/{2}'.format(output_dir, dataset, la)
            output_file_location = '{0}/{1}.csv'.format(write_to_dir, subnet)
            
            with open (output_file_location, 'w', newline='') as out_file:
                writer = csv.writer(out_file)
                
                for input_file in os.listdir(read_from_dir):
                    input_file_location = '{0}/{1}'.format(read_from_dir, input_file)
                    df = pd.read_csv(input_file_location, sep='\t')
                    values = list(df['Count'])
                    writer.writerow(values)

## create merged csv files

In [12]:
input_dir = 'data/csv'
output_dir = 'data/csv/merged'
dataset_subnets = {'email': ['dept_1', 'dept_2', 'dept_3', 'dept_4'], 'sx': ['superuser', 'askubuntu', 'mathoverflow', 'stackoverflow']}
look_around_values = ['la_0', 'la_1']

In [None]:
for dataset, subnets in dataset_subnets.items():
    for subnet in subnets:
        print('Processing: {0}\t[{1}]'.format(subnet, datetime.now()))
        for la in look_around_values:
            df_static_loc = '{0}/static/{1}/{2}/{3}.csv'.format(input_dir, dataset, la, subnet)
            df_temporal_loc = '{0}/temporal/{1}/{2}/{3}.csv'.format(input_dir, dataset, la, subnet)
            write_to_loc = '{0}/{1}/{2}/{3}.csv'.format(output_dir, dataset, la, subnet)
            
            df_static = pd.read_csv(df_static_loc, header=None)
            df_static.columns = ['s-'+str(c) for c in df_static.columns] # rename columns so it's the join doesn't have conflicts
            
            df_temporal = pd.read_csv(df_temporal_loc, header=None)
            df_temporal.columns = ['t-'+str(c) for c in df_temporal.columns]
            
            df = df_static.join(df_temporal)
            df.to_csv(write_to_loc, index=False, header=False)

## train/test split
We will partition the dataset in a 80-20 split, in such manner that we train on past data and test on future data.

In [15]:
input_dir = 'data/csv'
output_dir = 'data/datasets'
types = ['static', 'temporal', 'merged']
datasets = ['email', 'sx']
look_around_values = ['la_0', 'la_1']

In [None]:
for t in types:
    for dataset in datasets:
        print('Processing: {0}\t{1}\t[{2}]'.format(t, dataset, datetime.now()))
        for la in look_around_values:
            read_from_dir = '{0}/{1}/{2}/{3}'.format(input_dir, t, dataset, la)
            write_to_dir = '{0}/{1}/{2}/{3}'.format(output_dir, t, dataset, la)

            output_train_location = '{0}/train.csv'.format(write_to_dir)
            output_test_location = '{0}/test.csv'.format(write_to_dir)

            train = []
            test = []

            for input_file in os.listdir(read_from_dir):
                input_file_location = '{0}/{1}'.format(read_from_dir, input_file)
                parts = input_file.split(r'.')
                target_class = parts[0]

                df = pd.read_csv(input_file_location, header=None)
                df['class'] = target_class
                values = df.values
                split = int(0.8 * len(values))
                train.append(values[:split])
                test.append(values[split:])

            df_train = pd.DataFrame(data=np.concatenate(train, axis=0))
            df_test = pd.DataFrame(data=np.concatenate(test, axis=0))

            df_train.to_csv(output_train_location, index=False, header=False)
            df_test.to_csv(output_test_location, index=False, header=False)