**Warning**: *Make sure to run `sampling.ipynb` before this notebook to generate the text files for analysis!*

## Load the samples generated using `sampling.ipynb`

In [9]:
import json
import pandas as pd
import os
import numpy as np

# Constants

TAACO_DIR = '../taaco/'
INPUT_DIR = TAACO_DIR + 'inputs/'
AI_INPUT_DIR = INPUT_DIR + 'generation/'
HUMAN_INPUT_DIR = INPUT_DIR + 'init/'
OUTPUT_DIR = TAACO_DIR + 'outputs/'
SAMPLE_FILE = TAACO_DIR + 'sampled_files.json'


In [10]:
ids = json.load(open(SAMPLE_FILE))['ids']
ids


[8619053,
 8719964,
 8703870,
 8751867,
 8758842,
 8713072,
 8766966,
 8711296,
 8753671,
 8720643,
 8730286,
 8722369,
 8712277,
 8611704,
 8758092,
 8607594,
 8751308,
 8761678,
 8615550,
 8713231,
 8703389,
 8768168,
 8716039,
 8724403,
 8756363,
 8756276,
 8712765,
 8712365,
 8754213,
 8762102,
 8725482,
 8727940,
 8600533,
 8618217,
 8712415,
 8607908,
 8704102,
 8724163,
 8724326,
 8603371,
 8715734,
 8724478,
 8725612,
 8759894,
 8765711,
 8700257,
 8765942,
 8615600,
 8764328,
 8725725,
 8612834,
 8605770,
 8754281,
 8761593,
 8601176,
 8711776,
 8708199,
 8603287,
 8765808,
 8718642,
 8614238,
 8609974,
 8603169,
 8616269,
 8756159,
 8761264,
 8755180,
 8613589,
 8726640,
 8612841,
 8720531,
 8758779,
 8725548,
 8756196,
 8600635,
 8708979,
 8752212,
 8618916,
 8726356,
 8604726,
 8712416,
 8761772,
 8728412,
 8708961,
 8724124,
 8757019,
 8765616,
 8709170,
 8708963,
 8715179,
 8718949,
 8613330,
 8759957,
 8720028,
 8616066,
 8759519,
 8754328,
 8727958,
 8619833,
 8766972,


In [11]:
ai_text = [ [ _id, open(os.path.join(AI_INPUT_DIR, str(_id) + '.txt')).read() ] for _id in ids ]
ai_text


[[8619053,
  'This paper presents a distributed adaptive consensus tracking control method for uncertain high-order nonlinear systems under directed graph condition. The proposed approach utilizes the backstepping technique, topology information, and Lyapunov methods to achieve consensus tracking control of multiple agents. The trajectory of the system is considered in the design of the controller to ensure that agents can converge to a common trajectory while following their own optimal paths. The uncertain parameters of the system are estimated online by adaptive control laws, which guarantee the convergence of the estimation errors. The topology of the control network is assumed to be partially known and the proposed approach is robust to changes in the communication topology. The efficacy of the proposed approach is demonstrated through simulations on a network of agents.'],
 [8719964,
  'The development of efficient activity recognition systems has become increasingly important in

In [12]:
human_text = [ [ _id, open(os.path.join(HUMAN_INPUT_DIR, str(_id) + '.txt'), encoding='utf-8').read() ] for _id in ids ]
human_text


[[8619053,
  'In this paper, we investigate the output consensus tracking problem for a class of high-order nonlinear systems subjected to unknown parameters and uncertain external disturbances. A novel backstepping based distributed adaptive control scheme is presented under the directed communication status. For the subsystems without direct access to time-varying desired trajectory, local estimators are introduced and the corresponding adaptive laws are designed in a totally distributed fashion. With the presented scheme, the assumption on linearly parameterized reference signal and the information exchange operation of subsystem inputs in the existing results are no longer needed. It is shown that all the closed-loop signals are globally uniformly bounded and desired output consensus tracking can be achieved.'],
 [8719964,
  "Human activity recognition has become an active research field over the past few years due to its wide application in various fields such as health-care, smar

## Analysis of metrics
---

Metrics analysed:
- [`nx.degree_centrality(G)`](/docs/metrics.md#degree-centrality) : Degree centrality is a measure of the importance of a node in a graph. \
    It is defined as the number of links incident upon a node (i.e., the number of ties that a node has) \
    Formula: `C_D(v) = d_v / (n - 1)`

- [`nx.eigenvector_centrality_numpy(G)`](/docs/metrics.md#eigenvector-centrality) : Eigenvector centrality is another measure of the importance of a node in a graph. \
    It is based on the idea that a node is important if it is connected to other important nodes. \
    Formula: `C_E(v) = (1 / λ) * ∑ C_E(u) ∀ (u,v) ∈ E`

- [`nx.closeness_centrality(G)`](/docs/metrics.md#closeness-centrality) : Closeness centrality is a measure of centrality in a network, which is calculated as the sum of the length of   the shortest paths between a node and all other nodes in the graph. \
    In other words, it measures how close a node is to all other nodes in the network.

- [`nx.closeness_vitality(G)`](/docs/metrics.md#closeness-vitality) : Closeness vitality is a measure of the importance of a node in a graph. \
    It is defined as the change in the sum of distances between all node pairs when excluding that node. \
    Formula: `C_V(v) = W - W(v)`

- [`nx.betweenness_centrality(G)`](/docs/metrics.md#betweenness-centrality) : Betweenness centrality is a measure of centrality in a network, which is calculated based on shortest paths between nodes. \
It measures the extent to which a node lies on the shortest path between other nodes in the network. \
Formula: `C_B​(v) = ∑ s != v != t σ_st(v)/σ_st`

- [`nx.edge_load_centrality(G)`](/docs/metrics.md#edge-load-centrality) : Edge load centrality is a concept in graph theory that measures the importance of an edge based on the number of shortest paths that pass through it. \
Formula: `C_L​(e) = ∑ s != t ∊ V σ_st(e)/σ_st`

- [`nx.all_pairs_node_connectivity(G)`](/docs/metrics.md#all-pairs-node-connectivity) : All pairs node connectivity is a concept in graph theory that measures the minimum number of nodes that must be removed to disconnect all pairs of nodes in a graph. It is also known as global node connectivity.

- [`nx.average_neighbor_degree(G)`](/docs/metrics.md#average-neighbor-degree) : Average neighbor degree is a concept in graph theory that measures the average degree of the neighborhood of each node in a graph. It is defined as the average degree of the nodes that are connected to a given node by an edge. 

- [`nx.average_degree_connectivity(G)`](/docs/metrics.md#average-degree-connectivity) : The average degree connectivity in a graph is the average nearest neighbor degree of nodes with degree k.

- [`nx.clustering(G)`](/docs/metrics.md#clustering) : Clustering in graph theory is a measure of the degree to which nodes in a graph tend to cluster together. \
    The clustering coefficient is a measure of the density of triangles in a graph. \
    The global clustering coefficient is based on triplets of nodes, while the local clustering coefficient is based on the neighborhood of a vertex. 

- [`nx.degree_assortativity_coefficient(G)`](/docs/metrics.md#degree-assortativity-coefficient) : The degree assortativity coefficient is a measure of the similarity of connections in the graph with respect to the node degree. \
It is a Pearson correlation coefficient of node degree between pairs of connected nodes, ranging from -1 to 1. \
A positive assortativity coefficient indicates that nodes tend to connect to other nodes with the same or similar degree, while a negative assortativity coefficient indicates that nodes tend to connect to nodes with different degrees.

- [`nx.degree_pearson_correlation_coefficient(G)`](/docs/metrics.md#degree-pearson-correlation-coefficient) : The degree Pearson correlation coefficient in a graph, also known as the degree assortativity coefficient, is a measure of the correlation between the degrees of all pairs of nodes connected by an edge.

- [`nx.average_clustering(G)`](/docs/metrics.md#average-clustering) : The average clustering coefficient of a graph is a measure of the degree to which nodes in a graph tend to cluster together. \
It is defined as the average of the local clustering coefficients of all nodes in the graph. \
Formula: `C_avg L = 1/|V| . ∑ C_L(u)`

- [`nx.check_planarity(G)[0]`](/docs/metrics.md#planarity) : Planarity in graphs refers to the property of a graph that can be drawn on a plane without any edges crossing each other.\
Formula: `v - e + f = 2`

For more details, you can peruse [`metrics.md`](/docs/metrics.md).


### Ready sentences for graphical analysis

Functions to convert abstract into a graph and visualise it.

In [13]:
from itertools import combinations, chain
import networkx as nx
import networkit as nk
import matplotlib.pyplot as plt
import string

def sentences_to_graph(text):
    sentences = text.replace('-', ' ').split('.')
    list_of_words = [ [ word.strip(string.punctuation).lower() for word in sentence.split(' ') if word != '' ] for sentence in sentences ]
    print(list_of_words)
    links_per_sentence = list([ list(combinations(sentence, 2)) for sentence in list_of_words ])
    links = list(chain(*links_per_sentence))
    nodes = set(chain(*list_of_words)) - set([''])

    G = nx.Graph()
    G.add_nodes_from(nodes)
    G.add_edges_from(links)
    G.remove_edges_from(nx.selfloop_edges(G))
    G_nk = nk.nxadapter.nx2nk(G)
    idmap = dict((u, id) for (id, u) in zip(G.nodes(), range(G.number_of_nodes())))
    return G, G_nk, idmap

def plot_graph(G, title='Word Network'):
    plt.figure(figsize=(20,10))
    nx.draw(G, with_labels=True, font_weight='bold')
    plt.title(title, fontsize=20)
    plt.show()


In [14]:
for _, abstract in ai_text:
    # split sentences by period and hyphen
    sentences = abstract.replace('-', ' ').split('.')
    list_of_words = [ [ word.strip(string.punctuation).lower() for word in sentence.split(' ') if word != '' ] for sentence in sentences ]
    print(list_of_words)

for _, abstract in human_text:
    # split sentences by period and hyphen
    sentences = abstract.replace('-', ' ').split('.')
    list_of_words = [ [ word.strip(string.punctuation).lower() for word in sentence.split(' ') if word != '' ] for sentence in sentences ]
    print(list_of_words)


[['this', 'paper', 'presents', 'a', 'distributed', 'adaptive', 'consensus', 'tracking', 'control', 'method', 'for', 'uncertain', 'high', 'order', 'nonlinear', 'systems', 'under', 'directed', 'graph', 'condition'], ['the', 'proposed', 'approach', 'utilizes', 'the', 'backstepping', 'technique', 'topology', 'information', 'and', 'lyapunov', 'methods', 'to', 'achieve', 'consensus', 'tracking', 'control', 'of', 'multiple', 'agents'], ['the', 'trajectory', 'of', 'the', 'system', 'is', 'considered', 'in', 'the', 'design', 'of', 'the', 'controller', 'to', 'ensure', 'that', 'agents', 'can', 'converge', 'to', 'a', 'common', 'trajectory', 'while', 'following', 'their', 'own', 'optimal', 'paths'], ['the', 'uncertain', 'parameters', 'of', 'the', 'system', 'are', 'estimated', 'online', 'by', 'adaptive', 'control', 'laws', 'which', 'guarantee', 'the', 'convergence', 'of', 'the', 'estimation', 'errors'], ['the', 'topology', 'of', 'the', 'control', 'network', 'is', 'assumed', 'to', 'be', 'partially', '

## Calculate and store all the metrics analysed above into a dataframe
---

In [15]:
def stats(arr):
    return {
        'min': np.min(arr), 
        '1qr': np.percentile(arr, 25), 
        'median': np.median(list(arr)), 
        '3qr': np.percentile(arr, 75), 
        'max': np.max(arr),
        'avg': np.mean(arr)
    }


In [16]:
nk.setNumberOfThreads(10)
nk.getMaxNumberOfThreads()


10

### Sample example for one text

In [17]:
import random
import operator

from networkit.graphtools import GraphTools

def get_density(G):
  return GraphTools.density(G)

def get_degree_centrality(G):
  deg = nk.centrality.DegreeCentrality(G)
  deg.run()
  return (deg.scores())

def get_eig_centrality(G):
  deg = nk.centrality.EigenvectorCentrality(G)
  deg.run()
  return (deg.scores())

def get_pagerank(G):
  deg = nk.centrality.PageRank(G)
  deg.run()
  return (deg.scores())

def get_btw_centrality(G):
  deg = nk.centrality.ApproxBetweenness(G)
  deg.run()
  return (deg.scores())

def get_cls_centrality(G):
  deg = nk.centrality.ApproxBetweenness(G)
  deg.run()
  return (deg.scores())

def get_size(G):
  return GraphTools.size(G)

def get_max_cliques(G):
  mc = nk.clique.MaximalCliques(G)
  mc.run()
  return mc.getCliques()

def get_isolated_nodes(degrees):
		sequence = sorted(degrees)
		i = 0
		nIsolated = 0
		while i < len(sequence) and sequence[i] == 0:
			nIsolated += 1
			i += 1
		return nIsolated

def get_connected_components(G):
  cc = nk.components.ConnectedComponents(G)
  scc = nk.components.StronglyConnectedComponents(G)
  cc.run()
  scc.run()
  return cc.numberOfComponents(), scc.numberOfComponents()    

def get_globals(G):
  nodes, edges = get_size(G)
  cc, scc = get_connected_components(G)

  return {
    'nodes': nodes,
    'edges': edges,
    'density': get_density(G),
    'isolated_nodes': get_isolated_nodes(get_degree_centrality(G)),
    'core_number': max(nx.core_number(nk.nxadapter.nk2nx(G)).items(), key = operator.itemgetter(1))[1],
    'global_cc': nk.globals.ClusteringCoefficient.exactGlobal(G),
    'approx_avg_local_cc': nk.globals.clustering(G),
    'max_cliques': len(get_max_cliques(G)),
    'connected_components': cc,
    'strongly_connected_components': scc,
    'degree_assortativity_coefficient': nx.degree_assortativity_coefficient(nk.nxadapter.nk2nx(G)),
    'degree_pearson_correlation_coefficient': nx.degree_pearson_correlation_coefficient(nk.nxadapter.nk2nx(G))
  }

def get_overview(G):
  nk.overview(G)
  plot_degree_distribution(nk.nxadapter.nk2nx(G))

def plot_degree_distribution(G):
  degree_sequence = sorted((d for n, d in G.degree()), reverse=True)
  dmax = max(degree_sequence)

  fig = plt.figure("Degree of a random graph", figsize=(8, 8))
  # Create a gridspec for adding subplots of different sizes
  axgrid = fig.add_gridspec(5, 4)

  ax0 = fig.add_subplot(axgrid[0:3, :])
  Gcc = G.subgraph(sorted(nx.connected_components(G), key=len, reverse=True)[0])
  pos = nx.spring_layout(Gcc, seed=10396953)
  nx.draw_networkx_nodes(Gcc, pos, ax=ax0, node_size=20)
  nx.draw_networkx_edges(Gcc, pos, ax=ax0, alpha=0.4)
  ax0.set_title("Connected components of G")
  ax0.set_axis_off()

  ax1 = fig.add_subplot(axgrid[3:, :2])
  ax1.plot(degree_sequence, "b-", marker="o")
  ax1.set_title("Degree Rank Plot")
  ax1.set_ylabel("Degree")
  ax1.set_xlabel("Rank")

  ax2 = fig.add_subplot(axgrid[3:, 2:])
  ax2.bar(*np.unique(degree_sequence, return_counts=True))
  ax2.set_title("Degree histogram")
  ax2.set_xlabel("Degree")
  ax2.set_ylabel("# of Nodes")

  fig.tight_layout()
  plt.show()



In [18]:
def prefaced_dict(d, prefix):
    return { prefix + str(key): value for key, value in d.items() }

_id = 8607594
get_overview(G_AI_nk)

Gcc = sorted(nx.connected_components(G_AI), key=len, reverse=True)
G0 = G_AI.subgraph(Gcc[0])

info = {
    'id': _id,
    **prefaced_dict(get_globals(G_AI_nk), ''),
    'avg_shortest_path_length': nx.average_shortest_path_length(G0),
    'avg_neighbour_degree': np.mean(list((nx.average_neighbor_degree(G_AI).values()))),
    'avg_degree_connectivity': np.mean(list((nx.average_degree_connectivity(G_AI).values()))),
    **prefaced_dict(stats(get_degree_centrality(G_AI_nk)), 'degree_centrality_'),
    **prefaced_dict(stats(get_eig_centrality(G_AI_nk)), 'eigenvector_centrality_'),
    **prefaced_dict(stats(get_pagerank(G_AI_nk)), 'pagerank_'),
    **prefaced_dict(stats(get_btw_centrality(G_AI_nk)), 'betweenness_centrality_'),
    **prefaced_dict(stats(get_cls_centrality(G_AI_nk)), 'closeness_centrality_'),
    'label': 'ai'
}
info


NameError: name 'G_AI_nk' is not defined

In [19]:
len(info)


NameError: name 'info' is not defined

### Find above metrics for all sampled abstracts

In [20]:
total_dataset = []

for _id, abstract in ai_text:
    G_AI, G_AI_nk, idmap_ai = sentences_to_graph(abstract)
    Gcc = sorted(nx.connected_components(G_AI), key=len, reverse=True)
    G0 = G_AI.subgraph(Gcc[0])
    total_dataset.append({
        'id': _id,
        **prefaced_dict(get_globals(G_AI_nk), ''),
        'avg_shortest_path_length': nx.average_shortest_path_length(G0) if nx.is_connected(G0) else 0,
        'avg_neighbour_degree': np.mean(list((nx.average_neighbor_degree(G_AI).values()))),
        'avg_degree_connectivity': np.mean(list((nx.average_degree_connectivity(G_AI).values()))),
        **prefaced_dict(stats(get_degree_centrality(G_AI_nk)), 'degree_centrality_'),
        **prefaced_dict(stats(get_eig_centrality(G_AI_nk)), 'eigenvector_centrality_'),
        **prefaced_dict(stats(get_pagerank(G_AI_nk)), 'pagerank_'),
        **prefaced_dict(stats(get_btw_centrality(G_AI_nk)), 'betweenness_centrality_'),
        **prefaced_dict(stats(get_cls_centrality(G_AI_nk)), 'closeness_centrality_'),
        'check_planarity': nx.check_planarity(G_AI)[0],
        'label': 'ai'
    })

for _id, abstract in human_text:
    G_Human, G_Human_nk, idmap_human = sentences_to_graph(abstract)
    Gcc = sorted(nx.connected_components(G_Human), key=len, reverse=True)
    G0 = G_Human.subgraph(Gcc[0])
    total_dataset.append({
        'id': _id,
        **prefaced_dict(get_globals(G_Human_nk), ''),
        'avg_shortest_path_length': nx.average_shortest_path_length(G0) if nx.is_connected(G0) else 0,
        'avg_neighbour_degree': np.mean(list((nx.average_neighbor_degree(G_Human).values()))),
        'avg_degree_connectivity': np.mean(list((nx.average_degree_connectivity(G_Human).values()))),
        **prefaced_dict(stats(get_degree_centrality(G_Human_nk)), 'degree_centrality_'),
        **prefaced_dict(stats(get_eig_centrality(G_Human_nk)), 'eigenvector_centrality_'),
        **prefaced_dict(stats(get_pagerank(G_Human_nk)), 'pagerank_'),
        **prefaced_dict(stats(get_btw_centrality(G_Human_nk)), 'betweenness_centrality_'),
        **prefaced_dict(stats(get_cls_centrality(G_Human_nk)), 'closeness_centrality_'),
        'check_planarity': nx.check_planarity(G_Human)[0],
        'label': 'human'
    })

total_df = pd.DataFrame(total_dataset)
total_df


[['this', 'paper', 'presents', 'a', 'distributed', 'adaptive', 'consensus', 'tracking', 'control', 'method', 'for', 'uncertain', 'high', 'order', 'nonlinear', 'systems', 'under', 'directed', 'graph', 'condition'], ['the', 'proposed', 'approach', 'utilizes', 'the', 'backstepping', 'technique', 'topology', 'information', 'and', 'lyapunov', 'methods', 'to', 'achieve', 'consensus', 'tracking', 'control', 'of', 'multiple', 'agents'], ['the', 'trajectory', 'of', 'the', 'system', 'is', 'considered', 'in', 'the', 'design', 'of', 'the', 'controller', 'to', 'ensure', 'that', 'agents', 'can', 'converge', 'to', 'a', 'common', 'trajectory', 'while', 'following', 'their', 'own', 'optimal', 'paths'], ['the', 'uncertain', 'parameters', 'of', 'the', 'system', 'are', 'estimated', 'online', 'by', 'adaptive', 'control', 'laws', 'which', 'guarantee', 'the', 'convergence', 'of', 'the', 'estimation', 'errors'], ['the', 'topology', 'of', 'the', 'control', 'network', 'is', 'assumed', 'to', 'be', 'partially', '

Unnamed: 0,id,nodes,edges,density,isolated_nodes,core_number,global_cc,approx_avg_local_cc,max_cliques,connected_components,...,betweenness_centrality_max,betweenness_centrality_avg,closeness_centrality_min,closeness_centrality_1qr,closeness_centrality_median,closeness_centrality_3qr,closeness_centrality_max,closeness_centrality_avg,check_planarity,label
0,8619053,78,905,0.301365,0,22,0.682941,0.894418,12,1,...,0.121950,0.008936,0.0,0.0,0.0,0.000000,0.126807,0.009004,False,ai
1,8719964,78,1014,0.337662,0,23,0.644529,0.884952,16,1,...,0.132920,0.008515,0.0,0.0,0.0,0.002998,0.131943,0.008522,False,ai
2,8703870,71,932,0.375050,0,25,0.714891,0.902158,7,1,...,0.148027,0.008803,0.0,0.0,0.0,0.001464,0.147678,0.008792,False,ai
3,8751867,71,883,0.355332,0,22,0.687805,0.889368,9,1,...,0.104146,0.009057,0.0,0.0,0.0,0.004335,0.102705,0.009083,False,ai
4,8758842,137,1964,0.210820,0,29,0.598396,0.892383,56,2,...,0.188770,0.005653,0.0,0.0,0.0,0.000000,0.186469,0.005619,False,ai
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,8723523,96,1207,0.264693,0,34,0.747049,0.947891,9,1,...,0.193186,0.008310,0.0,0.0,0.0,0.000000,0.193790,0.008264,False,human
1996,8616587,139,2153,0.224481,0,32,0.568339,0.886476,47,1,...,0.150328,0.005585,0.0,0.0,0.0,0.000000,0.148329,0.005584,False,human
1997,8704408,112,1434,0.230695,0,26,0.640858,0.908929,14,1,...,0.197323,0.007166,0.0,0.0,0.0,0.000000,0.196277,0.007177,False,human
1998,8612755,84,1723,0.494263,0,41,0.809736,0.922427,6,1,...,0.095988,0.005975,0.0,0.0,0.0,0.000000,0.098080,0.006076,False,human


### Write the resultant dataframe to `/data` directory

In [152]:
DATASET_PATH = '../data/'
total_df.to_csv(DATASET_PATH + 'topological_analysis-modified.csv')
total_df.to_json(DATASET_PATH + 'topological_analysis-modified.json')
