<h2><i>kiara</i>: Network Analysis</h2>

Welcome back! Now that we're comfortable with what <i>kiara</i> looks like and what it can do to help track your data and your research process, let's try out some of the digital analysis tools, starting with <b>Network Analysis</b>.

<h2>Why Network Analysis?</h2>

Network Analysis offers a computational and quantitative means to examine and explore relational objects, with proxies to measure structural roles and concepts such as power and influence. Doing so digitally - and at scale - also allows us to consider these kinds of questions with large amounts of material or documents that was not  heretofore manageable with qualitative or manual approaches.

>INSERT REFERENCES OR WIDER REFERENCING HERE: PROGRAMMING HISTORIAN LINKS W/ BIBLIOGRAPHY SECTION BELOW

<h3>Getting Started</h3>

Let's start by double checking that we have all the required plugins and setting up an API for us to use <i>kiara</i>. We'll do this all in one go this time, but if you're unsure, feel free to head back to the <a href="http://dharpa.org/kiara.documentation/latest/workshop/workshop/">installation notebook</a> to look over this section again.

In [18]:
import csv
import networkx as nx
import matplotlib.pyplot as plt

try:
    from kiara_plugin.jupyter import ensure_kiara_plugins
except:
    import sys
    print("Installing 'kiara_plugin.jupyter'...")
    !{sys.executable} -m pip install -q kiara_plugin.jupyter
    from kiara_plugin.jupyter import ensure_kiara_plugins

ensure_kiara_plugins()

from kiara.api import KiaraAPI
kiara = KiaraAPI.instance()

Great, we're all set up. We're going to import some data again like in the first notebook, but this time we're going to use a local file, using the kiara function `import.local.file`. We're using sample data again here, but you can also use this function to import your own data in the future. 

The data we're using here is a sample taken from the <b>correspSearch</b> dataset collated by the Berlin-Brandenburg Academy of Sciences and available on the 'LetterSampo' portal created by the Reassembling the Republic of Letters project team. For more on these projects see <a href="https://seco.cs.aalto.fi/projects/rrl/">here</a>.

A continually growing 'crowd-sourced' dataset, the correspSearch collection features letters from German history across a broad range of history, contributed by various individuals and larger research groups. Using quantitative network analysis on this dataset might offer insights into the most prolific writer in the dataset so-far, which actor connected the most people, or who operated in closely knit writing groups. Although we can also use network analysis to explore or 'map' our datasets if we don't know much about them, in this notebook the research questions and module parameters have been built around and defined by the information we already have about the data. It's important that we acknowledge this now as a core factor in the decisions already made for this process, but we'll also return to this throughout the notebook.

Let's use the <i>kiara</i> function `import.local.file` then to access our datset, specifying the path to the csv file in our <span style="color:green">inputs</span> and saving the <span style="color:red">outputs</span> of the function as '<b>correspsearch</b>'.

In [19]:
correspsearch = kiara.run_job('import.local.file', inputs={'path':'correspSearch.csv'})
correspsearch

<h2>Creating a Network</h2>

Time to make our network from this data. Let's start by searching for the <i>kiara</i> modules that are included in the `kiara_plugin.network_analysis` package.

In [20]:
infos = metadata = kiara.retrieve_operations_info()
operations = {}
for op_id, info in infos.item_infos.items():
    if info.context.labels.get("package", None) == "kiara_plugin.network_analysis":
        operations[op_id] = info

print(operations.keys())

dict_keys(['create.network_data.from.tables', 'export.network_data.as.csv_files', 'export.network_data.as.graphml_file', 'export.network_data.as.sql_dump', 'export.network_data.as.sqlite_db'])


There's lots of options for analysis, but we want to make our network first. Let's have a look what we need with the function `create.network_data.from.tables` using `kiara.retrieve_operation_info` once more.

<span style="color:blue">this will make more sense/have more options when analysis modules are incorporated into the network plugin (lol)</span>

In [21]:
kiara.retrieve_operation_info('create.network_data.from.tables')

Like other network analysis tools, <i>kiara</i> first needs the data as an edge table. This means we first have to transform the csv file we imported earlier into a table before we can create the network data. Let's start by using the `create.table.from.file` function that we used in the first notebook and storing this as our <b>edges</b>, then use this to create our network data using the `create.network_data.from.tables` that we just read about. In this, we are defining two different sets of <span style="color:green">inputs</span>, overriding the first variable once we have used it to create our table.

If we want, we can also import a separate table with the nodes in, but this is optional, and for the moment let's stick with just the edge table. We'll store this again at the end in the variable <b>correspsearch</b> for us to use again in a bit.

<span style="color:blue">also include information on the f0/f1 problem, if this is not fixed by then - we'll also need to write in dropping the first row (i.e. 'Source' and 'Target') if this is the case. Otherwise, write instructions for identifying source/target columns, alongside more information on the rest of the possible inputs (so that it makes sense for the lesmis info later)</span>

In [22]:
inputs = {
    "file": correspsearch['file']
}

outputs = kiara.run_job('create.table.from.file', inputs={'file': correspsearch['file']})

edges = outputs['table']

inputs = {
    'edges': edges,
    'source_column_name': 'f0',
    'target_column_name': 'f1'
}

correspsearch = kiara.run_job('create.network_data.from.tables', inputs=inputs)
correspsearch

Great - this has made a <i>kiara</i> network data object, and the output is showing the edge table and node table for the network. As we didn't give it a node table to start with this, it has extracted the information for the nodes from the edges instead.

As we can see, some of the edges are listed more than once, where more than one letter was written from one person to another. There's obviously more information about the network than just a list of edges and nodes then - let's find out some more information about our network object then.

<h2>Network Data</h2>

Let's start by having a look at the information for our network using the `get_property_data` function. As we're querying the `network_data` part of our network object, we'll save this as <b>correspsearch</b> for the moment.

In [23]:
correspsearch = correspsearch['network_data']
correspsearch.get_property_data('metadata.graph_properties')

Doing this gives us the total number of nodes, but also gives us an idea of the different kind of graphs we might chose to use for this dataset - <b>Directed</b>, <b>Undirected</b>, <b>Multi-Directed</b>, and <b>Multi-Undirected</b>. Having this kind of information accessible means we can make more informed decisions about the next steps that might work with our research or digital analysis, especially those that are sometimes automated for us.

Let's get some more information about the network as a whole then, using the `network_data.extract_largest_component` function. This works out how many different distinct components there are in a network, and also gives us the largest component on its own. We'll have a quick look at how it works first.

In [8]:
kiara.retrieve_operation_info('network_data.extract_largest_component')

In [9]:
output = kiara.run_job('network_data.extract_largest_component', inputs={'network_data':correspsearch})
output

FailedJobException: Job failed.

'NetworkData' object has no attribute 'clone'

For now, let's save our largest component in the variable `network_data` for later - we'll use this for the rest of our experiments rather than the full network, and make sure we're tracing this using <i>kiara</i>.

Let's have a look at the information for this largest component, using our `get_property_data` function again.

In [13]:
#network_data = output['largest_component']

#network_data.get_property_data('metadata.graph_properties')

network_data = correspsearch

<h2>Onboarding Data: An Alternative</h2>

So far then, we have created a network object in <i>kiara</i> by importing a csv from a local path.

But what about other formats? Let's pause quickly, and have a look at importing a <b>gml</b> file instead. 

Here we will use a different sample dataset, <a href="http://www-personal.umich.edu/~mejn/netdata/">co-appearance network</a> of characters in Victor Hugo's novel <i>Les Miserables</i>, already in gml format. 

Let's have a look at the function `onboard.gml_file` and how this will work for us.

In [10]:
kiara.retrieve_operation_info('onboard.gml_file')

We need a local file path again, and we can go ahead and save this as <b>lesmis</b>.

In [11]:
lesmis = kiara.run_job('onboard.gml_file', inputs={'path':'lesmis.gml'})
lesmis

As we can see, this module not only imports the gml file into <i>kiara</i> but automatically converts it into a <i>kiara</i> network object for us. Great!

Here we can see that the edge table has a 'value' column to indicate edges weights that has also been automatically included with the gml data.

We'll leave this <i>Les Miserables</i> network for now, but it's useful to see this other option for importing data for networks. If you want to experiment with this dataset later, feel free to come back to it!

<h2>Network Analysis: Statistical Measures</h2>

Ok, let's head back to our correspSearch largest component dataset, stored in the variable <b>network_data</b>. We've already had a look at some graph wide measures, so let's start looking at some node specific measurements.

Let's start with degree, using `create.degree_rank_list`. This module allows us to calculate degree as both <b>undirected</b> and <b>weighted</b>. In this epistolary network, <b>undirected degree</b> counts the number of individual correspondents each person has, whereas <b>weighted degree</b> counts the total number of incoming and outgoing letters for each actor in the network. 

Let's use our `retrieve_operation_info` function to have a look at what we need to calculate these degrees.

In [12]:
kiara.retrieve_operation_info('create.degree_rank_list')

So we've already computed the largest component to use as the `network_data` input, and we want to calculate the weighted degree meaures, so we'll leave the default as 'True'. Unlike the <i>Les Miserables</i> network, we don't have a pre-existing weight value for the edges, but we do know that there are parallell edges from multiple letters between correspondents, so we'll allow the module to aggregate the edges and set this as a weight. 

In creating this module, assumptions have already been made that we are working with both a single node type and a single edge type network. Again, a lot of the parameters have been set based on what we already know about the dataset, but we also need to acknowledge this as an active decision that has been 'pre-made' as part of the research process.

The inputs for `create.degree_rank_list` are prompting us to reflect on the decisions we are making as we are going along, and think about how our data fits into these kind of measurements, but by doing it in <i>kiara</i>, these inputs also allows us to <i>track</i> these decisions, as we will see more of later.

Let's give it a go then.

In [14]:
output = kiara.run_job('create.degree_rank_list', inputs={'network_data':network_data})
output

In [15]:
kiara.retrieve_operation_info('create.betweenness_rank_list')

In [16]:
centrality_network = output['centrality_network']

output = kiara.run_job('create.betweenness_rank_list', inputs={'network_data':centrality_network, 'weight_column_name':'value', 'weighted_betweenness':True})

output

FailedJobException: Job failed.

(sqlite3.IntegrityError) NOT NULL constraint failed: nodes.Weighted Betweenness Score
[SQL: INSERT INTO nodes (label, "Weighted Degree Score", "Degree Score", "Weighted Betweenness Score", "Betweenness Score", id) VALUES (?, ?, ?, ?, ?, ?)]
[parameters: [('Eisenstein, Gotthold', 11, 1, 0.0, 0.0, 'Eisenstein, Gotthold'), ('Regierungsrath', 1, 1, 0.0, 0.0, 'Regierungsrath'), ('Buquoy, Georg von', 3, 1, 0.0, 0.0, 'Buquoy, Georg von'), ('Poullet-Delisle, Antoine Ch. M.', 1, 1, 0.0, 0.0, 'Poullet-Delisle, Antoine Ch. M.'), ('Stimmel, Johann Gottlob', 1, 1, 0.0, 0.0, 'Stimmel, Johann Gottlob'), ('Schneider, Gerhard', 10, 1, 0.0, 0.0, 'Schneider, Gerhard'), ('Leser der Allgemeinen Literaturzeitung', 1, 1, 0.0, 0.0, 'Leser der Allgemeinen Literaturzeitung'), ('Leutsch, Ernst von', 1, 1, 0.0, 0.0, 'Leutsch, Ernst von')  ... displaying 10 of 1024 total bound parameter sets ...  ('Endter, Wolfgang', 1, 1, 0.0, 0.0, 'Endter, Wolfgang'), ('Stegmann', 1, 1, 0.0, 0.0, 'Stegmann')]]
(Background on this error at: https://sqlalche.me/e/20/gkpj)

In [None]:
kiara.retrieve_operation_info('create.eigenvector_rank_list')

In [None]:
centrality_network = output['centrality_network']

output = kiara.run_job('create.eigenvector_rank_list', inputs={'network_data':centrality_network, 'weight_column_name':'value'})

output

In [None]:
kiara.retrieve_operation_info('compute.modularity_group')

In [None]:
network = output['centrality_network']

output = kiara.run_job('compute.modularity_group', inputs={'network_data':network})

output

In [None]:
kiara.retrieve_operation_info('create.cut_point_list')

In [None]:
centrality_network = output['centrality_network']

output = kiara.run_job('create.cut_point_list', inputs={'network_data':centrality_network})

output

In [None]:
output['centrality_network'].lineage

In [None]:
kiara.retrieve_operation_info('export.network_data.as.graphml_file')