<h2><i>kiara</i>: Network Analysis</h2>

Welcome back everyone! Now that we're comfortable with what <i>kiara</i> looks like and what it can do to help track your data and your research process, let's try out some of the digital analysis tools, starting with <b>Network Analysis</b>.

<h2>Why Network Analysis?</h2>

Network Analysis offers a computational and quantitative means to examine and explore relational objects, with proxies to measure structural roles and concepts such as power and influence. Doing so digitally - and at scale - also allows us to consider these kinds of questions with large amounts of material or documents that was not  heretofore manageable with qualitative or manual approaches.

We won't get into any core network theories or its uses in the humanities here, as we're focused on the ways in which network analysis in <i>kiara</i> offers an interesting way to wrap the research process, and think about the decisions we're making and how to trace them. If you're interested in learning more about network analysis, or how to code using <a href="https://networkx.org">NetworkX</a>, the library currently used in these <i>kiara</i> modules, check out our recommended reading at the bottom.

<h3>Getting Started</h3>
<br>Let's start by double checking that we have all the required plugins and setting up an API for us to use <i>kiara</i>. We'll do this all in one go this time, but if you're unsure, feel free to head back to the <a href="http://dharpa.org/kiara.documentation/latest/workshop/workshop/">installation notebook</a> to look over this section again.

In [1]:
import csv
import networkx as nx
import matplotlib.pyplot as plt

try:
    from kiara_plugin.jupyter import ensure_kiara_plugins
except:
    import sys
    print("Installing 'kiara_plugin.jupyter'...")
    !{sys.executable} -m pip install -q kiara_plugin.jupyter
    from kiara_plugin.jupyter import ensure_kiara_plugins

ensure_kiara_plugins()

from kiara.api import KiaraAPI
kiara = KiaraAPI.instance()

Great, we're all set up. We're going to import some data again like in the first notebook, but this time we're going to use a local file, using the kiara function `import.local.file`. We're using sample data again here, but you can also use this function to import your own data in the future. 

The data we're using here is a sample taken from the <b>Circulation of Knowledge and Learned Practices in the 17th-century Dutch Republic (CKCC)</b> dataset collated by the Huygens Institute in the Netherlands and available on the 'LetterSampo' portal created by the Reassembling the Republic of Letters project team. For more on these projects see <a href="https://seco.cs.aalto.fi/projects/rrl/">here</a>.

The CKCC collection features around 20,000 letters written by and to 17th century scholars in the Dutch republic. Using quantitative network analysis on this dataset might offer insights into the most prolific writer in the dataset, which actor connected the most people, or who operated in closely knit writing groups. Although we can also use network analysis to explore or 'map' our datasets if we don't know much about them, in this notebook the research questions and module parameters have been built around and defined by the information we already have about the data. It's important that we acknowledge this now as a core factor in the decisions already made for this process, but we'll also return to this throughout the notebook.

Let's use the <i>kiara</i> function `import.local.file` then to access our datset, specifying the path to the csv file in our <span style="color:green">inputs</span> and saving the <span style="color:red">outputs</span> of the function as '<b>CKCC</b>'.

In [2]:
CKCC = kiara.run_job('import.local.file', inputs={'path':'CKCC.csv'})
CKCC

kiara.model_classes plugin search message/error -> Could not load 'documentation': No module named 'kiara_plugin.documentation'
kiara.model_classes plugin search message/error -> Could not load 'documentation': No module named 'kiara_plugin.documentation'
kiara.archive_type plugin search message/error -> Could not load 'documentation': No module named 'kiara_plugin.documentation'
kiara.modules plugin search message/error -> Could not load 'documentation': No module named 'kiara_plugin.documentation'
kiara.model_classes plugin search message/error -> Could not load 'documentation': No module named 'kiara_plugin.documentation'
kiara.archive_type plugin search message/error -> Could not load 'documentation': No module named 'kiara_plugin.documentation'
kiara.modules plugin search message/error -> Could not load 'documentation': No module named 'kiara_plugin.documentation'
kiara.archive_type plugin search message/error -> Could not load 'documentation': No module named 'kiara_plugin.docume

<h2>Creating a Network</h2>

Time to make our network from this data. Let's start by searching for the <i>kiara</i> modules that are included in the `kiara_plugin.network_analysis` package.

In [3]:
infos = metadata = kiara.retrieve_operations_info()
operations = {}
for op_id, info in infos.item_infos.items():
    if info.context.labels.get("package", None) == "kiara_plugin.network_analysis":
        operations[op_id] = info

print(operations.keys())

dict_keys(['create.network_data.from.tables', 'export.network_data.as.csv_files', 'export.network_data.as.graphml_file', 'export.network_data.as.sql_dump', 'export.network_data.as.sqlite_db', 'network_data.extract_largest_component'])


There's lots of options for analysis, but we want to make our network first. Let's have a look what we need with the function `create.network_data.from.tables` using `kiara.retrieve_operation_info` once more.

In [4]:
kiara.retrieve_operation_info('create.network_data.from.tables')

Like other network analysis tools, <i>kiara</i> first needs the data as an edge table. This means we first have to transform the csv file we imported earlier into a table before we can create the network data. Let's start by using the `create.table.from.file` function that we used in the first notebook and storing this as our <b>edges</b>, then use this to create our network data using the `create.network_data.from.tables` that we just read about. In this, we are defining two different sets of <span style="color:green">inputs</span>, overriding the first variable once we have used it to create our table.

If we want, we can also import a separate table with the nodes in, but this is optional, and for the moment let's stick with just the edge table. We'll store this again at the end in the variable <b>CKCC</b> for us to use again in a bit.

<span style="color:blue">also include information on the f0/f1 problem, if this is not fixed by then - we'll also need to write in dropping the first row (i.e. 'Source' and 'Target') if this is the case. Otherwise, write instructions for identifying source/target columns, alongside more information on the rest of the possible inputs (so that it makes sense for the lesmis info later)</span>

In [5]:
inputs = {
    "file": CKCC['file']
}

outputs = kiara.run_job('create.table.from.file', inputs={'file': CKCC['file']})

edges = outputs['table']

inputs = {
    'edges': edges,
    'source_column_name': 'f0',
    'target_column_name': 'f1'
}

CKCC = kiara.run_job('create.network_data.from.tables', inputs=inputs)
CKCC

Great - this has made a <i>kiara</i> network data object, and the output is showing the edge table and node table for the network. As we didn't give it a node table to start with this, it has extracted the information for the nodes from the edges instead.

As we can see, some of the edges are listed more than once, where more than one letter was written from one person to another. There's obviously more information about the network than just a list of edges and nodes then - let's find out some more information about our network object then.

<h2>Network Data</h2>

Let's start by having a look at the information for our network using the `get_property_data` function. As we're querying the `network_data` part of our network object, we'll save this as <b>CKCC</b> for the moment.

In [6]:
CKCC = CKCC['network_data']
CKCC.get_property_data('metadata.graph_properties')

Doing this gives us the total number of nodes, but also gives us an idea of the different kind of graphs we might chose to use for this dataset - <b>Directed</b>, <b>Undirected</b>, <b>Multi-Directed</b>, and <b>Multi-Undirected</b>. We spotted earlier that some of the edges were listed more than once, but this function tells us that there are a total of 17,087 parallel edges - we can decide what we'll do with those in a little bit, but it's good to know that they make up quite a lot of our data. It also shows us that there's a large number of self-loops - this is unusual in epistolarly collections, so this function might also flag up some errors or inconsistencies in our dataset that we can go back to at some point.

Having this kind of information accessible means we can make more informed decisions about the next steps that might work with our research or digital analysis, especially those that are sometimes automated for us.

Let's get some more information about the network as a whole then, using the `network_data.extract_largest_component` function. This works out how many different distinct components there are in a network, and also gives us the largest component on its own. We'll have a quick look at how it works first.

In [7]:
kiara.retrieve_operation_info('network_data.extract_largest_component')

In [8]:
output = kiara.run_job('network_data.extract_largest_component', inputs={'network_data':CKCC})
output

For now, let's save our largest component in the variable `network_data` for later - we'll use this for the rest of our experiments rather than the full network, and make sure we're tracing this using <i>kiara</i>.

Let's have a look at the information for this largest component, using our `get_property_data` function again.

In [9]:
network_data = output['largest_component']

network_data.get_property_data('metadata.graph_properties')

<h2>Onboarding Data: An Alternative</h2>

So far then, we have created a network object in <i>kiara</i> by importing a csv from a local path.

But what about other formats? Let's pause quickly, and have a look at importing a <b>gml</b> file instead. 

Here we will use a different sample dataset, <a href="http://www-personal.umich.edu/~mejn/netdata/">co-appearance network</a> of characters in Victor Hugo's novel <i>Les Miserables</i>, already in gml format. 

Let's have a look at the function `onboard.gml_file` and how this will work for us.

In [10]:
kiara.retrieve_operation_info('onboard.gml_file')

We need a local file path again, and we can go ahead and save this as <b>lesmis</b>.

In [11]:
lesmis = kiara.run_job('onboard.gml_file', inputs={'path':'lesmis.gml'})
lesmis

As we can see, this module not only imports the gml file into <i>kiara</i> but automatically converts it into a <i>kiara</i> network object for us. Great!

Here we can see that the edge table has a 'value' column to indicate edges weights that has also been automatically included with the gml data.

We'll leave this <i>Les Miserables</i> network for now, but it's useful to see this other option for importing data for networks. If you want to experiment with this dataset later, feel free to come back to it!

<h2>Network Analysis: Statistical Measures</h2>

Ok, let's head back to our correspSearch largest component dataset, stored in the variable <b>network_data</b>. We've already had a look at some graph wide measures, so let's start looking at some node specific measurements.

<b>Degree</b>
<br>Let's start with degree, using `create.degree_rank_list`. This module allows us to calculate degree as both <b>undirected</b> and <b>weighted</b>. In this epistolary network, <b>undirected degree</b> counts the number of individual correspondents each person has, whereas <b>weighted degree</b> counts the total number of incoming and outgoing letters for each actor in the network. 

Let's use our `retrieve_operation_info` function to have a look at what we need to calculate these degrees.

In [12]:
kiara.retrieve_operation_info('create.degree_rank_list')

So we've already computed the largest component to use as the `network_data` input, and we want to calculate the weighted degree meaures, so we'll leave the default as 'True'. Unlike the <i>Les Miserables</i> network, we don't have a pre-existing weight value for the edges, but we do know that there are parallell edges from multiple letters between correspondents, so we'll allow the module to aggregate the edges and set this as a weight. 

In creating this module, assumptions have already been made that we are working with both a single node type and a single edge type network. Again, a lot of the parameters have been set based on what we already know about the dataset, but we also need to acknowledge this as an active decision that has been 'pre-made' as part of the research process.

The inputs for `create.degree_rank_list` are prompting us to reflect on the decisions we are making as we are going along, and think about how our data fits into these kind of measurements, but by doing it in <i>kiara</i>, these inputs also allows us to <i>track</i> these decisions, as we will see more of later.

Let's give it a go then.

In [13]:
output = kiara.run_job('create.degree_rank_list', inputs={'network_data':network_data})
output

Great, this function gives us a table with the undirected degree and weighted degree for each member of this network, ranking them by undirected degree. 

In doing so, it's done two extra things for us. Seeing as we allowed the function to calculate parallel edges as edge weight, it's now saved the weight as an edge attribute that we can carry forward into our next measurements. It's also assigned the two degree scores as node attributes in our network, which means we can also keep these in further centrality measurements, allowing us to accumulate all the different scores rather than re-writing over them each time.

<b>Betweenness</b>
<br>Let's have a look at a different centrality measure now - use `retrieve.operation.info` again to see what we need to calculate betweenness for the nodes in our network.

In [14]:
kiara.retrieve_operation_info('create.betweenness_rank_list')

This module allows us to calculate both unweighted and weighted betweenness, so we'll go ahead and do both of those. Again, we can select a column that holds the edge weight if, like in our <i>Les Miserables</i> network, it it already exists or has a different label. As we just used the degree module to calculate edge weight using the parallel edges, we can leave this and it will automatically select the 'weight' column we just created. 

This module also asks us to define how we want our weights to be interpreted - is the weight 'positive', indicating strong relationships, or is it 'negative', acting as a distance or time needed for these edges? Whilst this is often automated in network measures, <i>kiara</i> prompts us to think more carefully about our data and our network, and again gets us to trace the decisions we as researchers are making about our analysis.

As we're dealing with epistolary data, we'll leave this input as 'True', as the weight indicates strength. At this stage, the module is also set to calculate both unweighted and weighted betweenness using the network as a directed graph. Though this is another 'pre-made' decision for this notebook and the dataset in use, it's important to acknowledge this and be as transparent about these kind of choices as the ones actively documented by user input.

Let's give it a go then. We want to use the network we just created using the degree ranking module, so let's save that and use it in our inputs.

In [15]:
network_data = output['centrality_network']

output = kiara.run_job('create.betweenness_rank_list', inputs={'network_data':network_data})

output

Just like the degree module, it's returned a table with the two betweenness scores, ranked by unweighted, and also assigned these as node attributes that we can carry forward into more measurements. Let's look at one more centrality here in this notebook.

<b>Eigenvector</b>
<br><i>kiara</i> also holds a module to measure eigenvector centrality, so let's look again at what that needs.

In [16]:
kiara.retrieve_operation_info('create.eigenvector_rank_list')

This module is set up similary to the betweenness measure, and again we can define the column with the weight information if we need to, and how to interpret these weights. If you have a larger dataset, you can also change the iterations for the measurement. For the moment we'll leave the parameters as they are, and again use our updated network with the degree and betweenness scores attached.

In [17]:
network_data = output['centrality_network']

output = kiara.run_job('create.eigenvector_rank_list', inputs={'network_data':network_data})

output

As before, we have our score table and our updated node attributes. Great!

There's one final centrality measure in the network analysis plugin for closeness. See if you can work out how to check the information for this and run it on the network here, or feel free to move on to other measures.

<b>Modularity Group</b>
<br>This next module determines the modularity groups in the network, again assigning each group as a node attribute. Let's have a look at the parameters for it.

In [18]:
kiara.retrieve_operation_info('compute.modularity_group')

Here, we can set the number of communities that we want the module to divide our network up into, or we can allow the code to find this automatically.

Let's give it a go with our network once more.

In [19]:
network_data = output['centrality_network']

output = kiara.run_job('compute.modularity_group', inputs={'network_data':network_data})

output

Great - this once again gives us our updated network, and also tells us how many modularity groups the measure has found in the network.

Let's look at one last measure.

<b>Cut Points</b>
This last function finds all the cut-points in the network, nodes that when removed will separate the component into two or more pieces. This function will return a list of the cut-points, and assign 'Yes' or 'No' as a node attribute.

Let's have a look one last time.

In [20]:
kiara.retrieve_operation_info('create.cut_point_list')

Nice and simple, no extra parameters: it just needs our network.

In [21]:
network_data = output['modularity_network']

output = kiara.run_job('create.cut_point_list', inputs={'network_data':network_data})

output

Having started simply with an imported CSV of letter edges, we've now got a lot of information. This is great - but what next?

<h2>Exporting the Network</h2>

<i>kiara</i> has stored all of this information we have just created, and as it's interoperable, it also allows us to export this network again. We can export all this network data as a set of CSVs or even graphml with built in <i>kiara</i> modules like this:

In [22]:
kiara.retrieve_operation_info('export.network_data.as.graphml_file')

Let's export our final network after the cut-points measures then, and save it locally.

In [23]:
network_data = output['cut_network']

output = kiara.run_job('export.network_data.as.graphml_file', inputs={'network_data':network_data, 'name':'CKCC_kiara_network'})

Finally we can check out the lineage for our final export output. As we can see, it has stored all the decisions we have made, and the ways in which they have created 'new' datasets, right from our original import.

In [24]:
output['export_details'].lineage

<h2>Recommended Reading</h2>
<br>Want to know more about Network Analysis? Here's some helpful tutorials and reading:

* <a href="https://programminghistorian.org/en/lessons/exploring-and-analyzing-network-data-with-python"><i>Programming Historian</i> NetworkX tutorial</a>
* Ahnert, Ruth, Ahnert, Sebastian E., Coleman, Catherine Nicole and Scott B. Weingart 2020. <i>The Network Turn: The Changing Perspectives in the Humanities</i>. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781108866804
* Barabási, Albert-László. <i>Linked: The New Science of Networks</i>. New York: Penguin Group, 2002.
* Borgatti, Stephen. ‘The Key Player Problem.’ In <i>Dynamic Social Network Modeling and Analysis: Workshop Summary and Papers</i>. Edited by Ronald Breiger, Kathleen Carley and Philippa Pattison. Washington: The National Acadamies Press, 2003. 241-252.
* Brughmans, Tom, Anna Collar, and Fiona Coward, ed. <i>The Connected Past: Challenges to Network Studies in Archaeology and History</i>. Oxford: Oxford University Press, 2016.
* Tuominen, Jouni, Koho, Mikko, Pikkanen, Ilona, Drobac, Senka, Enqvist, Johanna, Hyvönen, Eero, La Mela, Matti, Leskinen, Petri, Paloposki, Hanna-Leena and Rantala, Heikki. Constellations of Correspondence: a Linked Data Service and Portal for Studying Large and Small Networks of Epistolary Exchange in the Grand Duchy of Finland. DHNB 2022 The 6th Digital Humanities in Nordic and Baltic Countries Conference, pp. 415-423, CEUR Workshop Proceedings, Vol. 3232, March, 2022. http://ceur-ws.org/Vol-3232/paper41.pdf