<h2><i>kiara</i>: Network Analysis</h2>

Welcome back! Now that we're comfortable with what <i>kiara</i> looks like and what it can do to help track your data and your research process, let's try out some of the digital analysis tools, starting with <b>Network Analysis</b>.

<h2>Why Network Analysis?</h2>

Network Analysis offers a computational and quantitative means to examine and explore relational objects, with proxies to measure structural roles and concepts such as power and influence. Doing so digitally - and at scale - also allows us to consider these kinds of questions with large amounts of material or documents that was not  heretofore manageable with qualitative or manual approaches.

We won't get into any core network theories or its uses in the humanities here, as we're focused on the ways in which network analysis in <i>kiara</i> offers an interesting way to wrap the research process, and think about the decisions we're making and how to trace them. If you're interested in learning more about network analysis, or how to code using <a href="https://networkx.org">NetworkX</a>, the library currently used in these <i>kiara</i> modules, check out our recommended reading at the bottom.

<h3>Getting Started</h3>
<br>Let's start by double checking that we have all the required plugins and setting up an API for us to use <i>kiara</i>. We'll do this all in one go this time, but if you're unsure, feel free to head back to the <a href="http://dharpa.org/kiara.documentation/latest/workshop/workshop/">installation notebook</a> to look over this section again.

In [1]:
import networkx as nx
import os
from kiara.api import KiaraAPI
kiara = KiaraAPI.instance()

### Data
Next we set up the filepaths for the data that we are going to use in this notebook. It is located in the same directory as the two jupyter notebooks are. You can either save the full path to the csv file in the variable below, or use the `os.path` modules in python to shorten this, as below.

In [2]:
notebook_path = os.path.abspath('')

csv_file_path = os.path.join(notebook_path,"CKCC.csv")


Great, we're all set up. We're going to import some data again like in the first notebook, but this time we're going to use a local file, using the kiara function `import.local.file`. We're using sample data again here, but you can also use this function to import your own data in the future. 

The data we're using here is a sample taken from the <b>Circulation of Knowledge and Learned Practices in the 17th-century Dutch Republic (CKCC)</b> dataset collated by the Huygens Institute in the Netherlands and available on the 'LetterSampo' portal created by the Reassembling the Republic of Letters project team. For more on these projects see <a href="https://seco.cs.aalto.fi/projects/rrl/">here</a>.

The CKCC collection features around 20,000 letters written by and to 17th century scholars in the Dutch republic. Using quantitative network analysis on this dataset might offer insights into the most prolific writer in the dataset, which actor connected the most people, or who operated in closely knit writing groups. Although we can also use network analysis to explore or 'map' our datasets if we don't know much about them, in this notebook the research questions and module parameters have been built around and defined by the information we already have about the data. It's important that we acknowledge this now as a core factor in the decisions already made for this process, but we'll also return to this throughout the notebook.

Let's use the <i>kiara</i> function `import.local.file` then to access our datset, specifying the path to the csv file in our <span style="color:green">inputs</span> and saving the <span style="color:red">outputs</span> of the function as '<b>CKCC</b>'. Here we're indicating the 'csv_file_path' variable we have defined above. If it's stored somewhere else, we need to specify the full file path. (If you are using a file path and not a variable, remember to surround it with quotes like this for example: "/Users/some.user/Documents/my_csvfile.csv"). Alternatively, we can use the `download.file` module used in the <b>Hello Kiara</b> notebook.

Again, we'll leave the comments blank here for you to fill in yourself, but the comment here might indicate why you have chosen this dataset, or a reminder of which version you are working with if you have multiple versions of the same dataset.

In [3]:
CKCC = kiara.run_job('import.local.file', inputs={'path': csv_file_path}, comment="importing bits")

<h2>Creating a Network</h2>

Time to make our network from this data. Like other network analysis tools, <i>kiara</i> first needs the data as an edge table. This means we first have to transform the csv file we imported earlier into a table before we can create the network data. Let's start by using the `create.table.from.file` function that we used in the first notebook and storing this as our <b>edges</b>.

If we want, we can also import a separate table with the nodes in, but this is optional, and for the moment let's stick with just the edge table. We'll store this again at the end in the variable <b>CKCC</b> for us to use again in a bit.

First check the <span style="color:green">inputs</span> requirements for the `create.table.from.file` function, just to be sure:

In [4]:
kiara.retrieve_operation_info('create.table.from.file')

In [5]:
inputs = {
    "file": CKCC['file'],
    "first_row_is_header": True
}

outputs = kiara.run_job('create.table.from.file', inputs=inputs, comment="")

edges = outputs['table']

outputs

Great, now we have our edges as a <i>kiara</i> table, we can make our network graph. First though, we can preview some information about the different type of graphs we might be able to make with our data, using `preview.network_info`. This just requires us to select our edges table, and the column names for our sources and targets. Check it out:

In [6]:
inputs = {'edges': edges,
    'source_column': 'Source',
    'target_column': 'Target'}

network_info = kiara.run_job('preview.network_info', inputs=inputs, comment="")

network_info

Doing this gives us the total number of nodes, but also gives us an idea of how the different kind of graphs we might chose - <b>Directed</b>, <b>Undirected</b>, <b>Multi-Directed</b>, and <b>Multi-Undirected</b> - might change the number of edges we have. 

We have more edges in a directed than an undirected graph, suggesting there are reciprocal or directed edges between a pair of nodes - this is pretty standard in an epistolary network if people are writing back and forth to each other. We also have more edges in a multigraph than in either of our non-multigraphs, which means we have parallel edges (i.e. duplicates in our edge table). Again, in an epistolary network this might be pretty common if someone writes more than one letter to their friend! There are no isolates (nodes without any edges) and a number of components, but it also shows us that there's a large number of self-loops - this is unusual in epistolarly collections, as people are unlikely to write to themselves! As well as helping us decide what type of graph is most useful for our dataset then, this module also helps us to review our data by flagging up potential errors or inconsistencies in our dataset that we can go back to at some point.

Having this kind of information accessible means we can make more informed decisions about the next steps that might work with our research or digital analysis, especially those that are sometimes automated for us.

For this network, a <b>Directed Graph</b> makes the most sense. Let's check out what we need to build this with our `assemble.network_graph` module using `kiara.retrieve_operation_info` once more.

In [7]:
kiara.retrieve_operation_info('assemble.network_graph')

This seems like quite a chunky module, but it gets us to do a lot of heavy decision making up-front so that we don't have to keep inputting these decisions later on when we do some more analytical stuff. If we change our mind about the kind of graph we want to be using later on, we can come back to this step - but using the `preview.network_info` function should help us get the information we need to make an informed decision about our network!

We already decided that we want to make a Directed Network, so we can select 'directed' for graph type, and we created our edge table as 'edges' earlier, so we can pop that back in. We need to specify our Source and Target column again, and we can copy all this information from our preview module. We don't have a node table for this dataset, but if we did, now is where we would include it. 

Now comes some more decisions to make that we haven't seen yet. One is deciding whether our network is <b>weighted</b> or not, which might mean a number of things, depending on the data you are using - the number of letters between correspondents, the distance between them, the number of years they have known each other. If all the relationships between nodes in the network are the same, we can set this as False; if not, we need to tell <i>kiara</i> where this weight information is coming from. 

If weights already exist in the edges table (i.e. you've already assigned weights to the network before uploading the data into <i>kiara</i>), you can just pick the weight column and move on, or choose to do something more with them. If you have parallel edges (which the `preview.network_info` module will have told you) and you <i>don't</i> want to use a multigraph, you can choose how you want to handle weighted parallel edges. You can either: add all the weights together ('sum'); calculate the average weight for the merged edge ('mean'); find the largest value ('maximum'); or find the smallest value ('minimum'). This will then assign this value as the new weight for this edge in the network.

If you want <i>kiara</i> to calculate the weights for you, you can select 'sum' and total the number of occurences of this edge as a weight. Note that if you haven't provided any weights already, the edges will be automatically assigned a weight of 1, so choosing 'mean', 'minimum', or 'maximum' for this will just return a value of 1 for every edge, which will count the same as an <b>unweighted</b> network.

The inputs for this module are prompting us to reflect on the decisions we are making as we are going along, and think about how our data fits into these kind of measurements, but by doing it in <i>kiara</i>, these inputs also allows us to <i>track</i> these decisions, both in terms of storing the processes and with the comments we are adding in.

We're still working with our letter dataset, so let's get <i>kiara</i> to add all the edges together so that the weight will tell us how many letters each person wrote to each other.

In [8]:
inputs = {
    'graph_type': 'directed',
    'edges': edges,
    'source_column': 'Source',
    'target_column': 'Target',
    'is_weighted': True,
    'parallel_edge_strategy':'sum'
}

CKCC = kiara.run_job('assemble.network_graph', inputs=inputs, comment="")
CKCC

Great - this has made a <i>kiara</i> network graph object, and the output is showing the edge table and node table for the network. As we didn't give it a node table to start with this, it has extracted the information for the nodes from the edges instead. As we can see, the edge table now has weights on it, calculated according to our decisions in when we made the graph.

As we saw in our preview module, the network actually has eight separate components. We might be interested in investigating just the main component, which holds the most nodes. To do this we can extract the largest component, and return it as its own network graph. Let's have a look!

In [9]:
CKCC_largest_component = kiara.run_job('extract.largest_component', inputs={'network_graph':CKCC['network_graph']}, comment="")

CKCC_largest_component

<h2>Network Analysis: Statistical Measures</h2>

Now we've made our graph, we can do some more exciting and analytical stuff with it - let's start by looking at some common measurements that assess the value or importance of the individual nodes.

<b>Degree</b>
<br>Let's start with degree, using `create.degree_rank_list`. This module allows us to calculate degree as both <b>undirected</b> and <b>weighted</b>. In this epistolary network, <b>undirected degree</b> counts the number of individual correspondents each person has, whereas <b>weighted degree</b> counts the total number of incoming and outgoing letters for each actor in the network. 

Let's use our `retrieve_operation_info` function to have a look at what we need to calculate these degrees.

In [10]:
kiara.retrieve_operation_info('calculate.degree_score')

Nice and simple. We just need to give it the graph we created in `assemble.network_graph`.

Let's give it a go then.

In [11]:
output = kiara.run_job('calculate.degree_score', inputs={'network_graph':CKCC['network_graph']}, comment="")
output

Great, this function gives us a table with the undirected degree and, because it automatically detects that we have a weighted network, as decided in the creation step, it also calculates weighted degree for each member of this network. It's also assigned the two degree scores as node attributes in our network, which means we can also keep these in further centrality measurements, allowing us to accumulate all the different scores rather than re-writing over them each time.

<b>Betweenness</b>
<br>Let's have a look at a different centrality measure now - use `retrieve.operation.info` again to see what we need to calculate betweenness for the nodes in our network.

In [12]:
kiara.retrieve_operation_info('calculate.betweenness_score')

This module asks us to define how we want our weights to be interpreted - is the weight 'positive', indicating strong relationships, or is it 'negative', acting as a distance or time needed for these edges? Whilst this is often automated in network measures, <i>kiara</i> prompts us to think more carefully about our data and our network, and again gets us to trace the decisions we as researchers are making about our analysis.

As we're dealing with epistolary data, we'll leave this input as 'True', as the weight indicates strength. At this stage, the module is also set to calculate both unweighted and weighted betweenness using the network as a directed graph. Though this is another 'pre-made' decision for this notebook and the dataset in use, it's important to acknowledge this and be as transparent about these kind of choices as the ones actively documented by user input.

Let's give it a go then. We want to use the network we just created using the degree ranking module, so let's save that and use it in our inputs.

In [13]:
network_graph = output['centrality_network']

output = kiara.run_job('calculate.betweenness_score', inputs={'network_graph':network_graph}, comment="")

output

Just like the degree module, it's returned a table with the two betweenness scores, ranked by unweighted, and also assigned these as node attributes that we can carry forward into more measurements. Let's look at one more centrality here in this notebook.

<b>Eigenvector</b>
<br><i>kiara</i> also holds a module to measure eigenvector centrality, so let's look again at what that needs.

In [14]:
kiara.retrieve_operation_info('calculate.eigenvector_score')

This module is set up similary to the betweenness measure, and again we can define how to interpret the weights. If you have a larger dataset, you can also change the iterations for the measurement. For the moment we'll leave the parameters as they are, and again use our updated network with the degree and betweenness scores attached.

In [15]:
network_graph = output['centrality_network']

output = kiara.run_job('calculate.eigenvector_score', inputs={'network_graph':network_graph}, comment="")

output

As before, we have our score table and our updated node attributes. Great!

There's one final centrality measure in the network analysis plugin for closeness. See if you can work out how to check the information for this and run it on the network here, or feel free to move on to other measures.

<b>Modularity Group</b>
<br>This next module determines the modularity groups in the network, again assigning each group as a node attribute. Let's have a look at the parameters for it.

In [16]:
kiara.retrieve_operation_info('compute.modularity_group')

Here, we can set the number of communities that we want the module to divide our network up into, or we can allow the code to find this automatically.

Let's give it a go with our network once more.

In [17]:
network_graph = output['centrality_network']

output = kiara.run_job('compute.modularity_group', inputs={'network_graph':network_graph, 'number_of_communities':10}, comment="")

output

Great - this once again gives us our updated network, and also tells us how many modularity groups the measure has found in the network.

Let's look at one last measure.

<b>Cut Points</b>
This last function finds all the cut-points in the network, nodes that when removed will separate the component into two or more pieces. This function will return a list of the cut-points, and assign 'Yes' or 'No' as a node attribute.

Let's have a look one last time.

In [18]:
kiara.retrieve_operation_info('create.cut_point_list')

Nice and simple, no extra parameters: it just needs our network. It's worth pointing out that the cut-point function in NetworkX doesn't work on directed or multidirected graphs; if you have one of these graphs, this `create.cut_point_list` function will take the graph, convert it into an undirected graph, find the cut-points, and then give you back your directed graph. It makes no difference to the metrics, but it's worth knowing! 

In [19]:
network_graph = output['modularity_network']

output = kiara.run_job('create.cut_point_list', inputs={'network_graph':network_graph}, comment="")

output

Having started simply with an imported CSV of letter edges, we've now got a lot of information. This is great - but what next?

<h2>Exporting the Network</h2>

<i>kiara</i> has stored all of this information we have just created, and as it's interoperable, it also allows us to export this network again. We can export all this network data as a set of CSVs or a number of different network data objects like graphml or gexf with built in <i>kiara</i> modules like this:

In [20]:
kiara.retrieve_operation_info('export.network_graph')

From this we can work with our kiara object in other softwares for further analysis or visualisations!

Give it a go yourself.

Finally we can check out the lineage for our final cut-network output. As we can see, it has stored all the decisions we have made, and the ways in which they have created 'new' datasets, right from our original import.

In [21]:
lineage = kiara.retrieve_augmented_value_lineage(output['cut_network'])
from observable_jupyter import embed
embed('@dharpa-project/kiara-data-lineage', cells=['displayViz', 'style'], inputs={'dataset':lineage})

<h2>Onboarding Data: An Alternative</h2>

So far then, we have created a network object in <i>kiara</i> by importing a csv from a local path.

But what about other formats? Let's pause quickly, and have a look at importing a <b>gml</b> file instead. 

Here we will use a different sample dataset, <a href="http://www-personal.umich.edu/~mejn/netdata/">co-appearance network</a> of characters in Victor Hugo's novel <i>Les Miserables</i>, already in gml format. 

Let's have a look at the function `import.network_graph.from.file` and how this will work for us.

In [22]:
kiara.retrieve_operation_info('import.network_graph.from.file')

For this, we just need the path for the file we want to use - no need to import it into kiara first, as this module will take care of all of this in one go!

If our nodes are labelled anything other than 'id' in our gml file, we just need to pop this under label, and if we have a weight column in the import file, we also need to name this, just as in our create network module - this will rename the original weight column (in the <i>Les Miserables</i> graph, this is 'value'), so that further metrics can use the weighted edges information. 

In [23]:
lesmis_path = os.path.join(notebook_path,"lesmis.gml")

lesmis = kiara.run_job('import.network_graph.from.file', inputs={'path': lesmis_path, 'file_type':'gml', 'weight_column':'value'}, comment="")
lesmis

As we can see, this module not only imports the gml file into <i>kiara</i> but automatically converts it into a <i>kiara</i> network object for us. Great!

Here we can see that the edge table has renamed the 'value' column to 'weight' to account for edge weights that have also been automatically included with the gml data.

This can then be included in degree calculations, as below:

In [24]:
output = kiara.run_job('calculate.degree_score', inputs={'network_graph':lesmis['network_graph']}, comment="")
output

We'll leave this <i>Les Miserables</i> network for now, but it's useful to see this other option for importing data for networks. If you want to experiment with this dataset more, feel free to have a go!

<h2>Recommended Reading</h2>
<br>Want to know more about Network Analysis? Here's some helpful tutorials and reading:

* <a href="https://programminghistorian.org/en/lessons/exploring-and-analyzing-network-data-with-python"><i>Programming Historian</i> NetworkX tutorial</a>
* Ahnert, Ruth, Ahnert, Sebastian E., Coleman, Catherine Nicole and Scott B. Weingart 2020. <i>The Network Turn: The Changing Perspectives in the Humanities</i>. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781108866804
* Barabási, Albert-László. <i>Linked: The New Science of Networks</i>. New York: Penguin Group, 2002.
* Borgatti, Stephen. ‘The Key Player Problem.’ In <i>Dynamic Social Network Modeling and Analysis: Workshop Summary and Papers</i>. Edited by Ronald Breiger, Kathleen Carley and Philippa Pattison. Washington: The National Acadamies Press, 2003. 241-252.
* Brughmans, Tom, Anna Collar, and Fiona Coward, ed. <i>The Connected Past: Challenges to Network Studies in Archaeology and History</i>. Oxford: Oxford University Press, 2016.
* Tuominen, Jouni, Koho, Mikko, Pikkanen, Ilona, Drobac, Senka, Enqvist, Johanna, Hyvönen, Eero, La Mela, Matti, Leskinen, Petri, Paloposki, Hanna-Leena and Rantala, Heikki. Constellations of Correspondence: a Linked Data Service and Portal for Studying Large and Small Networks of Epistolary Exchange in the Grand Duchy of Finland. DHNB 2022 The 6th Digital Humanities in Nordic and Baltic Countries Conference, pp. 415-423, CEUR Workshop Proceedings, Vol. 3232, March, 2022. http://ceur-ws.org/Vol-3232/paper41.pdf