## Command line pipelines to get nodes and edges

This notebook doesn't contain any Python but only command line commands to reshape the data. You could do the same thing in Python (for example with the Pandas library) or with many other languages and tools, but the command line seems nice and simple for this task.

We will build up most of these commands piece by piece. We've broken them down here to show what's happening at each stage and to check that the commands are working correctly.

First we will get the CSV of the Linnean Society correspondence from the Core Course 1 repository on GitHub:

In [None]:
!wget https://raw.githubusercontent.com/jonathanblaney/CC1-Michaelmas25/refs/heads/main/week4-networks/linnean-society.csv

Before getting a particular year of data, let's have a quick look at what years are actually present in the data so we don't waste time on unrepresented years. We'll use `cut` to only return column 6 from the CSV, `grep` to extract only the four digits of the year, `sort` and `uniq` to remove duplicates and then `paste` to put them side by side and so easier to read.

In [None]:
!cut -d, -f6 linnean-society.csv | grep -Eo "[0-9][0-9][0-9][0-9]" | sort | uniq | paste - - - - -

Enter one year from the list above inside the double quotes:

In [None]:
year = "1811"

First we'll use `grep` to get all the lines which contain this year. Grep returns all lines in a file which contain the string. Here the year is preceded by a comma and followed by a dash (to try to exclude false positives):

In [None]:
!grep ",$year-" linnean-society.csv

Now we want to cut out columns 2 and 4, which contain the surname of sender and recipient. By default, the column separator with the `cut` command is a space so we set the delimiter to a comma instead:

In [None]:
!grep ",$year-" linnean-society.csv | cut -d, -f2,4

To get the counts for edges we count the number of occurrences of each unique line:

In [None]:
!grep ",$year-" linnean-society.csv | cut -d, -f2,4 | sort | uniq -c

Finally we need to make this into CSV. For Gephi, we should also move the number to the end (so Gephi doesn't think it's an ID). We'll also remove the `"` marks using `tr`.

In [None]:
!grep ",$year-" linnean-society.csv | cut -d, -f2,4 | sort | uniq -c | tr -d '"' | perl -pe 's/ +([0-9]+) +(.+$)/$2,$1/'

If this all looks good we can write it out to a file. If you get very few (or no) matches you can try a different year by resetting the year variable in the cell towards the top of this notebook and running the commands again. When we add the data we'll also prepend the headings that Gephi (but not Flourish) requires for an edges table.

In [None]:
!echo "source,target,count" > "$year-edges.csv"
!grep ",$year-" linnean-society.csv | cut -d, -f2,4 | sort | uniq -c | tr -d '"' | perl -pe 's/ +([0-9]+) +(.+$)/$2,$1/' >> "$year-edges.csv"

To get nodes with counts we need to follow a similar procedure but to combine the two columns into one. To get one column under another with cut, the two commands are run consecutively (separated by a semi-colon) and put in round brackets so that everything inside the brackets is executed before the following commands operate.

In [None]:
!(grep ",$year-" linnean-society.csv | cut -d, -f2; grep ",$year-" linnean-society.csv | cut -d, -f4) | sort | uniq -c

And again we'll move the numbers to the end and insert a comma to create a valid CSV file.

In [None]:
!(grep ",$year-" linnean-society.csv | cut -d, -f2; grep ",$year-" linnean-society.csv | cut -d, -f4) | sort | uniq -c | tr -d '"' | perl -pe 's/ +([0-9]+) +(.+$)/$2,$1/'

If all looks good we can again write this to a file:

In [None]:
!(grep ",$year-" linnean-society.csv | cut -d, -f2; grep ",$year-" linnean-society.csv | cut -d, -f4) | sort | uniq -c | tr -d '"' | perl -pe 's/ +([0-9]+) +(.+$)/$2,$1/' > "$year-nodes.csv"