# Notebook to the poster: 'Python and Pandas for Papyrology'
#### by Audric Wannaz

### Welcome to this Jupyter Notebook! To run a cell, press shift+enter or click the button >/ Run on the toolbar

### If you are new to this interface, you may want to just click "Kernel" > "Restart and run all" and go look directly at the part 'Explore the data' (cells 21 and forward)

This notebook explains in detail how the Network Analysis results displayed on the poster 'Python and Pandas for Papyrology?' were obtained. Keep in mind the code used may not be optimal or pythonic all along, but it is functional. Feel free to use it, optimize it or to push your own version back

# Introduction

We want to reproduce the Network Analysis of a biblical scene presented by Czaschez 2016 (see poster for the full quote) but with another material: 
Greek family letters from Graeco-Roman Egypt (in the final visualisations, we focus on texts from the Arsinoite and Oxyrhinchite nomes from 99 BC to 299 AD). To sum up the Network Analysis, we will repeat following procedure for each text:
1. We lemmatize the text
2. We remove stopwords from it
3. Each word is represented by a node
4. Each node is connected to the word coming before it AND to the word coming after it
5. We merge the nodes when they represent the same (lemmatized) word

At this point, you could visualise the network (for example with Palladio https://hdlab.stanford.edu/palladio/) but instead, we will extract some network features to compare between the texts of our corpus:
A. Nodes numbers
B. Degree
C. Betweenness

(you could of course extend this list with other features, e.g. Eigenvector)




to make things clearer, some steps were performed in advance. Those were:

1. Scraping of 148 texts and metadata tagged as family letters (https://papyri.info/search?STRING=(family+OR+famille+OR+Familie+OR+famiglia)&no_caps=on&no_marks=on&target=metadata&DATE_MODE=LOOSE&DOCS_PER_PAGE=15&STRING1=letter+OR+Brief+OR+lettre+OR+lettera&target1=METADATA&no_caps1=on&no_marks1=on&LANG=grc)
2. Storing the the needed information in a csv

## Setup

In [1]:
# first, we import the needed modules

import pandas as pd # to visualize and manipulate our data
import csv # to load the data
import networkx as nx # to perform the actual Network Analysis (not visualise it!)
from operator import itemgetter # to get values from single nodes 
import ast # to transform the lemma list in a string
import seaborn as sns #to colorize our final df


Bad key "text.kerning_factor" on line 4 in
C:\Users\Audric\anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.1.3/matplotlibrc.template
or from the matplotlib source distribution


## Functions

In [2]:
# skip

def do_all(a_str):
    a_list = a_str.split(' ')
    new_list = []
    for p in a_list:
        new_list.append(p)
    l_of_l = [new_list]
    a_df = pd.DataFrame(l_of_l).transpose()
    a_df = a_df.rename(columns ={0:'lemma'})
    
    return a_df

def make_df2(a_df):
    source_list = []
    target_list = []
    for index, v in a_df.lemma.items():
        try:
            target_list.append(a_df.lemma[index+1])
            source_list.append(v)
        except:
            continue
    
        try:
            target_list.append(a_df.lemma[index-1])
            source_list.append(v)
        except:
            continue

    target_list.append(a_df.lemma.iloc[-2])
    source_list.append(a_df.lemma.iloc[-1])
    edges = [source_list, target_list]
    df2 = pd.DataFrame(edges).transpose()
    df2 = df2.rename(columns ={0:'source', 1:'target'})
    return df2

def listToString(s):  
    
    str1 = " "   
    return (str1.join(s))

def make_string(x):
    x = ast.literal_eval(x)
    x = (listToString(x))
    return x

#

# this last functions performs the actual NA

def from_string_toNADF(a_str, choice):
    testdf = do_all(a_str)
    testdf2 = make_df2(testdf)

    testdf.to_csv('nodesdepassage.csv', index=False)
    testdf2.to_csv('edgesdepassage.csv', index=False)

#from csv to graph part
    with open('nodesdepassage.csv', encoding='utf-8') as nodecsv:
        nodereader = csv.reader(nodecsv) # Read the csv
    # Retrieve the data (using Python list comprhension and list slicing to remove the header row, see footnote 3)
        nodes = [n for n in nodereader][1:]

        node_names = [n[0] for n in nodes] # Get a list of only the node names

    with open('edgesdepassage.csv', encoding='utf-8') as edgecsv: # Open the file
        edgereader = csv.reader(edgecsv) # Read the csv
        edges = [tuple(e) for e in edgereader][1:] # Retrieve the data
    
    a = len(node_names)


# NA part
    G = nx.Graph()
    G.add_nodes_from(node_names)
    G.add_edges_from(edges)
    infos = nx.info(G)
    degree_dict = dict(G.degree(G.nodes()))
    nx.set_node_attributes(G, degree_dict, 'degree')
    sorted_degree = sorted(degree_dict.items(), key=itemgetter(1), reverse=True)
    b = sorted_degree[0][0]
    c = sorted_degree[0][1]
    betweenness_dict = nx.betweenness_centrality(G) # Run betweenness centrality
    #eigenvector_dict = nx.eigenvector_centrality(G) # Run eigenvector centrality

    # Assign each to an attribute in your network
    nx.set_node_attributes(G, betweenness_dict, 'betweenness')
    #nx.set_node_attributes(G, eigenvector_dict, 'eigenvector')
    sorted_betweenness = sorted(betweenness_dict.items(), key=itemgetter(1), reverse=True)
    d = sorted_betweenness[0][0]
    e = sorted_betweenness[0][1]

    #sorted_eigen = sorted(eigenvector_dict.items(), key=itemgetter(1), reverse=True)
    #e = sorted_eigen[0]
    G.clear()
    
    if choice == 'a':
        return a
    elif choice == 'b':
        return b
    elif choice == 'c':
        return c
    elif choice == 'd':
        return d
    elif choice == 'e':
        return e
    else:
        return('Wrong choice!')

## Load the data into pandas

In [3]:
# We will load a csv table 

df1 = pd.read_csv('dflem.csv')
df1['OBJECT']= df1.LEMMAS.apply(lambda x: make_string(x))

# we will remove the greek stopwords according to the .txt file of Classical Language Toolkit (cltk)
f = open("stopwords_greek.txt", encoding='utf-8')
cltk_stopwords = f.read()
cltk_stopwords = cltk_stopwords.replace('\n', ' ')

def stop_them(a_str):
    a_list = a_str.split(' ')
    for e in a_list:
        if e in cltk_stopwords:
            a_list.remove(e)
    
    a_str = listToString(a_list)
    return a_str

df1['OBJECT']= df1.OBJECT.apply(lambda x: stop_them(x))

# we only keep the texts that are not dated to the following centuries 
# CENTURY = 0 means the text was not dated in the scraped XML!

dfa = df1[df1.CENTURY != -3]
dfb = dfa[df1.CENTURY != -2]
dfc = dfb[df1.CENTURY != 4]
dfd = dfc[df1.CENTURY != 5]
dfe = dfd[df1.CENTURY != 6]
dff = dfe[df1.CENTURY != 7]
dfg = dff[df1.CENTURY != 8]
dfright_time = dfg
dt = dfright_time

# we remove  the texts that are not clearly tagged as coming from the two observed nomes (Arsinoites and Oxyrhinchites)
dt = dt[dt.PLACE != 'other']
del dt['DATA']
del dt['Unnamed: 0']
del dt['Unnamed: 0.1']
del dt['METADATA']
del dt['TOKENS']
del dt['LEMMAS']
dt = dt.reset_index()
del dt['index']



In [4]:
# you can already have a deeper look at our data and metadata here, but the data is still quite messy
df1

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,NAME,DATA,METADATA,PLACE,CENTURY,TOKENS,LEMMAS,OBJECT
0,0,0,/apf;63;138,ταρία τῷ ἀδελφῷ μου ἀπολλῷ χαίρειν ἐβάσταζέ μ...,"[<table class=""metadata"">\n<tbody>\n<tr>\n<th ...",arsi,2,"['ταρία', 'τῷ', 'ἀδελφῷ', 'μου', 'ἀπολλῷ', 'χα...","['ταρία', 'ἀδελφός', 'ἐγώ', 'ἀπόλάω1', 'χαίρω'...",ταρία ἀδελφός ἀπόλάω1 χαίρω ἐβάσταζέ παραδοῦνέ...
1,1,1,/basp;49;83,αὐρήλειος πωλείον στρατειώτης λεγειῶνος β βοη...,"[<table class=""metadata"">\n<tbody>\n<tr>\n<th ...",other,3,"['αὐρήλειος', 'πωλείον', 'στρατειώτης', 'λεγει...","['αὐρήλειος', 'πωλείον', 'στρατειώτης', 'λεγει...",αὐρήλειος πωλείον στρατειώτης λεγειῶνος βοηθέω...
2,2,2,/basp;49;135,πανίσκος ἀϊῶνι τῷ ἀδελφῶι πολλὰ χαίρειν πρὸ μ...,"[<table class=""metadata"">\n<tbody>\n<tr>\n<th ...",other,3,"['πανίσκος', 'ἀϊῶνι', 'τῷ', 'ἀδελφῶι', 'πολλὰ'...","['πανίσκος', 'ἀϊῶνι', 'ἀδελφός', 'πολλὰ', 'χαί...",πανίσκος ἀϊῶνι ἀδελφός χαίρω πᾶς εὔχομέ ὁλόκλ...
3,3,3,/bgu;1;261,θερμουθᾶς ἀπολιναρίῳ τῷ ἀδελφῷ πλεῖστα χαίρει...,"[<table class=""metadata"">\n<tbody>\n<tr>\n<th ...",arsi,0,"['θερμουθᾶς', 'ἀπολιναρίῳ', 'τῷ', 'ἀδελφῷ', 'π...","['θερμουθᾶς', 'ἀπολιναρίῳ', 'ἀδελφός', 'πλεῖσ...",θερμουθᾶς ἀπολιναρίῳ ἀδελφός πλεῖστος χαίρω γ...
4,4,4,/bgu;3;822,θερμουτᾶς ἀπολιναρίῳ τῶι ἀδελφῷ χαίρειν γινωσ...,"[<table class=""metadata"">\n<tbody>\n<tr>\n<th ...",arsi,0,"['θερμουτᾶς', 'ἀπολιναρίῳ', 'τῶι', 'ἀδελφῷ', '...","['θερμουτᾶς', 'ἀπολιναρίῳ', 'ἀδελφός', 'χαίρω'...",θερμουτᾶς ἀπολιναρίῳ ἀδελφός χαίρω γινωσκιν ἐθ...
...,...,...,...,...,...,...,...,...,...,...
130,130,130,/sb;16;12980,διὰ τοῦ παρόντος ἡμετέρου γράμματος πλεῖστα ὡ...,"[<table class=""metadata"">\n<tbody>\n<tr>\n<th ...",other,6,"['διὰ', 'τοῦ', 'παρόντος', 'ἡμετέρου', 'γράμμα...","['πάρειμι1', 'ἡμέτερος', 'γράμμα', 'πλεῖστος'...",πάρειμι1 γράμμα πλεῖστος παρὼν προσκυνέω ἀσπά...
131,131,131,/sb;16;12981,ἶσις σεραπίωνι τῶι ἀδελφῶι πολλὰ χαίρειν τὸ π...,"[<table class=""metadata"">\n<tbody>\n<tr>\n<th ...",other,2,"['ἶσις', 'σεραπίωνι', 'τῶι', 'ἀδελφῶι', 'πολλὰ...","['ἶσις', 'σεραπίωνι', 'ἀδελφός', 'πολλὰ', 'χαί...",ἶσις σεραπίωνι ἀδελφός χαίρω προσκύνημά ποιέω ...
132,132,132,/sb;20;14132,πτολέμα βελήουτι τῇ μητρεὶ καὶ κυρί ᾳ πλεῖστα...,"[<table class=""metadata"">\n<tbody>\n<tr>\n<th ...",other,1,"['πτολέμα', 'βελήουτι', 'τῇ', 'μητρεὶ', 'καὶ',...","['πτολέμα', 'βελήουτι', 'μητρεὶ', 'κυρί', 'ᾳ',...",πτολέμα βελήουτι μητρεὶ κυρί πλεῖστος χαίρω π...
133,133,133,/sb;22;15768,ὑπατείας τῶν δεσποτῶν ἡμῶν ἰοουινιανοῦ αἰωνίο...,"[<table class=""metadata"">\n<tbody>\n<tr>\n<th ...",other,4,"['ὑπατείας', 'τῶν', 'δεσποτῶν', 'ἡμῶν', 'ἰοουι...","['ὑπατεία', 'δεσπότης', 'ἐγώ', 'ἰοουινιανοῦ', ...",ὑπατεία δεσπότης ἰοουινιανοῦ αἰώνιος αὔγουστ...


In [5]:
# this DataFrame is ready for the Network Analysis
dt

Unnamed: 0,NAME,PLACE,CENTURY,OBJECT
0,/apf;63;138,arsi,2,ταρία ἀδελφός ἀπόλάω1 χαίρω ἐβάσταζέ παραδοῦνέ...
1,/bgu;1;261,arsi,0,θερμουθᾶς ἀπολιναρίῳ ἀδελφός πλεῖστος χαίρω γ...
2,/bgu;3;822,arsi,0,θερμουτᾶς ἀπολιναρίῳ ἀδελφός χαίρω γινωσκιν ἐθ...
3,/p.bas;2;15,arsi,1,πᾶσις ὀρσενούφι ἀδελφός χαίρω γυναῖκά ἐγλάνβα...
4,/p.corn;;49,arsi,1,διογείνης θερμουθᾶτι μητρεὶ κυρείᾳ χαίρω πᾶς ...
5,/p.fay;;124,arsi,2,θεογίτων ἀπολλώνιος χαίρω πάλειν γράφιν μαι ἔρ...
6,/p.mich;3;201,arsi,1,ἀντώνις ἀποληείῳ οὐαλε ριᾶτι ἀμφωταίροις χαίρι...
7,/p.mich;3;202,arsi,2,οὐαλερεία θερμουθᾶς δύο θερμουτεί ἀδελφή χαίρι...
8,/p.mich;3;203,arsi,2,σατορνῖλος ἀφροδοῦτι μητρεὶ πλεῖστος χαίρω πα...
9,/p.mich;3;209,arsi,2,σατορνῖλος σεμπρώνιος ἀδελφός κύριος πλεῖστος...


In [6]:
# we extract 5 pieces of information about this Network

dt['#nodes'] = dt.OBJECT.apply(lambda x: from_string_toNADF(x, 'a'))
dt['topdegree'] = dt.OBJECT.apply(lambda x: from_string_toNADF(x, 'b'))
dt['degree#'] = dt.OBJECT.apply(lambda x: from_string_toNADF(x, 'c'))
dt['topbetweenness'] = dt.OBJECT.apply(lambda x: from_string_toNADF(x, 'd'))
dt['betweenness#'] = dt.OBJECT.apply(lambda x: from_string_toNADF(x, 'e'))

In [7]:
dt

Unnamed: 0,NAME,PLACE,CENTURY,OBJECT,#nodes,topdegree,degree#,topbetweenness,betweenness#
0,/apf;63;138,arsi,2,ταρία ἀδελφός ἀπόλάω1 χαίρω ἐβάσταζέ παραδοῦνέ...,42,οὐ,4,οὐ,0.565315
1,/bgu;1;261,arsi,0,θερμουθᾶς ἀπολιναρίῳ ἀδελφός πλεῖστος χαίρω γ...,90,αὐτός,6,αὐτός,0.667058
2,/bgu;3;822,arsi,0,θερμουτᾶς ἀπολιναρίῳ ἀδελφός χαίρω γινωσκιν ἐθ...,87,εὑρίσκω,8,εὑρίσκω,0.661013
3,/p.bas;2;15,arsi,1,πᾶσις ὀρσενούφι ἀδελφός χαίρω γυναῖκά ἐγλάνβα...,25,εὐθύς,4,θιν,0.521739
4,/p.corn;;49,arsi,1,διογείνης θερμουθᾶτι μητρεὶ κυρείᾳ χαίρω πᾶς ...,29,πᾶς,4,ἀσπάζομαι,0.588933
5,/p.fay;;124,arsi,2,θεογίτων ἀπολλώνιος χαίρω πάλειν γράφιν μαι ἔρ...,61,πάλειν,6,πάλειν,0.504148
6,/p.mich;3;201,arsi,1,ἀντώνις ἀποληείῳ οὐαλε ριᾶτι ἀμφωταίροις χαίρι...,67,αὐτός,4,αὐτός,0.60973
7,/p.mich;3;202,arsi,2,οὐαλερεία θερμουθᾶς δύο θερμουτεί ἀδελφή χαίρι...,70,μέλλις,6,μέλλις,0.50806
8,/p.mich;3;203,arsi,2,σατορνῖλος ἀφροδοῦτι μητρεὶ πλεῖστος χαίρω πα...,233,ἔρχομαι,15,πᾶς,0.477891
9,/p.mich;3;209,arsi,2,σατορνῖλος σεμπρώνιος ἀδελφός κύριος πλεῖστος...,109,ἀδελφός,10,ἀδελφός,0.435603


In [8]:
del dt['OBJECT']

In [9]:
def cutone(a_str):
    return a_str[1:]

In [10]:
dt.NAME = dt.NAME.apply(lambda x: cutone(x))

In [11]:
cm = sns.light_palette("red", as_cmap=True)

s = dt.sort_values(by='degree#', ascending=False).head(20).style.background_gradient(cmap=cm)

# Explore the data

In [12]:
s

# At last we can explore the cleaned up data and try to interpret it
# Can we see a corelation between network features and place or century? 
# Are we surprised by the words with the highest degree and betweenness

Unnamed: 0,NAME,PLACE,CENTURY,#nodes,topdegree,degree#,topbetweenness,betweenness#
34,sb;5;7572,arsi,0,86,κὲ,22,κὲ,0.852726
8,p.mich;3;203,arsi,2,233,ἔρχομαι,15,πᾶς,0.477891
32,p.wisc;2;68,arsi,0,63,δραχμὰς,11,δραχμὰς,0.416152
19,p.oxy;14;1649,oxy,0,128,περιέχω,11,περιέχω,0.548262
9,p.mich;3;209,arsi,2,109,ἀδελφός,10,ἀδελφός,0.435603
30,p.tebt;3.1;816,arsi,2,175,πτολεμαῖος,10,πτολεμαῖος,0.329538
11,p.mich;8;476,arsi,2,169,οὐ,10,οὐ,0.42734
41,sb;14;11899,oxy,2,141,ὀφείλω,9,παρʼ,0.374064
36,sb;8;9882,arsi,2,92,δραχμὰς,9,δραχμὰς,0.403578
2,bgu;3;822,arsi,0,87,εὑρίσκω,8,εὑρίσκω,0.661013


In [13]:
dt.describe()

Unnamed: 0,CENTURY,#nodes,degree#,betweenness#
count,45.0,45.0,45.0,45.0
mean,1.666667,71.666667,6.466667,0.503178
std,1.044466,42.93547,3.539389,0.113937
min,0.0,11.0,2.0,0.3
25%,1.0,49.0,4.0,0.424077
50%,2.0,64.0,6.0,0.50806
75%,2.0,86.0,8.0,0.567005
max,3.0,233.0,22.0,0.852726


You can remove the # symbol of line and then run it to explore the data. Of course, you can also adapt the code to your needs and ask your own questions

In [14]:
#dt[dt.topdegree != dt.topbetweenness]

In [15]:
#dt.sort_values(by='degree#', ascending=False).head(10)

In [16]:
#dt.sort_values(by='betweenness#', ascending=False).head(10)

In [17]:
#dt.sort_values(by='#nodes', ascending=False).head(45)

In [18]:
dt.groupby(by='CENTURY').mean()

Unnamed: 0_level_0,#nodes,degree#,betweenness#
CENTURY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,70.0,8.5,0.553069
1,41.0,4.0,0.508337
2,82.909091,6.590909,0.486437
3,59.666667,5.0,0.486371


In [19]:
dt.groupby(by='PLACE').mean()

Unnamed: 0_level_0,CENTURY,#nodes,degree#,betweenness#
PLACE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
arsi,1.675,68.775,6.25,0.511721
oxy,1.6,94.8,8.2,0.434834


In [20]:
dt.iloc[:1:3]

Unnamed: 0,NAME,PLACE,CENTURY,#nodes,topdegree,degree#,topbetweenness,betweenness#
0,apf;63;138,arsi,2,42,οὐ,4,οὐ,0.565315


In [21]:
dt['relative_degree'] = dt['degree#'] / dt['#nodes']

#['degree#']/dt['#nodes']]

In [22]:
dt

Unnamed: 0,NAME,PLACE,CENTURY,#nodes,topdegree,degree#,topbetweenness,betweenness#,relative_degree
0,apf;63;138,arsi,2,42,οὐ,4,οὐ,0.565315,0.095238
1,bgu;1;261,arsi,0,90,αὐτός,6,αὐτός,0.667058,0.066667
2,bgu;3;822,arsi,0,87,εὑρίσκω,8,εὑρίσκω,0.661013,0.091954
3,p.bas;2;15,arsi,1,25,εὐθύς,4,θιν,0.521739,0.16
4,p.corn;;49,arsi,1,29,πᾶς,4,ἀσπάζομαι,0.588933,0.137931
5,p.fay;;124,arsi,2,61,πάλειν,6,πάλειν,0.504148,0.098361
6,p.mich;3;201,arsi,1,67,αὐτός,4,αὐτός,0.60973,0.059701
7,p.mich;3;202,arsi,2,70,μέλλις,6,μέλλις,0.50806,0.085714
8,p.mich;3;203,arsi,2,233,ἔρχομαι,15,πᾶς,0.477891,0.064378
9,p.mich;3;209,arsi,2,109,ἀδελφός,10,ἀδελφός,0.435603,0.091743


In [23]:
dt.sort_values(by='relative_degree', ascending=False).head(45)

Unnamed: 0,NAME,PLACE,CENTURY,#nodes,topdegree,degree#,topbetweenness,betweenness#,relative_degree
34,sb;5;7572,arsi,0,86,κὲ,22,κὲ,0.852726,0.255814
20,p.oxy;58;3919,oxy,0,28,δραχμὰς,6,διαπέμπω,0.45671,0.214286
42,sb;14;12082,arsi,3,11,πατήρ,2,ἀνάπτω,0.555556,0.181818
32,p.wisc;2;68,arsi,0,63,δραχμὰς,11,δραχμὰς,0.416152,0.174603
31,p.tebt;3.2;948,arsi,2,18,ποσειδώνιος,3,ποσειδώνιος,0.362637,0.166667
3,p.bas;2;15,arsi,1,25,εὐθύς,4,θιν,0.521739,0.16
28,p.tebt;3.1;760,arsi,3,53,δίδωμι,8,γράφω,0.486179,0.150943
44,sb;26;16578,arsi,2,40,αὐτός,6,αὐτός,0.705026,0.15
17,p.mich.mchl;;23,arsi,0,28,χαίρω,4,νεμεσίωνι,0.3,0.142857
4,p.corn;;49,arsi,1,29,πᾶς,4,ἀσπάζομαι,0.588933,0.137931


As you can tell, there are a lot of aspects of this network that could be investigated in detail. I plan to publish the most interesting parts of my exploration of this kind of network for my source material, so stay tuned!