## This Jupyter Notebook creates useful visualizations for the following statistics about Vasari's *The Lives*:

* Total number of cooccurrences accumulated over all volumes = 673 ✅
* Total number of cooccurrences per volume (1-10) ✅
* Number of NEs (persons, artworks, events, etc.) per volume ✅
* Persons and their frequency for different volumes they appear in ✅
* Number of cooccurrences per person (top 10)  ✅
* Top 10 most frequent cooccurrences ✅
* Top 10 least frequently occurring persons in the cooccurrences (overall) ✅
* Persons in index vs. persons sharing cooccurrences



In [1]:
import ast


import pandas as pd
import plotly.express as px

ModuleNotFoundError: No module named 'plotly'

#### The following code will calculate the number of cooccurrences for each volume with summarised entries for pages and paragraphs.
Using pandas, we read the CSV data into a DataFrame and take a first peak into the data structure.

In [2]:
file = open('../data/results/pmi_tables/9.csv')
df = pd.read_csv(file)
df

Unnamed: 0,x,p_x,y,p_y,p_yx,pmi_yx,pages,paragraphs,volume
0,"Galassi, Galasso (Galasso Ferrarese)",0.000661,Cristofano,0.000661,1.000000,10.564149,[104],['112'],['1']
1,"Galassi, Galasso (Galasso Ferrarese)",0.000661,Simone,0.000661,1.000000,10.564149,[104],['112'],['1']
2,Cristofano,0.000661,Simone,0.000661,1.000000,10.564149,[104],['112'],['1']
3,Polycletus,0.000661,"Ghiberti, Bonaccorso",0.000661,1.000000,10.564149,[160],['181'],['1']
4,"Avanzi, Jacopo (Jacopo Davanzo)",0.000661,"Verona, Sebeto da",0.000661,1.000000,10.564149,[55],['60'],['3']
...,...,...,...,...,...,...,...,...,...
474,"Sanzio, Raffaello (Raffaello da Urbino)",0.058124,"Beccafumi, Domenico (Domenico di Pace)",0.009908,0.011364,0.197827,[236],['355'],['5']
475,"Urbino, Bramante da",0.015852,"Bandinelli, Baccio (Baccio de' Brandini)",0.037649,0.041667,0.146297,[190],['298'],['8']
476,"Sarto, Andrea del (Andrea d' Agnolo)",0.027741,Tribolo (Niccolï¿½),0.021797,0.023810,0.127438,[4],['2'],['6']
477,"Buonarroti, Michelagnolo",0.116248,"Ghirlandajo, Ridolfo",0.010568,0.011364,0.104718,"[3, 95]","['1', '137']","['7', '7']"


Next, we need to make new rows for the accumulated cooccurrences. 
### What are accumulated cooccurrences and how do we identify them?
As you can see, some rows have multiple entries in a list for pages, paragraphs and volumes. This means that they appear more than once in the same or different volumes. The appearance on a page and in a paragraph is always directly tied to a volume.
This means that, for example, row 221, Ghirlandajo and Granacci share a cooccurrence in volume 5 and 8, for both exactly once. However, in row 220, we can observe that Pierino Vinci and Leonardo da Vinci share two cooccurrences in volume 6, both on different pages and in different paragraphs. 
We can insert new rows that still have the same values for all columns except for the volume: For two cooccurrences noted by two entries in these lists, we make two out of one rows for the same persons, which would look the following:
### Original:

|   |                     |        |                                  |        |        |        |       |             |      |
|---|---------------------|--------|----------------------------------|--------|--------|--------|-------|-------------|------|
|220|Vinci, Pierino (Piero) da|0.001982|Vinci, Leonardo da                |0.019815|0.666667|5.072296|[44, 43]|['49', '47'] |[6, 6]|
|221|Ghirlandajo, Domenico|0.011889|Granacci, Francesco (Il Granaccio)|0.003303|0.111111|5.072296|[58, 6]|['102', '15']|[5, 8]|

### New:

|   |                     |        |                                  |        |        |        |       |             |      |
|---|---------------------|--------|----------------------------------|--------|--------|--------|-------|-------------|------|
|264|Ghirlandajo, Domenico|0.011889|Granacci, Francesco (Il Granaccio)|0.003303|0.111111|5.072296|[58, 6]|['102', '15']|8  |
|263|Ghirlandajo, Domenico|0.011889|Granacci, Francesco (Il Granaccio)|0.003303|0.111111|5.072296|[58, 6]|['102', '15']|5  |
|262|Vinci, Pierino (Piero) da|0.001982|Vinci, Leonardo da                |0.019815|0.666667|5.072296|[44, 43]|['49', '47'] |6     |
|261|Vinci, Pierino (Piero) da|0.001982|Vinci, Leonardo da                |0.019815|0.666667|5.072296|[44, 43]|['49', '47'] |6     |


The number of entries in a list, independent of whether we are looking at pages, paragraphs or volumes, is equivalent to the cooccurrence frequency of two people.

In [3]:
df['volume'] = df.volume.apply(lambda s: list(ast.literal_eval(s)))
df = df.explode(column='volume',ignore_index=True)

In [4]:
df

Unnamed: 0,x,p_x,y,p_y,p_yx,pmi_yx,pages,paragraphs,volume
0,"Galassi, Galasso (Galasso Ferrarese)",0.000661,Cristofano,0.000661,1.000000,10.564149,[104],['112'],1
1,"Galassi, Galasso (Galasso Ferrarese)",0.000661,Simone,0.000661,1.000000,10.564149,[104],['112'],1
2,Cristofano,0.000661,Simone,0.000661,1.000000,10.564149,[104],['112'],1
3,Polycletus,0.000661,"Ghiberti, Bonaccorso",0.000661,1.000000,10.564149,[160],['181'],1
4,"Avanzi, Jacopo (Jacopo Davanzo)",0.000661,"Verona, Sebeto da",0.000661,1.000000,10.564149,[55],['60'],3
...,...,...,...,...,...,...,...,...,...
669,"Urbino, Bramante da",0.015852,"Bandinelli, Baccio (Baccio de' Brandini)",0.037649,0.041667,0.146297,[190],['298'],8
670,"Sarto, Andrea del (Andrea d' Agnolo)",0.027741,Tribolo (Niccolï¿½),0.021797,0.023810,0.127438,[4],['2'],6
671,"Buonarroti, Michelagnolo",0.116248,"Ghirlandajo, Ridolfo",0.010568,0.011364,0.104718,"[3, 95]","['1', '137']",7
672,"Buonarroti, Michelagnolo",0.116248,"Ghirlandajo, Ridolfo",0.010568,0.011364,0.104718,"[3, 95]","['1', '137']",7


As we can see now, instead of 478 cooccurrences (number of rows), we have 673. This equals the total number of cooccurrences we have in the database.
Following, we calculate the number of cooccurrences per volume using group aggregation.

In [6]:
df_vol = df['volume'].value_counts().to_frame().reset_index().sort_values(by=['volume'])

In [7]:
df_vol

Unnamed: 0,volume,count
7,0,41
8,1,38
6,2,52
5,3,68
4,4,75
0,5,120
3,6,86
1,7,92
2,8,87
9,9,15


Visualized in a barplot, this looks the following:

In [None]:
fig = px.bar(df_vol, x='volume', y='count', color = 'volume')
fig.show()

#### Persons and their frequency

In the following lines, we will calculate how frequently the different individuals appear in the volumes / overall in all volumes.
First, we need to join both columns 'x' and 'y' into one DataFrame to be able to calculate the overall frequency for all persons, independent of their position in the cooccurrence (x or y).

df1 and d2 give us an overview of the different people in the Dataframe and enable a full outer join.

In [9]:
df1 = df.groupby(['volume', 'x']).size().to_frame().reset_index().rename(columns={0: 'count'})

In [10]:
df1

Unnamed: 0,volume,x,count
0,0,Agnolo (of Siena),3
1,0,"Buffalmacco, Buonamico",3
2,0,"Cimabue, Giovanni",1
3,0,"Gaddi, Agnolo",1
4,0,"Gaddi, Taddeo",1
...,...,...,...
243,9,"Buonarroti, Michelagnolo",5
244,9,"Lastricati, Zanobi",1
245,9,"Poggini, Domenico",2
246,9,"Pontormo, Jacopo da (Jacopo Carrucci)",2


In d2, we need to rename the original column 'y' to 'x' for being able to perform a join.

In [11]:
df2 = df.groupby(['volume', 'y']).size().to_frame().reset_index().rename(columns={'y': 'x', 0: 'count'})

In [12]:
df2

Unnamed: 0,volume,x,count
0,0,Agnolo (of Siena),1
1,0,Agostino (of Siena),4
2,0,"Bologhini, Bartolommeo",1
3,0,"Capanna, Puccio",1
4,0,"Cavallini, Pietro",1
...,...,...,...
324,9,"Lancia, Pompilio",2
325,9,"Lastricati, Zanobi",1
326,9,"Lorenzi, Battista (Battista del Cavaliere)",1
327,9,"Rossi, Vincenzio de'",1


Here, we do a full outer join on the volumes and persons stored in column 'x'. Keep in mind, 'x' now contains individuals from both the original 'x' column and the former 'y' one.
Some people appear in both columns in the original DataFrame. Thus, we summarize these values in another column called 'count_sum'. This sum is valid per volume.

In [13]:
merged = pd.merge(df1, df2,  how='outer', on=['volume','x'])
merged.fillna(0, inplace=True)
merged['count_sum'] = merged['count_x'] + merged['count_y']

In [14]:
merged

Unnamed: 0,volume,x,count_x,count_y,count_sum
0,0,Agnolo (of Siena),3.0,1.0,4.0
1,0,Agostino (of Siena),0.0,4.0,4.0
2,0,"Bologhini, Bartolommeo",0.0,1.0,1.0
3,0,"Buffalmacco, Buonamico",3.0,0.0,3.0
4,0,"Capanna, Puccio",0.0,1.0,1.0
...,...,...,...,...,...
443,9,"Lorenzi, Battista (Battista del Cavaliere)",0.0,1.0,1.0
444,9,"Poggini, Domenico",2.0,0.0,2.0
445,9,"Pontormo, Jacopo da (Jacopo Carrucci)",2.0,0.0,2.0
446,9,"Rossi, Vincenzio de'",1.0,1.0,2.0


The new DataFrame allows a grouping per volume, so we can calculate which people appear how often in which volume.

In [15]:
grouped_volumes = merged.groupby('volume')

In [16]:
fig = px.bar(grouped_volumes.get_group('0'), x='x', y='count_sum', color = 'count_sum')
fig.show()

When we group by person, we can also find out the individual frequencies per volume.

In [17]:
grouped_persons = merged.groupby('x')


In [18]:
grouped_persons.get_group("Buonarroti, Michelagnolo")

Unnamed: 0,volume,x,count_x,count_y,count_sum
109,3,"Buonarroti, Michelagnolo",8.0,0.0,8.0
152,4,"Buonarroti, Michelagnolo",1.0,0.0,1.0
215,5,"Buonarroti, Michelagnolo",10.0,0.0,10.0
283,6,"Buonarroti, Michelagnolo",15.0,0.0,15.0
328,7,"Buonarroti, Michelagnolo",12.0,0.0,12.0
392,8,"Buonarroti, Michelagnolo",32.0,0.0,32.0
440,9,"Buonarroti, Michelagnolo",5.0,0.0,5.0


We also want to find out how often people occur in the entire database. For this, we once again use aggregtation, so grouping by persons, and then summarize over the count_sum.

In [23]:
merged.groupby(['x'])['count_sum'].agg('sum')

x
Agnolo (of Siena)                      4.0
Agnolo, Baccio d' (Baccio Baglioni)    5.0
Agnolo, Giuliano di Baccio d'          4.0
Agostino (of Siena)                    4.0
Albertinelli, Mariotto                 6.0
                                      ... 
Vitruvius                              2.0
Vivarino, Luigi                        2.0
Viviano, Michelagnolo di               4.0
Zevio, Aldigieri (Altichiero) da       2.0
Zucchero, Taddeo                       5.0
Name: count_sum, Length: 312, dtype: float64

The following visualization shows the top 10 most frequent persons in the database. They share the highest amount of cooccurrences.

In [20]:
fig = px.bar(merged.groupby(['x'])['count_sum'].agg('sum').reset_index().sort_values(by='count_sum', ascending=False)[:10], x='x', y='count_sum', color = 'x')
fig.show()





The following table shows the cooccurrences and their frequency depending on the volume. We can filter it for the most or least frequent ones, but also search for certain groups independently.
We can filter for the most frequent cooccurrences, which will yield the following result:

|                                                                                 |                                                  |   |   |
|---------------------------------------------------------------------------------|--------------------------------------------------|---|---|
|Donato (Donatello)                                                               |Brunelleschi, Filippo (Filippo di Ser Brunellesco)|1  |6  |
|Buonarroti, Michelagnolo                                                         |Bandinelli, Baccio (Baccio de' Brandini)          |6  |5  |
|Vecelli, Tiziano (Tiziano da Cadore)                                             |Castelfranco, Giorgione da                        |8  |5  |
|Ghiberti, Lorenzo (Lorenzo di Bartoluccio Ghiberti, or Lorenzo di Cione Ghiberti)|Brunelleschi, Filippo (Filippo di Ser Brunellesco)|1  |5  |
|Sarto, Andrea del (Andrea d' Agnolo)                                             |Franciabigio (Francia)                            |4  |5  |



In [42]:
dfp = df.groupby(['x', 'y', 'volume']).count().drop(columns=['p_x', 'p_y', 'p_yx', 'pmi_yx', 'paragraphs']).rename(columns={'pages': 'count'}).sort_values(by='count', ascending=False).reset_index()
dfp['cooc_id'] = dfp.index

In [43]:
dfp

Unnamed: 0,x,y,volume,count,cooc_id
0,Donato (Donatello),"Brunelleschi, Filippo (Filippo di Ser Brunelle...",1,6,0
1,"Buonarroti, Michelagnolo","Bandinelli, Baccio (Baccio de' Brandini)",6,5,1
2,"Vecelli, Tiziano (Tiziano da Cadore)","Castelfranco, Giorgione da",8,5,2
3,"Ghiberti, Lorenzo (Lorenzo di Bartoluccio Ghib...","Brunelleschi, Filippo (Filippo di Ser Brunelle...",1,5,3
4,"Sarto, Andrea del (Andrea d' Agnolo)",Franciabigio (Francia),4,5,4
...,...,...,...,...,...
519,"Filarete, Antonio","Ciuffagni, Bernardo",2,1,519
520,"Fiesole, Andrea da (Andrea Ferrucci)","Marchissi, Antonio di Giorgio",4,1,520
521,"Fiesole, Andrea da (Andrea Ferrucci)","Maini (Marini), Michele",4,1,521
522,"Fabriano, Gentile da","Pisanello, Vittore or Antonio",2,1,522


In [48]:
fig = px.bar(dfp[:80], x='cooc_id', y='count', color = 'count')
fig.show()