# Creating a graph from City Readers data
This notebook takes the data that I got from City Readers using the full version of the web scraping script in the `2021_s1_d2_m4d_1_scrape_city_readers` notebook and uses it to create a network graph using `Networkx`, which we'll then visualize with `Bokeh`.

## Connect to Google Drive

In [None]:
from google.colab import drive
drive.mount('/gdrive')

## Import some packages
We'll use the `pandas` package for reading our .csv file, sine `Networkx` plays very nicely with `pandas` dataframes

In [1]:
import numpy as np
import pandas as pd

## Load City Readers data
This cell reads the .csv of City Readers data as a `pandas` data frame, then replaces all instances of `numpy` "Not a Number" values with empty strings.


In [2]:
# source_directory = '/gdrive/MyDrive/rbs_digital_approaches_2021/data_class/'
source_directory = '/gdrive/MyDrive/RBS_Course/2021/rbs_digital_approaches_2021/data/'
df_nysoc_1800_loans = pd.read_csv(source_directory + 'july_19_23_borrowers_all_1800_books.csv')
df_nysoc_1800_loans = df_nysoc_1800_loans.replace(np.nan, '', regex=True)
print(df_nysoc_1800_loans.columns)
print(df_nysoc_1800_loans)


Index(['borrower_id', 'borrower_name', 'borrower_occupation', 'book_id',
       'book_author', 'book_title', 'book_subject'],
      dtype='object')
      borrower_id  ...                                       book_subject
0            1065  ...                      English fiction--18th century
1            1065  ...                                                   
2            1065  ...                                                   
3            1065  ...                              Verse satire, English
4            1065  ...                                             Novels
...           ...  ...                                                ...
3583         1169  ...                      English fiction--18th century
3584         1169  ...                            Sicily (Italy)--Fiction
3585         1169  ...                                  Vicars, Parochial
3586         1169  ...  Cervantes Saavedra, Miguel de, 1547-1616. Don ...
3587         1169  ...                

## Create node list
`Networkx` will create nodes automatically based on the edges we import later (which makes sense: if there is an edge, there are also nodes that it joins). We got some extra information about all of our nodes in the last notebook, though, and it seems easier to add that information to the `Networkx` network if we have a separate node list.

### Borrowers
The cell below creates a new `pandas` dataframe with five columns for our Borrower nodes, then copies data from selected columns in our initial data frame to columns in the new data frame. (I'm using list comprehension syntax for some of these columns to concatenate strings with the values in the source dataframe. It turns out that the same number can end up being an id for a book and for a borrower at City Readers, so I'm tacking on a prefix to the id to clarify that these are borrowers, for example.)

The .csv we imported had one row for every book borrowed by each borrower, which means that each borrower appears many times in our source .csv. We only want one node per borrower, though, so we `drop_duplicates()` to leave just unique rows.

In [3]:
#Create a new dataframe for Borrower nodes
df_borrower_nodes = pd.DataFrame(columns=['id','label', 'characteristic', 'node_type','link'])

#Copy columns from our original data frome to corresponding columns in our 
#new dataframe
df_borrower_nodes['id'] = ['bor_' + str(id) for id in df_nysoc_1800_loans['borrower_id']]
df_borrower_nodes['label'] = df_nysoc_1800_loans['borrower_name']
df_borrower_nodes['characteristic'] = df_nysoc_1800_loans['borrower_occupation']
df_borrower_nodes['node_type'] = 'Borrower'
df_borrower_nodes['link'] = ['https://cityreaders.nysoclib.org/Detail/entities/' + str(id) for id in df_nysoc_1800_loans['borrower_id']]

#Remove duplicate rows to leave a list of unique Borrowers
df_borrower_nodes = df_borrower_nodes.drop_duplicates() 
print(df_borrower_nodes)

            id  ...                                               link
0     bor_1065  ...  https://cityreaders.nysoclib.org/Detail/entiti...
58    bor_1167  ...  https://cityreaders.nysoclib.org/Detail/entiti...
79     bor_830  ...  https://cityreaders.nysoclib.org/Detail/entiti...
117   bor_1055  ...  https://cityreaders.nysoclib.org/Detail/entiti...
157    bor_805  ...  https://cityreaders.nysoclib.org/Detail/entiti...
...        ...  ...                                                ...
3477   bor_635  ...  https://cityreaders.nysoclib.org/Detail/entiti...
3533  bor_2935  ...  https://cityreaders.nysoclib.org/Detail/entiti...
3547   bor_522  ...  https://cityreaders.nysoclib.org/Detail/entiti...
3554  bor_1295  ...  https://cityreaders.nysoclib.org/Detail/entiti...
3576  bor_1169  ...  https://cityreaders.nysoclib.org/Detail/entiti...

[138 rows x 5 columns]


### Books
Now we do the same thing for Books. (There's a bit of a shortcut involved here for concatenating authors and titles into a single "label" for each book.)

In [4]:
df_book_nodes = pd.DataFrame(columns=['id','label', 'characteristic', 'node_type','link'])
df_book_nodes['id'] = ['bk_' + str(id) for id in df_nysoc_1800_loans['book_id']]
df_book_nodes['label'] = [author + ', ' + title for author, title in zip(df_nysoc_1800_loans['book_author'], df_nysoc_1800_loans['book_title'])]
df_book_nodes['label'] = [string.lstrip(', ') for string in df_book_nodes['label']]
df_book_nodes['characteristic'] = df_nysoc_1800_loans['book_subject']
df_book_nodes['node_type'] = 'Book'
df_book_nodes['link'] = ['https://cityreaders.nysoclib.org/Detail/objects/' + str(x) for x in df_nysoc_1800_loans['book_id']]
df_book_nodes = df_book_nodes.drop_duplicates() 
print(df_book_nodes)

           id  ...                                               link
0     bk_3516  ...  https://cityreaders.nysoclib.org/Detail/object...
1     bk_3521  ...  https://cityreaders.nysoclib.org/Detail/object...
2     bk_3354  ...  https://cityreaders.nysoclib.org/Detail/object...
3     bk_2377  ...  https://cityreaders.nysoclib.org/Detail/object...
4     bk_3147  ...  https://cityreaders.nysoclib.org/Detail/object...
...       ...  ...                                                ...
3568  bk_2299  ...  https://cityreaders.nysoclib.org/Detail/object...
3571  bk_1187  ...  https://cityreaders.nysoclib.org/Detail/object...
3574  bk_3080  ...  https://cityreaders.nysoclib.org/Detail/object...
3575  bk_2493  ...  https://cityreaders.nysoclib.org/Detail/object...
3586  bk_2186  ...  https://cityreaders.nysoclib.org/Detail/object...

[834 rows x 5 columns]


### Putting our nodes together
This creates a new dataframe for *all* our nodes (both borrowers and books) and then appends the Borrowers and Books dataframes to that new dataframe.

In [5]:
df_nysoc_1800_nodes = pd.DataFrame(columns=['id','label', 'characteristic', 'node_type','link'])
df_nysoc_1800_nodes = df_nysoc_1800_nodes.append([df_borrower_nodes, df_book_nodes])
print(df_nysoc_1800_nodes)

            id  ...                                               link
0     bor_1065  ...  https://cityreaders.nysoclib.org/Detail/entiti...
58    bor_1167  ...  https://cityreaders.nysoclib.org/Detail/entiti...
79     bor_830  ...  https://cityreaders.nysoclib.org/Detail/entiti...
117   bor_1055  ...  https://cityreaders.nysoclib.org/Detail/entiti...
157    bor_805  ...  https://cityreaders.nysoclib.org/Detail/entiti...
...        ...  ...                                                ...
3568   bk_2299  ...  https://cityreaders.nysoclib.org/Detail/object...
3571   bk_1187  ...  https://cityreaders.nysoclib.org/Detail/object...
3574   bk_3080  ...  https://cityreaders.nysoclib.org/Detail/object...
3575   bk_2493  ...  https://cityreaders.nysoclib.org/Detail/object...
3586   bk_2186  ...  https://cityreaders.nysoclib.org/Detail/object...

[972 rows x 5 columns]


## Create edge list
This cell does much the same thing that the ones above did, but to create a two-column dataframe (`from` and `to`) with Borrower ids in the `from` column and Book ids in the `to` column.

In [6]:
df_nysoc_1800_edges = pd.DataFrame(columns=['from', 'to'])
df_nysoc_1800_edges['from'] = ['bor_' + str(bor_id) for bor_id in df_nysoc_1800_loans['borrower_id']]
df_nysoc_1800_edges['to'] = ['bk_' + str(bk_id) for bk_id in df_nysoc_1800_loans['book_id']]
print(df_nysoc_1800_edges)

          from       to
0     bor_1065  bk_3516
1     bor_1065  bk_3521
2     bor_1065  bk_3354
3     bor_1065  bk_2377
4     bor_1065  bk_3147
...        ...      ...
3583  bor_1169  bk_4397
3584  bor_1169  bk_4168
3585  bor_1169  bk_3046
3586  bor_1169  bk_2186
3587  bor_1169  bk_3371

[3588 rows x 2 columns]


## Creating a network graph with `Networkx`
We import the `networkx` package, then create a new graph object (`DG`). 

By using `DiGraph()`, I've created this as a "directed" graph, that is one that assumes a kind of directionality to the relationship between readers and books: readers borrow books, but books do not reciprocally borrow readers.

We then add edges to this graph object:

1. We use `zip()` to pair the `from` and `to` columns from our edges dataframe.
2. We work through each row in that paired set of columns, treating the value in the `from` column as `source` and the value in the `to` column as target.
  * Then we use `add_edges` to add a tuple of each `source`/`target` pair as an edge in our directed graph object.

In [7]:
import networkx as nx
DG = nx.DiGraph()
for source, target in zip(df_nysoc_1800_edges['from'], df_nysoc_1800_edges['to']) :
  # print((source, target))
  DG.add_edges_from([(source, target)])


Believe it or not, those three lines actually just created a network graph, `Networkx` can do some basic visualization of its network graphs using the `matplotlib` package, so let's just check to see that we do, in fact have something. (This could take a little while—up to 30 seconds or so.)

In [None]:
from matplotlib.pyplot import figure
figure(figsize=(10, 10))
nx.draw_networkx(DG, with_labels=False)

## Examining the network before visualizing it
Now, that's not much to look at, I'll grant you. We'll definitely make a more attractive visualization, but let's pause a minute before we do to see what we actually have in this network graph, because we could start learning things about this network even if we never got around to making a nice visualization of it.

### Adding more information about our nodes
Right now, our network graph only knows about the Borrower ids and Book ids that we used to construct our edges. Let's add some of that information about our nodes so that we can have a clearer sense of what we're looking at.

This next cell uses several columns from our nodes dataframe, then adds the information from those columns as attributes of our existing nodes. (`Networkx`'s `add_node()` command doesn't just create new nodes, it updates existing nodes: the informatino that we're bringing in from the `df_nysoc_1800_nodes['id']` column matches the node ids that `Networkx` has, so we're updating each node with a `label,` a `node_type` (Book vs. Borrower), and a `link` back to City Readers

In [8]:
for id, label, node_type, link in zip(df_nysoc_1800_nodes['id'], 
                                           df_nysoc_1800_nodes['label'],
                                           df_nysoc_1800_nodes['node_type'],
                                           df_nysoc_1800_nodes['link']) :
  DG.add_node(id, id=id, label=label, node_type=node_type, link=link) 

### And some more node information
We can also add the occupations for borrowers and the subjects for books. I haven't really had time to come to grips enough with `Bokeh` to make the use of this that I'd like to, but it should be possible to use these attributes of our nodes to filter `Bokeh`'s visualization of the betwork graph interactively to show, for example, only Borrowers who had the occupation of "Merchant" and the books they read. Work for another day... 

In [9]:
#The occupations for borrowers got spliced together into a single string.
#This creates a regular expression to find split them apart into a list
import re
divide_occupation = re.compile(r'[A-Z][^A-Z]+')

#Look at selected columns from our nodes dataframe, as above
for id, node_type, characteristic_string in zip(df_nysoc_1800_nodes['id'], 
                                         df_nysoc_1800_nodes['node_type'], 
                                         df_nysoc_1800_nodes['characteristic']) :
  
  #Create a list to use for the characteristic attribute for networkx's nodes
  characteristic = []
  
  #Behave differently, depending on whether the node_type is "Borrower" or "Book"
  if node_type == 'Borrower' :
    if characteristic_string != '' :
      #Find each occupation and append it to the list
      for match in re.findall(divide_occupation, characteristic_string) :
        characteristic.append(match)
    
  if node_type == 'Book' :
    subject = ''
    if characteristic_string != '' :
      #Add any subeject to the empty string, then append the result to the list
      subject += characteristic_string
      characteristic.append(subject)
  
  #Add the characteristic list as the value for the characteristic attribute of
  #all nodes in the graph object DG
  DG.add_node(id, characteristic=characteristic)

Let's have a look at our network: how many nodes and edges, and then the information we have about the nodes.

In [None]:
print(DG.number_of_nodes(), DG.number_of_edges())

print(DG.nodes(data=True))

### Getting the most prolific borrowers
Our graph object has information about how many connections there are between nodes, which can begin to tell us things about the network even if we never visualize it. This next cell, for instance, uses "out degree"—the number of connections coming *from* a node, to figure out how many books each borrower borrowed and shows us the list of borrowers from most active to least.

In [None]:
#Create a list for tuples of informatino about borrowers and numbers of books
#they borrowed
loans_per_borrower = []

#Consider all nodes in the DG object, taking into account the data that we added
#to those nodes
for node, data in DG.nodes(data=True) :
  #If the node is a Borrower...
  if data['node_type'] == 'Borrower' :
    #Append a tuple to the list with the Borrowe's outdegree and name
    loans_per_borrower.append((DG.out_degree(node), data['id'], data['label']))
#First sort the list, then reverse that sorting to show borrowers from
#largest out degree to smallest
for borrower in reversed(sorted(loans_per_borrower)) :
  print(borrower)

### And the most important books
Similarly, we can figure out which books are most central to this network of readers by calculating their "in degree centrality" ("in degree" because this is a directed network and books have connections coming in, but not out).

**Note:** I do not consider myself at all well-versed in network theory, generally, or in `networkx` specifically. I feel sure there is a more elegant way to handle in-degree centrality in `networkx` that I haven't yet figured out.

In [None]:
in_degree_centrality = nx.in_degree_centrality(DG)
for k, v in in_degree_centrality.items() :
  for id, node_type in zip(df_nysoc_1800_nodes['id'], 
                                            df_nysoc_1800_nodes['node_type']) :
    if data['node_type'] == 'Book' :
      DG.add_node(id, indeg=in_degree_centrality[id]) 

book_centrality = []
for node, data in DG.nodes(data=True) :
  if data['node_type'] == 'Book' :
    book_centrality.append((data['indeg'], data['id'], data['label']))

for book in reversed(sorted(book_centrality)) :
  print(book)

### Understanding general patterns
A lot of the books in this sample turn out to have been borrowed by only one person. Just as a practical matter, it might make sense to exclude those books from our network visualization—they wouldn't reveal any connections among borrowers. Let's have a quick look at overall trends regarding the number of people who checked out each book.

In [None]:
import statistics
#This calculates the absolute in_degree for each books—the number of edges coming
#in to it
books_in_degrees = [DG.in_degree[node] for node, data in DG.nodes(data=True) if data['node_type'] == 'Book']

#Now we'll figure out how many books were borrowed by one person, two people, 
#three people, etc.
from collections import Counter
books_per_number_of_loans = Counter(books_in_degrees)

#Let's make a quick chart
import plotly.express as px
to_chart = {'times borrowed': books_per_number_of_loans.keys(), 'number of books': books_per_number_of_loans.values()}

fig = px.bar(to_chart, x='times borrowed', y='number of books')
fig.show()

### Removing some nodes prior to visualization
This cell removes nodes from our graph object that fall below a certain number of connections: books that were borrowed by fewer than three people and people who borrowed fewer than five books.

In [None]:
nodes_to_remove = []
print(DG.number_of_nodes(), DG.number_of_edges())
for node, data in DG.nodes(data=True) :
  if data['node_type'] == 'Book' and DG.in_degree[node] < 3 :
    nodes_to_remove.append(node)
  if data['node_type'] == 'Borrower' and DG.out_degree[node] < 5 :
    nodes_to_remove.append(node)
print(len(nodes_to_remove))
for node_to_remove in nodes_to_remove :
  DG.remove_node(node_to_remove)
print(DG.number_of_nodes(), DG.number_of_edges())

## Visualizing the network graph with `Bokeh`
`Bokeh` is a very sophisticated visualization package for Python that I have made some shift to use for visualizing our network graph (with much Googling and use of StackOverflow, it must be said.)

The code for this visualization doesn't lend itself to be broken up into separate cells, so I've added comments to try to explain as best I'm able what each part of the code is doing.

In [None]:
!pip install bokeh

In [11]:
#Create dictionaries for certain attributes we want to add to our nodes
node_color_attrs = {}
node_size_attrs = {}

# Set different colors for "Borrower" and "Book" nodes and scale them relative 
#to the number of outgoing and incoming edges, respectively
for node, data in DG.nodes(data=True):
    if data['node_type'] == 'Borrower' :
      node_color = 'cornflowerblue'
      node_size = 0.25 * DG.out_degree[node]
    else :
      node_color = 'cornsilk'
      node_size = DG.in_degree[node]
    
    #Add these new attribute values to the dictionaries we created above
    node_color_attrs[node] = node_color
    node_size_attrs[node] = node_size

#Attach the dictionaries of attributes we created to the nodes in the DG network
nx.set_node_attributes(DG, node_color_attrs, "node_color")
nx.set_node_attributes(DG, node_size_attrs, "node_size")

In [14]:
#Import several components of bokeh
import bokeh.io
from bokeh.io import show
#This one's necessary to show the visualization in a Jupyter notebook
bokeh.io.output_notebook()
from bokeh.plotting import figure, from_networkx
from bokeh.models import (BoxSelectTool, Circle, EdgesAndLinkedNodes, HoverTool,
                          MultiLine, NodesAndLinkedEdges, Plot, Range1d, TapTool, 
                          PanTool, WheelZoomTool, PointDrawTool)
#A color palette. I have... very little sense for color, so we'll just use theirs...
from bokeh.palettes import Spectral4

#Create a plot object with some basic settings. Note that we are including
#a toolbar and indicating that our graph should be scaled up on both axes
#to fit the available space
plot = figure(title="July 19-23 New York Society Library Readers", 
              tools="", toolbar_location='above', sizing_mode="scale_both")

#Create a tool to display information about a node when the pointer hovers over
#it. Include the label and characteristic attributes from networkx's data about
#each node
node_hover_tool = HoverTool(tooltips=[("", "@label"), ("", "characteristic")])

#Do NOT cause anything to appear when the pointer hovers over an edge--which
#will make sense when you see how many edges we have
edge_hover_tool = HoverTool(tooltips=None)

#Add selected tools to our plot object: the node_hover_tool and _edge_hover_tool
#we just created, plus tools for panning around the visualization and using
#the mouse's scroll wheel to zoom in and out 
plot.add_tools(node_hover_tool, edge_hover_tool, TapTool(), BoxSelectTool(), PanTool(), WheelZoomTool())

#Create a graph from the networkx directed graph we created
graph = from_networkx(DG, nx.spring_layout, scale=2, center=(0,0))

#Render the nodes in the graph using the node_size and node_color attributes we
#created. Change the color of the nodes when they are selected or hovered over
graph.node_renderer.glyph = Circle(size="node_size", fill_color='node_color')
graph.node_renderer.selection_glyph = Circle(size="node_size", fill_color=Spectral4[2])
graph.node_renderer.hover_glyph = Circle(size="node_size", fill_color=Spectral4[1])

#Render the edges in the graph—same idea as the rendering of nodes, above.
graph.edge_renderer.glyph = MultiLine(line_color="#CCCCCC", line_alpha=0.8, line_width=1)
graph.edge_renderer.selection_glyph = MultiLine(line_color=Spectral4[2], line_width=5)
graph.edge_renderer.hover_glyph = MultiLine(line_color=Spectral4[1], line_width=5)

#Set how the graph should behave when a node is selected or inspected—highlight
#the nodes and linked edges
graph.selection_policy = NodesAndLinkedEdges()
graph.inspection_policy = NodesAndLinkedEdges()

plot.add_tools(PointDrawTool(renderers = [graph.node_renderer], empty_value = 'black'))

#Add our graph to the plot object
plot.renderers.append(graph)

#Show our plot object. Whew.
show(plot)

## Next steps
As I suggested above, it would be nice to make this very large visualization filterable so that we could look at smaller subsets of the network. It's possible to add select and multiple-select lists to `Bokeh` visualizations, so that should be achievable.

It might also be nice to make it possible to drag nodes around to see how they're connected to other nodes. I'm not able to see that `Bokeh` has that facility, though it's such a large library that I may well just be missing it. 

If you were preparing network graphs of this sort for a web-based project, you would probably end up using different tools. The [D3](https://d3js.org) might be one possibility to explore.

This notebook has only touched on the kinds of things that we might discover from network analysis, but should at least give us a starting point for discussion.