In [5]:
# First install takes a while to download models
!pip install --quiet -r requirements.txt

In [19]:
import os
import re

import graphistry
import networkx as nx
import pandas as pd
import requests

In [3]:
# Environment variable setup
GRAPHISTRY_USERNAME = os.getenv("GRAPHISTRY_USERNAME")
GRAPHISTRY_PASSWORD = os.getenv("GRAPHISTRY_PASSWORD")

# Part 2: Quantitative Networks: Social Network Analysis and Network Science

## Visualizing Networks with Graphistry

Throughout this part of the course we will be using `pygraphistry` and [Graphistry Hub](https://hub.graphistry.com/) [https://hub.graphistry.com/](https://hub.graphistry.com/) to visualize networks. Both are free for personal use and are powerful for visualizing networks large and small.

You can [signup](https://hub.graphistry.com/accounts/signup/) for a Graphistry account at [https://hub.graphistry.com/accounts/signup/](https://hub.graphistry.com/accounts/signup/) with your Github or Google account. Retain and use the username and password in 

In [4]:
graphistry.register(
    api=3,
    username=GRAPHISTRY_USERNAME,
    password=GRAPHISTRY_PASSWORD,
)

# Social Network Analysis

We start with an example of a social network: [High-energy physics theory citation network](https://snap.stanford.edu/data/cit-HepTh.html) from [Stanford SNAP](http://snap.stanford.edu/).

> Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a dataset of 27,770 papers with 352,807 edges. If a paper i cites paper j, the graph contains a directed edge from i to j. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this.
>
> The data covers papers in the period from January 1993 to April 2003 (124 months). It begins within a few months of the inception of the arXiv, and thus represents essentially the complete history of its HEP-TH section.
>
> The data was originally released as a part of [2003 KDD Cup](http://www.cs.cornell.edu/projects/kddcup/).

J. Leskovec, J. Kleinberg and C. Faloutsos. [Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations.](http://www.cs.cmu.edu/~jure/pubs/powergrowth-kdd05.pdf) ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2005.

Note: in addition to the metadata in the graph below, the time of paper submission is also available: https://snap.stanford.edu/data/cit-HepTh-dates.txt.gz

## Extract Edge List and Build Directional Graph (DiGraph)

First we download the edge list and build the structure of the network: `(paper)-cited->(paper)`

In [50]:
import gzip
import io
import networkx as nx
import tarfile


# Initialize a directed graph
G = nx.DiGraph()

# Download and load edges (citations) from `cit-HepTh.txt.gz`
response = requests.get("https://snap.stanford.edu/data/cit-HepTh.txt.gz")
gzip_content = io.BytesIO(response.content)

# Decompress the gzip content and build the edge list for our network
with gzip.GzipFile(fileobj=gzip_content) as f:
    for line in f:
        line = line.decode('utf-8')
        # Ignore lines that start with '#'
        if not line.startswith('#'):
            cited, citing = line.strip().split('\t')
            G.add_edge(citing, cited)

## Summarize the Propertis of our DiGraph

Let's check how many nodes and edges we have. This will help evaluate how we are doing when we parse the abstracts to add properties to our DiGraph.

In [51]:
print(f"Total nodes: {G.number_of_nodes():,}")
print(f"Total edges: {G.number_of_edges():,}")

Total nodes: 27,770
Total edges: 352,807


## Add Properties to Nodes in Network

Now we will use `extract_paper_info(text)` to add structured data to the nodes in our network.

### Using ChatGPT to Write NetworkX Code

We cover Chatbots at the end of this course, I just want to point out that the following dialogue generated the code below:

```
I am going to past some text representing some semi-structured data about an academic paper below:

Paper: hep-th/0002031
From: Maulik K. Parikh 
Date: Fri, 4 Feb 2000 17:04:51 GMT   (10kb)

Title: Confinement and the AdS/CFT Correspondence
Authors: D. S. Berman and Maulik K. Parikh
Comments: 12 pages, 1 figure, RevTeX
Report-no: SPIN-1999/25, UG-1999/42
Journal-ref: Phys.Lett. B483 (2000) 271-276
\\
  We study the thermodynamics of the confined and unconfined phases of
superconformal Yang-Mills in finite volume and at large N using the AdS/CFT
correspondence. We discuss the necessary conditions for a smooth phase
crossover and obtain an N-dependent curve for the phase boundary.
\\

Now I want you to write python code to extract the fields "Paper", "From", "Date", "Title", "Authors", "Comments", "Report-no", "Journal-ref" and the last field the "Abstract" text content.
```

In [52]:
def extract_paper_info(text):
    # Define patterns for each field
    patterns = {
        "Paper": r"Paper:\s*(.+?)\n",
        "From": r"From:\s*(.+?)\n",
        "Date": r"Date:\s*(.+?)\n",
        "Title": r"Title:\s*(.+?)\n",
        "Authors": r"Authors:\s*(.+?)\n",
        "Comments": r"Comments:\s*(.+?)\n",
        "Report-no": r"Report-no:\s*(.+?)\n",
        "Journal-ref": r"Journal-ref:\s*(.+?)\n",
        "Abstract": r"\\\\\n\s*(.*?)\n\\\\"
    }

    # Extract field values
    paper_info = {}
    for field, pattern in patterns.items():
        match = re.search(pattern, text, re.DOTALL)
        if match:
            paper_info[field] = match.group(1).strip()

    return paper_info

In [53]:
# Download the abstracts from `cit-HepTh-abstracts.tar.gz`
abstract_response = requests.get("https://snap.stanford.edu/data/cit-HepTh-abstracts.tar.gz")

# Convert the response content to an in-memory binary stream
abstract_gzip_content = io.BytesIO(abstract_response.content)

# Decompress the gzip content
with gzip.GzipFile(fileobj=abstract_gzip_content) as f:
    with tarfile.open(fileobj=f, mode='r|') as tar:
        for member in tar:
            abstract_file = tar.extractfile(member)
            if abstract_file is not None:
                content = abstract_file.read().decode('utf-8')
                paper_info = extract_paper_info(content)
                if paper_info:
                    paper_id = paper_info.get("Paper", "").split("/")[-1]  # Get the paper ID part of the "Paper" field
                    for field, value in paper_info.items():
                        if paper_id in G:
                            G.nodes[paper_id][field] = value
                        else:
                            G.add_node(paper_id, **paper_info)

# Now `G` is a property graph representing the "High-energy physics theory citation network" dataset

In [61]:
G.nodes["1001"]

{}

In [62]:
G.edges("1001")

OutEdgeDataView([('1001', '5068'), ('1001', '7170'), ('1001', '7195'), ('1001', '11075'), ('1001', '104044'), ('1001', '105207'), ('1001', '112101'), ('1001', '209230'), ('1001', '212114'), ('1001', '212223')])

In [None]:
list(G.nodes())[0:10]