> **Exercise 1: Mixing Patterns and Assortativity.**
>
> * For each node, compute the fraction of edges that connect to a node that works in the same top field. Find the average value across all nodes.
> * Create a new graph, with the same nodes and edges, but where the association between nodes and field is shuffled. Compute the measure above for this randomized graph.
> * Repeat the point above 100 times (at least). Plot the distribution of the values obtained and compare it with the value you have found for the real graph. Is the chance to connect to a member of the same field significantly higher than it would be by chance?
> * Compute the assortativity coefficient with respect to author's field. How do you interpret the value you obtain? (__Hint__: See [this paper](https://nbviewer.org/github/suneman/socialgraphs2019/blob/master/lectures/Week5.ipynb), eq (2)). **Important**: here I do not want you to use the NetworkX implementation, but rather to implement your own version of the measure.
> * Is the graph assortative with respect to the degree? (e.g. do high-degree scientists tend to link to other high-degree scientists, and low-degree scientists to other low-degree scientists?). Provide an interpretation of your answer.
> * _Optional:_ Estimate the gender of each author from their name, using the [World Gender Name Dictionary](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/MSEGSJ). Repeat the analysis above to study the assortativity of the network by gender rather than by field. What do you observe?


In [5]:
import networkx as nx
import pandas as pd
import pickle
import requests
from semanticscholar import SemanticScholar
import re

In [6]:
# request missing authors
import numpy as np
import requests
from time import sleep
def make_author_requests(ids: list[str], start_index: int = 0, end_index: int = 100_000, batch_size=30, save_threshold=500):
    if save_threshold < batch_size:
        print("save_threshold must be higher than batch_size")
        exit()
    batch_url = "https://api.semanticscholar.org/graph/v1/author/batch"
    responses = []
    requests_made = 0
    counter = start_index
    errors = []
    save_increment = int(np.ceil(save_threshold/batch_size) * batch_size)
    save_target = save_increment
    save_start_idx = 0
    while counter < end_index:
        current_ids = ids[counter: counter + batch_size]
        data = {"ids": current_ids}
        params = {"fields": "name,aliases,papers.s2FieldsOfStudy"}
        try:
            response = requests.post(batch_url, json=data, params=params).json()
            
            if isinstance(response, dict):
                errors.append(response)
                if response["message"] == "Too Many Requests":
                    sleep(300)
                else:
                    counter += batch_size
            else:
                responses.append(response)
                counter += batch_size
                requests_made += batch_size
                if requests_made == save_target:
                    with open(f"responses_{save_start_idx+start_index}_{save_target+start_index}.pkl", "wb") as f:
                        pickle.dump(responses, f)
                    responses = []
                    save_start_idx = save_target
                    save_target += save_increment
        except Exception as e:
            print(e)
            counter += batch_size
            requests_made += batch_size
            if requests_made == save_target:
                with open(f"responses_{save_start_idx+start_index}_{save_target+start_index}.pkl", "wb") as f:
                    pickle.dump(responses, f)
                responses = []
                save_start_idx = save_target
                save_target += save_increment

    with open(f"responses_{save_start_idx+start_index}_{end_index+start_index}.pkl", "wb") as f:
        pickle.dump(responses, f)
    return errors

In [7]:
with open("missing_authors.pkl", "rb") as f:
    missing_authors = pickle.load(f)

In [8]:
n_missing = len(missing_authors)
print(f"Number of missing authors: {n_missing}")
print(f"Number of requests:\n\tKris: {n_missing // 3}\n\tRasmus: {n_missing // 3}\n\tVictor: {n_missing - (n_missing // 3) * 2}")

Number of missing authors: 23026
Number of requests:
	Kris: 7675
	Rasmus: 7675
	Victor: 7676


In [None]:
# Kris
# make_author_requests(missing_authors, batch_size=30, start_index=0, end_index=7675)

# Rasmus
# make_author_requests(missing_authors, batch_size=30, start_index = 7675, end_index = 7675*2)

# Victor
make_author_requests(missing_authors, batch_size=30, start_index = 7675*2)

In [3]:
authors = pd.read_csv('all_authors.csv', index_col=0)
papers = pd.read_csv('all_papers.csv', index_col=0)

  mask |= (ar1 == a)


In [5]:

# import author data
# We have a mistake in the data, so each author only has their top field listed
# {"id1" = 123123123,"id2" = 123123123, "weight" = 3"}

pattern = '[a-zA-Z]+' # regex pattern to match only letters

# list of dicts with authorid 1 and authorid 2 and how many times they coauthored each other
edgelist = []
for id in authors["id"]:
    for paperindex in range(len(papers)):
        paper = papers.iloc[paperindex]['authors']
        authorlist = re.sub(pattern,"",paper[1:-1].replace("'","")).split(",")
        paper_authors = [int(x) for x in authorlist if x not in ["", " ", "  "]]
        if id in paper_authors:
            # drop paper from dataframe so we don't recount it
            # papers.drop(paperindex, inplace=True)
            for author in paper_authors:
                if author != id:
                    for dic in edgelist:
                        if dic["id1"] == id and dic["id2"] == author:
                            dic["weight"] += 1
                        elif dic["id1"] == author and dic["id2"] == id:
                            dic["weight"] += 1
                        else:
                            edgelist.append({"id1": id, "id2": author, "weight": 1})
    print(edgelist)

edgelist


[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]


KeyboardInterrupt: 