# 02467 - Computational Social Science: Assignment 1
#### Group 15: Adam Bøttcher Haupt-Hansen, s224202 & Edvin Smajlovic, s224204 & Sophia Reiffenstein Petersen, s224222 

Everyone in the group has contributed equally to this project, as we have worked together weekly on the exercises.


In [32]:
# Link to GitLab Repository
GitRep = "https://gitlab.gbar.dtu.dk/s224222/computational_assignment1"
Github = "https://github.com/SophiaRP00/Computational_Assignment_1/tree/main"



In [2]:
import requests
from bs4 import BeautifulSoup
import difflib
import pandas as pd
import networkx as nx
import regex as re
import math
import os
from time import sleep

### Part I - WEBSCRAPING

In [3]:
url = "https://ic2s2-2023.org/program"

response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

In [4]:
# Get all the names into a table
names = []
for i in soup.find_all("i"):
    names.append(i.text)

names = ",".join(names)


# if "chair" in names: Remove the "chair"

names = names.replace("Chair: ", "")
names = names.split(",")
map(str.strip, names)
print(names[0:5])

['Claudia Wagner', 'Jonas L Juul', ' Jon Kleinberg', 'Chloe Ahn', ' Xinyi Wang']


In [5]:
# Remove duplicates, strip and lower case
for i in range(len(names)):
    names[i] = names[i].lower()
    names[i] = names[i].strip()
names = list(set(names))
# and sort alphabetically

names.sort()
print(names[0:5])

['aaron clauset', 'aaron j. schwartz', 'aaron schein', 'aaron smith', 'abbas haidar']


In [6]:
# Remove duplicates where the names are very similar using sequence matcher

duplicate = []
for i in range(len(names)):
    for j in range(i+1, len(names)):
        if difflib.SequenceMatcher(None, names[i], names[j]).ratio() > 0.95:
            duplicate.append(names[j].lower().strip())

print(duplicate[0:5])

['alessandro flammini', 'anne c. kroon', 'diogo pacheco', 'duncan j. watts', 'fabio carrella']


In [7]:
# Remove one of the duplicates from the list
names = [x for x in names if x not in duplicate]

In [8]:
print(len(names))

1471


We get 1471 authors after having cleaned the data

In [9]:
# Create a txt file with the names

with open("Data/names.txt", "w") as f:
    for name in names:
        f.write(name + "\n")



Explain the process you followed to web-scrape the page. Which choices did you make to accurately retreive as many names as possible? Which strategies did you use to assess the quality of your final list? Explain your reasoning and your choices (answer in max 150 words).

#### THE PROCESS:

In order to retrieve as many names as possible, we noticed that all the authors were written in italics in the Plenary programs. Additionally, authors were written after the word "Keynote" in the overview which we also took into consideration. 

To clean the data, we lowercased, striped and removed obvious duplicates. To take human error into account, we used a sequence matcher from the difflib library where we compared each name and if they were 95% or more similar, we removed them. We checked each of them by printing the two similar names, and thereby, trying to ensure we didn't remove any that could potentially be be two seperate people. 

### PART II - READY MADE VS. CUSTOM MADE

#### QUESTION:

What are pros and cons of the custom-made data used in Centola's experiment (the first study presented in the lecture) and the ready-made data used in Nicolaides's study (the second study presented in the lecture)? You can support your arguments based on the content of the lecture and the information you read in Chapter 2.3 of the book (answer in max 150 words).

#### ANSWER: <br>
The pros of custom-made data as seen in Centola's experiments is that the experiment was designed in a controlled environment, reducing confounding variables and making it easuer to establish relationsships between people in Centola's experiment. However, there are quite a lot of cons when using custom-made data such as limited generalisability and smaller sample sizes. The largest issue is being nonreactive which can explain how certain people might perform better, since they are being watched. 

When it comes to Nicolaide's study, we see that it provides a better insight into socital interactions as it is performed on "real-life data. However, we see that Ready-made data is generally sensitive as well as dirty and lacks reproduceability. This is also the case when it comes to Nicolaide's study where the data might reflect real-life networks compared to custom-made but it comes at the cost of the privacy of the participants. 

Reference: Chapter 2.3: Salgani, Matthew J. - Bit by bit (2018)

#### QUESTION: <br>
How do you think these differences can influence the interpretation of the results in each study? (answer in max 150 words)

#### ANSWER <br>
Using custom-made data is good when it comes to reproducability, but it might lead to wrong conclusions about societal networks. Ready-made data is more likely to lead to a more realistic interpretation about real-life situation, but as it is hard to reproduce, it might in that sense be difficult to argue for generability. Additionally, there might be unobserved confounders or homophily like mentioned in the video where people are friends with similar people or holidays or weather affects our actions. These will generally affect our data analysis and might lead to false conclusions.

Therefore, it is important to mix both custom-made and ready-made data which is often seen in real life.

### PART III - GATHERING RESEARCH ARTICLES USING OPENALEX API

Since actually building the csv file of the 2024 authors is not a part of the assignment, we can simply load in the file.

In [10]:
authors = pd.read_csv("Data/authors2024.csv")

In [11]:
len(authors)

1047

In [12]:
# Created to filter what papers we actually want to look at
fields_to_ids = {'Sociology': 'https://openalex.org/C144024400',
                 'Psychology': 'https://openalex.org/C15744967',
                 'Economics': 'https://openalex.org/C162324750',
                 'Political Science': 'https://openalex.org/C17744445',
                 'Mathematics': 'https://openalex.org/C33923547}',
                 'Physics':'https://openalex.org/C121332964',
                 'Computer Science': 'https://openalex.org/C41008148'
                }


In [13]:
### Showing the basic structure of the API request
base_url = 'https://api.openalex.org/' 
resource = 'works'
filterstring = ["cited_by_count:>10",
                 "authors_count:<10", 
                 f"concepts.id:{fields_to_ids['Sociology']}|{fields_to_ids['Psychology']}|{fields_to_ids['Economics']}|{fields_to_ids['Political Science']}", 
                 f"concepts.id:{fields_to_ids['Mathematics']}|{fields_to_ids['Physics']}|{fields_to_ids['Computer Science']}"]
filterstring = ",".join(filterstring)

request = {
    "filter": filterstring,
    "cursor": '*',
}
if (os.environ.get('mail') is not None): ## get email from environment variable, so we don't expose it in the code
    request['mailto'] = os.environ.get('mail')

In [14]:
rq = requests.get(base_url + resource, params=request)
data = rq.json()

Perfect! We can get data from the api.

In [15]:
# Ensuring we keep the authors only within limits
authors = authors[(authors['works_count'] > 5) & (authors['works_count'] < 5000)]

In [16]:
author_ids = authors['id'].values

In [17]:
author_ids[0:5]

array(['https://openalex.org/A5097398930',
       'https://openalex.org/A5082554858',
       'https://openalex.org/A5067206551',
       'https://openalex.org/A5014394213',
       'https://openalex.org/A5008439962'], dtype=object)

We quickly define some helper functions for parsing the papers and handling individual requests.

In [18]:
def parse_paper(paper):
    paper_id = paper['id']
    pub_year = paper['publication_year']
    cited_by = paper['cited_by_count']
    author_ids = [author['author']['id'] for author in paper['authorships']]
    title = paper['title']
    abstract_inverted_index = paper['abstract_inverted_index']
    return [paper_id, pub_year, cited_by, author_ids, title, abstract_inverted_index]

In [19]:
def get_author_papers(index):
    query = author_ids[index: index+10]   
    author_string = "authorships.author.id:" + "|".join(query)
    cursor = '*'
    request = {
        "filter": filterstring + "," + author_string,
        "cursor": cursor,
        'per_page': 100
    }
    if (os.environ.get('mail') is not None): ## Same deal as before. I dont particularly want to expose my email in a public repo
        request['mailto'] = os.environ.get('mail')
    parsed_results = []
    while cursor:
        rq = requests.get(base_url + resource, params=request)
        if rq.status_code != 200:
            raise Exception(rq.status_code)
        try:
            meta = rq.json()['meta']
            cursor = meta['next_cursor']
        except:
            cursor = False
            break
        request['cursor'] = cursor
        sleep(0.6) ## Arguably a long sleep, but better safe than sorry
        results = rq.json()['results']
        for paper in results:
            parsed_results.append(parse_paper(paper))
    return parsed_results

In [20]:
from tqdm import tqdm
from joblib import Parallel, delayed

## This is slightly inconsistent. Seems like the API just sometimes fails
final_results = Parallel(n_jobs=8, prefer='threads')(delayed(get_author_papers)(i) for i in tqdm(range(0, len(author_ids), 10))) ## Using joblib to parallelize the requests

100%|██████████| 92/92 [00:37<00:00,  2.49it/s]


Having gathered the results, we can then simply save them.

In [21]:
iscpapers = []
for res in final_results:
    iscpapers.extend(res)


iscpapers = pd.DataFrame(iscpapers, columns=['paper_id', 'pub_year', 'cited_by', 'author_ids', 'title', 'abstract_inverted_index'])

iscpapers_2 = iscpapers[['paper_id', 'pub_year', 'cited_by', 'author_ids']]
isc_abstracts = iscpapers[['paper_id', 'title', 'abstract_inverted_index']]
iscpapers_2.to_csv('Data/iscpapers.csv', index=False)
isc_abstracts.to_csv('Data/isc_abstracts.csv', index=False)

In [22]:
len(iscpapers)

11595

In [23]:
paper_authors = iscpapers['author_ids'].explode().reset_index()

In [24]:
len(list(set(paper_authors['author_ids'].to_numpy()))) ## Unholy way to get the number of unique authors

15666

#### Question:
How many works are listed in your IC2S2 papers dataframe? How many unique researchers have co-authored these works?
#### Answer
There are 11595 papers in our dataframe. These papers have been authored by a total of 15666 authors
#### Question:
Describe the strategies you implemented to make your code more efficient. How did your approach affect your code's execution time?
#### Answer:
We used several strategies for speed. We used filters in our both our own author list, and in our api requests to make sure that every result returned was actually relevant. Additionally we grouped 10 authors per request as most authors only have a few papers, so getting more results back from each request allowed for higher speeds. Finanally we used multithreading, to enable multiple simultaneous calls. This made the code run quite fast, in fact, it includes a sleep(0.6) because it kept running too fast for the api.
#### Question:
Reflect on the rationale behind setting specific thresholds for the total number of works by an author, the citation count, the number of authors per work, and the relevance of works to specific fields. How do these filtering criteria contribute to the relevance of the dataset you compiled? Do you believe any aspects of Computational Social Science research might be underrepresented or overrepresented as a result of these choices? 
#### Answer:
Filtering criteria help guide the collection of data. That said, in a relatively new field such as Computational Social Science, limiting ourselves to only papers with \>10 citations, can potentially, limit our understanding of more recent changes. As this is data from a conference in 2024, this risks eliminating new researchers, or papers published recently. Additionally the topic filter has problems, due to how OpenAlex structures its topics. This potentially results in actual computational social science getting drowned out by papers in other genres, that happed to be tagged with topics like "mathematics"

### PART IV - THE NETWORK OF COMPUTATIONAL SCIENTISTS

In [25]:
papers_combined = pd.read_csv('Data/IC2S2_combined_papers.csv')
authors_combined = pd.read_csv('Data/IC2S2_combined_authors.csv')

In [26]:
#We construct our graph and all the weighted edges between authors
graph = nx.Graph()
for paper_authors in papers_combined['author_ids']:
    authors = re.sub(r"[\[\]'\s]", "", paper_authors).split(',')
    for i,author1 in enumerate(authors):
        for author2 in authors[i:]:
            if author1 != author2:
                if graph.has_edge(author1,author2):
                    graph[author1][author2]['weight'] += 1
                else:
                    graph.add_edge(author1,author2,weight=1)
print("Total number of Nodes:" + str(graph.number_of_nodes()))
print("Total number of Edges:" + str(graph.number_of_edges()))

Total number of Nodes:17729
Total number of Edges:71979


In [27]:
#We find the attributes of the authors
citationcount = {}
publication_years = {}

for i,paper_authors in enumerate(papers_combined['author_ids']):
    authors = re.sub(r"[\[\]'\s]", "", paper_authors).split(',')
    for author in authors:
        if author not in citationcount:
            citationcount[author] = 0
        citationcount[author] += 1

        if author not in publication_years:
            publication_years[author] = papers_combined['publication_year'][i]
        else:
            if publication_years[author] > papers_combined['publication_year'][i]:
               publication_years[author] = papers_combined['publication_year'][i]

def get_author_info(author_id):
    author_info = authors_combined[authors_combined['id'] == author_id]
    display_name = author_info['display_name'].values[0]
    country_code = author_info['country_code'].values[0]
    return display_name, country_code, citationcount[author_id], publication_years[author_id]

In [28]:
#We add the node attributes to the graph      Runtime: ~20s
for author in graph.nodes():
    display_name, countrycode, citation_count, first_publication_year = get_author_info(author)
    if type(countrycode) != str and math.isnan(countrycode):
        countrycode = 'NaN'

    nx.set_node_attributes(graph, {author: countrycode}, 'country_code')
    nx.set_node_attributes(graph, {author: display_name}, 'display_name')
    nx.set_node_attributes(graph, {author: citation_count}, 'citation_count')
    nx.set_node_attributes(graph, {author: int(first_publication_year)}, 'first_publication_year')

#We print the first 5 nodes of the graph and save it as JSON
for node in list(graph.nodes(data=True))[:5]:
   print(node)

import json
with open('Data/graph.json', 'w') as f:
    json.dump(nx.node_link_data(graph), f)

('https://openalex.org/A5014647140', {'country_code': 'US', 'display_name': 'Aaron Clauset', 'citation_count': 36, 'first_publication_year': 2004})
('https://openalex.org/A5082953212', {'country_code': 'US', 'display_name': 'Cosma Rohilla Shalizi', 'citation_count': 23, 'first_publication_year': 1999})
('https://openalex.org/A5067142016', {'country_code': 'US', 'display_name': 'M. E. J. Newman', 'citation_count': 55, 'first_publication_year': 1994})
('https://openalex.org/A5008033989', {'country_code': 'US', 'display_name': 'Cristopher Moore', 'citation_count': 31, 'first_publication_year': 1996})
('https://openalex.org/A5007285525', {'country_code': 'US', 'display_name': 'Erzsébet Ravasz Regan', 'citation_count': 8, 'first_publication_year': 2000})


Part 2: Preliminary Network Analysis

In [29]:
print("Total number of authors: " + str(graph.number_of_nodes()))
print("Total number of unique collaborations: " + str(graph.number_of_edges()))
print()
print("Network Density: " + str(nx.density(graph)))
print()
print("Number of connected components: " + str(nx.number_connected_components(graph)))

Total number of authors: 17729
Total number of unique collaborations: 71979

Network Density: 0.00045802778209354516

Number of connected components: 107


#### QUESTION:<br>
Is the network fully connected (i.e., is there a direct or indirect path between every pair of nodes within the network), or is it disconnected?
If the network is disconnected, how many connected components does it have? A connected component is defined as a subset of nodes within the network where a path exists between any pair of nodes in that subset.
How many isolated nodes are there in your network? An isolated node is defined as a node with no connections to any other node in the network.
Discuss the results above on network density, and connectivity. Are your findings in line with what you expected? Why? (answer in max 150 words)

#### ANSWER <br>
We see above that we have captured rougly 18 thousand authors with 72 thousand collaborations between them <br>
The network is sparse with a density of  around 0.000458. This makes sense as the maximum number of links is n*(n-1)/2 or 157149856 in this case, and we have nowhere near that many collaborations. <br>
We also see that the graph has 107 connected components, meaning the network is disconnected. <br>
Finally we know there are no isolated authors, as we constructed the graph (and hence all the nodes) entirely from edges <br>
<br>
These findings make sense, as most authors will not collaborate with anywhere near 18 thousand other authors, so the network density will be low. The graph not being connected also makes sense, as different fields and locations might prevent authors from co-authoring with others who are either far-away or working on a different topic

In [30]:
# Compute the average, median, mode, minimum, and maximum degree of the nodes. Perform the same analysis for node strength (weighted degree)
import numpy as np
degrees = [degree for node,degree in graph.degree()]
strengths = [strength for node,strength in graph.degree(weight='weight')]

print("Average degree:" + str(np.mean(degrees)))
print("Median degree:" + str(np.median(degrees)))
print("Mode degree:" + str(max(set(degrees), key=degrees.count)))
print("Minimum degree:" + str(min(degrees)))
print("Maximum degree:" + str(max(degrees)))
print()
print("Average strength:" + str(np.mean(strengths)))
print("Median strength:" + str(np.median(strengths)))
print("Mode strength:" + str(max(set(strengths), key=strengths.count)))
print("Minimum strength:" + str(min(strengths)))
print("Maximum strength:" + str(max(strengths)))

Average degree:8.11991652095437
Median degree:6.0
Mode degree:4
Minimum degree:1
Maximum degree:362

Average strength:15.034124880139883
Median strength:8.0
Mode strength:4
Minimum strength:1
Maximum strength:605


#### QUESTION: <br>
What do these metrics tell us about the network? (answer in max 150 words)

#### ANSWER <br>
These metrics tell us that most authors in our graph have "only" collaborated with 4 other authors, while some few collaborate with up to 362 different authors, dragging the average up. We also see that the mode strength and mode degree are the same, implying that most authors in our graph who've only collaborated with 4 other authors, have only collaborated with them once (most authors have low strength)

In [31]:
#We find the 5 authors with the highest degree
degree_dict = dict(graph.degree())
sorted_degree = sorted(degree_dict.items(), key=lambda x: x[1], reverse=True)
print("Top 5 authors with the highest degree:")
for i in range(5):
    display_name, country_code, citation_count, first_publication_year = get_author_info(sorted_degree[i][0])
    print(display_name + " with degree " + str(sorted_degree[i][1]) + "    OpenAlex ID: " + str(sorted_degree[i][0]))

Top 5 authors with the highest degree:
Yan Wang with degree 362    OpenAlex ID: https://openalex.org/A5100322712
Yi Yang with degree 306    OpenAlex ID: https://openalex.org/A5005421447
Simon A. Levin with degree 279    OpenAlex ID: https://openalex.org/A5077712228
Alex Pentland with degree 261    OpenAlex ID: https://openalex.org/A5007176508
Robert West with degree 255    OpenAlex ID: https://openalex.org/A5059645286


#### QUESTION: <br>
Identify the top 5 authors by degree. What role do these node play in the network?
Research these authors online. What areas do they specialize in? Do you think that their work aligns with the themes of Computational Social Science? If not, what could be possible reasons? (answer in max 150 words)

#### ANSWER <br>
Looking at the top 5 authors, they are not very aligned with the themes of Computational Social Science, with the exception of Robert West. The others mainly have papers concerning math, biology, and such. This is could be due to many reasons, but we think it's likely because of these fields being bigger and the topics might also be more prone to collaboration than Social Science