# Assignment 1
> **Github repository**: [02467_Assignment1](https://github.com/JulWin24/02467_Assignment1)
>
> **Group members**:
> - Rune Harlyk (s234814)
> - Joseph Nguyen (s234826)
> - Julius Winkel (s234862)

In [268]:
import requests
from joblib import Parallel, delayed 
from bs4 import BeautifulSoup
from unidecode import unidecode
from fuzzywuzzy import fuzz
from collections import defaultdict
from time import sleep
from tqdm.notebook import tqdm
from ast import literal_eval 
from collections import Counter
from itertools import combinations
import matplotlib.pyplot as plt
import networkx as nx
import netwulf as nw
import pandas as pd
import numpy as np
import json
import os
import logging
from functools import lru_cache
from typing import Optional, List, Dict
from tqdm.notebook import tqdm

logger = logging.getLogger()
logger.setLevel(logging.WARNING)

### Common helper functions

In [150]:
def load_existing_data(data_file):
    if os.path.exists(data_file):
        return pd.read_csv(data_file).to_dict(orient="records")
    return []

## Part 1: Web-scraping

### Fetch program

In [151]:
url = "https://ic2s2-2023.org/program"

req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")

### Get names

In [152]:
names = set()

def get_plenary_names(names, soup): 
    new_names = {name.strip() for nav_list in soup.find_all("ul", class_="nav_list") 
        for i in nav_list.find_all("i") 
        for name in i.get_text(strip=True).split(",")}
    print(f"Found: {len(new_names)} plenary names")
    names.update(new_names)

def get_keynotes_names(names, soup):
    new_names = {a.get_text(strip=True).replace("Keynote - ", "") 
        for a in soup.find_all("a", href=lambda x: x and x.startswith("/keynotes#"))}
    print(f"Found: {len(new_names)} keynotes names")
    names.update(new_names)
    
def get_chair_names(names, soup):
    new_names = {i.get_text(strip=True).replace("Chair: ", "") 
          for i in soup.find_all("i") if i.get_text(strip=True).startswith("Chair:")}
    print(f"Found: {len(new_names)} chair names")
    names.update(new_names)

get_plenary_names(names, soup)
get_keynotes_names(names, soup)
get_chair_names(names, soup)

print(f"Found: {len(names)} names in total" )

Found: 1475 plenary names
Found: 10 keynotes names
Found: 49 chair names
Found: 1491 names in total


### Clean names

In [153]:
def clean_name(name):
    name = unidecode(name)
    return name

def clean_names(names):
    names = {clean_name(name) for name in names}
    return names

def fuzz_names(names, threshold=90):
    names_list = sorted(names)
    name_groups = defaultdict(list)

    for name in names_list:
        first_letter = name[0] if name else ""
        name_groups[first_letter].append(name)

    merge_map = {}
    for letter, group in name_groups.items():
        for i, name in enumerate(group):
            for j in range(i + 1, len(group)):
                match_name = group[j]
                score = fuzz.ratio(name, match_name)
                if score >= threshold:
                    merge_map[match_name] = name

    merged_names = set()
    for name in names_list:
        standardized_name = merge_map.get(name, name)
        merged_names.add(standardized_name)

    return merged_names

names = clean_names(names)
print(f"After cleaning: {len(names)} names")

names = fuzz_names(names)
print(f"After fuzzing: {len(names)} names")

After cleaning: 1486 names
After fuzzing: 1460 names


### Save to a file

In [154]:
with open('author_names_2023.txt', 'w', encoding="utf8") as f:
    for name in sorted(names):
        f.write(f"{name}\n")

### Reflection

## Part 2: Ready Made vs Custom Made Data

> 1. What are pros and cons of the custom-made data used in Centola's experiment (the first study presented in the lecture) and the ready-made data used in Nicolaides's study (the second study presented in the lecture)? You can support your arguments based on the content of the lecture and the information you read in Chapter 2.3 of the book __(answer in max 150 words)__.

Damon Centola wanted to test a scenario and obtained custom-made data through the internet. As it is costume made it will be able to tell something about the hypothesis of study. The data will avoid some of the faults of big data, like being 'dirty', 'incomplete' or 'inaccessible'. At the same time the data could also be smaller and more costly. The whole scenario could also be somewhat artificial and might not be applicable in the real world. 

Sinan Aral and Christos Nicolaides study used ready-made data of 1.1 million users from a fitness app. While the data is nonreactive, there might still be some underlying confounding factor.

> 2. How do you think these differences can influence the interpretation of the results in each study? __(answer in max 150 words)__





**INSPIRATION:** 
$$
\begin{array}{|c|c|c|}
\hline
\textbf{Factor} & \textbf{Centola (Custom-Made)} & \textbf{Nicolaides (Ready-Made)} \\
\hline
\text{Control} & \text{High – controlled variables} & \text{Low – cannot manipulate variables} \\
\hline
\text{Causality} & \text{Strong – designed experiment} & \text{Weak – correlation, not causation} \\
\hline
\text{Realism} & \text{Lower – artificial setting} & \text{Higher – real-world behaviors} \\
\hline
\text{Scale} & \text{Small – limited participants} & \text{Large – millions of users} \\
\hline
\text{Cost \& Time} & \text{High – expensive and time-consuming} & \text{Low – uses existing data} \\
\hline
\text{Data Completeness} & \text{High – collects exactly what is needed} & \text{Low – missing key details} \\
\hline
\end{array}
$$


## Part 3: Gathering Research Articles using the OpenAlex API

### Loading researches 2024

In [155]:
names_file = "author_names_2024.txt"
data_file = "author_data.csv"

with open(names_file, 'r', encoding="utf8") as f:
    names = f.read().splitlines()

print(f"Loaded names: {len(names)}")
names = clean_names(names)

names = fuzz_names(names)
print(f"After fuzzing: {len(names)} names")

# TODO
# 1 - Remove (Santa Fe Institute) from names
# 2 - Remove Pensylvania State University from names

names = sorted(names)

Loaded names: 1206
After fuzzing: 1202 names


### Defining working constants

In [245]:
# URLS
WORKS_URL = "https://api.openalex.org/works"
AUTHORS_URL = "https://api.openalex.org/authors"
CONCEPTS_URL = "https://api.openalex.org/concepts"

# REQUESTS PARAMETERS
BATCH_SIZE = 25
MAX_REQUESTS_PER_SECOND = 10
NUM_CORES = 10
REQUEST_TIMEOUT = 60
MAX_RETRIES = 5

# FILTERS
social_science_fields = ['Political science', 'Economics', 'Psychology', 'Sociology']
quantitative_fields = ['Mathematics', 'Physics', 'Computer science']
min_cited_by = 10
max_authors = 10

# SELECTED FIELDS
WORKS_ATTRIBUTES = ["id", "title", "publication_year", "abstract_inverted_index", "authorships", "cited_by_count", "concepts"]
AUTHOR_ATTRIBUTES = ["id", "display_name", "works_count", "summary_stats", "affiliations", "works_api_url"]

# MAPPING
id_slice = len("https://openalex.org/")

### Helper functions to make requests

In [157]:
def make_request(url: str, mapper = None) -> Optional[Dict]:
    retries = 0
    while retries <= MAX_RETRIES:
        try:
            response = requests.get(url, timeout=REQUEST_TIMEOUT)
            
            # Handle rate limiting and server errors
            if response.status_code == 429 or response.status_code >= 500:
                wait_time = 0.5 * (2 ** retries)  # Exponential backoff
                logger.warning(f"Request throttled (status {response.status_code}), waiting {wait_time:.2f}s")
                sleep(wait_time)
                retries += 1
                continue
            
            if not response.ok:
                logger.error(f"Request failed with status {response.status_code}, {response.text}")
                return None
            
            if mapper:
                return mapper(response.json())
            return response.json()
            
        except Exception as e:
            logger.error(f"Request error: {e}")
            return None
        
        time.sleep(1 / MAX_REQUESTS_PER_SECOND) # Apply rate limiting
    
    logger.error(f"Max retries exceeded for URL: {url}")
    return None

def make_paginated_requests(url: str, mapper = None) -> List[Dict]:
    """Get all pages of results from paginated API."""
    all_results = []
    cursor = "*"
    
    while cursor:
        page_url = f"{url}&cursor={cursor}" if "?" in url else f"{url}?cursor={cursor}"
        
        response_data = make_request(page_url, mapper)
        if not response_data:
            break
        
        results = response_data.get("results", [])
        if mapper:
            mapped_results = []
            for item in results:
                try:
                    mapped_item = mapper(item)
                    if mapped_item is not None:
                        mapped_results.append(mapped_item)
                except Exception as e:
                    logger.error(f"Error in mapper function: {e}")
            all_results.extend(mapped_results)
        else:
            all_results.extend(results)

        cursor = response_data.get("meta", {}).get("next_cursor")
    
    return all_results

### Fetching of IC2S2 2024 author data

In [209]:
def map_author_result(results: Dict) -> Dict:
    return {
        "id": results.get("id")[id_slice:],
        "display_name": results.get("display_name"),
        "works_count": results.get("works_count"),
        "h_index": results.get("summary_stats")["h_index"],
        "country_code": results.get("affiliations")[0]["institution"]["country_code"],
        "works_api_url": results.get("works_api_url")
    }

def map_first_author(json: Dict) -> Dict:
    res = json.get("results")[0]
    return map_author_result(res)

def get_author_data(name):
    try:
        url = f"https://api.openalex.org/authors?filter=display_name.search:{name}"
        author = make_request(url, map_first_author)
        return author if author else name
    except Exception as ex:
        print(f"Error: {ex}")
        return name
    
# get_author_data("Ralph Hertwig")

In [215]:
existing_data = load_existing_data(data_file)
existing_names = {entry['display_name'] for entry in existing_data if 'display_name' in entry}
names_to_process = list(set(names) - existing_names)

print(f"Already have {len(existing_names)}, missing {len(names_to_process)}, total {len(names)}")

author_data = existing_data
bad_names = []

results = Parallel(n_jobs=NUM_CORES)(
    delayed(get_author_data)(name) for name in tqdm(names_to_process, desc="Fetching authors in parallel", unit="authors")
)

author_data = [res for res in results if isinstance(res, dict)]

bad_names = [res for res in results if not isinstance(res, dict)]

author_df = pd.DataFrame(existing_data + author_data)

author_df = author_df.drop_duplicates(subset='id', keep='first')

author_df.to_csv(data_file, index=False)

print(f"Got data for: {len(author_data)}, missing {len(bad_names)}")

Already have 0, missing 1202


Fetching authors in parallel:   0%|          | 0/1202 [00:00<?, ?authors/s]

Got data for: 1025, missing 177


### Load data again and filter between 5-5000 works

In [260]:
author_df = pd.read_csv('author_data.csv')
author_df = author_df.drop_duplicates(subset='id', keep='first')

print(len(author_df))
author_df = author_df[(author_df["works_count"] >= 5) & (author_df["works_count"] <= 5000)]
print(len(author_df))

1025
917


### Define filters

In [218]:
def get_concepts_url(level:int = 0) -> str:
    return f"{CONCEPTS_URL}?filter=level:{level}&per-page=200"

@lru_cache(maxsize=1)
def fetch_concept_ids(level = 0) -> str:
    concepts_url = get_concepts_url(level)
    response_concepts = requests.get(concepts_url)

    if response_concepts.ok:
        concepts = response_concepts.json()['results']
        
        social_science_ids = [i['id'][id_slice:] for i in concepts if i['display_name'] in social_science_fields]
        quantitative_ids = [i['id'][id_slice:] for i in concepts if i['display_name'] in quantitative_fields]

    return social_science_ids, quantitative_ids

social_science_ids, quantitative_ids = fetch_concept_ids()

In [247]:
def create_concept_filter(*groups: List[List[str]]) -> str:
    return ",".join((f"concepts.id:{'|'.join(group)}" for group in groups))

def create_cited_by_filter(min_cited_by):
    return f"cited_by_count:>{min_cited_by}"

def create_authors_filter(ids: List[str]) -> str:
    return f"authorships.author.id:{'|'.join(ids)}"

def create_author_id_filter(ids: List[str]) -> str:
    return f"id:{'|'.join(ids)}"

def create_author_count_filter(max_authors):
    return f"authors_count:<{max_authors}"

def create_query_filter(*filters:List[str]) -> str:
    return ",".join(filters)

social_science_ids, quantitative_ids = fetch_concept_ids()
concept_filter = create_concept_filter(social_science_ids, quantitative_ids)
cited_by_filter = create_cited_by_filter(min_cited_by)
author_count_filter = create_author_count_filter(max_authors)

In [239]:
def get_url(base_url: str, filter_str:str, select_data: List[str], per_page:int = 200) -> str:
    return (
        f"{base_url}?filter={filter_str}"  # Filter data
        f"&select={','.join(select_data)}"  # Select data
        f"&per_page={per_page}"             # Fetch max results per request
    )

def get_works_url(filter_str:str, select_data: List[str], per_page:int = 200) -> str:
    return get_url(WORKS_URL, filter_str, select_data, per_page)

def get_author_url(filter_str:str, select_data: List[str], per_page:int = 200) -> str:
    return get_url(AUTHORS_URL, filter_str, select_data, per_page)

# author_count_filter = create_authors_filter(["A5068556395"])
# query_filter = create_query_filter(concept_filter, cited_by_filter, author_count_filter) 

# test_works_url = get_works_url(query_filter, WORKS_ATTRIBUTES)
# test_works_url

### Fetching works

In [240]:
def map_work(item) -> tuple[dict, dict]:
    return {
        "id": item["id"],
        "publication_year": item.get("publication_year"),
        "cited_by_count": item.get("cited_by_count", 0),
        "author_ids": [auth["author"]["id"][id_slice:] for auth in item.get("authorships", [])]
    }, {
        "id": item["id"],
        "title": item.get("title"),
        "abstract_inverted_index": item.get("abstract_inverted_index")
    }

def fetch_work_batched(authors):
    author_filter = create_authors_filter(authors)
    query_filter = create_query_filter(concept_filter, cited_by_filter, author_count_filter, author_filter) 
    url = get_works_url(query_filter, WORKS_ATTRIBUTES)

    all_papers = []
    all_abstracts = []

    def process_work(work):
        papers, abstracts = map_work(work)
        all_papers.append(papers)
        all_abstracts.append(abstracts)
        return None

    make_paginated_requests(url, mapper=process_work)

    return all_papers, all_abstracts

In [241]:
author_ids = author_df["id"].tolist()
author_batches = [author_ids[i: i + BATCH_SIZE] for i in range(0, len(author_ids), BATCH_SIZE)]

print(f"Fetching works for {len(author_ids)} authors in {len(author_batches)} batches")

results = Parallel(n_jobs=NUM_CORES)(
    delayed(fetch_work_batched)(batch) for batch in tqdm(author_batches, desc="Fetching works in parallel", unit="batch")
)

print(f"Finished fetching {len(results)} results")

Fetching works for 917 authors in 37 batches


Fetching works in parallel:   0%|          | 0/37 [00:00<?, ?batch/s]

Finished fetching 37 results


In [242]:
all_papers = [paper for batch_papers, _ in results for paper in batch_papers]
all_abstracts = [abstract for _, batch_abstracts in results for abstract in batch_abstracts]

print(f"Got {len(all_papers)} papers and {len(all_abstracts)} abstracts")

papers_df = pd.DataFrame(all_papers)
abstracts_df = pd.DataFrame(all_abstracts)

papers_df = papers_df.drop_duplicates(subset='id', keep='first')
abstracts_df = abstracts_df.drop_duplicates(subset='id', keep='first')

papers_df.to_csv("ic2s2_papers.csv", index=False)
abstracts_df.to_csv("ic2s2_abstract.csv", index=False)

Got 11351 papers and 11351 abstracts


### Get all co authors from papers

In [243]:
papers_df = pd.read_csv("ic2s2_papers.csv", converters={'author_ids': literal_eval})
coauthor_ids = papers_df.explode('author_ids')["author_ids"].unique().tolist()
len(set(coauthor_ids))

15324

In [249]:
def map_author_result_filter(result: Dict) -> Dict:
    num_works = result.get('works_count', 0)
    if num_works < 5 or num_works > 5000:
        return None
    return map_author_result(result)

def fetch_author_batched(authors):
    author_filter = create_author_id_filter(authors)
    print(author_filter)
    query_filter = create_query_filter(author_filter) 
    url = get_author_url(query_filter, AUTHOR_ATTRIBUTES)

    return make_paginated_requests(url, mapper=map_author_result_filter)

In [250]:
author_batches = [coauthor_ids[i: i + BATCH_SIZE] for i in range(0, len(coauthor_ids), BATCH_SIZE)]

print(f"Fetching coauthor data for {len(coauthor_ids)} authors in {len(author_batches)} batches")

results = Parallel(n_jobs=NUM_CORES)(
    delayed(fetch_author_batched)(batch) for batch in tqdm(author_batches, desc="Fetching author in parallel", unit="batch")
)

print(f"Finished fetching {len(results)} results")

Fetching author data for 15324 authors in 613 batches


Fetching author in parallel:   0%|          | 0/613 [00:00<?, ?batch/s]

Finished fetching 613 results


In [258]:
coauthors = [author for batch in results for author in batch]

coauthors_df = pd.DataFrame(coauthors)
coauthors_df = coauthors_df.drop_duplicates(subset='id', keep='first')

coauthors_df.to_csv("ic2s2_coauthors.csv", index=False)

print(f"Got data for {len(coauthors_df)} coauthors")
# coauthors_df["works_count"].describe()

Got data for 14061 coauthors


## Part 4: The Network of Computational Social Scientists

### Getting final dataset with authors and coauthors

In [269]:
all_authors_df = pd.concat([author_df, coauthors_df], ignore_index=True)
all_authors_df = all_authors_df.drop_duplicates(subset='id', keep='first')
all_authors_df.to_csv("ic2s2_all_authors.csv", index=False)
all_author_ids = all_authors_df.explode('id')["id"].unique().tolist()
len(all_author_ids)

14293

In [270]:
all_author_batches_ids = [all_author_ids[i: i + BATCH_SIZE] for i in range(0, len(all_author_ids), BATCH_SIZE)]

print(f"Fetching {len(all_author_ids)} authors in {len(all_author_batches_ids)} batches")

results = Parallel(n_jobs=NUM_CORES)(
    delayed(fetch_work_batched)(batch) for batch in tqdm(all_author_batches_ids, desc="Fetching works in parallel", unit="batch")
)

## TODO - FILTER THAT ONLY AUTHORS IN I2CS2 ARE CONSIDERED

Fetching 572 batches


Fetching works in parallel:   0%|          | 0/572 [00:00<?, ?batch/s]

In [271]:
all_papers = [paper for batch_papers, _ in results for paper in batch_papers]
all_abstracts = [abstract for _, batch_abstracts in results for abstract in batch_abstracts]

# Convert to DataFrame
all_papers_df = pd.DataFrame(all_papers)
all_abstracts_df = pd.DataFrame(all_abstracts)

# Drop
all_papers_df = all_papers_df.drop_duplicates(subset='id', keep='first')
all_abstracts_df = all_abstracts_df.drop_duplicates(subset='id', keep='first')

all_papers_df.to_csv("all_papers.csv", index=False)
all_abstracts_df.to_csv("all_abstracts.csv", index=False)
len(all_papers_df)

## Filter to only use 2 degrees of separation

In [279]:
# Filter out papers with authors not in the author list
all_unique_author_ids = set(all_author_ids)
all_papers_df["author_ids"] = all_papers_df["author_ids"].apply(lambda x: [i for i in x if i in all_unique_author_ids])

all_papers_df = all_papers_df[all_papers_df["author_ids"].apply(len) >= 2]

all_papers_df.to_csv("ic2s2_coauthors_papers.csv", index=False)

len(all_papers_df)

37912

## Part 1: Network Construction

### Getting author pairs

In [291]:
edges = Counter()

for author_list in all_papers_df["author_ids"]:
    for pair in combinations(author_list, 2):
        edges[pair] += 1

edgelist = [(a, b, count) for (a, b), count in edges.items()]
len(edgelist)

63718

### Graph construction

In [292]:
def save_graph(graph_file, G):
    data = nx.readwrite.json_graph.node_link_data(G)
    with open(graph_file, "w") as f:
        json.dump(data, f)

def load_graph(graph_file):
    with open(graph_file, "r") as f:
        data = json.load(f)
    return nx.readwrite.json_graph.node_link_graph(data)

In [300]:
df_exploded = all_papers_df.explode("author_ids")

author_stats = df_exploded.groupby("author_ids").agg(
    first_publication_year=("publication_year", "min"),
    cited_by_count=("cited_by_count", "sum")
).reset_index()


df_merged = all_authors_df.merge(author_stats, left_on="id", right_on="author_ids", how="inner")
df_merged.drop(columns=["author_ids"], inplace=True)
attr_dict = df_merged[["id", "display_name", "country_code", "first_publication_year", "cited_by_count"]].set_index("id").to_dict("index")

In [301]:
graph_file = "ic2s2_coauthors_graph.json"
G = nx.Graph()
G.add_weighted_edges_from(edgelist)
nx.set_node_attributes(G, attr_dict)

In [302]:
save_graph(graph_file, G)

## Part 2: Preliminary Network Analysis

### Network Metrics:

In [303]:
# Network Stats
num_links = len(edgelist)
num_nodes = len(all_unique_author_ids)
print(f"Got {num_links} links between {num_nodes} nodes")

# Density Stats
print(f'Network density is: {nx.density(G)}')

# Number of connected components
num_isolated = len(list(nx.isolates(G)))
print("Is fully connected: ", nx.is_connected(G))
print("Number of connected components: ", nx.number_connected_components(G))
print("Number of isolated nodes: ", num_isolated)

Got 63718 links between 14293 nodes
Network density is: 0.0005797675391076178
Is fully connected:  False
Number of connected components:  89
Number of isolated nodes:  0


### Degree analysis

In [304]:
degrees = [d for _, d in G.degree()]
strengths = [s for _, s in G.degree(weight="weight")]

degree_stats = {
    "avg": np.mean(degrees),
    "median": np.median(degrees),
    "mode": Counter(degrees).most_common(1)[0][0],
    "min": np.min(degrees),
    "max": np.max(degrees)
}

strength_stats = {
    "avg": np.mean(strengths),
    "median": np.median(strengths),
    "mode": Counter(strengths).most_common(1)[0][0],
    "min": np.min(strengths),
    "max": np.max(strengths)
}

print(degree_stats)
print(strength_stats)

{'avg': 8.146313692001138, 'median': 6.0, 'mode': 4, 'min': 1, 'max': 339}
{'avg': 12.808995160831198, 'median': 7.0, 'mode': 4, 'min': 1, 'max': 536}


#### Top authors

In [305]:
def top_nodes_by_degree(G, top_n=5):
    return sorted(G.degree, key=lambda x: x[1], reverse=True)[:top_n]

top_5 = top_nodes_by_degree(G)
print(top_5)

[('A5100322712', 339), ('A5005421447', 302), ('A5077712228', 270), ('A5007176508', 239), ('A5059645286', 235)]


## Visualize

In [308]:
config = {
    "zoom": 0.6,
    "scale_node_size_by_strength": True,
    "node_size_variation": 1,
    "node_size": 30,
    "node_gravity": 0.45,
}

id_to_name = pd.Series(all_authors_df.display_name.values, index=all_authors_df.id).to_dict()

# G_named = nx.relabel_nodes(G, id_to_name)

network, config = nw.visualize(G)#, config=config)

# fig, ax = nw.draw_netwulf(network, figsize=(10,10))
plt.show()
# plt.savefig("myfigure.pdf")

ValueError: Out of range float values are not JSON compliant