Assignment 1, Group 1

Imports and loading CSV file

In [42]:
import requests
import time
import pandas as pd
import re
from google.colab import drive

drive.mount('/content/drive')

data = pd.read_csv('/content/drive/MyDrive/Computational Social Science/ic2s2_2025_schedule_v5.csv')

# Find author column
author_cols = [c for c in data.columns if "author" in c.lower()]
print("Author-like columns:", author_cols)

author_col = author_cols[0] if author_cols else None
assert author_col is not None, "No author-like column found."

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Author-like columns: ['author']


Clean and spllit authors

In [55]:
def split_authors(x):
    if pd.isna(x):
        return []
    s = str(x).strip()

    # normalize common separators
    s = s.replace("\n", ";")
    s = re.sub(r"\s+(and|&)\s+", ";", s, flags=re.IGNORECASE)
    s = s.replace("|", ";")

    parts = [p.strip() for p in s.split(";") if p.strip()]

    # If still a single blob, try to detect comma-separated FULL NAMES:
    # Example: "Étienne Ollion, Émilien Schultz"
    if len(parts) == 1 and ", " in parts[0]:
        chunks = [c.strip() for c in parts[0].split(",") if c.strip()]

        looks_like_full_names = sum((" " in c) for c in chunks) >= max(2, int(0.7 * len(chunks)))

        if looks_like_full_names:
            parts = chunks

    return parts

def clean_author_name(a):
    a = re.sub(r"\s+", " ", str(a)).strip()
    a = a.strip(" '\"")
    a = a.replace("'", "").replace('"', "")
    return a.strip()

all_authors = []
for cell in data[author_col]:
    all_authors.extend(split_authors(cell))

authors_list = []
seen = set()
for a in all_authors:
    a = clean_author_name(a)
    if a and a not in seen:
        seen.add(a)
        authors_list.append(a)

print("Number of unique authors:", len(authors_list))
print(authors_list[:30])

Number of unique authors: 1565
['Étienne Ollion', 'Émilien Schultz', 'Miriam Schirmer', 'Julia Mendelsohn', 'Dustin Wright', 'Dietram A. Scheufele', 'Ágnes Horvát', 'Caspar van Lissa', 'Jorge Barreras', 'Thomas Li', 'Chen Zhong', 'Cate Heine', 'Adam (Zhengzi) Zhou', 'Paolo Turrini', 'Elias Fernández Domingos', 'Kristina Gligoric', 'Cinoo Lee', 'Tijana Zrnic', 'Adel Daoud', 'Connor Jerzak', 'Mark Whiting', 'Linnea Gandhi', 'Amirhossein Nakhaei', 'Duncan Watts', 'Egor Kotov', 'Johannes Mast', 'Matthew A. Turner', 'James Holland Jones', 'Dean Eckles', 'Kathleen Carly']


OpenAlex helpers

In [102]:
#api_key = "MQ3Jfhzi1EOFIYdbkDlnRf"
#api_key = "qLZSYk0X9NhCmgsdvcGRNJ"
#api_key = "JW3vNYrvustrbEkExjEvWH"
api_key = "c3nPPdLKjAfWMJKxPtzPR6"
BASE = "https://api.openalex.org"

# OpenAlex helper for data retrieval
def oa_get(path, params=None, sleep=0.6, tries=3):
    url = f"{BASE}{path}"
    params = dict(params or {})
    params["api_key"] = api_key  # or remove this line and use "mailto" instead if you don't have an API key

    last_err = None

    for _ in range(tries):
        try:
            r = requests.get(url, params=params, timeout=60)
            time.sleep(sleep)
            #print("Status:", r.status_code, "URL:", r.url)

            # if non-200, print body and retry
            if r.status_code != 200:
                print("Non-200 response:", r.text[:500])
                last_err = f"HTTP {r.status_code}"
                time.sleep(1.0)
                continue

            js = r.json()

            if isinstance(js, dict) and "error" in js:
                print("OpenAlex error JSON:", js)
                last_err = js
                time.sleep(1.0)
                continue

            return js

        except Exception as e:
            print("Exception in oa_get:", e)
            last_err = e
            time.sleep(1.0)

    print("oa_get failed, last_err:", last_err)
    return None

# OpenAlex helper for reading through multiple pages
def paginate(path, params=None, per_page=200, max_pages=None):
    """Generator over all results using cursor pagination."""
    p = dict(params or {})
    p["per-page"] = per_page
    p["cursor"] = "*"

    page = 0
    while True:
        js = oa_get(path, p)
        results = js.get("results", [])
        for item in results:
            yield item

        next_cursor = js.get("meta", {}).get("next_cursor")
        page += 1

        if not next_cursor:
            break
        if max_pages is not None and page >= max_pages:
            break

        p["cursor"] = next_cursor

# Takes the concept name (like Sociology and Computer Science) and returns a concept ID
def get_concept_id_by_name(name):
    js = oa_get("/concepts", {
        "search": name,
        "select": "id,display_name,level,works_count",
        "per-page": 10
    })
    results = js.get("results", [])
    if not results:
        return None

    name_l = name.strip().lower()
    for c in results:
        if (c.get("display_name") or "").strip().lower() == name_l:
            return c.get("id")

    return results[0].get("id")

Finds a "tophit" author, meaning the author name that matches the best. This is using the OpenAlex "search". After that a panda frame is created with the authors found, including author ID, display_name, work_count, h_index, countrry_code and query_name.

The first filter is then applied, only including authors with "work_count" between 5 and 5000.

In [100]:
def find_author_tophit(author_name):
    js = oa_get("/authors", {
        "search": author_name,
        "select": "id,display_name,works_count,summary_stats,last_known_institutions",
        "per-page": 1
    })
    if js is None:
        return None

    results = js.get("results", [])
    if not results:
        return None

    r = results[0]
    insts = r.get("last_known_institutions") or []
    country_code = insts[0].get("country_code") if insts else None

    return {
        "id": r.get("id"),
        "display_name": r.get("display_name"),
        "works_count": r.get("works_count"),
        "h_index": (r.get("summary_stats") or {}).get("h_index"),
        "country_code": country_code,
        "query_name": author_name,
    }

Checking that the code gets correct output

In [103]:
name = authors_list[0]
print("Testing search for:", name)

js = oa_get("/authors", {
    "search": name,
    "select": "id,display_name,works_count,summary_stats,last_known_institutions",
    "per-page": 1
})
print("Raw js:", js)

Testing search for: Étienne Ollion
Raw js: {'meta': {'count': 7, 'db_response_time_ms': 53, 'page': 1, 'per_page': 1, 'groups_count': None, 'cost_usd': 0.001}, 'results': [{'id': 'https://openalex.org/A5022704573', 'display_name': 'Étienne Ollion', 'works_count': 131, 'summary_stats': {'2yr_mean_citedness': 2.8620689655172415, 'h_index': 17, 'i10_index': 23}, 'last_known_institutions': None}], 'group_by': []}


In [80]:
test_name = authors_list[0]
print("Testing search for:", test_name)

test = oa_get("/authors", {
    "search": test_name,
    "select": "id,display_name,works_count",
    "per-page": 1
})
print(test["results"][:1])

Testing search for: Étienne Ollion
[{'id': 'https://openalex.org/A5022704573', 'display_name': 'Étienne Ollion', 'works_count': 131}]


In [89]:
D1 = []

for nm in authors_list[:10]:
    r = find_author_tophit(nm)
    print(nm, "->", "OK" if r else "None")

Étienne Ollion -> OK
Émilien Schultz -> OK
Miriam Schirmer -> OK
Julia Mendelsohn -> OK
Dustin Wright -> OK
Dietram A. Scheufele -> OK
Ágnes Horvát -> OK
Caspar van Lissa -> OK
Jorge Barreras -> OK
Thomas Li -> OK


Creating the eligible_authors, where work_count is between 5 and 5000. Because of API constraints, only 100 authors are used to ensure we can get through all exercises.

In [104]:
max_authors = 100  # write None to go through all authors

if max_authors is None:
    to_iterate = authors_list
else:
    to_iterate = authors_list[:max_authors]

D1 = []

for nm in to_iterate:
    res = find_author_tophit(nm)
    if res is not None:
        D1.append(res)

df_auth = pd.DataFrame(D1)
print("Matched OpenAlex authors:", len(df_auth))
display(df_auth.head(10))

eligible_df = df_auth[(df_auth["works_count"] >= 5) & (df_auth["works_count"] <= 5000)].copy()
eligible_authors = set(eligible_df["id"].dropna().tolist())

print("Eligible authors:", len(eligible_authors))
display(eligible_df.describe(include="all"))

Matched OpenAlex authors: 95


Unnamed: 0,id,display_name,works_count,h_index,country_code,query_name
0,https://openalex.org/A5022704573,Étienne Ollion,131,17,,Étienne Ollion
1,https://openalex.org/A5079294298,Émilien Schultz,114,11,FR,Émilien Schultz
2,https://openalex.org/A5036764962,Miriam Schirmer,13,3,,Miriam Schirmer
3,https://openalex.org/A5038833789,Julia Mendelsohn,23,9,US,Julia Mendelsohn
4,https://openalex.org/A5090577017,Dustin E. Wright,25,9,,Dustin Wright
5,https://openalex.org/A5064753440,Dietram A. Scheufele,327,76,US,Dietram A. Scheufele
6,https://openalex.org/A5004566032,Emőke-Ágnes Horvát,72,15,US,Ágnes Horvát
7,https://openalex.org/A5069426228,Caspar J. Van Lissa,200,36,NL,Caspar van Lissa
8,https://openalex.org/A5035315636,Jorge Varela Barreras,46,16,ES,Jorge Barreras
9,https://openalex.org/A5052819678,L. Li,3380,184,CN,Thomas Li


Eligible authors: 87


Unnamed: 0,id,display_name,works_count,h_index,country_code,query_name
count,88,88,88.0,88.0,76,88
unique,87,87,,,17,88
top,https://openalex.org/A5000679279,Duncan J. Watts,,,US,Étienne Ollion
freq,2,2,,,36,1
mean,,,202.454545,30.068182,,
std,,,404.397577,31.039387,,
min,,,5.0,1.0,,
25%,,,29.0,9.0,,
50%,,,102.0,21.5,,
75%,,,233.25,37.5,,


Filters for level-0 concept IDs

In [105]:
# Filter for level-0 concept IDs

social_names = ["Sociology", "Psychology", "Economics", "Political science"]
quant_names  = ["Mathematics", "Physics", "Computer science"]

social_ids = [get_concept_id_by_name(n) for n in social_names]
quant_ids  = [get_concept_id_by_name(n) for n in quant_names]

social_ids = [x for x in social_ids if x]
quant_ids  = [x for x in quant_ids if x]

assert len(social_ids) > 0, "Could not find any social concept IDs."
assert len(quant_ids)  > 0, "Could not find any quantitative concept IDs."

# OpenAlex OR within a single filter uses "|"
social_or = "|".join(social_ids)
quant_or  = "|".join(quant_ids)

print("Social concept OR:", social_or)
print("Quant concept OR:", quant_or)

Social concept OR: https://openalex.org/C144024400|https://openalex.org/C15744967|https://openalex.org/C162324750|https://openalex.org/C17744445
Quant concept OR: https://openalex.org/C33923547|https://openalex.org/C121332964|https://openalex.org/C41008148


Implementing the remaining filters:

* More than 10 citations
* Limit to works authored by fewer than 10 individuals
* Including only level-0 concepts: Computational Social Science (focusing on: Sociology OR Psychology OR Economics OR Political Science) AND intersecting with a quantitative discipline (Mathematics OR Physics OR Computer Science)


In [107]:
def iter_works_for_author_filtered(author_id, max_pages=None):
    filter_str = (
        f"authorships.author.id:{author_id},"
        f"cited_by_count:>10,"
        f"concept.id:{social_or},"
        f"concept.id:{quant_or}"
    )
    params = {
        "filter": filter_str,
        "select": "id,publication_year,cited_by_count,authorships,title,abstract_inverted_index",
        "sort": "cited_by_count:desc",
    }
    yield from paginate("/works", params, per_page=200, max_pages=max_pages)

def extract_author_ids(work):
    auths = work.get("authorships") or []
    ids = []
    for a in auths:
        au = (a.get("author") or {}).get("id")
        if au:
            ids.append(au)
    return ids

D2_rows, D3_rows = [], []
seen_work_ids = set()

n_seen = 0
n_kept = 0

for aid in eligible_authors:
    for w in iter_works_for_author_filtered(aid, max_pages=None):
        n_seen += 1
        wid = w.get("id")
        if not wid or wid in seen_work_ids:
            continue

        # More than 10 citations filter
        c = w.get("cited_by_count")
        if c is None or c <= 10:
            continue

        # Fewer than 10 authors filter
        author_count = len(w.get("authorships") or [])
        if author_count >= 10:
          continue

        author_ids = extract_author_ids(w)

        # keep only if at least one eligible IC2S2 author is on the paper (usually true)
        if not any(x in eligible_authors for x in author_ids):
            continue

        D2_rows.append({
            "id": wid,
            "publication_year": w.get("publication_year"),
            "cited_by_count": c,
            "author_ids": author_ids
        })

        D3_rows.append({
            "id": wid,
            "title": w.get("title"),
            "abstract_inverted_index": w.get("abstract_inverted_index")
        })

        seen_work_ids.add(wid)
        n_kept += 1

print("Works returned (including duplicates across authors):", n_seen)
print("Unique works kept:", n_kept)

df2 = pd.DataFrame(D2_rows)
df3 = pd.DataFrame(D3_rows)

print("D2 shape:", df2.shape)
print("D3 shape:", df3.shape)

Works returned (including duplicates across authors): 1555
Unique works kept: 1367
D2 shape: (1367, 4)
D3 shape: (1367, 3)


Displying the results of D2 and D3 and checking if the ID's are the same in both dataframes

In [108]:
display(df2.head(20))
display(df3.head(20))

print("Duplicate IDs in D2?", df2["id"].duplicated().any())
print("Duplicate IDs in D3?", df3["id"].duplicated().any())
print("Same ID sets?", set(df2["id"]) == set(df3["id"]))

Unnamed: 0,id,publication_year,cited_by_count,author_ids
0,https://openalex.org/W3016737084,2020,120,"[https://openalex.org/A5100782346, https://ope..."
1,https://openalex.org/W2982316861,2019,77,"[https://openalex.org/A5013465081, https://ope..."
2,https://openalex.org/W4410818492,2025,21,"[https://openalex.org/A5103118684, https://ope..."
3,https://openalex.org/W3157505695,2021,14,"[https://openalex.org/A5013465081, https://ope..."
4,https://openalex.org/W4321455981,2023,473,"[https://openalex.org/A5070497645, https://ope..."
5,https://openalex.org/W4387299971,2023,109,"[https://openalex.org/A5070497645, https://ope..."
6,https://openalex.org/W4296154596,2022,92,"[https://openalex.org/A5070497645, https://ope..."
7,https://openalex.org/W4294786836,2022,12,"[https://openalex.org/A5070497645, https://ope..."
8,https://openalex.org/W4376616205,2023,12,"[https://openalex.org/A5070497645, https://ope..."
9,https://openalex.org/W4410030884,2025,12,"[https://openalex.org/A5070497645, https://ope..."


Unnamed: 0,id,title,abstract_inverted_index
0,https://openalex.org/W3016737084,Scientific elite revisited: patterns of produc...,"{'Throughout': [0], 'history,': [1], 'a': [2, ..."
1,https://openalex.org/W2982316861,Quantifying the dynamics of failure across sci...,
2,https://openalex.org/W4410818492,The pivot penalty in research,
3,https://openalex.org/W3157505695,Science as a Public Good: Public Use and Fundi...,"{'Knowledge': [0], 'of': [1, 15, 18, 56, 70, 8..."
4,https://openalex.org/W4321455981,"Out of One, Many: Using Language Models to Sim...","{'Abstract': [0], 'We': [1, 54, 92, 104, 130, ..."
5,https://openalex.org/W4387299971,Leveraging AI for democratic discourse: Chat i...,"{'Political': [0], 'discourse': [1, 21, 61], '..."
6,https://openalex.org/W4296154596,"Out of One, Many: Using Language Models to Sim...","{'We': [0, 53, 95, 107, 133, 173], 'propose': ..."
7,https://openalex.org/W4294786836,Does Political Participation Contribute to Pol...,"{'Abstract': [0], 'Polarization': [1], 'and': ..."
8,https://openalex.org/W4376616205,Misclassification and Bias in Predictions of I...,"{'We': [0, 104], 'show': [1], 'that': [2, 21, ..."
9,https://openalex.org/W4410030884,Testing theories of political persuasion using AI,"{'Despite': [0], 'its': [1], 'importance': [2]..."


Duplicate IDs in D2? False
Duplicate IDs in D3? False
Same ID sets? True


In [109]:
# Dataset D2/D3 summary for report

# Number of works in D2
n_works = len(df2)

# Number of unique authors across all works in D2
all_author_ids = set()
for ids in df2["author_ids"]:
    for aid in ids:
        all_author_ids.add(aid)

n_authors = len(all_author_ids)

print(f"Number of works in D2: {n_works}")
print(f"Number of unique authors in D2: {n_authors}")

Number of works in D2: 1367
Number of unique authors in D2: 2284


Answers regarding the Data Overview and Reflection questions:

* Dataset summary. How many works are listed in your Dataset D2 (IC2S2 papers) dataframe? How many unique researchers have co-authored these works?

This is answerd in the cell above, there are 1367 works in D2 and the number of unique authors that have co-authored are 2284

* Efficiency in code and Filtering Criteria and Dataset Relevance

We used a limit of "max_authors" to make sure we had enough API requests, as we have had to deal with that problem a lot. The filter for citations makes sure that OpenAlex only takes "more relevant" articles, which speeds up the process significantly. The "author_count" filter makes sure we only focus on "smaller collaborations" and don't focus on very large ones. The filter regarding "eligible authors" also filters out a lot, as it focuses on authors that have contributed a significant amount, which helps the signify their significance.