Link to GitHub: https://github.com/Stampe04/CSS_Assignment1.git

All group members contributed equally to this assignment

### Part 1: Ready Made vs Custom Made Data ###

***What are pros and cons of the custom-made data used in Centola's experiment (the first study presented in the lecture) and the ready-made data used in Nicolaides's study (the second study presented in the lecture)? You can support your arguments based on the content of the lecture and the information you read in Chapter 2.3 of the book (answer in max 150 words).***

In Centola’s experiment, the custom-made data has the benefit of giving exactly what the study is looking for, without filling the data set with useless information or leaving out important information that would make the data set incomplete. However, the results might be impaired if any of the people participating were aware that they were a part of an experiment.

In Nicolaides’s experiment, the ready-made data has the benefit of being always-on, meaning that they are constantly receiving new data that can be used in their experiment. However, as it was mentioned in the video, there might be other relevant insights about the people that they do not receive from their data or that they do not have access to. 

***How do you think these differences can influence the interpretation of the results in each study? (answer in max 150 words)***

Centola's study can potentially help draw strong causal conclusions, if behaviour is significantly different between the 2 network structures, because the network structures are experimentally manipulated. However, this setup may not generalize well to real world social systems, where networks are influenced by many additional factors. 

Nicolaides' study can potentially create results that better reflect real world behaviour, since it uses naturally collected and large scale data. However, since the researchers cannot control for all variables, the results could only help draw correlational conclusions rather than causal. 

Comparing the two, Centola's results can better show how and why behaviour spreads, while Nicolaides' results can better show how patterns appear in the real world.

### Part 2: Find Researchers using the OpenAlex API ###

In [1]:
import requests
import time
import pandas as pd
import re
from google.colab import drive

drive.mount('/content/drive')

data = pd.read_csv('/content/drive/MyDrive/Computational Social Science/ic2s2_2025_schedule_v5.csv')

# Find author column
author_cols = [c for c in data.columns if "author" in c.lower()]
print("Author-like columns:", author_cols)

author_col = author_cols[0] if author_cols else None
assert author_col is not None, "No author-like column found."

ModuleNotFoundError: No module named 'google'

In [None]:
def split_authors(x):
    if pd.isna(x):
        return []
    s = str(x).strip()

    # normalize common separators
    s = s.replace("\n", ";")
    s = re.sub(r"\s+(and|&)\s+", ";", s, flags=re.IGNORECASE)
    s = s.replace("|", ";")

    parts = [p.strip() for p in s.split(";") if p.strip()]

    # If still a single blob, try to detect comma-separated FULL NAMES:
    # Example: "Étienne Ollion, Émilien Schultz"
    if len(parts) == 1 and ", " in parts[0]:
        chunks = [c.strip() for c in parts[0].split(",") if c.strip()]

        looks_like_full_names = sum((" " in c) for c in chunks) >= max(2, int(0.7 * len(chunks)))

        if looks_like_full_names:
            parts = chunks

    return parts

def clean_author_name(a):
    a = re.sub(r"\s+", " ", str(a)).strip()
    a = a.strip(" '\"")
    a = a.replace("'", "").replace('"', "")
    return a.strip()

all_authors = []
for cell in data[author_col]:
    all_authors.extend(split_authors(cell))

authors_list = []
seen = set()
for a in all_authors:
    a = clean_author_name(a)
    if a and a not in seen:
        seen.add(a)
        authors_list.append(a)

print("Number of unique authors:", len(authors_list))
print(authors_list[:30])

In [None]:
#api_key = "MQ3Jfhzi1EOFIYdbkDlnRf"
#api_key = "qLZSYk0X9NhCmgsdvcGRNJ"
#api_key = "JW3vNYrvustrbEkExjEvWH"
api_key = "c3nPPdLKjAfWMJKxPtzPR6"
BASE = "https://api.openalex.org"

# OpenAlex helper for data retrieval
def oa_get(path, params=None, sleep=0.6, tries=3):
    url = f"{BASE}{path}"
    params = dict(params or {})
    params["api_key"] = api_key  # or remove this line and use "mailto" instead if you don't have an API key

    last_err = None

    for _ in range(tries):
        try:
            r = requests.get(url, params=params, timeout=60)
            time.sleep(sleep)
            #print("Status:", r.status_code, "URL:", r.url)

            # if non-200, print body and retry
            if r.status_code != 200:
                print("Non-200 response:", r.text[:500])
                last_err = f"HTTP {r.status_code}"
                time.sleep(1.0)
                continue

            js = r.json()

            if isinstance(js, dict) and "error" in js:
                print("OpenAlex error JSON:", js)
                last_err = js
                time.sleep(1.0)
                continue

            return js

        except Exception as e:
            print("Exception in oa_get:", e)
            last_err = e
            time.sleep(1.0)

    print("oa_get failed, last_err:", last_err)
    return None

# OpenAlex helper for reading through multiple pages
def paginate(path, params=None, per_page=200, max_pages=None):
    """Generator over all results using cursor pagination."""
    p = dict(params or {})
    p["per-page"] = per_page
    p["cursor"] = "*"

    page = 0
    while True:
        js = oa_get(path, p)
        results = js.get("results", [])
        for item in results:
            yield item

        next_cursor = js.get("meta", {}).get("next_cursor")
        page += 1

        if not next_cursor:
            break
        if max_pages is not None and page >= max_pages:
            break

        p["cursor"] = next_cursor

# Takes the concept name (like Sociology and Computer Science) and returns a concept ID
def get_concept_id_by_name(name):
    js = oa_get("/concepts", {
        "search": name,
        "select": "id,display_name,level,works_count",
        "per-page": 10
    })
    results = js.get("results", [])
    if not results:
        return None

    name_l = name.strip().lower()
    for c in results:
        if (c.get("display_name") or "").strip().lower() == name_l:
            return c.get("id")

    return results[0].get("id")

In [None]:
def find_author_tophit(author_name):
    js = oa_get("/authors", {
        "search": author_name,
        "select": "id,display_name,works_count,summary_stats,last_known_institutions",
        "per-page": 1
    })
    if js is None:
        return None

    results = js.get("results", [])
    if not results:
        return None

    r = results[0]
    insts = r.get("last_known_institutions") or []
    country_code = insts[0].get("country_code") if insts else None

    return {
        "id": r.get("id"),
        "display_name": r.get("display_name"),
        "works_count": r.get("works_count"),
        "h_index": (r.get("summary_stats") or {}).get("h_index"),
        "country_code": country_code,
        "query_name": author_name,
    }

In [None]:
name = authors_list[0]
print("Testing search for:", name)

js = oa_get("/authors", {
    "search": name,
    "select": "id,display_name,works_count,summary_stats,last_known_institutions",
    "per-page": 1
})
print("Raw js:", js)

In [None]:
test_name = authors_list[0]
print("Testing search for:", test_name)

test = oa_get("/authors", {
    "search": test_name,
    "select": "id,display_name,works_count",
    "per-page": 1
})
print(test["results"][:1])

In [None]:
D1 = []

for nm in authors_list[:10]:
    r = find_author_tophit(nm)
    print(nm, "->", "OK" if r else "None")

In [None]:
max_authors = 100  # write None to go through all authors

if max_authors is None:
    to_iterate = authors_list
else:
    to_iterate = authors_list[:max_authors]

D1 = []

for nm in to_iterate:
    res = find_author_tophit(nm)
    if res is not None:
        D1.append(res)

df_auth = pd.DataFrame(D1)
print("Matched OpenAlex authors:", len(df_auth))
display(df_auth.head(10))

***Which challenges did you encounter? How did you address them?***

In the beginning, we had some issues with the API itself, since we found the API documentation very confusing to navigate on their homepage. However, this slowly became less of an issue as we went along and got more familiar with the documentation and the code implementation. Furthermore, We had some initial issues loading the .csv file properly, but this was solved by doing some simple research on handling .csv files in Python. In addition, we ran into trouble with our API keys multiple times, as we kept running into limitation issues, e.g. not having enough tokens per day. We solved this by testing our code on a limited amount of data to use our keys sparingly, and then do the full run once we knew the code worked.

***Choose one problem you faced while collecting the data and describe your solution. Why did you choose this approach, and what impact might it have on your data?***

The biggest problem would be the limitation issues with our API keys, as we tested whole tables during our coding trials which put quite a strain on our amount of token usage. We solved this problem by limiting the number of calls with our API keys, e.g. limiting the number of data we called to ensure the code worked for all rows of data. We chose this approach as we still needed to test our code for edge cases and this would do just that without limiting our testing to a few times a day.

Another problem we faced was loading the .csv file from the IC2S2 website using web scraping. Due to the fact that the website's dataset was transformed whenever we web scraped, we needed to locate the dataset and extract it before web scraping in Python. This wasn't the biggest issue as we could extract the HTML information from the website and locate it, but it proved to be a hassle to find it without downloading all dependencies (i.e.  the whole website).

### Part 3: Collect Research Articles ###

In [None]:
eligible_df = df_auth[(df_auth["works_count"] >= 5) & (df_auth["works_count"] <= 5000)].copy()
eligible_authors = set(eligible_df["id"].dropna().tolist())

print("Eligible authors:", len(eligible_authors))
display(eligible_df.describe(include="all"))

In [None]:
# Filter for level-0 concept IDs

social_names = ["Sociology", "Psychology", "Economics", "Political science"]
quant_names  = ["Mathematics", "Physics", "Computer science"]

social_ids = [get_concept_id_by_name(n) for n in social_names]
quant_ids  = [get_concept_id_by_name(n) for n in quant_names]

social_ids = [x for x in social_ids if x]
quant_ids  = [x for x in quant_ids if x]

assert len(social_ids) > 0, "Could not find any social concept IDs."
assert len(quant_ids)  > 0, "Could not find any quantitative concept IDs."

# OpenAlex OR within a single filter uses "|"
social_or = "|".join(social_ids)
quant_or  = "|".join(quant_ids)

print("Social concept OR:", social_or)
print("Quant concept OR:", quant_or)

In [None]:
def iter_works_for_author_filtered(author_id, max_pages=None):
    filter_str = (
        f"authorships.author.id:{author_id},"
        f"cited_by_count:>10,"
        f"concept.id:{social_or},"
        f"concept.id:{quant_or}"
    )
    params = {
        "filter": filter_str,
        "select": "id,publication_year,cited_by_count,authorships,title,abstract_inverted_index",
        "sort": "cited_by_count:desc",
    }
    yield from paginate("/works", params, per_page=200, max_pages=max_pages)

def extract_author_ids(work):
    auths = work.get("authorships") or []
    ids = []
    for a in auths:
        au = (a.get("author") or {}).get("id")
        if au:
            ids.append(au)
    return ids

D2_rows, D3_rows = [], []
seen_work_ids = set()

n_seen = 0
n_kept = 0

for aid in eligible_authors:
    for w in iter_works_for_author_filtered(aid, max_pages=None):
        n_seen += 1
        wid = w.get("id")
        if not wid or wid in seen_work_ids:
            continue

        # More than 10 citations filter
        c = w.get("cited_by_count")
        if c is None or c <= 10:
            continue

        # Fewer than 10 authors filter
        author_count = len(w.get("authorships") or [])
        if author_count >= 10:
          continue

        author_ids = extract_author_ids(w)

        # keep only if at least one eligible IC2S2 author is on the paper (usually true)
        if not any(x in eligible_authors for x in author_ids):
            continue

        D2_rows.append({
            "id": wid,
            "publication_year": w.get("publication_year"),
            "cited_by_count": c,
            "author_ids": author_ids
        })

        D3_rows.append({
            "id": wid,
            "title": w.get("title"),
            "abstract_inverted_index": w.get("abstract_inverted_index")
        })

        seen_work_ids.add(wid)
        n_kept += 1

print("Works returned (including duplicates across authors):", n_seen)
print("Unique works kept:", n_kept)

df2 = pd.DataFrame(D2_rows)
df3 = pd.DataFrame(D3_rows)

print("D2 shape:", df2.shape)
print("D3 shape:", df3.shape)

In [None]:
display(df2.head(20))
display(df3.head(20))

print("Duplicate IDs in D2?", df2["id"].duplicated().any())
print("Duplicate IDs in D3?", df3["id"].duplicated().any())
print("Same ID sets?", set(df2["id"]) == set(df3["id"]))

In [None]:
# Dataset D2/D3 summary for report

# Number of works in D2
n_works = len(df2)

# Number of unique authors across all works in D2
all_author_ids = set()
for ids in df2["author_ids"]:
    for aid in ids:
        all_author_ids.add(aid)

n_authors = len(all_author_ids)

print(f"Number of works in D2: {n_works}")
print(f"Number of unique authors in D2: {n_authors}")

***How many works are listed in your Dataset D2 (IC2S2 papers) dataframe? How many unique researchers have co-authored these works?***

This is also answerd in the cell above. There are 1367 works in D2 and the number of unique authors that have co-authored are 2284

***Describe the strategies you implemented to make your code more efficient. How did your approach affect your code's execution time?***

We used a limit of "max_authors" to make sure we had enough API requests, as we have had to deal with that problem a lot. The filter for citations makes sure that OpenAlex only takes "more relevant" articles, which speeds up the process significantly. The "author_count" filter makes sure we only focus on "smaller collaborations" and don't focus on very large ones. The filter regarding "eligible authors" also filters out a lot, as it focuses on authors that have contributed a significant amount, which helps to signify their significance.

***Reflect on the rationale behind setting specific thresholds for the total number of works by an author, the citation count, the number of authors per work, and the relevance of works to specific fields. How do these filtering criteria contribute to the relevance of the dataset you compiled? Do you believe any aspects of Computational Social Science research might be underrepresented or overrepresented as a result of these choices?***

Before applying any of the filters or specific settings, we ran a testing round on just one person, which alone gave us an enormous output, and it was very difficult to find specific information in it. This alone made us realize how necessary it would be to apply filters when searching for multiple people, but in practice, we also realised just how powerful of a tool it can be, especially when there is very specific information you are interested in or a specific criteria or threshold you want to limit yourself to. Overall, we think they are an excellent tool.  

However, we also see some potential problem when applying these filters. As an example, if you choose to filter based on a work or an author that have been cited at least x amount of times, you may underrepresent authors or works that are legitimately good but have just not been used much or discovered yet. Moreover, this can overrepresent works or authors that may not be good anymore or have become outdated, just because they have been cited a lot. 