## Imports

In [1]:
import pickle
import requests
import pandas as pd
from joblib import Parallel, delayed
from tqdm import tqdm
import time

## Part 1: Web-scraping

## Part 2: Ready Made vs Custom Made Data

## Part 3: Gathering Research Articles using the OpenAlex API

In [2]:
# Load dataset
with open("datav2.pkl", "rb") as f:
    df = pickle.load(f)

df = df[(df['works_count'] > 5) & (df['works_count'] < 5000)]  # Filtering

#Initialize DataFrames
papers = pd.DataFrame(columns=['id', 'publication_year', 'cited_by_count', 'author_ids'])
abstracts = pd.DataFrame(columns=['id', 'title', 'abstract_inverted_index'])

# Define concept IDs
concept_ids = [
    "C144024400",  # Sociology
    "C15744967",   # Psychology
    "C162324750",  # Economics
    "C17744445",   # Political Science
    "C33923547",   # Mathematics
    "C121332964",  # Physics
    "C41008148",   # Computer Science
]

paperdata = []
abstractdata = []

def get_data(i):
    ids = [aut.split("id:")[1] for aut in i]
    BASE_URL = (
        f"https://api.openalex.org/works?filter=author.id:({("|").join(ids)}),cited_by_count:>10,"
        f"authors_count:<10,concept.id:({'|'.join(concept_ids[:4])}),concept.id:({'|'.join(concept_ids[4:])})"
    )

    retries = 0
    papers = []
    abstracts = []

    while retries < 3:
        try:
            response = requests.get(BASE_URL + "&per-page=200&cursor=*").json()

            while response.get("results"):
                for result in response["results"]:
                    papers.append({
                        "id": result.get("id"),
                        "publication_year": result.get("publication_year"),
                        "cited_by_count": result.get("cited_by_count"),
                        "author_ids": [
                            auth["author"]["id"]
                            for auth in result.get("authorships", [])
                            if "author" in auth and "id" in auth["author"]
                        ],
                    })
                    abstracts.append({
                        "id": result.get("id"),
                        "title": result.get("title"),
                        "abstract_inverted_index": result.get("abstract_inverted_index"),
                    })

                next_cursor = response.get("meta", {}).get("next_cursor")
                if not next_cursor:
                    break

                time.sleep(1) 
                response = requests.get(BASE_URL + f"&per-page=200&cursor={next_cursor}").json()

            return papers, abstracts

        except Exception as e:
            print(f"Error fetching work ID {ids}: {e}")
            retries += 1
            time.sleep(1)

    return [], []

# Parallel processing
num_batch = 5
batch_size = 100 

for i in tqdm(range(0, len(df["works_api_url"]), batch_size)):
    batch_indexes = df["works_api_url"][i:i+100].tolist()
    batches = [batch_indexes[i:i+25] for i in range(0,100,25)]
    #Retrieve data in parallel
    results = Parallel(n_jobs=num_batch)(
        delayed(get_data)(batch) for batch in batches
    )

    # Collect results
    for pap, abs in results:
        if pap and abs:
            paperdata.extend(pap) 
            abstractdata.extend(abs)

    time.sleep(2)

#Convert collected lists into DataFrames
paperdata_df = pd.DataFrame(paperdata)
abstractdata_df = pd.DataFrame(abstractdata)

# Save results
paperdata_df.to_csv("papers.csv", index=False)
abstractdata_df.to_csv("abstracts.csv", index=False)


100%|██████████| 11/11 [01:36<00:00,  8.74s/it]


In [7]:
#How many works are listed? (id in this dataframe is an id of a work)
len(paperdata_df['id'])

13362

In [19]:
#How many unique reasearchers have co-authored these works?
(pd.DataFrame(paperdata_df.explode('author_ids'))['author_ids']).nunique()

18004

> - **Dataset summary.**

There is a total of 13362 works listed in the *IC2S2 papers* dataframe with 18004 unique researchers who have co-authored these works.
> - **Efficiency in code.** Describe the strategies you implemented to make your code more efficient. How did your approach affect your code's execution time? __(answer in max 150 words)__

When retrieving data from the API, the request would often have trouble fetching some of the data, resulting in a very long run-time. To address this, we implemented a try-except block, which retries the request a set number of tries, before moving on to the next request. In addition, we also implemented multiprocessing to make multiple requests at the same time, with the use of Parallel and tqdm - this significantly reduced the run-time of the code. Lastly applying filters in the API request, ensured that only relevant data was taken into consideration. All of these strategies combined, improved the execution time of the code, from taking close to an hour to run, to only about a minute or two.

> - **Filtering Criteria and Dataset Relevance** Reflect on the rationale behind setting specific thresholds for the total number of works by an author, the citation count, the number of authors per work, and the relevance of works to specific fields. How do these filtering criteria contribute to the relevance of the dataset you compiled? Do you believe any aspects of Computational Social Science research might be underrepresented or overrepresented as a result of these choices? __(answer in max 150 words)__

Using specific thresholds ensures that the fetched data is both relevant and of high quality, but makes data collecting more manageable, as fewer works need to be processed. While the threshold for total number of works by an author makes sure that very prolific authors are not overrepresented, this however also excludes inactive or emerging authors. Similarly, the citation threshold highlights influential and relevant studies, but may overlook newer works that have yet to gain recognition. Limiting the number of authors per work favors small collaborations over large, potentially filtering out large interdisciplinary research. Field-based filtering broadens the scope to include relevant interdisciplinary studies but may still underrepresent qualitative approaches. Applying filters enhances the relevance of the dataset, but may also introduce bias.

## Part 4: The Network of Computational Social Scientists