# Contribution table with estimated percentage of contribution

|           | Jonas (s234845) | Alexander(s234815) | Mikkel (s224187) |
|-----------|-------|-----------|--------|
| **Part 1** |   40%    |   60%        |     0%   |
| **Part 2** |   60%    |   40%        |    0%    |
| **Part 3** |    80%   |     20%      |    0%    |
| **Part 4** |  20%     |      80%     |     0%   |


We have had minor contact with mikkel, but he has not responded to any invitation prior to that. We have written both on mail and messenger.  

Link to the github: https://github.com/Glymse/Webscrabing-API-Graph-02467-assignment1/blob/master/Assignment1.ipynb

_______

In [53]:
# Imports
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
from thefuzz import fuzz
import json
import time
import os
from joblib import Parallel, delayed
from tqdm import tqdm
import sys
import networkx as nx


Utility functions, they are being used thoughout the document:

In [54]:
# Useful helper functions
# Created after the fact to clean up the messy code and to improve readability
def fetch_soup(url):
    response = requests.get(url)
    return BeautifulSoup(response.content, 'html.parser')

def extract_text(elements):
    return [el.get_text(strip=True) for el in elements]

def split_and_strip(text, separator=','):
    # also strips whitespace
    return [part.strip() for part in text.split(separator)]

def clean_names(names):
    # split by comma and flatten to a cleaned list
    return list(set([name.strip() for entry in names for name in entry.split(",")]))

def find_dupe_names(names, threshold=85):
    # note: 85 seemed to do pretty well
    # 80 was too non-strict
    names = sorted(names)
    name_matches = {}
    remaining_names = set(names)
    
    for i, name in enumerate(names):
        if name not in remaining_names:
            continue
        
        alternatives = [alt_name for alt_name in names[i+1:] if fuzz.ratio(name, alt_name) >= threshold]
        remaining_names -= set(alternatives)
        name_matches[name] = alternatives if alternatives else "" # just leave everything else empty
    
    # here we create an additional column with a list of the name matches, in case we need to use them later
    return pd.DataFrame({'Original': name_matches.keys(), 'Alternatives': name_matches.values()}).reset_index(drop=True)

## Part 1: Web-scraping

### 1.1-1.2: Get names of researchers

The task is to gather researcher names from ic2s2-2023:
- Keynote speakers
- Plenary
- Chairs
- Posters

In [55]:
LINK = "https://ic2s2-2023.org/program"

# create a list of researcher names that we will append to later
researcher_names = []

soup_main = fetch_soup(LINK)

#### Keynote names

Keynote speakers all had `keynotes#` in their ref tag/value, so by filtering the href and the `a` name its possible to extract them.

In [56]:
keynote_attrs = {'href': re.compile(r'keynotes#')}

keynote_html = soup_main.find_all(name="a", attrs=keynote_attrs)

# here we use regular expressions and filter for entries starting with "Keynote -"
keynote_names = [re.sub(r'^Keynote -', '', name).strip() for name in extract_text(keynote_html)]

len(keynote_names)

10

#### Plenary, Chair and Posters names

The rest of the names could all be narrowed down to the `i` name/tag and by doing extra filtering on the `u` tag, its possible to isolate all the names easily. Note that chair members had "`Chair :`" in front of them, so you need to delimit that for those entries.

In [57]:
author_tags = soup_main.find_all(name='i')

plenary_names = []

for tag in author_tags:
    u_names = extract_text(tag.find_all('u'))
    plain_text_names = re.split(r',\s*', tag.get_text(strip=True))
    plain_text_names = [name for name in plain_text_names if name not in u_names]
    # here we use regular expressions and filter for entries starting with "Chair :"
    plain_text_names = [re.sub(r'^Chair:\s*', '', name) for name in plain_text_names]
    plenary_names.extend(u_names + plain_text_names)

len(plenary_names)

2095

#### Program committee names

The program committee names were found in another link/URL. So a new separate soup had to be made for this task. The names in the program committee were found under the `b` name/tag.

In [58]:
LINK_comittee = "https://ic2s2-2023.org/program_committee"
soup_committee = fetch_soup(LINK_comittee)

committee_names = extract_text(soup_committee.find_all("b"))

len(committee_names)

335

#### Compilation of all names

Now its possible to compile all the names into one list. Then make the list into a set to remove any identical elements (in this case names) and then turn it back into a list. Additionally, delimiting is used on any entries with comma, and then flatten any entry with multiple names.

In [59]:
researcher_names.extend(keynote_names + plenary_names + committee_names)

researcher_names = clean_names(researcher_names)

display(researcher_names[0], len(researcher_names))

'Marton Karsai'

1655

### 1.3: Filter names, remove duplicates etc.

Here Fuzzy string matching (`thefuzz`) is used, which utilizes the Levenshtein distance (see [Wikipedia](https://en.wikipedia.org/wiki/Levenshtein_distance#Definition)) metric for measuring the difference between two strings. Note that the number of alternatives depend on the set threshold. Through trial and error we figured a similarity score of 80-85 was adequate.

In [60]:
researcher_names_cleaned = find_dupe_names(researcher_names)
researcher_names_cleaned

Unnamed: 0,Original,Alternatives
0,Aaron Clauset,
1,Aaron J. Schwartz,
2,Aaron Schein,
3,Aaron Smith,
4,Abbas Haidar,
...,...,...
1595,Zoltan Kmetty,
1596,Zsófia Rakovics,
1597,diogo pacheco,
1598,franco scarselli,


Using this method, 55 duplicate names were "cleared" 

In [61]:
# Example where Alternatives is not empty
researcher_names_cleaned[researcher_names_cleaned["Alternatives"] != ""].head()

Unnamed: 0,Original,Alternatives
49,Alessandro Flamini,[Alessandro Flammini]
59,Alexander Gates,[Alexander J Gates]
95,Ana Maria Jaramillo,[Ana María Jaramillo]
139,AnnaSeo Gyeong Choi,[Seo Gyeong Choi]
141,Anne C Kroon,"[Anne C. Kroon, Anne Kroon]"


In [62]:
# Final length
len(researcher_names_cleaned)

1600

### 1.6: Explanation of the process

First, the relevant HTML tags associated with researcher names were identified by inspecting the website's elements and navigating through the HTML structure. Specific tags corresponding to names (e.g., u, i) were extracted, and any prefixes or separators unrelated to the actual researcher name were removed, such as "Chair: " before chair members of a given plenary or "Keynote - " before keynote presenters. To address duplicated or misspelled names, FuzzySearch was applied to compute similarity scores for each name against all other unchecked names.

## Part 2: Ready Made vs Custom Made Data
Week 2, ex 1.

> **Exercise: Ready made data vs Custom made data** In this exercise, I want to make sure you have understood they key points of my lecture and the reading. 
>
> 1. What are pros and cons of the custom-made data used in Centola's experiment (the first study presented in the lecture) and the ready-made data used in Nicolaides's study (the second study presented in the lecture)? You can support your arguments based on the content of the lecture and the information you read in Chapter 2.3 of the book __(answer in max 150 words)__.
> 2. How do you think these differences can influence the interpretation of the results in each study? __(answer in max 150 words)__

- **Centola's experiment: (custom-made)** Based on controlled email invitations to join a health behavior platform.
  - **Pros:**
    - Low chance of systematic drifting, the study design, invitations, and procedures are fixed and stable throughout the experiment.
    - Prevents incomplete data: Since the platform and network were designed for the study, the data structure is complete and consistent.

  - **Cons:**
    - Population drifting: Users with fewer connections or less engagement might leave the site over time, affecting network structure and participation
    - Dirty data: There is a probability of BOT accounts or one-time accounts.  
    - Algorithmically confounded: non-spam mails can end up in spam filthers. 
    - Reactive: People knew they were in a health study, so their behvaoir might be different. 

- **Nicolaides' study: (ready-made)** Based on observational Strava activity data.
  - **Pros:**
    - Big data: Access to the entirety of Strava data.
    - Always-on: constantly measures people when they run. and share results
    - Low chance of behavoiur drifting. Users consistently log runs for their own purposes (training, competition, habit), so the core behavior of interest (running and sharing) stays stable.
    - Non reactive: People didn’t know they were being studied, so behavior is natural. 
  - **Cons:**
    - Non-representative of a greater population. Strava users are not a random sample of the general population; they tend to be more sports-oriented and possibly competitive. (comparing themselves to others)
    - Dirty: Data could have errors (GPS mistakes, fake activities)
    - Population drifting: Platform updates could change user activity and weather may affect amount of people running. 

**PART 2**

> Centolas experiemnt:
Since the study and data was designed specifically to test how behaviors spread within specific network structures, and conditions were controlled, it's easier to say "this network structure (contaigon) caused the behavior to spread". However the reactive setting (participants knew they were in a health study) makes it harder to assume these findings apply in real-world. Emails ending up in spam might add even futher bias.

> Nicolaides study: There’s no control over why people run or why they interact with others. Other unseen factors (weather, competitions, personal goals) might drive behavior changes. However the data set is big and non reactive, which could give a realistic picture of the study claim even though theres biases. 

## Part 3: Gathering Research Articles using the OpenAlex API

### Get researcher names from IC2S2 2024

Its possible to simply pull the Google Sheets information straight into `pandas` as such:

In [63]:
# Pull info from Google Sheets directly into pandas
def gs_link(spreadsheet_id):
    return f"https://docs.google.com/spreadsheets/d/{spreadsheet_id}/export?format=csv"

# Poster presentations
pp_id = "1tyug6JFNa2BVEBFNXMWiKNKwXOxnOLDR"
df_pp = pd.read_csv(gs_link(pp_id))
df_pp

# Lightning talks
lt_id = "17NqO1ofBn1SKMC6bhCAF_XsRYgHWagDj"
df_ld = pd.read_csv(gs_link(lt_id))
df_ld

# Oral panels
op_id = "1PY37V6MRvkr9D-w0liMm3X-QN_i43mv0"
df_op = pd.read_csv(gs_link(op_id))


Inspecting the Google Sheets documents its possible to gather the author names from the various columns (`Poster authors` and `Presentation authors`) in the various data frames:

In [64]:
for i in [df_pp, df_ld, df_op]:
    print(list(i.columns))

['Date', 'Poster title', 'Poster authors', 'Easel assignment']
['Date', 'Time', 'Location', 'Presentation title', 'Presentation authors']
['Session', 'Date', 'Time', 'Location', 'Session track', 'Presentation title', 'Presentation authors']


To get all names in one place we concatenate these into a single list:

In [65]:
authors_2024 = pd.concat([df_pp['Poster authors'], df_ld['Presentation authors'], df_op['Presentation authors']], axis=0).reset_index(drop=True)

authors_2024 = authors_2024.to_list()

To see what kind of pattern the data has:

In [66]:
authors_2024[0]

'Hazem Ibrahim, New York University; Talal Rahwan, New York University; Yasir Zaki, New York University Abu Dhabi'

Which tells that the pattern is as such: `author name, institution name`.

Very few entries had the alternative pattern: `author name (institution name)`.

In [67]:
authors_2024_cleaned = []

for i in authors_2024:
    # Separate authors into "author, institution"
    a_i_list = re.split(r';\s*', i)
    
    # Discard institution and keep only author name
    for person in a_i_list:
        # replace "(" with "," to remove them all at once
        authors_2024_cleaned.append(person.replace("(", ",").split(",")[0])

Get rid of identical name duplicates:

In [68]:
authors_2024_cleaned = clean_names(authors_2024_cleaned)

Use FuzzySearch on the list of names to further remove duplicates:

In [69]:
authors_2024_cleaned_dupe = find_dupe_names(authors_2024_cleaned, threshold=85)
authors_2024_cleaned_dupe

Unnamed: 0,Original,Alternatives
0,A. Marthe Möller,
1,Aaron Clauset,
2,Aaron D Nichols,
3,Aaron Schein,
4,Abdul Basit Adeel,
...,...,...
1209,Zsófia Rakovics,
1210,Zubair Shafiq,
1211,diogo pacheco,
1212,Ákos Huszár,


Down to 1214 after cleaning.

### Get author IDs and other relevant info

The first step is to retrieve the IDs, along with other relevant details, for each author using the OpenAlex API for authors. Searching by author ID is significantly more reliable than searching by name alone. To increase accuracy, alternative names are also considered when the original name is not found.

Due to time constraints, the author ID retrieval process was not optimized to the same extent as the later algorithms. Potential improvements include batching multiple author names together for bulk searches and applying parallelization to increase efficiency.

In [70]:
file_ = "./IC2S2-authors.csv"

# check if file exists
if os.path.isfile(file_):
   authors_df = pd.read_csv(file_)
else:
    BASE_URL = "https://api.openalex.org/authors"

    df = authors_2024_cleaned_dupe

    author_data = []  # List to hold dictionary records

    # Use a session for connection reuse
    with requests.Session() as session:
        for index, row in df.iterrows():
            name = row["Original"]
            alternatives = row["Alternatives"] if isinstance(row["Alternatives"], list) else []

            all_names = [name] + alternatives  # Try the Original first, then Alternatives

            for author_name in all_names:
                params = {"page": "1", "per_page": 1, "search": author_name}
                try:
                    time.sleep(0.1)  # Stay within 10 requests per second limit
                    response = session.get(BASE_URL, params=params)
                    response.raise_for_status()
                    
                    json_data = response.json()
                    result = json_data.get("results", [])

                    if result:
                        author = result[0]
                        #institutions = author.get("last_known_institutions", [])
                        #country = institutions[0].get("country_code", "N/A") if institutions else "N/A" # assuming this is where to fetch country
                        author_data.append({
                            "id": author["id"].split("/")[-1],
                            "display_name": author["display_name"],
                            #"country": country, 
                            "works_api_url": author["works_api_url"],
                            "h_index": author["summary_stats"]["h_index"],
                            "works_count": author["works_count"]
                        })
                        break  # Stop searching once a match is found

                except requests.exceptions.RequestException as e:
                    print(f"Error fetching data for {author_name}: {e}")

    authors_df = pd.DataFrame(author_data)
    authors_df.to_csv("IC2S2-authors.csv", index=False) # save as CSV file

On a mac M1, it takes around 5 minutes. However as said earlier,  bulk searching using the `|` operator could be used in furher development

In [71]:
authors_df

Unnamed: 0,id,display_name,country,works_api_url,h_index,works_count
0,A5082130337,A. Marthe Möller,NL,https://api.openalex.org/works?filter=author.i...,6,13
1,A5014647140,Aaron Clauset,US,https://api.openalex.org/works?filter=author.i...,48,284
2,A5089395967,Aaron Nichols,US,https://api.openalex.org/works?filter=author.i...,2,10
3,A5053043999,Aaron J. Schein,US,https://api.openalex.org/works?filter=author.i...,16,19
4,A5082332656,Abdul Basit Adeel,US,https://api.openalex.org/works?filter=author.i...,4,9
...,...,...,...,...,...,...
1136,A5090107603,Zsófia Rakovics,HU,https://api.openalex.org/works?filter=author.i...,2,6
1137,A5100771200,Muhammad Shafiq,PK,https://api.openalex.org/works?filter=author.i...,49,396
1138,A5087528940,Diogo A. Gomes,SA,https://api.openalex.org/works?filter=author.i...,30,253
1139,A5054348632,Ákos Huszár,HU,https://api.openalex.org/works?filter=author.i...,5,31


Total authors found:

In [72]:
len(authors_df)

1141

### Apply filtering to get relevant authors and works

It's important to note that we create a new list/dataframe for authors here, since we also need to include all co-authors. Furthermore, the works API has a `countries` tag/value under `authorship` where we can directly pull the country code from, for each author.

Include only IC2S2 authors with a total work count between 5 and 5,000:

In [73]:
sortWorkCount = authors_df.loc[(authors_df["works_count"] >= 5) & (authors_df["works_count"] <= 5000)]
len(sortWorkCount)

999

Create chunks/batches of authors, size 25 to utilise the `filther=author.id: {id_1} | {id_2} | ... |`  operator.  This will limit the amount of request needed

In [74]:
# Step 1: Extract author IDs
author_ids = sortWorkCount["id"].to_list()

# Step 2: Split into chunks of 25
chunk_size = 25
chunks = [author_ids[i:i + chunk_size] for i in range(0, len(author_ids), chunk_size)]

# Step 3: Format each chunk into a string
formatted_chunks = [f'{"|".join(chunk)}' for chunk in chunks]

# Example chunk
print(formatted_chunks[0])

A5082130337|A5014647140|A5089395967|A5053043999|A5082332656|A5064296964|A5113886631|A5045620226|A5073592405|A5068763840|A5083303782|A5054913386|A5071293344|A5074354806|A5109586591|A5031106143|A5018574266|A5040820784|A5086011172|A5102415619|A5075314395|A5026854954|A5028932589|A5082554858|A5038976962


Now we are interessted in even more filthering:
- 1. Only take articles with relevant concepts at level 0 
- 2. Only articles with less than 10 authors
- 3. Only articles with more than 10 citations (done using `filther=cited_by_count:>10`) in next code segmentNow we filter the rest:

In [75]:
def filter_relevant_works(data):
    # Define relevant Level 0 concepts
    social_science_concepts = {"Sociology", "Psychology", "Economics", "Political science"}
    quantitative_concepts = {"Mathematics", "Physics", "Computer science"}

    def is_relevant(work):
        """Checks if a work has at least one concept from each relevant category."""
        level_0_concepts = {concept["display_name"] for concept in work.get("concepts", []) if concept["level"] == 0}
        return (
            any(concept in social_science_concepts for concept in level_0_concepts) and
            any(concept in quantitative_concepts for concept in level_0_concepts)
        )

    # Filter works based on criteria
    filtered_works = []
    abstracts_dataset = []
    authors_dataset = []

    for work in data.get("results", []):
        if len(work.get("authorships", [])) < 10 and is_relevant(work):
            work_entry = {
                "id": work["id"],
                "publication_year": work["publication_year"],
                "cited_by_count": work["cited_by_count"],
                "author_ids": [authorship["author"]["id"].split("/")[-1] for authorship in work["authorships"]]
            }
            filtered_works.append(work_entry)

            abstract_entry = {
                "id": work["id"],
                "title": work["title"],
                "abstract_inverted_index": work.get("abstract_inverted_index", {})
            }
            abstracts_dataset.append(abstract_entry)

            for authorship in work.get("authorships", []):
                author = authorship["author"]
                author_entry = {
                    "id": author["id"].split("/")[-1],
                    "display_name": author["display_name"],
                    "country_code": authorship["countries"][0] if authorship.get("countries") else None
                }
                authors_dataset.append(author_entry)

    return filtered_works, abstracts_dataset, authors_dataset

The following code is not pretty, but it does in simple terms:

1 - Uses cursor-paging on the `https://api.openalex.org/works`, it works by putting `https://api.openalex.org/works&cursor = *`, then the meta object will return a cursor ID for the next page of information, resultating in `https://api.openalex.org/works&cursor = {id_nextpage_cursor}`. Then a while loop keeps looping, until the last cursor ID for the specific work url is ``none``. (thus the last page)

2 - Takes care of 200 errors (with empty or invalid responses) and 429 (too any requests). Byt doing a try catch where it will attempt 5 times. 

3 - Use parralel using joblib, where we use ` n = 9` workers (since OpenALEX has a limit of 10 requests pr second) to prevent 429 errors. 

In [76]:
def fetch_data(author_ids):
    Workurl = "https://api.openalex.org/works"
    cursor = "*"
    page_i = 1
    Articles = []
    Abstract = []
    Authors = []

    while cursor:
        params = {
            "filter": f"author.id:{author_ids},cited_by_count:>10",
            "per-page": 200,
            "cursor": cursor
        }

        for attempt in range(5):  # Retry up to 5 times
            try:
                response = requests.get(Workurl, params=params, timeout=3)

                if response.status_code != 200:
                    print(f"Error {response.status_code} for {author_ids}, retrying...")
                    time.sleep(2 ** attempt)  # Exponential backoff
                    continue

                if not response.text.strip():  # Check for empty response
                    print(f"Empty response for {author_ids}, retrying...")
                    time.sleep(2 ** attempt)
                    continue

                data = response.json()  # Parse JSON
                break  # Exit retry loop if successful

            except requests.exceptions.RequestException as e:
                print(f"Request error for {author_ids}: {e}, retrying...")
                time.sleep(2 ** attempt)
            except requests.exceptions.JSONDecodeError:
                print(f"JSON decode error for {author_ids}, response: {response.text}, retrying...")
                time.sleep(2 ** attempt)

        else:
            print(f"Failed to fetch data for {author_ids} after multiple attempts.")
            return [], [], []

        # Call the function
        filtered_papers, filtered_abstracts, filtered_authors = filter_relevant_works(data)

        # Append results to existing lists
        Articles.extend(filtered_papers)
        Abstract.extend(filtered_abstracts)
        Authors.extend(filtered_authors)

        #print(f"Author Batch Processed, Page {page_i} parsed")
        page_i += 1
        cursor = data.get('meta', {}).get('next_cursor')

    return Articles, Abstract, Authors

# Parallel execution for each batch of authors
results = Parallel(n_jobs=9)(delayed(fetch_data)(chunk) for chunk in formatted_chunks)

# Merging results from parallel execution
Articles = []
Abstract = []
Authors = []
for articles, abstracts, authors in results:
    Articles.extend(articles)
    Abstract.extend(abstracts)
    Authors.extend(authors)

with open("articles.json", "w") as f:
    json.dump(Articles, f, indent=4)

with open("abstracts.json", "w") as f:
    json.dump(Abstract, f, indent=4)

with open("authors.json", "w") as f:
    json.dump(Authors, f, indent=4)



Error 429 for A5082130337|A5014647140|A5089395967|A5053043999|A5082332656|A5064296964|A5113886631|A5045620226|A5073592405|A5068763840|A5083303782|A5054913386|A5071293344|A5074354806|A5109586591|A5031106143|A5018574266|A5040820784|A5086011172|A5102415619|A5075314395|A5026854954|A5028932589|A5082554858|A5038976962, retrying...
Error 429 for A5034728614|A5010879920|A5069001141|A5086308811|A5019136836|A5016014168|A5070753993|A5101690189|A5114125504|A5024130598|A5067288003|A5100714698|A5093136190|A5070019023|A5100350849|A5066491651|A5009735497|A5027261073|A5091142592|A5073844842|A5076189854|A5021327158|A5014989262|A5059787882|A5057083249, retrying...


Length before filtering/removing duplicates:

In [77]:
len(Authors), len(Articles), len(Abstract)

(49501, 12915, 12915)

Then remove the duplicate

In [78]:
from collections import Counter

# Count occurrences of article IDs
id_counts = Counter(article['id'] for article in Articles)

# Find duplicates
duplicates = {article_id: count for article_id, count in id_counts.items() if count > 1}
print(f"Total duplicate articles: {sum(duplicates.values()) - len(duplicates)}")
#print("Duplicate entries per ID:", duplicates)

# Remove duplicates while keeping the first occurrence
seen_ids = set()
unique_articles = []
unique_abstracts = []

for article, abstract in zip(Articles, Abstract):
    if article['id'] not in seen_ids:
        unique_articles.append(article)
        unique_abstracts.append(abstract)
        seen_ids.add(article['id'])

# Update Articles and Abstract with unique elements
Articles_cleaned = unique_articles
Abstract_cleaned = unique_abstracts

Total duplicate articles: 1320


In [79]:
# Count occurrences of author IDs
id_counts = Counter(author['id'] for author in Authors)

# Find duplicates
duplicates = {author_id: count for author_id, count in id_counts.items() if count > 1}
print(f"Total duplicate authors: {sum(duplicates.values()) - len(duplicates)}")
#print("Duplicate entries per ID:", duplicates)

# Remove duplicates while keeping the first occurrence
seen_ids = set()
unique_authors = []

for author in Authors:
    if author['id'] not in seen_ids:
        unique_authors.append(author)
        seen_ids.add(author['id'])

# Update Authors with unique elements
Authors_cleaned = unique_authors

Total duplicate authors: 31788


In [80]:
len(Articles_cleaned), len(Abstract_cleaned), len(Authors_cleaned)

(11595, 11595, 17713)

Finally, turn them into pandas dataframes:

In [81]:
df_articles = pd.DataFrame(Articles_cleaned)
df_abstract= pd.DataFrame(Abstract_cleaned)
df_authors = pd.DataFrame(Authors_cleaned)

In [82]:
df_articles.head()

Unnamed: 0,id,publication_year,cited_by_count,author_ids
0,https://openalex.org/W3103362336,2009,7042,"[A5014647140, A5082953212, A5067142016]"
1,https://openalex.org/W2047940964,2004,6955,"[A5014647140, A5067142016, A5008033989]"
2,https://openalex.org/W2018045523,2002,4174,"[A5007285525, A5067021466, A5029755266, A50883..."
3,https://openalex.org/W2119298903,2012,3872,"[A5054913386, A5065660380, A5065503150]"
4,https://openalex.org/W1987228002,2010,3047,"[A5100744117, A5080830598, A5022334515, A50389..."


In [83]:
df_authors

Unnamed: 0,id,display_name,country_code
0,A5014647140,Aaron Clauset,
1,A5082953212,Cosma Rohilla Shalizi,
2,A5067142016,M. E. J. Newman,
3,A5008033989,Cristopher Moore,US
4,A5007285525,Erzsébet Ravasz Regan,US
...,...,...,...
17708,A5083702049,Feng Wang,CN
17709,A5100452647,Han Wang,CN
17710,A5004273745,Jinan Luo,CN
17711,A5109934253,Rongzu Hu,CN


In [84]:
df_abstract.head()

Unnamed: 0,id,title,abstract_inverted_index
0,https://openalex.org/W3103362336,Power-Law Distributions in Empirical Data,"{'Power-law': [0], 'distributions': [1], 'occu..."
1,https://openalex.org/W2047940964,Finding community structure in very large netw...,"{'The': [0, 147], 'discovery': [1], 'and': [2,..."
2,https://openalex.org/W2018045523,Hierarchical Organization of Modularity in Met...,"{'Spatially': [0], 'or': [1], 'chemically': [2..."
3,https://openalex.org/W2119298903,Evaluating Online Labor Markets for Experiment...,"{'We': [0, 16, 32, 57, 69], 'examine': [1], 't..."
4,https://openalex.org/W1987228002,Limits of Predictability in Human Mobility,"{'Predictable': [0], 'Travel': [1], 'Routines'..."


> To conclude we have:

**Data Overview and Reflection questions: Answer the following questions:**

- **Dataset summary.** 

We have 11596 unique articles , with 17713 unique co-authors.  

- **Code effeciency.** 
  
To improve efficiency, author IDs were processed in chunks. The filter `cited_by_count:>10` was applied directly within the API request. After retrieval, works were further filtered based on relevant topics and the number of authors. Parallel requests were utilized alongside cursor-based pagination, with the maximum page size of 200 to reduce the number of requests. These optimizations reduced the runtime to just 1.5 minutes on a Mac M1.


- **Filtering Criteria and Dataset Relevance**


Filtering out authors with fewer than 5 works helps exclude individuals who may not have a substantial influence in social science. Similarly, setting a minimum citation count of 10 helps focus the analysis on works that have made a measurable impact. Since the goal is to study the network of Social-SCI researchers, it's important to highlight the influential works. Papers with large author lists can create oversized clusters in the network, which may obscure the underlying structures. Finally, filtering for works that are explicitly relevant to Social-SCI ensures we are analyzing the right subset of research.


However theese filthers, may exclude meta-studies with lots of citations and new studies (with few citations). Which could for example restrict the data to a older time period. 

## Part 4: The Network of Computational Social Scientists
Week 4, ex 1. Please use the final dataset you collected from both authors and co-authors (IC2S2 2024).

### 4.1: Network Construction

Note that:
- Nodes = authors of academic papers
- Link/edge = authors A and B have written a paper together (co-authored)
- Link weight = number of papers written by both author A and B

#### 4.1.1: Weighted Edgelist Creation

In [85]:
import itertools

# generate co-author pairs and count occurrences
edge_list = []
for authors in df_articles['author_ids']:
    pairs = list(itertools.combinations(authors, 2))  # make all possible author A-B pairs
    edge_list.extend(pairs)

# count occurrences of each pair
edge_weights = {}
for pair in edge_list:
    if pair in edge_weights:
        edge_weights[pair] += 1
    else:
        edge_weights[pair] = 1

# convert to DataFrame (Weighted Edge List)
weighted_edge_list = pd.DataFrame(
    [(a, b, w) for (a, b), w in edge_weights.items()],
    columns=["source", "target", "weight"]
)

In [86]:
weighted_edge_list

Unnamed: 0,source,target,weight
0,A5014647140,A5082953212,1
1,A5014647140,A5067142016,4
2,A5082953212,A5067142016,1
3,A5014647140,A5008033989,5
4,A5067142016,A5008033989,1
...,...,...,...
59770,A5083702049,A5109934253,1
59771,A5100452647,A5004273745,1
59772,A5100452647,A5109934253,1
59773,A5004273745,A5109934253,1


#### 4.1.2: Graph Construction

In [87]:
# make undirected graph
G = nx.Graph()

# add weighted edges
G.add_weighted_edges_from(weighted_edge_list.itertuples(index=False, name=None))

#### 4.1.3: Node Attributes

Here, numerical data is converted to `int` as JSON was unable to encode `numpy.int64` datatypes.

In [88]:
# Convert authors_df into a dictionary for quick lookup
author_metadata = {
    row["id"]: {
        "display_name": row["display_name"],
        "country_code": row["country_code"]
    }
    for _, row in df_authors.iterrows()
}

# Convert df_articles into a dictionary for first_pub_year & citation_count
author_publication_info = {}
for _, row in df_articles.iterrows():
    for author in row["author_ids"]:
        if author not in author_publication_info:
            author_publication_info[author] = {
                "first_pub_year": int(row["publication_year"]),  # Convert numpy.int64 to int
                "citation_count": int(row["cited_by_count"])  # Convert numpy.int64 to int
            }


In [89]:
for node in G.nodes():
    # Get author metadata from authors_df
    G.nodes[node]["display_name"] = author_metadata.get(node, {}).get("display_name", "Unknown")
    G.nodes[node]["country_code"] = author_metadata.get(node, {}).get("country_code", "Unknown")

    # Get publication info from df_articles
    G.nodes[node]["first_pub_year"] = author_publication_info.get(node, {}).get("first_pub_year", None)
    G.nodes[node]["citation_count"] = author_publication_info.get(node, {}).get("citation_count", 0)


Example lookup of an author in the graph, programatically:

In [90]:
G.nodes["A5082130337"]

{'display_name': 'A. Marthe Möller',
 'country_code': 'NL',
 'first_pub_year': 2017,
 'citation_count': 230}

Finally save the graph as a JSON file:

In [91]:
import json
from networkx.readwrite import json_graph

# Convert NetworkX graph to JSON format
graph_data = json_graph.node_link_data(G)

# Save to a JSON file
with open("coauthorship_network.json", "w") as f:
    json.dump(graph_data, f, indent=4)

### 4.2: Preliminary Network Analysis

#### 4.2.1: Network Metrics

> What is the total number of nodes (authors) and links (collaborations) in the network? 

Number of nodes and edges:

In [92]:
G_nodes = len(G.nodes())
G_edges = len(G.edges())

print(f"Total number of nodes (authors): {G_nodes}")
print(f"Total number of links (collaborations): {G_edges}")

Total number of nodes (authors): 17704
Total number of links (collaborations): 56903


> Calculate the network's density (the ratio of actual links to the maximum possible number of links). Would you say that the network is sparse? Justify your answer.

To calculate the network's density we need to calculate the maximum possible number of links, which for an undirected graph like this, can be calculated as such:
$$
E_{\max} = \frac{n(n-1)}{2}
$$
where $n$ are the number of nodes.

In [93]:
E_max = int(G_nodes*(G_nodes-1)/2)
print(E_max)

ratio_ = G_edges/E_max

ratio_

156706956


0.000363117256900836

The network is sparse, as only 56903 links/edges are established out of 156706956 possible links. This implies that most of the authors have not collaborated or coauthored together.

> Is the network fully connected (i.e., is there a direct or indirect path between every pair of nodes within the network), or is it disconnected?

No, the network is not fully connected, i.e. it is disconnected:

In [94]:
nx.is_connected(G)

False

This means that there are clusters of graphs and subgraphs, which also makes sense as people from different backgrounds are less likely to work together unless it's some cross-sectional research.

 If the network is disconnected, how many connected components does it have? A connected component is defined as a subset of nodes within the network where a path exists between any pair of nodes in that subset.

In [95]:
nx.number_connected_components(G)

273

> How many isolated nodes are there in your network?  An isolated node is defined as a node with no connections to any other node in the network.

In [96]:
nx.number_of_isolates(G), list(nx.isolates(G))

(0, [])

There are no isolated (degree zero) nodes, meaning no author has never only worked alone in all their papers. In other words, an author may have worked alone on some papers, but have released other papers where they have collaborated with authors.

>Discuss the results above on network density, and connectivity. Are your findings in line with what you expected? Why?

Yes, as explained above, the results are as expected as people from different sectors/fields are less likely to work together. If the number of edges were maximized, it would've required all authors to have collaborated with every other author.

#### 4.2.2: Degree Analysis

> Compute the average, median, mode, minimum, and maximum degree of the nodes. Perform the same analysis for node strength (weighted degree). What do these metrics tell us about the network?

In [97]:
# Get degrees for all nodes (authors)
node_degree = pd.DataFrame.from_dict(G.degree)
node_degree.columns = ["id", "degree"]
display(node_degree.describe())

Unnamed: 0,degree
count,17704.0
mean,6.428265
std,10.74354
min,1.0
25%,3.0
50%,5.0
75%,7.0
max,362.0


Note that the median is the 50% percentile.

Generally these metrics give us insight into the collaboration/coauthoring tendencies e.g., that the average author has collaborated with around 6 other authors and that the maximum number of collaborations that an author has done is 362. It's also worth noting that the standard deviation is quite high, which is expected as e.g., general engineers are more "prone" to do cross-field papers due to their versatility. The summary statistics also tell us that the distribution of degrees is very likely to be right-skewed.

In [98]:
# Parse the weight property to apply weights
node_degree_weighted = pd.DataFrame.from_dict(G.degree(weight="weight"))
node_degree_weighted.columns = ["id", "degree"]
display(node_degree_weighted.describe())

Unnamed: 0,degree
count,17704.0
mean,8.15488
std,17.13154
min,1.0
25%,3.0
50%,5.0
75%,8.0
max,540.0


The weight property on the edges represent the strength of coauthorship, in other words it says something about the frequency or intensity of collaboration. This also weighs in the number of times two authors have worked together, and not just if they have worked together (as in the non-weighted case).

#### 4.2.3: Top Authors

> Identify the top 5 authors by degree. What role do these node play in the network? 

In [99]:
# Get top 5 authors
top_5 = node_degree.nlargest(5, "degree")
top_5

Unnamed: 0,id,degree
8081,A5100322712,362
16297,A5005421447,306
8693,A5077712228,279
108,A5007176508,263
4830,A5059645286,256


In [100]:
# Get their names
for id in top_5["id"]:
    display(df_authors[df_authors["id"] == id])

Unnamed: 0,id,display_name,country_code
8085,A5100322712,Yan Wang,US


Unnamed: 0,id,display_name,country_code
16306,A5005421447,Yi Yang,AU


Unnamed: 0,id,display_name,country_code
8698,A5077712228,Simon A. Levin,GB


Unnamed: 0,id,display_name,country_code
108,A5007176508,Alex Pentland,US


Unnamed: 0,id,display_name,country_code
4832,A5059645286,Robert West,GB


You could say they act as "author hubs", and that they have collaborated with a lot of other people.

Judging by their OpenAlex author entries, Yan Wang, Yi Wang, Alex Pentland specialize mainly in computer science, AI/ML, and they have some works in the social science subfield. This makes sense as a lot of their data may have roots in social science.

Simon A. Levin and Robert West work in health, biology, chemistry and/or environmental engineering with only a few works in computer science. Their big "footprint" can probably explain their large degree.