# 02467 Computational Social Science
## Assignment 1
### Group 15

Our GitHub repo is availabe at: https://github.com/Simo067m/ComSocSci-Assignments <br>
Contribution:
- s233304 : Part 2 + Part 3
- s214592 : Part 1 + Part 4

In [1]:
# Import packages
import pandas as pd
import networkx
import netwulf
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
import ast

from multiprocessing import Pool
import time

## Part 1: Web-scraping
Web-scraping the list of participants to the International Conference in Computational Social Science

In [2]:
# Define a function for finding all unique researchers
def scrape_IC2S2(soup : BeautifulSoup):
    # Find all the names from the top table
    names = []
    # Find all the table rows
    table_rows = soup.find_all("tr")
    for tr in table_rows:
        tds = tr.find_all("td")
        for row in tds:
            a = row.find_all("a")
            for text in a:
                text_content = text.text
                if ("Keynote" in text_content):
                    text_split = text_content.split("-")
                    stripped = text_split[1].strip()
                    if (stripped not in names):
                        names.append(stripped)
    
    # Find all the names from the bottom lists
    # Find all the unordered lists
    ul = soup.find_all("ul", class_="nav_list")
    # Find all the list elements
    for list in ul:
        found_names = list.find_all("i")
        # For every found name line, seperate into individual names
        for name in found_names:
            found_names_seperated = name.text.split(", ")
            for seperated_name in found_names_seperated:
                if (seperated_name.strip() not in names):
                    names.append(seperated_name.strip())

    # Find all the names of the chairs
    headers = soup.find_all("h2")
    for header in headers:
        text = header.find("i")
        if (text is not None):
            seperated_name = text.text.split(": ")
            if (seperated_name[1].strip() not in names):
                names.append(seperated_name[1].strip())

    return names

In [3]:
# Define url and collect content
LINK = "https://ic2s2-2023.org/program"
r = requests.get(LINK)
soup = BeautifulSoup(r.content)

# find participant names
IC2S2_names = scrape_IC2S2(soup)
# Save to a pandas DataFrame
IC2S2_names_df = pd.DataFrame(IC2S2_names, columns=["name"])
IC2S2_names_df.to_csv("IC2S2_names.csv", index=False)

### Q5
_How many unique researchers do you get?_
>- 1491

In [4]:
print(len(pd.read_csv("IC2S2_names.csv")))

1491


### Q6:
_Explain the process you followed to web-scrape the page. Which choices did you make to accurately retreive as many names as possible? Which strategies did you use to assess the quality of your final list? Explain your reasoning and your choices **(answer in max 150 words)**._

>- By inspecting the webpage, we were able to figure out that names were always contained in an &lt;a&gt; element for displaying the name properly. This means that finding an &lt;a&gt; element within one of the tables containing schedules would guarantee a name. When finding other names, like the ones that have the "chair", correctness was ensured by splitting that part from the name, ensuring only the name is retrieved.
>- Before adding a new name to the list, there is a check making sure that the name is not already in the found names list before adding it, making sure only unique names are in the list. The names contain no unwanted whitespace by calling the $\texttt{str.strip()}$ method before adding.

## Part 2: Ready Made vs Custom Made Data

### Q1
_What are pros and cons of the custom-made data used in Centola's experiment (the first study presented in the lecture) and the ready-made data used in Nicolaides's study (the second study presented in the lecture)? You can support your arguments based on the content of the lecture and the information you read in Chapter 2.3 of the book (answer in max 150 words)._

>- **custom-made data**: The pros are that specific hypotheses can be investigated within a  controlled environment without any external influences. However, this control can also be a disadvantage, as the created clusters were created randomly and might only artificially fit the definition of friend group and the studied social network definitions, and people might behave differently as the people that use are not really their friends.
>- **ready-made data**: The advantages are that a big dataset is already available and ecologically valid. However, the researchers do not have any control over how the data is collected and this collection purpose might not match the needs of the research. Nevertheless, this can also be seen as a potential pro as this could lead to more creative and innovative methods used by the researchers which might in the next step create new findings outside of the standard procedure.
(146 words)

### Q2
_How do you think these differences can influence the interpretation of the results in each study? (answer in max 150 words)_
> These differences can mainly influence the limitations of a study. In Centola's study, the degree of realism within the experimental setup can influence the interpretation. While these results were observed within an artificially created network, it has to be investigated if these outcomes can be generalized to other naturally formed networks. In Nicolaide's study, other effects other than the contagiousness of the network could be the reason for the change in running behaviour. While there were some tests applied to check the outcome, for example, the weather check, it is still always questionable if the "right" testing has been done. (100 words)

## Part 3: Gathering Research Articles using the OpenAlex API


In [5]:
BASE_URL = "https://api.openalex.org"
WORKS_URL = "/works"
CONCEPT_URL = "/concepts"
AUTHORS_URL = "/authors"

In [6]:
"""
Retrieve concepts ids and return tupel of lists that reflect ([Sociology, Psychology, Economics, Political Science], [Mathematics, Physics, Computer science])
"""

def create_concept_filters():
    #Retrieve concepts so they can be used for filtering
    response_concepts = requests.get(BASE_URL + CONCEPT_URL)
    all_concepts = (response_concepts.json()['results'])
    
    # We want either one of: Sociology OR Psychology OR Economics OR Political Science
    com_soc_sci_list = ['Sociology', 'Psychology', 'Economics', 'Political science']
    # AND intersecting with a quantitative discipline (Mathematics OR Physics OR Computer Science"
    quant_disc_list = ['Mathematics', 'Physics', 'Computer science']
           
    com_soc_sci_ids = [c['id'] for c in all_concepts if c['display_name'] in com_soc_sci_list]
    quant_disc_ids = [c['id'] for c in all_concepts if c['display_name'] in quant_disc_list]

    return(com_soc_sci_ids, quant_disc_ids)

In [7]:
"""
Returns a parallel executable "to-do" list of parameters for retrieving batches of works at once.
"""

def create_todo_list(author_id_list):
    todo_list = []
    batch_size = 25

    com_soc_sci_ids, quant_disc_ids = create_concept_filters()
    
    for i in range(0, len(author_id_list), batch_size):
        author_filter = '|'.join(author_id_list[i:i + batch_size])
        filter_string = f"authorships.author.id:{author_filter},cited_by_count:>10,authors_count:<10,concepts.id:{'|'.join(com_soc_sci_ids)},concepts.id:{'|'.join(quant_disc_ids)}"

        params = {
            "filter": filter_string,
            "per-page": 200,
            "cursor": "*"
        }
        todo_list.append(params)
    return todo_list

In [8]:
"""
Retrieve works for the given parameters using paging.
"""

def retrieve_all_works(params, specific_url):
    # Pause for 2 seconds to stay within limits of API calls
    time.sleep(2)
    
    # List to store all works retrieved
    all_works = []
    
    # Make requests until all works are fetched
    while True:
        # Make the GET request
        response = requests.get(BASE_URL + specific_url, params=params)
        
        # Check if the request was successful (status code 200)
        if response.status_code == 200:
            # Append retrieved works to the list
            all_works.extend(response.json()["results"])
            
            # Check if there are more works to fetch (paging)
            if len(response.json()["results"]) == params["per-page"]:
                # Update the cursor for the next page
                params["cursor"] = response.json()["meta"]["next_cursor"]
            else:
                # No more works to fetch
                break
        else:
            # Print an error message if the request was not successful
            print("Error:", response.status_code, params)
            break
    
    return all_works

In [9]:
"""
Return object including id, publication_year, cited_by_count and author_ids.
"""

def create_paper(work):
    author_ids = [a['author']['id'] for a in work['authorships']] 
    return {
        'id': work['id'],
        'publication_year': work['publication_year'],
        'cited_by_count': work['cited_by_count'],
        'author_ids': author_ids
    }

In [10]:
"""
Return object including id, title and abstract_inverted_index.
"""

def create_abstract(work):
    return {
        'id': work['id'],
        'title': work['title'],
        'abstract_inverted_index': work['abstract_inverted_index']
    }

In [11]:
"""
Process retrieved data and extract relevant information.
"""
def process_data():
    for upper_list in unprocessed_data:
        for work in upper_list:
            IC2S2_papers.append(create_paper(work))
            IC2S2_abstracts.append(create_abstract(work))

### Retrieve and process data

In [12]:
IC2S2_papers = []
IC2S2_abstracts = []

papers_df = pd.DataFrame()
abstracts_df = pd.DataFrame()

In [13]:
IC2S2_authors_df = pd.read_csv('week2-authors.csv')

# Filters: Only include IC2S2 authors with a total work count between 5 and 5,000
filtered_IC2S2_authors_df = IC2S2_authors_df[(IC2S2_authors_df['works_count'] > 5) & (IC2S2_authors_df['works_count'] < 5000)]

In [14]:
def work_worker(params):
    return retrieve_all_works(params, WORKS_URL)

In [15]:
if __name__ == "__main__":
    # Prepare all request parameters that have to be done in parallel
    todo_list = create_todo_list(filtered_IC2S2_authors_df.id.tolist())
    
    # Perform all requests with the prepared parameters in parallel
    with Pool() as p:
        unprocessed_data = list(tqdm(p.imap(work_worker, todo_list), total=len(todo_list)))

    # Process collected data
    process_data()   

    # Remove duplicates and store in dataframes and 
    papers_df = pd.DataFrame(IC2S2_papers).drop_duplicates(subset=['id'])
    abstracts_df = pd.DataFrame(IC2S2_abstracts).drop_duplicates(subset=['id'])
    
    # Export data
    papers_df.to_csv('papers.csv', index=False)
    abstracts_df.to_csv('abstracts.csv', index=False)

100%|████████████████████████████████████████████████████████████████| 48/48 [00:41<00:00,  1.15it/s]


#### Retrieving information about co-authors in preparation for Part 4

In [16]:
"""
Returns a parallel executable "to-do" list of parameters for retrieving batches of authors at once.
"""
def create_auth_todo_list(author_id_list):
    auth_todo_list = []
    batch_size = 25
    
    for i in range(0, len(author_id_list), batch_size):
        author_filter = "|".join(author_id_list[i:i + batch_size])
        params = {
            "filter": f"ids.openalex:{author_filter}", #author_filter,
            "per-page": 200,
            "cursor": "*"
        }
        auth_todo_list.append(params)

    return auth_todo_list

In [26]:
"""
Return object including id, display_name, works_api_url, works_count, summary_stats and last_known_institution.
"""
def create_author(author):
    obj =  {
        'id': author['id'],
        'display_name': author['display_name'],
        'works_api_url': author['works_api_url'],
        'works_count': author['works_count'],
    }

    if author['summary_stats']:
        obj['h_index'] = author['summary_stats']['h_index']

    if author['last_known_institution']:
        obj['country_code'] = author['last_known_institution']['country_code']

    return obj

In [18]:
def author_worker(params):
    return retrieve_all_works(params, AUTHORS_URL)

In [19]:
co_authors = []
co_authors_df = pd.DataFrame()

In [20]:
# Start with the unique author IDs in the IC2S2 papers dataset from part 3
author_ids_set = set()
for d in IC2S2_papers:
    author_ids_set.update(d['author_ids'])

# Exclude the IC2S2 authors from this query since you already have their data. 
co_authors_id = [a for a in author_ids_set if a not in list(IC2S2_authors_df.id)]

In [27]:
if __name__ == "__main__":
    # Prepare all request parameters that have to be done in parallel
    co_author_todo = create_auth_todo_list(co_authors_id)

    # Perform all requests with the prepared parameters in parallel
    with Pool() as p:
        unprocessed_auth_data = list(tqdm(p.imap(author_worker, co_author_todo), total=len(co_author_todo)))

    # Clean-up data 
    for data in unprocessed_auth_data:
        for author_object in data:
            co_authors.append(create_author(author_object))

    # Remove duplicates and store in dataframes and 
    co_authors_df = pd.DataFrame(co_authors).drop_duplicates(subset=['id'])
    print(co_authors_df.shape)

    # Export co-authors
    co_authors_df.to_csv('co-authors.csv', index=False)

(14850, 6)


In [None]:
# Create one dataframe holding all authors and co-authors
authors_df = pd.concat([IC2S2_authors_df, co_authors_df])
authors_df.to_csv('authors.csv', index=False)

### Q1 - Dataset summary
- _How many works are listed in your IC2S2 papers dataframe?_
>- 10.155
- _How many unique researchers have co-authored these works?_
>- 14.850

In [17]:
Q1_works = papers_df.shape[0]
print(f"How many works are listed in your IC2S2 papers dataframe? {Q1_works}")

How many works are listed in your IC2S2 papers dataframe? 10155


In [43]:
Q1_authors = df_co_authors.shape[0]
print(f"How many unique researchers have co-authored these works? {Q1_authors}")

How many unique researchers have co-authored these works? 14850


### Q2 - Efficiency in code
_Describe the strategies you implemented to make your code more efficient. How did your approach affect your code's execution time?_

> I have used multiprocessing of batches with paging while using filters when retrieving the data. 
First of all, I have filtered the authors from week 2 exercise 2 to start with a smaller set already. I have then created a list of requests that can be executed in parallel. Each of these requests includes the filters (described in the assignment) of the individual request, and the batch size of 200.
Next, I executed these requests in parallel using the function “imap” by the class “multiprocessing.Pool”. To keep within the limits of the API I have added a sleep for two seconds. 
The data retrieval has taken a total of about 36 seconds. (112 words)

### Q3 - Filtering Criteria and Dataset Relevance
_Reflect on the rationale behind setting specific thresholds for the total number of works by an author, the citation count, the number of authors per work, and the relevance of works to specific fields. How do these filtering criteria contribute to the relevance of the dataset you compiled? Do you believe any aspects of Computational Social Science research might be underrepresented or overrepresented as a result of these choices?_

> When focusing on the total number of works by an author and the citation count, one can say, that while this filtering helps include work from established researchers with a proven track record of impactful research, it excludes upcoming researchers or those from underrepresented backgrounds or smaller fields of computational social science which might be as valuable as the others.
This could lead to an overrepresentation of researchers in a more general field and an underrepresentation of niche researchers.
When looking into the number of authors per work, can help to create a balanced representation of single-authored and multi-authored works. However, collaborative research projects which may be in interdisciplinary environments will be excluded.
When looking into the specific fields, research that does not directly fall into one of these pre-defined fields might still hold a lot of value within these fields and may bridge traditional disciplinary boundaries. (147 words)

## Part 4: The Network of Computational Social Scientists

### Constructing the Computational Social Scientists Network

### Part 4.1: Network Construction

In [1]:
# Load the papers dataset
papers_df = pd.read_csv("papers.csv")
# Load the author's dataset
authors_df = pd.read_csv("week2-authors.csv")
# Load the abstracts dataset
abstracts_df = pd.read_csv("abstracts.csv")

NameError: name 'pd' is not defined

In [None]:
# Create weighted edgelist WIP
def create_weighted_edgelist(paper_authors_ids, author_ids):
    edges = []
    # Loop through the paper authors
    for authors in tqdm(paper_authors_ids, desc="Progress"):
        pairs = []
        for i in range(len(authors)):
            for j in range(i + 1, len(authors)):
                pair = (authors[i], authors[j])

                if pair not in pairs and (pair[1], pair[0]) not in pairs:
                    pairs.append(pair)
                    edges.append((pair, 1))
    return edges

papers_authors_ids = papers_df["author_ids"].apply(ast.literal_eval)
author_ids = authors_df["id"]
print(create_weighted_edgelist(papers_authors_ids, author_ids))
#for index, row in papers_df.iterrows():