| Assignment 1 contribution table   | Teis Aggerholm (s234822) | Andreas Holm Matthiassen (s234838) | Hector Helt Jakobsen (s234822) |
|-------------|---------|---------|---------|
| Part 1 | 100%     | 0%     | 0%     |
| Part 2 | 100%     | 0%     | 0%     |
| Part 3 | 0%     | 0%     | 100%     |
| Part 4 | 0%     | 100%     | 0%     |

Link to our GitHub repository: `https://github.com/Andreas-Holm-2/02467-Assignment-1`

# `Part 1:` Web-scraping

1. Inspect the HTML of the page and use web-scraping to get the names of all researchers that contributed to the conference in 2023. The goal is the following: (i) get as many names as possible including: keynote speakers, chairs, authors of parallel talks and authors of posters; (ii) ensure that the collected names are complete and accuarate as reported in the website (e.g. both first name and family name); (iii) ensure that no name is repeated multiple times with slightly different spelling.

2. Some instructions for success:
First, inspect the page through your web browser to identify the elements of the page that you want to collect. Ensure you understand the hierarchical structure of the page, and where the elements you are interested in are located within this nested structure.
Use the BeautifulSoup Python package to navigate through the hierarchy and extract the elements you need from the page.
You can use the find_all method to find elements that match specific filters. Check the documentation of the library for detailed explanations on how to set filters.
Parse the strings to ensure that you retrieve "clean" author names (e.g. remove commas, or other unwanted charachters)
The overall idea is to adapt the procedure I have used here for the specific page you are scraping.

3. Create the set of unique researchers that joined the conference and store it into a file.
Important: If you notice any issue with the list of names you have collected (e.g. duplicate/incorrect names), come up with a strategy to clean your list as much as possible.

4. Optional: For a more complete represenation of the field, include in your list: (i) the names of researchers from the programme committee of the conference, that can be found at this link; (ii) the organizers of tutorials, that can be found at this link

### Scraping the URL:

In [1]:
from bs4 import BeautifulSoup
import requests

def fetch_names_program():
    url = "https://ic2s2-2023.org/program"
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "html.parser")

    names = set()
    for tag in soup.find_all("i"): # We use find_all as recommended
        text = tag.get_text()
        for name in text.split(", "):  # Handle multiple names in one <i> tag
            clean_name = name.replace("Chair: ", "").strip() # remove 'Chair: ' from certain strings
            names.add(clean_name)
            
    for tr in soup.find(id="summary").find_all("tr"): # Loop through the rows of the summary table containing keynote speakers
        tds = tr.find_all("td")
        if len(tds) >= 2: # Filter for rows containing atleast content in column with name Room A
            text = tds[1].get_text() # Get column containing keynote speakers
            if text:
                prefix = "Keynote - " # Prefix that needs to be cleaned
                if text.startswith(prefix):
                    clean_name = text[len(prefix) : ].strip()
                    names.add(clean_name)

    return names

def fetch_names_program_commitee():
    url = "https://ic2s2-2023.org/program_committee"
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "html.parser")

    names = set()
    
    for tag in soup.find_all("b"):
        name = tag.get_text().strip()
        if not any(char in name for char in ["√", "°", "©", "§"]): # Exclude all misspelled names with special characters 
            names.add(name)
        
    return names

def fetch_names_tutorials():
    url = "https://ic2s2-2023.org/tutorials"
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "html.parser")
    
    names = set()
    
    section = soup.find("section", {"class": "wrapper style3"}) # Navigate to the section containing the professors and organizers

    for b in section.find_all("b"):
        clean_name = b.get_text().strip()
        names.add(clean_name)
        
    
    
    return names

names = fetch_names_program()
names = names.union(fetch_names_program_commitee())
names = names.union(fetch_names_tutorials())
names = sorted(names)

print(f'Initially, we have found {len(names)} names. The names are {names}')

Initially, we have found 1652 names. The names are ['Aaron Clauset', 'Aaron J. Schwartz', 'Aaron Schein', 'Aaron Smith', 'Abbas Haidar', 'Abby Smith', 'Abdulkadir Celikkanat', 'Abdullah Almaatouq', 'Abdullah Zameek', 'Abeer ElBahrawy', 'Abigail Z Jacobs', 'Adam Dunn', 'Adam Finnemann', 'Adam Frank', 'Adam H. Russell', 'Adam Stefkovics', 'Adam Sutton', 'Aditi Dutta', 'Adriano Belisario', 'Adrienne Mendrik', 'Afra Mashhadi', 'Agnieszka Czaplicka', 'Agnieszka Falenska', 'Aguru Ishibashi', 'Ahmad Hesam', 'Ahmed Nasser Mostafa', 'Aidan Combs', 'Aidar Zinnatullin', 'Akeela Careem', 'Akhil Arora', 'Akira Hashimoto', 'Akira Matsui', 'Akira Tsurushima', 'Akrati Saxena', 'Alain Barrat', 'Alan Paul Kwan', 'Alba Motes Rodrigo', 'Albert-Laszlo Barabasi', 'Alberto Amaduzzi', 'Alberto Antonioni', 'Aleix Bassolas', 'Alejandro Beltran', 'Alejandro Dinkelberg', 'Alejandro Hermida Carrillo', 'Aleksandra Aloric', 'Aleksandra Urman', 'Alessandra Urbinati', 'Alessandro De Gaetano', 'Alessandro Di Nallo', 'A

### Removing researcher names spelled slightly different

In [2]:
from thefuzz import fuzz, process

def find_fuzzy_matches(list_of_names, threshold=80):
    
    fuzzy_matches = [] # List to store fuzzy matches
    seen_matches = set()  # To avoid duplicates

    for i, name in enumerate(list_of_names):
        # Find top matches in the remaining names (excluding self-matches)
        matches = process.extract(name, list_of_names[i+1:], scorer=fuzz.ratio, limit=5)

        for match, score in matches:
            if score >= threshold and (name, match) not in seen_matches and (match, name) not in seen_matches:
                fuzzy_matches.append((name, match, score))
                seen_matches.add((name, match))

    
    fuzzy_matches.sort(key=lambda x: x[2], reverse=True) # Sort matches by descending score

    return fuzzy_matches


fuzzy_matches = find_fuzzy_matches(names, threshold=85)

print(f"Found {len(fuzzy_matches)} fuzzy matches.")
print("\nFuzzy Matches:")
for name1, name2, score in fuzzy_matches:
    print(f"{name1} & {name2} (Score: {score})")

Found 59 fuzzy matches.

Fuzzy Matches:
Bedoor AlShebli & Bedoor Alshebli (Score: 100)
Diogo Pacheco & diogo pacheco (Score: 100)
Federico Barrera Lemarchand & Federico Barrera-Lemarchand (Score: 100)
Lisette Espin Noboa & Lisette Espin-Noboa (Score: 100)
Luca Verginer & luca verginer (Score: 100)
NaLette Brodnax & Nalette Brodnax (Score: 100)
Ollin Demian Langle Chimal & Ollin Demian Langle-Chimal (Score: 100)
Sonja M Schmer Galunder & Sonja M Schmer-Galunder (Score: 100)
Valeria D'Andrea & Valeria d'Andrea (Score: 100)
Woo-Sung Jung & Woo-sung Jung (Score: 100)
Alessandro Flamini & Alessandro Flammini (Score: 97)
Duncan J Watts & Duncan J. Watts (Score: 97)
Maximilan Schich & Maximilian Schich (Score: 97)
Pantelis P Analytis & Pantelis P. Analytis (Score: 97)
Anne C Kroon & Anne C. Kroon (Score: 96)
Diogo Pachecho & Diogo Pacheco (Score: 96)
Diogo Pachecho & diogo pacheco (Score: 96)
Fabio Carella & Fabio Carrella (Score: 96)
Paul C Bauer & Paul C. Bauer (Score: 96)
Scott A Hale & Sc

In [3]:
# we run the function again, and remove the first name down to a threshold of 85, as we began to mistakes in the people here 

fuzzy_matches = find_fuzzy_matches(names, threshold=85)

print(f'Removing {len(fuzzy_matches)} duplicate names from the original name set')

fuzzy_matches = [match[0] for match in fuzzy_matches] # we pick a very simple policy: we remove the first name of each list to save time.
names = [name for name in names if name not in fuzzy_matches]

print(f'We now have a list of {len(names)} researchers. Their names are {names}')


Removing 59 duplicate names from the original name set
We now have a list of 1598 researchers. Their names are ['Aaron Clauset', 'Aaron J. Schwartz', 'Aaron Schein', 'Aaron Smith', 'Abbas Haidar', 'Abby Smith', 'Abdulkadir Celikkanat', 'Abdullah Almaatouq', 'Abdullah Zameek', 'Abeer ElBahrawy', 'Abigail Z Jacobs', 'Adam Dunn', 'Adam Finnemann', 'Adam Frank', 'Adam H. Russell', 'Adam Stefkovics', 'Adam Sutton', 'Aditi Dutta', 'Adriano Belisario', 'Adrienne Mendrik', 'Afra Mashhadi', 'Agnieszka Czaplicka', 'Agnieszka Falenska', 'Aguru Ishibashi', 'Ahmad Hesam', 'Ahmed Nasser Mostafa', 'Aidan Combs', 'Aidar Zinnatullin', 'Akeela Careem', 'Akhil Arora', 'Akira Hashimoto', 'Akira Matsui', 'Akira Tsurushima', 'Akrati Saxena', 'Alain Barrat', 'Alan Paul Kwan', 'Alba Motes Rodrigo', 'Albert-Laszlo Barabasi', 'Alberto Amaduzzi', 'Alberto Antonioni', 'Aleix Bassolas', 'Alejandro Beltran', 'Alejandro Dinkelberg', 'Alejandro Hermida Carrillo', 'Aleksandra Aloric', 'Aleksandra Urman', 'Alessandra U

In [7]:
with open("assignment_part_1_names.txt", "w", encoding="utf-8") as f:
    f.write("\n".join(names))

5. How many unique researchers do you get?

In [4]:
print(f'In total, we have found {len(names)} differnet researchers')

In total, we have found 1598 differnet researchers


6. Explain the process you followed to web-scrape the page. Which choices did you make to accurately retreive as many names as possible? Which strategies did you use to assess the quality of your final list? Explain your reasoning and your choices (answer in max 150 words).


In order to retrieve the most abundant dataset we looked through all three pages for names that could be found. On each page we inspected the page to figure out in which pattern the names could be found. Then carefully looking through the names on the page we would look for pitfalls such as names that included artefacts such as "Chair: ", "Keynote - " or special chars such as: "√", "°", "©", "§" and avoid them by extracting the name (or excluding the name in case of special chars). 

In order to ensure that no name is repeated multiple times with slightly different spelling, we found a python library `thefuzz`, which allows for checking for fuzzy matching (an artificial intelligence and machine learning technology that identifies similar, but not identical elements https://redis.io/blog/what-is-fuzzy-matching/).

Lastly, we went ahead and deleted matches in the array of similar names down to a threshold of 0.85.

# `Part 2:` Ready Made vs Custom Made Data


> **Exercise 1: Ready made data vs Custom made data** In this exercise, I want to make sure you have understood they key points of my lecture and the reading. Remember to come and ask me, if you have any question about this! 
>
> 1. What are pros and cons of the custom-made data used in Centola's experiment (the first study presented in the lecture) and the ready-made data used in Nicolaides's study (the second study presented in the lecture)? You can support your arguments based on the content of the lecture and the information you read in Chapter 2.3 of the book. 

One of the pros of Centola's custom-made data experiment is, that they're able to design and tailor the experiment, so they can investigate the difference in behaviour in controlled conditions. This could imply participants to be reactive, but since the experiment is run through an ad, the participants are nonreactive. However, a con of this could be change in behaviour since the experiment is ran through a social platform, so even though they are nonreactive, it becomes difficult to ensure that the data is a direct reflection of peoples behavior. 

In Nicolaides study, one of the cons of the study being ready made is, that its difficult to control the experiment to only examine the effects of specific social behavior. Of ready made data there is abundance of data. Since the data is always-on, it allows for measurements over time, and inclusion of demographic data enables the examination of various confounders, including even the weather.

> 2. How do you think these differences can influence the interpretation of the results in each study?

In the Centola experiment, since the data is custom-made, the results can be more confidently attributed to the underlying social mechanism being tested. The controlled conditions of the experiment makes it possible to find causal relationship by isolating the tested variable. However, because the experiment takes place on a website, participants might alter their behaviour, wanting to present a certain version of themselves, or by holding back due to unfamiliarity. This could affect how the results translate to real-world social settings. 

On the other hand, Nicolaides experiment, which uses ready-made data, provides real world data modelling social interactions over time increasing the generalization and the validity of the results. However, since the study is based in a real world setting, there could be confounding variables influencing the results. This makes it more difficult to explain a direct causal relationship. 

# `Part 3:` Gathering Research Articles using the OpenAlex API

> **Exercise 1: Collecting Research Articles from IC2S2 Authors**
>
>In this exercise, we'll leverage the OpenAlex API to gather information on research articles authored by participants of the IC2S2 2024 conference, referred to as *IC2S2 authors*. **Before you start, please ensure you read through the entire exercise.**
>
> 
> **Steps:**
>  
> 1. **Retrieve Data:** Starting with the *authors* you identified in Week 2, Exercise 2, use the OpenAlex API [works endpoint](https://docs.openalex.org/api-entities/works) to fetch the research articles they have authored. For each article, retrieve the following details:
>    - _id_: The unique OpenAlex ID for the work.
>    - _publication_year_: The year the work was published.
>    - _cited_by_count_: The number of times the work has been cited by other works.
>    - _author_ids_: The OpenAlex IDs for the authors of the work.
>    - _title_: The title of the work.
>    - _abstract_inverted_index_: The abstract of the work, formatted as an inverted index.
> 
>     **Important Note on Paging:** By default, the OpenAlex API limits responses to 25 works per request. For more efficient data retrieval, I suggest to adjust this limit to 200 works per request. Even with this adjustment, you will need to implement pagination to access all available works for a given query. This ensures you can systematically retrieve the complete set of works beyond the initial 200. Find guidance on implementing pagination [here](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging#cursor-paging).
>
> 2. **Data Storage:** Organize the retrieved information into two Pandas DataFrames and save them to two files in a suitable format:
>    - The *IC2S2 papers* dataset should include: *id, publication\_year, cited\_by\_count, author\_ids*.
>    - The *IC2S2 abstracts* dataset should include: *id, title, abstract\_inverted\_index*.
>  
>
> **Filters:**
> To ensure the data we collect is relevant and manageable, apply the following filters:
> 
>    - Only include *IC2S2 authors* with a total work count between 5 and 5,000.
>    - Retrieve only works that have received more than 10 citations.
>    - Limit to works authored by fewer than 10 individuals.
>    - Include only works relevant to Computational Social Science (focusing on: Sociology OR Psychology OR Economics OR Political Science) AND intersecting with a quantitative discipline (Mathematics OR Physics OR Computer Science), as defined by their [Concepts](https://docs.openalex.org/api-entities/works/work-object#concepts). *Note*: here we only consider Concepts at *level=0* (the most coarse definition of concepts). 
>
> **Efficiency Tips:**
> Writing efficient code in this exercise is **crucial**. To speed up your process:
> - **Apply filters directly in your request:** When possible, use the [filter parameter](https://docs.openalex.org/api-entities/works/filter-works) of the *works* endpoint to apply the filters above directly in your API request, ensuring only relevant data is returned. Learn about combining multiple filters [here](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/filter-entity-lists).  
> - **Bulk requests:** Instead of sending one request for each author, you can use the [filter parameter](https://docs.openalex.org/api-entities/works/filter-works) to query works by multiple authors in a single request. *Note: My testing suggests that can only include up to 25 authors per request.*
> - **Use multiprocessing:** Implement multiprocessing to handle multiple requests simultaneously. I highly recommend [Joblib’s Parallel](https://joblib.readthedocs.io/en/stable/) function for that, and [tqdm](https://tqdm.github.io/) can help monitor progress of your jobs. Remember to stay within [the rate limit](https://docs.openalex.org/how-to-use-the-api/rate-limits-and-authentication) of 10 requests per second.
>
>
>   
> For reference, employing these strategies allowed me to fetch the data in about 30 seconds using 5 cores on my laptop. I obtained a dataset of approximately 25 MB (including both the *IC2S2 abstracts* and *IC2S2 papers* files).

### Defining our dataframes, filtering id's for concepts and our dataframe from week 2

In [1]:
import pickle 
import requests
import pandas as pd
from joblib import Parallel, delayed
from tqdm import tqdm
import time

#ID's for concepts: 
id_soc = 'https://openalex.org/C144024400'
id_psyc = 'https://openalex.org/C15744967'
id_econ = 'https://openalex.org/C162324750'
id_pol = "https://openalex.org/C17744445"
id_math = "https://openalex.org/C33923547"
id_physics = "https://openalex.org/C121332964"
id_cs = "https://openalex.org/C41008148"

#Loading the authors dataframe from week 2: 
with open("final_dataframe.pkl", "rb") as f: 
    data = pickle.load(f)

#filtering works_count 
data = data[( data["works_count"] < 5000) & ( data["works_count"] > 5)]

#Initializing two empty dataframes for holding the data: 
paperdata = pd.DataFrame(columns = ["id", "publication_year", "cited_by_count", "author_ids"])
abstractdata = pd.DataFrame(columns = ["id", "title", "abstract_inverted"])

### Defining our data fetch function

In [None]:
def fetch_data(urls):
    url = f"https://api.openalex.org/works?filter=author.id:({ '|'.join([u.split("id:")[1] for u in urls]) })" 
    full_url = f"{url},cited_by_count:>10,authors_count:<10,concept.id:{id_soc}|{id_psyc}|{id_econ}|{id_pol}),concept.id:({id_math}|{id_physics}|{id_cs})"
    results_p, results_a = [], []

    retries = 3
    while retries > 0:
        try:
            response = requests.get(full_url + "&per-page=200&cursor=*").json() 
            
            while response.get("results"):
                for res in response["results"]:
                    results_p.append({
                        "id": res["id"],
                        "publication_year": res["publication_year"],
                        "cited_by_count": res["cited_by_count"],
                        "author_ids": [aut["author"]["id"] for aut in res["authorships"]]
                    })
                    results_a.append({
                        "id": res["id"],
                        "title": res["title"],
                        "abstract_inverted": res.get("abstract_inverted_index", {})
                    })
                
                next_cursor = response["meta"].get("next_cursor")
                if not next_cursor:
                    break
                
                time.sleep(1)
                response = requests.get(full_url + f"&per-page=200&cursor={next_cursor}").json()

            return results_p, results_a
        
        except Exception as e:
            print(f"Error fetching {url}: {e}")
            retries -= 1
            time.sleep(1)

    return [], []

# Parallel processing
batch_size = 5
for i in tqdm(range(0, len(data["works_api_url"]), 100)):
    batch_urls = data["works_api_url"][i:i+100].tolist()
    urls = [batch_urls[i:i+20] for i in range(0,100,20)]

    results = Parallel(n_jobs=batch_size)(
        delayed(fetch_data)(url) for url in urls
    )

    for p, a in results:
        if p and a:
            paperdata = pd.concat([paperdata, pd.DataFrame(p)], ignore_index=True)
            abstractdata = pd.concat([abstractdata, pd.DataFrame(a)], ignore_index=True)

    time.sleep(1)

100%|██████████| 10/10 [01:09<00:00,  6.99s/it]


> **Data Overview and Reflection questions:** Answer the following questions: 
> - **Dataset summary.** How many works are listed in your *IC2S2 papers* dataframe? How many unique researchers have co-authored these works? 

In [4]:
authors = set()
for row in paperdata["author_ids"]: 
    for aut in row: 
        authors.add(aut)
print("I found this many unique papers: ", (paperdata["id"].unique().shape)[0] )
print("I found this many unique researches to have co-authored the works: ",len(authors))

I found this many unique papers:  11666
I found this many unique researches to have co-authored the works:  17845


> - **Efficiency in code.** Describe the strategies you implemented to make your code more efficient. How did your approach affect your code's execution time?

The first attempt to retrieve the dataframes ran for 30 minutes looping through each name, without filtering directly in the api call. In the second version, the library job_lib was implemented allowing parallel requests, and filters where implemented directly in the api call. This got the running time down to around 5 minutes. 

Lastly, filtering up to 25 authors at a time with 5 parallel jobs resulted in a runtime around 1 minute, using a Macbook M1 pro

> - **Filtering Criteria and Dataset Relevance** Reflect on the rationale behind setting specific thresholds for the total number of works by an author, the citation count, the number of authors per work, and the relevance of works to specific fields. How do these filtering criteria contribute to the relevance of the dataset you compiled? Do you believe any aspects of Computational Social Science research might be underrepresented or overrepresented as a result of these choices?

Applying the filters significantly impacts the relevance and balance of the dataset. By setting a lower and upper bound of 5 and 5000 for the total number of works by an author, we ensure that the included authors have experience in the field while also preventing dominance by specific authors. This avoids overrepresenting highly established researchers while still ensuring that authors contributing actively are included. 

Filtering for citation count above 10 also works well for including relevant research papers, but this could unintentionally underrepresent emerging studies creating a bias towards older and widely discussed topics. 

Additionally, filtering for specific fields ensures that the papers in our dataset remains relevant for computational social science studies, while still accounting for papers categorized under other scientific diciplines. While our goal is to highlight computational social studies, we can risk underrepresenting the more classical social studies, that are not being labeled under categories such as CS or Math at concept level=0, but still using data science tools for their study.

# `Part 4:` The Network of Computational Social Scientists