https://github.com/CarolineMortensen/Group_project_CSS.git

# Assignment 1

## Imports

In [2]:
import pickle
import requests
import pandas as pd
from joblib import Parallel, delayed
from tqdm import tqdm
import time
from bs4 import BeautifulSoup
import re

## Part 1: Web-scraping

> **Exercise 3 : Web-scraping the list of participants to the International Conference in Computational Social Science**    
> You can find the programme of the 2023 edition of the conference at [this link](https://ic2s2-2023.org/program). As you can see the conference programme included many different contributions: keynote presentations, parallel talks, tutorials, posters. 
> 1. Inspect the HTML of the page and use web-scraping to get the names of all researchers that contributed to the conference in 2023. The goal is the following: (i) get as many names as possible including: keynote speakers, chairs, authors of parallel talks and authors of posters; (ii) ensure that the collected names are complete and accurate as reported in the website (e.g. both first name and family name); (iii) ensure that no name is repeated multiple times with slightly different spelling.
> 2. Some instructions for success: 
>    * First, inspect the page through your web browser to identify the elements of the page that you want to collect. Ensure you understand the hierarchical structure of the page, and where the elements you are interested in are located within this nested structure.   
>    * Use the [BeautifulSoup Python package](https://pypi.org/project/beautifulsoup4/) to navigate through the hierarchy and extract the elements you need from the page. 
>    * You can use the [find_all](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all) method to find elements that match specific filters. Check the [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) of the library for detailed explanations on how to set filters.  
>    * Parse the strings to ensure that you retrieve "clean" author names (e.g. remove commas, or other unwanted characters)
>    * The overall idea is to adapt the procedure I have used [here](https://nbviewer.org/github/TheYuanLiao/comsocsci2025/blob/main/additional_notebooks/ScreenScraping.ipynb) for the specific page you are scraping.
> 3. Create the set of unique researchers that joined the conference and store it into a file.
>     * *Important:* If you notice any issue with the list of names you have collected (e.g. duplicate/incorrect names), come up with a strategy to clean your list as much as possible. 
> 4. *Optional:* For a more complete representation of the field, include in your list: (i) the names of researchers from the programme committee of the conference, that can be found at [this link](https://ic2s2-2023.org/program_committee); (ii) the organizers of tutorials, that can be found at [this link](https://ic2s2-2023.org/tutorials).
> 5. How many unique researchers do you get?
> 6. Explain the process you followed to web-scrape the page. Which choices did you make to accurately retreive as many names as possible? Which strategies did you use to assess the quality of your final list? Explain your reasoning and your choices (answer in max 150 words).


In [None]:
LINK = "https://ic2s2-2023.org/program#session_1b"
r = requests.get(LINK)
soup = BeautifulSoup(r.content)

table = soup.find("table", {"class": "tutorials"})
print(table)

<table class="tutorials" id="summary">
<!--<tr>
								<th><b style="color:#ffffff">Time</b></th>
								<th><b style="color:#ffffff"><a href="#parallelday1">July 18th (morning)</a></b></th>
								<th><b style="color:#ffffff"><a href="#parallelday2">July 19th (morning)</a></b></th>
								<th><b style="color:#ffffff"><a href="#parallelday2">July 20th (morning)</a></b></th>
								<th><b style="color:#ffffff"><a href="#parallelday1">July 18th (morning)</a></b></th>
								<th><b style="color:#ffffff"><a href="#parallelday2">July 19th (morning)</a></b></th>
								<th><b style="color:#ffffff"><a href="#parallelday2">July 20th (morning)</a></b></th>
								<th><b style="color:#ffffff"><a href="#parallelday1">July 18th (morning)</a></b></th>
								<th><b style="color:#ffffff"><a href="#parallelday2">July 19th (morning)</a></b></th>
							</tr>-->
<tr>
<th><b style="color:#ffffff"></b></th>
<th colspan="100%"><b style="color:#ffffff">Tutorial day (July 17th)</b></th>
</tr>
<tr>


In [None]:
rows = []
names_temp = []
table_names = []
table_rows = table.find_all("tr")

# Extracting the header
ths = table_rows[0].find_all("th")
headings = [th.text.replace("\n", "") for th in ths]

# Getting the rows
for tr in table_rows[1:]:
    tds = tr.find_all("td")
    row = [td.text.replace("\n", "") for td in tds]
    
    if any(re.search(r'Keynote', td.text, re.IGNORECASE) for td in tds):
        rows.append(row)

for row in rows: 
    nam = [r for r in row if "Keynote" in r]
    names_temp.append(nam[0])  

for name in names_temp:
    name = name.split("- ")[1]
    table_names.append(name)


table_names

['Jevin West',
 'Linda Steg',
 'Sharad Goel',
 'Molly Crockett',
 'Lisa Anne Hendriks',
 'Stefan Gössling',
 'Joanna Bryson',
 'Tim Althoff',
 'Lauren Brent',
 'Esteban Moro']

In [None]:
all = soup.find_all("i")
raw = [a.text for a in all]
raw

ind_names = []
for ra in raw: 
    if ra.startswith('Chair:'):
        chair_name = ra.split('Chair:')[1].strip()
        ind_names.append(chair_name)
    else:
        names = [name.strip() for name in ra.split(',')]
        ind_names.extend(names)

ind_names

['Claudia Wagner',
 'Jonas L Juul',
 'Jon Kleinberg',
 'Chloe Ahn',
 'Xinyi Wang',
 'Giuseppe Russo',
 'luca verginer',
 'Manoel Horta Ribeiro',
 'Giona Casiraghi',
 'Almog Simchon',
 'Adam Sutton',
 'Matthew Edwards',
 'Stephan Lewandowsky',
 'Arianna Pera',
 'Manuel Vimercati',
 'Matteo Palmonari',
 'Mohammed Alsobay',
 'Abdullah Almaatouq',
 'David G. Rand',
 'Duncan J. Watts',
 'Sara Venturini',
 'Satyaki Sikdar',
 'Francesco Rinaldi',
 'Francesco Tudisco',
 'Santo Fortunato',
 'Isabella Loaiza',
 'Takahiro Yabe',
 'Alex Pentland',
 'Taha Yasseri',
 'Silvia De Sojo Caso',
 'Mia Ann Jørgensen',
 'Sune Lehmann',
 'Laura Alessandretti',
 'Emil Bakkensen Johansen',
 'Mathias Wullum Nielsen',
 'Rubén Rodríguez Casañ',
 'Antonio Ariño Villarroya',
 'Sunny Rai',
 'Ashley Francisco',
 'Salvatore Giorgi',
 'Brenda Curtis',
 'Lyle Ungar',
 'Sharath Chandra Guntuku',
 'Allison Koenecke',
 'Eric Giannella',
 'Robb Willer',
 'Sharad Goel',
 'Ziv Epstein',
 'Hause Lin',
 'Levin Brinkmann',
 'Bra

In [None]:
# program committee
link = "https://ic2s2-2023.org/program_committee"
r = requests.get(link)
soup = BeautifulSoup(r.content)

In [None]:
# tutorials
link = "https://ic2s2-2023.org/tutorials"
r = requests.get(link)
soup = BeautifulSoup(r.content)

In [None]:
names = []
for li in soup.find_all('li'):
    b_tag = li.find('b')
    if b_tag:
        a_tag = b_tag.find('a')
        if a_tag and 'href' in a_tag.attrs:
            name = a_tag.get_text(strip=True)
            names.append(name)

for name in names:
    print(name)

Étienne Ollion
Rubing Shen
Yelena Mejova
Kyriaki Kalimeri
Giovanni Da San Martino
Oscar Araque
Michael Szell
Ane Rahbek Vierø
Anastassia Vybornova
Mohammed Alsobay
James Houghton
James Evans
Bhargav Srinivasa Desikan
August Lohse
Simon P. von der Maase
Sanja Šćepanović
Ingmar Weber
Philipp Lorenz-Spreen
Julian Jaursch


In [None]:
IC2023 = set(table_names) | set(ind_names) | set(raw) | set(names)
len(IC2023)

2072

In [None]:
path = '/Users/carolinemortensen/Desktop/KID/6. Semester/CSS/IC2S2_2024_program_overview.xlsx - Sheet1.csv'
program = pd.read_csv(path)
program.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Irvine Auditorium Main Hall (A),Houston: Bodek Lounge,Houston: Hall of Flags (K),Houston: Ben Franklin Room (B),Irvine: Cafe 58 (C),Irvine: Amado Recital Hall (D),Houston: Golkin (E),Irvine: G7 (F),Irvine: G16 (G),Irvine: Green Room (H),Williams Hall G01 (I),Williams Hall 723 (J),Penn Museum
0,"Wednesday, July 17, 2024",9:00 AM - 12:30 PM,,,Tutorial: New Approaches and Data Sources to S...,Tutorial: Making Models We Can Understand: An ...,Tutorial: Collecting Digital Trace Data Throug...,Tutorial: Exploring Emerging Social Media: Acq...,,,,,,,
1,,12:30 PM - 1:30 PM,,Lunch,,,,,,,,,,,
2,,1:00 PM - 1:45 PM,Dirk Helbing: Societies are complex systems. S...,Dirk Helbing: Societies are complex systems. S...,Dirk Helbing: Societies are complex systems. S...,Dirk Helbing: Societies are complex systems. S...,Dirk Helbing: Societies are complex systems. S...,Dirk Helbing: Societies are complex systems. S...,Dirk Helbing: Societies are complex systems. S...,Dirk Helbing: Societies are complex systems. S...,Dirk Helbing: Societies are complex systems. S...,Dirk Helbing: Societies are complex systems. S...,Dirk Helbing: Societies are complex systems. S...,Dirk Helbing: Societies are complex systems. S...,Dirk Helbing: Societies are complex systems. S...
3,,1:30 PM - 5:00 PM,,,Tutorial: Training Computational Social Scienc...,Tutorial: Using LLMs for Computational Social ...,,Tutorial: Thinking With Deep Learning: An Expo...,Tutorial: Active Agents: An Active Inference A...,,,,,,
4,,5:00 PM - 6:30 PM,,Welcome mixer,,,,,,,,,,,


In [None]:
irv = program['Irvine Auditorium Main Hall (A)']
irv

0                                                   NaN
1                                                   NaN
2     Dirk Helbing: Societies are complex systems. S...
3                                                   NaN
4                                                   NaN
5                                                   NaN
6                      Breakfast (Main Hall Box Office)
7                                       Lightning talks
8     Meta panel: Talia Stroud, University of Texas;...
9                 02. Algorithmics & Public Opinion: 1A
10                                                  NaN
11                                                  NaN
12    Publishing with Nature Portfolio journals – an...
13    Talal Rahwan: A Global Perspective of Scientis...
14    Amy Orben: Screen Savers: Protecting Adolescen...
15                                                  NaN
16                     Breakfast (Main Hall Box Office)
17                                      Lightnin

In [None]:
irvine_auditorium = []
irv = program['Irvine Auditorium Main Hall (A)'].dropna().astype(str)
# Remove nan values

for i in range(len(irv)):
    # Check if ":" is in the string
    if "keynote" in irv.iloc[i]:
        name = irv.iloc[i].split(':')
        irvine_auditorium.append(name[0])

irvine_auditorium.remove('Meta panel')

In [None]:
text = irv[8]
# Remove everything before the colon
text = text.split(":")[-1]

# Split by commas and semicolons, keeping only the first two words
names = [" ".join(item.strip().split()) for item in re.split(r'[;,]', text)]
names.remove('University of Texas')
names.remove('Meta')
names.remove('European Union Institute (keynote)')
irvine_auditorium.extend(names)

In [None]:
tutorial = program.iloc[0].dropna().astype(str)
tut_temp = []

for i in range(len(tutorial)):
    if "Tutorial" in tutorial.iloc[i]:
        splits = tutorial.iloc[i].split(';')
        tut_temp.append(splits[1])

cleaned_names = [re.sub(r'\s*\(.*?\)', '', item).strip().split(', ') for item in tut_temp]

# Flatten the list
names_list = [name.strip() for sublist in cleaned_names for name in sublist]
names_list

['Sebastian Stier',
 'Philipp Lorenz-Spreen',
 'Lisa Oswald',
 'David Lazer',
 'Chudi Zhong',
 'Alina Jade Barnett',
 'Harsh Parikh',
 'Laura Boeschoten',
 'Niek de Schipper',
 'Filipi Nascimento Silva',
 'Kaicheng Yang',
 'Bao Tran Truong',
 'Wanying Zhao']

In [None]:
irvine_auditorium.extend(names_list)
irvine_auditorium

['Dirk Helbing',
 'Talal Rahwan',
 'Amy Orben',
 'Dan Jurafsky',
 'Elizabeth Bruch',
 'Divya Sharma',
 'Matthew Gentzkow',
 'Desmond Patton',
 'Talia Stroud',
 'Chad P. Kiewiet De Jonge',
 'Kevin Munger',
 'Sebastian Stier',
 'Philipp Lorenz-Spreen',
 'Lisa Oswald',
 'David Lazer',
 'Chudi Zhong',
 'Alina Jade Barnett',
 'Harsh Parikh',
 'Laura Boeschoten',
 'Niek de Schipper',
 'Filipi Nascimento Silva',
 'Kaicheng Yang',
 'Bao Tran Truong',
 'Wanying Zhao']

In [None]:
posters = pd.read_csv('/Users/carolinemortensen/Desktop/KID/6. Semester/CSS/IC2S2_2024_posters.xlsx - Sheet1.csv')
posters.head()

Unnamed: 0,Date,Poster title,Poster authors,Easel assignment
0,18-Jul,Big Tech dominance despite global mistrust,"Hazem Ibrahim, New York University; Talal Rahw...",1
1,18-Jul,Estimating Police Anti-Black Bias in Use of Fo...,"Vikram Balasubramanian, University of Dublin, ...",7
2,18-Jul,What Contributions are Favoured in the Communi...,"Guodong Ju, London School of Economics and Pol...",13
3,18-Jul,Single Board Computers: A Case Study in Using ...,"Stuart Duncan, Toronto Metropolitan University",14
4,18-Jul,Can likes returned from peers within a day imp...,"Kenji Yokotani, Tokushima University; Masanori...",47


In [None]:
authors = posters.loc[:, 'Poster authors'].dropna().astype(str)
aut_names = []

for i in range(len(authors)):
    names_a = authors.iloc[i].split(';')
    aut_names.append(names_a)
    
author_names = [entry.split(',')[0].strip() for sublist in aut_names for entry in sublist]
author_names

['Hazem Ibrahim',
 'Talal Rahwan',
 'Yasir Zaki',
 'Vikram Balasubramanian',
 'Guodong Ju',
 'Stuart Duncan',
 'Kenji Yokotani',
 'Masanori Takano',
 'Nobuhito Abe',
 'Tenzin Tamang',
 'Arianna Pera',
 'Luca Maria Aiello',
 'Arianna Pera',
 'Luca Maria Aiello',
 'Lynnette Hui Xian Ng',
 'Jennifer M Krebsbach',
 'Ting Luo',
 'Yan Wang',
 'Qian Shen',
 'Xiaomeng Xiong',
 'Xueyan Gao',
 'Masahiro Kuwahara',
 'Yukihisa Fujita',
 'CAI Tianji',
 'Alex Pentland',
 'Alex Rutherford',
 'Esteban Moro',
 'Iyad Rahwan',
 'Morgan Ryan Frank',
 'Tobin South',
 'Marios Papachristou',
 'Yuan Yuan',
 'Cameron Lai',
 'Fujio Toriumi',
 'Haruhiko Tashiro',
 'Cameron Lai',
 'Fujio Toriumi',
 'Mitsuo Yoshida',
 'Koji Suzuki',
 'Takehiko Murai',
 'Toshio Murase',
 'Lukas Erhard',
 'Yutong Si',
 'Jen-Hao Chen',
 'Lu Hsiu-Chi',
 'Chloe Ahn',
 'Drew Dimmery',
 'Kevin Munger',
 'Linda Vecgaile',
 'Zhuangyuan Fan',
 'Tobias Kamelski',
 'Benjamin C. Kennedy',
 'John Diaz',
 'Lily Hofstetter',
 'William P Hogan',
 

In [None]:
panels = pd.read_csv('/Users/carolinemortensen/Desktop/KID/6. Semester/CSS/IC2S2_2024_oral_panels.xlsx - Sheet1.csv')
panels.head()

Unnamed: 0,Session,Date,Time,Location,Session track,Presentation title,Presentation authors
0,1A,18-Jul,11:00 AM - 12:30 PM,Irvine Auditorium Main Hall,02. Algorithmics & Public Opinion,Youtube’s recommendation algorithm is left-lea...,"Hazem Ibrahim, New York University; Nouar AlDa..."
1,1A,18-Jul,11:00 AM - 12:30 PM,Irvine Auditorium Main Hall,02. Algorithmics & Public Opinion,"Lower Quantity, Higher Quality: Perceptions an...","Alvin Zhou, University of Minnesota - Twin Cit..."
2,1A,18-Jul,11:00 AM - 12:30 PM,Irvine Auditorium Main Hall,02. Algorithmics & Public Opinion,Filter Bubble or Homogenization? Disentangling...,"Grant Schoenebeck, University of Michigan; Md ..."
3,1A,18-Jul,11:00 AM - 12:30 PM,Irvine Auditorium Main Hall,02. Algorithmics & Public Opinion,Confrontational Consumption: How Hostility Pro...,"Seonhye Noh, University of California, Los Ang..."
4,1A,18-Jul,11:00 AM - 12:30 PM,Irvine Auditorium Main Hall,02. Algorithmics & Public Opinion,Dynamics of Digital Discourse: A Field Experim...,"Lisa Oswald, Max-Planck Institute; Philipp Lor..."


In [None]:
presentations = panels.loc[:, 'Presentation authors'].dropna().astype(str)
pres_names = []

for i in range(len(presentations)):
    names_pres = presentations.iloc[i].split(';')
    pres_names.append(names_pres)
    
presenters = [entry.split(',')[0].strip() for sublist in pres_names for entry in sublist]
presenters

['Hazem Ibrahim',
 'Nouar AlDahoul',
 'Talal Rahwan',
 'Yasir Zaki',
 'Alvin Zhou',
 'Danaë Metaxa',
 'Shengchun Huang',
 'Stephanie Wang',
 'Grant Schoenebeck',
 'Md Sanzeed Anwar',
 'Paramveer Dhillon',
 'Seonhye Noh',
 'Stuart Soroka',
 'Lisa Oswald',
 'Philipp Lorenz-Spreen',
 'Alexander Wenz',
 'Anna-Carolina Haensch',
 'Leah von der Heyde',
 'Hans William Alexander Hanley',
 'Jennifer Pan',
 'Yingdan Lu',
 'Chloe Ahn',
 'Junghyun Lim',
 'Ashton Anderson',
 'George Eilender',
 'Joshua Introne',
 'Una Joh',
 'Ceren Budak',
 'Julia Mendelsohn',
 'Patrick Gordon Wall',
 'Asael H Sorensen',
 'Asmeret Naugle',
 'Casey Doyle',
 'Dan J. Krofcheck',
 'Matt Sweitzer',
 'Christian Zingg',
 'Christoph Benedikt Gote',
 'Frank Schweitzer',
 'Giona Casiraghi',
 'Giuseppe Russo',
 'Luca Verginer',
 'Arnab Kumar Sarker',
 'Patrick Park',
 'Bruno Lepri',
 'Eduardo López',
 'Robin Dunbar',
 'Sam G B Roberts',
 'Simone Centellegher',
 'Valentin Vergara Hidd',
 'Balaraju BATTU',
 'Talal Rahwan',
 'Al

In [None]:
talks = pd.read_csv('/Users/carolinemortensen/Desktop/KID/6. Semester/CSS/IC2S2_2024_lightning_talks.xlsx - Sheet1.csv')
talks.head()

Unnamed: 0,Date,Time,Location,Presentation title,Presentation authors
0,18-Jul,9:00 AM - 9:03 AM,Irvine Auditorium Main Hall,Measuring Implicit Bias in Explicitly Unbiased...,"Xuechunzi Bai, The University of Chicago; Ange..."
1,18-Jul,9:03 AM - 9:06 AM,Irvine Auditorium Main Hall,Platform-Driven Collaboration Patterns: Struct...,"Babak Heydari, Northeastern University; Negin ..."
2,18-Jul,9:06 AM - 9:09 AM,Irvine Auditorium Main Hall,Neural embedding of beliefs reveals the role o...,"Byunghwee Lee, Indiana University; Haewoon Kwa..."
3,18-Jul,9:09 AM - 9:12 AM,Irvine Auditorium Main Hall,Why do people think liberals drink lattes? How...,"Kathleen M. Carley, Carnegie Mellon University..."
4,18-Jul,9:12 AM - 9:15 AM,Irvine Auditorium Main Hall,The Arrival of Fast Internet and Infant Mortality,"Yifan Wang, University of Illinois at Urbana-C..."


In [None]:
presentations_talks = talks.loc[:, 'Presentation authors'].dropna().astype(str)
pres_t_names = []

for i in range(len(presentations_talks)):
    names_pres_t = presentations_talks.iloc[i].split(';')
    pres_t_names.append(names_pres_t)
    
talkers = [entry.split(',')[0].strip() for sublist in pres_t_names for entry in sublist]
talkers

['Xuechunzi Bai',
 'Angelina Wang',
 'Ilia Sucholutsky',
 'Tom Griffiths',
 'Babak Heydari',
 'Negin Maddah',
 'Byunghwee Lee',
 'Haewoon Kwak',
 'Jisun An',
 'Rachith Aiyappa',
 'Yong-Yeol Ahn',
 'Kathleen M. Carley',
 'Kenneth Joseph',
 'Samantha C Phillips',
 'Yifan Wang',
 'Esteban Moro',
 'Guangyuan Weng',
 'Minsuk Kim',
 'Yong-Yeol Ahn',
 'Ankita Gupta',
 'Ethan Zuckerman',
 "Brendan O'Connor",
 'Diego Gomez-Zara',
 'Nandini Banerjee',
 'Daniel Romero',
 'David Gamba',
 'Grant Schoenebeck',
 'Yuan Yuan',
 'Yulin Yu',
 'Andreas Bjerre-Nielsen',
 'Jacob Aarup Dalsgaard',
 'Roberta Sinatra',
 'Giuseppe Russo',
 'Maciej Styczen',
 'Manoel Horta Ribeiro',
 'Robert West',
 'Andrea Musso',
 'Elisabeth Stockinger',
 'Laura Maria Alessandretti',
 'Fred Morstatter',
 'Zhuoyu Shi',
 'Filippo Menczer',
 'Kai-Cheng Yang',
 'Joshua Grossman',
 'Julian Nyarko',
 'Sharad Goel',
 'Alicja Chaszczewicz',
 'Emma Pierson',
 'Emma Wang',
 'Jure Leskovec',
 'Maya Josifovska',
 'Serina Chang',
 'Joshua 

In [None]:
IC2024 = irvine_auditorium + author_names + presenters + talkers
uniques2024 = []

for name in IC2024:
    if name not in uniques2024:
        uniques2024.append(name)

len(uniques2024)

1244

In [None]:
authors1_raw = talks["Presentation authors"]
authors1 = authors1_raw.tolist()
names_raw1 = []
for aut in authors1: 
    names_raw1 += aut.split(";")
names1 = [re.sub(r"^ +", "",nam.split(", ")[0]) for nam in names_raw1]


authors2_raw = panels["Presentation authors"]
authors2 = authors2_raw.tolist()
names_raw2 = []
for aut in authors2: 
    names_raw2 += aut.split(";")
names2 = [re.sub(r"^ +", "",nam.split(", ")[0]) for nam in names_raw2]


authors3_raw = posters["Poster authors"]
authors3 = authors3_raw.tolist()
names_raw3 = []
for aut in authors3: 
    names_raw3 += aut.split(";")
names3 = [re.sub(r"^ +", "",nam.split(", ")[0]) for nam in names_raw3]

total_names = set(names1 + names2 + names3)
print(len(total_names))

import pickle 
with open("Total_names2024.pkl", "wb") as f: 
    pickle.dump(total_names, f)


1232


5. How many unique researchers do you get?

By scraping the different sources, a total of 1232 unique researchers were found (see code above)

6. Explain the process you followed to web-scrape the page. Which choices did you make to accurately retreive as many names as possible? Which strategies did you use to assess the quality of your final list? Explain your reasoning and your choices __(answer in max 150 words)__.

To webscrape the website, the *requests* package is used to retrieve the HTML code, and the package *BeautifulSoup* is used to parse it. First, the relevant *table* elements containing names, particularly keynote speakers, are identified by searching for specific class attributes and text patterns in the HTML code. The names are then extracted from *i* elements and structured as lists. Additionally, the names are retrived including tutorials and the program committee.

To ensure complete and accurate names, they were cross-referenced with names found in the conference schedules stored in excel files, including keynotes, panels, posters, and lightning talks. The data is cleaned by removing institutional names, handling missing values, and ensuring unique entries.

Quality of the retrived data is ensured, by checking for duplicates and removing irrelevant text. By combining multiple sources and verifying against structured datasets a complete and accurate dataset can be created.

## Part 2: Ready Made vs Custom Made Data

1. What are pros and cons of the custom-made data used in Centola's experiment and the ready-made data used in Nicolaides's study?

Custom made data is purposely made for a specific research project e.g. surveys, where people's internal states can be directly accessed in a controlled sample. However, respondents may not answer truthfully. Custom-made allows for causal relationships to be investigated through experiments by controlling variables to avoid confounding variables and biases - this can be expensive and difficult to generalize due to the artificial environment.

Ready-made data is typically collected for different purposes, but can be used to answer social science questions. An advantage is that ready-made data is large-scale, often gathered over long periods in real-world settings, allowing behavior to be studied over time, without people’s awareness. However, data can be incomplete, biased (only actions of certain individuals are captured) and influenced by unseen variables. Additionally, accessing real-world data can be challenging due to privacy concerns and restrictions and the data often requires cleaning due to errors or missing values.

2. How do you think these differences can influence the interpretation of the results in each study?

## Part 3: Gathering Research Articles using the OpenAlex API

In [2]:
# Load dataset
with open("datav2.pkl", "rb") as f:
    df = pickle.load(f)

df = df[(df['works_count'] > 5) & (df['works_count'] < 5000)]  # Filtering

#Initialize DataFrames
papers = pd.DataFrame(columns=['id', 'publication_year', 'cited_by_count', 'author_ids'])
abstracts = pd.DataFrame(columns=['id', 'title', 'abstract_inverted_index'])

# Define concept IDs
concept_ids = [
    "C144024400",  # Sociology
    "C15744967",   # Psychology
    "C162324750",  # Economics
    "C17744445",   # Political Science
    "C33923547",   # Mathematics
    "C121332964",  # Physics
    "C41008148",   # Computer Science
]

paperdata = []
abstractdata = []

def get_data(i):
    ids = [aut.split("id:")[1] for aut in i]
    BASE_URL = (
        f"https://api.openalex.org/works?filter=author.id:({("|").join(ids)}),cited_by_count:>10,"
        f"authors_count:<10,concept.id:({'|'.join(concept_ids[:4])}),concept.id:({'|'.join(concept_ids[4:])})"
    )

    retries = 0
    papers = []
    abstracts = []

    while retries < 3:
        try:
            response = requests.get(BASE_URL + "&per-page=200&cursor=*").json()

            while response.get("results"):
                for result in response["results"]:
                    papers.append({
                        "id": result.get("id"),
                        "publication_year": result.get("publication_year"),
                        "cited_by_count": result.get("cited_by_count"),
                        "author_ids": [
                            auth["author"]["id"]
                            for auth in result.get("authorships", [])
                            if "author" in auth and "id" in auth["author"]
                        ],
                    })
                    abstracts.append({
                        "id": result.get("id"),
                        "title": result.get("title"),
                        "abstract_inverted_index": result.get("abstract_inverted_index"),
                    })

                next_cursor = response.get("meta", {}).get("next_cursor")
                if not next_cursor:
                    break

                time.sleep(1) 
                response = requests.get(BASE_URL + f"&per-page=200&cursor={next_cursor}").json()

            return papers, abstracts

        except Exception as e:
            print(f"Error fetching work ID {ids}: {e}")
            retries += 1
            time.sleep(1)

    return [], []

# Parallel processing
num_batch = 5
batch_size = 100 

for i in tqdm(range(0, len(df["works_api_url"]), batch_size)):
    batch_indexes = df["works_api_url"][i:i+100].tolist()
    batches = [batch_indexes[i:i+25] for i in range(0,100,25)]
    #Retrieve data in parallel
    results = Parallel(n_jobs=num_batch)(
        delayed(get_data)(batch) for batch in batches
    )

    # Collect results
    for pap, abs in results:
        if pap and abs:
            paperdata.extend(pap) 
            abstractdata.extend(abs)

    time.sleep(2)

#Convert collected lists into DataFrames
paperdata_df = pd.DataFrame(paperdata)
abstractdata_df = pd.DataFrame(abstractdata)

# Save results
paperdata_df.to_csv("papers.csv", index=False)
abstractdata_df.to_csv("abstracts.csv", index=False)


100%|██████████| 11/11 [01:36<00:00,  8.74s/it]


In [7]:
#How many works are listed? (id in this dataframe is an id of a work)
len(paperdata_df['id'])

13362

In [19]:
#How many unique reasearchers have co-authored these works?
(pd.DataFrame(paperdata_df.explode('author_ids'))['author_ids']).nunique()

18004

> - **Dataset summary.**

There is a total of 13362 works listed in the *IC2S2 papers* dataframe with 18004 unique researchers who have co-authored these works.
> - **Efficiency in code.** Describe the strategies you implemented to make your code more efficient. How did your approach affect your code's execution time? __(answer in max 150 words)__

When retrieving data from the API, the request would often have trouble fetching some of the data, resulting in a very long run-time. To address this, we implemented a try-except block, which retries the request a set number of tries, before moving on to the next request. In addition, we also implemented multiprocessing to make multiple requests at the same time, with the use of Parallel and tqdm - this significantly reduced the run-time of the code. Lastly applying filters in the API request, ensured that only relevant data was taken into consideration. All of these strategies combined, improved the execution time of the code, from taking close to an hour to run, to only about a minute or two.

> - **Filtering Criteria and Dataset Relevance** Reflect on the rationale behind setting specific thresholds for the total number of works by an author, the citation count, the number of authors per work, and the relevance of works to specific fields. How do these filtering criteria contribute to the relevance of the dataset you compiled? Do you believe any aspects of Computational Social Science research might be underrepresented or overrepresented as a result of these choices? __(answer in max 150 words)__

Using specific thresholds ensures that the fetched data is both relevant and of high quality, but makes data collecting more manageable, as fewer works need to be processed. While the threshold for total number of works by an author makes sure that very prolific authors are not overrepresented, this however also excludes inactive or emerging authors. Similarly, the citation threshold highlights influential and relevant studies, but may overlook newer works that have yet to gain recognition. Limiting the number of authors per work favors small collaborations over large, potentially filtering out large interdisciplinary research. Field-based filtering broadens the scope to include relevant interdisciplinary studies but may still underrepresent qualitative approaches. Applying filters enhances the relevance of the dataset, but may also introduce bias.

## Part 4: The Network of Computational Social Scientists