### GitHub repository
Link to repository used to colaborate on the assignment:
https://github.com/KarolineKlan/Assignments_ComSocSci2024.git

### Contribution statement

Team members:

- Jacob (s214596)
- Kristoffer (s214609)
- Karoline (s214638)

All members collaborated and contributed to every part of the assignment.


# Assignment 1
This assignment was formed using Web-scraping tools from the program of the International Conference in Computational Social Science 2023  https://ic2s2-2023.org/program, and acessing data of Authors and Research Articles using the OpenAlex API https://docs.openalex.org/.

In [58]:
#Import relevant libraries
from bs4 import BeautifulSoup 
import requests
import pandas as pd
from tqdm import tqdm
from Levenshtein import distance
import numpy as np
import ast
import networkx as nx
from joblib import Parallel, delayed
import os 

## Part 1 - Webscraping
In the following task we use web-scraping tools to get the list of participants in the International Conference in Computational Social Science (CSC) 2023

In [59]:
# define link to scrape, and beautifulsoup object
LINK = "https://ic2s2-2023.org/program"
LINK_OPTIONAL1 = "https://ic2s2-2023.org/program_committee"
r_OPTIONAL1 = requests.get(LINK_OPTIONAL1)
soup_OPTIONAL1 = BeautifulSoup(r_OPTIONAL1.content)
LINK_OPTIONAL2 = "https://ic2s2-2023.org/tutorials"
r_OPTIONAL2 = requests.get(LINK_OPTIONAL2)
soup_OPTIONAL2 = BeautifulSoup(r_OPTIONAL2.content)
r = requests.get(LINK)
soup = BeautifulSoup(r.content)

# Find all relevant places in the HTML code where names are stored
speaker = soup.findAll("ul", {"class" : "nav_list"})
chair = soup.findAll("h2")
table = soup.find("table", {"class" : "tutorials"})
table = table.find_all("td")
main = soup_OPTIONAL1.find("section", {"id" : "main"})
names_members = main.findAll("li")
names_teachers = soup_OPTIONAL2.findAll("div", {"class" : "col-5 col-12-medium"})

# Loop through the HTML code and extract names
keynote_names = [table[k].text.lower().split("- ")[1] for k in range(len(table)) if "Keynote" in table[k].text]
chair_names = [chair[k].text.lower().split(": ")[2] for k in range(len(chair)) if "Chair" in chair[k].text]
speaker_names = [speaker[k].find_all("i")[j].text.lower().split(", ")  for k in range(len(speaker)) for j in range(len(speaker[k].find_all("i")))]
speaker_names = sum(speaker_names, [])
names_members_lst = [names_members[i].find("b").text.lower() for i in range(len(names_members))]
names_teachers = [names_teachers[i].findAll("li")[k].find("b").text.lower() for i in range(len(names_teachers)) for k in range(len(names_teachers[i].findAll("li")))]



# Print results for each category
print(f"Number of unique speakers:  {len(set(speaker_names))}")
print(f"Number of unique keynote speakers:  {len(set(keynote_names))}")
print(f"Number of unique chairs:  {len(set(chair_names))}")
print(f"Number of unique members from optional link1:  {len(set(names_members_lst))}")
print(f"Number of unique teachers from optional link2:  {len(set(names_teachers))}")

# Add all names together to find total unique names
total_names = speaker_names + keynote_names + chair_names + names_members_lst + names_teachers
df = pd.DataFrame(total_names, columns = ["Name"])
df["Name"] = df["Name"].str.replace(".", "")
uniq_names = pd.DataFrame(set(df["Name"]), columns=["Name"])
#uniq_names = uniq_names.sort_values('Name', ascending=True)
print(f"Total number of unique speakers:  {len((uniq_names))}")

pd.DataFrame(uniq_names).to_csv("data/authors_part1.csv", index=False)

Number of unique speakers:  1472
Number of unique keynote speakers:  10
Number of unique chairs:  49
Number of unique members from optional link1:  333
Number of unique teachers from optional link2:  19
Total number of unique speakers:  1645



**The process of the web-scraping:** 

In the process of web-scraping the website and collect the specific names of all the researchers, a thourough investigation of the HTML setup was initiated in order to understand the hierarchical and nested structure of the page. The main structure of the page was after inspection divided into 3 main parts, where different approaches were utilized in order to access the data from different structures:
1. collect the names og the key-note speakers from the overview table structure
2. collect the names for the chair speakers in the "h2" sections
3. collect names from the text sections in the "ul" sections 

## Part 2 - Ready Made vs Custom Made Data

1. **Custom-made data**, as used in Centola's experiment, offers the advantage of experimental control and manipulation, allowing the researchers to precisely design conditions and variables and data is not incomplete, since all the needed data is collected. However, it can be time-consuming and costly to collect, and it may lack real-world validity due to potential response biases when trying to record behaviors with digital systems that are highly engineered to induce specific behaviors. 
**Ready-made data**, like used in Nicolaides's study, offers readily available information as it is always-on, and then enables the study of unexpected events and real-time measurements. It reflects real-world behavior but may lack experimental control and relevance to specific research questions, as samples can be incomplete or biased. It also poses challenges in identifying and controlling confounding variables and can be dirty and difficult to access.

2. To be made....


## Part 3 - Gathering Research Articles using the OpenAlex API

In this part of the assignment API endpoints from the OpenAlex API are used to collect Research Articles from IC2S2 Authors.

In [60]:
BASE_URL = 'https://api.openalex.org/'
RESOURCE = 'works'
COMPLETE_URL = BASE_URL + RESOURCE

df = pd.read_csv('data/authors_part3.csv', index_col=[0])

IC2S2_author = df.loc[df['works_count'] > 5]
IC2S2_authors = IC2S2_author.loc[IC2S2_author['works_count'] < 5000]

IC2S2_authors = IC2S2_authors.reset_index()

lst = [str(IC2S2_authors['id'][i]) + '|' for i in range(len(IC2S2_authors['id']))]
n = 25

chunked_lists = [lst[i:i + n] for i in range(0, len(lst), n)]

# Filters for the API
Filter1 = "cited_by_count:>10"
Filter2 = "authors_count:<10"

all_concepts = requests.get(BASE_URL+"concepts", params={"filter":"level:0"}).json()

social_concepts= [i["id"] for i in all_concepts["results"] if i['display_name'] in (["Sociology","Psychology","Economics","Political Science"])]
math_concepts= [i["id"] for i in all_concepts["results"] if i['display_name'] in (["Mathematics","Physics","Computer Science"])]

Filter3="concepts.id:"+"|".join(social_concepts)
Filter4="concepts.id:"+"|".join(math_concepts)

In [61]:
IC2S2_papers = {}
IC2S2_abstracts = {}

def get_papers_and_abstracts(chunked_list):
    ids = ''.join(chunked_list)
    ids = ids[:-1]
    cursorState = '*'
    while cursorState:
        PARAMS = {"per_page":200,
                "filter" :f'{",".join([Filter1,Filter2,Filter3,Filter4])},authorships.author.id:{ids}',
                "cursor": f'{cursorState}'}
            
        response = requests.get(COMPLETE_URL, params=PARAMS).json()
        if response['meta']['next_cursor']:
            cursorState = str(response['meta']['next_cursor'])
        else: cursorState = None

        for i in range(len(response['results'])):
            id = response['results'][i]['id']
            pub_year = response['results'][i]['publication_year']
            cited = response['results'][i]['cited_by_count']
            authors = [response['results'][i]['authorships'][l]['author']['id'] for l in range(len(response['results'][i]['authorships']))]
            title = response['results'][i]['title']
            abstract = response['results'][i]['abstract_inverted_index']
            IC2S2_papers[str(id)] = {'id':id, 'publication_year':pub_year,'cited_by_count':cited, 'author_ids':authors}
            IC2S2_abstracts[str(id)] = {'id':id, 'title':title, 'abstract_inverted_index':abstract}
    
    return IC2S2_papers, IC2S2_abstracts

results = Parallel(n_jobs=2)(delayed(get_papers_and_abstracts)(chunked_list) for chunked_list in tqdm(chunked_lists))


papers_dict = {k: v for result in results for k, v in result[0].items()}
abstracts_dict = {k: v for result in results for k, v in result[1].items()}

Papers = pd.DataFrame(papers_dict).transpose()
Abstracts = pd.DataFrame(abstracts_dict).transpose()


  0%|          | 0/42 [00:00<?, ?it/s]

100%|██████████| 42/42 [01:04<00:00,  1.53s/it]


In [62]:
BASE_URL = 'https://api.openalex.org/'
RESOURCE = 'authors'
COMPLETE_URL = BASE_URL + RESOURCE
unique_authors = list(set([i for a in Papers['author_ids'] for i in a]))
new_authors = list(set(unique_authors).difference(list(IC2S2_authors['id'])))

lst = [str(new_authors[i]) + '|' for i in range(len(new_authors))]
n = 25

chunked_lists_new = [lst[i:i + n] for i in range(0, len(lst), n)]

In [63]:
def fetch_author_data(chunk):
    cursorState = '*'
    ids = ''.join(chunk)
    ids = ids[:-1]
    responses = {}
    count = 0
    while cursorState:
        PARAMS = {"per_page":200,
                "filter" :f'ids.openalex:{ids}',
                "cursor": f'{cursorState}'}
        
        responses[str(count)] = requests.get(COMPLETE_URL, params=PARAMS).json()
        if responses[str(count)]['meta']['next_cursor']:
            cursorState = str(responses[str(count)]['meta']['next_cursor'])
        else: cursorState = None
        count += 1
    return responses

In [64]:
def query_data(response):
    lists = []
    for idx in response:
        result = response[str(idx)]['results']
        CO_list = {'display_name':[],'id':[],'works_api_url':[],'h_index':[],'works_count':[],'country_code':[]}
        for i in range(len(result)):
            CO_list['display_name']+=[result[i]['display_name']]

            CO_list['id']+=[result[i]['id']]
            
            CO_list['works_api_url']+=[result[i]['works_api_url']]
            
            CO_list['h_index']+=[result[i]['summary_stats']['h_index']]
            CO_list['works_count']+=[result[i]['works_count']]
            
            if result[i]['last_known_institution'] is not None:
                CO_list['country_code']+=[result[i]['last_known_institution']['country_code']]
            else: 
                CO_list['country_code']+=['None']
        if len(CO_list['display_name'])>= 1:
            lists.append(CO_list)
    
    return lists

In [65]:
A_data = Parallel(n_jobs=2)(delayed(fetch_author_data)(chunk) for chunk in tqdm(chunked_lists_new))
author_list = Parallel(n_jobs=2)(delayed(query_data)(A_data[idx]) for idx in tqdm(range(len(A_data))))

100%|██████████| 311/311 [03:17<00:00,  1.58it/s]
100%|██████████| 311/311 [00:00<00:00, 1525.02it/s]


In [66]:
df = pd.DataFrame(author_list[0][0])
for i in range(1,len(author_list)):
    df_authors = pd.concat([df,pd.DataFrame(author_list[i][0])], ignore_index=True)
    df = df_authors

In [67]:
BASE_URL = 'https://api.openalex.org/'
RESOURCE = 'works'
COMPLETE_URL = BASE_URL + RESOURCE

def fetch_works_data(chunk):
    ids = ''.join(chunk)
    ids = ids[:-1]
    cursorState = '*'
    responses = {}
    count = 0
    while cursorState:
        PARAMS = {"per_page":200,
                "filter" :f'{",".join([Filter1,Filter2,Filter3,Filter4])},authorships.author.id:{ids}',
                "cursor": f'{cursorState}'}
            
        responses[str(count)] = requests.get(COMPLETE_URL, params=PARAMS).json()
        if responses[str(count)]['meta']['next_cursor']:
            cursorState = str(responses[str(count)]['meta']['next_cursor'])
        else: cursorState = None
        count+=1
    return responses

def query_data_works_paper(response):
    lists = []
    for idx in response:
        result = response[str(idx)]['results']
        Paper_list = {'id':[],'publication_year':[],'cited_by_count':[],'author_ids':[]}
        for i in range(len(result)):
            Paper_list['id'] += [result[i]['id']]
            Paper_list['publication_year'] += [result[i]['publication_year']]
            Paper_list['cited_by_count'] += [result[i]['cited_by_count']]
            
            test = [result[i]['authorships'][l]['author']['id'] for l in range(len(result[i]['authorships']))]
            Paper_list['author_ids'] += [list(set(test).intersection(unique_authors))]
        if len(Paper_list['id'])>= 1:
            lists.append(Paper_list)
    return lists

def query_data_works_abstract(response):
    lists = []
    for idx in response:
        result = response[str(idx)]['results']
        Paper_list = {'id':[],'title':[],'abstract_inverted_index':[]}
        for i in range(len(result)):
            Paper_list['id'] += [result[i]['id']]
            Paper_list['title'] += [result[i]['title']]
            Paper_list['abstract_inverted_index'] += [result[i]['abstract_inverted_index']]
            
        if len(Paper_list['id'])>= 1:
            lists.append(Paper_list)
    return lists


In [68]:
CO_works_data = Parallel(n_jobs=2)(delayed(fetch_works_data)(chunk) for chunk in tqdm(chunked_lists_new))
Paper_data = Parallel(n_jobs=2)(delayed(query_data_works_paper)(CO_works_data[idx]) for idx in tqdm(range(len(CO_works_data))))
Abstract_data = Parallel(n_jobs=2)(delayed(query_data_works_abstract)(CO_works_data[idx]) for idx in tqdm(range(len(CO_works_data))))

100%|██████████| 311/311 [12:57<00:00,  2.50s/it]
100%|██████████| 311/311 [00:11<00:00, 27.52it/s]
100%|██████████| 311/311 [02:01<00:00,  2.55it/s]


In [69]:
df = pd.DataFrame(Paper_data[0][0])
for i in range(1,len(Paper_data)):
    df_paper = pd.concat([df,pd.DataFrame(Paper_data[i][0])], ignore_index=True)
    df = df_paper
    
df = pd.DataFrame(Abstract_data[0][0])
for i in range(1,len(Abstract_data)):
    df_abstract = pd.concat([df,pd.DataFrame(Abstract_data[i][0])], ignore_index=True)
    df = df_abstract
    
Co_Paper = df_paper
Co_abstract = df_abstract
Co_authors = df_authors


In [70]:
all_papers = pd.concat([Papers,Co_Paper],ignore_index=True)
all_abstracts = pd.concat([Abstracts,Co_abstract],ignore_index=True)
all_author = pd.concat([IC2S2_authors,Co_authors],ignore_index=True)

all_papers.drop_duplicates(subset="id", keep='first', inplace=True)

In [71]:
all_author.to_csv('data/IC2S2_all_authors',index=False)
all_papers.to_csv('data/IC2S2_all_papers',index=False)
all_abstracts.to_csv('data/IC2S2_all_abstracts',index=False)

### Data Overview and Reflection questions:

##### **How many works are listed in your IC2S2 papers dataframe?**

- 45105 works are listed in the IC2S2 papers dataframe. 

##### **How many unique researchers have co-authored these works?**

- The number of unique researchers is 8808.

##### **Efficiency in code**
- Our approach to making our code more efficient was to consider two main things. We realize that the bottleneck in the code is the rate limit of the API which is 2 request per second. We therefore implemented, instead of using the 'search' parameter where one author can be requested at a time, we use in the filter parameter 'authorships.author.id' to request 25 authors at a time where we concatenate the author ids with the "|" (OR) parameter. This way we can request 25 authors at a time.
Furthermore, we used the python library 'joblib' to parallelize the requests to the API. We changed all the for loops in our code, to function and run it through joblib to run more jobs at a time. We see that this has an impact in the run-time of the code, and make it more efficient.


##### **Filtering Criteria and Dataset Relevance**
- By setting our filter on the API request to set a threshold of works count between 5 and 5000, ensures that authors represented in the dataset. To further ensure that the dataset is relevant, we also set a filter to the citation count to be above 5 and that the amount of authors on the work is below 10. This ensures that the works and authors are relevant.
Furthermore we filter for works that include both either "Sociology", "Psychology", "Economics", "Political Science" or "Mathematics","Physics","Computer Science". This leads to an overrepresentation in these field and will lead to an underrepresentation in fields that include "History", "Biology", "Medicine" and more fields.

## Part 4: The Network of Computational Social Scientists

In this part of the assignment we construct and investigate the Computational Social Scientists Network.

In [72]:
import networkx as nx
import pandas as pd
import ast

all_papers = pd.read_csv("data/IC2S2_all_papers")

def list_maker(string):
    return ast.literal_eval(string)

all_papers = all_papers.drop_duplicates(['id'])
Paper_author = all_papers['author_ids'].apply(list_maker)
all_papers['author_ids'] = Paper_author

#1.1 Weighted edgelist made from Papers dataset
def find_pairs(my_list): # A function to rearange a list into a list of tuples of pairs
    pairs = []
    for n, author in enumerate(my_list):
        for author2 in my_list[n+1:]:
            pairs.append((author,author2))
    return pairs

pairs = all_papers['author_ids'].apply(lambda x: tuple(sorted(x))).apply(find_pairs) #rearange id's to avoid duplicates eg. (a,b),(b,a)
all_pairs = pairs.explode()
sorted_pairs = all_pairs.groupby(all_pairs).count().sort_values()
edgelist = [(a1,a2,v) for (a1,a2), v in zip(sorted_pairs.index, sorted_pairs.values)]

In [73]:
# 1.2 Graph construction based on the created edgelist
G = nx.Graph()
G.add_weighted_edges_from(edgelist)

In [74]:
# 1.3 save authors as nodes in the graph, with certain information about each, saved as attributes.
Author_df = pd.read_csv('data/IC2S2_all_authors')
Author_papers = all_papers.explode('author_ids')
publication_df = Author_papers.groupby('author_ids')['publication_year'].min().reset_index()
publication_df.columns = ['id', 'first_publication_year']
cited_df = Author_papers.groupby('author_ids')['cited_by_count'].sum().reset_index()
cited_df.columns = ['id','cited_by_count']
Author_df = Author_df.merge(publication_df, on='id')
Author_df = Author_df.merge(cited_df, on='id')
Author_df = Author_df.fillna('')

In [75]:
for node in G.nodes:
    if node in Author_df['id'].values:
        G.nodes[node]['display_name'] = Author_df[Author_df['id']== node]['display_name'].values[0]
        G.nodes[node]['country_code'] = Author_df[Author_df['id']== node]['country_code'].values[0]
        G.nodes[node]['cited_by_count'] = Author_df[Author_df['id']== node]['cited_by_count'].values[0]
        G.nodes[node]['first_publication_year'] = Author_df[Author_df['id']== node]['first_publication_year'].values[0]

        #OBS gem som JSON

In [76]:
#2.1 Network metrics
CC = 0
IN = 0
components = nx.connected_components(G)
for c in components:
    CC += 1
    if len(c) == 1:
        IN += 1

print(f'The number of nodes in G is {G.number_of_nodes()} and the amount of edges are {G.number_of_edges()}')
print(f'\nThe denisty of G is {nx.density(G)}')
print(f'\nIs the graph fully connected (not disconnected): {nx.is_connected(G)}')
print(f'\nSince graph G is disconnected; there are {CC} different connected components and {IN} isolated nodes')

#OBS besvar tekst spørgsmålet

The number of nodes in G is 8464 and the amount of edges are 26290

The denisty of G is 0.0007340414529877302

Is the graph fully connected (not disconnected): False

Since graph G is disconnected; there are 139 different connected components and 0 isolated nodes


In [77]:
s1 = sorted([val for idx, val in G.degree])
s2 = sorted([val for idx, val in G.degree(weight='weight')])
for n,sorted_degrees in enumerate([s1, s2]):
    average_degree = sum(sorted_degrees)/G.number_of_nodes()
    median_degree = sorted_degrees[int(len(sorted_degrees)/2)]
    mode_degree = max(set(sorted_degrees),key=sorted_degrees.count)
    min_degree = sorted_degrees[0]
    max_degree = sorted_degrees[-1]
    if n == 0:
        print(f'Regarding the degree of nodes in G. \naverage : {average_degree}\nmedian : {median_degree}\nmode {mode_degree}\nmin : {min_degree}\nmax : {max_degree} ')
    else: print(f'Regarding the weighted degree of nodes in G. \naverage : {average_degree}\nmedian : {median_degree}\nmode {mode_degree}\nmin : {min_degree}\nmax : {max_degree} ')

Regarding the degree of nodes in G. 
average : 6.21219281663516
median : 5
mode 4
min : 1
max : 122 
Regarding the weighted degree of nodes in G. 
average : 9.108931947069943
median : 6
mode 4
min : 1
max : 212 


In [78]:
#2.3 Top 5 Authors, based on the amount of degrees (collaborative works with unique authors)
top5 = pd.DataFrame(G.degree, columns=['id', 'degree']).sort_values(by='degree',ascending=False).head(5)
top5['names'] = None
display_names = [Author_df.loc[Author_df["id"] == top5["id"].values[i]]["display_name"].values[0] for i in range(len(top5))]
top5['names'] = display_names
top5

#OBS BESVAR TEKST SPØRGSMÅL

Unnamed: 0,id,degree,names
978,https://openalex.org/A5088141761,122,Jonathan D. Cohen
2956,https://openalex.org/A5029100305,104,Denny Borsboom
579,https://openalex.org/A5075080019,95,Qin Li
925,https://openalex.org/A5055710645,94,Jon Kleinberg
255,https://openalex.org/A5065243448,92,Qin Wang
