### GitHub repository
Link to repository used to colaborate on the assignment:
https://github.com/KarolineKlan/Assignments_ComSocSci2024.git

### Contribution statement

Team members:

- Jacob (s214596)
- Kristoffer (s214609)
- Karoline (s214638)

All members collaborated and contributed to every part of the assignment.


# Assignment 1
This assignment was formed using Web-scraping tools from the program of the International Conference in Computational Social Science 2023  https://ic2s2-2023.org/program, and acessing data of Authors and Research Articles using the OpenAlex API https://docs.openalex.org/.

In [4]:
#Import relevant libraries
from bs4 import BeautifulSoup 
import requests
import pandas as pd
from tqdm import tqdm
from Levenshtein import distance
import numpy as np
import ast
import networkx as nx
from joblib import Parallel, delayed
import os 
import json

## Part 1 - Webscraping
In the following task we use web-scraping tools to get the list of participants in the International Conference in Computational Social Science (CSC) 2023

In [6]:
# define link to scrape, and beautifulsoup object
LINK = "https://ic2s2-2023.org/program"
LINK_OPTIONAL1 = "https://ic2s2-2023.org/program_committee"
r_OPTIONAL1 = requests.get(LINK_OPTIONAL1)
soup_OPTIONAL1 = BeautifulSoup(r_OPTIONAL1.content)
LINK_OPTIONAL2 = "https://ic2s2-2023.org/tutorials"
r_OPTIONAL2 = requests.get(LINK_OPTIONAL2)
soup_OPTIONAL2 = BeautifulSoup(r_OPTIONAL2.content)
r = requests.get(LINK)
soup = BeautifulSoup(r.content)

# Find all relevant places in the HTML code where names are stored
speaker = soup.findAll("ul", {"class" : "nav_list"})
chair = soup.findAll("h2")
table = soup.find("table", {"class" : "tutorials"})
table = table.find_all("td")
main = soup_OPTIONAL1.find("section", {"id" : "main"})
names_members = main.findAll("li")
names_teachers = soup_OPTIONAL2.findAll("div", {"class" : "col-5 col-12-medium"})

# Loop through the HTML code and extract names
keynote_names = [table[k].text.lower().split("- ")[1] for k in range(len(table)) if "Keynote" in table[k].text]
chair_names = [chair[k].text.lower().split(": ")[2] for k in range(len(chair)) if "Chair" in chair[k].text]
speaker_names = [speaker[k].find_all("i")[j].text.lower().split(", ")  for k in range(len(speaker)) for j in range(len(speaker[k].find_all("i")))]
speaker_names = sum(speaker_names, [])
names_members_lst = [names_members[i].find("b").text.lower() for i in range(len(names_members))]
names_teachers = [names_teachers[i].findAll("li")[k].find("b").text.lower() for i in range(len(names_teachers)) for k in range(len(names_teachers[i].findAll("li")))]



# Print results for each category
print(f"Number of unique speakers:  {len(set(speaker_names))}")
print(f"Number of unique keynote speakers:  {len(set(keynote_names))}")
print(f"Number of unique chairs:  {len(set(chair_names))}")
print(f"Number of unique members from optional link1:  {len(set(names_members_lst))}")
print(f"Number of unique teachers from optional link2:  {len(set(names_teachers))}")

# Add all names together to find total unique names
total_names = speaker_names + keynote_names + chair_names + names_members_lst + names_teachers
df = pd.DataFrame(total_names, columns = ["Name"])
df["Name"] = df["Name"].str.replace(".", "")
df["Name"] = df["Name"].str.lstrip(" ")
uniq_names = pd.DataFrame(set(df["Name"]), columns=["Name"])
uniq_names = uniq_names.sort_values('Name', ascending=True)

print(f"Total number of unique speakers:  {len((uniq_names))}")

pd.DataFrame(uniq_names).to_csv("data/authors_part1.csv", index=False)

Number of unique speakers:  1472
Number of unique keynote speakers:  10
Number of unique chairs:  49
Number of unique members from optional link1:  333
Number of unique teachers from optional link2:  19
Total number of unique speakers:  1645



**The process of the web-scraping:** 

In the process of web-scraping the website and collect the specific names of all the researchers, a thourough investigation of the HTML setup was initiated in order to understand the hierarchical and nested structure of the page. The main structure of the page was after inspection divided into 3 main parts, where different approaches were utilized in order to access the data from different structures:
1. collect the names og the key-note speakers from the overview table structure
2. collect the names for the chair speakers in the "h2" sections
3. collect names from the text sections in the "ul" sections 

And later the optional parts was added extracting the members using find_all on the "li" structure, and teachers in the special class "col-5 col-12-medium". 

When collecting all the names, the strings were all converted to lower-case and dots, initial space and other characters were stripped from the strings in order to mitigate having the same person spelled with specific middlename breveations. The set() function makes sure we only save unique names.




## Part 2 - Ready Made vs Custom Made Data

1. **Custom-made data**, as used in Centola's experiment, offers the advantage of experimental control and manipulation, allowing the researchers to precisely design conditions and variables and data is not incomplete, since all the needed data is collected. However, it can be time-consuming and costly to collect, and it may lack real-world validity due to potential response biases when trying to record behaviors with digital systems that are highly engineered to induce specific behaviors. 
**Ready-made data**, like used in Nicolaides's study, offers readily available information as it is always-on, and then enables the study of unexpected events and real-time measurements. It reflects real-world behavior but may lack experimental control and relevance to specific research questions, as samples can be incomplete or biased. It also poses challenges in identifying and controlling confounding variables and can be dirty and difficult to access.

2. Considering the results gathered from Centola's custom-made data, it allows for clear causal inference regarding the spread of behavior, but the experiment is made within an artificial environment, potentially limiting the applicability of the results to real-life settings and the overall generalizability of the experiment. In the interpretation of the results from Nicolaides's ready-made data confounding variables and biases that might have been involved could be hard to identify, which could complicate the interpretation of the results and potentially lead to wrong conclusions. Additionally, bias within both types of data could affect the results, which is important to consider in the interpretation. In custom-made data response biases should be considered, as researchers might induce specific types of behaviors based on the design of the experiment. In ready-made data bias induced by potentially incomplete or nonrepresentative datasets, and other characteristics of big data could affect the results and should also be considered.


## Part 3 - Gathering Research Articles using the OpenAlex API

In this part of the assignment the API endpoint "works" from the OpenAlex API is used to collect Research Articles from IC2S2 Authors.

A dataframe called "authors_part3.csv" from week 2 exercise 2 is loaded in order to retrieve the works from the authors. In week 2 the author_info of the authors retrieved in part 1 is reduced from being the length of 1645 to 1257, because the Levenshtein distance measure is utilized to make sure the correct names are retrieved from the openAlex API. We exclude names where the Levenshtein distance is above 6. 

In [3]:
BASE_URL = 'https://api.openalex.org/'
RESOURCE = 'works'
COMPLETE_URL = BASE_URL + RESOURCE

#load df from week 2 exercise 2
df = pd.read_csv('data/authors_part3.csv', index_col=[0])

#filter out authors with less than 5 works and more than 5000 works
IC2S2_author = df.loc[df['works_count'] > 5]
IC2S2_authors = IC2S2_author.loc[IC2S2_author['works_count'] < 5000]
IC2S2_authors = IC2S2_authors.reset_index()

#split the list of authors into chunks of 25
lst = [str(IC2S2_authors['id'][i]) + '|' for i in range(len(IC2S2_authors['id']))]
n = 25
chunked_lists = [lst[i:i + n] for i in range(0, len(lst), n)]

# Specify the Filters used in the filtering process for the API
Filter1 = "cited_by_count:>10"
Filter2 = "authors_count:<10"

all_concepts = requests.get(BASE_URL+"concepts", params={"filter":"level:0"}).json()

social_concepts= [i["id"] for i in all_concepts["results"] if i['display_name'] in (["Sociology","Psychology","Economics","Political Science"])]
math_concepts= [i["id"] for i in all_concepts["results"] if i['display_name'] in (["Mathematics","Physics","Computer Science"])]

Filter3="concepts.id:"+"|".join(social_concepts)
Filter4="concepts.id:"+"|".join(math_concepts)

In [4]:
IC2S2_papers = {}
IC2S2_abstracts = {}

# Define function to get papers and abstracts from the API in chunks of 25 authors
def get_papers_and_abstracts(chunked_list):
    ids = ''.join(chunked_list)
    ids = ids[:-1]
    cursorState = '*'
    while cursorState:
        PARAMS = {"per_page":200,
                "filter" :f'{",".join([Filter1,Filter2,Filter3,Filter4])},authorships.author.id:{ids}',
                "cursor": f'{cursorState}'}
            
        response = requests.get(COMPLETE_URL, params=PARAMS).json()
        if response['meta']['next_cursor']:
            cursorState = str(response['meta']['next_cursor'])
        else: cursorState = None

        for i in range(len(response['results'])):
            id = response['results'][i]['id']
            pub_year = response['results'][i]['publication_year']
            cited = response['results'][i]['cited_by_count']
            authors = [response['results'][i]['authorships'][l]['author']['id'] for l in range(len(response['results'][i]['authorships']))]
            title = response['results'][i]['title']
            abstract = response['results'][i]['abstract_inverted_index']
            IC2S2_papers[str(id)] = {'id':id, 'publication_year':pub_year,'cited_by_count':cited, 'author_ids':authors}
            IC2S2_abstracts[str(id)] = {'id':id, 'title':title, 'abstract_inverted_index':abstract}
    
    return IC2S2_papers, IC2S2_abstracts

# Use joblib to parallelize the process of getting papers and abstracts faster
results = Parallel(n_jobs=2)(delayed(get_papers_and_abstracts)(chunked_list) for chunked_list in tqdm(chunked_lists))


papers_dict = {k: v for result in results for k, v in result[0].items()}
abstracts_dict = {k: v for result in results for k, v in result[1].items()}

Papers = pd.DataFrame(papers_dict).transpose()
Abstracts = pd.DataFrame(abstracts_dict).transpose()


100%|██████████| 42/42 [01:01<00:00,  1.46s/it]


In [13]:
print(f"Number of works in the IC2S2_papers dataframe: {len(Papers)}")
print(f"Number of unique authors that have co-authored the works: {len(Papers['author_ids'].explode().unique())}")

Number of works in the IC2S2_papers dataframe: 4485
Number of unique authors that have co-authored the works: 8474


### **Data Overview and Reflection questions:**

##### **How many works are listed in your IC2S2 papers dataframe?**

- 4485 works are listed in the IC2S2 papers dataframe. 

##### **How many unique researchers have co-authored these works?**

- The number of unique researchers is 8474.

##### **Efficiency in code**
- Our approach to making our code more efficient was to consider two main things. We realize that the bottleneck in the code is the rate limit of the API which is 2 request per second. We therefore implemented, instead of using the 'search' parameter where one author can be requested at a time, we use in the filter parameter 'authorships.author.id' to request 25 authors at a time where we concatenate the author ids with the "|" (OR) parameter. This way we can request 25 authors at a time.
Furthermore, we used the python library 'joblib' to parallelize the requests to the API. We changed all the for loops in our code, to function and run it through joblib to run more jobs at a time. We see that this has an impact in the run-time of the code, and make it more efficient.


##### **Filtering Criteria and Dataset Relevance**
- By setting our filter on the API request to set a threshold of works count between 5 and 5000, ensures that authors represented in the dataset. To further ensure that the dataset is relevant, we also set a filter to the citation count to be above 5 and that the amount of authors on the work is below 10. This ensures that the works and authors are relevant.
Furthermore we filter for works that include both either "Sociology", "Psychology", "Economics", "Political Science" or "Mathematics","Physics","Computer Science". This leads to an overrepresentation in these field and will lead to an underrepresentation in fields that include "History", "Biology", "Medicine" and more fields.

## Part 4: The Network of Computational Social Scientists

In this part of the assignment we construct and investigate the Computational Social Scientists Network.

In [24]:

all_papers = Papers#pd.read_csv("data/IC2S2_all_papers")

# def list_maker(string):
#     return ast.literal_eval(string)

#all_papers = all_papers.drop_duplicates(['id'])
#Paper_author = all_papers['author_ids'].apply(list_maker)
#all_papers['author_ids'] = Paper_author

#1.1 Weighted edgelist made from Papers dataset
def find_pairs(my_list): # A function to rearange a list into a list of tuples of pairs
    pairs = []
    for n, author in enumerate(my_list):
        for author2 in my_list[n+1:]:
            pairs.append((author,author2))
    return pairs

pairs = Papers['author_ids'].apply(lambda x: tuple(sorted(x))).apply(find_pairs) #rearange id's to avoid duplicates eg. (a,b),(b,a)
all_pairs = pairs.explode()
sorted_pairs = all_pairs.groupby(all_pairs).count().sort_values()
edgelist = [(a1,a2,v) for (a1,a2), v in zip(sorted_pairs.index, sorted_pairs.values)]

In [25]:
# 1.2 Graph construction based on the created edgelist
G = nx.Graph()
G.add_weighted_edges_from(edgelist)

In [32]:
# 1.3 save authors as nodes in the graph, with certain information about each, saved as attributes.
Author_df = IC2S2_authors #pd.read_csv('data/IC2S2_Authors')


Long_paper_df = all_papers.explode('author_ids')

# Find the smallest publication year for each author
publication_df = Long_paper_df.groupby('author_ids')['publication_year'].min().reset_index()
publication_df.columns = ['id', 'first_publication_year']

#find the number of papers each author has been cited
cited_df = Long_paper_df.groupby('author_ids')['cited_by_count'].sum().reset_index()
cited_df.columns = ['id','cited_by_count']

#add the information to the Author_df 
Author_df = Author_df.merge(publication_df, on='id')
Author_df = Author_df.merge(cited_df, on='id')
Author_df = Author_df.fillna('')

Author_df.head()

Unnamed: 0,id,display_name,works_api_url,h_index,works_count,country_code,first_publication_year,cited_by_count
0,https://openalex.org/A5014647140,Aaron Clauset,https://api.openalex.org/works?filter=author.i...,43,201,US,2005,2652
1,https://openalex.org/A5047404909,Aaron J. Schwartz,https://api.openalex.org/works?filter=author.i...,8,23,US,2021,30
2,https://openalex.org/A5015211752,Aaron Smith,https://api.openalex.org/works?filter=author.i...,37,157,GB,2004,24
3,https://openalex.org/A5064296964,Abdullah Almaatouq,https://api.openalex.org/works?filter=author.i...,10,47,US,2016,67
4,https://openalex.org/A5063098095,Abeer ElBahrawy,https://api.openalex.org/works?filter=author.i...,6,17,GB,2018,132


In [33]:
#add the information to each node in the graph 
for node in G.nodes:
    if node in Author_df['id'].values:
        G.nodes[node]['display_name'] = Author_df[Author_df['id']== node]['display_name'].values[0]
        G.nodes[node]['country_code'] = Author_df[Author_df['id']== node]['country_code'].values[0]
        G.nodes[node]['cited_by_count'] = Author_df[Author_df['id']== node]['cited_by_count'].values[0]
        G.nodes[node]['first_publication_year'] = Author_df[Author_df['id']== node]['first_publication_year'].values[0]
G.nodes.data()


NodeDataView({'https://openalex.org/A5000017075': {'display_name': 'Pavlin Mavrodiev', 'country_code': 'CH', 'cited_by_count': 180, 'first_publication_year': 2013}, 'https://openalex.org/A5020270223': {'display_name': 'Claudio J. Tessone', 'country_code': 'CH', 'cited_by_count': 113, 'first_publication_year': 2013}, 'https://openalex.org/A5037087050': {'display_name': 'Eduardo López', 'country_code': 'US', 'cited_by_count': 476, 'first_publication_year': 2014}, 'https://openalex.org/A5088539840': {'display_name': 'Esteban Moro', 'country_code': 'US', 'cited_by_count': 1948, 'first_publication_year': 2002}, 'https://openalex.org/A5070200645': {'display_name': 'Robert L. Axtell', 'country_code': 'US', 'cited_by_count': 34, 'first_publication_year': 2011}, 'https://openalex.org/A5048877432': {'display_name': 'Bruno Lepri', 'country_code': 'IT', 'cited_by_count': 637, 'first_publication_year': 2006}, 'https://openalex.org/A5037077388': {'display_name': 'Stefan A. Frisch', 'country_code': '

In [36]:
#Save the graph as a JSON file

# Convert the graph to a dictionary
graph_dict = nx.node_link_data(G)

# Convert numpy.int64 to int for JSON serialization
def convert(o):
    if isinstance(o, np.int64):
        return int(o)
    raise TypeError

# Write the graph dictionary to a JSON file
with open("data/network_with_attributes.json", "w") as f:
    json.dump(graph_dict, f, default=convert)

In [37]:
#2.1 Network metrics
CC = 0
IN = 0
components = nx.connected_components(G)
for c in components:
    CC += 1
    if len(c) == 1:
        IN += 1

print(f'The number of nodes in G is {G.number_of_nodes()} and the amount of edges are {G.number_of_edges()}')
print(f'\nThe denisty of G is {nx.density(G)}')
print(f'\nIs the graph fully connected (not disconnected): {nx.is_connected(G)}')
print(f'\nSince graph G is disconnected; there are {CC} different connected components and {IN} isolated nodes')

#OBS besvar tekst spørgsmålet

####!!!!!!!! 8464 vi mister 10 fra part 3???

The number of nodes in G is 8464 and the amount of edges are 23742

The denisty of G is 0.0006628989036452906

Is the graph fully connected (not disconnected): False

Since graph G is disconnected; there are 261 different connected components and 0 isolated nodes


In [38]:
s1 = sorted([val for idx, val in G.degree])
s2 = sorted([val for idx, val in G.degree(weight='weight')])
for n,sorted_degrees in enumerate([s1, s2]):
    average_degree = sum(sorted_degrees)/G.number_of_nodes()
    median_degree = sorted_degrees[int(len(sorted_degrees)/2)]
    mode_degree = max(set(sorted_degrees),key=sorted_degrees.count)
    min_degree = sorted_degrees[0]
    max_degree = sorted_degrees[-1]
    if n == 0:
        print(f'Regarding the degree of nodes in G. \naverage : {average_degree}\nmedian : {median_degree}\nmode {mode_degree}\nmin : {min_degree}\nmax : {max_degree} ')
    else: print(f'Regarding the weighted degree of nodes in G. \naverage : {average_degree}\nmedian : {median_degree}\nmode {mode_degree}\nmin : {min_degree}\nmax : {max_degree} ')

Regarding the degree of nodes in G. 
average : 5.6101134215500945
median : 4
mode 3
min : 1
max : 122 
Regarding the weighted degree of nodes in G. 
average : 6.8676748582230625
median : 5
mode 3
min : 1
max : 212 


In [None]:
#2.3 Top 5 Authors, based on the amount of degrees (collaborative works with unique authors)
top5 = pd.DataFrame(G.degree, columns=['id', 'degree']).sort_values(by='degree',ascending=False).head(5)
top5['names'] = None
display_names = [Author_df.loc[Author_df["id"] == top5["id"].values[i]]["display_name"].values[0] for i in range(len(top5))]
top5['names'] = display_names
top5

#OBS BESVAR TEKST SPØRGSMÅL

Unnamed: 0,id,degree,names
885,https://openalex.org/A5088141761,122,Jonathan D. Cohen
2977,https://openalex.org/A5029100305,104,Denny Borsboom
587,https://openalex.org/A5075080019,95,Qin Li
924,https://openalex.org/A5055710645,94,Jon Kleinberg
294,https://openalex.org/A5065243448,92,Qin Wang
