# Assignment 1

Link to github: https://github.com/NikolajT84/CSS_assignment1

### Collaboration

We have written the code working together on Nikolaj's machine, which is why all commits are from the same user.

In [80]:
# Imports for all exercises
import json
import requests
import pandas as pd
import networkx as nx
from tqdm import tqdm
from bs4 import BeautifulSoup
from itertools import combinations
from joblib import Parallel, delayed
from statistics import mean, mode, median

### Part 1

We follow the instructions, and use bs4 to parse the html.

In [123]:
# Get the HTML content
link = 'https://ic2s2-2023.org/program'
r = requests.get(link)
soup = BeautifulSoup(r.content)

# Find all the lists of things going on
programs = soup.find_all("ul", {"class":"nav_list"})

# Find all lists of authors
program_authors = [program.find_all("i") for program in programs]

# Get the text
all_authors = [author.text for authors in program_authors for author in authors]

# Get the individual names
all_names = [name.lower() for authors in all_authors for name in authors.split(", ")]
all_names_unique = list(set(all_names))

print(len(all_names_unique))
all_names_unique.sort()
print(all_names_unique)

# Save file as json
with open('authors.json', 'w') as f:
    json.dump(all_names_unique, f)

1472
[' bokányi', 'aaron clauset', 'aaron j. schwartz', 'aaron schein', 'aaron smith', 'abbas haidar', 'abby smith', 'abdulkadir celikkanat', 'abdullah almaatouq', 'abdullah zameek', 'abeer elbahrawy', 'adam finnemann', 'adam frank', 'adam h. russell', 'adam stefkovics', 'adam sutton', 'aditi dutta', 'adriano belisario', 'adrienne mendrik', 'agnieszka czaplicka', 'agnieszka falenska', 'aguru ishibashi', 'ahmad hesam', 'ahmed nasser mostafa', 'aidan combs', 'aidar zinnatullin', 'akeela careem', 'akhil arora', 'akira hashimoto', 'akira matsui', 'akira tsurushima', 'akrati saxena', 'alain barrat', 'alan paul kwan', 'alba motes rodrigo', 'albert-laszlo barabasi', 'alberto amaduzzi', 'alejandro beltran', 'alejandro dinkelberg', 'alejandro hermida carrillo', 'aleksandra urman', 'alessandra urbinati', 'alessandro de gaetano', 'alessandro flamini', 'alessandro flammini', 'alessandro gambetti', 'alessandro lomi', 'alessia antelmi', 'alessia melegaro', 'alessio vincenzo cardillo', 'alex mielke',

_5. How many unique researches do you get?_

As seen in the code output above we get 1472 unique authors.

_6. Explain the process you followed to web-scrape the page. Which choices did you make to accurately retrieve as many names as possible? Which strategies did you use to assess the quality of your final list? Explain your reasoning and your choices?_

We first go to the webpage to inspect the url. We notice that all the programs are in elements marked 'ul' with class name 'nav_list'. We get these programs and find all the names in them, contained in the elements marked 'i'. Then it's just a matter of formatting the resulting lists and deleting duplicates. The approach of taking all the navigable lists ensures that we get names from all the types of events. We look at a subset of the names, to see that it is indeed names and not something else we have found. We deem that since we will be using the search function of the openalex API to find authors, further checking the names is unnecessary, since name mispellings, duplicates etc. will be caught at that point.

### Part 2

_1. What are pros and cons of the custom-made data used in Centola's experiment (the first study presented in the lecture) and the ready-made data used in Nicolaides's study (the second study presented in the lecture)? You can support your arguments based on the content of the lecture and the information you read in Chapter 2.3 of the book._

For the custom-made data used by Centola, there are the usual downsides to this type of data: it probably took quite a lot of effort to set up and design the experiment, and they couldn't be sure beforehand how many people they would get to sign up to the network. On the other hand, they got the exact, specific data they wanted, and were thus able to get more out of their limited data.

For the ready-made data used in the second study, there was the advantage that all the data was collected beforehand. Thus, the researchers could jump right to analyzing it. The volume of data from the fitness tracker would most likely also be way bigger than anything they could have designed themselves. The fact that data is collected over time also provides the added advantage of facilitating longitudinal analysis. The downside is, that since they had no control over the collection process, they had to be more careful when doing their causal inference, as some variables could not be measured directly and the biases of the collection process is harder to uncover.

_2. How do you think these differences can influence the interpretation of the results in each study?_

With ready-made data, one will often have to do some data-manipulation to infer the variables of interest. This can introduce added uncertainty to the conclusions drawn from the study. On the other hand, with custom-made data, one will often have to extrapolate/do inference from fewer data-points, since the collection process is more ardous. This can limit how value of the study's conclusions.



### Part 3

Here we use the author database that we collected in week 2.
We first get all authors with works count between 5 and 5000.

In [64]:
authors = pd.read_pickle('authors.pkl')
authors = authors.loc[(authors['works_count'] > 5) &
                      (authors['works_count'] < 5000)]

We then construct the filter for the concepts, ie.
"Sociology|Psychology|Economics|Political science
& Mathematics|Physics|Computer science"
but using the concept ids, which we need to get.

In [35]:
# Define the base url
base_url = 'https://api.openalex.org/works'

# Columns for the dataframes
cols_papers = ['id', 'publication_year', 'cited_by_count', 'author_ids']
cols_abstract = ['id', 'title', 'abstract_inverted_index']

# Produce the concept filter
# First get all high-level concepts
concepts_url = 'https://api.openalex.org/concepts'
params_concepts = {'filter': 'level:0'}
result_concepts = requests.get(concepts_url, params=params_concepts).json()['results']
concepts = {concept['display_name']: concept['id'] for concept in result_concepts}

# Then define the lists of each category
soc_concepts = ['Sociology', 'Psychology', 'Economics', 'Political science']
quant_concepts = ['Mathematics', 'Physics', 'Computer science']

# Construct the conditions for 
condition_soc = '|'.join(concepts[c] for c in soc_concepts)
condition_quant = '|'.join(concepts[c] for c in quant_concepts)

# Construct filter
concepts_filter_soc = f'concept.id:{condition_soc}'
concepts_filter_quant = f'concept.id:{condition_quant}'

We define our filters and the dataframes, and define a function to get the works of a batch of authors.
Doing it in this manner enables us to use parallelization to speed up the process.

In [18]:
# Collect filters
filters = ',cited_by_count:>10,authors_count:<10,' + concepts_filter_soc + ',' + concepts_filter_quant

author_ids_list = list(authors['id_fixed'])

# Produce batch indexes for querying authors in bulk
batch_size = 10
author_batches_idx = []
for batch in range(0, len(authors), batch_size):
    author_batches_idx.append((batch, min(batch + batch_size, len(authors))))

# Define dataframes
papers_all = pd.DataFrame(columns=cols_papers)
abstract_all = pd.DataFrame(columns=cols_abstract)

# Function to query the api
def get_works(author_batch, author_ids_list):
    i, j = author_batch
    author_ids_str = '|'.join(author_ids_list[i:j])
    params = {'filter': 'author.id:' + author_ids_str + filters,
              'per-page': '200'}
    next_cursor = '*'

    # Define dataframes
    papers_df = pd.DataFrame(columns=cols_papers)
    abstract_df = pd.DataFrame(columns=cols_abstract)

    # Flip through the pages
    while next_cursor is not None:
        params['cursor'] = next_cursor
        result = requests.get(base_url, params=params).json()
        works = result['results']
        next_cursor = result['meta']['next_cursor']
        for work in works:
            # We take the characters [21:] from the ids in order to avoid the
            # start of the url, and get only the id itself.
            author_ids_work = [author['author']['id'][21:] for author in work['authorships']]
            new_paper = pd.DataFrame([[work[key] if not key == 'author_ids' else author_ids_work
                                       for key in cols_papers]],
                                     columns=cols_papers)
            new_abstract = pd.DataFrame([[work[key] for key in cols_abstract]],
                                        columns=cols_abstract)
            # Concatenate to dataframe
            papers_df = pd.concat([papers_df, new_paper], ignore_index=True)
            abstract_df = pd.concat([abstract_df, new_abstract], ignore_index=True)

    return papers_df, abstract_df

We can now run the query, using the recommend tricks of searching for multiple authors, plus parallel processing.

In [21]:
# Run queries in parallel
result = Parallel(n_jobs=-1)(delayed(get_works)(author_batch, author_ids_list) 
                             for author_batch in tqdm(author_batches_idx))
for pdf, adf in result:
    papers_all = pd.concat([papers_all, pdf], ignore_index=True)
    abstract_all = pd.concat([abstract_all, adf], ignore_index=True)

100%|██████████| 104/104 [00:46<00:00,  2.23it/s]


And now we can save our data.

In [27]:
papers_all.to_pickle('IC2S2_papers.pkl')
abstract_all.to_pickle('IC2S2_abstracts.pkl')

##### Dataset summary. 
_How many works are listed in your IC2S2 papers dataframe? How many unique researchers have co-authored these works?_

We take a look:


In [32]:
print('Total number of papers: ' + str(len(papers_all)))

paper_authors = set([ids for author_ids in papers_all['author_ids'] for ids in author_ids])
print('Number of unique authors: ' + str(len(paper_authors)))

Total number of papers: 10457
Number of unique authors: 13810


We see that the number of papers in the dataframe is 10457, and the 13810 unique authors authored these works.

#### Efficiency in code. 
_Describe the strategies you implemented to make your code more efficient. How did your approach affect your code's execution time?_

First of all, as much of the filtering as possible was frontloaded in the filters or in the selection of authors, in order to reduce the bottleneck, which was the API requests. Second of all, we searched by groups of authors. This reduced the overhead of connecting to the API for each author, and allowed us to pull more data each time. Lastly, we used parallel processing to use all the compute available. All these things sped up the process to the point where the job only took ~45 seconds, instead of upwards of an hour.

#### Filtering Criteria and Dataset Relevance 
_Reflect on the rationale behind setting specific thresholds for the total number of works by an author, the citation count, the number of authors per work, and the relevance of works to specific fields. How do these filtering criteria contribute to the relevance of the dataset you compiled? Do you believe any aspects of Computational Social Science research might be underrepresented or overrepresented as a result of these choices?_

Only taking authors with work counts between 5 and 5000 eliminates outliers, and only takes authors about whom it may be assumed that they have worked at least semi-regularly in CSS. The citation count ensures that the papers we collect have relevance in the field, and the amount of authors may again help to weed out outlier papers, that are not representative of the field. Using the filter for the different concepts ensures that the papers are indeed relevant to the field of CSS.

These filters provide us with a dataset where we can be fairly sure, that all papers are relevant to CSS. However, newer researchers or those who work in smaller, less recognized fields might be underrepresented, which could become an issue.

### Part 4
We first construct the graph from the edge list.

In [90]:
author_set = set(authors['id_fixed'])
edge_dict = {}

for co_authors in papers_all['author_ids']:
    # Check that we only use authors from the author file
    co_authors = list(set(co_authors) & author_set)
    # Leave out papers with only one author
    if len(co_authors) < 2:
        continue
    co_authors.sort()
    author_pairs = list(combinations(co_authors, r=2))
    for author_pair in author_pairs:
        if author_pair not in edge_dict:
            edge_dict[author_pair] = 1
        else:
            edge_dict[author_pair] += 1

edge_list = [[author_pair[0], author_pair[1], weight]
             for author_pair, weight in list(zip(edge_dict.keys(),
                                             edge_dict.values()))]
G = nx.Graph()
G.add_weighted_edges_from(edge_list, weight='weight')

We also need the first year of publication:

In [65]:
# Get first year of publication
first_pub = papers_all.explode('author_ids').groupby(['author_ids'])['publication_year'].min()
first_pub_df = pd.DataFrame(first_pub)
first_pub_df.index.name = 'idx'
first_pub_df.columns = ['first_pub']
first_pub_df['id_fixed'] = first_pub_df.index
authors = pd.merge(authors, first_pub_df, on='id_fixed')

Finally we can add the note attributes and save the network.

In [91]:
attribute_dict = {author['id_fixed']: {'display_name': author['display_name'],
                                       'country': author['country_code'],
                                       'citations_count': author['citations_count'],
                                       'first_pub': author['first_pub']}
                  for i, author in authors.iterrows()}
nx.set_node_attributes(G, attribute_dict)

# Write to list
data = nx.readwrite.json_graph.node_link_data(G)

# Write the data to a json file
with open('graph.json', 'w') as f:
    json.dump(data, f)

#### Network Metrics

In [104]:
print('Number of nodes: ', len(G.nodes))
print('Number of edges: ', len(G.edges))
print('Density: ', nx.density(G))
print('Fully connected: ', nx.is_connected(G))
print('Number of components: ', len(list(nx.connected_components(G))))
print('Number of isolates: ', len(list(nx.isolates(G))))

Number of nodes:  405
Number of edges:  754
Density:  0.009216477203275883
Fully connected:  False
Number of components:  42
Number of isolates:  0


From the calculations above we see that the network has 405 nodes and 754 links. The density is 0.009 which is low - meaning the network is indeed sparse - which is expected since papers take a long time to write, and most authors have only written a handful, and thereby have only collaborated with a tiny fraction of the whole field. We also see that the network is not fully connected, and that there are 42 connected components. Since we constructed the graph from an edge list only there are no isolates.

All in all this seems to be what one would expect. For the reasons given above it makes sense that the network is sparse, and for geographical/institutional reasons it makes sense that there are more than a few connected components, despite the globalized nature of modern research.

#### Degree Analysis

In [105]:
degrees = [d for _, d in list(G.degree)]
print('Degree mean: ', mean(degrees))
print('Degree median: ',median(degrees))
print('Degree mode: ',mode(degrees))

Degree mean:  3.723456790123457
Degree median:  2
Degree mode:  1


In [106]:
strength = [d for _, d in list(G.degree(weight='weight'))]
print('Strength mean: ',mean(strength))
print('Strength median: ',median(strength))
print('Strength mode: ',mode(strength))

Strength mean:  25.145679012345678
Strength median:  12
Strength mode:  2


From the degree analysis we see that there are many nodes with low degree/strength, and fewer nodes with higher values (mean>median>mode). This makes sense, in that there are few authors with many papers and collaborations (often in leadership positions) and many authors with fewer papers, probably lots of PhD students and the like.

#### Top Authors

In [118]:
sort_idxs = sorted(range(len(degrees)),key=degrees.__getitem__)
degrees_sorted = [list(G.degree)[i] for i in sort_idxs]
top_five = degrees_sorted[:-6:-1]
print('Top five authors:')
for id, degree in top_five:
    author_name = authors.loc[authors['id_fixed']==id]['display_name'].values[0]
    print(author_name, ': ', degree)

Top five authors:
Ciro Cattuto :  22
Nicola Perra :  20
Sune Lehmann :  19
Filippo Menczer :  18
Alain Barrat :  18


We do a little research on each author:

__Ciro Cattuto__: Scientific Director of ISI Foundation, and a founder and principal investigator of the SocioPatterns collaboration. _"My work focuses on measuring and understanding complex phenomena in systems that entangle human behaviors and digital platforms."_

__Nicola Perra__: _"I serve as Reader in Applied Mathematics at Queen Mary University of London, UK and chair of the British Chapter of the network Society."_

__Sune Lehmann__: _"I’m a Professor of Networks and Complexity Science at DTU Compute, Technical University of Denmark. I’m also a Professor of Social Data Science at the Center for Social Data Science (SODAS), University of Copenhagen."_

__Filippo Menczer__: University Distinguished Professor and the Luddy Professor of Informatics and Computer Science at the Luddy School of Informatics, Computing, and Engineering, Indiana University.

__Alain Barrat__: Senior researcher affiliated with CNRS, CPT, and Turing Center for Living systems in Marseille, France.

__Comments__: We would say all of these seem highly related to CSS, based on their descriptions of their work. These are the authors with the most works collaborated on within the authors dataset, and it makes sense that most of them occupy leadership positions in institutes focused on CSS or related, since getting authorships on hundreds of papers probably mean that you oversee a lot of projects. Given that these are the most heavily connected authors, they serve as 'hubs' within their institutes and the field at large. 