# Assignment 1

Link to github: https://github.com/NikolajT84/CSS_assignment1

### Collaboration

We have written the code working together on Nikolaj's machine, which is why all commits are from the same user.

In [80]:
# Imports for all exercises
import json
import requests
import pandas as pd
import networkx as nx
from tqdm import tqdm
from bs4 import BeautifulSoup
from itertools import combinations
from joblib import Parallel, delayed
from statistics import mean, mode, median

### Part 1

We follow the instructions, and use bs4 to parse the html.

In [5]:
# Get the HTML content
link = 'https://ic2s2-2023.org/program'
r = requests.get(link)
soup = BeautifulSoup(r.content)

# Find all the lists of things going on
programs = soup.find_all("ul", {"class":"nav_list"})

# Find all lists of authors
program_authors = [program.find_all("i") for program in programs]

# Get the text
all_authors = [author.text for authors in program_authors for author in authors]

# Get the individual names
all_names = [name.lower() for authors in all_authors for name in authors.split(", ")]
all_names_unique = list(set(all_names))

print(len(all_names_unique))
print(all_names_unique[:100])

# Save file as json
with open('authors.json', 'w') as f:
    json.dump(all_names_unique, f)

1472
['kristoffer lind glavind', 'matthew deverna', 'sho cho', 'balazs vedres', 'manju bura', 'anya hommadova lu', 'sagar kumar', 'piotr bródka', 'laura maria alessandretti', 'salvatore giorgi', 'naoki yoshinaga', 'ashlyn b. aske', 'nakao ran', 'sanja scepanovic', 'camille testard', 'sharon kang', 'alexandra segerberg', 'angelo brayner', 'michael cook', 'shaun bevan', 'laura boeschoten', 'yuan zhang', 'indrajeet patil', 'michael szell', 'daniele rama', 'anna rogers', 'neeley pate', 'rob chew', 'jon roozenbeek', 'louis boucherie', 'kai-cheng yang', 'xindi wang', 'brenda curtis', 'ryan louis stevens', 'nalette brodnax', 'andrea failla', 'miriam hurtado bodell', 'daniel larremore', 'lisette espin-noboa', 'zishan lan', 'edmond awad', 'vadim voskresenskii', 'yu-wen chen', 'katayoun farrahi', 'zoe k. rahwan', 'dan dai', 'ivano bison', 'd. sunshine hillygus', 'filipi nascimento silva', 'miriam redi', 'zhemeng xie', 'gemma read', 'gerardo iñiguez', 'cynthia rudin', 'yuan liao', 'fredrik jansso

_5. How many unique researches do you get?_

As seen in the code output above we get 1472 unique authors.

_6. Explain the process you followed to web-scrape the page. Which choices did you make to accurately retrieve as many names as possible? Which strategies did you use to assess the quality of your final list? Explain your reasoning and your choices?_

We first go to the webpage to inspect the url. We notice that all the programs are in elements marked 'ul' with class name 'nav_list'. We get these programs and find all the names in them, contained in the elements marked 'i'. Then it's just a matter of formatting the resulting lists and deleting duplicates. The approach of taking all the navigable lists ensures that we get names from all the types of events. We look at a subset of the names, to see that it is indeed names and not something else we have found.

### Part 2

_1. What are pros and cons of the custom-made data used in Centola's experiment (the first study presented in the lecture) and the ready-made data used in Nicolaides's study (the second study presented in the lecture)? You can support your arguments based on the content of the lecture and the information you read in Chapter 2.3 of the book._

For the custom-made data used by Centola, there are the usual downsides to this type of data: it probably took quite a lot of effort to set up and design the experiment, and they couldn't be sure beforehand how many people they would get to sign up to the network. On the other hand, they got the exact, specific data they wanted, and were thus able to get more out of their limited data.

For the ready-made data used in the second study, there was the advantage that all the data was collected beforehand. Thus, the researchers could jump right to analyzing it. The volume of data from the fitness tracker would most likely also be way bigger than anything they could have designed themselves. The fact that data is collected over time also provides the added advantage of facilitating longitudinal analysis. The downside is, that since they had no control over the collection process, they had to be more careful when doing their causal inference, as some variables could not be measured directly and the biases of the collection process is harder to uncover.

_2. How do you think these differences can influence the interpretation of the results in each study?_

With ready-made data, one will often have to do some data-manipulation to infer the variables of interest. This can introduce added uncertainty to the conclusions drawn from the study. On the other hand, with custom-made data, one will often have to extrapolate/do inference from fewer data-points, since the collection process is more ardous. This can limit how value of the study's conclusions.



### Part 3

Here we use the author database that we collected in week 2.
We first get all authors with works count between 5 and 5000.

In [64]:
authors = pd.read_pickle('authors.pkl')
authors = authors.loc[(authors['works_count'] > 5) &
                      (authors['works_count'] < 5000)]

We then construct the filter for the concepts, ie.
"Sociology|Psychology|Economics|Political science
& Mathematics|Physics|Computer science"
but using the concept ids, which we need to get.

In [35]:
# Define the base url
base_url = 'https://api.openalex.org/works'

# Columns for the dataframes
cols_papers = ['id', 'publication_year', 'cited_by_count', 'author_ids']
cols_abstract = ['id', 'title', 'abstract_inverted_index']

# Produce the concept filter
# First get all high-level concepts
concepts_url = 'https://api.openalex.org/concepts'
params_concepts = {'filter': 'level:0'}
result_concepts = requests.get(concepts_url, params=params_concepts).json()['results']
concepts = {concept['display_name']: concept['id'] for concept in result_concepts}

# Then define the lists of each category
soc_concepts = ['Sociology', 'Psychology', 'Economics', 'Political science']
quant_concepts = ['Mathematics', 'Physics', 'Computer science']

# Construct the conditions for 
condition_soc = '|'.join(concepts[c] for c in soc_concepts)
condition_quant = '|'.join(concepts[c] for c in quant_concepts)

# Construct filter
concepts_filter_soc = f'concept.id:{condition_soc}'
concepts_filter_quant = f'concept.id:{condition_quant}'

We define our filters and the dataframes, and define a function to get the works of a batch of authors.
Doing it in this manner enables us to use parallelization to speed up the process.

In [18]:
# Collect filters
filters = ',cited_by_count:>10,authors_count:<10,' + concepts_filter_soc + ',' + concepts_filter_quant

author_ids_list = list(authors['id_fixed'])

# Produce batch indexes for querying authors in bulk
batch_size = 10
author_batches_idx = []
for batch in range(0, len(authors), batch_size):
    author_batches_idx.append((batch, min(batch + batch_size, len(authors))))

# Define dataframes
papers_all = pd.DataFrame(columns=cols_papers)
abstract_all = pd.DataFrame(columns=cols_abstract)

# Function to query the api
def get_works(author_batch, author_ids_list):
    i, j = author_batch
    author_ids_str = '|'.join(author_ids_list[i:j])
    params = {'filter': 'author.id:' + author_ids_str + filters,
              'per-page': '200'}
    next_cursor = '*'

    # Define dataframes
    papers_df = pd.DataFrame(columns=cols_papers)
    abstract_df = pd.DataFrame(columns=cols_abstract)

    # Flip through the pages
    while next_cursor is not None:
        params['cursor'] = next_cursor
        result = requests.get(base_url, params=params).json()
        works = result['results']
        next_cursor = result['meta']['next_cursor']
        for work in works:
            # We take the characters [21:] from the ids in order to avoid the
            # start of the url, and get only the id itself.
            author_ids_work = [author['author']['id'][21:] for author in work['authorships']]
            new_paper = pd.DataFrame([[work[key] if not key == 'author_ids' else author_ids_work
                                       for key in cols_papers]],
                                     columns=cols_papers)
            new_abstract = pd.DataFrame([[work[key] for key in cols_abstract]],
                                        columns=cols_abstract)
            # Concatenate to dataframe
            papers_df = pd.concat([papers_df, new_paper], ignore_index=True)
            abstract_df = pd.concat([abstract_df, new_abstract], ignore_index=True)

    return papers_df, abstract_df

We can now run the query, using the recommend tricks of searching for multiple authors, plus parallel processing.

In [21]:
# Run queries in parallel
result = Parallel(n_jobs=-1)(delayed(get_works)(author_batch, author_ids_list) 
                             for author_batch in tqdm(author_batches_idx))
for pdf, adf in result:
    papers_all = pd.concat([papers_all, pdf], ignore_index=True)
    abstract_all = pd.concat([abstract_all, adf], ignore_index=True)

100%|██████████| 104/104 [00:46<00:00,  2.23it/s]


And now we can save our data.

In [27]:
papers_all.to_pickle('IC2S2_papers.pkl')
abstract_all.to_pickle('IC2S2_abstracts.pkl')

##### Dataset summary. 
_How many works are listed in your IC2S2 papers dataframe? How many unique researchers have co-authored these works?_

We take a look:


In [32]:
print('Total number of papers: ' + str(len(papers_all)))

paper_authors = set([ids for author_ids in papers_all['author_ids'] for ids in author_ids])
print('Number of unique authors: ' + str(len(paper_authors)))

Total number of papers: 10457
Number of unique authors: 13810


We see that the number of papers in the dataframe is 10457, and the 13810 unique authors authored these works.

#### Efficiency in code. 
_Describe the strategies you implemented to make your code more efficient. How did your approach affect your code's execution time?_

First of all, as much of the filtering as possible was frontloaded in the filters or in the selection of authors, in order to reduce the bottleneck, which was the API requests. Second of all, we searched by groups of authors. This reduced the overhead of connecting to the API for each author, and allowed us to pull more data each time. Lastly, we used parallel processing to use all the compute available. All these things sped up the process to the point where the job only took ~45 seconds, instead of upwards of an hour.

#### Filtering Criteria and Dataset Relevance 
_Reflect on the rationale behind setting specific thresholds for the total number of works by an author, the citation count, the number of authors per work, and the relevance of works to specific fields. How do these filtering criteria contribute to the relevance of the dataset you compiled? Do you believe any aspects of Computational Social Science research might be underrepresented or overrepresented as a result of these choices?_

Only taking authors with work counts between 5 and 5000 eliminates outliers, and only takes authors about whom it may be assumed that they have worked at least semi-regularly in CSS. The citation count ensures that the papers we collect have relevance in the field, and the amount of authors may again help to weed out outlier papers, that are not representative of the field. Using the filter for the different concepts ensures that the papers are indeed relevant to the field of CSS.

These filters provide us with a dataset where we can be fairly sure, that all papers are relevant to CSS. However, newer researchers or those who work in smaller, less recognized fields might be underrepresented, which could become an issue.

### Part 4

In [41]:
authors

['A5017095669', 'A5035808622', 'A5085139454', 'A5088539840']

In [90]:
author_set = set(authors['id_fixed'])
edge_dict = {}

for co_authors in papers_all['author_ids']:
    # Check that we only use authors from the author file
    co_authors = list(set(co_authors) & author_set)
    # Leave out papers with only one author
    if len(co_authors) < 2:
        continue
    co_authors.sort()
    author_pairs = list(combinations(co_authors, r=2))
    for author_pair in author_pairs:
        if author_pair not in edge_dict:
            edge_dict[author_pair] = 1
        else:
            edge_dict[author_pair] += 1

edge_list = [[author_pair[0], author_pair[1], weight]
             for author_pair, weight in list(zip(edge_dict.keys(),
                                             edge_dict.values()))]
G = nx.Graph()
G.add_weighted_edges_from(edge_list, weight='weight')

In [65]:
# Get first year of publication
first_pub = papers_all.explode('author_ids').groupby(['author_ids'])['publication_year'].min()
first_pub_df = pd.DataFrame(first_pub)
first_pub_df.index.name = 'idx'
first_pub_df.columns = ['first_pub']
first_pub_df['id_fixed'] = first_pub_df.index
authors = pd.merge(authors, first_pub_df, on='id_fixed')

In [91]:
attribute_dict = {author['id_fixed']: {'display_name': author['display_name'],
                                       'country': author['country_code'],
                                       'citations_count': author['citations_count'],
                                       'first_pub': author['first_pub']}
                  for i, author in authors.iterrows()}
nx.set_node_attributes(G, attribute_dict)

# Write to list
data = nx.readwrite.json_graph.node_link_data(G)

# Write the data to a json file
with open('graph.json', 'w') as f:
    json.dump(data, f)

In [92]:
print(len(G.nodes))
print(len(G.edges))
print(nx.density(G))
print(nx.is_connected(G))
print(len(list(nx.connected_components(G))))
print(len(list(nx.isolates(G))))

405
754
0.009216477203275883
False
42
0


In [93]:
degrees = [d for _, d in list(G.degree)]
print(mean(degrees))
print(median(degrees))
print(mode(degrees))

3.723456790123457
1
2


In [95]:
strength = [d for _, d in list(G.degree(weight='weight'))]
print(mean(strength))
print(median(strength))
print(mode(strength))

25.145679012345678
2
12


In [103]:
sort_idxs = sorted(range(len(strength)),key=strength.__getitem__)
strengths_sorted = [list(G.degree(weight='weight'))[i] for i in sort_idxs]
print(strengths_sorted[:-6:-1])
for id, _ in strengths_sorted[:-6:-1]:
    print(authors.loc[authors['id_fixed']==id]['display_name'].values[0])

[('A5021346979', 340), ('A5011228873', 278), ('A5069885186', 228), ('A5014662127', 202), ('A5034406723', 172)]
Filippo Menczer
Alessandro Flammini
Ciro Cattuto
Alain Barrat
Luca Maria Aiello
