### Github Repository
https://github.com/MathiasDamsgaard/Comp_Social_Sci_Assigments.git

### Contribution statement
We all helped each other with the different parts of the assignment, as we all sat toghether when reviewing the exercises from the first four weeks. While Andreas finished up part 1, Mathias corrected the answers for part 2, and Anton added the code for part 3 and part 4, before we all looked thorugh each others work.

In [1]:
# Imports
import requests
import json
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import networkx as nx
import time
import concurrent.futures
from tqdm import tqdm
import ast
from itertools import combinations
from collections import defaultdict

## Part 1: Web-scraping

> **Exercise: Web-scraping the list of participants to the International Conference in Computational Social Science**    
>
> You can find the programme of the 2023 edition of the conference at [this link](https://ic2s2-2023.org/program). As you can see the conference programme included many different contributions: keynote presentations, parallel talks, tutorials, posters. 
> 1. Inspect the HTML of the page and use web-scraping to get the names of all researchers that contributed to the conference in 2023. The goal is the following: (i) get as many names as possible including: keynote speakers, chairs, authors of parallel talks and authors of posters; (ii) ensure that the collected names are complete and accuarate as reported in the website (e.g. both first name and family name); (iii) ensure that no name is repeated multiple times with slightly different spelling. 
> 2. Some instructions for success: 
>    * First, inspect the page through your web browser to identify the elements of the page that you want to collect. Ensure you understand the hierarchical structure of the page, and where the elements you are interested in are located within this nested structure.   
>    * Use the [BeautifulSoup Python package](https://pypi.org/project/beautifulsoup4/) to navigate through the hierarchy and extract the elements you need from the page. 
>    * You can use the [find_all](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all) method to find elements that match specific filters. Check the [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) of the library for detailed explanations on how to set filters.  
>    * Parse the strings to ensure that you retrieve "clean" author names (e.g. remove commas, or other unwanted charachters)
>    * The overall idea is to adapt the procedure I have used [here](https://nbviewer.org/github/lalessan/comsocsci2023/blob/master/additional_notebooks/ScreenScraping.ipynb) for the specific page you are scraping. 
> 3. Create the set of unique researchers that joined the conference and *store it into a file*.
>     * *Important:* If you notice any issue with the list of names you have collected (e.g. duplicate/incorrect names), come up with a strategy to clean your list as much as possible. 
> 4. *Optional:* For a more complete represenation of the field, include in your list: (i) the names of researchers from the programme committee of the conference, that can be found at [this link](https://ic2s2-2023.org/program_committee); (ii) the organizers of tutorials, that can be found at [this link](https://ic2s2-2023.org/tutorials)

In [2]:
# The link to the conference program
LINK = "https://ic2s2-2023.org/program"
r = requests.get(LINK) 

# Parse the HTML content of the page
soup = BeautifulSoup(r.content) 

# Find the table with the keynote speakers
table1 = soup.find("table",{"class":"tutorials"})

# Find all 'a' tags within the table as all the names are within 'a' tags
links = table1.find_all('a')

# Extract the text from each 'a' tag (which should be the names)
keynote_names = [link.text for link in links if "Keynote" in link.text]

# Split each string on ' - ' and get the second part as this is the name of the speaker
# (.strip().title()) remove any leading/trailing white space and
# convert the first letter to uppercase and the rest to lowercase in case of any mispellings
names = [name.split(' - ')[1].strip().title() for name in keynote_names]

In [3]:
# Find all the tables with the keynote speakers
# As we noticed that the plenary talkers all where italic we use that to find them all at once
table2 = soup.find_all("div",{"class":"col-9 col-12-medium"})

# Loop through the tables and extract elements with 'i' tags
for tab in table2:
    plenary_talks = tab.find_all('i')
# Extract the text from each 'i' tag (which should be the names)
    plenary_talks = [talkers.text for talkers in plenary_talks]

    # Process each string in the list
    for talk in plenary_talks:
        # If the string starts with 'Chair:', remove that part and trim whitespace
        if talk.startswith('Chair:'):
            talk = talk.replace('Chair:', '').strip().title()

        # If the string contains a comma, split it into multiple names
        if ',' in talk:
            names.extend([name.strip().title() for name in talk.split(',')])

        # If the string does not contain a comma, it's a single name. Pass the name to the list of names
        else:
            names.append(talk)
    

# Convert the list to a set, which removes duplicates and then back to a sorted list
unique_names = sorted(set(names))

# Print the length of the list with unique names
print(f"Number of unique researcher found are {len(unique_names)}")


# Create a dataframe from the list and save it to a CSV file
df = pd.DataFrame(unique_names)
df.to_csv('data/names.csv', index=False, header=False)

Number of unique researcher found are 1488


> 5. How many unique researchers do you get?

We got 1488 unique researchers

> 6. Explain the process you followed to web-scrape the page. Which choices did you make to accurately retreive as many names as possible? Which strategies did you use to assess the quality of your final list? Explain your reasoning and your choices __(answer in max 150 words)__.

We scraped the names by first identifying the table containing the keynote speakers. All keynote speakers were represented by “a” tags. We extracted the text containing the word ‘Keynote’ from these tags, resulting in the 10 keynote speakers. For plenary speakers, we noticed their names were in italics (represented by “i” tags). We used this fact and extracted all the “i” tags from the rest of the webpage and handled three cases: names starting with ‘Chair:’, comma-separated names, and single names. This could be done as only names were written in italic, and we were therefore fairly confident we got about all the names out. 
Some of the strategies we used to make sure the names were accurately retrieved were by using like functions .strip() and .title() to eliminate whitespaces and standardize capitalization to circumvent any misspellings that would lead to errors. 


## Part 2: Ready Made vs Custom Made Data

> **Exercise: Ready made data vs Custom made data** In this exercise, I want to make sure you have understood they key points of my lecture and the reading. 
>
> 1. What are pros and cons of the custom-made data used in Centola's experiment (the first study presented in the lecture) and the ready-made data used in Nicolaides's study (the second study presented in the lecture)? You can support your arguments based on the content of the lecture and the information you read in Chapter 2.3 of the book __(answer in max 150 words)__.

- Benefits of custom-made data is the influence you have on collected the data and the representativeness, sensitivity, and completeness. A con is how resource heavy of a task it is. Often people are aware they are participating in an experiment, potentially making them reactive and behave a certain way. In Centola's experiment they did the study without letting the subjects knowing there was a test and control group, hence making their behaviour more believable.
- It is often opposite for ready-made data, where you have access to big data, but not all of it may be useful to you. It requires a lot of clean-up and consideration of the truthfulness of the data. For Nicolaides's study the data matched what they wanted to do, however they have no way of knowing the individuals that they are analyzing, and whether the data reflects what they want to investigate.

> 2. How do you think these differences can influence the interpretation of the results in each study? __(answer in max 150 words)__

When making the choice of using either custom-made or ready-made data, the researchers must consider the benefits and consequences of that choice, here some of them were explained previously. If they fail to properly interpret possible trends, the relevancy, representativeness, or a possible confounder for the behaviour, the analysis suddenly losses its reliability. Here custom-made data will always be more reliable, but its higher cost and smaller sample size can lead to a skewed distribution of participants. These differences makes it hard of whether to rely on big data or not as the ten common characteristics also highlights relevant problems, some of which have already been explained. To do the necessary groundwork is therefore important, and if the considerations are properly presented, the influence of it on the results in a study as Nicolaides’ will become smaller.

## Part 3: Gathering Research Articles using the OpenAlex API

> **Exercise : Collecting Research Articles from IC2S2 Authors**
>
>In this exercise, we'll leverage the OpenAlex API to gather information on research articles authored by participants of the IC2S2 2023 conference, referred to as *IC2S2 authors*. **Before you start, please ensure you read through the entire exercise.**
>
> 
> **Steps:**
>  
> 1. **Retrieve Data:** Starting with the *authors* you identified in Week 2, Exercise 2, use the OpenAlex API [works endpoint](https://docs.openalex.org/api-entities/works) to fetch the research articles they have authored. For each article, retrieve the following details:
>    - _id_: The unique OpenAlex ID for the work.
>    - _publication_year_: The year the work was published.
>    - _cited_by_count_: The number of times the work has been cited by other works.
>    - _author_ids_: The OpenAlex IDs for the authors of the work.
>    - _title_: The title of the work.
>    - _abstract_inverted_index_: The abstract of the work, formatted as an inverted index.
> 
>     **Important Note on Paging:** By default, the OpenAlex API limits responses to 25 works per request. For more efficient data retrieval, I suggest to adjust this limit to 200 works per request. Even with this adjustment, you will need to implement pagination to access all available works for a given query. This ensures you can systematically retrieve the complete set of works beyond the initial 200. Find guidance on implementing pagination [here](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging#cursor-paging).
>
> 2. **Data Storage:** Organize the retrieved information into two Pandas DataFrames and save them to two files in a suitable format:
>    - The *IC2S2 papers* dataset should include: *id, publication\_year, cited\_by\_count, author\_ids*.
>    - The *IC2S2 abstracts* dataset should include: *id, title, abstract\_inverted\_index*.
>  
>
> **Filters:**
> To ensure the data we collect is relevant and manageable, apply the following filters:
> 
>    - Only include *IC2S2 authors* with a total work count between 5 and 5,000.
>    - Retrieve only works that have received more than 10 citations.
>    - Limit to works authored by fewer than 10 individuals.
>    - Include only works relevant to Computational Social Science (focusing on: Sociology OR Psychology OR Economics OR Political Science) AND intersecting with a quantitative discipline (Mathematics OR Physics OR Computer Science), as defined by their [Concepts](https://docs.openalex.org/api-entities/works/work-object#concepts). *Note*: here we only consider Concepts at *level=0* (the most coarse definition of concepts). 
>
> **Efficiency Tips:**
> Writing efficient code in this exercise is **crucial**. To speed up your process:
> - **Apply filters directly in your request:** When possible, use the [filter parameter](https://docs.openalex.org/api-entities/works/filter-works) of the *works* endpoint to apply the filters above directly in your API request, ensuring only relevant data is returned. Learn about combining multiple filters [here](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/filter-entity-lists).  
> - **Bulk requests:** Instead of sending one request for each author, you can use the [filter parameter](https://docs.openalex.org/api-entities/works/filter-works) to query works by multiple authors in a single request. *Note: My testing suggests that can only include up to 25 authors per request.*
> - **Use multiprocessing:** Implement multiprocessing to handle multiple requests simultaneously. I highly recommmend [Joblib’s Parallel](https://joblib.readthedocs.io/en/stable/) function for that, and [tqdm](https://tqdm.github.io/) can help monitor progress of your jobs. Remember to stay within [the rate limit](https://docs.openalex.org/how-to-use-the-api/rate-limits-and-authentication) of 10 requests per second.
>
>
>   
> For reference, employing these strategies allowed me to fetch the data in about 30 seconds using 5 cores on my laptop. I obtained a dataset of approximately 25 MB (including both the *IC2S2 abstracts* and *IC2S2 papers* files).



In [30]:
# Load the data from Week 2 Part 3 - we uploaded the data from the exercise in the data folder ourselves instead of adding the code
authors = pd.read_csv('data/authors.csv')
authors = authors[(authors["works_count"] >= 5) & (authors["works_count"] <= 5000)]
authors = authors['id'].tolist()

In [31]:
# We get the concepts for the different fields 
concept_url = 'https://api.openalex.org/concepts'
filter_conc1 = 'level:0,display_name:Sociology|Psychology|Economics|Political science'
filter_conc2 = 'level:0,display_name:Mathematics|Physics|Computer science'

select = ['id']

params1 = {'filter': filter_conc1, 'select': select}
params2 = {'filter': filter_conc2, 'select': select}

response1 = requests.get(concept_url, params=params1)
response2 = requests.get(concept_url, params=params2)

result1 = response1.json().get('results')
result2 = response2.json().get('results')

# We get the ids of the concepts to create the filters for the works
concepts1 = '|'.join([concept['id'] for concept in result1])
concepts2 = '|'.join([concept['id'] for concept in result2])

# Furthermore we add the filters for the works with the found concepts
work_url = 'https://api.openalex.org/works'
select = 'id,publication_year,cited_by_count,authorships,title,abstract_inverted_index'
filters = 'cited_by_count:>10,authors_count:<10,concepts.id:' + concepts1 + ',concepts.id:' + concepts2

limit = 200

params = {'filter': filters, 'select': select, 'per-page': limit}

In [32]:
# We create a function to get the data from the API, and handle the pagination and errors
def fetch_data(author, url, params):
    try:
        params['filter'] = filters + ',author.id:' + '|'.join(author)
        cursor = '*'
        papers = []
        abstracts = []

        while cursor:
            params['cursor'] = cursor
            response = requests.get(url, params=params)
            
            # We handle different errors that can occur
            if response.status_code == 429:
                print("Rate limit exceeded. Sleeping for 2 seconds.")
                time.sleep(5)
                response = requests.get(url, params=params)  # Retry the request
            elif response.status_code == 403:
                print("Forbidden access. Please check your API credentials or contact the API provider.")
                return [], []
            
            elif response.status_code != 200:
                print(f"Error occurred while fetching data for author {author}: {response.json()}")
                return [], []

            data = response.json().get('results')
            meta = response.json().get('meta')

            if not data:
                break

            # Retrieve the relevant data into the lists
            for paper in data:
                papers.append([paper['id'], paper['publication_year'], paper['cited_by_count'],
                               [authorship['author']['id'] for authorship in paper['authorships']]])
                
                abstracts.append([paper['id'], paper['title'], paper['abstract_inverted_index']])

            cursor = meta['next_cursor']  

        return papers, abstracts
    
    except Exception as e:
        print(f"Error occurred while fetching data for author {author}: {e}")
        return [], []

In [39]:
papers_data = []
abstracts_data = []

# We use a bulk size of 25 authors to get data more quickly.
bulk_size = 25
author_chunks = [authors[i:i + bulk_size] for i in range(0, len(authors), bulk_size)]

# We found that we could use 4 threads to get the data without hitting the rate limit.
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
    # We fetch the data from the API using the function.
    results = list(tqdm(executor.map(fetch_data, author_chunks, [work_url]*len(author_chunks),
                                     [params]*len(author_chunks)), total=len(author_chunks)))
    for result in results:
        papers_data.extend(result[0])
        abstracts_data.extend(result[1])

  0%|          | 0/43 [00:00<?, ?it/s]

100%|██████████| 43/43 [00:34<00:00,  1.23it/s]


In [41]:
# Organize the data into dataframes
papers_df = pd.DataFrame(papers_data, columns=["id", "publication_year", "cited_by_count", "coauthor_ids"])
abstracts_df = pd.DataFrame(abstracts_data, columns=["id", "title", "abstract_inverted_index"])
papers_df.drop_duplicates(inplace=True, subset=['id'])
abstracts_df.drop_duplicates(inplace=True, subset=['id'])

# Save the dataframes to csv files
papers_df.to_csv("data/IC2S2_papers.csv", index=False)
abstracts_df.to_csv("data/IC2S2_abstracts.csv", index=False)

# Print answers to the questions
print(f"Total number of papers: {len(papers_df)}")
print(f"Total number of unique authors: {papers_df['coauthor_ids'].explode().nunique()}")

Total number of papers: 7726
Total number of unique authors: 12247


> **Data Overview and Reflection questions:** Answer the following questions: 
> - **Dataset summary.** How many works are listed in your *IC2S2 papers* dataframe? How many unique researchers have co-authored these works? 

We get 7726 works in our IC2S2 papers dataframe, and 12247 unique researchers to have co-authored these works. However, the code seems to generate a different number each time, often varying with a coupld hundred, and the former numbers are the most recently generated.

> - **Efficiency in code.** Describe the strategies you implemented to make your code more efficient. How did your approach affect your code's execution time? __(answer in max 150 words)__

We used the bulk size of 25 authors to get the data faster. We found that we could use 4 threads to get the data without hitting the rate limit of the api. This allowed us to get the data in a reasonable time of around 30 secounds. Furthermore, we used the filters directly in the request to get the data that we wanted. Additionally, we filtered the dataframe beforehand to ensure we only requested for authors who had enough works written. This allowed us to get the data that we wanted without having to filter it afterwards. This made the process more efficient. However we still had to handle the pagination and the errors that could occur.

> - **Filtering Criteria and Dataset Relevance**. Reflect on the rationale behind setting specific thresholds for the total number of works by an author, the citation count, the number of authors per work, and the relevance of works to specific fields. How do these filtering criteria contribute to the relevance of the dataset you compiled? Do you believe any aspects of Computational Social Science research might be underrepresented or overrepresented as a result of these choices? __(answer in max 150 words)__

Some of the filters makes sense to incorporate as looking for well cited works as this makes sure we are only considering works that have made an impact on the field. The same goes for removing authors with really few works, as they probably haven’t become a well-established author yet. This can of course affect mean they have made one good article we might not find tough. However, filtering for authors with less than 5001 works and works with less than 10 co-authors is more filters to reduce computation time and complexity and can influence the relevance of the works found. Lastly, the concept filters do limit the fields of the works quite a bit and could be expanded if one wanted to broaden the search more, but currently they do help in finding relevant works.

## Part 4: The Network of Computational Social Scientists

> **Exercise: Constructing the Computational Social Scientists Network**
>
> In this exercise, we will create a network of researchers in the field of Computational Social Science using the NetworkX library. In our network, nodes represent authors of academic papers, with a direct link from node _A_ to node _B_ indicating a joint paper written by both. The link's weight reflects the number of papers written by both _A_ and _B_.
>
> **Part 1: Network Construction**
>
> 1. **Weighted Edgelist Creation:** Start with your dataframe of *papers*. Construct a _weighted edgelist_ where each list element is a tuple containing three elements: the _author ids_ of two collaborating authors and the total number of papers they've co-authored. Ensure each author pair is listed only once. 

In [9]:
papers = pd.read_csv('data/IC2S2_full_papers.csv')

# Initialize a dictionary to store author pairs and their counts
author_pairs = defaultdict(int)

# Iterate over each row in the DataFrame
for _, row in papers.iterrows():
    # Convert the 'authorships' string into a list
    authors_list = ast.literal_eval(row['coauthor_ids'])
    
    # Generate all combinations of authors for the current row
    authors = combinations(authors_list, 2)

    # For each combination, increment the count in the dictionary
    for pair in authors:
        author_pairs[pair] += 1

# Convert the dictionary to a list of tuples (edgelist)
edgelist = [(pair[0], pair[1], weight) for pair, weight in author_pairs.items()]

> 2. **Graph Construction:**
>    - Use NetworkX to create an undirected [``Graph``](https://networkx.org/documentation/stable/reference/classes/graph.html).
>    - Employ the [`add_weighted_edges_from`](https://networkx.org/documentation/stable/reference/classes/generated/networkx.Graph.add_weighted_edges_from.html#networkx.Graph.add_weighted_edges_from) function to populate the graph with the weighted edgelist from step 1, creating a weighted, undirected graph.

In [10]:
# Create a weighted graph from the edgelist
G = nx.Graph()
G.add_weighted_edges_from(edgelist)

> 3. **Node Attributes:**
>    - For each node, add attributes for the author's _display name_, _country_, _citation count_, and the _year of their first publication_ in Computational Social Science. The _display name_ and _country_ can be retrieved from your _authors_ dataset. The _year of their first publication_ and the _citation count_  can be retrieved from the _papers_ dataset.
>    - Save the network as a JSON file.

In [11]:
authors_data = pd.read_csv('data/IC2S2_full_authors.csv')

papers['coauthor_ids'] = papers['coauthor_ids'].apply(ast.literal_eval)
authors = np.hstack(papers['coauthor_ids'].values)
authors = list(set(authors))

# Only keep an author if they're in the authors_data
authors = [author for author in authors if author in authors_data['id'].values]

# Create a dictionary of author IDs and their corresponding total citation counts and first year of publication.
additional_data = {}
for author in authors:
    # Get the citations by finding papers that each author has authored and summing the citations for those papers.
    # Then get the first year of an authors publication by finding papers that each author has authored in and selecting the minimum value.
    author_papers = papers[papers['coauthor_ids'].apply(lambda x: author in x)]
    additional_data[author] = [author_papers['cited_by_count'].sum(), author_papers['publication_year'].min()]

In [12]:
# Add node attributes
for author in G.nodes:
    if authors_data.loc[authors_data['id'] == author, 'display_name'].values.size == 0:
        continue
    G.nodes[author]['display_name'] = authors_data.loc[authors_data['id'] == author, 'display_name'].values[0]
    G.nodes[author]['country_code'] = authors_data.loc[authors_data['id'] == author, 'country_code'].values[0]
    G.nodes[author]['citations'] = additional_data[author][0]
    G.nodes[author]['first_year'] = additional_data[author][1]

In [13]:
# Save the network as a JSON file
class MyEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.integer):
            return int(obj)
        elif isinstance(obj, np.floating):
            if np.isnan(obj):
                return None
            elif np.isinf(obj):
                return str(obj)
        return super(MyEncoder, self).default(obj)

# The above step was done to avoid the error of numpy.int64 not being serializable

data = nx.node_link_data(G)
with open('data/authors_network.json', 'w') as f:
    json.dump(data, f, cls=MyEncoder)

> **Part 2: Preliminary Network Analysis**
> Now, with the network constructed, perform a basic analysis to explore its features.

In [14]:
# Read the JSON file back into a networkx graph
with open('data/authors_network.json', 'r') as f:
    data = json.load(f)
G = nx.node_link_graph(data)

> 1. **Network Metrics:**
>    - What is the total number of nodes (authors) and links (collaborations) in the network?

In [15]:
# Total number of nodes (authors) and links (collaborations)
print(f'Total number of nodes: {G.number_of_nodes()}')
print(f'Total number of links: {G.number_of_edges()}')

Total number of nodes: 103424
Total number of links: 364335


>    - Calculate the network's density (the ratio of actual links to the maximum possible number of links). Would you say that the network is sparse? Justify your answer.

As the network density is 0.00007, it means that the network is definitly sparse, as there are not nearly as many links as the maximum number of possible links.

In [16]:
# Network density
density = nx.density(G)
print(f'Network density: {density}')

Network density: 6.81227902439113e-05


>    - Is the network fully connected?

In [17]:
# Network connection between nodes
is_connected = nx.is_connected(G)
print(f'Is the network connected: {is_connected}')

Is the network connected: False


>    - If the network is disconnected, how many connected components does it have?

In [18]:
# Number of connected components
connected_components = nx.number_connected_components(G)
print(f'Number of connected components: {connected_components}')

Number of connected components: 68


>    - How many isolated nodes are there in your network?  An isolated node is defined as a node with no connections to any other node in the network.

In [19]:
# Number of isolated nodes
isolated_nodes = list(nx.isolates(G))
print(f'Number of isolated nodes: {len(isolated_nodes)}')

Number of isolated nodes: 0


>    - Discuss the results above on network density, and connectivity. Are your findings in line with what you expected? Why?  __(answer in max 150 words)__

That the density is low would make sense, as you wouldn't expect every author to have co-authored with every other author, since we find works across different fields and the authors most likely would specify in just a couple of those. This also relates to connectivity, as this idea would create different groups in the network, thus not making it fully connected. We would also expect every author to have co-authored with at least one other author, so that there is no isolated nodes makes sense.

> 2. **Degree Analysis:**
>    - Compute the average, median, mode, minimum, and maximum degree of the nodes. Perform the same analysis for node strength (weighted degree).

In [29]:
# Degree
degree = dict(G.degree())
degree_values = list(degree.values())

# Strength
strength = dict(G.degree(weight='weight'))
strength_values = list(strength.values())

# Average
avg_degree = np.mean(degree_values)
avg_strength = np.mean(strength_values)

# Median
median_degree = np.median(degree_values)
median_strength = np.median(strength_values)

# Mode
mode_degree = max(set(degree_values), key=degree_values.count)
mode_strength = max(set(strength_values), key=strength_values.count)

# Minimum
min_degree = min(degree_values)
min_strength = min(strength_values)

# Maximum
max_degree = max(degree_values)
max_strength = max(strength_values)

print(f'Degree Analysis:')
print(f'Average: {avg_degree:.2f}')
print(f'Median: {median_degree}')
print(f'Mode: {mode_degree}')
print(f'Minimum: {min_degree}')
print(f'Maximum: {max_degree}')
print(f'\nStrength Analysis:')
print(f'Average: {avg_strength:.2f}')
print(f'Median: {median_strength}')
print(f'Mode: {mode_strength}')
print(f'Minimum: {min_strength}')
print(f'Maximum: {max_strength}')

Degree Analysis:
Average: 7.05
Median: 5.0
Mode: 4
Minimum: 1
Maximum: 1548

Strength Analysis:
Average: 8.48
Median: 5.0
Mode: 4
Minimum: 1
Maximum: 1763


> What do these metrics tell us about the network? __(answer in max 150 words)__

These metrics tell us that the median author (degree and strength analysis) has collaborated with 5 other coauthors. As expected we also see that when we do the strength analysis the average rises. This is because the strength is the weighted degree, and as such the more papers an author has written with another author, the higher the strength. The maximum also tells us that there's an author which has coauthored 1763 articles under the filters we've used. This high number also skews the average. This author is known as a 'hub' and compared to the relatively low median this shows the exponential decay which is typical for real networks. Also the strength analysis is very similar to Science Collaboration network in  table 2.1 in the Network Science book. This further substantiates the universality of networks.

> 3. **Top Authors:**
>    - Identify the top 5 authors by degree. What role do these node play in the network?

The top five authors are found below and displayed together with their degree value.
In the network they have a role of being a connector between many different notes. As their degree is high, thay have many connection to other notes, thus meaning they have high contribution to writing different works with a varity of author.

In [21]:
# Identify the top 5authors by degree.
top_5_degree = sorted(degree.items(), key=lambda x: x[1], reverse=True)[:5]
top_5_degree

[('https://openalex.org/A5027835055', 1548),
 ('https://openalex.org/A5071773009', 898),
 ('https://openalex.org/A5070812231', 762),
 ('https://openalex.org/A5073501391', 745),
 ('https://openalex.org/A5036726873', 613)]

>    - Research these authors online. What areas do they specialize in? Do you think that their work aligns with the themes of Computational Social Science? If not, what could be possible reasons? __(answer in max 150 words)__

The top five authors found are named Jun Li, Yu Zhang, Rui Wang, Jie Zhang, and Meng Wang. While they have work counts between 10.000 and 34.000, these authors still appear in the network, as they are co-authors of papers found in the last step of the data collection process and thus wasn't filtered away even though they have done more than 5.000 works.
The authors seem to specialize in fields different from Computational Social Science as their recent works focus on different biotechnical aspects varying from the human body, ethnopharmacology, and antibiotics. These don't match with the concepts we tried to filter by when retrieving the works. One possible reason for this could be that an author helped with the scientific aspects of a work and thus were rightfully credited. However, this will lead us to authors that don't necessarily study in either of the categories we tried to capture.