### GitHub repository
Link to repository used to colaborate on the assignment:
https://github.com/KarolineKlan/Assignments_ComSocSci2024.git

### Contribution statement

Team members:

- Jacob (s214596)
- Kristoffer (s214609)
- Karoline (s214638)

All members collaborated and contributed to every part of the assignment.


# Assignment 1
This assignment was formed using Web-scraping tools from the program of the International Conference in Computational Social Science 2023  https://ic2s2-2023.org/program, and acessing data of Authors and Research Articles using the OpenAlex API https://docs.openalex.org/.

In [2]:
#Import relevant libraries
from bs4 import BeautifulSoup 
import requests
import pandas as pd
from tqdm import tqdm
import Levenshtein
import numpy as np
import ast
import networkx as nx
from 

## Part 1 - Webscraping
In the following task we use web-scraping tools to get the list of participants in the International Conference in Computational Social Science (CSC) 2023

In [42]:
# define link to scrape, and beautifulsoup object
LINK = "https://ic2s2-2023.org/program"
LINK_OPTIONAL1 = "https://ic2s2-2023.org/program_committee"
r_OPTIONAL1 = requests.get(LINK_OPTIONAL1)
soup_OPTIONAL1 = BeautifulSoup(r_OPTIONAL1.content)
LINK_OPTIONAL2 = "https://ic2s2-2023.org/tutorials"
r_OPTIONAL2 = requests.get(LINK_OPTIONAL2)
soup_OPTIONAL2 = BeautifulSoup(r_OPTIONAL2.content)
r = requests.get(LINK)
soup = BeautifulSoup(r.content)

# Find all relevant places in the HTML code where names are stored
speaker = soup.findAll("ul", {"class" : "nav_list"})
chair = soup.findAll("h2")
table = soup.find("table", {"class" : "tutorials"})
table = table.find_all("td")
main = soup_OPTIONAL1.find("section", {"id" : "main"})
names_members = main.findAll("li")
names_teachers = soup_OPTIONAL2.findAll("div", {"class" : "col-5 col-12-medium"})

# Loop through the HTML code and extract names
keynote_names = [table[k].text.lower().split("- ")[1] for k in range(len(table)) if "Keynote" in table[k].text]
chair_names = [chair[k].text.lower().split(": ")[2] for k in range(len(chair)) if "Chair" in chair[k].text]
speaker_names = [speaker[k].find_all("i")[j].text.lower().split(", ")  for k in range(len(speaker)) for j in range(len(speaker[k].find_all("i")))]
speaker_names = sum(speaker_names, [])
names_members_lst = [names_members[i].find("b").text.lower() for i in range(len(names_members))]
names_teachers = [names_teachers[i].findAll("li")[k].find("b").text.lower() for i in range(len(names_teachers)) for k in range(len(names_teachers[i].findAll("li")))]



# Print results for each category
print(f"Number of unique speakers:  {len(set(speaker_names))}")
print(f"Number of unique keynote speakers:  {len(set(keynote_names))}")
print(f"Number of unique chairs:  {len(set(chair_names))}")
print(f"Number of unique members from optional link1:  {len(set(names_members_lst))}")
print(f"Number of unique teachers from optional link2:  {len(set(names_teachers))}")

# Add all names together to find total unique names
total_names = speaker_names + keynote_names + chair_names + names_members_lst + names_teachers
df = pd.DataFrame(total_names, columns = ["Name"])
df["Name"] = df["Name"].str.replace(".", "")
uniq_names = pd.DataFrame(set(df["Name"]), columns=["Name"])
uniq_names["Name"] = uniq_names["Name"].str.lstrip(" ")
uniq_names = uniq_names.sort_values('Name', ascending=True)
print(f"Total number of unique speakers:  {len((uniq_names))}")

pd.DataFrame(uniq_names).to_csv("data/authors_part1.csv", index=False)

Number of unique speakers:  1472
Number of unique keynote speakers:  10
Number of unique chairs:  49
Number of unique members from optional link1:  333
Number of unique teachers from optional link2:  19
Total number of unique speakers:  1645



**The process of the web-scraping:** 

In the process of web-scraping the website and collect the specific names of all the researchers, a thourough investigation of the HTML setup was initiated in order to understand the hierarchical and nested structure of the page. The main structure of the page was after inspection divided into 3 main parts, where different approaches were utilized in order to access the data from different structures:
1. collect the names og the key-note speakers from the overview table structure
2. collect the names for the chair speakers in the "h2" sections
3. collect names from the text sections in the "ul" sections 

## Part 2 - Ready Made vs Custom Made Data

Write

## Part 3 - Gathering Research Articles using the OpenAlex API

In [6]:
import requests # import requests module
from tqdm import tqdm # import tqdm module
import pandas as pd
from Levenshtein import distance
from joblib import Parallel, delayed


authors = pd.read_csv("data/authors_part1.csv")
authors = authors["Name"].tolist()

BASE_URL = "https://api.openalex.org/"
RESOURCE = "authors"

complete_url = BASE_URL + RESOURCE
storage = []

def get_author_week2(names):
    
    PARAMETERS = {"search" : names}
    response = requests.get(complete_url, params=PARAMETERS).json()
    
    if response["meta"]["count"] > 0:
        if distance(names, response["results"][0]["display_name"]) < 6:
            id = response["results"][0]["id"]
            display_name = response["results"][0]["display_name"]
            works_api_url = response["results"][0]["works_api_url"]
            h_index = response["results"][0]["summary_stats"]["h_index"]
            works_count = response["results"][0]["works_count"]
            
            if response["results"][0]["last_known_institution"] != None:
                country_code = response["results"][0]["last_known_institution"]["country_code"]
            else:
                country_code = "No last known institution"
            storage.append([id, display_name, works_api_url, h_index, works_count, country_code])
            
    return storage



df = Parallel(n_jobs=2)(delayed(get_author_week2)(names) for names in tqdm(authors))


df = pd.DataFrame(df, columns=["id", "display_name", "works_api_url", "h_index", "works_count", "country_code"])
df = df.drop_duplicates(subset="id", keep='first')
df.to_csv("data/authors_part3.csv", index=False)

  0%|          | 0/1645 [00:00<?, ?it/s]1154.83s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
1154.83s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
100%|██████████| 1645/1645 [07:55<00:00,  3.46it/s]


### Data Overview and Reflection questions: 

- **Dataset summary.** 

- **Efficiency in code.** 


- **Filtering Criteria and Dataset Relevance**

## Part 4: The Network of Computational Social Scientists

In this part of the assignment we construct and investigate the Computational Social Scientists Network.