| Assignment 1 contribution table   | Teis Aggerholm (s234822) | Andreas Holm Matthiassen (s234838) | Hector Helt Jakobsen (s234822) |
|-------------|---------|---------|---------|
| Part 1 | 100%     | 0%     | 0%     |
| Part 2 | 100%     | 0%     | 0%     |
| Part 3 | 0%     | 0%     | 100%     |
| Part 4 | 0%     | 100%     | 0%     |

Link to our GitHub repository: `https://github.com/Andreas-Holm-2/02467-Assignment-1`

# `Part 1:` Web-scraping

1. Inspect the HTML of the page and use web-scraping to get the names of all researchers that contributed to the conference in 2023. The goal is the following: (i) get as many names as possible including: keynote speakers, chairs, authors of parallel talks and authors of posters; (ii) ensure that the collected names are complete and accuarate as reported in the website (e.g. both first name and family name); (iii) ensure that no name is repeated multiple times with slightly different spelling.

2. Some instructions for success:
First, inspect the page through your web browser to identify the elements of the page that you want to collect. Ensure you understand the hierarchical structure of the page, and where the elements you are interested in are located within this nested structure.
Use the BeautifulSoup Python package to navigate through the hierarchy and extract the elements you need from the page.
You can use the find_all method to find elements that match specific filters. Check the documentation of the library for detailed explanations on how to set filters.
Parse the strings to ensure that you retrieve "clean" author names (e.g. remove commas, or other unwanted charachters)
The overall idea is to adapt the procedure I have used here for the specific page you are scraping.

3. Create the set of unique researchers that joined the conference and store it into a file.
Important: If you notice any issue with the list of names you have collected (e.g. duplicate/incorrect names), come up with a strategy to clean your list as much as possible.

4. Optional: For a more complete represenation of the field, include in your list: (i) the names of researchers from the programme committee of the conference, that can be found at this link; (ii) the organizers of tutorials, that can be found at this link

## Scraping the URL:

In [2]:
from bs4 import BeautifulSoup
import requests

def fetch_names(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "html.parser")

    names = set()
    for tag in soup.find_all("i"): # We use find_all as recommended
        text = tag.get_text()
        for name in text.split(", "):  # Handle multiple names in one <i> tag
            clean_name = name.replace("Chair: ", "").strip() # remove 'Chair: ' from certain strings
            names.add(clean_name)
    
    return names

URL = "https://ic2s2-2023.org/program#plenaryday1"
names = fetch_names(URL)
names = sorted(names)

print(f'Initially, we have found {len(names)} names. The names are {names}')

Initially, we have found 1484 names. The names are ['Aaron Clauset', 'Aaron J. Schwartz', 'Aaron Schein', 'Aaron Smith', 'Abbas Haidar', 'Abby Smith', 'Abdulkadir Celikkanat', 'Abdullah Almaatouq', 'Abdullah Zameek', 'Abeer ElBahrawy', 'Adam Finnemann', 'Adam Frank', 'Adam H. Russell', 'Adam Stefkovics', 'Adam Sutton', 'Aditi Dutta', 'Adriano Belisario', 'Adrienne Mendrik', 'Agnieszka Czaplicka', 'Agnieszka Falenska', 'Aguru Ishibashi', 'Ahmad Hesam', 'Ahmed Nasser Mostafa', 'Aidan Combs', 'Aidar Zinnatullin', 'Akeela Careem', 'Akhil Arora', 'Akira Hashimoto', 'Akira Matsui', 'Akira Tsurushima', 'Akrati Saxena', 'Alain Barrat', 'Alan Paul Kwan', 'Alba Motes Rodrigo', 'Albert-Laszlo Barabasi', 'Alberto Amaduzzi', 'Alejandro Beltran', 'Alejandro Dinkelberg', 'Alejandro Hermida Carrillo', 'Aleksandra Urman', 'Alessandra Urbinati', 'Alessandro De Gaetano', 'Alessandro Flamini', 'Alessandro Flammini', 'Alessandro Gambetti', 'Alessandro Lomi', 'Alessia Antelmi', 'Alessia Melegaro', 'Alessio 

## Removing researcher names spelled slightly different

In [3]:
from thefuzz import fuzz, process

def find_fuzzy_matches(list_of_names, threshold=80):
    
    fuzzy_matches = [] # List to store fuzzy matches
    seen_matches = set()  # To avoid duplicates

    for i, name in enumerate(list_of_names):
        # Find top matches in the remaining names (excluding self-matches)
        matches = process.extract(name, list_of_names[i+1:], scorer=fuzz.ratio, limit=5)

        for match, score in matches:
            if score >= threshold and (name, match) not in seen_matches and (match, name) not in seen_matches:
                fuzzy_matches.append((name, match, score))
                seen_matches.add((name, match))

    
    fuzzy_matches.sort(key=lambda x: x[2], reverse=True) # Sort matches by descending score

    return fuzzy_matches


fuzzy_matches = find_fuzzy_matches(names)

print(f"Found {len(fuzzy_matches)} fuzzy matches.")
print("\nFuzzy Matches:")
for name1, name2, score in fuzzy_matches:
    print(f"{name1} & {name2} (Score: {score})")

Found 74 fuzzy matches.

Fuzzy Matches:
Bedoor AlShebli & Bedoor Alshebli (Score: 100)
Federico Barrera Lemarchand & Federico Barrera-Lemarchand (Score: 100)
Lisette Espin Noboa & Lisette Espin-Noboa (Score: 100)
Luca Verginer & luca verginer (Score: 100)
Sonja M Schmer Galunder & Sonja M Schmer-Galunder (Score: 100)
Woo-Sung Jung & Woo-sung Jung (Score: 100)
Alessandro Flamini & Alessandro Flammini (Score: 97)
Duncan J Watts & Duncan J. Watts (Score: 97)
Maximilan Schich & Maximilian Schich (Score: 97)
Pantelis P Analytis & Pantelis P. Analytis (Score: 97)
Anne C Kroon & Anne C. Kroon (Score: 96)
Diogo Pachecho & Diogo Pacheco (Score: 96)
Fabio Carella & Fabio Carrella (Score: 96)
Scott A Hale & Scott A. Hale (Score: 96)
Ana Maria Jaramillo & Ana María Jaramillo (Score: 95)
Jose Javier Ramasco & José Javier Ramasco (Score: 95)
Nicholas A Christakis & Nicholas Christakis (Score: 95)
Alexander Gates & Alexander J Gates (Score: 94)
David M Rothschild & David Rothschild (Score: 94)
Martin

In [4]:
# we run the function again, and remove the first name down to a threshold of 85, as we began to mistakes in the people here 

fuzzy_matches = find_fuzzy_matches(names, threshold=85)

print(f'Removing {len(fuzzy_matches)} duplicate names from the original name set')

fuzzy_matches = [match[0] for match in fuzzy_matches] # we pick a very simple policy: we remove the first name of each list to save time.
names = [name for name in names if name not in fuzzy_matches]

print(f'We now have a list of {len(names)} researchers. Their names are {names}')


Removing 42 duplicate names from the original name set
We now have a list of 1446 researchers. Their names are ['Aaron Clauset', 'Aaron J. Schwartz', 'Aaron Schein', 'Aaron Smith', 'Abbas Haidar', 'Abby Smith', 'Abdulkadir Celikkanat', 'Abdullah Almaatouq', 'Abdullah Zameek', 'Abeer ElBahrawy', 'Adam Finnemann', 'Adam Frank', 'Adam H. Russell', 'Adam Stefkovics', 'Adam Sutton', 'Aditi Dutta', 'Adriano Belisario', 'Adrienne Mendrik', 'Agnieszka Czaplicka', 'Agnieszka Falenska', 'Aguru Ishibashi', 'Ahmad Hesam', 'Ahmed Nasser Mostafa', 'Aidan Combs', 'Aidar Zinnatullin', 'Akeela Careem', 'Akhil Arora', 'Akira Hashimoto', 'Akira Matsui', 'Akira Tsurushima', 'Akrati Saxena', 'Alain Barrat', 'Alan Paul Kwan', 'Alba Motes Rodrigo', 'Albert-Laszlo Barabasi', 'Alberto Amaduzzi', 'Alejandro Beltran', 'Alejandro Dinkelberg', 'Alejandro Hermida Carrillo', 'Aleksandra Urman', 'Alessandra Urbinati', 'Alessandro De Gaetano', 'Alessandro Flammini', 'Alessandro Gambetti', 'Alessandro Lomi', 'Alessia A

In [5]:
print(f'In total, we have found {len(names)} differnet researchers')

In total, we have found 1446 differnet researchers


6. Explain the process you followed to web-scrape the page. Which choices did you make to accurately retreive as many names as possible? Which strategies did you use to assess the quality of your final list? Explain your reasoning and your choices (answer in max 150 words).


In order to complete the exercise as well as possible, start by finding all the names on the webpage, which we notice all are in `<i>` tags in the HTML code

We now scrape the 2023 conference URL for names, and add these all these to a set (ensuring no duplicates).

By inspection, we noticed 'Chair' was extracted with the names for the chairmen. We simply repalace the part of the strings containing 'chair: ' for which this was relevant. 

In order to ensure that no name is repeated multiple times with slightly different spelling, we found a python library `thefuzz`, which allows for checking for fuzzy matching (an artificial intelligence and machine learning technology that identifies similar, but not identical elements in data table sets https://redis.io/blog/what-is-fuzzy-matching/).

Lastly, we went ahead and deleted one in the array of similar names down to a threshold of 0.86.



# `Part 2:` Ready Made vs Custom Made Data


# `Part 3:` Gathering Research Articles using the OpenAlex API

# `Part 4:` The Network of Computational Social Scientists