# Week 1

## Overview

As explained in the [*Before week 1* notebook](https://nbviewer.org/github/lalessan/comsocsci2024/blob/master/lectures/Before_week_1.ipynb), each week of this class is a Jupyter notebook like this one. **_In order to follow the class, you simply start reading from the top_**, following the instructions.

**Hint**: You can ask me for help at any point if you get stuck!

## Today

This first lecture will go over a few different topics to get you started:

* **Part 1:** You picked this course in **Computational Social Science** but... *What does that even mean??* The first thing we will do today is to learn a bit more about it by reading some chapters of the book and listen to a short lecture by me.

* **Part 2:**  In the second part of this class, I will introduce you to **a topic we will work on for the rest of this course** and I will ask you to reflect upon it.

* **Part 3:**  In the final part of this class, we will start working on something hands on. We will use **Web scraping** to gather some data. 


## Part 1: Intro to Computational Social Science


*What is Computational Social Science?* In the video below, I will give you some answers. There will be a little bit of history, example of topics and datasets, an overview of the methods, and some reflections on the challenges faced by researchers, including in relation to Ethics and Privacy. 


> **_Video lecture_**: Watch the video below about Computational Social Science

In [2]:
from IPython.display import YouTubeVideo
YouTubeVideo("3dA1GYdSg-A",width=600, height=337.5)


In this course, we are going to read some parts of the amazing book by Matthew Salganik ["_Bit By Bit: Social Research in the Digital Age_"](https://www.bitbybitbook.com/). 
Salganik is a professor in Sociology at Princeton, and an active researcher in Computational Social Science. 
You can read the book online, but I encourage you to buy it. 

> *Reading*: [Bit by Bit, chapter 1](https://www.bitbybitbook.com/en/1st-ed/introduction/) Start by reading the Introduction of the book, where you will get you an understanding of the history of the field and the general framework.    
>
 > *Reading 2*: [Bit by Bit, chapter 6](https://www.bitbybitbook.com/en/1st-ed/ethics/) Read the Ethics chapter of the book. Here, I don't expect you to read all the details. However, I want to make sure you get an overall understanding of the ethical challenges and some of the approaches that are used in the field to deal with these complex issues. You can focus on sections 6.4 and 6.6.    
>
> *Optional Reading*: [Computational Social Science](https://www.science.org/doi/pdf/10.1126/science.1167742). This is a crucial article written by some of the pioneers in Computational Social Science. The paper came out in 2009 in the prestigious scientific journal Science. People had already been working on using large-scale data and new computational tools to study society and behaviour, but the article is the first to acknowledge and describe this emerging field. I encourage you to read it or skim through it! 

> **Exercise 1 : Topics in Computational Social Science** By now, you must have a grasp of what Computational Social Science is about.
>
> - Work in pairs. Based on what you know so far, come up with three social science topics that you think it would be interesting and possible to work on using computational social science methods.
> - Do you have an idea of the data that you could use for your research? Come up with one dataset for each of the topics you have identified (if you have any doubt you can talk to me).
> - Go to [DTU Learn](https://learn.inside.dtu.dk/d2l/home/187754) and fill the Survey "_Topics in Computational Social Science_"

**Answer** <br>
*Topics*
1. Financial wealth based on political news consumption (how much engagement in politics)
2. Education level based on social media usage
3. Political opinion based on phone brand owned (fx apple vs samsung)

*Data*

1. Data collected from statistics of wealth distribution and survey or mobile data (fx news reading)
2. Data collected of educational level statistics and a surver or mobile data (fx screen time or cookies)
3. Data from a survey and possibly compare with data from social media precense

## Part 2 : A topic we will work on for the rest of this course
 

All right so, as I promised this course will be very hands-on. We will learn some of the methods and modelling approaches used in Computational Social Science and we will put this learning into practice.
The way we will do it is that we will apply the methods we learn to study a specific topic throughout the rest of this course.

In this video, I will explain to you what the whole project will be about. Ready? Watch the video below!


> **_Video lecture_**: Watch the video below
    

In [3]:
from IPython.display import YouTubeVideo
YouTubeVideo("uEJltY5Pv1U",width=600, height=337.5)


> **Exercise 2 : Understanding the field Computational Social Science in a data-driven way**    
> In the following of this course, we will do a meta-study on the field of Computational Social Science. 
> We will start by gathering data on Computational Social Science researchers, their interactions, and their scientific production. 
> - Work in pairs. Discuss possible data sources that you could collect to address our research question. 
> - Which dataset(s) would you collect?
> - How would you practically collect it? 
> - What are some limitations of your data? 
> - Go to [DTU Learn](https://learn.inside.dtu.dk/d2l/home/187754) and fill the Survey "_Data for Computational Social Science_"
>
>
> Remember that in this exercise, there is not a single correct answer. There could be multiple ways to gather data, and different data sources could shed lights on different aspects of scientists' works and interactions.

**Answer**
- Possible data sources: Research paper authors and inter-paper references. Subcategories of CSS studies and how much they overlap.
- Which datasets: I would collect author names of research papers as well as "Referenced by" data from these same research papers. Also collect the title and abstracts of the papers.
- How would you practically collect it: Scrape data from a publisher or research paper site and either categorize for CSS or find a magazine which specializes in CSS.
- Limitations of the data: Authors on research papers are not necessarily guaranteed to be only computational social scientists, and references can be made from papers that are not CSS research papers as well. The subject of a paper can be very difficult to infer automatically from simply the title or the abstract.

## Prelude to part 3:  Basic HTML

In the final part of this class, we will talk about web-scraping. 
For web-scraping, you need a little bit of knowledge about the structure of web-pages.
The standard way to write web-bages is to use a language called HTML. 

**Useful tutorial:**  If you are not familiar with HTML, I recommend [reading this tutorial](https://www.internetingishard.com/html-and-css/basic-web-pages).

**Useful resource:** HTML pages are built in a hierarchical structure and are composed of elements such as tables, titles, paragraphs, sections, etc. A complete list of HTML elements can be found [here](https://developer.mozilla.org/en-US/docs/Web/HTML/Element).

## Part 3: Using web-scraping to gather data.

All right, so now it's really time to start working on something a bit hands-on. 
The first thing we need to do is indeed to get DATA. 
As I said a few times by now, in this class we will do things from scratch. 
One of the ways to gather data from the web is to use web-scraping, which basically means getting information directly from web-pages. 
In the video below, I will give a brief overview on how to web-scrape web pages. 


> **_Video lecture_**: Watch the video below about web scraping (you can find [here the notebook that I use in the video](https://nbviewer.org/github/lalessan/comsocsci2023/blob/master/additional_notebooks/ScreenScraping.ipynb)

In [4]:
from IPython.display import YouTubeVideo
YouTubeVideo("nK_d0UQp4cE",width=600, height=337.5)


> **Exercise 3 : Web-scraping the list of participants to the International Conference in Computational Social Science**    
> It's time to put things into practice. Remember that our goal will be to gather a dataset describing Computational Social Scientists and their work. As we have discussed, the field of Computational Social Science is loosely defined. To gather data, we will start from the list of researchers that have joined the most important scientific conference in Computational Social Science in 2023. The conference is called International Conference in Computational Social Science (*IC2S2* in short). The assumption here is that the scientists who contribute to this conference are at the core of the field of Computational Social Science.    
>
>
> You can find the programme of the 2023 edition of the conference at [this link](https://ic2s2-2023.org/program). As you can see the conference programme included many different contributions: keynote presentations, parallel talks, tutorials, posters. 
> 1. Inspect the HTML of the page and use web-scraping to get the names of all researchers that contributed to the conference in 2023. The goal is the following: (i) get as many names as possible including: keynote speakers, chairs, authors of parallel talks and authors of posters; (ii) ensure that the collected names are complete and accuarate as reported in the website (e.g. both first name and family name); (iii) ensure that no name is repeated multiple times with slightly different spelling. 
> 2. Some instructions for success: 
>    * First, inspect the page through your web browser to identify the elements of the page that you want to collect. Ensure you understand the hierarchical structure of the page, and where the elements you are interested in are located within this nested structure.   
>    * Use the [BeautifulSoup Python package](https://pypi.org/project/beautifulsoup4/) to navigate through the hierarchy and extract the elements you need from the page. 
>    * You can use the [find_all](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all) method to find elements that match specific filters. Check the [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) of the library for detailed explanations on how to set filters.  
>    * Parse the strings to ensure that you retrieve "clean" author names (e.g. remove commas, or other unwanted charachters)
>    * The overall idea is to adapt the procedure I have used [here](https://nbviewer.org/github/lalessan/comsocsci2023/blob/master/additional_notebooks/ScreenScraping.ipynb) for the specific page you are scraping. 
> 3. Create the set of unique researchers that joined the conference and store it into a file.
>     * *Important:* If you notice any issue with the list of names you have collected (e.g. duplicate/incorrect names), come up with a strategy to clean your list as much as possible. 
> 5. *Optional:* For a more complete represenation of the field, include in your list: (i) the names of researchers from the programme committee of the conference, that can be found at [this link](https://ic2s2-2023.org/program_committee); (ii) the organizers of tutorials, that can be found at [this link](https://ic2s2-2023.org/tutorials)
> 6. Go to [DTU Learn](https://learn.inside.dtu.dk/d2l/home/187754) and fill the Survey "_Web Scraping_"
> 7. How many unique researchers do you get?
> 8. Explain the process you followed to web-scrape the page. Which choices did you make to accurately retreive as many names as possible? Which strategies did you use to assess the quality of your final list? Explain your reasoning and your choices. 



In [5]:
# Import packages
import requests
from bs4 import BeautifulSoup

In [6]:
# Define variables and collect content
LINK = "https://ic2s2-2023.org/program"
r = requests.get(LINK)
soup = BeautifulSoup(r.content)

In [7]:
# Find all the names from the top table
names = []
# Find all the table rows
table = soup.find("table", class_="summary")
table_rows = soup.find_all("tr")
for tr in table_rows:
    tds = tr.find_all("td")
    for row in tds:
        a = row.find_all("a")
        for text in a:
            text_content = text.text
            if ("Keynote" in text_content):
                text_split = text_content.split("-")
                stripped = text_split[1].strip()
                if (stripped not in names):
                    names.append(stripped)

In [8]:
# Find all the names from the bottom lists
# Find all the unordered lists
ul = soup.find_all("ul", class_="nav_list")
# Find all the list elements
for list in ul:
    found_names = list.find_all("i")
    # For every found name line, seperate into individual names
    for name in found_names:
        found_names_seperated = name.text.split(", ")
        for seperated_name in found_names_seperated:
            if (seperated_name.strip() not in names):
                names.append(seperated_name.strip())

In [9]:
# Find all the names of the chairs
headers = soup.find_all("h2")
for header in headers:
    text = header.find("i")
    if (text is not None):
        seperated_name = text.text.split(": ")
        if (seperated_name[1].strip() not in names):
            names.append(seperated_name[1].strip())
print(len(names))

# Define the name finding as a function for later use
def get_researcher_names():
    # Define variables and collect content
    LINK = "https://ic2s2-2023.org/program"
    r = requests.get(LINK)
    soup = BeautifulSoup(r.content)
    # Find all the names from the top table
    names = []
    # Find all the table rows
    table = soup.find("table", class_="summary")
    table_rows = soup.find_all("tr")
    for tr in table_rows:
        tds = tr.find_all("td")
        for row in tds:
            a = row.find_all("a")
            for text in a:
                text_content = text.text
                if ("Keynote" in text_content):
                    text_split = text_content.split("-")
                    stripped = text_split[1].strip()
                    if (stripped not in names):
                        names.append(stripped)
    
    # Find all the names from the bottom lists
    # Find all the unordered lists
    ul = soup.find_all("ul", class_="nav_list")
    # Find all the list elements
    for list in ul:
        found_names = list.find_all("i")
        # For every found name line, seperate into individual names
        for name in found_names:
            found_names_seperated = name.text.split(", ")
            for seperated_name in found_names_seperated:
                if (seperated_name.strip() not in names):
                    names.append(seperated_name.strip())
    
    # Find all the names of the chairs
    headers = soup.find_all("h2")
    for header in headers:
        text = header.find("i")
        if (text is not None):
            seperated_name = text.text.split(": ")
            if (seperated_name[1].strip() not in names):
                names.append(seperated_name[1].strip())

1491
['Jevin West', 'Linda Steg', 'Sharad Goel', 'Molly Crockett', 'Lisa Anne Hendriks', 'Stefan Gössling', 'Joanna Bryson', 'Tim Althoff', 'Lauren Brent', 'Esteban Moro', 'Jonas L Juul', 'Jon Kleinberg', 'Chloe Ahn', 'Xinyi Wang', 'Giuseppe Russo', 'luca verginer', 'Manoel Horta Ribeiro', 'Giona Casiraghi', 'Almog Simchon', 'Adam Sutton', 'Matthew Edwards', 'Stephan Lewandowsky', 'Arianna Pera', 'Manuel Vimercati', 'Matteo Palmonari', 'Mohammed Alsobay', 'Abdullah Almaatouq', 'David G. Rand', 'Duncan J. Watts', 'Sara Venturini', 'Satyaki Sikdar', 'Francesco Rinaldi', 'Francesco Tudisco', 'Santo Fortunato', 'Isabella Loaiza', 'Takahiro Yabe', 'Alex Pentland', 'Silvia De Sojo Caso', 'Mia Ann Jørgensen', 'Sune Lehmann', 'Laura Alessandretti', 'Emil Bakkensen Johansen', 'Mathias Wullum Nielsen', 'Rubén Rodríguez Casañ', 'Antonio Ariño Villarroya', 'Sunny Rai', 'Ashley Francisco', 'Salvatore Giorgi', 'Brenda Curtis', 'Lyle Ungar', 'Sharath Chandra Guntuku', 'Allison Koenecke', 'Eric Gianne