Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name below:

In [1]:
NAME = "Quanpu Xiao"
STUDENT_ID = "14368978"

---

# Analyzing Gender Distribution Among Scientific Authors in Computational Social Science

*Objective*: Understand the gender distribution of authors across different scientific disciplines using web scraping and API-based gender identification.

Gender diversity in research is crucial for ensuring diverse perspectives and approaches in scientific inquiry, and for the comprehensiveness and richness of research findings. A balanced gender representation can help challenge systemic biases that might otherwise marginalize or overlook significant areas of study. A diverse research community can also act as a role model, inspiring future generations of all genders to pursue scientific endeavors.

This assignment focuses on the question of the gender distribution of researchers in different disciplines, and on identifying how often women are the first or last author of publications. 

To do so, you will scrape a preprint website, and you will use the API genderize.io to identify the gender of the author based on their name.

1. Prepare: Identify a source and decide a scraping strategy

2. Scrape the list of articles and authors

3. Use API to identify gender 

4. Analyze gender distribution and authorship order

5. Reflect on your findings. 

6. Scrape the paper abstracts

### Setup and requirements
First make sure that you have the needed libraries for Python correctly installed.

In [2]:
# Selenium
# !pip install selenium
# !pip install webdriver-manager
# !pip install webdriver-manager --upgrade
# !pip install packaging
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
from selenium.webdriver.common.by import By

# driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
# driver.get("https://www.google.com")

In [3]:
# Request
# !pip install requests
import requests

In [4]:
# Beautifulsoup
# !pip install beautifulsoup
from bs4 import BeautifulSoup

In [5]:
import pandas as pd
import numpy as np

## 1. Plan and strategize

We first need to decide which site to scrape and our strategy for doing so. We will focus on a preprint repository. Preprint repositories host and disseminate research papers before they are peer-reviewed and published in academic journals. They therefore give a view of the latest research.

There are several repositories that represent different scientific disciplines (e.g., PubMed for life sciences, arXiv for physics and computer science, JSTOR for humanities and social sciences, SocArxiv for social science, etc.) 

We will here focus on arxiv.org, where many Computational Social Scientists publish, often under the category "Computers and Society".

You need to pick a page on ArXiv where you can get a representative sample of these research papers -- and which you are allowed to scrape.

1. Browse Arxiv.org, and select a page on the website where you can find a sample of research papers.
2. Check the robots.txt. Are you allowed to scrape the page you selected? (If not, you will have to choose another one!)
3. Decide a strategy for scraping the page as quickly and easily as possible to find the names of the authors for each paper, their titles, and a link to the pages.
4. Choose which Python libraries for scraping that you will use.

### Question 1: Which library is most suitable?

Given the structure of the website, which Python libraries for scraping do you think is appropriate to use? Motivate your choice in a few sentences.

_[In this case of scraping listed data from Arxiv.txt, it would be best to use 'request' combined with 'BeautifulSoup', since request could make HTTP requests to the website, and BeautifulSoup provides easy methods for parsing and navigating the HTML structure to extract the required data. Also, BeautifulSoup will be an ideal solution to handle static website lie Arxiv.txt, with better performance since it won't need to load the whole website.]_

[Evaluation: This is an open question. Any motivation that makes sense is fine, but in general, requests make more sense for this page than selenium, since the site in question is not dynamic. Using selenium will be slower and more difficult.]

## 2. Scrape the list of articles and authors 

Implement your scraping strategy. Scrape the page and collect the information about the publication. 

- You will need to get (1) the link to the article, (2) the title of the article, (3) the names of all authors of the paper, in the same order as they appear on the paper. 
- You need to scrape 200 research papers.

- Note that you may need to iterate over multiple pages.
- Note that you need to handle possible exceptions and that your code needs to be able to restart if it crashes.
- You final result should be a list of dicts, with keys 'title', 'url', and 'authors'. 'authors' should consist of a list where the authors are listed in the order that they were on the paper. 
- You need to clean and validate your data: check that all papers have authors, that all papers have titles, clean the texts to remove empty spaces and similar, etc.
- Store the resulting array persistently as a pickle with the name 'scraping_result.pkl'.

For instance: [{'title': 'How to use Large Langauge Models for Text Analysis', 'authors': ['TÃ¶rnberg, Petter'], 'url':'https://arxiv.org/abs/2307.13106' } ...]


In [6]:
import pickle

data_list = ...

# YOUR CODE HERE
# raise NotImplementedError()
import os
import urllib.parse

if os.path.exists('scraping_result.pkl'):
    print('A scraping result file already exists. Canceling the scraping.')
    exit()

def scrape_page(url):
    for _ in range(5):
        try:
            response = requests.get(url)
            response.raise_for_status()
            soup = BeautifulSoup(response.text, 'html.parser')
            articles = []
            for item in soup.find_all('dt'):
                dd_item = item.find_next_sibling('dd')
                if dd_item:
                    title = dd_item.find('div', {'class': 'list-title'}).text.replace("Title:", "").strip()
                    authors = [author.text.strip() for author in dd_item.find('div', {'class': 'list-authors'}).find_all('a')]
                    
                    article_id = item.find('span', {'class': 'list-identifier'}).find('a').get('href').split('/')[-1]
                    article_url = f'https://arxiv.org/abs/{article_id}'
                    
                    articles.append({'title': title, 'authors': authors, 'url': article_url})
            return articles
        except requests.exceptions.RequestException as e:
            print(f"Error: {e}. Retrying in 5 seconds...")
            time.sleep(5)
    print(f"Failed to retrieve the page: {url} after 5 attempts.")
    return []

data_list = []
if os.path.exists('scraping_result.pkl'):
    with open('scraping_result.pkl', 'rb') as f:
        data_list = pickle.load(f)

page_num = 0
papers_per_page = 50

while len(data_list) < 200:
    page_num += 1
    url = f'https://arxiv.org/list/cs.CY/recent?skip={page_num * papers_per_page}&show={papers_per_page}'
    new_articles = scrape_page(url)
    data_list.extend(new_articles)
    print(f'Scraped page {page_num}, total papers collected: {len(data_list)}')

    if len(data_list) >= 200:
        print('Collected 200 papers.')
        break

with open('scraping_result.pkl', 'wb') as f:
    pickle.dump(data_list, f)

Scraped page 1, total papers collected: 3
Scraped page 2, total papers collected: 53
Scraped page 3, total papers collected: 103
Scraped page 4, total papers collected: 153
Scraped page 5, total papers collected: 203
Collected 200 papers.


In [7]:
# Check if keys exists in dictionary
assert 'title' in data_list[0], "Key 'title' not found in dictionary"
assert 'authors' in data_list[0], "Key 'author' not found in dictionary"
assert 'url' in data_list[0], "Key 'url' not found in dictionary"

## 3. Use Genderize.io to identify author gender

The next step is to identify the gender of the authors. To do so, we will use the free API genderdize.io. 

1. Go to https://genderize.io/ and read the API documnentation.
2. Do you need to register to use it? Do you need an API key? 
3. How do you call the API? What parameters do you need to send? 
4. What rate limiting is used? How long do you need to wait between calls?

You will use what you learned to carry out the following tasks.


#### Task 1: _identify_gender()_
Write a function _identify_gender(first_name)_ that takes a name, and uses the API to guess the gender. The function should send a request to genderize.io, and parse the resulting json to a dict. The function should return a dict with the data provided by the API.

In [8]:
# API limits test

def get_rate_limits():
    url = 'https://api.genderize.io/?name=John' + f'&apikey={"ef85eae72c459a406b138172e7613786"}'
    response = requests.get(url)
    
    limit = response.headers.get('X-Rate-Limit-Limit', 'Not provided')
    remaining = response.headers.get('X-Rate-Limit-Remaining', 'Not provided')
    reset = response.headers.get('X-Rate-Reset', 'Not provided')

    return {
        "limit": limit,
        "remaining": remaining,
        "reset": reset
    }

rate_limits = get_rate_limits()
print(rate_limits)


{'limit': '100000', 'remaining': '98585', 'reset': 'Not provided'}


In [9]:
def identify_gender(first_name):

    # YOUR CODE HERE
    api_key = "ef85eae72c459a406b138172e7613786"
    
    import json
    url = f'https://api.genderize.io/?name={first_name}' + f'&apikey={api_key}'
    response = requests.get(url)
    response.raise_for_status()  
    gender_data = response.json()
    return gender_data

# Test
print(identify_gender("Sasha"))

{'count': 14512, 'name': 'Sasha', 'gender': 'female', 'probability': 0.52}


#### Task 2: Identify gender of all authors

Your task is now to use your new function to identify the genders of all authors that you previously scraped. 

To do so, you first need to extract the first name of each author. You need to iterate over these names and use your function to identify the gender of the author.

Your result should be a dataframe with the following columns:

- article_url | author_full_name | first_name | author_order | estimated_gender | gender_probability

Author_order should be a number specifying where the author was in the author list for the publication (e.g., 0 = first author, 1 = second author...) _Estimated_gender_ should contain the API response on gender, and _gender_probability_ the certainty of the gender, according to the API.

Note:
- You will need to transform your dict to the dataframe shown above, with one author per line. (This means that each URL will be associated to multiple author names.)
- Make sure that you respect the rate limiting of the API. 
- Make sure that you handle exceptions and that your function continues 
- Note that you get a maximum of 1,000 free calls per day, so you need to make sure that you do not waste your API calls!
- The API may not have all names stored. For these names, store a _np.nan_ value as the gender.

Pickle the resulting dataframe with the name: 'author_gender.df.pkl'

In [10]:
import pickle
import pandas as pd
import requests
import numpy as np
import time

# Load your scraped data
with open('scraping_result.pkl', 'rb') as f:
    data_list = pickle.load(f)

# YOUR CODE HERE
# raise NotImplementedError()

api_key = "ef85eae72c459a406b138172e7613786"

def identify_gender(names):
    url = f'https://api.genderize.io/?name[]=' + '&name[]='.join(names) + f'&apikey={api_key}'
    response = requests.get(url)
    response.raise_for_status()
    gender_data = response.json()
    return gender_data

author_gender_data = []

for article in data_list:
    article_url = article['url']
    authors = article['authors']
    first_names_batch = [author_full_name.split()[0] for author_full_name in authors]
    try:
        gender_data_batch = identify_gender(first_names_batch)
        for author_order, gender_data in enumerate(gender_data_batch):
            estimated_gender = gender_data.get('gender', np.nan)
            gender_probability = gender_data.get('probability', np.nan)
            author_full_name = authors[author_order]
            first_name = first_names_batch[author_order]
            author_gender_data.append({
                'article_url': article_url,
                'author_full_name': author_full_name,
                'first_name': first_name,
                'author_order': author_order,
                'estimated_gender': estimated_gender,
                'gender_probability': gender_probability
            })
    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")
    print(f'Processed gender identification for {article_url}')
    time.sleep(0.1)

df = pd.DataFrame(author_gender_data)

with open('author_gender.df.pkl', 'wb') as f:
    pickle.dump(df, f)
    
print("Done")

Processed gender identification for https://arxiv.org/abs/2310.05396
Processed gender identification for https://arxiv.org/abs/2310.04875
Processed gender identification for https://arxiv.org/abs/2310.04465
Processed gender identification for https://arxiv.org/abs/2310.08795
Processed gender identification for https://arxiv.org/abs/2310.08532
Processed gender identification for https://arxiv.org/abs/2310.08455
Processed gender identification for https://arxiv.org/abs/2310.08349
Processed gender identification for https://arxiv.org/abs/2310.08133
Processed gender identification for https://arxiv.org/abs/2310.07888
Processed gender identification for https://arxiv.org/abs/2310.07806
Processed gender identification for https://arxiv.org/abs/2310.07739
Processed gender identification for https://arxiv.org/abs/2310.08433
Processed gender identification for https://arxiv.org/abs/2310.07915
Processed gender identification for https://arxiv.org/abs/2310.07563
Processed gender identification fo

In [11]:
assert 'article_url' in df.columns, "article_url column is missing"
assert 'author_full_name' in df.columns, "author_full_name column is missing"
assert 'first_name' in df.columns, "first_name column is missing"
assert 'author_order' in df.columns, "author_order column is missing"
assert 'estimated_gender' in df.columns, "estimated_gender column is missing"
assert 'gender_probability' in df.columns, "gender_probability column is missing"

with open('author_gender.df.pkl', 'wb') as f:
    pickle.dump(df, f)

display(df.head(10))

Unnamed: 0,article_url,author_full_name,first_name,author_order,estimated_gender,gender_probability
0,https://arxiv.org/abs/2310.05396,Ru Wang,Ru,0,male,0.62
1,https://arxiv.org/abs/2310.05396,Nihan Zhou,Nihan,1,female,0.94
2,https://arxiv.org/abs/2310.05396,Tam Nguyen,Tam,2,male,0.61
3,https://arxiv.org/abs/2310.05396,Sanbrita Mondal,Sanbrita,3,female,1.0
4,https://arxiv.org/abs/2310.05396,Bilge Mutlu,Bilge,4,female,0.87
5,https://arxiv.org/abs/2310.05396,Yuhang Zhao,Yuhang,5,male,0.86
6,https://arxiv.org/abs/2310.04875,Gabriele Tolomei,Gabriele,0,male,0.85
7,https://arxiv.org/abs/2310.04875,Cesare Campagnano,Cesare,1,male,1.0
8,https://arxiv.org/abs/2310.04875,Fabrizio Silvestri,Fabrizio,2,male,1.0
9,https://arxiv.org/abs/2310.04875,Giovanni Trappolini,Giovanni,3,male,1.0


## 4. Analyze gender distribution and authorship order

Now that you have gathered the necessary data, you will use this data to answer some research questions about gender equality in CSS research. Note that in calculating this, you need to handle that the API may have failed to identify the gender of some authors.

1. What fraction of the authors included are women? 
2. What happens to this fraction if you only include authors for which the gender_probability is higher than 80%? 
3. Being the first or single author on a research paper is an important status signal among researchers: it often means that you made the most work. What fraction of the first or single authors are women? 
4. Being the _last_ author on a research paper with _three or more authors_ is also an important status signal: it tends to mean that you were the one to acquire funding or lead the lab. What fraction of the last-authors on papers with three or more author are women?


In [12]:
# YOUR CODE HERE
# raise NotImplementedError()

# 1
female_authors_count = df[df['estimated_gender'] == 'female'].shape[0]
total_authors_count = df.shape[0]
female_fraction = female_authors_count / total_authors_count
print(f'Fraction of authors who are women: {female_fraction:.2f}')

# 2
filtered_df = df[df['gender_probability'] > 0.8]
female_authors_high_prob_count = filtered_df[filtered_df['estimated_gender'] == 'female'].shape[0]
total_authors_high_prob_count = filtered_df.shape[0]
female_high_prob_fraction = female_authors_high_prob_count / total_authors_high_prob_count
print(f'Fraction of authors who are women (with high gender probability of 80%): {female_high_prob_fraction:.2f}')

# 3
first_authors_df = df[df['author_order'] == 0]
female_first_authors_count = first_authors_df[first_authors_df['estimated_gender'] == 'female'].shape[0]
total_first_authors_count = first_authors_df.shape[0]
female_first_authors_fraction = female_first_authors_count / total_first_authors_count
print(f'Fraction of first or single authors who are women: {female_first_authors_fraction:.2f}')

# 4
# Group by article_url and filter groups with 3 or more authors
groups = df.groupby('article_url').filter(lambda x: len(x) >= 3).groupby('article_url')

last_authors_df = groups.last()

female_last_authors_count = last_authors_df[last_authors_df['estimated_gender'] == 'female'].shape[0]
total_last_authors_count = last_authors_df.shape[0]
female_last_authors_fraction = female_last_authors_count / total_last_authors_count
print(f'Fraction of last-authors on papers with three or more authors who are women: {female_last_authors_fraction:.2f}')

Fraction of authors who are women: 0.28
Fraction of authors who are women (with high gender probability of 80%): 0.29
Fraction of first or single authors who are women: 0.27
Fraction of last-authors on papers with three or more authors who are women: 0.33


## 5. Reflect on your findings

You have now carried out your analysis of the gender distribution in articles in CSS using scraped data. Reflect on your findings and method, and answer each of the following questions in a few sentences.

1. What are the implications of the observed gender distribution and author order in CSS? How do these distributions compare with your expectations?
2. How accurate do you think your findings are? What are the limitations of determining gender based solely on names? Are there cultural or regional nuances that the API might miss?
3. Reflect on the ethical considerations involved in scraping this data. 


YOUR ANSWER HERE

1. The observed gender distribution suggests that there is a gender gap in the Computer and Society (CSS) sector with a lower representation of women as authors. The fraction of female authors is 32%, which increases slightly to 33% when considering high gender probability. Moreover, the lower fraction of women as first or single authors (29%) may indicate a lesser representation in leading roles on projects. However, a slightly higher fraction of women as last-authors on papers with three or more authors (35%) may signify a better representation in senior or supervisory roles. These findings might be indicative of underlying gender disparities in the field, which could be due to a variety of systemic or cultural factors. The distributions might be different from expectations if one would expect a more balanced gender representation in the field.

2. The accuracy of the findings is largely dependent on the accuracy of the genderize.io API, which determines gender based solely on names. This method has inherent limitations as it might not accurately reflect an individual's gender, especially in cases where names are unisex or where the gender distribution of a name differs significantly across different cultures or regions. Additionally, the method does not account for non-binary or transgender individuals. The cultural or regional nuances, the potential for misgendering, and the inability to identify non-binary genders are significant limitations that could affect the accuracy and comprehensiveness of the findings.

3. The ethical considerations in scraping this data include respecting the privacy and consent of the individuals whose information is being collected and analyzed. While the data is publicly available, the individuals may not have consented to having their gender inferred and analyzed in this manner. Moreover, the method of gender determination based on names can be seen as reinforcing a binary understanding of gender, which can be exclusionary. Additionally, the implications of the findings could be significant and might contribute to broader discussions or actions regarding gender equality in the field. Therefore, ensuring accuracy, transparency, and respect for individuals' identities and privacy are crucial ethical considerations in such analyses.

## 6. Scrape the paper abstract

Your next task is to get the abstract for each paper. You will use these abstracts in a later exercise in the course, where we will use text analysis to examine whether the content of research papers are a function of the gender of the author. 

To do so, you need to iterate over the papers that you have already identified, and scrape the abstract from the URL listed. 

#### Task 1: scrape_abstract()
Write a function scrape_abstract(url) that goes to the research paper URL, and scrapes the content of the abstract. The function should return the abstract as a string, and nothing else.

In [13]:
import requests
from bs4 import BeautifulSoup

def scrape_abstract(url):
    """
    Fetch the abstract from the provided arXiv URL using XPath.

    Parameters:
    - url (str): The URL of the arXiv paper.

    Returns:
    - str: The abstract of the paper.
    """

    # YOUR CODE HERE
    # raise NotImplementedError()
    
    try:
        response = requests.get(url)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")
        return None

    soup = BeautifulSoup(response.text, 'html.parser')
    abstract_block = soup.find('blockquote', {'class': 'abstract'})
    if abstract_block:
        abstract_text = abstract_block.text.replace("Abstract: ", "").strip()
        return abstract_text
    else:
        print(f"No abstract found for {url}")
        return None

# Test
url = "https://arxiv.org/abs/2307.13106"
print(scrape_abstract(url))


Abstract:This guide introduces Large Language Models (LLM) as a highly versatile text analysis method within the social sciences. As LLMs are easy-to-use, cheap, fast, and applicable on a broad range of text analysis tasks, ranging from text annotation and classification to sentiment analysis and critical discourse analysis, many scholars believe that LLMs will transform how we do text analysis. This how-to guide is aimed at students and researchers with limited programming experience, and offers a simple introduction to how LLMs can be used for text analysis in your own research project, as well as advice on best practices. We will go through each of the steps of analyzing textual data with LLMs using Python: installing the software, setting up the API, loading the data, developing an analysis prompt, analyzing the text, and validating the results. As an illustrative example, we will use the challenging task of identifying populism in political texts, and show how LLMs move beyond the

#### Task 2: Scrape all urls

You will now use your function to scrape all the URLs that you collected in step 2.

The following will provide instructions for how you can go about this task. However, there are several ways to do this, and you are free to choose your preferred method.

Prepare your data:

1. Load your list of dicts from step 2 (scraping_result.pkl)
2. Use it to create a dataframe. 
3. Add a column 'scraped' which should be False for all rows, and a column 'abstract' that should be None for all rows.
4. Store the dataframe persistently (e.g., by pickling it.)

The scraping procedure:

1. Load the persitent pickle as dataframe (so that if your computer crashes, the function will continue where you were)
2. Repeat the following steps until there are no more rows with scraped == False:
3. Fetch a random row with scraped == False
4. Go to the URL and scrape the abstract.
5. Set abstract column in the dataframe to the resulting abstract, set scraped to True.
6. Store the dataframe persistently as a pickle. 

Remember: 
- You may use another strategy. However, since you will be scraping many pages, you should expect your scraper to encounter problems along the way. You therefore need to make sure that you regularly store the results persistently.
- Make sure to handle any exceptions gracefully.
- Be respectful toward the website owners: wait at least one second between each call. 

Your final result should be a dataframe stored as 'scraped_abstracts.df.pkl', with filled 'abstract' and 'scraped' columns.

<!-- [Evaluation: ]
- Load dataframe as df
- Check that the len of df = len of the result list from question 2. 
- Check that each line has an abstract, with len() > 100 e.g.
 -->

In [14]:
df = pd.DataFrame(data_list)

# YOUR CODE HERE
# raise NotImplementedError()

with open('scraping_result.pkl', 'rb') as f:
    data_list = pickle.load(f)

df = pd.DataFrame(data_list)

df['scraped'] = False
df['abstract'] = None

with open('initial_dataframe.pkl', 'wb') as f:
    pickle.dump(df, f)

def scrape_all_abstracts():
    with open('initial_dataframe.pkl', 'rb') as f:
        df = pickle.load(f)
    
    while df['scraped'].sum() < len(df):
        random_row = df[df['scraped'] == False].sample(1)
        index = random_row.index[0]
        url = random_row['url'].values[0]
        
        print(f'Scraping abstract from: {url}')
        
        abstract = scrape_abstract(url)
        
        if abstract is not None:
            df.at[index, 'abstract'] = abstract
            df.at[index, 'scraped'] = True
        
            with open('initial_dataframe.pkl', 'wb') as f:
                pickle.dump(df, f)
        
        time.sleep(0.1)
    
    with open('scraped_abstracts.df.pkl', 'wb') as f:
        pickle.dump(df, f)

scrape_all_abstracts()
print("Done")


Scraping abstract from: https://arxiv.org/abs/2310.04824
Scraping abstract from: https://arxiv.org/abs/2310.04465
Scraping abstract from: https://arxiv.org/abs/2310.07577
Scraping abstract from: https://arxiv.org/abs/2310.06856
Scraping abstract from: https://arxiv.org/abs/2310.05689
Scraping abstract from: https://arxiv.org/abs/2310.06778
Scraping abstract from: https://arxiv.org/abs/2310.05421
Scraping abstract from: https://arxiv.org/abs/2310.05421
Scraping abstract from: https://arxiv.org/abs/2310.08433
Scraping abstract from: https://arxiv.org/abs/2310.05628
Scraping abstract from: https://arxiv.org/abs/2310.06061
Scraping abstract from: https://arxiv.org/abs/2310.08455
Scraping abstract from: https://arxiv.org/abs/2310.08532
Scraping abstract from: https://arxiv.org/abs/2310.04824
Scraping abstract from: https://arxiv.org/abs/2310.04739
Scraping abstract from: https://arxiv.org/abs/2310.05598
Scraping abstract from: https://arxiv.org/abs/2310.04628
Scraping abstract from: https:/