Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name below:

In [1]:
NAME = "Maja Kubara"
STUDENT_ID = "14498863"

---

# Analyzing Gender Distribution Among Scientific Authors in Computational Social Science

*Objective*: Understand the gender distribution of authors across different scientific disciplines using web scraping and API-based gender identification.

Gender diversity in research is crucial for ensuring diverse perspectives and approaches in scientific inquiry, and for the comprehensiveness and richness of research findings. A balanced gender representation can help challenge systemic biases that might otherwise marginalize or overlook significant areas of study. A diverse research community can also act as a role model, inspiring future generations of all genders to pursue scientific endeavors.

This assignment focuses on the question of the gender distribution of researchers in different disciplines, and on identifying how often women are the first or last author of publications. 

To do so, you will scrape a preprint website, and you will use the API genderize.io to identify the gender of the author based on their name.

1. Prepare: Identify a source and decide a scraping strategy

2. Scrape the list of articles and authors

3. Use API to identify gender 

4. Analyze gender distribution and authorship order

5. Reflect on your findings. 

6. Scrape the paper abstracts

### Setup and requirements
First make sure that you have the needed libraries for Python correctly installed.

In [2]:
#Selenium
# !pip install selenium
# !pip install webdriver-manager
# !pip install webdriver-manager --upgrade
# !pip install packaging
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
from selenium.webdriver.common.by import By

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get("https://www.google.com")

In [3]:
# Request
#!pip install requests
import requests

In [4]:
# Beautifulsoup
#!pip install beautifulsoup
from bs4 import BeautifulSoup

In [5]:
import pandas as pd
import numpy as np

## 1. Plan and strategize

We first need to decide which site to scrape and our strategy for doing so. We will focus on a preprint repository. Preprint repositories host and disseminate research papers before they are peer-reviewed and published in academic journals. They therefore give a view of the latest research.

There are several repositories that represent different scientific disciplines (e.g., PubMed for life sciences, arXiv for physics and computer science, JSTOR for humanities and social sciences, SocArxiv for social science, etc.) 

We will here focus on arxiv.org, where many Computational Social Scientists publish, often under the category "Computers and Society".

You need to pick a page on ArXiv where you can get a representative sample of these research papers -- and which you are allowed to scrape.

1. Browse Arxiv.org, and select a page on the website where you can find a sample of research papers.
2. Check the robots.txt. Are you allowed to scrape the page you selected? (If not, you will have to choose another one!)
3. Decide a strategy for scraping the page as quickly and easily as possible to find the names of the authors for each paper, their titles, and a link to the pages.
4. Choose which Python libraries for scraping that you will use.

### Question 1: Which library is most suitable?

Given the structure of the website, which Python libraries for scraping do you think is appropriate to use? Motivate your choice in a few sentences.

_Json parsing as it is easier to scrape and then turn into a dictionary _

[Evaluation: This is an open question. Any motivation that makes sense is fine, but in general, requests make more sense for this page than selenium, since the site in question is not dynamic. Using selenium will be slower and more difficult.]

## 2. Scrape the list of articles and authors 

Implement your scraping strategy. Scrape the page and collect the information about the publication. 

- You will need to get (1) the link to the article, (2) the title of the article, (3) the names of all authors of the paper, in the same order as they appear on the paper. 
- You need to scrape 200 research papers.

- Note that you may need to iterate over multiple pages.
- Note that you need to handle possible exceptions and that your code needs to be able to restart if it crashes.
- You final result should be a list of dicts, with keys 'title', 'url', and 'authors'. 'authors' should consist of a list where the authors are listed in the order that they were on the paper. 
- You need to clean and validate your data: check that all papers have authors, that all papers have titles, clean the texts to remove empty spaces and similar, etc.
- Store the resulting array persistently as a pickle with the name 'scraping_result.pkl'.

For instance: [{'title': 'How to use Large Langauge Models for Text Analysis', 'authors': ['Törnberg, Petter'], 'url':'https://arxiv.org/abs/2307.13106' } ...]


In [6]:
import pickle
import urllib
import feedparser

# Access url, read it and parse
base_url = 'http://export.arxiv.org/api/query?'
query =  'search_query=cat:stat.ME&start=0&max_results=200'
url = urllib.request.urlopen(base_url + query)
response = url.read()
data = feedparser.parse(response)

# Iterate through each entry in the data, save url, title, and authors of each article
article_info = []
for article in data.entries:
    url = article.link
    title = article.title
    authors = article.authors
    # Save it in a dictionary and append in a list
    info = {'title':title, 'authors':authors, 'url':url}
    article_info.append(info)

# Saving entries that have authors
article_info = [article for article in article_info if article['authors']]

# Removing extra spaces
data_list = []
for entry in article_info:
    data_dict = {key: value.strip() if isinstance(value, str) else value for key, value in entry.items()}
    data_list.append(data_dict)

#store the results as a pickle
with open ('scraping_result.pkl', 'wb') as file:
    pickle.dump(data_list, file)


In [7]:
# Check if keys exists in dictionary
assert 'title' in data_list[0], "Key 'title' not found in dictionary"
assert 'authors' in data_list[0], "Key 'author' not found in dictionary"
assert 'url' in data_list[0], "Key 'url' not found in dictionary"

## 3. Use Genderize.io to identify author gender

The next step is to identify the gender of the authors. To do so, we will use the free API genderdize.io. 

1. Go to https://genderize.io/ and read the API documnentation.
2. Do you need to register to use it? Do you need an API key? 
3. How do you call the API? What parameters do you need to send? 
4. What rate limiting is used? How long do you need to wait between calls?

You will use what you learned to carry out the following tasks.


#### Task 1: _identify_gender()_
Write a function _identify_gender(first_name)_ that takes a name, and uses the API to guess the gender. The function should send a request to genderize.io, and parse the resulting json to a dict. The function should return a dict with the data provided by the API.

In [8]:
# Import gender API and save as a function
def identify_gender(first_name):
    
    url = f'https://api.genderize.io?name={first_name}'
    response = requests.get(url)
    
    # Check if the request was successful
    if response.status_code == 200:
        data = response.json()

    return data

# Test
print(identify_gender("Sasha"))

{'count': 14512, 'name': 'Sasha', 'gender': 'female', 'probability': 0.52}


#### Task 2: Identify gender of all authors

Your task is now to use your new function to identify the genders of all authors that you previously scraped. 

To do so, you first need to extract the first name of each author. You need to iterate over these names and use your function to identify the gender of the author.

Your result should be a dataframe with the following columns:

- article_url | author_full_name | first_name | author_order | estimated_gender | gender_probability

Author_order should be a number specifying where the author was in the author list for the publication (e.g., 0 = first author, 1 = second author...) _Estimated_gender_ should contain the API response on gender, and _gender_probability_ the certainty of the gender, according to the API.

Note:
- You will need to transform your dict to the dataframe shown above, with one author per line. (This means that each URL will be associated to multiple author names.)
- Make sure that you respect the rate limiting of the API. 
- Make sure that you handle exceptions and that your function continues 
- Note that you get a maximum of 1,000 free calls per day, so you need to make sure that you do not waste your API calls!
- The API may not have all names stored. For these names, store a _np.nan_ value as the gender.

Pickle the resulting dataframe with the name: 'author_gender.df.pkl'

In [9]:
import pickle
import pandas as pd
import requests
import numpy as np
import time
import json

# Load scraped data
with open('scraping_result.pkl', 'rb') as f:
    data_list = pickle.load(f)

articles_info = []

# Iterate over each article in a list
for article in data_list: 
    # Save url, title and authors of each article, count the order of autors  
    url = article['url']
    title = article['title']
    authors = article.get("authors", [])
    order = 0

    # Iterate over each author in a list
    for author in authors: 
        # Get name of each author
        full_name = author.get("name")
        full_name = str(full_name)

        # Check if the author value is missing
        if full_name: 
            # Get first name      
            full_name = full_name.split()
            name = full_name[0]
            # Use gender function
            gender_info = identify_gender(name)
            # Save gender and probability
            gender = gender_info.get('gender')
            probability = gender_info.get('probability')

        #if missing then np.nan   
        else:               
            name = [np.nan]

        #create a dictionary and append to a list   
        article_info = {'article_url':url, 'author_full_name':full_name, 'first_name':name,'author_order':order,'estimated_gender':gender,'gender_probability':probability}
        articles_info.append(article_info)

        order += 1
        
   

df = pd.DataFrame(articles_info)

In [10]:
assert 'article_url' in df.columns, "article_url column is missing"
assert 'author_full_name' in df.columns, "author_full_name column is missing"
assert 'first_name' in df.columns, "first_name column is missing"
assert 'author_order' in df.columns, "author_order column is missing"
assert 'estimated_gender' in df.columns, "estimated_gender column is missing"
assert 'gender_probability' in df.columns, "gender_probability column is missing"

with open('author_gender.df.pkl', 'wb') as f:
    pickle.dump(df, f)

display(df.head(10))

Unnamed: 0,article_url,author_full_name,first_name,author_order,estimated_gender,gender_probability
0,http://arxiv.org/abs/0705.0700v3,"[Raydonal, Ospina]",Raydonal,0,,0.0
1,http://arxiv.org/abs/0705.0700v3,"[Silvia, L., P., Ferrari]",Silvia,1,female,1.0
2,http://arxiv.org/abs/0705.2938v1,"[Guilhem, Coq]",Guilhem,0,male,0.99
3,http://arxiv.org/abs/0705.2938v1,"[Olivier, Alata]",Olivier,1,male,1.0
4,http://arxiv.org/abs/0705.2938v1,"[Marc, Arnaudon]",Marc,2,male,1.0
5,http://arxiv.org/abs/0705.2938v1,"[Christian, Olivier]",Christian,3,male,1.0
6,http://arxiv.org/abs/0705.4588v1,"[Shurong, Zheng]",Shurong,0,female,0.64
7,http://arxiv.org/abs/0705.4588v1,"[Guodong, Song]",Guodong,1,male,0.96
8,http://arxiv.org/abs/0705.4588v1,"[Ning-Zhong, Shi]",Ning-Zhong,2,,0.0
9,http://arxiv.org/abs/0706.1287v1,"[Helen, Armstrong]",Helen,0,female,1.0


## 4. Analyze gender distribution and authorship order

Now that you have gathered the necessary data, you will use this data to answer some research questions about gender equality in CSS research. Note that in calculating this, you need to handle that the API may have failed to identify the gender of some authors.

1. What fraction of the authors included are women? 
2. What happens to this fraction if you only include authors for which the gender_probability is higher than 80%? 
3. Being the first or single author on a research paper is an important status signal among researchers: it often means that you made the most work. What fraction of the first or single authors are women? 
4. Being the _last_ author on a research paper with _three or more authors_ is also an important status signal: it tends to mean that you were the one to acquire funding or lead the lab. What fraction of the last-authors on papers with three or more author are women?


In [11]:
#Question 1
criteria_1 = df['estimated_gender'] == "female"
female_total = criteria_1.sum()
total_rows = len(df)
female_fraction = female_total/total_rows
female_fraction = round(female_fraction, 2)
print(f'{female_fraction} of authors are women')
#Question 2
criteria_2 = (df['estimated_gender'] == "female") & (df['gender_probability'] > 0.8)
female_total_2 = criteria_2.sum()
fraction_2 = female_total_2/total_rows
fraction_2 = round(fraction_2, 2)
print(f'If the probability is changed to above 80% the fraction equals to {fraction_2}')
#Question 3
criteria_3 = (df['author_order'] == 0) & (df['estimated_gender'] == "female")
single_female = criteria_3.sum()
fraction_3 = single_female/total_rows
fraction_3 = round(fraction_3, 2)
print(f'The amount of women that are either first or a single author is {fraction_3}')
#Question 4
criteria_4= df[(df['author_order'] <=3) &(df['estimated_gender']=='female')]
female= len(criteria_4)
fraction_4 = (female)/(total_rows)
print(f'{fraction_4}')

0.24 of authors are women
If the probability is changed to above 80% the fraction equals to 0.18
The amount of women that are either first or a single author is 0.1
0.22877358490566038


## 5. Reflect on your findings

You have now carried out your analysis of the gender distribution in articles in CSS using scraped data. Reflect on your findings and method, and answer each of the following questions in a few sentences.

1. What are the implications of the observed gender distribution and author order in CSS? How do these distributions compare with your expectations?
2. How accurate do you think your findings are? What are the limitations of determining gender based solely on names? Are there cultural or regional nuances that the API might miss?
3. Reflect on the ethical considerations involved in scraping this data. 


1. The fraction of women is less than fraction of men authors, more male authors are first in order
2. Not accurate, the names could be used for both genders depepdning on a religion, culture, orvpersonal preferences
3. Consent, the authors did not have a chance to consent to use their data for study and Transparency, authors are not aware of the study purposes.

## 6. Scrape the paper abstract

Your next task is to get the abstract for each paper. You will use these abstracts in a later exercise in the course, where we will use text analysis to examine whether the content of research papers are a function of the gender of the author. 

To do so, you need to iterate over the papers that you have already identified, and scrape the abstract from the URL listed. 

#### Task 1: scrape_abstract()
Write a function scrape_abstract(url) that goes to the research paper URL, and scrapes the content of the abstract. The function should return the abstract as a string, and nothing else.

In [12]:
import requests
from bs4 import BeautifulSoup

def scrape_abstract(url):
    # get url and parse html
    page = requests.get(url).text
    soup = BeautifulSoup(page, 'html.parser')
    #find abstract by class name and extract it
    abstract = soup.find('blockquote',class_="abstract mathjax")
    abstract_text = abstract.get_text(separator = ' ')
    
    return abstract_text


# Test
url = "https://arxiv.org/abs/2307.13106"
print(scrape_abstract(url))



 Abstract: This guide introduces Large Language Models (LLM) as a highly versatile text analysis method within the social sciences. As LLMs are easy-to-use, cheap, fast, and applicable on a broad range of text analysis tasks, ranging from text annotation and classification to sentiment analysis and critical discourse analysis, many scholars believe that LLMs will transform how we do text analysis. This how-to guide is aimed at students and researchers with limited programming experience, and offers a simple introduction to how LLMs can be used for text analysis in your own research project, as well as advice on best practices. We will go through each of the steps of analyzing textual data with LLMs using Python: installing the software, setting up the API, loading the data, developing an analysis prompt, analyzing the text, and validating the results. As an illustrative example, we will use the challenging task of identifying populism in political texts, and show how LLMs move beyond 

#### Task 2: Scrape all urls

You will now use your function to scrape all the URLs that you collected in step 2.

The following will provide instructions for how you can go about this task. However, there are several ways to do this, and you are free to choose your preferred method.

Prepare your data:

1. Load your list of dicts from step 2 (scraping_result.pkl)
2. Use it to create a dataframe. 
3. Add a column 'scraped' which should be False for all rows, and a column 'abstract' that should be None for all rows.
4. Store the dataframe persistently (e.g., by pickling it.)

The scraping procedure:

1. Load the persitent pickle as dataframe (so that if your computer crashes, the function will continue where you were)
2. Repeat the following steps until there are no more rows with scraped == False:
3. Fetch a random row with scraped == False
4. Go to the URL and scrape the abstract.
5. Set abstract column in the dataframe to the resulting abstract, set scraped to True.
6. Store the dataframe persistently as a pickle. 

Remember: 
- You may use another strategy. However, since you will be scraping many pages, you should expect your scraper to encounter problems along the way. You therefore need to make sure that you regularly store the results persistently.
- Make sure to handle any exceptions gracefully.
- Be respectful toward the website owners: wait at least one second between each call. 

Your final result should be a dataframe stored as 'scraped_abstracts.df.pkl', with filled 'abstract' and 'scraped' columns.

<!-- [Evaluation: ]
- Load dataframe as df
- Check that the len of df = len of the result list from question 2. 
- Check that each line has an abstract, with len() > 100 e.g.
 -->

In [13]:
#open pickle file
with open ('scraping_result.pkl', 'rb') as file:
    data_list = pickle.load(file)

#convert to dataframe
df = pd.DataFrame(data_list)

#add scraped and abstract columns 
df['scraped'] = 'False'
df['abstract'] = 'None'

#for each link scrape abstract and save it in the dataframe
for link in df['url']:
    abstract = scrape_abstract(link)
    if abstract:
        df['scraped'] = 'True'
        df['abstract'] = abstract
    
display(df.head(5))

# Rename and save the final dataframe
df.to_pickle('scraped_abstracts.df.pkl')

Unnamed: 0,title,authors,url,scraped,abstract
0,Inflated Beta Distributions,"[{'name': 'Raydonal Ospina'}, {'name': 'Silvia...",http://arxiv.org/abs/0705.0700v3,True,\n Abstract: A sequential multiple testing p...
1,Codage arithmetique pour la description d'une ...,"[{'name': 'Guilhem Coq'}, {'name': 'Olivier Al...",http://arxiv.org/abs/0705.2938v1,True,\n Abstract: A sequential multiple testing p...
2,Variable Selection Incorporating Prior Constra...,"[{'name': 'Shurong Zheng'}, {'name': 'Guodong ...",http://arxiv.org/abs/0705.4588v1,True,\n Abstract: A sequential multiple testing p...
3,Bayesian Covariance Matrix Estimation using a ...,"[{'name': 'Helen Armstrong'}, {'name': 'Christ...",http://arxiv.org/abs/0706.1287v1,True,\n Abstract: A sequential multiple testing p...
4,Sensitivity of principal Hessian direction ana...,"[{'name': 'Luke A. Prendergast'}, {'name': 'Jo...",http://arxiv.org/abs/0706.1408v1,True,\n Abstract: A sequential multiple testing p...
