<a href="https://colab.research.google.com/github/AEGriffith/PhDUtilities/blob/main/ACM_Paper_Exctractor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [23]:
#@title ACM Search Information
#@markdown Enter your search url:
search_url = "https://dl.acm.org/action/doSearch?fillQuickSearch=false&target=advanced&ContentItemType=research-article&expand=dl&CCSAnd=60&AfterYear=2018&BeforeYear=2023&AllField=%28Keyword%3A%28Creativity%29+OR+%28Fulltext%3A%28AI%2C+agent%2C+%22Artificial+Intelligence%22%29+AND+Fulltext%3A%28Creativity%29+AND+Fulltext%3A%28Collab*+Support+Tool%29%29%29" #@param {type: "string"}
#@markdown Enter the first and last search year:
start_year = 2018 #@param {type: "integer"}
end_year = 2023 #@param {type: "integer"}
#@markdown Enter the filepath to save csv file (including csv name)
filepath = "/content/drive/MyDrive/Quals/papers.csv" #@param {type: "string"} 


Note: This code is based on a script that my colleague, Gloria Katuka (https://github.com/gkatuka), wrote. I have adapted it for this specific use case.

# Run All

## Setup

In [24]:
%%capture
# install chromium, its driver, and selenium
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium
# set options to be headless, ..
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

In [25]:
import urllib3
import pandas as pd

from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import WebDriverWait
from collections import Counter
import re

In [26]:
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

## Functions

In [27]:
def modify_link(url, year):
  """
    This function modifies a given URL by setting the page size to 50, setting the date range to search
    within a given year, and finding the number of pages required to display all search results.

    Parameters:
    url (str): The URL to be modified.
    year (int): The year to set the date range to.

    Returns:
    Tuple[str, int]: A tuple containing the modified URL, and the number of pages required to display all search results.
  """

  # Set the page size to 50 by replacing any existing pageSize parameter in the URL, or by adding one if it doesn't exist.
  if re.findall(r'pageSize=\d+', url):
    url = re.sub(r'pageSize=\d+', 'pageSize=50', url)
  else: 
    url = url + "&pageSize=50"
  # Set the date range to search within a given year.
  # This is done so that we can search one year at a time, as the ACM limits the number of papers shown to 2000.
  url = re.sub(r'AfterYear=\d+', 'AfterYear={year}', url)
  url = re.sub(r'BeforeYear=\d+', 'BeforeYear={year}', url)
  # Find the number of pages by dividing the number of search results by 50 (rounded up).
  # To do this the function uses Selenium to get the html source of the page, and then uses BeautifulSoup to parse it.
  driver.get(url.format(year=year))
  WebDriverWait(driver, 10)
  html = driver.page_source
  soup = BeautifulSoup(html, "html.parser")
  num_results = int((soup.find("span", {"class": "hitsLength"}).text).replace(",",""))
  num_pages: int = (num_results // 50) + 1
  # Replace the 'startPage' parameter with the current page number or by adding one if it doesn't exist.
  # This will allow the function to navigate through multiple pages of results.
  if re.findall(r'startPage=\d+', url):
    url = re.sub(r'startPage=\d+', 'startPage={page}', url)
  else:
    url = url + "&startPage={page}"
  return url, num_pages

In [28]:
def get_paper_info(search_url, filepath):
  """
    This function extracts information about papers that match a given search query on the ACM Digital Library. 
    It returns a DataFrame containing the title, DOI, month, year, citation count, and download count for each paper,
    and saves it as a csv.

    Parameters:
    search_url (str): The search URL for the query on the ACM Digital Library

    Returns:
    df (DataFrame): A DataFrame containing the extracted information for each paper.
    """

  paper_dict = {"paper_doi": [], "paper_title": [], "paper_month": [], "paper_year":[], "citation_count": [], "download_count": []}
  paper_urls = []

  # Loop through all years in the specified date range
  for year in range(start_year, end_year+1):
    
    page_url, num_pages = modify_link(search_url, year)

    # loop through all the pages of the search results for each year
    for page in range(num_pages):
      driver.get(page_url.format(page=page, year=year))
      WebDriverWait(driver, 10)
      html = driver.page_source
      soup = BeautifulSoup(html, "html.parser")

      # Extract title and doi
      title_spans = soup.find_all("span", {"class": "hlFld-Title"})
      for title_span in title_spans:
        paper_url = title_span.find("a", href=True)
        if paper_url:
          paper_url = urllib3.util.url.parse_url(paper_url["href"]).url
          paper_urls.append(f'https://dlc.acm.org/{paper_url}')
        paper_title = title_span.find("a").text
        paper_dict["paper_title"].append(paper_title)
        paper_dict["paper_doi"].append(paper_url)

      # Extract citation and download counts
      metrics = soup.find_all("li", {"class": "metric-holder"})
      for metric in metrics:
        # citation count
        paper_citation = metric.find("div", {"class": "citation"})
        if paper_citation:
          citation = paper_citation.text
          citation = citation.replace("Total Citations", "")
          citation = citation.replace(",", "")
          citation = citation.replace(" ", "")
          paper_dict["citation_count"].append(int(citation))
        else:
          paper_dict["citation_count"].append(0)
        # download count
        paper_download = metric.find("div", {"class": "metric"})
        if paper_download:
          download = paper_download.text
          download = download.replace("Total Downloads", "")
          download = download.replace(",", "")
          download = download.replace(" ", "")
          paper_dict["download_count"].append(int(download))
        else:
          paper_dict["download_count"].append(0)

      # Extract paper dates
      paper_dates = soup.find_all("div", {"class": "bookPubDate"})
      for date in paper_dates:
        date = date.text
        # get month, year from date
        month, year = date.split(" ")
        paper_dict["paper_month"].append(month)
        paper_dict["paper_year"].append(year)
  df = pd.DataFrame(paper_dict)
  df.to_csv(filepath)
  driver.quit()
  return df

In [29]:
driver = webdriver.Chrome('chromedriver',options=options)
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# Get paper lists and information and put it into a dataframe and save as csv.
df = get_paper_info(search_url, filepath)


# Specific processing for my Qualifying Exam

In [34]:
df["download_cutoff"] = False
df["citation_cutoff"] = False

for year in range(start_year, end_year+1):
  df_by_year = df[df['paper_year'].astype(int) == year]
  avg_downloads = df_by_year.loc[:, 'download_count'].mean()
  df_keep_download = df_by_year[(df_by_year['download_count'] >= avg_downloads)]
  avg_citations = df_by_year.loc[:, 'citation_count'].mean()
  df_keep_citation = df_by_year[(df_by_year['citation_count'] >= avg_citations)]
  df_lose = df_by_year[(df_by_year['download_count'] < avg_downloads)]
  df.loc[df_keep_download.index, 'download_cutoff'] = True
  df.loc[df_keep_citation.index, 'citation_cutoff'] = True

df.to_csv(filepath)

In [35]:
# count number of papers that meet the criteria for downloads or citations
print("Number of papers that meet the criteria for downloads: ", df[df['download_cutoff'] == True].shape[0])
print("Number of papers that meet the criteria for citations: ", df[df['citation_cutoff'] == True].shape[0])

# count the number of papers that meet the criteria for downloads but not citations
print("Number of papers that meet the criteria for downloads but not citations: ", df[(df['download_cutoff'] == True) & (df['citation_cutoff'] == False)].shape[0])

# count the number of papers that meet the criteria for citations but not downloads
print("Number of papers that meet the criteria for citations but not downloads: ", df[(df['download_cutoff'] == False) & (df['citation_cutoff'] == True)].shape[0])

# count the number of papers that meet the criteria for both downloads and citations
print("Number of papers that meet the criteria for both downloads and citations: ", df[(df['download_cutoff'] == True) & (df['citation_cutoff'] == True)].shape[0])

Number of papers that meet the criteria for downloads:  840
Number of papers that meet the criteria for citations:  924
Number of papers that meet the criteria for downloads but not citations:  291
Number of papers that meet the criteria for citations but not downloads:  375
Number of papers that meet the criteria for both downloads and citations:  549


In [36]:
# count the number of papers that meet the criteria for both downloads and citations for each year
for year in range(start_year, end_year + 1):
  print(f"Total number of papers in {year}: ", df[df['paper_year'].astype(int) == year].shape[0])
  print(f"Number of papers that meet the criteria for both downloads and citations in {year}: ", df[(df['download_cutoff'] == True) & (df['citation_cutoff'] == True) & (df['paper_year'].astype(int) == year)].shape[0])
  print(f"Number of papers that meet the criteria for downloads but not citations in {year}: ", df[(df['download_cutoff'] == True) & (df['citation_cutoff'] == False) & (df['paper_year'].astype(int) == year)].shape[0])
  print(f"Number of papers that meet the criteria for citations but not downloads in {year}: ", df[(df['download_cutoff'] == False) & (df['citation_cutoff'] == True) & (df['paper_year'].astype(int) == year)].shape[0])

Total number of papers in 2018:  432
Number of papers that meet the criteria for both downloads and citations in 2018:  97
Number of papers that meet the criteria for downloads but not citations in 2018:  26
Number of papers that meet the criteria for citations but not downloads in 2018:  53
Total number of papers in 2019:  449
Number of papers that meet the criteria for both downloads and citations in 2019:  87
Number of papers that meet the criteria for downloads but not citations in 2019:  38
Number of papers that meet the criteria for citations but not downloads in 2019:  53
Total number of papers in 2020:  508
Number of papers that meet the criteria for both downloads and citations in 2020:  104
Number of papers that meet the criteria for downloads but not citations in 2020:  30
Number of papers that meet the criteria for citations but not downloads in 2020:  76
Total number of papers in 2021:  624
Number of papers that meet the criteria for both downloads and citations in 2021:  

In [33]:
print("Number of papers that meet the criteria for either downloads or citations or both: ", df[
    (df['download_cutoff'] == True) | (df['citation_cutoff'] == True)].shape[0])

# count the number of papers that meet the criteria for either downloads or citations or both for each year
for year in range(start_year, end_year + 1):
    print("Number of papers that meet the criteria for either downloads or citations or both in {}: ".format(year), df[
        ((df['download_cutoff'] == True) | (df['citation_cutoff'] == True)) & (
                    df['paper_year'].astype(int) == year)].shape[0])
    



Number of papers that meet the criteria for either downloads or citations or both:  840
Number of papers that meet the criteria for either downloads or citations or both in 2018:  123
Number of papers that meet the criteria for either downloads or citations or both in 2019:  125
Number of papers that meet the criteria for either downloads or citations or both in 2020:  134
Number of papers that meet the criteria for either downloads or citations or both in 2021:  202
Number of papers that meet the criteria for either downloads or citations or both in 2022:  255
Number of papers that meet the criteria for either downloads or citations or both in 2023:  1
