# Research Papers Web Scraping from ACM Digital Library

## Team Snap Papers

This notebook explains the webscraping of research papers from the ACM Library depending on the members of Humanities & Engineering Special Interest Group

Import the necessary libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import time

Define the Profile ID of the author:

In [2]:
profile_id = '81321495239'

The response is get through the method <code>requests.get</code>

In [3]:
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
url = 'https://dl.acm.org/profile/{}/publications?Role=author&startPage=0&pageSize=50'.format(profile_id)

response = requests.get(url, headers=headers)

Generate the Paper Structure object

In [4]:
paper_doc = BeautifulSoup(response.text,'html.parser')

Get all the information

In [5]:
paper_names = []
for paper in paper_doc.find_all("li", {"class" : "issue-item-container"}):
    paper_type = ''
    paper_title = ''
    author_section = []
    abstract = ''
    details = ''
    citations = '0'
    metrics = '0'
    authors_list = []
    authors_list_string = ''
    url = ''
    
    if paper.find('div', {'class' : 'issue-heading'}) is not None:
        paper_type = paper.find('div', {'class' : 'issue-heading'}).text
    paper_title = paper.find('h5', {'class': 'issue-item__title'}).text
    author_section = paper.find("ul", {"title" : "list of authors"})
    abstract = paper.find("div", {"class" : "issue-item__abstract"}).find("p").text
    if paper.find("span", {"class" : "citation"}) is not None:
        citations = paper.find("span", {"class" : "citation"}).find("span").text
    if paper.find("span", {"class" : "metric"}) is not None:
        metrics = paper.find("span", {"class" : "metric"}).find("span").text
    details = paper.find("div", {"class" : "issue-item__detail"}).text
    url = "https://dl.acm.org"+paper.find("h5", {"class" : "issue-item__title"}).find("a").get("href")
    
    print(paper_type)
    print(paper_title)
    for author in author_section.find_all("a"):
        authors_list.append(author.get("title"))
    authors_list_string = ' | '.join(authors_list)
    print(authors_list_string)
    print(details)
    print(abstract)
    print(citations)
    print(metrics)
    print(url)
    
    print("--------------------------")

Doctoral Theses
Knowledge Management Using SpiCE
Timothy Joseph Maciag | Hepting, Daryl | Arbuthnott, Katherine | Delbaere, Marjorie

The idea of Knowledge Management (KM) is continually evolving. A traditional and popular idea of KM is one that emphasizes the activity of transforming data to information, and information to knowledge. Another popular idea of KM emphasizes the ...
0
0
https://dl.acm.org/doi/book/10.5555/AAI28140953
--------------------------

Supporting Sustainable Decision-Making: Evaluation of Previous Support Tools with New Designs
Timothy Maciag

Revision with unchanged content. The quality of the natural environment has become one of the primary concerns in present society. However, very little has been done to illuminate the various connections between our household purchases and the effect ...
0
0
https://dl.acm.org/doi/book/10.5555/2378550
--------------------------
Article
Web-Based Support of Crop Selection for Climate Adaptation
Daryl H. Hepting | Timothy Mac

Prepare the dictionary of information

In [6]:
paper_repos_dict = {
                    'Paper Title' : [],
                    'Publication Type' : [],
                    'Authors' : [],
                    'Abstract' : [],
                    'Publication' : [],
                    'Citations' : [],
                    'Metrics' : [],
                    'URL' : [] }

# adding information in repository
def add_in_paper_repo(paper_title, paper_type, authors_list_string, abstract, details, citations, metrics, url):
    paper_repos_dict['Paper Title'] = paper_title
    paper_repos_dict['Publication Type'] = paper_type
    paper_repos_dict['Authors'] = authors_list_string
    paper_repos_dict['Abstract'] = abstract
    paper_repos_dict['Publication'] = details
    paper_repos_dict['Citations'] = citations
    paper_repos_dict['Metrics'] = metrics
    paper_repos_dict['URL'] = url

    return paper_repos_dict

Scrap the data and generating CSV file

In [7]:
paper_names = []
data = []
for paper in paper_doc.find_all("li", {"class" : "issue-item-container"}):
    paper_type = ''
    paper_title = ''
    author_section = []
    abstract = ''
    details = ''
    citations = '0'
    metrics = '0'
    authors_list = []
    authors_list_string = ''
    url = ''
    
    if paper.find('div', {'class' : 'issue-heading'}) is not None:
        paper_type = paper.find('div', {'class' : 'issue-heading'}).text
    paper_title = paper.find('h5', {'class': 'issue-item__title'}).text
    author_section = paper.find("ul", {"title" : "list of authors"})
    abstract = paper.find("div", {"class" : "issue-item__abstract"}).find("p").text
    if paper.find("span", {"class" : "citation"}) is not None:
        citations = paper.find("span", {"class" : "citation"}).find("span").text
    if paper.find("span", {"class" : "metric"}) is not None:
        metrics = paper.find("span", {"class" : "metric"}).find("span").text
    details = paper.find("div", {"class" : "issue-item__detail"}).text
    url = "https://dl.acm.org"+paper.find("h5", {"class" : "issue-item__title"}).find("a").get("href")

    for author in author_section.find_all("a"):
        authors_list.append(author.get("title"))
    authors_list_string = ' | '.join(authors_list)
    
    print(add_in_paper_repo(paper_title, paper_type, authors_list_string, abstract, details, citations, metrics, url))
    
    print("--------------------------")
    
    data.append(add_in_paper_repo(paper_title, paper_type, authors_list_string, abstract, details, citations, metrics, url).copy())

df = pd.DataFrame(data)
df.to_csv("./data.csv", sep='#', encoding='utf-8')

{'Paper Title': 'Knowledge Management Using SpiCE', 'Publication Type': 'Doctoral Theses', 'Authors': 'Timothy Joseph Maciag | Hepting, Daryl | Arbuthnott, Katherine | Delbaere, Marjorie', 'Abstract': 'The idea of Knowledge Management (KM) is continually evolving. A traditional and popular idea of KM is one that emphasizes the activity of transforming data to information, and information to knowledge. Another popular idea of KM emphasizes the ...', 'Publication': '', 'Citations': '0', 'Metrics': '0', 'URL': 'https://dl.acm.org/doi/book/10.5555/AAI28140953'}
--------------------------
{'Paper Title': 'Supporting Sustainable Decision-Making: Evaluation of Previous Support Tools with New Designs', 'Publication Type': '', 'Authors': 'Timothy Maciag', 'Abstract': 'Revision with unchanged content. The quality of the natural environment has become one of the primary concerns in present society. However, very little has been done to illuminate the various connections between our household purc

The reference of the previous code was an article made by Nandini Saini in Medium

Saini, N. (2021). Scraping Information of Research Papers on Google Scholar using Python. Medium. Web Article. https://medium.com/@nandinisaini021/scraping-publications-of-aerial-image-research-papers-on-google-scholar-using-python-a0dee9744728