## Pulling Papers

In [1]:
# Load Libraries
import os
import shutil
import DancePartner as dance
import pandas as pd

# Define the output directory
output_directory = os.path.join(os.getcwd(), "pulling_papers")

# Remove it if it already exists and start anew
if os.path.exists(output_directory):
    shutil.rmtree(output_directory)
    os.mkdir(output_directory, mode = 0o777)
else: 
    os.mkdir(output_directory, mode = 0o777)

  from .autonotebook import tqdm as notebook_tqdm


DancePartner can be used to pull papers from PubMed, Scopus, and OSTI. Papers can be pulled from databases individually, or multiple databases at a time. The default priority is clean text whenever possible followed by titles and abstracts. Users may specify different priorities when pulling text. 

Here we will go through various ways that publications can be pulled. Our examples are:

1. Pulling papers from a single database
2. Pulling papers from multiple databases


### Example 1: Pulling papers from a single database

Here, we have a list of PubMed paper IDs that be either numeric or string type. The last ID does not reference a real paper, but demonstrates how the package acts when a paper is encountered that does not exist.

In [2]:
paper_ids = [9851916, 16803962, 12628183, 15035988, 17626846, 18675916, 21858180, 16803962333333]

Now we call our paper pulling function called `pull_papers`. Note that this function may take a while depending on the number of publications to pull. You should estimate approximately 1-5 seconds per publication. 

In [3]:
os.mkdir(os.path.join(output_directory, "pubmed_paper_output_folder"))
dance.pull_papers(pubmed_ids = paper_ids, output_directory = os.path.join(output_directory, "pubmed_paper_output_folder"))

2025-04-24 10:47:35 WE45748 metapub.findit[19799] INFO FindIt Cache initialized at /Users/degn400/.cache/findit.db


Let's take a look at the content of the folder that we created:

In [4]:
os.listdir(os.path.join(output_directory, "pubmed_paper_output_folder"))

['pubmed_tarballs',
 'pubmed_clean',
 'output_summary.txt',
 'pubmed_abstracts',
 'pubmed_pdfs']

We see that the output folder contains an `output_summary.txt` file as well as several subdirectories. The `pubmed_tarballs` folder contains the `[].tar.gz` files that are used as a subprocess to collect full papers from the PubMed database. They are large files that do not need to be kept. If you would like to write them to a different location, you can specify this in `ppi.pull_papers()` with the optional `tarball_path` parameter. The other three subfolders contain the actual papers themselves. We can glean additional insight into these folders with the `output_summary.txt` file. Let's take a look at what it says:

In [5]:
with open(os.path.join(output_directory, "pubmed_paper_output_folder", "output_summary.txt"), "r") as f:
    print(f.read())

Output Summary for Pulling Papers
Created: 2025-04-24 10:47:45.759754
Total Num. Articles: 8
Total Num. Articles Found: 7
Number of Full Text: 3
Number of Title & Abstracts: 4
Number Missing: 1



This gives a detailed look at the results of the pull_paper function. It shows us when the files were downloaded, the database used, the total number of articles that were searched for, as well as a breakdown of how many papers of each download type were found. Users may also specify which publication type they would like to pull, whether that be "abstracts", "full text", or "both" where priority is given to full text publications. Let's pull some abstracts. 

In [6]:
# Create directory to hold papers 
os.mkdir(os.path.join(output_directory, "pubmed_abstracts"))

# Pull papers
dance.pull_papers(pubmed_ids = paper_ids, output_directory = os.path.join(output_directory, "pubmed_abstracts"), type = "abstract")

# Read summary file 
with open(os.path.join(output_directory, "pubmed_abstracts", "output_summary.txt"), "r") as f:
    print(f.read())

Output Summary for Pulling Papers
Created: 2025-04-24 10:47:50.005828
Total Num. Articles: 8
Total Num. Articles Found: 7
Number of Full Text: 0
Number of Title & Abstracts: 7
Number Missing: 1



If using the output csv from LitPortal, read the table into python. For PubMed and OSTI, please use the `OriginId` column. For Scopus, please use the `DOI` column. Let's pull papers from scopus. Scopus requires an API key. Save the key as "scopus_key.txt" as put it in your example_data folder. More details can be found here: https://dev.elsevier.com/

In [7]:
# Create directory to hold papers
os.mkdir(os.path.join(output_directory, "scopus_papers"))

# Read the scopus api key 
with open(os.path.join(os.getcwd(), "../example_data/scopus_key.txt"), "r" ) as f: 
    scopus_api_key = f.read()

# Pull papers
dance.pull_papers(scopus_ids = ["10.1186/s40168-021-01035-8", "10.1002/bit.26296", "10.1002/pmic.200300397", "10.1074/mcp.M115.057117"],
                output_directory = os.path.join(output_directory, "scopus_papers"), scopus_api_key = scopus_api_key)

# Read summary file 
with open(os.path.join(output_directory, "scopus_papers", "output_summary.txt"), "r") as f:
    print(f.read())

Output Summary for Pulling Papers
Created: 2025-04-24 10:47:53.884140
Total Num. Articles: 4
Total Num. Articles Found: 4
Number of Full Text: 1
Number of Title & Abstracts: 3
Number Missing: 0



And finally, here is an example using OSTI. 

In [8]:
# Create directory to hold osti papers
os.mkdir(os.path.join(output_directory, "osti_papers"))

# Pull papers
dance.pull_papers(osti_ids = ["2229172", "1629838", "1766618", "1379914"], output_directory = os.path.join(output_directory, "osti_papers"))

# Read summary file 
with open(os.path.join(output_directory, "osti_papers", "output_summary.txt"), "r") as f:
    print(f.read())



  f.write(BeautifulSoup(data['title']).get_text() + ". ")


Output Summary for Pulling Papers
Created: 2025-04-24 10:47:56.264270
Total Num. Articles: 4
Total Num. Articles Found: 3
Number of Full Text: 0
Number of Title & Abstracts: 3
Number Missing: 1



### Example 2: Pulling Papers from Multiple Databases

Oftentimes, we want to pull papers from more than just one database. To do so, we pass a differnt set of arguments to our `pull_papers` function. Instead of specifying a database and a list of IDs, we can instead feed strings pointing to the CSV files downloaded from each database.

In [None]:
import os
import shutil
import DancePartner as dance
import pandas as pd


pubmed_path = os.path.join(os.getcwd(), "vignette_data/PubMed_Export.csv")
scopus_path = os.path.join(os.getcwd(), "vignette_data/Scopus_Export.csv")
osti_path = os.path.join(os.getcwd(), "vignette_data/OSTI_Export.csv")

First, let's deduplicate the papers with the deduplicate_papers function.

We will now specify the output folder location as we did previously. Please note that using Scopus requires an API key to function properly. Instructions to obtain one can be found [here](https://dev.elsevier.com/). We will read in our key here, so remember to replace these lines with yours.

In [10]:
deduplicated_papers = dance.deduplicate_papers(pubmed_path, scopus_path, osti_path)
deduplicated_papers

Unnamed: 0,pubmed,DOI,scopus,osti,Title
0,,10.1002/aic.16396,,1610933,
1,,10.1002/ange.202212074,,1894730,
2,,10.1002/anie.202212074,,1900267,
3,15822095,10.1002/arch.20053,,,
4,,10.1002/bies.202300188,,2376130,
...,...,...,...,...,...
288,,10.7554/eLife.87303,,2282793,
289,,10.7554/elife.60049,,1825553,
290,,10.7717/peerj.5245,2-s2.0-85050639015,,
291,12934925,,,,probing the molecular physiology of the microb...


This is the deduped_table that is needed by pull_papers(). Papers will be pulled prioritizing full text to abstracts, in the order of pubmed, scopus, and OSTI. 

In [11]:
# Make an example for this deduplicated data 
os.mkdir(os.path.join(output_directory, "deduped_example"))

# Read the scopus api key 
with open(os.path.join(os.getcwd(), "../example_data/scopus_key.txt"), "r" ) as f: 
    scopus_api_key = f.read()

# To save time, let's do a subset of the deduplicated papers
a_subset = pd.concat([deduplicated_papers.head(10), deduplicated_papers.tail(10)]).reset_index(drop = True)

# Pull the papers. To save time, let's do the first 10 rows and the last 10 rows
dance.pull_papers(deduped_table = a_subset, output_directory = os.path.join(output_directory, "deduped_example"), scopus_api_key = scopus_api_key)

# Read summary file 
with open(os.path.join(output_directory, "deduped_example", "output_summary.txt"), "r") as f:
    print(f.read())



  f.write(BeautifulSoup(data['title']).get_text() + ". ")


Output Summary for Pulling Papers
Created: 2025-04-24 10:48:02.478693
Total Num. Articles: 20
Total Num. Articles Found: 11
Number of Full Text: 0
Number of Title & Abstracts: 11
Number Missing: 9



A note on adding publications. They must be added as txt files. The `pypdf` package can be used to convert a pdf to txt file using the `pdfReader()`. Example code is below

```{python}
# Load library
import pypdf

# Read data
reader = PdfReader("your_file.pdf")

# Hold text
text = []

with open("your_file.txt", "w") as file:
    for page in reader.pages:
        file.write(page.extract_text() + "\n")
```