# How to fetch articles and references from Scopus with journal names?

## Philosophy corpus

This code is to get articles and their references via Scopus API using Pybibliometrics package. Jacob Hamel-Mottiez is responsible for this file. 

In [1]:
PATH_TO_DATA = r'C:/Users/jacob/OneDrive - Université Laval/biophilo/Data/pybiblio'
import pandas as pd
import numpy as np
import pybliometrics
from pybliometrics.scopus import AbstractRetrieval
from pybliometrics.scopus import ScopusSearch
from tqdm import tqdm
import sys
import contextlib
import io

### The philosophy of science journals we will work with

As you see, we will both work with generalist journals of philosophy of science as well as more specialized journals in philosophy of biology. The "\" are important. They make sure we get the exact journal name and not an other one that would have all the words of the first.

Be careful with the journal "Studies in history and Philosophy of Science". As Wikipedia states : 

"**Studies in History and Philosophy of Science**_ is a series of three [peer-reviewed](https://en.wikipedia.org/wiki/Peer-review "Peer-review") [academic journals](https://en.wikipedia.org/wiki/Academic_journal "Academic journal") published by [Elsevier](https://en.wikipedia.org/wiki/Elsevier "Elsevier"). It was established in 1970 as a single journal, and was split into two sections–_**Studies in History and Philosophy of Science Part A**_ and _**Studies in History and Philosophy of Science Part B: Studies in History and Philosophy of Modern Physics**_–in 1995. In 1998, a third section, _**Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences**_, was created.[[1]](https://en.wikipedia.org/wiki/Studies_in_History_and_Philosophy_of_Science#cite_note-1)[[2]](https://en.wikipedia.org/wiki/Studies_in_History_and_Philosophy_of_Science#cite_note-2) In January 2021, all three sections were merged back into Part A, _Studies in History and Philosophy of Science."

In [2]:
general_philo_of_science =[
# GENERAL PHILOSOPHY OF SCIENCE JOURNALS 
  #r"\"PHILOSOPHY OF SCIENCE\"",
  #r"\"BRITISH_JOURNAL_FOR_THE_PHILOSOPHY_OF_SCIENCE\"", 
  #r"\"SYNTHESE\"", 
  #r"\"ERKENNTNIS\"", 
  #r"\"EUROPEAN JOURNAL FOR THE PHILOSOPHY OF SCIENCE\"", 
  #r"\"INTERNATIONAL STUDIES IN THE PHILOSOPHY OF SCIENCE\"", 
  #r"\"JOURNAL FOR GENERAL PHILOSOPHY OF SCIENCE\"", 
  #r"\"FOUNDATIONS OF SCIENCE\"",
  #r"\"STUDIES IN HISTORY AND PHILOSOPHY OF SCIENCE\""
]

specialized_philo_bio_journals = [
# SPECIALIZED PHILOSOPHY OF BIOLOGY JOURNALS
  #r"\"BIOLOGY & PHILOSOPHY\"",
  #r"\"BIOLOGY AND PHILOSOPHY\"",
  r"\"BIOLOGICAL THEORY\"",
  #r"\"STUDIES IN HISTORY AND PHILOSOPHY OF SCIENCE PART C\"",
  #r"\"HISTORY AND PHILOSOPHY OF THE LIFE SCIENCES\"",
  #r"\"ACTA BIOTHEORETICA\"",
  #r"\"BEHAVIORAL AND BRAIN SCIENCES\"",
  #r"\"BIOESSAYS\"",
  #r"\"BIOSEMIOTICS\"",
]

This part is to simplify file saving and loading in the future. 

In [3]:
for_name_general_philo_of_science =[
    
# GENERAL PHILOSOPHY OF SCIENCE JOURNALS    
  #"PHILOSOPHY_OF_SCIENCE",
  #"THE_BRITISH_JOURNAL_FOR_THE_PHILOSOPHY_OF_SCIENCE", 
  #"SYNTHESE", 
  #"ERKENNTNIS", 
  #"EUROPEAN_JOURNAL_FOR_THE_PHILOSOPHY_OF_SCIENCE", 
  #"INTERNATIONAL_STUDIES_IN_THE_PHILOSOPHY_OF_SCIENCE", 
  #"JOURNAL_FOR_GENERAL_PHILOSOPHY_OF_SCIENCE", 
  #"FOUNDATIONS_OF_SCIENCE",
  #"STUDIES_IN_HISTORY_AND_PHILOSOPHY_OF_SCIENCE"
]
for_name_specialized_philo_bio_journals = [
# SPECIALIZED PHILOSOPHY OF BIOLOGY JOURNALS
  #"BIOLOGY_&_PHILOSOPHY",
  #"BIOLOGY_AND_PHILOSOPHY",
  "BIOLOGICAL_THEORY",
  #"STUDIES_IN_HISTORY_AND_PHILOSOPHY_OF_SCIENCE_PART_C",
  #"HISTORY_AND_PHILOSOPHY_OF_THE_LIFE_SCIENCES"
  #"ACTA_BIOTHEORETICA",
  #"BEHAVIORAL_AND_BRAIN_SCIENCES",
  #"BIOESSAYS",
  #"BIOSEMIOTICS",
]

### Fetch Philosophy of Science (both general and specialized) Articles

The main function to fetch articles is called "ScopusSearch". If possible, you should prioritize getting the "FULL" view if you want to get the maximum of information available. 

In [21]:
pybliometrics.scopus.init()

from pybliometrics.scopus import ScopusSearch
article_list = pd.DataFrame()  # Initialize an empty DataFrame to store results

for i in range(len(specialized_philo_bio_journals)):  
    pybliometrics.scopus.init()
    query = "EXACTSRCTITLE(" + specialized_philo_bio_journals[i] + ")" 
    print(query)
    s = ScopusSearch(query, verbose=True, subscriber = True, view = "COMPLETE")
    result = s.results
    result_df = pd.DataFrame(result)
    result_df.to_csv("C:\\Users\\jacob\\OneDrive - Université Laval\\biophilo\Data\\"+ for_name_specialized_philo_bio_journals[i] + ".csv")


  result_df.to_csv("C:\\Users\\jacob\\OneDrive - Université Laval\\biophilo\Data\\"+ for_name_specialized_philo_bio_journals[i] + ".csv")


EXACTSRCTITLE(\"BIOLOGY & PHILOSOPHY\")
EXACTSRCTITLE(\"BIOLOGY AND PHILOSOPHY\")
EXACTSRCTITLE(\"BIOLOGICAL THEORY\")
EXACTSRCTITLE(\"STUDIES IN HISTORY AND PHILOSOPHY OF SCIENCE PART C\")
Downloading results for query "EXACTSRCTITLE(\"STUDIES IN HISTORY AND PHILOSOPHY OF SCIENCE PART C\")":


100%|██████████| 42/42 [00:29<00:00,  1.39it/s]


EXACTSRCTITLE(\"HISTORY AND PHILOSOPHY OF THE LIFE SCIENCES\")


The journals have been selected on what Khelfaoui et al. (2021), Malaterre et al. (2021) and others have done. Kept only general journals and did not include journals that were specificaly about a speciality other than philosophy of biology. It is noteworthy to say that if there is a limitation of 40 references with RScopus package in R, this limitation is not found in Pybibliometrics. Moreover, another limitation with RScopus is that you cannot directly fetch more than 5000 entries. This problems doesn't appear neither with Pybibliometrics. If you still want to operate with RScopus, you can find a script [here](https://github.com/christopherBelter/scopusAPI) but some tweaking will be necessary (notably to add your insttoken if you have one).

OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'C:\\"BIOLOGICAL THEORY\\"'

### Fetch Philosophy of Science (both general and specialized) References

This code chunk is simply to load the articles files precedently generated. We we will need them to find their references. 

In [7]:
import os


for name in for_name_specialized_philo_bio_journals:
    file_path = os.path.join(PATH_TO_DATA, f"{name}.csv")  # Assuming CSV format

    if os.path.exists(file_path):  # Check if file exists
        globals()[name] = pd.read_csv(file_path)  # Assign to a variable with the file name
    else:
        print(f"Warning: {file_path} not found!")

We define a function called 'parse_abstract' that look for each eid of each articles and fetch the corresponding references. We use pandas function `append` instead of a for-loop because it seems that the former is quicker and less prone to errors. 

In [8]:
# Initialize a list to store reference data
references_list = []

def parse_abstract(eid):
    with contextlib.redirect_stdout(io.StringIO()):  # Suppresses print output
        try:
            print(f"Processing EID: {eid}")
            s = AbstractRetrieval(eid, id_type="eid", view="FULL")

            if s.references:  # Ensure references exist
                df = pd.DataFrame(s.references)
                df['citing_eid'] = eid
                references_list.append(df)  # Collect results
            else:
                print(f"No references found for {eid}")

        except Exception as e:
            print(f"Error processing {eid}: {e}")

tqdm.pandas()  # Initializes tqdm for pandas

In [None]:
references_list = []  # List to store extracted references
pybliometrics.scopus.init()
for name in for_name_specialized_philo_bio_journals:
    if name in globals():  # Ensure the DataFrame exists
        print(f"Processing journal: {name}")
        
        dfs = globals()[name]  # Fetch the actual DataFrame
        
        # Apply the function to extract references
        df = dfs['eid'].progress_apply(parse_abstract)  
        
        # Ensure references_list has data before concatenating
        if references_list:
            references_df = pd.concat(references_list, ignore_index=True)
        else:
            references_df = pd.DataFrame(columns=['eid', 'citing_eid'])  # Placeholder for empty case

        # Save only the extracted references, not the original articles
        output_path = f"C:\\Users\\jacob\\OneDrive - Université Laval\\biophilo\\Data\\pybiblio\\{name}_refs_pyblio.csv"
        references_df.to_csv(output_path, index=False)
        
        print(f"Saved references: {output_path}")
        
        # Clear the list for the next journal to avoid mixing references
        references_list.clear()
    else:
        print(f"Warning: No DataFrame found for {name}")

Processing journal: BIOLOGICAL_THEORY


  0%|          | 0/714 [00:00<?, ?it/s]

100%|██████████| 714/714 [00:15<00:00, 45.91it/s]


Saved references: C:\Users\jacob\OneDrive - Université Laval\biophilo\Data\pybiblio\BIOLOGICAL_THEORY_refs_pyblio.csv


## Biology Corpus

Once we have those journals, we also want to fetch the journals that philosophers of biology cite. 

For this we proceeded as follow : 
1. We took the mean citescore for each journal (mean of n.citations/n. of articles published through all the years). 
2. We weighted this citescore with the number of citations from our philosophy of biology corpus. 
3. We selected 31 journal in Biology ranging from general biology to more specific areas such as Evolution, Cell biology, Ecology, Molecular Biology. We do not pretend that this representation of biology is fully representative of biology itself, but we contend that it is representative of what philosophers consider *relevant* biology. 

In [4]:
biology_journals = [
    #r"\"JOURNAL OF THEORETICAL BIOLOGY\"",
    r"\"EVOLUTION\"",
    r"\"CELL\"",
    r"\"PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA\"",
    r"\"NATURE REVIEWS GENETICS\"",
    r"\"GENETICS\"",
    r"\"AMERICAN NATURALIST\"",
    r"\"CURRENT BIOLOGY\"",
    r"\"TRENDS IN COGNITIVE SCIENCES\"",
    r"\"TRENDS IN ECOLOGY AND EVOLUTION\"",
    r"\"JOURNAL OF EVOLUTIONARY BIOLOGY\"",
    r"\"QUARTERLY REVIEW OF BIOLOGY\"",
    r"\"BIOSCIENCE\"",
    r"\"COGNITION\"",
    r"\"PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY B: BIOLOGICAL SCIENCES\"",
    r"\"NATURE GENETICS\"",
    r"\"NEURON\"",
    r"\"PLOS BIOLOGY\"",
    r"\"SYSTEMATIC BIOLOGY\"",
    r"\"ECOLOGY\"",
    r"\"NATURE NEUROSCIENCE\"",
    r"\"ANNALS OF THE NEW YORK ACADEMY OF SCIENCES\"",
    r"\"JOURNAL OF MOLECULAR BIOLOGY\"",
    r"\"NATURE REVIEWS MICROBIOLOGY\"",
    r"\"NATURE REVIEWS NEUROSCIENCE\"",
    r"\"BIOLOGICAL REVIEWS\"",
    r"\"TRENDS IN MICROBIOLOGY\"",
    r"\"JOURNAL OF NEUROSCIENCE\"",
    r"\"MOLECULAR BIOLOGY AND EVOLUTION\"",
    r"\"TRENDS IN GENETICS\"",
    r"\"NATURE REVIEWS MOLECULAR CELL BIOLOGY\"",
    r"\"GENOME BIOLOGY\"",
    r"\"GENOME RESEARCH\"",
    r"\"ANNUAL REVIEW OF MICROBIOLOGY\"",
    r"\"ANNUAL REVIEW OF NEUROSCIENCE\"",
    r"\"ECOLOGY LETTERS\"",
    r"\"ANNUAL REVIEW OF ECOLOGY, EVOLUTION, AND SYSTEMATICS\"",
    r"\"ANNUAL REVIEW OF GENETICS\"",
    r"\"NATURE REVIEWS CANCER\"",
    r"\"AMERICAN JOURNAL OF HUMAN GENETICS\"",
    r"\"TRENDS IN NEUROSCIENCES\"",
    #r"\"MICROBIOLOGY AND MOLECULAR BIOLOGY REVIEWS\"",
    #r"\"TRENDS IN BIOCHEMICAL SCIENCES\"",
]


for_name_biology_journals = [
    #"JOURNAL_OF_THEORETICAL_BIOLOGY",
    "EVOLUTION",
    "CELL",
    "PROCEEDINGS_OF_THE_NATIONAL_ACADEMY_OF_SCIENCES_OF_THE_UNITED_STATES_OF_AMERICA",
    "NATURE_REVIEWS_GENETICS",
    "GENETICS",
    "AMERICAN_NATURALIST",
    "CURRENT_BIOLOGY",
    "TRENDS_IN_COGNITIVE_SCIENCES",
    "TRENDS_IN_ECOLOGY_AND_EVOLUTION",
    "JOURNAL_OF_EVOLUTIONARY_BIOLOGY",
    "QUARTERLY_REVIEW_OF_BIOLOGY",
    "BIOSCIENCE",
    "COGNITION",
    "PHILOSOPHICAL_TRANSACTIONS_OF_THE_ROYAL_SOCIETY_B_BIOLOGICAL_SCIENCES",
    "NATURE_GENETICS",
    "NEURON",
    "PLOS_BIOLOGY",
    "SYSTEMATIC_BIOLOGY",
    "ECOLOGY",
    "NATURE_NEUROSCIENCE",
    "ANNALS_OF_THE_NEW_YORK_ACADEMY_OF_SCIENCES",
    "JOURNAL_OF_MOLECULAR_BIOLOGY",
    "NATURE_REVIEWS_MICROBIOLOGY",
    "NATURE_REVIEWS_NEUROSCIENCE",
    "BIOLOGICAL_REVIEWS",
    "TRENDS_IN_MICROBIOLOGY",
    "JOURNAL_OF_NEUROSCIENCE",
    "MOLECULAR_BIOLOGY_AND_EVOLUTION",
    "TRENDS_IN_GENETICS",
    "NATURE_REVIEWS_MOLECULAR_CELL_BIOLOGY",
    "GENOME_BIOLOGY",
    "GENOME_RESEARCH",
    "ANNUAL_REVIEW_OF_MICROBIOLOGY",
    "ANNUAL_REVIEW_OF_NEUROSCIENCE",
    "ECOLOGY_LETTERS",
    "ANNUAL_REVIEW_OF_ECOLOGY_EVOLUTION_AND_SYSTEMATICS",
    "ANNUAL_REVIEW_OF_GENETICS",
    "NATURE_REVIEWS_CANCER",
    "AMERICAN_JOURNAL_OF_HUMAN_GENETICS",
    "TRENDS_IN_NEUROSCIENCES",
    #"MICROBIOLOGY_AND_MOLECULAR_BIOLOGY_REVIEWS",
    #"TRENDS_IN_BIOCHEMICAL_SCIENCES"
]




### Fetch Biology Articles

In [5]:
pybliometrics.scopus.init()

for i in range(len(biology_journals)):  
    pybliometrics.scopus.init()
    query = "EXACTSRCTITLE(" + biology_journals[i] + ")" 
    print(query)
    s = ScopusSearch(query, verbose=True, subscriber = True, view = "COMPLETE")
    result = s.results
    result_df = pd.DataFrame(result)
    result_df.to_csv("C:\\Users\\jacob\\OneDrive - Université Laval\\biophilo\Data\\pybiblio"+ for_name_biology_journals[i] + ".csv")

  result_df.to_csv("C:\\Users\\jacob\\OneDrive - Université Laval\\biophilo\Data\\pybiblio"+ for_name_biology_journals[i] + ".csv")


EXACTSRCTITLE(\"EVOLUTION\")
Downloading results for query "EXACTSRCTITLE(\"EVOLUTION\")":


100%|██████████| 5268/5268 [1:07:15<00:00,  1.31it/s]


EXACTSRCTITLE(\"CELL\")
Downloading results for query "EXACTSRCTITLE(\"CELL\")":


100%|██████████| 19097/19097 [3:52:25<00:00,  1.37it/s]  


EXACTSRCTITLE(\"PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA\")
Downloading results for query "EXACTSRCTITLE(\"PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA\")":


100%|██████████| 6074/6074 [1:19:52<00:00,  1.27it/s]  


EXACTSRCTITLE(\"NATURE REVIEWS GENETICS\")
Downloading results for query "EXACTSRCTITLE(\"NATURE REVIEWS GENETICS\")":


100%|██████████| 156/156 [02:33<00:00,  1.01it/s]


EXACTSRCTITLE(\"GENETICS\")
Downloading results for query "EXACTSRCTITLE(\"GENETICS\")":


100%|██████████| 14011/14011 [2:51:36<00:00,  1.36it/s]  


EXACTSRCTITLE(\"AMERICAN NATURALIST\")
Downloading results for query "EXACTSRCTITLE(\"AMERICAN NATURALIST\")":


100%|██████████| 306/306 [05:15<00:00,  1.03s/it]


EXACTSRCTITLE(\"CURRENT BIOLOGY\")
Downloading results for query "EXACTSRCTITLE(\"CURRENT BIOLOGY\")":


100%|██████████| 896/896 [10:45<00:00,  1.39it/s]


EXACTSRCTITLE(\"TRENDS IN COGNITIVE SCIENCES\")
Downloading results for query "EXACTSRCTITLE(\"TRENDS IN COGNITIVE SCIENCES\")":


100%|██████████| 128/128 [01:29<00:00,  1.42it/s]


EXACTSRCTITLE(\"TRENDS IN ECOLOGY AND EVOLUTION\")
Downloading results for query "EXACTSRCTITLE(\"TRENDS IN ECOLOGY AND EVOLUTION\")":


100%|██████████| 200/200 [02:05<00:00,  1.59it/s]


EXACTSRCTITLE(\"JOURNAL OF EVOLUTIONARY BIOLOGY\")
Downloading results for query "EXACTSRCTITLE(\"JOURNAL OF EVOLUTIONARY BIOLOGY\")":


100%|██████████| 212/212 [02:21<00:00,  1.49it/s]


EXACTSRCTITLE(\"QUARTERLY REVIEW OF BIOLOGY\")
Downloading results for query "EXACTSRCTITLE(\"QUARTERLY REVIEW OF BIOLOGY\")":


100%|██████████| 29/29 [00:19<00:00,  1.44it/s]


EXACTSRCTITLE(\"BIOSCIENCE\")
Downloading results for query "EXACTSRCTITLE(\"BIOSCIENCE\")":


100%|██████████| 2227/2227 [25:49<00:00,  1.44it/s] 


EXACTSRCTITLE(\"COGNITION\")
Downloading results for query "EXACTSRCTITLE(\"COGNITION\")":


100%|██████████| 2013/2013 [23:41<00:00,  1.41it/s]  


EXACTSRCTITLE(\"PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY B: BIOLOGICAL SCIENCES\")
Downloading results for query "EXACTSRCTITLE(\"PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY B: BIOLOGICAL SCIENCES\")":


100%|██████████| 371/371 [04:47<00:00,  1.29it/s]


EXACTSRCTITLE(\"NATURE GENETICS\")
Downloading results for query "EXACTSRCTITLE(\"NATURE GENETICS\")":


100%|██████████| 409/409 [06:07<00:00,  1.11it/s]


EXACTSRCTITLE(\"NEURON\")
Downloading results for query "EXACTSRCTITLE(\"NEURON\")":


100%|██████████| 568/568 [07:27<00:00,  1.27it/s]


EXACTSRCTITLE(\"PLOS BIOLOGY\")
Downloading results for query "EXACTSRCTITLE(\"PLOS BIOLOGY\")":


100%|██████████| 285/285 [03:46<00:00,  1.25it/s]


EXACTSRCTITLE(\"SYSTEMATIC BIOLOGY\")
Downloading results for query "EXACTSRCTITLE(\"SYSTEMATIC BIOLOGY\")":


100%|██████████| 117/117 [01:26<00:00,  1.35it/s]


EXACTSRCTITLE(\"ECOLOGY\")
Downloading results for query "EXACTSRCTITLE(\"ECOLOGY\")":


100%|██████████| 13757/13757 [2:48:49<00:00,  1.36it/s]  


EXACTSRCTITLE(\"NATURE NEUROSCIENCE\")
Downloading results for query "EXACTSRCTITLE(\"NATURE NEUROSCIENCE\")":


100%|██████████| 296/296 [05:50<00:00,  1.19s/it]


EXACTSRCTITLE(\"ANNALS OF THE NEW YORK ACADEMY OF SCIENCES\")
Downloading results for query "EXACTSRCTITLE(\"ANNALS OF THE NEW YORK ACADEMY OF SCIENCES\")":


100%|██████████| 2628/2628 [30:26<00:00,  1.44it/s] 


EXACTSRCTITLE(\"JOURNAL OF MOLECULAR BIOLOGY\")
Downloading results for query "EXACTSRCTITLE(\"JOURNAL OF MOLECULAR BIOLOGY\")":


100%|██████████| 1388/1388 [17:36<00:00,  1.31it/s]


EXACTSRCTITLE(\"NATURE REVIEWS MICROBIOLOGY\")
Downloading results for query "EXACTSRCTITLE(\"NATURE REVIEWS MICROBIOLOGY\")":


100%|██████████| 150/150 [01:58<00:00,  1.25it/s]


EXACTSRCTITLE(\"NATURE REVIEWS NEUROSCIENCE\")
Downloading results for query "EXACTSRCTITLE(\"NATURE REVIEWS NEUROSCIENCE\")":


100%|██████████| 170/170 [01:49<00:00,  1.54it/s]


EXACTSRCTITLE(\"BIOLOGICAL REVIEWS\")
Downloading results for query "EXACTSRCTITLE(\"BIOLOGICAL REVIEWS\")":


100%|██████████| 103/103 [01:24<00:00,  1.21it/s]


EXACTSRCTITLE(\"TRENDS IN MICROBIOLOGY\")
Downloading results for query "EXACTSRCTITLE(\"TRENDS IN MICROBIOLOGY\")":


100%|██████████| 161/161 [02:05<00:00,  1.28it/s]


EXACTSRCTITLE(\"JOURNAL OF NEUROSCIENCE\")
Downloading results for query "EXACTSRCTITLE(\"JOURNAL OF NEUROSCIENCE\")":


100%|██████████| 3367/3367 [41:32<00:00,  1.35it/s]  


EXACTSRCTITLE(\"MOLECULAR BIOLOGY AND EVOLUTION\")
Downloading results for query "EXACTSRCTITLE(\"MOLECULAR BIOLOGY AND EVOLUTION\")":


100%|██████████| 343/343 [04:54<00:00,  1.16it/s]


EXACTSRCTITLE(\"TRENDS IN GENETICS\")
Downloading results for query "EXACTSRCTITLE(\"TRENDS IN GENETICS\")":


100%|██████████| 191/191 [02:10<00:00,  1.46it/s]


EXACTSRCTITLE(\"NATURE REVIEWS MOLECULAR CELL BIOLOGY\")
Downloading results for query "EXACTSRCTITLE(\"NATURE REVIEWS MOLECULAR CELL BIOLOGY\")":


100%|██████████| 162/162 [01:45<00:00,  1.52it/s]


EXACTSRCTITLE(\"GENOME BIOLOGY\")
Downloading results for query "EXACTSRCTITLE(\"GENOME BIOLOGY\")":


100%|██████████| 450/450 [05:45<00:00,  1.30it/s]


EXACTSRCTITLE(\"GENOME RESEARCH\")
Downloading results for query "EXACTSRCTITLE(\"GENOME RESEARCH\")":


100%|██████████| 365/365 [04:53<00:00,  1.24it/s]


EXACTSRCTITLE(\"ANNUAL REVIEW OF MICROBIOLOGY\")
Downloading results for query "EXACTSRCTITLE(\"ANNUAL REVIEW OF MICROBIOLOGY\")":


100%|██████████| 73/73 [00:51<00:00,  1.41it/s]


EXACTSRCTITLE(\"ANNUAL REVIEW OF NEUROSCIENCE\")
Downloading results for query "EXACTSRCTITLE(\"ANNUAL REVIEW OF NEUROSCIENCE\")":


100%|██████████| 41/41 [00:28<00:00,  1.39it/s]


EXACTSRCTITLE(\"ECOLOGY LETTERS\")
Downloading results for query "EXACTSRCTITLE(\"ECOLOGY LETTERS\")":


100%|██████████| 177/177 [02:15<00:00,  1.30it/s]


EXACTSRCTITLE(\"ANNUAL REVIEW OF ECOLOGY, EVOLUTION, AND SYSTEMATICS\")
Downloading results for query "EXACTSRCTITLE(\"ANNUAL REVIEW OF ECOLOGY, EVOLUTION, AND SYSTEMATICS\")":


100%|██████████| 23/23 [00:17<00:00,  1.23it/s]


EXACTSRCTITLE(\"ANNUAL REVIEW OF GENETICS\")
Downloading results for query "EXACTSRCTITLE(\"ANNUAL REVIEW OF GENETICS\")":


100%|██████████| 48/48 [00:31<00:00,  1.49it/s]


EXACTSRCTITLE(\"NATURE REVIEWS CANCER\")
Downloading results for query "EXACTSRCTITLE(\"NATURE REVIEWS CANCER\")":


100%|██████████| 149/149 [01:37<00:00,  1.51it/s]


EXACTSRCTITLE(\"AMERICAN JOURNAL OF HUMAN GENETICS\")
Downloading results for query "EXACTSRCTITLE(\"AMERICAN JOURNAL OF HUMAN GENETICS\")":


100%|██████████| 516/516 [06:54<00:00,  1.24it/s]


EXACTSRCTITLE(\"TRENDS IN NEUROSCIENCES\")
Downloading results for query "EXACTSRCTITLE(\"TRENDS IN NEUROSCIENCES\")":


100%|██████████| 219/219 [02:25<00:00,  1.49it/s]


### Fetch Biology References

In [6]:
import os

PATH_TO_DATA = r'C:/Users/jacob/OneDrive - Université Laval/biophilo/Data/pybiblio'

for name in for_name_biology_journals:
    file_path = os.path.join(PATH_TO_DATA, f"{name}.csv")  # Assuming CSV format

    if os.path.exists(file_path):  # Check if file exists
        globals()[name] = pd.read_csv(file_path)  # Assign to a variable with the file name
    else:
        print(f"Warning: {file_path} not found!")



In [7]:
# Initialize a list to store reference data
references_list = []

def parse_abstract(eid):
    with contextlib.redirect_stdout(io.StringIO()):  # Suppresses print output
        try:
            print(f"Processing EID: {eid}")
            s = AbstractRetrieval(eid, id_type="eid", view="FULL")

            if s.references:  # Ensure references exist
                df = pd.DataFrame(s.references)
                df['citing_eid'] = eid
                references_list.append(df)  # Collect results
            else:
                print(f"No references found for {eid}")

        except Exception as e:
            print(f"Error processing {eid}: {e}")

tqdm.pandas()  # Initializes tqdm for pandas

In [17]:
references_list = []  # List to store extracted references
pybliometrics.scopus.init()
for name in for_name_biology_journals:
    if name in globals():  # Ensure the DataFrame exists
        print(f"Processing journal: {name}")
        
        dfs = globals()[name]  # Fetch the actual DataFrame
        
        # Apply the function to extract references
        df = dfs['eid'].progress_apply(parse_abstract)  
        
        # Ensure references_list has data before concatenating
        if references_list:
            references_df = pd.concat(references_list, ignore_index=True)
        else:
            references_df = pd.DataFrame(columns=['eid', 'citing_eid'])  # Placeholder for empty case

        # Save only the extracted references, not the original articles
        output_path = f"C:\\Users\\jacob\\OneDrive - Université Laval\\biophilo\\Data\\pybiblio\\{name}_refs_pyblio.csv"
        references_df.to_csv(output_path, index=False)
        
        print(f"Saved references: {output_path}")
        
        # Clear the list for the next journal to avoid mixing references
        references_list.clear()
    else:
        print(f"Warning: No DataFrame found for {name}")

Processing journal: JOURNAL_OF_THEORETICAL_BIOLOGY


100%|██████████| 16811/16811 [1:55:25<00:00,  2.43it/s]  


Saved references: C:\Users\jacob\OneDrive - Université Laval\biophilo\Data\pybiblio\JOURNAL_OF_THEORETICAL_BIOLOGY_refs_pyblio.csv


# Trying to parallelize the process

In [None]:
import pybliometrics
import concurrent.futures
import pandas as pd
import contextlib
import io
from tqdm import tqdm

# Initialize the list to store reference data
references_list = []

def parse_abstract(eid):
    with contextlib.redirect_stdout(io.StringIO()):  # Suppresses print output
        try:
            print(f"Processing EID: {eid}")
            s = AbstractRetrieval(eid, id_type="eid", view="FULL")

            if s.references:  # Ensure references exist
                df = pd.DataFrame(s.references)
                df['citing_eid'] = eid
                references_list.append(df)  # Collect results
            else:
                print(f"No references found for {eid}")

        except Exception as e:
            print(f"Error processing {eid}: {e}")

# Initialize tqdm for pandas
tqdm.pandas()

# Start pybliometrics
pybliometrics.scopus.init()

def process_journal(name):
    if name in globals():  # Ensure the DataFrame exists
        print(f"Processing journal: {name}")

        dfs = globals()[name]  # Fetch the actual DataFrame

        # Use ThreadPoolExecutor to parallelize the reference extraction process
        with concurrent.futures.ThreadPoolExecutor() as executor:
            list(tqdm(executor.map(parse_abstract, dfs['eid']), total=len(dfs)))

        # Ensure references_list has data before concatenating
        if references_list:
            references_df = pd.concat(references_list, ignore_index=True)
        else:
            references_df = pd.DataFrame(columns=['eid', 'citing_eid'])  # Placeholder for empty case

        # Save only the extracted references, not the original articles
        output_path = f"C:\\Users\\jacob\\OneDrive - Université Laval\\biophilo\\Data\\pybiblio\\{name}_refs_pyblio.csv"
        references_df.to_csv(output_path, index=False)

        print(f"Saved references: {output_path}")

        # Clear the list for the next journal to avoid mixing references
        references_list.clear()
    else:
        print(f"Warning: No DataFrame found for {name}")

# Run the processing for each journal in parallel
for name in for_name_biology_journals:
    process_journal(name)