This notebook contains the code used to import all the data necessary for the analysis.

## Table of Contents

* [Chapter 0 - Libraries](#chapter0)
* [Chapter 1 - Retraction Watch Database](#chapter1)
* [Chapter 2 - Journals to consider](#chapter2)   
* [Chapter 3 - Bibliometric Data of Retractions](#chapter3)  
* [Chapter 4 - Bibliometric Data by Journals](#chapter4)  
    * [4.1 - Webscraping by Journal Range](#section_4_1)
    * [4.2 - Number of Articles per Journal](#section_4_2)
    * [4.3 - Journals Individual Export](#section_4_3)
        * [4.3.1 - Medium Big Journals](#section_4_3_1)
        * [4.3.2 - Journals with more than 100.000 observations](#section_4_3_2)
    * [4.4 - Join data into less datasets](#section_4_4)
    * [4.5 - Check if Journals were all scraped](#section_4_5)
* [Chapter 5 - Bibliometric Data of Citations](#chapter5)
* [Chapter 6 - Bibliometrics Data of Erratums](#chapter6)
    

<div class="alert alert-block alert-info" style = "background:#d0de6f; color:#000000; border:0;">

# Chapter 0 - Libraries <a class="anchor" id="chapter0"></a>

In [72]:
import pandas as pd
import numpy as np
import math

import time

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from tqdm import tqdm

import os

<div class="alert alert-block alert-info" style = "background:#d0de6f; color:#000000; border:0;">

# Chapter 1 - Retraction Watch Database <a class="anchor" id="chapter1"></a>

In [73]:
rw = pd.read_excel('./retractions_data/retraction_watch_database.xlsx')
rw.head()

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes
0,47271,Binding of DCC by Netrin-1 to Mediate Axon Gui...,(BLS) Biology - Cellular;(BLS) Biology - Gener...,Departments of Anatomy and of Biochemistry and...,Science,American Association for the Advancement of Sc...,United States,Elke Stein;Yimin Zou;Mu-ming Poo;Marc Tessier-...,https://retractionwatch.com/2023/08/31/stanfor...,Research Article;,2023-08-31 00:00:00,10.1126/science.adk1521,0.0,2001-03-09 00:00:00,10.1126/science.1059391,11239160.0,Retraction,+Investigation by Company/Institution;+Manipul...,No,
1,47270,Hierarchical Organization of Guidance Receptor...,(BLS) Biochemistry;(BLS) Biology - General;(BL...,Department of Anatomy and Department of Bioche...,Science,American Association for the Advancement of Sc...,United States,Elke Stein;Marc Tessier-Lavigne,https://retractionwatch.com/2023/08/31/stanfor...,Research Article;,2023-08-31 00:00:00,10.1126/science.adk1517,0.0,2001-02-08 00:00:00,10.1126/science.1058445,11239147.0,Retraction,+Duplication of Image;+Investigation by Compan...,No,
2,47243,Therapeutic potential of targeting IRES-depend...,(BLS) Biochemistry;(BLS) Biology - Cancer;(BLS...,"Division of Hematology-Oncology, UCLA-Greater ...",Oncogene,Springer - Nature Publishing Group,United States,Y Shi;Y Yang;C Bardeleben;B Holmes;J Gera;Alan...,,Research Article;,2023-08-31 00:00:00,10.1038/s41388-023-02820-5,0.0,2015-05-11 00:00:00,10.1038/onc.2015.156,25961916.0,Retraction,+Concerns/Issues About Data;+Concerns/Issues A...,No,see also: https://pubpeer.com/publications/704...
3,47233,A classifier based on 273 urinary peptides pre...,(BLS) Biochemistry;(HSC) Medicine - Cardiovasc...,"Department of Nephrology, The Third Affiliated...",Journal of Hypertension,Wolters Kluwer - Lippincott Williams & Wilkins,China,Lirong Lin;Chunxuan Wang;Jiangwen Ren;Mei Mei;...,,Research Article;,2023-08-30 00:00:00,10.1097/HJH.0000000000003551,37642599.0,2023-08-01 00:00:00,10.1097/HJH.0000000000003467,37199562.0,Retraction,+Concerns/Issues About Results;+Investigation ...,No,see also https://journals.lww.com/jhypertensio...
4,47227,"Age, Gender Demographics and Comorbidity Preva...",(HSC) Biostatistics/Epidemiology;(HSC) Medicin...,"Department of Orthopaedics, Dhanalakshmi Srini...",Journal of Coastal Life Medicine,Journal of Coastal Life Medicine,India,S Venkatesh Kumar;Mohith Singh;Gowtham Singh;K...,,Research Article;,2023-08-30 00:00:00,unavailable,0.0,2023-01-01 00:00:00,unavailable,0.0,Retraction,+Notice - Lack of;+Withdrawal;,No,"date of retraction unknown, article title repl..."


In [74]:
rw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42700 entries, 0 to 42699
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Record ID              42700 non-null  int64  
 1   Title                  42700 non-null  object 
 2   Subject                42700 non-null  object 
 3   Institution            42699 non-null  object 
 4   Journal                42700 non-null  object 
 5   Publisher              42700 non-null  object 
 6   Country                42700 non-null  object 
 7   Author                 42700 non-null  object 
 8   URLS                   21687 non-null  object 
 9   ArticleType            42700 non-null  object 
 10  RetractionDate         42700 non-null  object 
 11  RetractionDOI          42209 non-null  object 
 12  RetractionPubMedID     37599 non-null  float64
 13  OriginalPaperDate      42700 non-null  object 
 14  OriginalPaperDOI       40173 non-null  object 
 15  Or

In [75]:
rw[rw['OriginalPaperDOI']=='10.2174/1381612043383782']

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes
10454,30868,Effects of Vitamin K2 on Osteoporosis,(BLS) Toxicology;(HSC) Medicine - Orthopedics;...,"Department of Sports Medicine, Keio University...",Current Pharmaceutical Design,Bentham Science Publishers,Japan,Jun Iwamoto;Tsuyoshi Takeda;Yoshihiro Sato,http://retractionwatch.com/?s=Jun+Iwamoto;http...,Article in Press;Review Article;,2021-07-06 00:00:00,10.2174/138161282719210608092930,34264184.0,2004-10-15 00:00:00,10.2174/1381612043383782,15320745.0,Retraction,+Cites Retracted Work;+Concerns/Issues About D...,No,Yoshihiro Sato and Jun Iwamoto have institutio...


In [76]:
rw['RetractionDate'] = pd.to_datetime(rw['RetractionDate'], errors='coerce') #, infer_datetime_format=True
rw['OriginalPaperDate'] = pd.to_datetime(rw['OriginalPaperDate'])

<div class="alert alert-block alert-info" style = "background:#d0de6f; color:#000000; border:0;">

# Chapter 2 - Journals to consider <a class="anchor" id="chapter2"></a>

In [77]:
journals = pd.read_csv('../scimagojr_2022.csv', sep=';')
journals

Unnamed: 0,Rank,Sourceid,Title,Type,Issn,SJR,SJR Best Quartile,H index,Total Docs. (2022),Total Docs. (3years),...,Total Cites (3years),Citable Docs. (3years),Cites / Doc. (2years),Ref. / Doc.,Country,Region,Publisher,Coverage,Categories,Areas
0,1,28773,Ca-A Cancer Journal for Clinicians,journal,"15424863, 00079235",86091,Q1,198,44,118,...,30318,85,29999,9700,United States,Northern America,Wiley-Blackwell,1950-2022,Hematology (Q1); Oncology (Q1),Medicine
1,2,29431,Quarterly Journal of Economics,journal,"00335533, 15314650",36730,Q1,292,36,122,...,2141,122,1483,6661,United Kingdom,Western Europe,Oxford University Press,1886-2022,Economics and Econometrics (Q1),"Economics, Econometrics and Finance"
2,3,20315,Nature Reviews Molecular Cell Biology,journal,"14710072, 14710080",34201,Q1,485,121,328,...,13331,156,3547,8929,United Kingdom,Western Europe,Nature Publishing Group,2000-2022,Cell Biology (Q1); Molecular Biology (Q1),"Biochemistry, Genetics and Molecular Biology"
3,4,18434,Cell,journal,"00928674, 10974172",26494,Q1,856,420,1637,...,67791,1440,4380,6574,United States,Northern America,Cell Press,1974-2022,"Biochemistry, Genetics and Molecular Biology (...","Biochemistry, Genetics and Molecular Biology"
4,5,15847,New England Journal of Medicine,journal,"00284793, 15334406",26015,Q1,1130,1410,4561,...,133956,1854,3393,1021,United States,Northern America,Massachussetts Medical Society,1945-2022,Medicine (miscellaneous) (Q1),Medicine
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18031,18032,17500154901,Progress in Molecular Biology and Translationa...,journal,"18771173, 18780814",,-,110,90,314,...,1170,0,272,10888,Netherlands,Western Europe,Academic Press Inc.,2008-2022,Molecular Biology; Molecular Medicine,"Biochemistry, Genetics and Molecular Biology"
18032,18033,25192,Reviews of Environmental Contamination and Tox...,journal,01795953,,-,94,22,99,...,441,0,440,15168,United States,Northern America,Springer New York,1987-2022,"Health, Toxicology and Mutagenesis; Medicine (...",Environmental Science; Medicine
18033,18034,5700155185,Voprosy Istorii (discontinued),journal,00428779,,-,5,168,1316,...,38,1316,003,796,Russian Federation,Eastern Europe,"Rossiiskaya Akademiya Nauk, Institut Istorii (...","1965, 1972, 1975, 1978-1982, 1985, 1988, 1999-...",History; Medicine (miscellaneous),Arts and Humanities; Medicine
18034,18035,21100873483,Wisdom (discontinued),journal,18293824,,-,7,66,180,...,76,180,045,2561,Armenia,Eastern Europe,Khachatur Abovyan Armenian State Pedagogical U...,2018-2022,Philosophy,Arts and Humanities


In [78]:
# Calculate the threshold for the top 10%
threshold = int(0.10 * len(journals))

# Select the top 10% of rows
top_10_percent = journals.head(threshold)
top_10_percent

Unnamed: 0,Rank,Sourceid,Title,Type,Issn,SJR,SJR Best Quartile,H index,Total Docs. (2022),Total Docs. (3years),...,Total Cites (3years),Citable Docs. (3years),Cites / Doc. (2years),Ref. / Doc.,Country,Region,Publisher,Coverage,Categories,Areas
0,1,28773,Ca-A Cancer Journal for Clinicians,journal,"15424863, 00079235",86091,Q1,198,44,118,...,30318,85,29999,9700,United States,Northern America,Wiley-Blackwell,1950-2022,Hematology (Q1); Oncology (Q1),Medicine
1,2,29431,Quarterly Journal of Economics,journal,"00335533, 15314650",36730,Q1,292,36,122,...,2141,122,1483,6661,United Kingdom,Western Europe,Oxford University Press,1886-2022,Economics and Econometrics (Q1),"Economics, Econometrics and Finance"
2,3,20315,Nature Reviews Molecular Cell Biology,journal,"14710072, 14710080",34201,Q1,485,121,328,...,13331,156,3547,8929,United Kingdom,Western Europe,Nature Publishing Group,2000-2022,Cell Biology (Q1); Molecular Biology (Q1),"Biochemistry, Genetics and Molecular Biology"
3,4,18434,Cell,journal,"00928674, 10974172",26494,Q1,856,420,1637,...,67791,1440,4380,6574,United States,Northern America,Cell Press,1974-2022,"Biochemistry, Genetics and Molecular Biology (...","Biochemistry, Genetics and Molecular Biology"
4,5,15847,New England Journal of Medicine,journal,"00284793, 15334406",26015,Q1,1130,1410,4561,...,133956,1854,3393,1021,United States,Northern America,Massachussetts Medical Society,1945-2022,Medicine (miscellaneous) (Q1),Medicine
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1798,1799,19900192175,Journal of Topology,journal,"17538416, 17538424",1575,Q1,28,56,119,...,137,119,111,3668,United Kingdom,Western Europe,John Wiley and Sons Ltd,2010-2022,Geometry and Topology (Q1),Mathematics
1799,1800,21100875643,Materials Chemistry Frontiers,journal,20521537,1575,Q1,70,367,1245,...,8782,1233,721,6392,United Kingdom,Western Europe,Royal Society of Chemistry,2017-2022,Materials Chemistry (Q1); Materials Science (m...,Materials Science
1800,1801,15061,Agricultural Systems,journal,"0308521X, 18732267",1574,Q1,126,166,607,...,4408,601,702,7249,United Kingdom,Western Europe,Elsevier BV,1976-2022,Agronomy and Crop Science (Q1); Animal Science...,Agricultural and Biological Sciences
1801,1802,26112,Frontiers of Hormone Research,journal,03013073,1574,Q1,42,0,45,...,156,41,092,000,Switzerland,Western Europe,S. Karger AG,"1975, 1977, 1984, 1996-1997, 1999-2002, 2004-2...","Endocrinology (Q1); Endocrinology, Diabetes an...","Biochemistry, Genetics and Molecular Biology; ..."


In [79]:
top_10_percent['Issn']

0       15424863, 00079235
1       00335533, 15314650
2       14710072, 14710080
3       00928674, 10974172
4       00284793, 15334406
               ...        
1798    17538416, 17538424
1799              20521537
1800    0308521X, 18732267
1801              03013073
1802    14617307, 13505076
Name: Issn, Length: 1803, dtype: object

In [80]:
top_10_percent[top_10_percent['Issn'].str.contains(',')]['Issn']

0       15424863, 00079235
1       00335533, 15314650
2       14710072, 14710080
3       00928674, 10974172
4       00284793, 15334406
               ...        
1795    15565653, 00150282
1796    19360533, 19360541
1798    17538416, 17538424
1800    0308521X, 18732267
1802    14617307, 13505076
Name: Issn, Length: 1191, dtype: object

In [81]:
top_10_percent[top_10_percent['Issn'].str.count(',') > 1]['Issn']

143     15383598, 00987484, 00029955
263     03785912, 01662236, 1878108X
304     13624326, 03765067, 09680004
1545    08203946, 00084409, 14882329
Name: Issn, dtype: object

In [82]:
journals = top_10_percent['Issn'][0:100].values
journals

array(['15424863, 00079235', '00335533, 15314650', '14710072, 14710080',
       '00928674, 10974172', '00284793, 15334406', '1546170X, 10788956',
       '10575987, 15458601', '15461696, 10870156', '20588437', '00028282',
       '1474175X, 14741768', '14764687, 00280836', '00223808, 1537534X',
       '00346861, 15390756', '20587546', '14710056, 14710064',
       '14741784, 14741776', '15206890, 00092665', '14741741, 14741733',
       '01492195, 1545861X', '10614036, 15461718', '00018392, 19303815',
       '19416520, 19416067', '00221082, 15406261', '15453278, 07320582',
       '15458636, 15460738', '03060012, 14604744', '17594782, 17594774',
       '10974180, 10747613', '01406736, 1474547X', '00346527, 1467937X',
       '15487091, 15487105', '17238617', '1553877X', '14764660, 14761122',
       '19358245, 19358237', '15221210, 00319333', '10959203, 00368075',
       '25201158', '17483387, 17483395', '00129682, 14680262', '00220515',
       '19457790, 19457782', '15356108, 18783686', '175

In [83]:
top_10_percent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1803 entries, 0 to 1802
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Rank                    1803 non-null   int64 
 1   Sourceid                1803 non-null   int64 
 2   Title                   1803 non-null   object
 3   Type                    1803 non-null   object
 4   Issn                    1803 non-null   object
 5   SJR                     1803 non-null   object
 6   SJR Best Quartile       1803 non-null   object
 7   H index                 1803 non-null   int64 
 8   Total Docs. (2022)      1803 non-null   int64 
 9   Total Docs. (3years)    1803 non-null   int64 
 10  Total Refs.             1803 non-null   int64 
 11  Total Cites (3years)    1803 non-null   int64 
 12  Citable Docs. (3years)  1803 non-null   int64 
 13  Cites / Doc. (2years)   1803 non-null   object
 14  Ref. / Doc.             1803 non-null   object
 15  Coun

In [84]:
top_10_percent[['Total Docs. (2022)', 'Total Docs. (3years)']]

Unnamed: 0,Total Docs. (2022),Total Docs. (3years)
0,44,118
1,36,122
2,121,328
3,420,1637
4,1410,4561
...,...,...
1798,56,119
1799,367,1245
1800,166,607
1801,0,45


In [85]:
top_10_percent[['Total Docs. (2022)', 'Total Docs. (3years)']].describe()

Unnamed: 0,Total Docs. (2022),Total Docs. (3years)
count,1803.0,1803.0
mean,305.364947,802.886855
std,585.720319,1464.200862
min,0.0,3.0
25%,66.0,166.0
50%,146.0,379.0
75%,311.5,879.0
max,8195.0,20676.0


In [86]:
top_10_percent[top_10_percent['Total Docs. (3years)']==20676]

Unnamed: 0,Rank,Sourceid,Title,Type,Issn,SJR,SJR Best Quartile,H index,Total Docs. (2022),Total Docs. (3years),...,Total Cites (3years),Citable Docs. (3years),Cites / Doc. (2years),Ref. / Doc.,Country,Region,Publisher,Coverage,Categories,Areas
1269,1270,25349,Science of the Total Environment,journal,"00489697, 18791026",1946,Q1,317,8195,20676,...,225488,20446,1094,7309,Netherlands,Western Europe,Elsevier,"1970, 1972-2023",Environmental Chemistry (Q1); Environmental En...,Environmental Science


<div class="alert alert-block alert-info" style = "background:#d0de6f; color:#000000; border:0;">

# Chapter 3 - Bibliometric Data of Retractions <a class="anchor" id="chapter3"></a>

In [87]:
def webscraping_wos_range(website_path, first_year, last_year):
    """
    Scrapes scientific articles from Web of Science within a specified range.

    Parameters:
    - website_path (str): The URL to the Web of Science search results page.
    - start_article (int): The starting article number for scraping.
    - end_article (int): The ending article number for scraping.

    Description:
    This function automates the web scraping process of scientific articles from Web of Science.
    It allows you to specify a range of articles you want to scrape (e.g., articles 5000 to 6000).
    The function will visit the provided website, reject cookies, and then scrape articles within
    the specified range.

    Note:
    - Ensure you have the ChromeDriver installed and its path specified in the function.
    - The function interacts with the Web of Science website, so it may need adjustments
      if the website's structure changes.
    - When you are scraping too many articles the function might stop because it asks to prove that
      you are human.

    Example Usage:
    website_path = "URL_TO_WEB_OF_SCIENCE_SEARCH_RESULTS"
    start_article = 5000
    end_article = 6000
    webscraping_wos_range(website_path, start_article, end_article)
    """

    # path for chrome driver
    path = Service(ChromeDriverManager().install())

    # create the driver
    driver = webdriver.Chrome(service = path)

    # open chromedriver window
    driver.get(website_path)
    time.sleep(10)
    
    # reject cookies
    driver.find_element('xpath', '//*[@id="onetrust-reject-all-handler"]').click()
    
    id = 1
    for year in range(first_year, last_year+1):
        # click on search bar
        driver.find_element('xpath', '/html/body/app-wos/main/div/div/div[2]/div/div/div[2]/app-input-route/app-base-summary-component/app-search-friendly-display/div[1]/app-general-search-friendly-display/app-query-modifier/div/div[1]/div').click()

        # select correct year
        driver.find_element(By.CSS_SELECTOR,'input[aria-label="Search box 2"]').clear()
        time.sleep(0.5)
        driver.find_element(By.CSS_SELECTOR,'input[aria-label="Search box 2"]').send_keys(year)

        # click search 
        driver.find_element('xpath', '//*[@id="snSearchType"]/div[5]/button[2]/span[1]').click()
        time.sleep(3)

        # calculate total number of publications, nem é bem preciso para nada
        n_publications = int(driver.find_element(By.CLASS_NAME, "brand-blue").text.replace(',', ''))
        
        for page in range(math.ceil((n_publications - 0) / 500)):

            time.sleep(2)
            
            # calculate the range of articles to scrape on the current page
            page_start_article = page * 500 + 1 + 0
            page_end_article = min((page + 1) * 500 + 0, n_publications)
            
            try:
                # open "Export" options
                driver.find_element('xpath', '//*[@id="snRecListTop"]/app-export-menu/div/button').click()

                time.sleep(2)

                print(f"Scraping articles {page_start_article} to {page_end_article}")
                
                # export to tab-delimited file
                driver.find_element('xpath', '//*[@id="exportToTabWinButton"]').click()

                # export to bibtex file
                #driver.find_element('xpath', '//*[@id="exportToBibtexButton"]').click()

            except:
                print(f'Page number {page_start_article} until {page_end_article} was not exported')
                continue
            
            time.sleep(3)
            driver.find_element('xpath', '//*[@id="radio3"]/label/span[1]/span[2]').click()        
            time.sleep(1)
            
            # starting range
            driver.find_element(By.NAME, "markFrom").clear()
            time.sleep(0.5)
            driver.find_element(By.NAME, "markFrom").send_keys(page_start_article)
            time.sleep(0.5)
            
            # ending range
            driver.find_element(By.NAME, "markTo").clear()
            time.sleep(0.5)
            driver.find_element(By.NAME, "markTo").send_keys(page_end_article)
            time.sleep(0.5)
            
            # dropdown "Record Content:"
            driver.find_element('xpath', '/html/body/app-wos/main/div/div/div[2]/div/div/div[2]/app-input-route[1]/app-export-overlay/div/div[3]/div[2]/app-export-out-details/div/div[2]/form/div/div[1]/wos-select/button').click()

            time.sleep(2)
            
            # choose the record content to be "Full Record and Cited References"
            driver.find_element('xpath', '//*[@id="global-select"]/div/div/div[4]/span').click()

            time.sleep(2)
            
            # click export
            driver.find_element('xpath', '/html/body/app-wos/main/div/div/div[2]/div/div/div[2]/app-input-route[1]/app-export-overlay/div/div[3]/div[2]/app-export-out-details/div/div[2]/form/div/div[2]/button[1]').click()

            time.sleep(13)
        id +=1
        
    # close the chrome window
    driver.quit()

In [88]:
#website = "https://www.webofscience.com/wos/woscc/summary/91e4b578-c1f9-49cc-bd35-86dbfd429de1-b1766607/relevance/1"
#webscraping_wos_range(website, 2011, 2019)

<div class="alert alert-block alert-info" style = "background:#d0de6f; color:#000000; border:0;">

# Chapter 4 - Bibliometric Data by Journals <a class="anchor" id="chapter4"></a>

<a class="anchor"> 

## 4.1 - Webscraping by Journal Range <a class="anchor" id="section_4_1"></a>

In [89]:
def webscraping_journal_range(website_path, start_journal, end_journal):
    """
    Scrapes scientific articles from Web of Science within a specified range of journals.

    Parameters:
    - website_path (str): The URL to the Web of Science search results page.
    - start_article (int): The starting journal number for scraping. Corresponds to index of journal in top_10_percent['Issn']
    - end_article (int): The ending journal number for scraping.

    Description:
    This function automates the web scraping process of scientific articles from Web of Science.
    It allows you to specify a range of journals you want to scrape (e.g., journal 1 to 100).
    The function will visit the provided website, reject cookies, and change the query to the journal 
    corresponding to the loop iteration, scrape articles within that query
    
    Note:
    - Ensure you have the ChromeDriver installed and its path specified in the function.
    - The function interacts with the Web of Science website, so it may need adjustments
      if the website's structure changes.
    - When you are scraping too many articles the function might stop because it asks to prove that
      you are human.

    Exceptions:
    To circumvent the CAPTCHA, if too many articles are scraped in one go, the following exceptions were put in place:
    - If the number of publications from query is above 30000, the function will return the index of that journal and reason "big_journal"
    - If the sum of the number of downloads (loop iterations) the function has gone through and the next journal exceeds 60, 
      the function will return the index of the next journal and reason "exceed_n_journals". 
    In case there is an error exporting the data:
    - Gives 30 more seconds to finish downloading the data from previous iteration, as elements might not be accessible yet
    - In case previous step doesn't work, the function returns the index of the journal and "STOP"

    Example Usage:
    website_path = "URL_TO_WEB_OF_SCIENCE_SEARCH_RESULTS"
    start_journal = 1
    end_journal = 100
    webscraping_wos_range(website_path, start_journal, end_journal)
    """

    # path for chrome driver
    path = Service(ChromeDriverManager().install())

    # create the driver
    driver = webdriver.Chrome(service = path)

    # open chromedriver window
    driver.get(website_path)
    time.sleep(10)

    driver.maximize_window()
    time.sleep(3)
    # reject cookies
    driver.find_element('xpath', '//*[@id="onetrust-reject-all-handler"]').click()
    
    journals = top_10_percent['Issn'][start_journal:end_journal].values

    journal_nr = start_journal
    loop_number = 0
    for issn in tqdm(journals):
        try: 
            # click on search bar
            driver.find_element('xpath', '/html/body/app-wos/main/div/div/div[2]/div/div/div[2]/app-input-route/app-base-summary-component/app-search-friendly-display/div[1]/app-general-search-friendly-display/app-query-modifier/div/div[1]/div').click()
        except:
            time.sleep(30)
            driver.find_element('xpath', '/html/body/app-wos/main/div/div/div[2]/div/div/div[2]/app-input-route/app-base-summary-component/app-search-friendly-display/div[1]/app-general-search-friendly-display/app-query-modifier/div/div[1]/div').click()

                
        # select correct year
        driver.find_element('xpath','//*[@id="advancedSearchInputArea"]').clear()
        time.sleep(0.5)
        issn_list = issn.replace(" ", "").split(',')
        comma_count = issn.count(',')
        if comma_count == 0:
            driver.find_element('xpath','//*[@id="advancedSearchInputArea"]').send_keys(f"IS=({issn_list[0]})")
        elif comma_count == 1:
            driver.find_element('xpath','//*[@id="advancedSearchInputArea"]').send_keys(f"IS=({issn_list[0]}) OR IS=({issn_list[1]})")
        else:
            driver.find_element('xpath','//*[@id="advancedSearchInputArea"]').send_keys(f"IS=({issn_list[0]}) OR IS=({issn_list[1]}) OR IS=({issn_list[2]})")

        # click search 
        driver.find_element('xpath', '/html/body/app-wos/main/div/div/div[2]/div/div/div[2]/app-input-route/app-base-summary-component/app-search-friendly-display/div[1]/app-general-search-friendly-display/app-query-modifier/div/div[1]/div/div[4]/div[2]/app-advanced-search-form/form/div[2]/div[1]/div/div/button[2]/span[1]').click()
        time.sleep(3)

        # calculate total number of publications
        n_publications = int(driver.find_element(By.CLASS_NAME, "brand-blue").text.replace(',', ''))

        if n_publications > 30_000:
            print(f'This journal ({journal_nr}) has too many articles. Export individually.')
            return journal_nr, 'big_journal'
        
        nr_iterations = math.ceil((n_publications - 0) / 500)

        if nr_iterations + loop_number >60:
            print(f'The next journal ({journal_nr}) will exceed the limit of automatic exports')
            return journal_nr, 'exceed_n_journals'


        for page in range(nr_iterations):
            loop_number += 1

            time.sleep(2)
            driver.implicitly_wait(100)
            # calculate the range of articles to scrape on the current page
            page_start_article = page * 500 + 1 + 0
            page_end_article = min((page + 1) * 500 + 0, n_publications)
            
            try:
                # open "Export" options
                driver.find_element('xpath', '//*[@id="snRecListTop"]/app-export-menu/div/button').click()

                time.sleep(2)

                print(f"Scraping articles {page_start_article} to {page_end_article}")
                
                # export to tab-delimited file
                driver.find_element('xpath', '//*[@id="exportToTabWinButton"]').click()

                # export to bibtex file
                #driver.find_element('xpath', '//*[@id="exportToBibtexButton"]').click()

            except:
                try: 
                    time.sleep(30)
                    # open "Export" options
                    driver.find_element('xpath', '//*[@id="snRecListTop"]/app-export-menu/div/button').click()

                    time.sleep(2)

                    print(f"Scraping articles {page_start_article} to {page_end_article}")
                    
                    # export to tab-delimited file
                    driver.find_element('xpath', '//*[@id="exportToTabWinButton"]').click()
                    
                except:
                    print(f'Page number {page_start_article} until {page_end_article} was not exported')
                    return journal_nr, 'STOP'
                   
            
            time.sleep(3)
            driver.find_element('xpath', '//*[@id="radio3"]/label/span[1]/span[2]').click()
            time.sleep(1)
            
            # starting range
            driver.find_element(By.NAME, "markFrom").clear()
            time.sleep(0.5)
            driver.find_element(By.NAME, "markFrom").send_keys(page_start_article)
            time.sleep(0.5)
            
            # ending range
            driver.find_element(By.NAME, "markTo").clear()
            time.sleep(0.5)
            driver.find_element(By.NAME, "markTo").send_keys(page_end_article)
            time.sleep(0.5)
            
            # dropdown "Record Content:"
            driver.find_element('xpath', '/html/body/app-wos/main/div/div/div[2]/div/div/div[2]/app-input-route[1]/app-export-overlay/div/div[3]/div[2]/app-export-out-details/div/div[2]/form/div/div[1]/wos-select/button').click()

            time.sleep(2)
            
            # choose the record content to be "Full Record and Cited References"
            driver.find_element('xpath', '//*[@id="global-select"]/div/div/div[4]/span').click()

            time.sleep(2)
            
            # click export
            driver.find_element('xpath', '/html/body/app-wos/main/div/div/div[2]/div/div/div[2]/app-input-route[1]/app-export-overlay/div/div[3]/div[2]/app-export-out-details/div/div[2]/form/div/div[2]/button[1]').click()

            time.sleep(13)

        journal_nr += 1
        
    # close the chrome window
    driver.quit()
    return journal_nr+10, 'STOP'

In [90]:
top_10_percent['Issn'][4]
#1931-3128

'00284793, 15334406'

In [91]:
top_10_percent[top_10_percent['Issn'].astype(str).str.contains('00377732')].index

Index([1055], dtype='int64')

In [92]:
website = "https://www.webofscience.com/wos/woscc/summary/75146b30-3227-443c-bb5a-771738929e0b-b2272def/relevance/1"

In [93]:
#webscraping_journal_range(website,116, 200)

In [94]:
too_big = []

In [95]:
starting_journal = 0
ending_journal = 100
# while starting_journal<ending_journal:
#     journal_nr, reason = webscraping_journal_range(website,starting_journal, ending_journal)
#     time.sleep(100)
#     if reason == 'big_journal':
#         too_big.append(journal_nr)
#         starting_journal = journal_nr + 1
#     elif reason == 'exceed_n_journals':
#         starting_journal = journal_nr
#     elif reason == 'STOP':
#         print(journal_nr)
#         starting_journal=journal_nr
#         break

In [96]:
too_big

[]

In [97]:
starting_journal

0

<a class="anchor"> 

## 4.2 - Number of Articles per Journal <a class="anchor" id="section_4_2"></a>

This stage is necessary to save the number of articles each journal has in Web of Science. This information will be used to identify the journals that are too big to be exported at once and to check if the final dataset has missing articles for any journal.

In [98]:
def nr_articles_per_journal(website_path, start_journal, end_journal):
    """
    Returns the number of articles and index of journal in top_10_percent if it exceeds 30_000.

    Parameters:
    - website_path (str): The URL to the Web of Science search results page.
    - start_article (int): The starting journal number for scraping. Corresponds to index of journal in top_10_percent['Issn']
    - end_article (int): The ending journal number for scraping.

    Example Usage:
    website_path = "URL_TO_WEB_OF_SCIENCE_SEARCH_RESULTS"
    start_journal = 1
    end_journal = 100
    webscraping_wos_range(website_path, start_journal, end_journal)
    """

    # path for chrome driver
    path = Service(ChromeDriverManager().install())

    # create the driver
    driver = webdriver.Chrome(service = path)

    # open chromedriver window
    driver.get(website_path)
    time.sleep(10)

    driver.maximize_window()
    time.sleep(3)
    # reject cookies
    driver.find_element('xpath', '//*[@id="onetrust-reject-all-handler"]').click()
    
    journals = top_10_percent['Issn'][start_journal:end_journal].values
    journal_dimensions = []
    journal_nr = start_journal
    issn_index = []
    loop = 0
    for issn in tqdm(journals):
        try: 
            # click on search bar
            driver.find_element('xpath', '/html/body/app-wos/main/div/div/div[2]/div/div/div[2]/app-input-route/app-base-summary-component/app-search-friendly-display/div[1]/app-general-search-friendly-display/app-query-modifier/div/div[1]/div').click()
        except:
            time.sleep(30)
            driver.find_element('xpath', '/html/body/app-wos/main/div/div/div[2]/div/div/div[2]/app-input-route/app-base-summary-component/app-search-friendly-display/div[1]/app-general-search-friendly-display/app-query-modifier/div/div[1]/div').click()

                
        # select correct issn
        driver.find_element('xpath','//*[@id="advancedSearchInputArea"]').clear()
        time.sleep(0.5)
        issn_list = issn.replace(" ", "").split(',')
        comma_count = issn.count(',')
        if comma_count == 0:
            driver.find_element('xpath','//*[@id="advancedSearchInputArea"]').send_keys(f"IS=({issn_list[0]})")
        elif comma_count == 1:
            driver.find_element('xpath','//*[@id="advancedSearchInputArea"]').send_keys(f"IS=({issn_list[0]}) OR IS=({issn_list[1]})")
        else:
            driver.find_element('xpath','//*[@id="advancedSearchInputArea"]').send_keys(f"IS=({issn_list[0]}) OR IS=({issn_list[1]}) OR IS=({issn_list[2]})")

        # click search 
        driver.find_element('xpath', '/html/body/app-wos/main/div/div/div[2]/div/div/div[2]/app-input-route/app-base-summary-component/app-search-friendly-display/div[1]/app-general-search-friendly-display/app-query-modifier/div/div[1]/div/div[4]/div[2]/app-advanced-search-form/form/div[2]/div[1]/div/div/button[2]/span[1]').click()
        time.sleep(3)

        # calculate total number of publications
        n_publications = int(driver.find_element(By.CLASS_NAME, "brand-blue").text.replace(',', ''))
        journal_dimensions.append(n_publications)
        issn_index.append(start_journal+loop)

        if loop > 50:
            return journal_nr, issn_index, journal_dimensions
        
        time.sleep(2)
        journal_nr += 1
        loop += 1
        
    # close the chrome window
    return journal_nr, issn_index, journal_dimensions
    driver.quit()


In [99]:
journal_issn = []
journal_dimensions = []

In [100]:
website = "https://www.webofscience.com/wos/woscc/summary/75146b30-3227-443c-bb5a-771738929e0b-b2272def/relevance/1"

In [101]:
starting_journal = 0
ending_journal = 1803

# while starting_journal<ending_journal:
#     try: 
#         journal_nr, issns, dimensions = nr_articles_per_journal(website,starting_journal, ending_journal)
#         journal_issn.extend(issns)
#         journal_dimensions.extend(dimensions)
#         starting_journal = journal_nr
        
#     except:
#         continue


In [102]:
# dimensions_df = pd.DataFrame({'Issn': journal_issn, 'Dimension': journal_dimensions})

In [103]:
# dimensions_df.drop_duplicates(inplace=True)
# dimensions_df

In [104]:
#dimensions_df.to_excel('../thesis_data/dimensions_df.xlsx', index = False)

<a class="anchor"> 

## 4.3 - Journals Individual Export <a class="anchor" id="section_4_3"></a>

In [105]:
dimensions_df = pd.read_excel('dimensions_df.xlsx')

In [106]:
dimensions_df

Unnamed: 0,Issn,Dimension
0,0,1937
1,1,3244
2,2,4002
3,3,23230
4,4,114969
...,...,...
1798,1798,613
1799,1799,2578
1800,1800,3937
1801,1801,604


In [107]:
# according to function of export, a big journal is one that has more than 30000 records
medium_big_journals = dimensions_df[(dimensions_df['Dimension']>30_000) & (dimensions_df['Dimension']<=100_000)]

# if journal has more than 100_000 records, the query needs to be broken down (by year) to be able to be imported
journal_abv_100_000 = dimensions_df[dimensions_df['Dimension']>100_000]

### 4.3.1 - Medium Big Journals <a class="anchor" id="section_4_3_1"></a>

This section is dedicated to the download of data from journals that have too many articles to be extracted in one go. 

In [108]:
def webscraping_article_range(website_path, journal_number, start_article, end_article):
    """
    Scrapes scientific articles from Web of Science within a specified range.

    Parameters:
    - website_path (str): The URL to the Web of Science search results page.
    - start_article (int): The starting article number for scraping.
    - end_article (int): The ending article number for scraping.

    Description:
    This function automates the web scraping process of scientific articles from Web of Science.
    It allows you to specify a range of articles you want to scrape (e.g., articles 5000 to 6000).
    The function will visit the provided website, reject cookies, and then scrape articles within
    the specified range.

    Note:
    - Ensure you have the ChromeDriver installed and its path specified in the function.
    - The function interacts with the Web of Science website, so it may need adjustments
      if the website's structure changes.
    - When you are scraping too many articles the function might stop because it asks to prove that
      you are human.

    Example Usage:
    website_path = "URL_TO_WEB_OF_SCIENCE_SEARCH_RESULTS"
    start_article = 5000
    end_article = 6000
    webscraping_wos_range(website_path, start_article, end_article)
    """

    # path for chrome driver
    path = Service(ChromeDriverManager().install())

    # create the driver
    driver = webdriver.Chrome(service = path)

    # open chromedriver window
    driver.get(website_path)
    time.sleep(1)
    driver.maximize_window()
    time.sleep(10)
    
    # reject cookies
    driver.find_element('xpath', '//*[@id="onetrust-reject-all-handler"]').click()
    
    issn = top_10_percent['Issn'][journal_number]

    driver.find_element('xpath', '/html/body/app-wos/main/div/div/div[2]/div/div/div[2]/app-input-route/app-base-summary-component/app-search-friendly-display/div[1]/app-general-search-friendly-display/app-query-modifier/div/div[1]/div').click()

    # select correct year
    driver.find_element('xpath','//*[@id="advancedSearchInputArea"]').clear()
    time.sleep(0.5)
    issn_list = issn.replace(" ", "").split(',')
    comma_count = issn.count(',')
    if comma_count == 0:
        driver.find_element('xpath','//*[@id="advancedSearchInputArea"]').send_keys(f"IS=({issn_list[0]})")
    elif comma_count == 1:
        driver.find_element('xpath','//*[@id="advancedSearchInputArea"]').send_keys(f"IS=({issn_list[0]}) OR IS=({issn_list[1]})")
    else:
        driver.find_element('xpath','//*[@id="advancedSearchInputArea"]').send_keys(f"IS=({issn_list[0]}) OR IS=({issn_list[1]}) OR IS=({issn_list[2]})")

    # click search 
    driver.find_element('xpath', '/html/body/app-wos/main/div/div/div[2]/div/div/div[2]/app-input-route/app-base-summary-component/app-search-friendly-display/div[1]/app-general-search-friendly-display/app-query-modifier/div/div[1]/div/div[4]/div[2]/app-advanced-search-form/form/div[2]/div[1]/div/div/button[2]/span[1]').click()
    time.sleep(3)

    # calculate total number of publications
    n_publications = int(driver.find_element(By.CLASS_NAME, "brand-blue").text.replace(',', ''))
    
    nr_iterations = math.ceil((end_article - start_article) / 500)
    
    for page in range(nr_iterations):

        time.sleep(2)
        
        # calculate the range of articles to scrape on the current page
        page_start_article = page * 500 + 1 + start_article
        page_end_article = min((page + 1) * 500 + start_article, end_article)
        
        try:
            driver.implicitly_wait(100)
            # open "Export" options
            driver.find_element('xpath', '//*[@id="snRecListTop"]/app-export-menu/div/button').click()

            time.sleep(2)

            print(f"Scraping articles {page_start_article} to {page_end_article}")
            
            driver.implicitly_wait(100)
            # export to tab-delimited file
            driver.find_element('xpath', '//*[@id="exportToTabWinButton"]').click()

            # export to bibtex file
            #driver.find_element('xpath', '//*[@id="exportToBibtexButton"]').click()

        except:
            try: 
                time.sleep(30)
                # open "Export" options
                driver.find_element('xpath', '//*[@id="snRecListTop"]/app-export-menu/div/button').click()

                time.sleep(2)

                print(f"Scraping articles {page_start_article} to {page_end_article}")
                
                # export to tab-delimited file
                driver.find_element('xpath', '//*[@id="exportToTabWinButton"]').click()

            except:
                print(f'Page number {page_start_article} until {page_end_article} was not exported')
                return n_publications, 'STOP'
        
        time.sleep(3)
        driver.find_element('xpath', '//*[@id="radio3"]/label/span[1]/span[2]').click()
    
        time.sleep(1)
        
        # starting range
        driver.find_element(By.NAME, "markFrom").clear()
        time.sleep(0.5)
        driver.find_element(By.NAME, "markFrom").send_keys(page_start_article)
        time.sleep(0.5)
        
        # ending range
        driver.find_element(By.NAME, "markTo").clear()
        time.sleep(0.5)
        driver.find_element(By.NAME, "markTo").send_keys(page_end_article)
        time.sleep(0.5)
        
        # dropdown "Record Content:"
        driver.find_element('xpath', '/html/body/app-wos/main/div/div/div[2]/div/div/div[2]/app-input-route[1]/app-export-overlay/div/div[3]/div[2]/app-export-out-details/div/div[2]/form/div/div[1]/wos-select/button').click()

        time.sleep(2)
        
        # choose the record content to be "Full Record and Cited References"
        driver.find_element('xpath', '//*[@id="global-select"]/div/div/div[4]/span').click()

        time.sleep(2)
        
        # click export
        driver.find_element('xpath', '/html/body/app-wos/main/div/div/div[2]/div/div/div[2]/app-input-route[1]/app-export-overlay/div/div[3]/div[2]/app-export-out-details/div/div[2]/form/div/div[2]/button[1]').click()

        time.sleep(13)

        
    # close the chrome window
    driver.quit()
    return n_publications, 'loop_finished'

In [109]:
website = "https://www.webofscience.com/wos/woscc/summary/75146b30-3227-443c-bb5a-771738929e0b-b2272def/relevance/1"

In [110]:
#webscraping_article_range(website, 97, 40000, 50874) #50874

In [111]:
dimensions_df.iloc[97]['Dimension']

50820

In [112]:
medium_big_journals.iloc[96]['Issn']

1795

In [113]:
top_10_percent.iloc[1569]['Issn']

'03702693'

In [114]:
#top_10_percent[9]

In [115]:
medium_big_journals.shape

(97, 2)

In [116]:
#n_publications, reason = webscraping_article_range(website, medium_big_journals.iloc[85]['Issn'], 9500, 30_000) #

In [117]:
#n_publications, reason = webscraping_article_range(website, medium_big_journals.iloc[88]['Issn'], 23000, 45185)

starting_index = 90
ending_index = 96

for index_issn in tqdm(range(starting_index, ending_index+1)):
    # get dimension of current journal to calculate necessary number of loops
    dimension = medium_big_journals.iloc[index_issn]['Dimension']
    issn = medium_big_journals.iloc[index_issn]['Issn']
    starting_journal = 0
    
    # For each 20k observations, a new loop is created
    while starting_journal < dimension:
        if dimension < (starting_journal + 20_000):
            ending_journal = dimension
        else:
            ending_journal = starting_journal + 20_000
        
        #the function returns the last oobservation that it scraped and the reason the function stopped
        n_publications, reason = webscraping_article_range(website, issn, starting_journal, ending_journal)
        dimension = n_publications
        starting_journal = ending_journal

        time.sleep(100)

        if reason == 'STOP':
            print('issn index: ', index_issn)
            break

    if reason == 'STOP':
            print('issn index: ', index_issn)
            break
    
    time.sleep(60)

In [118]:
top_10_percent.iloc[1569]['Issn']

'03702693'

In [119]:
top_10_percent[top_10_percent['Issn'].astype(str).str.contains('0195668X')].index

Index([432], dtype='int64')

### 4.3.2 - Journals with more than 100.000 observations <a class="anchor" id="section_4_3_2"></a>

In [120]:
medium_filtered_years = journal_abv_100_000.drop([71, 109, 113, 247, 561], axis=0)

# website with year filter
website = "https://www.webofscience.com/wos/woscc/summary/f87068a7-0e14-490a-8ab7-2d21798c5277-bf2c4882/relevance/1"

In [121]:
#
#n_publications, reason = webscraping_article_range(website, medium_filtered_years.iloc[9]['Issn'], 80000, 97573)

starting_index = 10
ending_index = 15

for index_issn in tqdm(range(starting_index, ending_index+1)):
    # get dimension of current journal to calculate necessary number of loops
    dimension = medium_filtered_years.iloc[index_issn]['Dimension']
    issn = medium_filtered_years.iloc[index_issn]['Issn']
    starting_journal = 0
    
    # For each 20k observations, a new loop is created
    while starting_journal < dimension:
        if dimension < (starting_journal + 20_000):
            ending_journal = dimension
        else:
            ending_journal = starting_journal + 20_000
        
        #the function returns the last oobservation that it scraped and the reason the function stopped
        n_publications, reason = webscraping_article_range(website, issn, starting_journal, ending_journal)
        dimension = n_publications
        starting_journal = ending_journal

        time.sleep(100)

        if reason == 'STOP':
            print('issn index: ', index_issn)
            break

    if reason == 'STOP':
            print('issn index: ', index_issn)
            break
    
    time.sleep(60)

In [122]:
journal_abv_100_000 = journal_abv_100_000.filter(items = [71, 109, 113, 247, 561], axis=0)

In [123]:
# Use a filtered version of the query: with years 2000-2010
website = "https://www.webofscience.com/wos/woscc/summary/b44bc781-7f09-44f1-9999-16e8b1469fb4-bf84385d/relevance/1"

starting_index = 0
ending_index = 4

for index_issn in tqdm(range(starting_index, ending_index+1)):
    # get dimension of current journal to calculate necessary number of loops
    dimension = journal_abv_100_000.iloc[index_issn]['Dimension']
    issn = journal_abv_100_000.iloc[index_issn]['Issn']
    starting_journal = 0
    
    # For each 20k observations, a new loop is created
    while starting_journal < dimension:
        if dimension < (starting_journal + 20_000):
            ending_journal = dimension
        else:
            ending_journal = starting_journal + 20_000
        
        #the function returns the last oobservation that it scraped and the reason the function stopped
        n_publications, reason = webscraping_article_range(website, issn, starting_journal, ending_journal)
        dimension = n_publications
        starting_journal = ending_journal

        time.sleep(100)

        if reason == 'STOP':
            print('issn index: ', index_issn)
            break

    if reason == 'STOP':
            print('issn index: ', index_issn)
            break
    
    time.sleep(60)

In [124]:
# Use a filtered version of the query: with years 2010-2023
website = "https://www.webofscience.com/wos/woscc/summary/40e0af8f-4735-4c40-b659-07564403483d-bf843c26/relevance/1"

starting_index = 0
ending_index = 4

for index_issn in tqdm(range(starting_index, ending_index+1)):
    # get dimension of current journal to calculate necessary number of loops
    dimension = journal_abv_100_000.iloc[index_issn]['Dimension']
    issn = journal_abv_100_000.iloc[index_issn]['Issn']
    starting_journal = 0
    
    # For each 20k observations, a new loop is created
    while starting_journal < dimension:
        if dimension < (starting_journal + 20_000):
            ending_journal = dimension
        else:
            ending_journal = starting_journal + 20_000
        
        #the function returns the last oobservation that it scraped and the reason the function stopped
        n_publications, reason = webscraping_article_range(website, issn, starting_journal, ending_journal)
        dimension = n_publications
        starting_journal = ending_journal

        time.sleep(100)

        if reason == 'STOP':
            print('issn index: ', index_issn)
            break

    if reason == 'STOP':
            print('issn index: ', index_issn)
            break
    
    time.sleep(60)

<a class="anchor"> 

## 4.4 - Join data into less datasets <a class="anchor" id="section_4_4"></a>

import pandas as pd
import os

# Define a function to preprocess data before reading CSV
def preprocess_csv(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        # Read the file content
        content = file.read()
        # Replace problematic escape characters (if needed)
        cleaned_content = content.replace('\\n', ' ')  # Replace '\n' with a space or appropriate handling
        
        # Write the cleaned content back to the file (optional)
        with open(file_path, 'w', encoding='utf-8') as cleaned_file:
            cleaned_file.write(cleaned_content)
            
# Empty DataFrame to store combined data
journals_data = pd.DataFrame()

# Directory containing CSV files
directory = "C:/Users/isabe/Desktop/Tese/journals_data/journals200-299"
filenames = os.listdir(directory)

for file in filenames:
    file_path = os.path.join(directory, file)
    
    # Preprocess the file before reading
    preprocess_csv(file_path)
    
    # Read the CSV file and handle any specific parameters
    data = pd.read_csv(file_path, sep='\t')
    
    # Concatenate data to the main DataFrame
    journals_data = pd.concat([journals_data, data], ignore_index=True)


In [125]:
# journals_data = pd.DataFrame()
# filnames = os.listdir(f"C:/Users/isabe/Desktop/Tese/journals_data/journals200-299")
# for file in filnames:
#     data = pd.read_csv(f"C:/Users/isabe/Desktop/Tese/journals_data/journals200-299/{file}", sep = '\t')
#     journals_data = pd.concat([journals_data, data], ignore_index= True)

In [126]:
folder_list =  os.listdir(f"C:/Users/isabe/Desktop/Tese/journals_data")
directory = "C:/Users/isabe/Desktop/Tese/journals_data/"
problem_files = []
problem_folders = []
error_message = []

In [127]:
# Define a function to preprocess data before reading CSV
def preprocess_csv(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        # Read the file content
        content = file.read()
        # Replace problematic escape characters (if needed)
        cleaned_content = content.replace('\\n', ' ')  # Replace '\n' with a space or appropriate handling
        
        # Write the cleaned content back to the file (optional)
        with open(file_path, 'w', encoding='utf-8') as cleaned_file:
            cleaned_file.write(cleaned_content)

In [128]:
# for file in os.listdir(f"C:/Users/isabe/Desktop/Tese/journals_data/journals100-199"):
#     file_path = os.path.join("C:/Users/isabe/Desktop/Tese/journals_data/journals100-199", file)
#     # Preprocess the file before reading
#     preprocess_csv(file_path)
#     df_test = pd.read_csv("C:/Users/isabe/Desktop/Tese/journals_data/journals100-199/"+file, sep='\t', on_bad_lines='warn', engine='python')


In [129]:
print(pd.__version__) #1.3.5

2.0.3


for folder in tqdm(folder_list):
    filenames = os.listdir(f"C:/Users/isabe/Desktop/Tese/journals_data/{folder}")
    journals_data = pd.DataFrame()
    for file in filenames:
        try: 
            file_path = os.path.join(directory+folder, file)
            
            # Preprocess the file before reading
            preprocess_csv(file_path)
            
            # Read the CSV file and handle any specific parameters
            data = pd.read_csv(file_path, sep='\t', on_bad_lines='warn')
            
            # Concatenate data to the main DataFrame
            journals_data = pd.concat([journals_data, data], ignore_index=True)

            #data = pd.read_csv(f"C:/Users/isabe/Desktop/Tese/journals_data/{folder}/{file}", sep = '\t')
            #journals_data = pd.concat([journals_data, data], ignore_index= True)
        except Exception as e:
            problem_files.append(file)
            problem_folders.append(folder)
            error_type = type(e).__name__
            error_message.append(error_type)
            print(f"Error reading {file}: {error_type}")
            
    journals_data.to_csv(f'./retractions_data/journals_data{folder}.csv', index= False)

In [130]:
# Saving the problem files and folders information to a CSV for analysis if needed
# problem_data = pd.DataFrame({'Problem Files': problem_files, 'Problem Folders': problem_folders, 'Error Message': error_message})
# problem_data.to_csv('./retractions_data/problem_files.csv', index=False)

In [131]:
#journals_data.to_csv(f'./retractions_data/journals_data{folder}.csv', index= False)

In [132]:
#problem_files

In [133]:
filenames = os.listdir(f"C:/Users/isabe/Desktop/Tese/Retractions/retractions_data")
filenames = [file for file in filenames if file.startswith('journals_data')]
filenames

['journals_data100-199.csv',
 'journals_data200-299.csv',
 'journals_data300-399.csv',
 'journals_data400-499.csv',
 'journals_databig2000-10.csv',
 'journals_databig2010-2023.csv',
 'journals_datajournals0-99.csv',
 'journals_datajournals100-199.csv',
 'journals_datajournals1000-1100.csv',
 'journals_datajournals1100-1200.csv',
 'journals_datajournals1200-1299.csv',
 'journals_datajournals1300-1399-1.csv',
 'journals_datajournals1300-1399.csv',
 'journals_datajournals1400-1499.csv',
 'journals_datajournals1500-1599.csv',
 'journals_datajournals1600-1699.csv',
 'journals_datajournals1700-1803.csv',
 'journals_datajournals200-299.csv',
 'journals_datajournals300-399.csv',
 'journals_datajournals400-499.csv',
 'journals_datajournals500-599.csv',
 'journals_datajournals600-699.csv',
 'journals_datajournals700-799.csv',
 'journals_datajournals900-999.csv',
 'journals_datamedium0-10.csv',
 'journals_datamedium11-29.csv',
 'journals_datamedium30-49.csv',
 'journals_datamedium50-89.csv',
 'jo

In [134]:
len(filenames)

30

final_dataset = pd.DataFrame()
i = 0
for file in filenames: 
    i+=1
    dataset = pd.read_csv(f'C:/Users/isabe/Desktop/Tese/Retractions/retractions_data/{file}')
    final_dataset= pd.concat([final_dataset, dataset], ignore_index=True)
    del dataset
    if i==15:
        break


In [135]:
import pyarrow as pa
import pyarrow.parquet as pq

In [136]:
#final_dataset['BA'].dtype

In [137]:
#final_dataset.info()

In [138]:
#final_dataset['BA'] = final_dataset['BA'].astype(str)

In [139]:
import dask.dataframe as dd

In [140]:
# ddf = dd.from_pandas(final_dataset, npartitions=10)
# ddf.to_parquet(f'./retractions_data/control_dataset.parquet', index= False)

In [142]:
#del final_dataset

<a class="anchor"> 

## 4.5 - Check if Journals were all scraped <a class="anchor" id="section_4_5"></a>

<div class="alert alert-block alert-info" style = "background:#d0de6f; color:#000000; border:0;">

# Chapter 5 - Bibliometric Data of Citations <a class="anchor" id="chapter5"></a>

<div class="alert alert-block alert-info" style = "background:#d0de6f; color:#000000; border:0;">

# Chapter 6 - Bibliometric Data of Erratums <a class="anchor" id="chapter6"></a>

To obtain the data about corrections, the function used for the big journals can be used. 

In [143]:
def webscraping_article_range(website_path, start_article, end_article):
    """
    Scrapes scientific articles from Web of Science within a specified range.

    Parameters:
    - website_path (str): The URL to the Web of Science search results page.
    - start_article (int): The starting article number for scraping.
    - end_article (int): The ending article number for scraping.

    Description:
    This function automates the web scraping process of scientific articles from Web of Science.
    It allows you to specify a range of articles you want to scrape (e.g., articles 5000 to 6000).
    The function will visit the provided website, reject cookies, and then scrape articles within
    the specified range.

    Note:
    - Ensure you have the ChromeDriver installed and its path specified in the function.
    - The function interacts with the Web of Science website, so it may need adjustments
      if the website's structure changes.
    - When you are scraping too many articles the function might stop because it asks to prove that
      you are human.

    Example Usage:
    website_path = "URL_TO_WEB_OF_SCIENCE_SEARCH_RESULTS"
    start_article = 5000
    end_article = 6000
    webscraping_wos_range(website_path, start_article, end_article)
    """

    # path for chrome driver
    path = Service(ChromeDriverManager().install())

    # create the driver
    driver = webdriver.Chrome(service = path)

    # open chromedriver window
    driver.get(website_path)
    time.sleep(1)
    driver.maximize_window()
    time.sleep(10)
    
    # reject cookies
    driver.find_element('xpath', '//*[@id="onetrust-reject-all-handler"]').click()
    time.sleep(3)

    # calculate total number of publications
    n_publications = int(driver.find_element(By.CLASS_NAME, "brand-blue").text.replace(',', ''))
    
    nr_iterations = math.ceil((end_article - start_article) / 500)
    
    for page in range(nr_iterations):

        time.sleep(2)
        
        # calculate the range of articles to scrape on the current page
        page_start_article = page * 500 + 1 + start_article
        page_end_article = min((page + 1) * 500 + start_article, end_article)
        
        try:
            driver.implicitly_wait(100)
            # open "Export" options
            driver.find_element('xpath', '//*[@id="snRecListTop"]/app-export-menu/div/button').click()

            time.sleep(2)

            print(f"Scraping articles {page_start_article} to {page_end_article}")
            
            driver.implicitly_wait(100)
            # export to tab-delimited file
            #driver.find_element('xpath', '//*[@id="exportToTabWinButton"]').click()

            # export to bibtex file
            driver.find_element('xpath', '//*[@id="exportToBibtexButton"]').click()

        except:
            try: 
                time.sleep(30)
                # open "Export" options
                driver.find_element('xpath', '//*[@id="snRecListTop"]/app-export-menu/div/button').click()

                time.sleep(2)

                print(f"Scraping articles {page_start_article} to {page_end_article}")
                
                # export to tab-delimited file
                driver.find_element('xpath', '//*[@id="exportToTabWinButton"]').click()

            except:
                print(f'Page number {page_start_article} until {page_end_article} was not exported')
                return n_publications, 'STOP'
        
        time.sleep(3)
        driver.find_element('xpath', '//*[@id="radio3"]/label/span[1]/span[2]').click()
    
        time.sleep(1)
        
        # starting range
        driver.find_element(By.NAME, "markFrom").clear()
        time.sleep(0.5)
        driver.find_element(By.NAME, "markFrom").send_keys(page_start_article)
        time.sleep(0.5)
        
        # ending range
        driver.find_element(By.NAME, "markTo").clear()
        time.sleep(0.5)
        driver.find_element(By.NAME, "markTo").send_keys(page_end_article)
        time.sleep(0.5)
        
        # dropdown "Record Content:"
        driver.find_element('xpath', '/html/body/app-wos/main/div/div/div[2]/div/div/div[2]/app-input-route[1]/app-export-overlay/div/div[3]/div[2]/app-export-out-details/div/div[2]/form/div/div[1]/wos-select/button').click()

        time.sleep(2)
        
        # choose the record content to be "Full Record and Cited References"
        driver.find_element('xpath', '//*[@id="global-select"]/div/div/div[4]/span').click()

        time.sleep(2)
        
        # click export
        driver.find_element('xpath', '/html/body/app-wos/main/div/div/div[2]/div/div/div[2]/app-input-route[1]/app-export-overlay/div/div[3]/div[2]/app-export-out-details/div/div[2]/form/div/div[2]/button[1]').click()

        time.sleep(13)

        
    # close the chrome window
    driver.quit()
    return n_publications, 'loop_finished'

In [149]:
websites = [#'https://www.webofscience.com/wos/woscc/summary/5e4cb912-0659-4a95-88d3-e5dc48c59551-c2ff58a3/relevance/1', # 2000-05
            #'https://www.webofscience.com/wos/woscc/summary/70209162-7c4a-471a-bf34-f23efa67a645-c2ff6264/relevance/1', #2005-13
            #'https://www.webofscience.com/wos/woscc/summary/d017afa7-dce3-4e46-96e3-4e7405e16c8d-c2ff6687/relevance/1', #2013-18
            #'https://www.webofscience.com/wos/woscc/summary/acf1bc03-2ed1-440d-8b73-47d4e4bcff41-c2ff6ad7/relevance/1', #2018-2021
            'https://www.webofscience.com/wos/woscc/summary/a5f03bfd-4f9a-43e3-9d49-4236a3d5cc7b-c2ff6d49/relevance/1' #2021-23
            ]

In [150]:
starting_article = 0
reason = 'loop_finished'
for query in websites:
    dimension = 30_000
    # For each 20k observations, a new loop is created
    while starting_article < dimension:
        if dimension < (starting_article + 20_000):
            ending_journal = dimension
        else:
            ending_journal = starting_article + 20_000
        
        #the function returns the last oobservation that it scraped and the reason the function stopped
        n_publications, reason = webscraping_article_range(query, starting_article, ending_journal)
        dimension = n_publications
        starting_article = ending_journal

        time.sleep(100)

        if reason == 'STOP':
            print('issn index: ', index_issn)
            break

    if reason == 'STOP':
        print('issn index: ', index_issn)
        break
    
    time.sleep(60)   

Scraping articles 1 to 500
Scraping articles 501 to 1000
Scraping articles 1001 to 1500
Scraping articles 1501 to 2000
Scraping articles 2001 to 2500
Scraping articles 2501 to 3000
Scraping articles 3001 to 3500
Scraping articles 3501 to 4000
Scraping articles 4001 to 4500
Scraping articles 4501 to 5000
Scraping articles 5001 to 5500
Scraping articles 5501 to 6000
Scraping articles 6001 to 6500
Scraping articles 6501 to 7000
Scraping articles 7001 to 7500
Scraping articles 7501 to 8000
Scraping articles 8001 to 8500
Scraping articles 8501 to 9000
Scraping articles 9001 to 9500
Scraping articles 9501 to 10000
Scraping articles 10001 to 10500
Scraping articles 10501 to 11000
Scraping articles 11001 to 11500
Scraping articles 11501 to 12000
Scraping articles 12001 to 12500
Scraping articles 12501 to 13000
Scraping articles 13001 to 13500
Scraping articles 13501 to 14000
Scraping articles 14001 to 14500
Scraping articles 14501 to 15000
Scraping articles 15001 to 15500
Scraping articles 155

_____end here_____

In [None]:
downloads_folder =  os.path.expanduser('~') + '/Downloads' #"C:/Users/isabe/Downloads"
count = 0

# List all files in the Downloads folder
for filename in os.listdir(downloads_folder):
    # Check if the filename contains "savedrecs" and ends with ".csv"
    if "savedrecs" in filename and filename.endswith(".txt"):
        count += 1

print(f"Number of 'savedrecs' CSV files in Downloads: {count}")

In [None]:
retractions_data = pd.DataFrame()
for file in range(count):
    if file == 0:
        data = pd.read_csv("C:/Users/isabe/Downloads/savedrecs.txt", sep = '\t')
        retractions_data = pd.concat([retractions_data, data], ignore_index= True)
    else:
        data = pd.read_csv(f"C:/Users/isabe/Downloads/savedrecs ({file}).txt", sep = '\t')
        retractions_data = pd.concat([retractions_data, data], ignore_index= True)

In [None]:
rename_columns = {
    "FN": "File Name",
    "VR": "Version Number",
    "PT": "Publication Type", # (J=Journal; B=Book; S=Series; P=Patent)
    "AU": "Authors",
    "AF": "Author Full Name",
    "BA": "Book Authors",
    "BF": "Book Authors Full Name",
    "CA": "Group Authors",
    "GP": "Book Group Authors",
    "BE": "Editors",
    "TI": "Document Title",
    "SO": "Publication Name",
    "SE": "Book Series Title",
    "BS": "Book Series Subtitle",
    "LA": "Language",
    "DT": "Document Type",
    "CT": "Conference Title",
    "CY": "Conference Date",
    "CL": "Conference Location",
    "SP": "Conference Sponsors",
    "HO": "Conference Host",
    "DE": "Author Keywords",
    "ID": "Keywords Plus",
    "AB": "Abstract",
    "C1": "Author Address",
    "RP": "Reprint Address",
    "EM": "E-mail Address",
    "RI": "ResearcherID Number",
    "OI": "ORCID Identifier (Open Researcher and Contributor ID)",
    "FU": "Funding Agency and Grant Number",
    'FP': 'Funding Name Preferred',
    "FX": "Funding Text",
    "CR": "Cited References",
    "NR": "Cited Reference Count",
    "TC": "Web of Science Core Collection Times Cited Count",
    "Z9": "Total Times Cited Count",
    "U1": "Usage Count (Last 180 Days)",
    "U2": "Usage Count (Since 2013)",
    "PU": "Publisher",
    "PI": "Publisher City",
    "PA": "Publisher Address",
    "SN": "International Standard Serial Number (ISSN)",
    "EI": "Electronic International Standard Serial Number (eISSN)",
    "BN": "International Standard Book Number (ISBN)",
    "J9": "29-Character Source Abbreviation",
    "JI": "ISO Source Abbreviation",
    "PD": "Publication Date",
    "PY": "Year Published",
    "VL": "Volume",
    "IS": "Issue",
    "SI": "Special Issue",
    "PN": "Part Number",
    "SU": "Supplement",
    "MA": "Meeting Abstract",
    "BP": "Beginning Page",
    "EP": "Ending Page",
    "AR": "Article Number",
    "DI": "Digital Object Identifier (DOI)",
    'DL': 'DOI Link',
    "D2": "Book Digital Object Identifier (DOI)",
    "EA": "Early access date",
    "EY": "Early access year",
    "PG": "Page Count",
    "P2": "Chapter Count (Book Citation Index)",
    "WC": "Web of Science Categories",
    'WE': 'Web of Science Index',
    "SC": "Research Areas",
    "GA": "Document Delivery Number",
    "PM": "PubMed ID",
    "UT": "Accession Number",
    "OA": "Open Access Indicator",
    "HP": "ESI Hot Paper", # Note that this field is valued only for ESI subscribers.
    "HC": "ESI Highly Cited Paper", # Note that this field is valued only for ESI subscribers.
    "DA": "Date this report was generated",
    "ER": "End of Record",
    "EF": "End of File"
}

retractions_data.rename(columns = rename_columns, inplace = True)
retractions_data.columns

In [None]:
retractions_data

In [None]:
retractions_data.to_csv('./retractions_data/wos_retractions_data.csv', index= False)

### Importing the information to python

In [None]:
# def import_text_files_as_dataframe(folder_path, prefix = ''):
#     """
#     Import all text files from a folder whose names start with a specific prefix
#     and concatenate them into a single pandas DataFrame.

#     Parameters:
#     - folder_path (str): The path to the folder containing the text files.
#     - prefix (str, optional): The prefix that the filenames should start with. Default is an empty string.

#     Returns:
#     - pandas.DataFrame: A DataFrame containing the content of all matching text files.

#     Description:
#     This function reads all text files within a specified folder that have filenames
#     starting with the provided prefix. It imports the content of these files and
#     concatenates them into a single pandas DataFrame.

#     - folder_path: The path to the folder containing the text files.
#     - prefix: The prefix that the filenames should start with. If not provided, all text files
#       in the folder will be imported.

#     Example Usage:
#     folder_path = '/path/to/your/folder'
#     prefix = 'your_prefix'
#     df = import_text_files_as_dataframe(folder_path, prefix)
#     """

#     files_in_folder = os.listdir(folder_path)
#     matching_files = [file for file in files_in_folder if file.startswith(prefix) and file.endswith('.txt')]

#     if not matching_files:
#         print("No matching files found.")
#         return None

#     concat_df = pd.DataFrame()
#     for file_name in matching_files:
#         file_path = os.path.join(folder_path, file_name)
#         df = pd.read_csv(file_path, sep = '\t')
#         concat_df = pd.concat([concat_df, df])
        
#     return concat_df

In [None]:
# folder_path = os.path.join(work_path, "WoS Electric Car tsv")
# prefix = 'savedrecs'

# df = import_text_files_as_dataframe(folder_path, prefix)
# df.head()

In [None]:
# rename_columns = {
#     "FN": "File Name",
#     "VR": "Version Number",
#     "PT": "Publication Type", # (J=Journal; B=Book; S=Series; P=Patent)
#     "AU": "Authors",
#     "AF": "Author Full Name",
#     "BA": "Book Authors",
#     "BF": "Book Authors Full Name",
#     "CA": "Group Authors",
#     "GP": "Book Group Authors",
#     "BE": "Editors",
#     "TI": "Document Title",
#     "SO": "Publication Name",
#     "SE": "Book Series Title",
#     "BS": "Book Series Subtitle",
#     "LA": "Language",
#     "DT": "Document Type",
#     "CT": "Conference Title",
#     "CY": "Conference Date",
#     "CL": "Conference Location",
#     "SP": "Conference Sponsors",
#     "HO": "Conference Host",
#     "DE": "Author Keywords",
#     "ID": "Keywords Plus",
#     "AB": "Abstract",
#     "C1": "Author Address",
#     "RP": "Reprint Address",
#     "EM": "E-mail Address",
#     "RI": "ResearcherID Number",
#     "OI": "ORCID Identifier (Open Researcher and Contributor ID)",
#     "FU": "Funding Agency and Grant Number",
#     "FX": "Funding Text",
#     "CR": "Cited References",
#     "NR": "Cited Reference Count",
#     "TC": "Web of Science Core Collection Times Cited Count",
#     "Z9": "Total Times Cited Count",
#     "U1": "Usage Count (Last 180 Days)",
#     "U2": "Usage Count (Since 2013)",
#     "PU": "Publisher",
#     "PI": "Publisher City",
#     "PA": "Publisher Address",
#     "SN": "International Standard Serial Number (ISSN)",
#     "EI": "Electronic International Standard Serial Number (eISSN)",
#     "BN": "International Standard Book Number (ISBN)",
#     "J9": "29-Character Source Abbreviation",
#     "JI": "ISO Source Abbreviation",
#     "PD": "Publication Date",
#     "PY": "Year Published",
#     "VL": "Volume",
#     "IS": "Issue",
#     "SI": "Special Issue",
#     "PN": "Part Number",
#     "SU": "Supplement",
#     "MA": "Meeting Abstract",
#     "BP": "Beginning Page",
#     "EP": "Ending Page",
#     "AR": "Article Number",
#     "DI": "Digital Object Identifier (DOI)",
#     "D2": "Book Digital Object Identifier (DOI)",
#     "EA": "Early access date",
#     "EY": "Early access year",
#     "PG": "Page Count",
#     "P2": "Chapter Count (Book Citation Index)",
#     "WC": "Web of Science Categories",
#     "SC": "Research Areas",
#     "GA": "Document Delivery Number",
#     "PM": "PubMed ID",
#     "UT": "Accession Number",
#     "OA": "Open Access Indicator",
#     "HP": "ESI Hot Paper", # Note that this field is valued only for ESI subscribers.
#     "HC": "ESI Highly Cited Paper", # Note that this field is valued only for ESI subscribers.
#     "DA": "Date this report was generated",
#     "ER": "End of Record",
#     "EF": "End of File"
# }

# df.rename(columns = rename_columns, inplace = True)
# df.columns

In [None]:
# df.shape

In [None]:
# df.drop_duplicates(subset = "Accession Number", inplace = True)
# df.shape

In [None]:
# # saving to csv the dataframe to later use

# csv_path = os.path.join(work_path, "WoSElectricCarData.csv")
# df.to_csv(csv_path, index = False)

## Analysis

In [None]:
# # importing the data from the csv file

# csv_path = os.path.join(work_path, "WoSElectricCarData.csv")

# df = pd.read_csv(csv_path)
# df.head()

In [None]:
# df.info()

In [None]:
# df['Year Published'].describe()

In [None]:
# df_year = df.groupby("Year Published", as_index = False).count()[["Year Published", "Publication Name"]]
# df_year

In [None]:
# # fig = px.line(df_year, x = 'Year Published', y = 'Publication Name', markers = True, text = 'Publication Name', title = "Number of Articles by Year", labels = {"Publication Name": "", "Year Published": "Year"}, width = 800, height = 600)
# # fig.update_traces(textposition = 'top center')

# fig = px.line(df_year, x = 'Year Published', y = 'Publication Name', markers = True, title = "Number of Articles by Year", labels = {"Publication Name": "", "Year Published": "Year"}, width = 800, height = 600)

# number_articles_year_path = os.path.join(work_path, "NumberArticlesYear.pdf")
# fig.write_image(number_articles_year_path)

# fig.show()

In [None]:
# # importing the data from the csv file

# csv_path = os.path.join(work_path, "WoS_Rdata.csv")

# df = pd.read_csv(csv_path)
# df.head()

In [None]:
# df.columns

In [None]:
# df['PY'].describe()

In [None]:
# df_year = df.groupby("PY", as_index = False).count()[["PY", "TI"]]
# df_year

In [None]:
# # fig = px.line(df_year, x = 'Year Published', y = 'Publication Name', markers = True, text = 'Publication Name', title = "Number of Articles by Year", labels = {"Publication Name": "", "Year Published": "Year"}, width = 800, height = 600)
# # fig.update_traces(textposition = 'top center')

# fig = px.line(df_year, x = 'PY', y = 'TI', markers = True, title = "Number of Articles by Year", labels = {"TI": "", "PY": "Year"}, width = 800, height = 600)

# number_articles_year_path = os.path.join(work_path, "NumberArticlesYear.pdf")
# fig.write_image(number_articles_year_path)

# fig.show()