The purpose of this Notebook is to join the datasets from the different sources into the following datasets:
- *retractions*: this dataset will join data from retraction watch database (RW) and bibliometric data of retractions in Web of Science (RD).
- *retracted_in_journals*: this dataset will join data from retraction watch database (RW), journal metrics from scimajor, and bibliometric data of all articles in best ranked journals (JD).


<div class="alert alert-block alert-info" style = "background:#d0de6f; color:#000000; border:0;">

# Chapter 0 - Libraries <a class="anchor" id="chapter0"></a>

In [2]:
import pandas as pd
import numpy as np
from datetime import datetime
import re
#import matplotlib.pyplot as plt
#import seaborn as sns
#import plotly.express as px

<div class="alert alert-block alert-info" style = "background:#d0de6f; color:#000000; border:0;">

# Chapter 1 - Individual Analysis <a class="anchor" id="chapter1"></a>

<a class="anchor"> 

## 1.1 - Retraction Watch Database (RW) <a class="anchor" id="section_1_1"></a>

In [3]:
rw = pd.read_excel('./retractions_data/retraction_watch_database.xlsx', dtype={'RetractionPubMedID': object, 'OriginalPaperPubMedID': object})
rw.head()

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes
0,47271,Binding of DCC by Netrin-1 to Mediate Axon Gui...,(BLS) Biology - Cellular;(BLS) Biology - Gener...,Departments of Anatomy and of Biochemistry and...,Science,American Association for the Advancement of Sc...,United States,Elke Stein;Yimin Zou;Mu-ming Poo;Marc Tessier-...,https://retractionwatch.com/2023/08/31/stanfor...,Research Article;,2023-08-31 00:00:00,10.1126/science.adk1521,0,2001-03-09 00:00:00,10.1126/science.1059391,11239160,Retraction,+Investigation by Company/Institution;+Manipul...,No,
1,47270,Hierarchical Organization of Guidance Receptor...,(BLS) Biochemistry;(BLS) Biology - General;(BL...,Department of Anatomy and Department of Bioche...,Science,American Association for the Advancement of Sc...,United States,Elke Stein;Marc Tessier-Lavigne,https://retractionwatch.com/2023/08/31/stanfor...,Research Article;,2023-08-31 00:00:00,10.1126/science.adk1517,0,2001-02-08 00:00:00,10.1126/science.1058445,11239147,Retraction,+Duplication of Image;+Investigation by Compan...,No,
2,47243,Therapeutic potential of targeting IRES-depend...,(BLS) Biochemistry;(BLS) Biology - Cancer;(BLS...,"Division of Hematology-Oncology, UCLA-Greater ...",Oncogene,Springer - Nature Publishing Group,United States,Y Shi;Y Yang;C Bardeleben;B Holmes;J Gera;Alan...,,Research Article;,2023-08-31 00:00:00,10.1038/s41388-023-02820-5,0,2015-05-11 00:00:00,10.1038/onc.2015.156,25961916,Retraction,+Concerns/Issues About Data;+Concerns/Issues A...,No,see also: https://pubpeer.com/publications/704...
3,47233,A classifier based on 273 urinary peptides pre...,(BLS) Biochemistry;(HSC) Medicine - Cardiovasc...,"Department of Nephrology, The Third Affiliated...",Journal of Hypertension,Wolters Kluwer - Lippincott Williams & Wilkins,China,Lirong Lin;Chunxuan Wang;Jiangwen Ren;Mei Mei;...,,Research Article;,2023-08-30 00:00:00,10.1097/HJH.0000000000003551,37642599,2023-08-01 00:00:00,10.1097/HJH.0000000000003467,37199562,Retraction,+Concerns/Issues About Results;+Investigation ...,No,see also https://journals.lww.com/jhypertensio...
4,47227,"Age, Gender Demographics and Comorbidity Preva...",(HSC) Biostatistics/Epidemiology;(HSC) Medicin...,"Department of Orthopaedics, Dhanalakshmi Srini...",Journal of Coastal Life Medicine,Journal of Coastal Life Medicine,India,S Venkatesh Kumar;Mohith Singh;Gowtham Singh;K...,,Research Article;,2023-08-30 00:00:00,unavailable,0,2023-01-01 00:00:00,unavailable,0,Retraction,+Notice - Lack of;+Withdrawal;,No,"date of retraction unknown, article title repl..."


In [4]:
rw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42700 entries, 0 to 42699
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Record ID              42700 non-null  int64 
 1   Title                  42700 non-null  object
 2   Subject                42700 non-null  object
 3   Institution            42699 non-null  object
 4   Journal                42700 non-null  object
 5   Publisher              42700 non-null  object
 6   Country                42700 non-null  object
 7   Author                 42700 non-null  object
 8   URLS                   21687 non-null  object
 9   ArticleType            42700 non-null  object
 10  RetractionDate         42700 non-null  object
 11  RetractionDOI          42209 non-null  object
 12  RetractionPubMedID     37599 non-null  object
 13  OriginalPaperDate      42700 non-null  object
 14  OriginalPaperDOI       40173 non-null  object
 15  OriginalPaperPubMed

In [5]:
# put date variables in correct format
rw['RetractionDate'] = pd.to_datetime(rw['RetractionDate'], errors='coerce') #, infer_datetime_format=True
rw['OriginalPaperDate'] = pd.to_datetime(rw['OriginalPaperDate'])

In [6]:
# Check for NaN values in 'Digital Object Identifier (DOI)' column
rw_filtered = rw.dropna(subset=['RetractionDOI'])

# Filter rows starting with "http://dx.doi.org/"
rw_filtered[rw_filtered['RetractionDOI'].str.startswith("http://dx.doi.org/")]['RetractionDOI']

Series([], Name: RetractionDOI, dtype: object)

### Duplicates

In theory, there should only be one DOI per article, and each retracted paper should only have one record in the database. This means that all DOIs should be unique.

In [7]:
rw['OriginalPaperDOI'].nunique()

36846

In [8]:
rw['RetractionDOI'].nunique()

36506

In [9]:
testing_dupes = rw[rw.duplicated(subset='OriginalPaperDOI', keep=False)]
testing_dupes['OriginalPaperDOI'].value_counts()

OriginalPaperDOI
Unavailable                        2234
unavailable                        1074
10.1136/jim-2021-SRMC                 6
10.1002/tox.21941                     2
10.1016/j.lfs.2019.116709             2
10.1038/s41598-021-03765-z            2
10.1007/s12275-012-2294-z             2
10.1016/j.cej.2011.04.016             2
10.1016/j.swevo.2021.100868           2
10.1016/j.esxm.2021.100447            2
10.1093/jge/aabc74                    2
10.1088/1742-2140/aaaf57              2
10.1088/1742-2140/aa953a              2
10.1016/j.carbpol.2019.115799         2
10.1001/archpediatrics.2012.999       2
10.1007/s13277-014-2995-5             2
10.3109/02699052.2016.1162060         2
10.1016/j.rapm.2005.05.009            2
10.1524/9783486834062.275             2
Name: count, dtype: int64

In [10]:
filtered_dupes = testing_dupes[testing_dupes['OriginalPaperDOI'].str.lower() != 'unavailable'].sort_values('OriginalPaperDOI')
filtered_dupes

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes
21229,14089,Can Branding Improve School Lunches?,(B/T) Business - Marketing;(BLS) Nutrition;(SO...,Charles H. Dyson School of Applied Economics a...,JAMA Pediatrics,JAMA Network,United States,Brian Wansink;David R Just;Collin R Payne,http://retractionwatch.com/?s=brian+wansink;ht...,Letter;Research Article;Retracted Article;,2017-10-20,10.1001/jamapediatrics.2017.4603,0,2012-10-01,10.1001/archpediatrics.2012.999,22911396,Retraction,+Breach of Policy by Author;+Error in Data;+Er...,No,Journal previously named Archives of Pediatric...
21230,11994,Can Branding Improve School Lunches?,(B/T) Business - Marketing;(BLS) Nutrition;(SO...,Charles H. Dyson School of Applied Economics a...,JAMA Pediatrics,JAMA Network,United States,Brian Wansink;David R Just;Collin R Payne,http://retractionwatch.com/?s=brian+wansink;ht...,Letter;Research Article;,2017-09-21,10.1001/jamapediatrics.2017.3136,28973133,2012-10-01,10.1001/archpediatrics.2012.999,22911396,Retraction,+Error in Analyses;+Error in Data;+Error in Me...,No,note: the paper was retracted again on October...
5685,38940,"Erratum to: Î±,Î²-Unsaturated aldehyde polluta...",(BLS) Biology - Cellular;(BLS) Toxicology;,"Department of Clinical Immunology, Xijing Hosp...",Environmental Toxicology,Wiley,China;United States,Zhenbiao Wu;Emily Y He;Glenda I Scott;Jun Ren,https://retractionwatch.com/2022/07/25/univers...,Correction/Erratum/Corrigendum;,2022-07-27,10.1002/tox.23620,35894684,2021-09-12,10.1002/tox.21941,34514704,Retraction,+Updated to Retraction;,No,
5696,38375,"Î±,Î²-Unsaturated aldehyde pollutant acrolein ...",(BLS) Biology - Cellular;(BLS) Toxicology;,"Department of Clinical Immunology, Xijing Hosp...",Environmental Toxicology,Wiley,China;United States,Zhenbiao Wu;Emily Y He;Glenda I Scott;Jun Ren,https://retractionwatch.com/2022/07/25/univers...,Research Article;,2022-07-27,10.1002/tox.23620,35894684,2013-12-23,10.1002/tox.21941,24376112,Retraction,+Falsification/Fabrication of Image;+Investiga...,No,
6643,37342,Identification of the Vibrio vulnificus htpG G...,(BLS) Genetics;(BLS) Microbiology;,"Department of Agricultural Biotechnology, Seou...",Journal of Microbiology,Springer,South Korea,Slae Choi;Kyungku Jang;Seulah Choi;Hee Jee Yun...,,Research Article;,2022-05-23,10.1007/s12275-022-1680-4,35606641,2012-08-25,10.1007/s12275-012-2294-z,22923124,Retraction,+Concerns/Issues About Authorship;+Upgrade/Upd...,No,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42578,4780,Phenylephrine stress in the evaluation of pati...,(HSC) Medicine - Cardiology;(HSC) Medicine - P...,"Department of Radiology, University of Califor...",Investigative Radiology,Wolters Kluwer,United States,Robert A Slutsky,http://retractionwatch.com/the-retraction-watc...,Research Article;,1986-02-01,,3514538,1983-03-01,,6345451,Retraction,+Concerns/Issues About Results;+Legal Reasons/...,No,
42579,4781,Thallium pulmonary scintigraphy. Relationship ...,(HSC) Medicine - Cardiology;(HSC) Medicine - P...,"Department of Radiology, University of Califor...",Investigative Radiology,Wolters Kluwer,United States,Robert A Slutsky,http://retractionwatch.com/the-retraction-watc...,Research Article;,1986-02-01,,3514538,1984-11-01,,6392156,Retraction,+Concerns/Issues About Results;+Legal Reasons/...,No,"Article is Nov/Dec 1984 (vol. 19, iss. 6, no d..."
42611,1494,Specific antigen exclusion and non-specific fa...,(BLS) Biology - Molecular;,"Department of Immunology, Institute of Child H...",Clinical and Experimental Immunology,Blackwell Publishing,United Kingdom,S A Roberts;M C Reinhardt;R Paganelli;R J Levi...,,Research Article;,1985-01-01,,3882286,1981-07-01,,6171369,Retraction,+Error in Analyses;+Results Not Reproducible;+...,No,No DOI for Original/Notice 3/24/2017;
42613,4249,Concurrent measurement of plasma levels of vit...,(HSC) Medicine - Cardiovascular;(HSC) Medicine...,Endocrinology-Mineral Metabolism and Nephrolog...,Translational Research: The Journal of Laborat...,Elsevier,United States,PW Lambert;PB DeOreo;BW Hollis;IY Fu;DJ Ginsbe...,,Research Article;,1984-10-01,,6384395,1981-10-01,,6270222,Retraction,+Notice - Unable to Access via current resources;,Unknown,Journal formerly known as: The Journal of Labo...


In [11]:
def find_changed_columns(group):
    changed_cols = group.apply(lambda x: x.nunique()).drop(['OriginalPaperDOI', 'Record ID'])
    value = changed_cols[changed_cols>1].index.to_list()
    return value


filtered_dupes.groupby('OriginalPaperDOI').apply(find_changed_columns).reset_index()

Unnamed: 0,OriginalPaperDOI,0
0,10.1001/archpediatrics.2012.999,"[ArticleType, RetractionDate, RetractionDOI, R..."
1,10.1002/tox.21941,"[Title, ArticleType, OriginalPaperDate, Origin..."
2,10.1007/s12275-012-2294-z,"[RetractionDate, RetractionDOI, RetractionPubM..."
3,10.1007/s13277-014-2995-5,"[ArticleType, RetractionDate, RetractionDOI, R..."
4,10.1016/j.carbpol.2019.115799,"[RetractionDate, RetractionDOI, RetractionPubM..."
5,10.1016/j.cej.2011.04.016,"[RetractionDate, RetractionDOI, Notes]"
6,10.1016/j.esxm.2021.100447,"[RetractionDate, RetractionDOI, RetractionPubM..."
7,10.1016/j.lfs.2019.116709,"[Subject, RetractionDate, RetractionDOI, Retra..."
8,10.1016/j.rapm.2005.05.009,"[RetractionDate, RetractionDOI, RetractionPubM..."
9,10.1016/j.swevo.2021.100868,"[RetractionDate, RetractionDOI, Notes]"


In [12]:
# following code commented so as to not override the changed made in the file
# with pd.ExcelWriter('./wos_rd/DOI_Duplicated_RWD.xlsx') as writer:
#     filtered_dupes.groupby('OriginalPaperDOI').apply(find_changed_columns).reset_index().to_excel(writer, sheet_name= "Differing vars",index = False)
#     filtered_dupes.to_excel(writer, sheet_name = "Duplicate records",index = False)

In [13]:
rw[rw['OriginalPaperDOI']=='10.1136/jim-2021-SRMC']

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes
11352,30679,Acute kidney injury and collapsing glomerulopa...,(HSC) Medicine - Infectious Disease;(HSC) Medi...,"Arkana Laboratories, Little Rock, Arkansas; De...",Journal of Investigative Medicine: The Officia...,BMJ Publishing,Australia;United States,Cesar F Hernandez-Arroyo;Christopher P Larson;...,http://retractionwatch.com/retracted-coronavir...,Conference Abstract/Paper;,2021-04-01,10.1136/jim-2021-SRMC.111.621wit,0,2021-01-25,10.1136/jim-2021-SRMC,0,Retraction,+Duplication of Article;+Withdrawal;,Yes,There are two withdrawn abstracts with the sam...
11353,30686,Acute kidney injury and collapsing glomerulopa...,(HSC) Medicine - Infectious Disease;(HSC) Medi...,"Arkana Laboratories, Little Rock, Arkansas; De...",Journal of Investigative Medicine: The Officia...,BMJ Publishing,Australia;United States,Cesar F Hernandez-Arroyo;Christopher P Larson;...,http://retractionwatch.com/retracted-coronavir...,Conference Abstract/Paper;,2021-04-01,10.1136/jim-2021-SRMC.111.621wit,0,2021-01-25,10.1136/jim-2021-SRMC,0,Retraction,+Duplication of Article;+Withdrawal;,Yes,There are two withdrawn abstracts with the sam...
11364,30687,"Filter clotting, anticoagulation and duration ...",(HSC) Medicine - Infectious Disease;(HSC) Medi...,"Department of Nephrology, Ochsner Health Syste...",Journal of Investigative Medicine: The Officia...,BMJ Publishing,United States,Yuang Wen;Jason R LeDoux;Akanksh Ramanand;Kevi...,http://retractionwatch.com/retracted-coronavir...,Conference Abstract/Paper;,2021-04-01,10.1136/jim-2021-SRMC.112.643wit,0,2021-01-25,10.1136/jim-2021-SRMC,0,Retraction,+Duplication of Article;+Withdrawal;,Yes,There are two withdrawn abstracts with the sam...
11365,30691,"Filter clotting, anticoagulation and duration ...",(HSC) Medicine - Infectious Disease;(HSC) Medi...,"Department of Nephrology, Ochsner Health Syste...",Journal of Investigative Medicine: The Officia...,BMJ Publishing,United States,Yuang Wen;Jason R LeDoux;Akanksh Ramanand;Kevi...,http://retractionwatch.com/retracted-coronavir...,Conference Abstract/Paper;,2021-04-01,10.1136/jim-2021-SRMC.112.643wit,0,2021-01-25,10.1136/jim-2021-SRMC,0,Retraction,+Duplication of Article;+Withdrawal;,Yes,There are two withdrawn abstracts with the sam...
11381,30693,Phenotype and outcomes of acute kidney injury ...,(HSC) Medicine - Infectious Disease;(HSC) Medi...,"Department of Nephrology, Ochsner Health Syste...",Journal of Investigative Medicine: The Officia...,BMJ Publishing,Australia;United States,Aldo E Torres-Ortiz;Muner M Mohamed;Joseph B W...,http://retractionwatch.com/retracted-coronavir...,Conference Abstract/Paper;,2021-04-01,10.1136/jim-2021-SRMC.113.445wit,0,2021-01-25,10.1136/jim-2021-SRMC,0,Retraction,+Duplication of Article;,Yes,There are two withdrawn abstracts with the sam...
11382,30697,Phenotype and outcomes of acute kidney injury ...,(HSC) Medicine - Infectious Disease;(HSC) Medi...,"Department of Nephrology, Ochsner Health Syste...",Journal of Investigative Medicine: The Officia...,BMJ Publishing,Australia;United States,Aldo E Torres-Ortiz;Muner M Mohamed;Joseph B W...,http://retractionwatch.com/retracted-coronavir...,Conference Abstract/Paper;,2021-04-01,10.1136/jim-2021-SRMC.113.445wit,0,2021-01-25,10.1136/jim-2021-SRMC,0,Retraction,+Duplication of Article;,Yes,There are two withdrawn abstracts with the sam...


In [14]:
# records that should be deleted
records_to_delete = [6000, 2175, 7242]
rw = rw[~rw['Record ID'].isin(records_to_delete)]

In [15]:
rw.sort_values(by=['OriginalPaperDOI', 'OriginalPaperDate'], ascending=[True, False], inplace=True)

# Keep only the first occurrence of each unique DOI (the most recent date)
filtered_rw = rw.drop_duplicates(subset='OriginalPaperDOI')

### Title analysis

In [16]:
rw[rw['Title'].str.startswith("Retracted:")]

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes
13359,45110,Retracted: Lifting the lid on lobbying in Indi...,(B/T) Government;,"Department of Management Studies, Indian Insti...",Journal of Public Affairs,Wiley,India,Pankaj K P Shreyaskar;Pramod Pathak,,Research Article;,2020-09-21,10.1002/pa.2423,0,2020-09-21,10.1002/pa.2423,0,Retraction,+Date of Retraction/Other Unknown;+Euphemisms ...,No,
1326,46570,Retracted: miR-214-3p Protects and Restores th...,(BLS) Biology - Molecular;(BLS) Genetics;(HSC)...,Key Laboratory of Advanced Technologies of Mat...,Evidence-Based Complementary and Alternative M...,Hindawi,China,Yuan Cheng;Qing He;Tao Jin;Na Li,https://retractionwatch.com/2022/09/28/exclusi...,Research Article;,2023-06-21,10.1155/2023/9823451,37388114,2022-07-18,10.1155/2022/1175935,35899226,Retraction,+Concerns/Issues About Data;+Concerns/Issues a...,No,See also: https://pubpeer.com/publications/C08...


In [17]:
rw['Title'] = rw['Title'].str.replace('Retracted:', '')

In [18]:
rw[rw['Title'].str.startswith("Retracted:")]

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes


In [19]:
rw.iloc[[1326,13359]]

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes
12550,25797,LncRNA ATB promotes proliferation and metastas...,(BLS) Biochemistry;(BLS) Biology - Cancer;(BLS...,"Department of Respiratory Medicine, The Affili...",Journal of Cellular Biochemistry,Wiley,China,Yiwei Cao;Xiangjun Luo;Xiaoqian Ding;Shichao C...,http://retractionwatch.com/2021/03/08/journal-...,Research Article;,2020-12-15,10.1002/jcb.29877,33590514,2018-04-25,10.1002/jcb.26894,29693289,Retraction,+Concerns/Issues About Data;+Concerns/Issues a...,No,see also: https://pubpeer.com/publications/B2B...
11996,44387,Preparation of self-healing anti-corrosion coa...,(PHY) Engineering - Chemical;(PHY) Materials S...,"Department of Materials Engineering, Isfahan U...",Surface Engineering,Taylor and Francis,Iran,Sogand Abbaspour;Ali Ashrafi;Mehdi Salehi,,Research Article;,2021-02-01,10.1080/02670844.2021.1883242,0,2019-11-21,10.1080/02670844.2019.1689641,0,Retraction,+Concerns/Issues About Image;+Concerns/Issues ...,No,


In [20]:
rw[rw['Title'].str.contains("(Withdrawn Publication)")]

  rw[rw['Title'].str.contains("(Withdrawn Publication)")]


Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes


In [21]:
rw.info()

<class 'pandas.core.frame.DataFrame'>
Index: 42697 entries, 29155 to 42695
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   Record ID              42697 non-null  int64         
 1   Title                  42697 non-null  object        
 2   Subject                42697 non-null  object        
 3   Institution            42696 non-null  object        
 4   Journal                42697 non-null  object        
 5   Publisher              42697 non-null  object        
 6   Country                42697 non-null  object        
 7   Author                 42697 non-null  object        
 8   URLS                   21686 non-null  object        
 9   ArticleType            42697 non-null  object        
 10  RetractionDate         42697 non-null  datetime64[ns]
 11  RetractionDOI          42206 non-null  object        
 12  RetractionPubMedID     37596 non-null  object        
 13  Or

<a class="anchor"> 

## 1.2 - Retractions from WoS (RD) <a class="anchor" id="section_2_2"></a>

In [22]:
wos_rd = pd.read_csv('./retractions_data/journals_data100-199.csv', dtype={'PubMed ID': object, 'PubMed ID': object})
wos_rd.info()

  wos_rd = pd.read_csv('./retractions_data/journals_data100-199.csv', dtype={'PubMed ID': object, 'PubMed ID': object})


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 471543 entries, 0 to 471542
Data columns (total 71 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   PT      471543 non-null  object 
 1   AU      471004 non-null  object 
 2   BA      0 non-null       float64
 3   BE      2529 non-null    object 
 4   GP      3 non-null       object 
 5   AF      471004 non-null  object 
 6   BF      0 non-null       float64
 7   CA      8671 non-null    object 
 8   TI      471543 non-null  object 
 9   SO      471543 non-null  object 
 10  SE      3521 non-null    object 
 11  BS      16 non-null      object 
 12  LA      471543 non-null  object 
 13  DT      471541 non-null  object 
 14  CT      24713 non-null   object 
 15  CY      24711 non-null   object 
 16  CL      24711 non-null   object 
 17  SP      23239 non-null   object 
 18  HO      566 non-null     object 
 19  DE      96472 non-null   object 
 20  ID      261715 non-null  object 
 21  AB      24

In [23]:
rename_columns = {
    "FN": "File Name",
    "VR": "Version Number",
    "PT": "Publication Type", # (J=Journal; B=Book; S=Series; P=Patent)
    "AU": "Authors",
    "AF": "Author Full Name",
    "BA": "Book Authors",
    "BF": "Book Authors Full Name",
    "CA": "Group Authors",
    "GP": "Book Group Authors",
    "BE": "Editors",
    "TI": "Document Title",
    "SO": "Publication Name",
    "SE": "Book Series Title",
    "BS": "Book Series Subtitle",
    "LA": "Language",
    "DT": "Document Type",
    "CT": "Conference Title",
    "CY": "Conference Date",
    "CL": "Conference Location",
    "SP": "Conference Sponsors",
    "HO": "Conference Host",
    "DE": "Author Keywords",
    "ID": "Keywords Plus",
    "AB": "Abstract",
    "C1": "Author Address",
    "RP": "Reprint Address",
    "EM": "E-mail Address",
    "RI": "ResearcherID Number",
    "OI": "ORCID Identifier (Open Researcher and Contributor ID)",
    "FU": "Funding Agency and Grant Number",
    "FX": "Funding Text",
    "CR": "Cited References",
    "NR": "Cited Reference Count",
    "TC": "Web of Science Core Collection Times Cited Count",
    "Z9": "Total Times Cited Count",
    "U1": "Usage Count (Last 180 Days)",
    "U2": "Usage Count (Since 2013)",
    "PU": "Publisher",
    "PI": "Publisher City",
    "PA": "Publisher Address",
    "SN": "International Standard Serial Number (ISSN)",
    "EI": "Electronic International Standard Serial Number (eISSN)",
    "BN": "International Standard Book Number (ISBN)",
    "J9": "29-Character Source Abbreviation",
    "JI": "ISO Source Abbreviation",
    "PD": "Publication Date",
    "PY": "Year Published",
    "VL": "Volume",
    "IS": "Issue",
    "SI": "Special Issue",
    "PN": "Part Number",
    "SU": "Supplement",
    "MA": "Meeting Abstract",
    "BP": "Beginning Page",
    "EP": "Ending Page",
    "AR": "Article Number",
    "DI": "Digital Object Identifier (DOI)",
    "D2": "Book Digital Object Identifier (DOI)",
    "EA": "Early access date",
    "EY": "Early access year",
    "PG": "Page Count",
    "P2": "Chapter Count (Book Citation Index)",
    "WC": "Web of Science Categories",
    "SC": "Research Areas",
    "GA": "Document Delivery Number",
    "PM": "PubMed ID",
    "UT": "Accession Number",
    "OA": "Open Access Indicator",
    "HP": "ESI Hot Paper", # Note that this field is valued only for ESI subscribers.
    "HC": "ESI Highly Cited Paper", # Note that this field is valued only for ESI subscribers.
    "DA": "Date this report was generated",
    "ER": "End of Record",
    "EF": "End of File"
}

wos_rd.rename(columns = rename_columns, inplace = True)
wos_rd.columns

Index(['Publication Type', 'Authors', 'Book Authors', 'Editors',
       'Book Group Authors', 'Author Full Name', 'Book Authors Full Name',
       'Group Authors', 'Document Title', 'Publication Name',
       'Book Series Title', 'Book Series Subtitle', 'Language',
       'Document Type', 'Conference Title', 'Conference Date',
       'Conference Location', 'Conference Sponsors', 'Conference Host',
       'Author Keywords', 'Keywords Plus', 'Abstract', 'Author Address', 'C3',
       'Reprint Address', 'E-mail Address', 'ResearcherID Number',
       'ORCID Identifier (Open Researcher and Contributor ID)',
       'Funding Agency and Grant Number', 'FP', 'Funding Text',
       'Cited References', 'Cited Reference Count',
       'Web of Science Core Collection Times Cited Count',
       'Total Times Cited Count', 'Usage Count (Last 180 Days)',
       'Usage Count (Since 2013)', 'Publisher', 'Publisher City',
       'Publisher Address', 'International Standard Serial Number (ISSN)',
       '

In [24]:
#wos_rd.dtypes[[2,6,36,58,60,65,67,68]]

In [25]:
columns_to_convert = ['Usage Count (Since 2013)', 'Page Count', 'PubMed ID', 'Year Published']
new_data_types = {'Usage Count (Since 2013)': 'int64', 'Page Count': 'int64', 'PubMed ID': 'object', 'Year Published': 'int64'}

for col in columns_to_convert:
    try:
        wos_rd[col] = pd.to_numeric(wos_rd[col], errors='coerce').astype(new_data_types[col])
    except ValueError:
        wos_rd[col] = pd.to_numeric(wos_rd[col], errors='coerce')

In [26]:
wos_rd.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Book Authors,0.0,,,,,,,
Book Authors Full Name,0.0,,,,,,,
Web of Science Core Collection Times Cited Count,471541.0,75.503125,269.042517,0.0,1.0,17.0,73.0,57414.0
Total Times Cited Count,471543.0,78.089661,277.848086,0.0,1.0,18.0,75.0,57919.0
Usage Count (Last 180 Days),471542.0,1.794612,8.272906,0.0,0.0,0.0,1.0,1933.0
Usage Count (Since 2013),471519.0,24.87848,89.359259,0.0,0.0,3.0,16.0,12137.0
Year Published,471529.0,2086.243048,40343.100244,6.0,1996.0,2010.0,2017.0,19590850.0
Page Count,471519.0,9.471427,20.787054,0.0,2.0,7.0,13.0,7002.0


In [27]:
wos_rd[wos_rd.duplicated()].shape[0]

71701

In [28]:
wos_rd['Year Published'] = np.where(wos_rd['Year Published'] < 1000, 2000, wos_rd['Year Published'])

In [29]:
wos_rd['Publication Date']

0         DEC
1         SEP
2         JUN
3         JUN
4         APR
         ... 
471538    JUN
471539    JUN
471540    FEB
471541    JAN
471542    JAN
Name: Publication Date, Length: 471543, dtype: object

In [30]:
wos_rd['Year Published'].astype(str).value_counts()

Year Published
2014.0        22284
2021.0        21154
2022.0        21004
2020.0        19100
2010.0        18017
              ...  
1903.0           20
1902.0           18
nan              14
2024.0            8
19590850.0        2
Name: count, Length: 127, dtype: int64

In [31]:
wos_rd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 471543 entries, 0 to 471542
Data columns (total 71 columns):
 #   Column                                                   Non-Null Count   Dtype  
---  ------                                                   --------------   -----  
 0   Publication Type                                         471543 non-null  object 
 1   Authors                                                  471004 non-null  object 
 2   Book Authors                                             0 non-null       float64
 3   Editors                                                  2529 non-null    object 
 4   Book Group Authors                                       3 non-null       object 
 5   Author Full Name                                         471004 non-null  object 
 6   Book Authors Full Name                                   0 non-null       float64
 7   Group Authors                                            8671 non-null    object 
 8   Document Title

In [35]:
wos_rd['Year Published'].fillna(2000).astype(int).astype(str).value_counts()

Year Published
2014        22284
2021        21154
2022        21004
2020        19100
2010        18017
            ...  
1931           37
1903           20
1902           18
2024            8
19590850        2
Name: count, Length: 126, dtype: int64

In [41]:
wos_rd['Year Published'].sort_values()

240226    1900.0
239741    1900.0
261973    1900.0
178693    1900.0
178888    1900.0
           ...  
187733       NaN
189215       NaN
190151       NaN
281783       NaN
307970       NaN
Name: Year Published, Length: 471543, dtype: float64

In [36]:
wos_rd['Publication Date'].fillna('JAN', inplace=True)

wos_rd['Publication Date'] = pd.to_datetime(wos_rd['Publication Date'] + ' ' + wos_rd['Year Published'].astype(str), errors='coerce')
wos_rd['Publication Date'].fillna(pd.to_datetime('JAN 01 ' + wos_rd['Year Published'].fillna(2000).astype(int).astype(str), format='%b %d %Y'), inplace=True)

# Sort the DataFrame by DOI and 'Publication Date' in descending order
wos_rd.sort_values(by=['Digital Object Identifier (DOI)', 'Publication Date'], ascending=[True, False], inplace=True)


  wos_rd['Publication Date'] = pd.to_datetime(wos_rd['Publication Date'] + ' ' + wos_rd['Year Published'].astype(str), errors='coerce')


ValueError: unconverted data remains when parsing with format "%b %d %Y": "0850", at position 125. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

In [None]:
# Check for NaN values in 'Digital Object Identifier (DOI)' column
wos_rd_filtered = wos_rd.dropna(subset=['Digital Object Identifier (DOI)'])

# Filter rows starting with "http://dx.doi.org/"
wos_rd_filtered[wos_rd_filtered['Digital Object Identifier (DOI)'].str.startswith("http://dx.doi.org/")]['Digital Object Identifier (DOI)']

732       http://dx.doi.org/10.1016/j.jhydrol.2021.127122
3324      http://dx.doi.org/10.1016/j.jhydrol.2021.127122
5962     http://dx.doi.org/10.1016/j.theochem.2009.04.010
22123    http://dx.doi.org/10.1016/j.theochem.2009.04.010
2426           http://dx.doi.org/10.1021/acsomega.0c04799
Name: Digital Object Identifier (DOI), dtype: object

In [30]:
wos_rd['Digital Object Identifier (DOI)'] = wos_rd['Digital Object Identifier (DOI)'].str.replace('http://dx.doi.org/', '')

### Title Analysis

In [31]:
wos_rd[(~wos_rd['Document Title'].str.startswith("RETRACTED: ")) & (~wos_rd['Document Title'].str.contains("(Withdrawn Publication)")) & ((~wos_rd['Document Title'].str.contains("Retracted Article. See")))]

  wos_rd[(~wos_rd['Document Title'].str.startswith("RETRACTED: ")) & (~wos_rd['Document Title'].str.contains("(Withdrawn Publication)")) & ((~wos_rd['Document Title'].str.contains("Retracted Article. See")))]


Unnamed: 0,Publication Type,Authors,Book Authors,Editors,Book Group Authors,Author Full Name,Book Authors Full Name,Group Authors,Document Title,Publication Name,...,Web of Science Categories,WE,Research Areas,Document Delivery Number,PubMed ID,Open Access Indicator,ESI Highly Cited Paper,ESI Hot Paper,Date this report was generated,Accession Number
15076,J,"Gangidi, PR; Souriyasak, N",,,,"Gangidi, Prashant Reddy; Souriyasak, Noy",,,Solder Selection for Reflowing Large Ceramic S...,JOURNAL OF FAILURE ANALYSIS AND PREVENTION,...,"Engineering, Multidisciplinary",Emerging Sources Citation Index (ESCI),Engineering,FH4QU,,,,,05/11/2023,WOS:000411144100009
14979,J,"Lyu, WH; Zhang, J",,,,"Lyu, Weihua; Zhang, Jian",,,The Influence of Childhood Psychological Maltr...,EURASIA JOURNAL OF MATHEMATICS SCIENCE AND TEC...,...,Education & Educational Research,Social Science Citation Index (SSCI),Education & Educational Research,FO1JZ,,gold,,,05/11/2023,WOS:000416517000040


In [32]:
# Remove the Alterations to the name of the article
wos_rd['Document Title'] = wos_rd['Document Title'].str.replace('RETRACTED: ', '')
wos_rd['Document Title'] = wos_rd['Document Title'].str.replace('(Withdrawn Publication)', '')
wos_rd['Document Title'] = wos_rd['Document Title'].str.replace('(Withdrawn publication)', '')
wos_rd['Document Title'] = wos_rd['Document Title'].str.replace('</bold>', '')
wos_rd['Document Title'] = wos_rd['Document Title'].str.replace('<bold>', '')

In [33]:
# Some records include variations of the phrase "(Retracted article. See XX)". 
# Where XX is has variable length and characters, but always ends with ")"
def remove_retraction_phrase(title):

    # Define the pattern to match the retraction phrase
    pattern = r'\(retracted article\. see [^\)]+\)'
    return re.sub(pattern, '', str.lower(title)).strip()

# Apply the function to the 'Document Title' column
wos_rd['Document Title'] = wos_rd['Document Title'].apply(remove_retraction_phrase)

In [None]:
def remove_retraction_phrase(title):

    # Define the pattern to match the retraction phrase
    pattern = r'\(withdrawal of [^\)]+\)'
    return re.sub(pattern, '', str.lower(title)).strip()

# Apply the function to the 'Document Title' column
wos_rd['Document Title'] = wos_rd['Document Title'].apply(remove_retraction_phrase)

In [34]:
# Remove spaces at the end of the string
wos_rd['Document Title'] = wos_rd['Document Title'].str.rstrip()

In [35]:
#wos_rd[(~wos_rd['Document Title'].str.startswith("RETRACTED: ")) & (~wos_rd['Document Title'].str.contains("(Withdrawn Publication)"))]

### Duplicates

In [36]:
# Replace all NaN values with a common value (e.g., a string)
wos_rd = wos_rd.fillna('This is a missing value')

# Use drop_duplicates to remove duplicates with 'NaN' values
wos_rd.drop_duplicates(inplace=True)

# Now, you can replace the 'NaN' values with NaN again if needed
wos_rd = wos_rd.replace('This is a missing value', np.nan)

In [37]:
wos_rd[wos_rd.duplicated(subset='Digital Object Identifier (DOI)', keep=False)].shape

(531, 71)

In [38]:
wos_rd['Digital Object Identifier (DOI)'].nunique()

12335

In [39]:
wos_rd.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12858 entries, 13322 to 23067
Data columns (total 71 columns):
 #   Column                                                   Non-Null Count  Dtype         
---  ------                                                   --------------  -----         
 0   Publication Type                                         12858 non-null  object        
 1   Authors                                                  12856 non-null  object        
 2   Book Authors                                             43 non-null     object        
 3   Editors                                                  1244 non-null   object        
 4   Book Group Authors                                       714 non-null    object        
 5   Author Full Name                                         12856 non-null  object        
 6   Book Authors Full Name                                   43 non-null     object        
 7   Group Authors                                     

In [40]:
#wos_rd.iloc[[2,6,36,58,60,65,67,68]]

In [41]:
wos_rd['Digital Object Identifier (DOI)'].value_counts()

Digital Object Identifier (DOI)
10.1016/j.anbehav.2009.11.027    2
10.1109/EEM.2015.7216687         2
10.1007/s10735-016-9659-2        2
10.1021/acsomega.1c02942         2
10.1088/1402-4896/aa6e8b         2
                                ..
10.1016/j.micpro.2020.103683     1
10.1016/j.micpro.2020.103688     1
10.1016/j.micpro.2020.103689     1
10.1016/j.micpro.2020.103691     1
10.1021/acsomega.0c04799         1
Name: count, Length: 12335, dtype: int64

In [42]:
wos_rd.shape

(12858, 71)

In [43]:
wos_rd[wos_rd.duplicated(subset='Digital Object Identifier (DOI)', keep=False)]

Unnamed: 0,Publication Type,Authors,Book Authors,Editors,Book Group Authors,Author Full Name,Book Authors Full Name,Group Authors,Document Title,Publication Name,...,Web of Science Categories,WE,Research Areas,Document Delivery Number,PubMed ID,Open Access Indicator,ESI Highly Cited Paper,ESI Hot Paper,Date this report was generated,Accession Number
13875,J,"Gao, SM; Cheng, C; Chen, HW; Li, M; Liu, KH; W...",,,,"Gao, Shuming; Cheng, Cai; Chen, Hanwen; Li, Mi...",,,igf1 3′utr functions as a cerna in promoting a...,JOURNAL OF MOLECULAR HISTOLOGY,...,Cell Biology,Science Citation Index Expanded (SCI-EXPANDED),Cell Biology,VL7MI,,,,,05/11/2023,WOS:000908834400001
13876,J,"Gao, SM; Cheng, C; Chen, HW; Li, M; Liu, KH; W...",,,,"Gao, Shuming; Cheng, Cai; Chen, Hanwen; Li, Mi...",,,igf1 3'utr functions as a cerna in promoting a...,JOURNAL OF MOLECULAR HISTOLOGY,...,Cell Biology,Science Citation Index Expanded (SCI-EXPANDED),Cell Biology,DH9MA,,,,,05/11/2023,WOS:000373119000006
6906,J,"Fernández-Gil, A; Swenson, JE; Granda, C; Pére...",,,,"Fernandez-Gil, Alberto; Swenson, Jon E.; Grand...",,,evidence of sexually selected infanticide in a...,ANIMAL BEHAVIOUR,...,Behavioral Sciences; Zoology,Science Citation Index Expanded (SCI-EXPANDED),Behavioral Sciences; Zoology,548SJ,,,,,05/11/2023,WOS:000273986300034
7068,J,"Fernández-Gil, A; Swenson, JE; Granda, C; Pére...",,,,"Fernandez-Gil, Alberto; Swenson, Jon E.; Grand...",,,evidence of sexually selected infanticide in a...,ANIMAL BEHAVIOUR,...,Behavioral Sciences; Zoology,Science Citation Index Expanded (SCI-EXPANDED),Behavioral Sciences; Zoology,548SJ,,,,,05/11/2023,WOS:000273986300034
14544,J,"Iwamoto, J; Takeda, T; Matsumoto, H",,,,"Iwamoto, J.; Takeda, T.; Matsumoto, H.",,,efficacy of risedronate against hip fracture i...,BONE,...,Endocrinology & Metabolism,Science Citation Index Expanded (SCI-EXPANDED)...,Endocrinology & Metabolism,436SP,,,,,05/11/2023,WOS:000265436200053
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2377,J,"BOLDT, J; KLING, D; DAPPER, F; HEMPELMANN, G",,,,"BOLDT, J; KLING, D; DAPPER, F; HEMPELMANN, G",,,myocardial temperature during cardiac operatio...,JOURNAL OF THORACIC AND CARDIOVASCULAR SURGERY,...,Cardiac & Cardiovascular Systems; Respiratory ...,Science Citation Index Expanded (SCI-EXPANDED),Cardiovascular System & Cardiology; Respirator...,EC910,2214832.0,,,,05/11/2023,WOS:A1990EC91000010
2325,J,"BOLDT, J; KLING, D; VONBORMANN, B; ZUGE, M; SC...",,,,"BOLDT, J; KLING, D; VONBORMANN, B; ZUGE, M; SC...",,,blood conservation in cardiac operations - cel...,JOURNAL OF THORACIC AND CARDIOVASCULAR SURGERY,...,Cardiac & Cardiovascular Systems; Respiratory ...,Science Citation Index Expanded (SCI-EXPANDED),Cardiovascular System & Cardiology; Respirator...,AA700,2786116.0,,,,05/11/2023,WOS:A1989AA70000004
11636,J,"BOLDT, J; VONBORMANN, B; KLING, D; BORNER, U; ...",,,,"BOLDT, J; VONBORMANN, B; KLING, D; BORNER, U; ...",,,volume replacement with a new hydroxyethylstar...,INFUSIONSTHERAPIE UND KLINISCHE ERNAHRUNG,...,"Medicine, General & Internal",Science Citation Index Expanded (SCI-EXPANDED),General & Internal Medicine,D1812,2427448.0,,,,05/11/2023,WOS:A1986D181200006
23516,J,"GLUECK, CJ; MELLIES, MJ; DINE, M; PERRY, T; LA...",,,,"GLUECK, CJ; MELLIES, MJ; DINE, M; PERRY, T; LA...",,,safety and efficacy of long-term diet and diet...,PEDIATRICS,...,Pediatrics,Science Citation Index Expanded (SCI-EXPANDED),Pediatrics,D4187,3526270.0,,,,05/11/2023,WOS:A1986D418700025


In [44]:
testing_dupes = wos_rd[wos_rd.duplicated(subset='Digital Object Identifier (DOI)', keep=False)]
testing_dupes['Digital Object Identifier (DOI)'].value_counts()

Digital Object Identifier (DOI)
10.1007/s10735-016-9659-2        2
10.1016/j.anbehav.2009.11.027    2
10.1016/j.bone.2009.01.095       2
10.1021/acsomega.1c02942         2
10.1088/1402-4896/aa6e8b         2
10.1093/icc/dtq041               2
10.1109/EEM.2015.7216687         2
10.1109/NISS.2009.21             2
Name: count, dtype: int64

In [45]:
unique_counts = testing_dupes[testing_dupes['Digital Object Identifier (DOI)']=='10.1093/aje/kwq207'].nunique()
unique_counts[unique_counts > 1].index

Index([], dtype='object')

In [46]:
wos_rd.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12858 entries, 13322 to 23067
Data columns (total 71 columns):
 #   Column                                                   Non-Null Count  Dtype         
---  ------                                                   --------------  -----         
 0   Publication Type                                         12858 non-null  object        
 1   Authors                                                  12856 non-null  object        
 2   Book Authors                                             43 non-null     object        
 3   Editors                                                  1244 non-null   object        
 4   Book Group Authors                                       714 non-null    object        
 5   Author Full Name                                         12856 non-null  object        
 6   Book Authors Full Name                                   43 non-null     object        
 7   Group Authors                                     

In [47]:
# number of different values per DOI
aux = testing_dupes.groupby('Digital Object Identifier (DOI)').nunique()

# number of variables that differ for each DOI
aux.gt(1).sum(axis=1).sort_values()

Digital Object Identifier (DOI)
10.1016/j.anbehav.2009.11.027     1
10.1016/j.bone.2009.01.095        1
10.1088/1402-4896/aa6e8b          1
10.1093/icc/dtq041                1
10.1007/s10735-016-9659-2         7
10.1021/acsomega.1c02942          8
10.1109/EEM.2015.7216687         13
10.1109/NISS.2009.21             13
dtype: int64

In [48]:
# variables that differ in duplicates
aux.columns[aux.max() > 1]

Index(['Document Title', 'Publication Name', 'Conference Title',
       'Conference Date', 'Conference Location', 'Keywords Plus', 'Abstract',
       'Author Address', 'C3', 'Reprint Address', 'Cited References',
       'Cited Reference Count', 'Usage Count (Since 2013)',
       'International Standard Book Number (ISBN)', 'Meeting Abstract',
       'Beginning Page', 'Article Number', 'Page Count',
       'Web of Science Categories', 'Research Areas',
       'Document Delivery Number', 'Accession Number'],
      dtype='object')

From the list of variables that were listed above as having different values for the same DOI, only the variables that are going to be necessary in the methodology will be individually inspected. That includes the following:
- 'Document Title'
- 'Keywords Plus'
- 'Abstract'
- 'C3' 
- 'Cited References'
- 'Cited Reference Count'
- 'Article Number'
- 'Web of Science Categories'
- 'Research Areas'

In [49]:
variable = 'Document Title'
pd.set_option('display.max_colwidth', None)  # Show full content of columns
pd.set_option('display.max_rows', None)      # Display all rows
testing_dupes.groupby('Digital Object Identifier (DOI)').filter(lambda x: x[variable].nunique() > 1)[variable]

13875    igf1 3′utr functions as a cerna in promoting angiogenesis by sponging mir-29 family in osteosarcoma
13876    igf1 3'utr functions as a cerna in promoting angiogenesis by sponging mir-29 family in osteosarcoma
Name: Document Title, dtype: object

In [50]:
# Since it is a formatting issue, the text will be formatted the same
wos_rd.at[13875, 'Document Title'] = wos_rd.at[13876, 'Document Title']

In [51]:
pd.reset_option('display.max_colwidth')
pd.reset_option('display.max_rows')

In [52]:
variable = 'Abstract'
testing_dupes.groupby('Digital Object Identifier (DOI)').filter(lambda x: x[variable].nunique() > 1)[['Digital Object Identifier (DOI)',variable]].sort_values('Digital Object Identifier (DOI)')

Unnamed: 0,Digital Object Identifier (DOI),Abstract
13875,10.1007/s10735-016-9659-2,Osteosarcoma is one of the most common maligna...
13876,10.1007/s10735-016-9659-2,Osteosarcoma is one of the most common maligna...
3382,10.1021/acsomega.1c02942,The physicochemical approaches and biological ...
3383,10.1021/acsomega.1c02942,The physicochemical approaches and biological ...
12103,10.1109/EEM.2015.7216687,Demand response has been identified as a solut...
12299,10.1109/EEM.2015.7216687,Demand response has been identified as a solut...


In [53]:
wos_rd.at[13875, 'Abstract'] = wos_rd.at[13876, 'Abstract']
wos_rd.at[3383, 'Abstract'] = wos_rd.at[3382, 'Abstract']
wos_rd.at[12103, 'Abstract'] = wos_rd.at[12299, 'Abstract']

In [54]:
variable = 'C3'
testing_dupes.groupby('Digital Object Identifier (DOI)').filter(lambda x: x[variable].nunique() > 1)[['Digital Object Identifier (DOI)',variable]].sort_values('Digital Object Identifier (DOI)')

Unnamed: 0,Digital Object Identifier (DOI),C3
22201,10.1109/NISS.2009.21,Beijing University of Posts & Telecommunicatio...
22581,10.1109/NISS.2009.21,Beijing University of Posts & Telecommunications


In [55]:
wos_rd.at[22581, 'C3'] = wos_rd.at[22201, 'C3']

In [56]:
variable = 'Cited Reference Count' #'Cited References'
testing_dupes.groupby('Digital Object Identifier (DOI)').filter(lambda x: x[variable].nunique() > 1)[['Digital Object Identifier (DOI)',variable]].sort_values('Digital Object Identifier (DOI)')

Unnamed: 0,Digital Object Identifier (DOI),Cited Reference Count
3382,10.1021/acsomega.1c02942,105
3383,10.1021/acsomega.1c02942,101
12103,10.1109/EEM.2015.7216687,14
12299,10.1109/EEM.2015.7216687,15


In [57]:
variable = 'Web of Science Categories'
testing_dupes.groupby('Digital Object Identifier (DOI)').filter(lambda x: x[variable].nunique() > 1)[['Digital Object Identifier (DOI)',variable]].sort_values('Digital Object Identifier (DOI)')

Unnamed: 0,Digital Object Identifier (DOI),Web of Science Categories
12103,10.1109/EEM.2015.7216687,"Engineering, Electrical & Electronic"
12299,10.1109/EEM.2015.7216687,"Energy & Fuels; Engineering, Electrical & Elec..."
22201,10.1109/NISS.2009.21,"Computer Science, Information Systems; Compute..."
22581,10.1109/NISS.2009.21,"Engineering, Electrical & Electronic; Telecomm..."


In [58]:
#wos_rd.at[12103, 'Web of Science Categories'] = wos_rd.at[12299, 'Web of Science Categories']

In [59]:
variable = 'Article Number' 
testing_dupes.groupby('Digital Object Identifier (DOI)').filter(lambda x: x[variable].nunique() > 1)[['Digital Object Identifier (DOI)',variable]].sort_values('Digital Object Identifier (DOI)')

Unnamed: 0,Digital Object Identifier (DOI),Article Number
15151,10.1088/1402-4896/aa6e8b,64004
23406,10.1088/1402-4896/aa6e8b,64004


In [60]:
wos_rd.at[23406, 'Article Number'] = wos_rd.at[15151, 'Article Number']

In [61]:
variable = 'Research Areas'
testing_dupes.groupby('Digital Object Identifier (DOI)').filter(lambda x: x[variable].nunique() > 1)[['Digital Object Identifier (DOI)',variable]].sort_values('Digital Object Identifier (DOI)')

Unnamed: 0,Digital Object Identifier (DOI),Research Areas
12103,10.1109/EEM.2015.7216687,Engineering
12299,10.1109/EEM.2015.7216687,Energy & Fuels; Engineering
22201,10.1109/NISS.2009.21,Computer Science; Engineering
22581,10.1109/NISS.2009.21,Engineering; Telecommunications


In [62]:
testing_dupes.groupby('Digital Object Identifier (DOI)').nunique()

Unnamed: 0_level_0,Publication Type,Authors,Book Authors,Editors,Book Group Authors,Author Full Name,Book Authors Full Name,Group Authors,Document Title,Publication Name,...,Web of Science Categories,WE,Research Areas,Document Delivery Number,PubMed ID,Open Access Indicator,ESI Highly Cited Paper,ESI Hot Paper,Date this report was generated,Accession Number
Digital Object Identifier (DOI),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10.1007/s10735-016-9659-2,1,1,0,0,0,1,0,0,2,1,...,1,1,1,2,0,0,0,0,1,2
10.1016/j.anbehav.2009.11.027,1,1,0,0,0,1,0,0,1,1,...,1,1,1,1,0,0,0,0,1,1
10.1016/j.bone.2009.01.095,1,1,0,0,0,1,0,0,1,1,...,1,1,1,1,0,0,0,0,1,1
10.1021/acsomega.1c02942,1,1,0,0,0,1,0,0,1,1,...,1,1,1,2,0,1,0,0,1,2
10.1088/1402-4896/aa6e8b,1,1,0,0,0,1,0,0,1,1,...,1,1,1,1,0,0,0,0,1,1
10.1093/icc/dtq041,1,1,0,0,0,1,0,0,1,1,...,1,1,1,1,0,0,0,0,1,1
10.1109/EEM.2015.7216687,1,1,0,0,1,1,0,0,1,2,...,2,1,2,2,0,0,0,0,1,2
10.1109/NISS.2009.21,1,1,0,0,1,1,0,0,1,2,...,2,1,2,2,0,0,0,0,1,2


In [63]:
# Keep only the first occurrence of each unique DOI (the most recent date)
filtered_wos = wos_rd.drop_duplicates(subset='Digital Object Identifier (DOI)')

<div class="alert alert-block alert-info" style = "background:#d0de6f; color:#000000; border:0;">

# Chapter 2 - Merge Data <a class="anchor" id="chapter2"></a>

<a class="anchor"> 

## 2.1 -Retractions Data (RW+Wos_rd) <a class="anchor" id="section_4_1"></a>

In [76]:
filtered_rw.shape
#filtered_wos.shape

(36847, 20)

In [77]:
test = filtered_rw.merge(filtered_wos, how= 'inner', left_on= 'OriginalPaperDOI', right_on= 'Digital Object Identifier (DOI)')
test

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher_x,Country,Author,URLS,ArticleType,...,Web of Science Categories,WE,Research Areas,Document Delivery Number,PubMed ID,Open Access Indicator,ESI Highly Cited Paper,ESI Hot Paper,Date this report was generated,Accession Number
0,5729,The prevention of hip fracture with risedronat...,(HSC) Medicine - Geriatric;(HSC) Medicine - Ne...,"Department of Neurology, Mitate Hospital, Taga...",Archives of Internal Medicine,JAMA Network,Japan,Yoshihiro Sato;Tomohiro Kanoko;Kei Satoh;Jun I...,http://retractionwatch.com/2016/06/03/jama-jou...,Clinical Study;Research Article;,...,"Medicine, General & Internal",Science Citation Index Expanded (SCI-EXPANDED),General & Internal Medicine,952VW,16087821.0,,,,05/11/2023,WOS:000231034800010
1,5728,Risedronate sodium therapy for prevention of h...,(HSC) Medicine - Neurology;(HSC) Medicine - Re...,"Department of Neurology, Mitate Hospital, Taga...",Archives of Internal Medicine,JAMA Network,Japan,Yoshihiro Sato;Jun Iwamoto;Tomohiro Kanoko,http://retractionwatch.com/2016/06/03/jama-jou...,Research Article;,...,"Medicine, General & Internal",Science Citation Index Expanded (SCI-EXPANDED),General & Internal Medicine,952VW,16087822.0,,,,05/11/2023,WOS:000231034800011
2,895,The Relation Between Pulse Pressure and Cardio...,(HSC) Biostatistics/Epidemiology;(HSC) Medicin...,"Department of Dietetics and Nutrition, Harokop...",JAMA Internal Medicine,JAMA Network,Finland;Greece;Italy;Japan;Netherlands;Serbia;...,Demosthenes B Panagiotakos;Daan Kromhout;Aless...,,Research Article;,...,"Medicine, General & Internal",Science Citation Index Expanded (SCI-EXPANDED),General & Internal Medicine,974FN,16217005.0,,,,05/11/2023,WOS:000232576800013
3,938,A Randomized Clinical Trial of a Single Dose o...,(HSC) Medicine - Ophthalmology;(HSC) Medicine ...,"Department of Anesthesiology, Toride Kyodo Gen...",JAMA Ophthalmology,American Medical Association,Japan,Yoshitaka Fujii;Hiroyoshi Tanaka;Mutsuko Ito,http://retractionwatch.com/2012/06/18/three-mo...,Clinical Study;,...,Ophthalmology,Science Citation Index Expanded (SCI-EXPANDED),Ophthalmology,886RL,15642807.0,,,,05/11/2023,WOS:000226245900002
4,14089,Can Branding Improve School Lunches?,(B/T) Business - Marketing;(BLS) Nutrition;(SO...,Charles H. Dyson School of Applied Economics a...,JAMA Pediatrics,JAMA Network,United States,Brian Wansink;David R Just;Collin R Payne,http://retractionwatch.com/?s=brian+wansink;ht...,Letter;Research Article;Retracted Article;,...,Pediatrics,Science Citation Index Expanded (SCI-EXPANDED),Pediatrics,014ZL,22911396.0,,,,05/11/2023,WOS:000309414400018
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9707,37586,Birth Weight Predicts Anthropometric and Body ...,(HSC) Medicine - General;,"Medicine Department, The Royal Hospital, Musca...",Journal of Obesity & Metabolic Syndrome (JOMES),Korean Society for the Study of Obesity,Oman;United Arab Emirates,Issa Al Salmi;Suad Hannawi,,Research Article;,...,Endocrinology & Metabolism,Emerging Sources Citation Index (ESCI),Endocrinology & Metabolism,WR7JE,34446614.0,"Green Published, gold",,,05/11/2023,WOS:000714671900010
9708,21244,The seasonal reproduction number of dengue fev...,(ENV) Climatology;(HSC) Biostatistics/Epidemio...,"Department of Mathematics, Faculty of Science,...",PeerJ,PeerJ,Thailand,Sittisede Polwiang,,Research Article;,...,Multidisciplinary Sciences,Science Citation Index Expanded (SCI-EXPANDED),Science & Technology - Other Topics,CN8KV,26213648.0,"Green Submitted, Green Published, gold",,,05/11/2023,WOS:000358690200002
9709,18340,Evaluation of Impact of Pregnancy on Oral Heal...,(HSC) Medicine - Dentistry;(HSC) Medicine - Ob...,"Department of Public Health Dentistry, Governm...",Journal of Clinical and Diagnostic Research (J...,JCDR Research and Publications Limited,India;Nepal;Pakistan,Aasim Farooq Shah;Manu Batra;Ambrina Qureshi,,Research Article;,...,"Medicine, General & Internal",Emerging Sources Citation Index (ESCI),General & Internal Medicine,FA1IY,28658896.0,"Green Published, gold",,,05/11/2023,WOS:000405194100123
9710,30813,Correlation Curve Correction and Spinal Length...,(HSC) Biostatistics/Epidemiology;(HSC) Medicin...,"Department of Orthopaedic, Bone and Joint Reco...",Journal of Clinical and Diagnostic Research (J...,JCDR Research and Publications Limited,Iran,Hasan Ghandhari;Mir Bahram Safari;Ebrahim Amer...,,Clinical Study;,...,"Medicine, General & Internal",Emerging Sources Citation Index (ESCI),General & Internal Medicine,FV2VD,,gold,,,05/11/2023,WOS:000424425200031


In [79]:
# Number of DOIs from filtered_wos not in filtered_rw
doi_not_in_rw = filtered_wos[~filtered_wos['Digital Object Identifier (DOI)'].isin(filtered_rw['OriginalPaperDOI'])]
num_dois_in_wos_not_in_rw = len(doi_not_in_rw)

# Number of DOIs from filtered_rw not in filtered_wos
doi_not_in_wos = filtered_rw[~filtered_rw['OriginalPaperDOI'].isin(filtered_wos['Digital Object Identifier (DOI)'])]
num_dois_in_rw_not_in_wos = len(doi_not_in_wos)

In [86]:
filtered_rw.info()

<class 'pandas.core.frame.DataFrame'>
Index: 36847 entries, 29155 to 22442
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   Record ID              36847 non-null  int64         
 1   Title                  36847 non-null  object        
 2   Subject                36847 non-null  object        
 3   Institution            36846 non-null  object        
 4   Journal                36847 non-null  object        
 5   Publisher              36847 non-null  object        
 6   Country                36847 non-null  object        
 7   Author                 36847 non-null  object        
 8   URLS                   19034 non-null  object        
 9   ArticleType            36847 non-null  object        
 10  RetractionDate         36847 non-null  datetime64[ns]
 11  RetractionDOI          36604 non-null  object        
 12  RetractionPubMedID     34013 non-null  object        
 13  Or

In [81]:
print(f"Number of DOIs from filtered_wos that are not in filtered_rw: {num_dois_in_wos_not_in_rw}")
print(f"Number of DOIs from filtered_rw that are not in filtered_wos: {num_dois_in_rw_not_in_wos}")

Number of DOIs from filtered_wos that are not in filtered_rw: 2624
Number of DOIs from filtered_rw that are not in filtered_wos: 27135


In [68]:
test = filtered_rw.merge(filtered_wos, how= 'left', left_on= 'OriginalPaperDOI', right_on= 'Digital Object Identifier (DOI)')
test

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher_x,Country,Author,URLS,ArticleType,...,Web of Science Categories,WE,Research Areas,Document Delivery Number,PubMed ID,Open Access Indicator,ESI Highly Cited Paper,ESI Hot Paper,Date this report was generated,Accession Number
0,985,Early Depth Assessment of Local Burns by Dermo...,(HSC) Biostatistics/Epidemiology;(HSC) Medicin...,"Burns Unit, Department of Dermatology, Nagasak...",Archives of Dermatology,JAMA Network,Japan,Kyomi Mihara;Hajime Shindo;Hiroya Mihara;Minak...,,Research Article;,...,,,,,,,,,,
1,5729,The prevention of hip fracture with risedronat...,(HSC) Medicine - Geriatric;(HSC) Medicine - Ne...,"Department of Neurology, Mitate Hospital, Taga...",Archives of Internal Medicine,JAMA Network,Japan,Yoshihiro Sato;Tomohiro Kanoko;Kei Satoh;Jun I...,http://retractionwatch.com/2016/06/03/jama-jou...,Clinical Study;Research Article;,...,"Medicine, General & Internal",Science Citation Index Expanded (SCI-EXPANDED),General & Internal Medicine,952VW,16087821.0,,,,05/11/2023,WOS:000231034800010
2,5728,Risedronate sodium therapy for prevention of h...,(HSC) Medicine - Neurology;(HSC) Medicine - Re...,"Department of Neurology, Mitate Hospital, Taga...",Archives of Internal Medicine,JAMA Network,Japan,Yoshihiro Sato;Jun Iwamoto;Tomohiro Kanoko,http://retractionwatch.com/2016/06/03/jama-jou...,Research Article;,...,"Medicine, General & Internal",Science Citation Index Expanded (SCI-EXPANDED),General & Internal Medicine,952VW,16087822.0,,,,05/11/2023,WOS:000231034800011
3,895,The Relation Between Pulse Pressure and Cardio...,(HSC) Biostatistics/Epidemiology;(HSC) Medicin...,"Department of Dietetics and Nutrition, Harokop...",JAMA Internal Medicine,JAMA Network,Finland;Greece;Italy;Japan;Netherlands;Serbia;...,Demosthenes B Panagiotakos;Daan Kromhout;Aless...,,Research Article;,...,"Medicine, General & Internal",Science Citation Index Expanded (SCI-EXPANDED),General & Internal Medicine,974FN,16217005.0,,,,05/11/2023,WOS:000232576800013
4,19230,"First Foods Most: After 18-Hour Fast, People D...",(BLS) Nutrition;(SOC) Psychology;,Dyson School of Applied Economics and Manageme...,JAMA Internal Medicine,American Medical Association,United States,Brian Wansink;Aner Tal;Mitsuru Shimizu,http://retractionwatch.com/2018/04/13/caught-o...,Letter;Research Article;,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36842,44860,A consideration of the impact of contaminated ...,(ENV) Ground/Surface Water;(HSC) Public Health...,"Department of Communication and Culture, Facul...",Human Life Culture Research,Otsuma Women's University Research Institute f...,Japan,Atsushi Okeda,,Research Article;,...,,,,,,,,,,
36843,43526,Study on the Waste Information and Statistics ...,(ENV) Environmental Sciences;(ENV) Ground/Surf...,"Member, Ph.D. Candidate, Department of Disaste...",Journal of the Korean Society of Hazard Mitiga...,Korean Society of Hazard Mitigation,South Korea,Eun-han Lee;Waon-ho Yi,,Research Article;,...,,,,,,,,,,
36844,38531,On the Feasibility of Stealthily Introducing V...,(B/T) Computer Science;(B/T) Technology;,University of Minnesota,2021 IEEE Symposium on Security and Privacy,IEEE: Institute of Electrical and Electronics ...,United States,Qiushi Wu;Kangjie Lu,,Conference Abstract/Paper;,...,,,,,,,,,,
36845,47227,"Age, Gender Demographics and Comorbidity Preva...",(HSC) Biostatistics/Epidemiology;(HSC) Medicin...,"Department of Orthopaedics, Dhanalakshmi Srini...",Journal of Coastal Life Medicine,Journal of Coastal Life Medicine,India,S Venkatesh Kumar;Mohith Singh;Gowtham Singh;K...,,Research Article;,...,,,,,,,,,,


In [69]:
test = filtered_rw.merge(filtered_wos, how= 'right', left_on= 'OriginalPaperDOI', right_on= 'Digital Object Identifier (DOI)')
test

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher_x,Country,Author,URLS,ArticleType,...,Web of Science Categories,WE,Research Areas,Document Delivery Number,PubMed ID,Open Access Indicator,ESI Highly Cited Paper,ESI Hot Paper,Date this report was generated,Accession Number
0,5729.0,The prevention of hip fracture with risedronat...,(HSC) Medicine - Geriatric;(HSC) Medicine - Ne...,"Department of Neurology, Mitate Hospital, Taga...",Archives of Internal Medicine,JAMA Network,Japan,Yoshihiro Sato;Tomohiro Kanoko;Kei Satoh;Jun I...,http://retractionwatch.com/2016/06/03/jama-jou...,Clinical Study;Research Article;,...,"Medicine, General & Internal",Science Citation Index Expanded (SCI-EXPANDED),General & Internal Medicine,952VW,16087821.0,,,,05/11/2023,WOS:000231034800010
1,5728.0,Risedronate sodium therapy for prevention of h...,(HSC) Medicine - Neurology;(HSC) Medicine - Re...,"Department of Neurology, Mitate Hospital, Taga...",Archives of Internal Medicine,JAMA Network,Japan,Yoshihiro Sato;Jun Iwamoto;Tomohiro Kanoko,http://retractionwatch.com/2016/06/03/jama-jou...,Research Article;,...,"Medicine, General & Internal",Science Citation Index Expanded (SCI-EXPANDED),General & Internal Medicine,952VW,16087822.0,,,,05/11/2023,WOS:000231034800011
2,895.0,The Relation Between Pulse Pressure and Cardio...,(HSC) Biostatistics/Epidemiology;(HSC) Medicin...,"Department of Dietetics and Nutrition, Harokop...",JAMA Internal Medicine,JAMA Network,Finland;Greece;Italy;Japan;Netherlands;Serbia;...,Demosthenes B Panagiotakos;Daan Kromhout;Aless...,,Research Article;,...,"Medicine, General & Internal",Science Citation Index Expanded (SCI-EXPANDED),General & Internal Medicine,974FN,16217005.0,,,,05/11/2023,WOS:000232576800013
3,938.0,A Randomized Clinical Trial of a Single Dose o...,(HSC) Medicine - Ophthalmology;(HSC) Medicine ...,"Department of Anesthesiology, Toride Kyodo Gen...",JAMA Ophthalmology,American Medical Association,Japan,Yoshitaka Fujii;Hiroyoshi Tanaka;Mutsuko Ito,http://retractionwatch.com/2012/06/18/three-mo...,Clinical Study;,...,Ophthalmology,Science Citation Index Expanded (SCI-EXPANDED),Ophthalmology,886RL,15642807.0,,,,05/11/2023,WOS:000226245900002
4,14089.0,Can Branding Improve School Lunches?,(B/T) Business - Marketing;(BLS) Nutrition;(SO...,Charles H. Dyson School of Applied Economics a...,JAMA Pediatrics,JAMA Network,United States,Brian Wansink;David R Just;Collin R Payne,http://retractionwatch.com/?s=brian+wansink;ht...,Letter;Research Article;Retracted Article;,...,Pediatrics,Science Citation Index Expanded (SCI-EXPANDED),Pediatrics,014ZL,22911396.0,,,,05/11/2023,WOS:000309414400018
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12331,,,,,,,,,,,...,"Medicine, General & Internal",Science Citation Index Expanded (SCI-EXPANDED),General & Internal Medicine,EZ345,2101137.0,Bronze,,,05/11/2023,WOS:A1990EZ34500002
12332,38977.0,Determining the accuracy of different water in...,(BLS) Agriculture;(ENV) Environmental Sciences...,"Department of Irrigation and Drainage, Faculty...",Journal of Hydrology,Elsevier,Iran,Ahmad Khasraei;Hamid Zare Abyaneh;Mehdi Jovzi;...,,Research Article;,...,Science Citation Index Expanded (SCI-EXPANDED),Engineering; Geology; Water Resources,XQ7GG,,,,,05/11/2023,WOS:000731712200002,
12333,5478.0,"Theoretical study of the mechanism, regio- and...",(PHY) Chemistry;(PHY) Mathematics;,"DÃ©partement de Chimie, FacultÃ© des Sciences,...",Journal of Molecular Structure: THEOCHEM,Elsevier,Algeria,H Chemouri;SM Mekelleche,,Research Article;,...,Science Citation Index Expanded (SCI-EXPANDED),Chemistry,471GB,,,,,05/11/2023,WOS:000268042900002,
12334,32100.0,Roflumilast Ameliorates Isoflurane-Induced Inf...,(BLS) Anatomy/Physiology;(BLS) Biology - Cellu...,"Department of Anesthesiology, the Affiliated B...",ACS Omega,American Chemical Society (ACS),China,Chunyuan Zhang;Zeting Xing;Meiyun Tan;Yanwen W...,,Research Article;,...,Science Citation Index Expanded (SCI-EXPANDED),Chemistry,QL2PM,33644540,,,,05/11/2023,WOS:000620923200008,


In [72]:
filtered_rw.shape

(36847, 20)

In [67]:
test = rw.merge(wos_rd, how= 'inner', left_on= 'Title', right_on= 'Document Title')
test

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher_x,Country,Author,URLS,ArticleType,...,Web of Science Categories,WE,Research Areas,Document Delivery Number,PubMed ID,Open Access Indicator,ESI Highly Cited Paper,ESI Hot Paper,Date this report was generated,Accession Number
0,6180,latrogenic perforation of the right pulmonary ...,(HSC) Medicine - Cardiology;(HSC) Medicine - C...,"Department of Cardiac Surgery, Clinica Capio, ...",European Journal of Cardio-Thoracic Surgery: O...,Oxford Academic,Spain,John J Trujillo;Sergio Beltrame;Stefano Urso;G...,,Research Article;,...,Cardiac & Cardiovascular Systems; Respiratory ...,Science Citation Index Expanded (SCI-EXPANDED),Cardiovascular System & Cardiology; Respirator...,384DN,18951035.0,Bronze,,,05/11/2023,WOS:000261724600021
1,2257,100 years of lung cancer,(HSC) Medicine - Oncology;(HSC) Medicine - Pul...,Department of Anesthesiology and Critical Care...,Respiratory Medicine,Elsevier,Poland,Michael Pirozynski,,Review Article;,...,Cardiac & Cardiovascular Systems; Respiratory ...,Science Citation Index Expanded (SCI-EXPANDED),Cardiovascular System & Cardiology; Respirator...,124NO,17056245.0,Bronze,,,05/11/2023,WOS:000243374700001
2,24730,24-h ambulatory blood pressure versus clinic b...,(HSC) Biostatistics/Epidemiology;(HSC) Medicin...,Department of Social Medicine and Health Educa...,Journal of Hypertension,Wolters Kluwer - Lippincott Williams & Wilkins,China;United Kingdom,Hong Fan;Igho J Onakpoya;Carl J Heneghan,,Meta-Analysis;,...,Peripheral Vascular Disease,Science Citation Index Expanded (SCI-EXPANDED),Cardiovascular System & Cardiology,OQ0PL,32618886.0,,,,05/11/2023,WOS:000588494400001
3,44233,"p300 promotes proliferation, migration, and in...",(BLS) Biology - Cancer;(BLS) Biology - Cellula...,"Department of Medical Oncology, Sun Yat-sen Un...",BMC Cancer,Springer - Biomed Central (BMC),China,Xue Hou;Run Gong;Jianhua Zhan;Ting Zhou;Yuxian...,,Research Article;,...,Oncology,Science Citation Index Expanded (SCI-EXPANDED),Oncology,GI6DT,29879950.0,"Green Published, gold",,,05/11/2023,WOS:000434460400003


In [66]:
test = rw.merge(wos_rd, how= 'right', left_on= 'Title', right_on= 'Document Title')
test

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher_x,Country,Author,URLS,ArticleType,...,Web of Science Categories,WE,Research Areas,Document Delivery Number,PubMed ID,Open Access Indicator,ESI Highly Cited Paper,ESI Hot Paper,Date this report was generated,Accession Number
0,,,,,,,,,,,...,"Medicine, General & Internal",Science Citation Index Expanded (SCI-EXPANDED),General & Internal Medicine,952VW,16087821.0,,,,05/11/2023,WOS:000231034800010
1,,,,,,,,,,,...,"Medicine, General & Internal",Science Citation Index Expanded (SCI-EXPANDED),General & Internal Medicine,952VW,16087822.0,,,,05/11/2023,WOS:000231034800011
2,,,,,,,,,,,...,"Medicine, General & Internal",Science Citation Index Expanded (SCI-EXPANDED),General & Internal Medicine,974FN,16217005.0,,,,05/11/2023,WOS:000232576800013
3,,,,,,,,,,,...,Ophthalmology,Science Citation Index Expanded (SCI-EXPANDED),Ophthalmology,886RL,15642807.0,,,,05/11/2023,WOS:000226245900002
4,,,,,,,,,,,...,Pediatrics,Science Citation Index Expanded (SCI-EXPANDED),Pediatrics,014ZL,22911396.0,,,,05/11/2023,WOS:000309414400018
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12853,,,,,,,,,,,...,Cardiac & Cardiovascular Systems; Respiratory ...,Science Citation Index Expanded (SCI-EXPANDED),Cardiovascular System & Cardiology; Respirator...,EC910,2214832.0,,,,05/11/2023,WOS:A1990EC91000010
12854,,,,,,,,,,,...,Cardiac & Cardiovascular Systems; Respiratory ...,Science Citation Index Expanded (SCI-EXPANDED),Cardiovascular System & Cardiology; Respirator...,AA700,2786116.0,,,,05/11/2023,WOS:A1989AA70000004
12855,,,,,,,,,,,...,"Medicine, General & Internal",Science Citation Index Expanded (SCI-EXPANDED),General & Internal Medicine,D1812,2427448.0,,,,05/11/2023,WOS:A1986D181200006
12856,,,,,,,,,,,...,Pediatrics,Science Citation Index Expanded (SCI-EXPANDED),Pediatrics,D4187,3526270.0,,,,05/11/2023,WOS:A1986D418700025
