The purpose of this Notebook is to join the datasets from the different sources into the following datasets:
- *retractions*: this dataset will join data from retraction watch database (RW) and bibliometric data of retractions in Web of Science (RD).
- *retracted_in_journals*: this dataset will join data from retraction watch database (RW), journal metrics from scimajor, and bibliometric data of all articles in best ranked journals (JD).


* [Chapter 0 - Libraries](#chapter0)
* [Chapter 1 - Individual Analysis](#chapter1)
    * [1.1 - Retraction Watch Database (RWD)](#section_1_1)
    * [1.2 - Control Set (CS)](#section_1_2)
    * [1.3 - Citation Data (CIT)](#section_1_3)
    * [1.4 - Corrections Data (COR)](#section_1_4)
* [Chapter 2 - Merge Data](#chapter2)   
    * [2.1 - Retraction Data (RD)](#section_2_1)

<div class="alert alert-block alert-info" style = "background:#d0de6f; color:#000000; border:0;">

# Chapter 0 - Libraries <a class="anchor" id="chapter0"></a>

In [211]:
import pandas as pd
import numpy as np
from datetime import datetime
import re
#import matplotlib.pyplot as plt
#import seaborn as sns
#import plotly.express as px

<div class="alert alert-block alert-info" style = "background:#d0de6f; color:#000000; border:0;">

# Chapter 1 - Individual Analysis <a class="anchor" id="chapter1"></a>

<a class="anchor"> 

## 1.1 - Retraction Watch Database (RWD) <a class="anchor" id="section_1_1"></a>

In [212]:
rwd = pd.read_excel('./retractions_data/retraction_watch_database.xlsx', dtype={'RetractionPubMedID': object, 'OriginalPaperPubMedID': object})
rwd.head()

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes
0,47271,Binding of DCC by Netrin-1 to Mediate Axon Gui...,(BLS) Biology - Cellular;(BLS) Biology - Gener...,Departments of Anatomy and of Biochemistry and...,Science,American Association for the Advancement of Sc...,United States,Elke Stein;Yimin Zou;Mu-ming Poo;Marc Tessier-...,https://retractionwatch.com/2023/08/31/stanfor...,Research Article;,2023-08-31 00:00:00,10.1126/science.adk1521,0,2001-03-09 00:00:00,10.1126/science.1059391,11239160,Retraction,+Investigation by Company/Institution;+Manipul...,No,
1,47270,Hierarchical Organization of Guidance Receptor...,(BLS) Biochemistry;(BLS) Biology - General;(BL...,Department of Anatomy and Department of Bioche...,Science,American Association for the Advancement of Sc...,United States,Elke Stein;Marc Tessier-Lavigne,https://retractionwatch.com/2023/08/31/stanfor...,Research Article;,2023-08-31 00:00:00,10.1126/science.adk1517,0,2001-02-08 00:00:00,10.1126/science.1058445,11239147,Retraction,+Duplication of Image;+Investigation by Compan...,No,
2,47243,Therapeutic potential of targeting IRES-depend...,(BLS) Biochemistry;(BLS) Biology - Cancer;(BLS...,"Division of Hematology-Oncology, UCLA-Greater ...",Oncogene,Springer - Nature Publishing Group,United States,Y Shi;Y Yang;C Bardeleben;B Holmes;J Gera;Alan...,,Research Article;,2023-08-31 00:00:00,10.1038/s41388-023-02820-5,0,2015-05-11 00:00:00,10.1038/onc.2015.156,25961916,Retraction,+Concerns/Issues About Data;+Concerns/Issues A...,No,see also: https://pubpeer.com/publications/704...
3,47233,A classifier based on 273 urinary peptides pre...,(BLS) Biochemistry;(HSC) Medicine - Cardiovasc...,"Department of Nephrology, The Third Affiliated...",Journal of Hypertension,Wolters Kluwer - Lippincott Williams & Wilkins,China,Lirong Lin;Chunxuan Wang;Jiangwen Ren;Mei Mei;...,,Research Article;,2023-08-30 00:00:00,10.1097/HJH.0000000000003551,37642599,2023-08-01 00:00:00,10.1097/HJH.0000000000003467,37199562,Retraction,+Concerns/Issues About Results;+Investigation ...,No,see also https://journals.lww.com/jhypertensio...
4,47227,"Age, Gender Demographics and Comorbidity Preva...",(HSC) Biostatistics/Epidemiology;(HSC) Medicin...,"Department of Orthopaedics, Dhanalakshmi Srini...",Journal of Coastal Life Medicine,Journal of Coastal Life Medicine,India,S Venkatesh Kumar;Mohith Singh;Gowtham Singh;K...,,Research Article;,2023-08-30 00:00:00,unavailable,0,2023-01-01 00:00:00,unavailable,0,Retraction,+Notice - Lack of;+Withdrawal;,No,"date of retraction unknown, article title repl..."


In [213]:
rwd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42700 entries, 0 to 42699
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Record ID              42700 non-null  int64 
 1   Title                  42700 non-null  object
 2   Subject                42700 non-null  object
 3   Institution            42699 non-null  object
 4   Journal                42700 non-null  object
 5   Publisher              42700 non-null  object
 6   Country                42700 non-null  object
 7   Author                 42700 non-null  object
 8   URLS                   21687 non-null  object
 9   ArticleType            42700 non-null  object
 10  RetractionDate         42700 non-null  object
 11  RetractionDOI          42209 non-null  object
 12  RetractionPubMedID     37599 non-null  object
 13  OriginalPaperDate      42700 non-null  object
 14  OriginalPaperDOI       40173 non-null  object
 15  OriginalPaperPubMed

In [214]:
# put date variables in correct format
rwd['RetractionDate'] = pd.to_datetime(rwd['RetractionDate'], errors='coerce') #, infer_datetime_format=True
rwd['OriginalPaperDate'] = pd.to_datetime(rwd['OriginalPaperDate'])

In [215]:
# Check for NaN values in 'Digital Object Identifier (DOI)' column
rwd_filtered = rwd.dropna(subset=['RetractionDOI'])

# Filter rows starting with "http://dx.doi.org/"
rwd_filtered[rwd_filtered['RetractionDOI'].str.startswith("http://dx.doi.org/")]['RetractionDOI']

Series([], Name: RetractionDOI, dtype: object)

### Duplicates

In theory, there should only be one DOI per article, and each retracted paper should only have one record in the database. This means that all DOIs should be unique.

In [216]:
rwd['OriginalPaperDOI'].nunique()

36846

In [217]:
rwd['RetractionDOI'].nunique()

36506

In [218]:
testing_dupes = rwd[rwd.duplicated(subset='OriginalPaperDOI', keep=False)]
testing_dupes['OriginalPaperDOI'].value_counts()

OriginalPaperDOI
Unavailable                        2234
unavailable                        1074
10.1136/jim-2021-SRMC                 6
10.1002/tox.21941                     2
10.1016/j.lfs.2019.116709             2
10.1038/s41598-021-03765-z            2
10.1007/s12275-012-2294-z             2
10.1016/j.cej.2011.04.016             2
10.1016/j.swevo.2021.100868           2
10.1016/j.esxm.2021.100447            2
10.1093/jge/aabc74                    2
10.1088/1742-2140/aaaf57              2
10.1088/1742-2140/aa953a              2
10.1016/j.carbpol.2019.115799         2
10.1001/archpediatrics.2012.999       2
10.1007/s13277-014-2995-5             2
10.3109/02699052.2016.1162060         2
10.1016/j.rapm.2005.05.009            2
10.1524/9783486834062.275             2
Name: count, dtype: int64

In [219]:
filtered_dupes = testing_dupes[testing_dupes['OriginalPaperDOI'].str.lower() != 'unavailable'].sort_values('OriginalPaperDOI')
filtered_dupes

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes
21229,14089,Can Branding Improve School Lunches?,(B/T) Business - Marketing;(BLS) Nutrition;(SO...,Charles H. Dyson School of Applied Economics a...,JAMA Pediatrics,JAMA Network,United States,Brian Wansink;David R Just;Collin R Payne,http://retractionwatch.com/?s=brian+wansink;ht...,Letter;Research Article;Retracted Article;,2017-10-20,10.1001/jamapediatrics.2017.4603,0,2012-10-01,10.1001/archpediatrics.2012.999,22911396,Retraction,+Breach of Policy by Author;+Error in Data;+Er...,No,Journal previously named Archives of Pediatric...
21230,11994,Can Branding Improve School Lunches?,(B/T) Business - Marketing;(BLS) Nutrition;(SO...,Charles H. Dyson School of Applied Economics a...,JAMA Pediatrics,JAMA Network,United States,Brian Wansink;David R Just;Collin R Payne,http://retractionwatch.com/?s=brian+wansink;ht...,Letter;Research Article;,2017-09-21,10.1001/jamapediatrics.2017.3136,28973133,2012-10-01,10.1001/archpediatrics.2012.999,22911396,Retraction,+Error in Analyses;+Error in Data;+Error in Me...,No,note: the paper was retracted again on October...
5685,38940,"Erratum to: Î±,Î²-Unsaturated aldehyde polluta...",(BLS) Biology - Cellular;(BLS) Toxicology;,"Department of Clinical Immunology, Xijing Hosp...",Environmental Toxicology,Wiley,China;United States,Zhenbiao Wu;Emily Y He;Glenda I Scott;Jun Ren,https://retractionwatch.com/2022/07/25/univers...,Correction/Erratum/Corrigendum;,2022-07-27,10.1002/tox.23620,35894684,2021-09-12,10.1002/tox.21941,34514704,Retraction,+Updated to Retraction;,No,
5696,38375,"Î±,Î²-Unsaturated aldehyde pollutant acrolein ...",(BLS) Biology - Cellular;(BLS) Toxicology;,"Department of Clinical Immunology, Xijing Hosp...",Environmental Toxicology,Wiley,China;United States,Zhenbiao Wu;Emily Y He;Glenda I Scott;Jun Ren,https://retractionwatch.com/2022/07/25/univers...,Research Article;,2022-07-27,10.1002/tox.23620,35894684,2013-12-23,10.1002/tox.21941,24376112,Retraction,+Falsification/Fabrication of Image;+Investiga...,No,
6643,37342,Identification of the Vibrio vulnificus htpG G...,(BLS) Genetics;(BLS) Microbiology;,"Department of Agricultural Biotechnology, Seou...",Journal of Microbiology,Springer,South Korea,Slae Choi;Kyungku Jang;Seulah Choi;Hee Jee Yun...,,Research Article;,2022-05-23,10.1007/s12275-022-1680-4,35606641,2012-08-25,10.1007/s12275-012-2294-z,22923124,Retraction,+Concerns/Issues About Authorship;+Upgrade/Upd...,No,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42578,4780,Phenylephrine stress in the evaluation of pati...,(HSC) Medicine - Cardiology;(HSC) Medicine - P...,"Department of Radiology, University of Califor...",Investigative Radiology,Wolters Kluwer,United States,Robert A Slutsky,http://retractionwatch.com/the-retraction-watc...,Research Article;,1986-02-01,,3514538,1983-03-01,,6345451,Retraction,+Concerns/Issues About Results;+Legal Reasons/...,No,
42579,4781,Thallium pulmonary scintigraphy. Relationship ...,(HSC) Medicine - Cardiology;(HSC) Medicine - P...,"Department of Radiology, University of Califor...",Investigative Radiology,Wolters Kluwer,United States,Robert A Slutsky,http://retractionwatch.com/the-retraction-watc...,Research Article;,1986-02-01,,3514538,1984-11-01,,6392156,Retraction,+Concerns/Issues About Results;+Legal Reasons/...,No,"Article is Nov/Dec 1984 (vol. 19, iss. 6, no d..."
42611,1494,Specific antigen exclusion and non-specific fa...,(BLS) Biology - Molecular;,"Department of Immunology, Institute of Child H...",Clinical and Experimental Immunology,Blackwell Publishing,United Kingdom,S A Roberts;M C Reinhardt;R Paganelli;R J Levi...,,Research Article;,1985-01-01,,3882286,1981-07-01,,6171369,Retraction,+Error in Analyses;+Results Not Reproducible;+...,No,No DOI for Original/Notice 3/24/2017;
42613,4249,Concurrent measurement of plasma levels of vit...,(HSC) Medicine - Cardiovascular;(HSC) Medicine...,Endocrinology-Mineral Metabolism and Nephrolog...,Translational Research: The Journal of Laborat...,Elsevier,United States,PW Lambert;PB DeOreo;BW Hollis;IY Fu;DJ Ginsbe...,,Research Article;,1984-10-01,,6384395,1981-10-01,,6270222,Retraction,+Notice - Unable to Access via current resources;,Unknown,Journal formerly known as: The Journal of Labo...


In [220]:
def find_changed_columns(group):
    changed_cols = group.apply(lambda x: x.nunique()).drop(['OriginalPaperDOI', 'Record ID'])
    value = changed_cols[changed_cols>1].index.to_list()
    return value


filtered_dupes.groupby('OriginalPaperDOI').apply(find_changed_columns).reset_index()

Unnamed: 0,OriginalPaperDOI,0
0,10.1001/archpediatrics.2012.999,"[ArticleType, RetractionDate, RetractionDOI, R..."
1,10.1002/tox.21941,"[Title, ArticleType, OriginalPaperDate, Origin..."
2,10.1007/s12275-012-2294-z,"[RetractionDate, RetractionDOI, RetractionPubM..."
3,10.1007/s13277-014-2995-5,"[ArticleType, RetractionDate, RetractionDOI, R..."
4,10.1016/j.carbpol.2019.115799,"[RetractionDate, RetractionDOI, RetractionPubM..."
5,10.1016/j.cej.2011.04.016,"[RetractionDate, RetractionDOI, Notes]"
6,10.1016/j.esxm.2021.100447,"[RetractionDate, RetractionDOI, RetractionPubM..."
7,10.1016/j.lfs.2019.116709,"[Subject, RetractionDate, RetractionDOI, Retra..."
8,10.1016/j.rapm.2005.05.009,"[RetractionDate, RetractionDOI, RetractionPubM..."
9,10.1016/j.swevo.2021.100868,"[RetractionDate, RetractionDOI, Notes]"


In [221]:
# following code commented so as to not override the changed made in the file
# with pd.ExcelWriter('./wos_rd/DOI_Duplicated_RWD.xlsx') as writer:
#     filtered_dupes.groupby('OriginalPaperDOI').apply(find_changed_columns).reset_index().to_excel(writer, sheet_name= "Differing vars",index = False)
#     filtered_dupes.to_excel(writer, sheet_name = "Duplicate records",index = False)

In [222]:
rwd[rwd['OriginalPaperDOI']=='10.3109/02699052.2016.1162060']

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes
23288,8429,Are rehabilitation outcomes after anoxic brain...,(HSC) Medicine - Neurology;(HSC) Medicine - Re...,Unavailable,Brain Injury,Taylor and Francis,Netherlands;Unknown,Emre Adiguzel;Evren Yasar;Yasin Demir;Ismail S...,,Conference Abstract/Paper;,2016-08-17,10.1080/02699052.2016.1210325,27533125,2016-05-19,10.3109/02699052.2016.1162060,27196965,Retraction,+Notice - Limited or No Information;,No,Part of Accepted Abstracts from the Internatio...
23290,6000,The effect of demographic and clinical charact...,(HSC) Medicine - Neurology;(HSC) Medicine - Re...,Unavailable,Brain Injury,Taylor and Francis,Netherlands;Unknown,Evren Yasar;Serdar Kesikburun;Ummugulsum Dogan...,,Conference Abstract/Paper;,2016-08-17,10.1080/02699052.2016.1210325,27533125,2016-05-19,10.3109/02699052.2016.1162060,27196965,Retraction,+Notice - Limited or No Information;,No,


In [223]:
def find_changed_columns(group):
    changed_cols = group.apply(lambda x: x.nunique()).drop(['OriginalPaperDOI', 'Record ID'])
    value = changed_cols[changed_cols>1].index.to_list()
    return value

filtered_dupes[(filtered_dupes['Record ID'] == 8429) | (filtered_dupes['Record ID'] == 6000)].groupby('OriginalPaperDOI').apply(find_changed_columns).reset_index()

Unnamed: 0,OriginalPaperDOI,0
0,10.3109/02699052.2016.1162060,"[Title, Author]"


In [224]:
pd.set_option('display.max_colwidth', None)  # Show full content of columns
pd.set_option('display.max_rows', None)      # Display all rows

In [225]:
filtered_dupes[(filtered_dupes['Record ID'] == 30679) | (filtered_dupes['Record ID'] == 30686)]['Title']

11353    Acute kidney injury and collapsing glomerulopathy associated with COVID-19 and APOL1 high risk genotype Abstract 621
11352    Acute kidney injury and collapsing glomerulopathy associated with COVID-19 and APOL1 high risk genotype Abstract 111
Name: Title, dtype: object

In [226]:
filtered_dupes[(filtered_dupes['Record ID'] == 30687) | (filtered_dupes['Record ID'] == 30691)]['Title']

11365    Filter clotting, anticoagulation and duration of sled in patients with COVID-19 and acute kidney injury Abstract 643
11364    Filter clotting, anticoagulation and duration of sled in patients with COVID-19 and acute kidney injury Abstract 112
Name: Title, dtype: object

In [227]:
pd.reset_option('display.max_colwidth')
pd.reset_option('display.max_rows')

In [228]:
# records that should be deleted
records_to_delete = [6000, 2175, 7242]
rwd = rwd[~rwd['Record ID'].isin(records_to_delete)]

In [229]:
rwd.sort_values(by=['OriginalPaperDOI', 'OriginalPaperDate'], ascending=[True, False], inplace=True)

# Keep only the first occurrence of each unique DOI (the most recent date)
filtered_rwd = rwd.drop_duplicates(subset='OriginalPaperDOI')

### Title analysis

In [230]:
rwd[rwd['Title'].str.startswith("Retracted:")]

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes
13359,45110,Retracted: Lifting the lid on lobbying in Indi...,(B/T) Government;,"Department of Management Studies, Indian Insti...",Journal of Public Affairs,Wiley,India,Pankaj K P Shreyaskar;Pramod Pathak,,Research Article;,2020-09-21,10.1002/pa.2423,0,2020-09-21,10.1002/pa.2423,0,Retraction,+Date of Retraction/Other Unknown;+Euphemisms ...,No,
1326,46570,Retracted: miR-214-3p Protects and Restores th...,(BLS) Biology - Molecular;(BLS) Genetics;(HSC)...,Key Laboratory of Advanced Technologies of Mat...,Evidence-Based Complementary and Alternative M...,Hindawi,China,Yuan Cheng;Qing He;Tao Jin;Na Li,https://retractionwatch.com/2022/09/28/exclusi...,Research Article;,2023-06-21,10.1155/2023/9823451,37388114,2022-07-18,10.1155/2022/1175935,35899226,Retraction,+Concerns/Issues About Data;+Concerns/Issues a...,No,See also: https://pubpeer.com/publications/C08...


In [231]:
rwd['Title'] = rwd['Title'].str.replace('Retracted:', '')

In [232]:
rwd[rwd['Title'].str.startswith("Retracted:")]

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes


In [233]:
rwd.iloc[[1326,13359]]

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes
12550,25797,LncRNA ATB promotes proliferation and metastas...,(BLS) Biochemistry;(BLS) Biology - Cancer;(BLS...,"Department of Respiratory Medicine, The Affili...",Journal of Cellular Biochemistry,Wiley,China,Yiwei Cao;Xiangjun Luo;Xiaoqian Ding;Shichao C...,http://retractionwatch.com/2021/03/08/journal-...,Research Article;,2020-12-15,10.1002/jcb.29877,33590514,2018-04-25,10.1002/jcb.26894,29693289,Retraction,+Concerns/Issues About Data;+Concerns/Issues a...,No,see also: https://pubpeer.com/publications/B2B...
11996,44387,Preparation of self-healing anti-corrosion coa...,(PHY) Engineering - Chemical;(PHY) Materials S...,"Department of Materials Engineering, Isfahan U...",Surface Engineering,Taylor and Francis,Iran,Sogand Abbaspour;Ali Ashrafi;Mehdi Salehi,,Research Article;,2021-02-01,10.1080/02670844.2021.1883242,0,2019-11-21,10.1080/02670844.2019.1689641,0,Retraction,+Concerns/Issues About Image;+Concerns/Issues ...,No,


In [234]:
rwd[rwd['Title'].str.contains("(Withdrawn Publication)")]

  rwd[rwd['Title'].str.contains("(Withdrawn Publication)")]


Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes


## Title Analysis

In [235]:
# Remove spaces at the end of the string
rwd['Title'] = rwd['Title'].str.rstrip()

#lower case string
rwd['Title'] = rwd['Title'].str.lower()

In [236]:
rwd.info()

<class 'pandas.core.frame.DataFrame'>
Index: 42697 entries, 29155 to 42695
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   Record ID              42697 non-null  int64         
 1   Title                  42697 non-null  object        
 2   Subject                42697 non-null  object        
 3   Institution            42696 non-null  object        
 4   Journal                42697 non-null  object        
 5   Publisher              42697 non-null  object        
 6   Country                42697 non-null  object        
 7   Author                 42697 non-null  object        
 8   URLS                   21686 non-null  object        
 9   ArticleType            42697 non-null  object        
 10  RetractionDate         42697 non-null  datetime64[ns]
 11  RetractionDOI          42206 non-null  object        
 12  RetractionPubMedID     37596 non-null  object        
 13  Or

<a class="anchor"> 

## 1.2 - Web of Science Data <a class="anchor" id="section_1_2"></a>

In [237]:
import pyarrow.parquet as pq

In [238]:
wos_dois_imported = pq.read_table("../thesis_data/processed_data/WoS_RWD_DOIS.parquet").to_pandas()

In [239]:
wos_dois_imported.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21054 entries, 0 to 21053
Data columns (total 73 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   AU        21054 non-null  object 
 1   AF        21051 non-null  object 
 2   CR        20943 non-null  object 
 3   AB        19481 non-null  object 
 4   AR        6038 non-null   object 
 5   BE        161 non-null    object 
 6   BN        110 non-null    object 
 7   BP        14782 non-null  object 
 8   C1        20491 non-null  object 
 9   C3        18706 non-null  object 
 10  CA        54 non-null     object 
 11  CL        403 non-null    object 
 12  CT        403 non-null    object 
 13  CY        403 non-null    object 
 14  DA        21054 non-null  object 
 15  DE        13002 non-null  object 
 16  DI        21054 non-null  object 
 17  DT        21054 non-null  object 
 18  EA        2257 non-null   object 
 19  EF        724 non-null    object 
 20  EI        17538 non-null  ob

In [240]:
wos_dois = wos_dois_imported.copy()

In [241]:
wos_dois.filter(like='X.').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21054 entries, 0 to 21053
Empty DataFrame


In [242]:
wos_dois.columns

Index(['AU', 'AF', 'CR', 'AB', 'AR', 'BE', 'BN', 'BP', 'C1', 'C3', 'CA', 'CL',
       'CT', 'CY', 'DA', 'DE', 'DI', 'DT', 'EA', 'EF', 'EI', 'EM', 'EP', 'ER',
       'FU', 'FX', 'GA', 'GP', 'HC', 'HO', 'HP', 'ID', 'IS', 'J9', 'JI', 'LA',
       'MA', 'NR', 'OA', 'OI', 'PA', 'PD', 'PG', 'PI', 'PM', 'PN', 'PT', 'PU',
       'PY', 'RI', 'RP', 'SC', 'SE', 'SI', 'SN', 'SO', 'SP', 'SU', 'TC', 'TI',
       'U1', 'U2', 'UT', 'VL', 'WC', 'WE', 'Z9', 'DB', 'AU_UN', 'AU1_UN',
       'AU_UN_NR', 'SR_FULL', 'SR'],
      dtype='object')

In [243]:
rename_columns = {
    'AU': "authors", 
    'AF': "author_fullnames", #
    'CR': "cited_references", 
    'AB': "abstract", 
    'AR': "article_number", 
    'BE': "editors", 
    'BN': "isbn", 
    'BP': "beginning_page", #
    'C1': "author_address", 
    'C3': "author_institution", ###
    'CA': "group_authors", ##
    'CL': "conference_location", #
    'CT': "conference_title", #
    'CY': "conference_date", #
    'DA': "date_report_generated", 
    'DE': "author_keywords", 
    'DI': "doi", 
    'DT': "document_type", 
    'EA': "early_access_date", #
    'EF': "ef", #
    'EI': "eissn", #
    'EM':"email_address", 
    'EP': "ending_page", #
    'ER': "end_of_record", #
    'FU': "funding_agency_and_grant_number", 
    'FX': "funding_text", 
    'GA': "document_delivery_number", 
    'GP': "book_group_authors", ##
    'HC': "esi_highly_cited", ##
    'HO': "conference_host", #
    'HP': "esi_hot_paper", ##
    'ID': "keywords_plus", 
    'IS': "issue", #
    'J9': "29_character_source_abv", 
    'JI': "iso_source_abv", 
    'LA': "language", 
    'MA': "meeting_abstract", ##
    'NR': "cited_reference_count", 
    'OA': "open_access_indicator", 
    'OI': "orcid", #
    'PA': "publisher_address", 
    'PD': "publication_date", #
    'PG': "page_count", #
    'PI': "publisher_city", #
    'PM': "pubmed_id", #
    'PN': "part_number",
    'PT': "publication_type", # (J=Journal; B=Book; S=Series; P=Patent)
    'PU': "publisher", 
    'PY': "year_published",  
    'RI': "researcher_id_nr", ##
    'RP': "reprint_address", 
    'SC': "research_areas",
    'SE': "book_series_title", 
    'SI': "special_issue", #
    'SN': "issn", 
    'SO': "publication_name", 
    'SP': "conference_sponsors", #
    'SU': "supplement", #
    'TC': "wos_core_collection_times_cited_count", 
    'TI': "document_title", 
    'U1': "usage_count_last_180_days", #
    'U2': "usage_count_since_2013", 
    'UT': "accession_number",
    'VL': "volume", 
    'WC': "wos_categories", #
    'WE': "we", #
    'Z9': "total_times_cited_count", #
    'DB': "database", 
    'AU_UN': "authors_affiliations", 
    'AU1_UN': "corresponding_author_affiliation",
    'AU_UN_NR': "not_recognized_affiliations", 
    'SR_FULL': "short_full_reference", 
    'SR': "short_reference"

}

wos_dois.rename(columns = rename_columns, inplace = True)
wos_dois.columns

Index(['authors', 'author_fullnames', 'cited_references', 'abstract',
       'article_number', 'editors', 'isbn', 'beginning_page', 'author_address',
       'author_institution', 'group_authors', 'conference_location',
       'conference_title', 'conference_date', 'date_report_generated',
       'author_keywords', 'doi', 'document_type', 'early_access_date', 'ef',
       'eissn', 'email_address', 'ending_page', 'end_of_record',
       'funding_agency_and_grant_number', 'funding_text',
       'document_delivery_number', 'book_group_authors', 'esi_highly_cited',
       'conference_host', 'esi_hot_paper', 'keywords_plus', 'issue',
       '29_character_source_abv', 'iso_source_abv', 'language',
       'meeting_abstract', 'cited_reference_count', 'open_access_indicator',
       'orcid', 'publisher_address', 'publication_date', 'page_count',
       'publisher_city', 'pubmed_id', 'part_number', 'publication_type',
       'publisher', 'year_published', 'researcher_id_nr', 'reprint_address',
  

In [244]:
variables_to_keep = [
       'authors', 'author_fullnames', 'cited_references', 'abstract', 
       'author_keywords', 'doi', 'document_type', 'funding_agency_and_grant_number', 
       'esi_highly_cited', 'esi_hot_paper', 'keywords_plus', 'language', 'early_access_date',
       'cited_reference_count', 'open_access_indicator', 'orcid',  'publication_date', 
       'page_count', 'pubmed_id', 'publication_type', 'publisher', 'year_published', 'researcher_id_nr', 
       'research_areas', 'issn', 'publication_name', 'wos_core_collection_times_cited_count', 'document_title',
       'wos_categories', 'total_times_cited_count','authors_affiliations', 'corresponding_author_affiliation'
       ]


wos_dois = wos_dois[variables_to_keep]

In [245]:
wos_dois.select_dtypes(include=['object']).columns

Index(['authors', 'author_fullnames', 'cited_references', 'abstract',
       'author_keywords', 'doi', 'document_type',
       'funding_agency_and_grant_number', 'esi_highly_cited', 'esi_hot_paper',
       'keywords_plus', 'language', 'early_access_date',
       'cited_reference_count', 'open_access_indicator', 'orcid',
       'publication_date', 'page_count', 'pubmed_id', 'publication_type',
       'publisher', 'researcher_id_nr', 'research_areas', 'issn',
       'publication_name', 'document_title', 'wos_categories',
       'total_times_cited_count', 'authors_affiliations',
       'corresponding_author_affiliation'],
      dtype='object')

In [246]:
wos_dois['total_times_cited_count'].value_counts()

total_times_cited_count
1      2042
0      1697
2      1459
3      1188
4      1014
       ... 
309       1
459       1
405       1
152       1
204       1
Name: count, Length: 425, dtype: int64

In [247]:
new_data_types = {'cited_reference_count': 'Int64', 
                'page_count': 'Int64',
                'total_times_cited_count': 'Int64', 
                'year_published': 'Int64'}


for col, dtype in new_data_types.items():
    wos_dois[col] = pd.array(wos_dois[col],dtype = pd.Int64Dtype())

wos_dois['early_access_date'] = pd.to_datetime(wos_dois['early_access_date'], format='%b %Y')

In [248]:
wos_dois.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21054 entries, 0 to 21053
Data columns (total 32 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   authors                                21054 non-null  object        
 1   author_fullnames                       21051 non-null  object        
 2   cited_references                       20943 non-null  object        
 3   abstract                               19481 non-null  object        
 4   author_keywords                        13002 non-null  object        
 5   doi                                    21054 non-null  object        
 6   document_type                          21054 non-null  object        
 7   funding_agency_and_grant_number        9005 non-null   object        
 8   esi_highly_cited                       47 non-null     object        
 9   esi_hot_paper                          47 non-null     object

In [249]:
wos_dois = wos_dois[wos_dois['year_published']<2023]

In [250]:
wos_dois.describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
early_access_date,2027.0,2020-11-06 01:52:14.681795840,2017-08-01 00:00:00,2020-04-01 00:00:00,2020-11-01 00:00:00,2021-06-01 00:00:00,2024-01-01 00:00:00,
cited_reference_count,20727.0,36.205288,0.0,22.0,32.0,45.0,912.0,27.119912
page_count,20727.0,10.17523,0.0,7.0,9.0,12.0,572.0,9.837019
year_published,20727.0,2014.803155,1940.0,2011.0,2017.0,2020.0,2022.0,6.714835
wos_core_collection_times_cited_count,20727.0,23.465528,0.0,2.0,8.0,23.0,2540.0,61.874001
total_times_cited_count,20727.0,26.08665,0.0,3.0,9.0,25.0,2646.0,69.575116


In [251]:
wos_dois[wos_dois.duplicated()].shape[0]

2

In [252]:
wos_dois.shape

(20727, 32)

In [253]:
# Replace all NaN values with a common value (e.g., a string)
wos_dois = wos_dois.fillna('This is a missing value')

# Use drop_duplicates to remove duplicates with 'NaN' values
wos_dois.drop_duplicates(inplace=True)

# Now, you can replace the 'NaN' values with NaN again if needed
wos_dois = wos_dois.replace('This is a missing value', np.nan)

In [254]:
wos_dois.shape

(20725, 32)

In [255]:
wos_dois['publication_date']

0              SEP 9
1                AUG
3           2021 SEP
4        2021 APR 27
5        2021 APR 15
            ...     
21026          DEC 1
21027       2021 FEB
21049         NOV 23
21051            APR
21052          APR 7
Name: publication_date, Length: 20725, dtype: object

In [256]:
wos_dois['publication_date'].fillna('JAN 1', inplace=True)

wos_dois['publication_date'] = pd.to_datetime(wos_dois['publication_date'] + ' ' + wos_dois['year_published'].astype(str), errors='coerce')
wos_dois['publication_date'].fillna(pd.to_datetime('JAN 01 ' + wos_dois['year_published'].astype(str), format='%b %d %Y'), inplace=True)

# Sort the DataFrame by DOI and 'Publication Date' in descending order
wos_dois.sort_values(by=['doi', 'publication_date'], ascending=[True, False], inplace=True)


  wos_dois['publication_date'] = pd.to_datetime(wos_dois['publication_date'] + ' ' + wos_dois['year_published'].astype(str), errors='coerce')


In [257]:
# Check for NaN values in 'doi' column
wos_dois_filtered = wos_dois.dropna(subset=['doi'])

# Filter rows starting with "http://dx.doi.org/"
wos_dois_filtered[wos_dois_filtered['doi'].str.startswith("http://dx.doi.org/")]['doi']

Series([], Name: doi, dtype: object)

In [258]:
wos_dois['doi'] = wos_dois['doi'].str.replace('http://dx.doi.org/', '')

### Title Analysis

In [259]:
wos_dois.columns

Index(['authors', 'author_fullnames', 'cited_references', 'abstract',
       'author_keywords', 'doi', 'document_type',
       'funding_agency_and_grant_number', 'esi_highly_cited', 'esi_hot_paper',
       'keywords_plus', 'language', 'early_access_date',
       'cited_reference_count', 'open_access_indicator', 'orcid',
       'publication_date', 'page_count', 'pubmed_id', 'publication_type',
       'publisher', 'year_published', 'researcher_id_nr', 'research_areas',
       'issn', 'publication_name', 'wos_core_collection_times_cited_count',
       'document_title', 'wos_categories', 'total_times_cited_count',
       'authors_affiliations', 'corresponding_author_affiliation'],
      dtype='object')

In [260]:
wos_dois[(~wos_dois['document_title'].str.startswith("RETRACTED: ")) & (~wos_dois['document_title'].str.contains("(Withdrawn Publication)")) & ((~wos_dois['document_title'].str.contains("Retracted Article. See")))]

  wos_dois[(~wos_dois['document_title'].str.startswith("RETRACTED: ")) & (~wos_dois['document_title'].str.contains("(Withdrawn Publication)")) & ((~wos_dois['document_title'].str.contains("Retracted Article. See")))]


Unnamed: 0,authors,author_fullnames,cited_references,abstract,author_keywords,doi,document_type,funding_agency_and_grant_number,esi_highly_cited,esi_hot_paper,...,researcher_id_nr,research_areas,issn,publication_name,wos_core_collection_times_cited_count,document_title,wos_categories,total_times_cited_count,authors_affiliations,corresponding_author_affiliation
9916,DAS RR;SINGH M,"DAS, RASHMI RANJAN;SINGH, MEENU","EBY GA, 2010, MED HYPOTHESES, V74, P482, DOI 1...",CLINICAL QUESTION: IS ORAL ZINC ASSOCIATED WIT...,,10.1001/jama.2014.1404,EDITORIAL MATERIAL; WITHDRAWN PUBLICATION,,,,...,,GENERAL & INTERNAL MEDICINE,0098-7484,JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION,9.0,ORAL ZINC FOR THE COMMON COLD (WITHDRAWN PUBLI...,"MEDICINE, GENERAL & INTERNAL",10,ALL INDIA INSTITUTE OF MEDICAL SCIENCES (AIIMS...,POSTGRAD INST MED EDUC AND RES
20227,FAVINI N;HOCKENBERRY JM;GILMAN M;JAIN S;ONG MK...,"FAVINI, NATHAN;HOCKENBERRY, JASON M.;GILMAN, M...","CAREY K, 2016, HEALTH AFFAIR, V35, P1918, DOI ...",,,10.1001/jama.2017.1469,LETTER,ROBERT WOOD JOHNSON FOUNDATION,,,...,"ONG, MICHAEL K/F-7397-2013; ONG, MICHAEL/AAR-3...",GENERAL & INTERNAL MEDICINE,0098-7484,JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION,18.0,COMPARATIVE TRENDS IN PAYMENT ADJUSTMENTS BETW...,"MEDICINE, GENERAL & INTERNAL",19,UNIVERSITY OF CALIFORNIA SYSTEM; UNIVERSITY OF...,EMORY UNIV
8143,ROSCHER I;FALK RS;VOS L;CLAUSEN OPF;HELSING P;...,"ROSCHER, INGRID;FALK, RAGNHILD S.;VOS, LINDA;C...","ALAM M, 2001, NEW ENGL J MED, V344, P975, DOI ...",IMPORTANCE CUTANEOUS SQUAMOUS CELL CARCINOMA (...,,10.1001/jamadermatol.2017.6428,ARTICLE,OSLO UNIVERSITY HOSPITAL; CANCER REGISTRY OF N...,,,...,"FALK, RAGNHILD SØRUM/H-5613-2019",DERMATOLOGY,2168-6068,JAMA DERMATOLOGY,60.0,VALIDATING 4 STAGING SYSTEMS FOR CUTANEOUS SQU...,DERMATOLOGY,61,UNIVERSITY OF OSLO; UNIVERSITY OF OSLO; UNIVER...,OSLO UNIV HOSP
4139,YOUSAF A;LEE JS;FANG W;KOLODNEY MS,"YOUSAF, AHMED;LEE, JUSTIN;FANG, WEI;KOLODNEY, ...","ANONYMOUS, 2007, UK BIOB RAT DES DEV;BRENNER M...",IMPORTANCE ALOPECIA AREATA (AA) IS A COMPLEX I...,,10.1001/jamadermatol.2021.0144,ARTICLE; EARLY ACCESS,WILLIAMWELTON ENDOWMENT FUND,,,...,,DERMATOLOGY,2168-6068,JAMA DERMATOLOGY,3.0,ASSOCIATION BETWEEN ALOPECIA AREATA AND NATURA...,DERMATOLOGY,3,WEST VIRGINIA UNIVERSITY,WEST VIRGINIA UNIV
10542,HOLLON SD;DERUBEIS RJ;FAWCETT J;AMSTERDAM JD;S...,"HOLLON, STEVEN D.;DERUBEIS, ROBERT J.;FAWCETT,...","BECK A. T., 1979, COGNITIVE THERAPY OF DEPRESS...",IMPORTANCE ANTIDEPRESSANT MEDICATION (ADM) IS ...,,10.1001/jamapsychiatry.2014.1054,ARTICLE,"NATIONAL INSTITUTE OF MENTAL HEALTH [MH60713, ...",,,...,"DERUBEIS, ROBERT J./A-1049-2007",PSYCHIATRY,2168-622X,JAMA PSYCHIATRY,122.0,EFFECT OF COGNITIVE THERAPY WITH ANTIDEPRESSAN...,PSYCHIATRY,138,VANDERBILT UNIVERSITY; UNIVERSITY OF PENNSYLVA...,VANDERBILT UNIV
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15421,POEHLMAN ET;TOTH MJ;GARDNER AW,"POEHLMAN, ET;TOTH, MJ;GARDNER, AW","ALOIA JF, 1991, AM J CLIN NUTR, V53, P1378, DO...",OBJECTIVE: TO DESCRIBE THE EFFECTS OF MENOPAUS...,,10.7326/0003-4819-123-9-199511010-00005,NOTE,"NIA NIH HHS [AG-07857, T32-AG00219, KO4-AG0056...",,,...,,GENERAL & INTERNAL MEDICINE,0003-4819,ANNALS OF INTERNAL MEDICINE,403.0,CHANGES IN ENERGY-BALANCE AND BODY-COMPOSITION...,"MEDICINE, GENERAL & INTERNAL",457,UNIVERSITY SYSTEM OF MARYLAND; UNIVERSITY OF M...,VET AFFAIRS MED CTR
20180,ZHANG J;XIA X;LI S;RAN W,"ZHANG, J.;XIA, X.;LI, S.;RAN, W.","ZHANG JC, 2018, PEERJ, V6",,,10.7717/peerj.4267,RETRACTION,,,,...,,SCIENCE & TECHNOLOGY - OTHER TOPICS,2167-8359,PEERJ,1.0,RETRACTION: RESPONSE OF METHANE PRODUCTION VIA...,MULTIDISCIPLINARY SCIENCES,1,,NOTREPORTED
20427,BLUM K;BADGAIYAN RD;GOLD MS,"BLUM, KENNETH;BADGAIYAN, RAJENDRA D.;GOLD, MAR...","ADDAD M, 1989, MED LAW, V8, P611;ALLEN CLIFFOR...",,,10.7759/cureus.290,RETRACTION,"LIFE EXTENSION FOUNDATION, FT. LAUDERDALE, FL....",,,...,"BADGAIYAN, RAJENDRA D/B-8183-2012",GENERAL & INTERNAL MEDICINE,,CUREUS JOURNAL OF MEDICAL SCIENCE,6.0,RETRACTION: HYPERSEXUALITY ADDICTION AND WITHD...,"MEDICINE, GENERAL & INTERNAL",6,,NOTREPORTED
8510,GHANDHARI H;SAFARI MB;AMERI E;KHEIRABADI H;TAB...,"GHANDHARI, HASAN;SAFARI, MIR BAHRAM;AMERI, EBR...","AMERI E, 2017, CLIN SPINE SURG, V30, PE485, DO...",INTRODUCTION: THE SPINAL CORRECTION AND FUSION...,ANTERIOR CORRECTION WITH FUSION SURGERY; COBB ...,10.7860/JCDR/2017/29396.11006,ARTICLE; WITHDRAWN PUBLICATION,,,,...,"SAFARI, MIR BAHRAM/N-5081-2017; GHANDHARI, HAS...",GENERAL & INTERNAL MEDICINE,2249-782X,JOURNAL OF CLINICAL AND DIAGNOSTIC RESEARCH,1.0,CORRELATION CURVE CORRECTION AND SPINAL LENGTH...,"MEDICINE, GENERAL & INTERNAL",1,IRAN UNIVERSITY OF MEDICAL SCIENCES; URMIA UNI...,URMIA UNIV MED SCI


In [261]:
wos_dois[(wos_dois['document_title'].str.startswith("RETRACTED: ")) | (wos_dois['document_title'].str.contains("(Withdrawn Publication)")) | ((wos_dois['document_title'].str.contains("Retracted Article. See")))]

  wos_dois[(wos_dois['document_title'].str.startswith("RETRACTED: ")) | (wos_dois['document_title'].str.contains("(Withdrawn Publication)")) | ((wos_dois['document_title'].str.contains("Retracted Article. See")))]


Unnamed: 0,authors,author_fullnames,cited_references,abstract,author_keywords,doi,document_type,funding_agency_and_grant_number,esi_highly_cited,esi_hot_paper,...,researcher_id_nr,research_areas,issn,publication_name,wos_core_collection_times_cited_count,document_title,wos_categories,total_times_cited_count,authors_affiliations,corresponding_author_affiliation
10235,SATO Y;KANOKO T;SATOH K;IWAMOTO J,"SATO, Y;KANOKO, T;SATOH, K;IWAMOTO, J","ANONYMOUS, COCHRANE DATABASE SY;BISCHOFF-FERRA...","BACKGROUND: A HIGH INCIDENCE OF FRACTURES, PAR...",,10.1001/archinte.165.15.1737,ARTICLE; RETRACTED PUBLICATION,,,,...,,GENERAL & INTERNAL MEDICINE,0003-9926,ARCHIVES OF INTERNAL MEDICINE,48.0,RETRACTED: THE PREVENTION OF HIP FRACTURE WITH...,"MEDICINE, GENERAL & INTERNAL",52,HIROSAKI UNIVERSITY; HIROSAKI UNIVERSITY; KEIO...,MITATE HOSP
10236,SATO Y;IWAMOTO J;KANOKO T;SATOH K,"SATO, Y;IWAMOTO, J;KANOKO, T;SATOH, K","CHAPUY MC, 1992, NEW ENGL J MED, V327, P1637, ...",BACKGROUND: THERE IS A HIGH INCIDENCE OF HIP F...,,10.1001/archinte.165.15.1743,ARTICLE; RETRACTED PUBLICATION,,,,...,,GENERAL & INTERNAL MEDICINE,0003-9926,ARCHIVES OF INTERNAL MEDICINE,113.0,RETRACTED: RISEDRONATE SODIUM THERAPY FOR PREV...,"MEDICINE, GENERAL & INTERNAL",119,KEIO UNIVERSITY; HIROSAKI UNIVERSITY; HIROSAKI...,MITATE HOSP
15186,PANAGIOTAKOS DB;KROMHOUT D;MENOTTI A;CHRYSOHOO...,"PANAGIOTAKOS, DB;KROMHOUT, D;MENOTTI, A;CHRYSO...","ANDERSON JT, 1956, CLIN CHEM, V2, P145;BENETOS...",BACKGROUND: HYPERTENSION IS A DOMINANT CHARACT...,,10.1001/archinte.165.18.2142,ARTICLE; RETRACTED PUBLICATION,,,,...,"PANAGIOTAKOS, DEMOSTHENES/K-8294-2019; PANAGIO...",GENERAL & INTERNAL MEDICINE,0003-9926,ARCHIVES OF INTERNAL MEDICINE,53.0,RETRACTED: THE RELATION BETWEEN PULSE PRESSURE...,"MEDICINE, GENERAL & INTERNAL",55,HAROKOPIO UNIVERSITY ATHENS; NETHERLANDS NATIO...,DB (CORRESPONDING AUTHOR)
20101,WANSINK B;TAL A;SHIMIZU M,"WANSINK, BRIAN;TAL, ANER;SHIMIZU, MITSURU","FROST G, 1987, HUM NUTR-APPL NUTR, V41A, P47;G...",,,10.1001/archinternmed.2012.1278,LETTER; RETRACTED PUBLICATION,,,,...,,GENERAL & INTERNAL MEDICINE,0003-9926,ARCHIVES OF INTERNAL MEDICINE,14.0,RETRACTED: FIRST FOODS MOST: AFTER 18-HOUR FAS...,"MEDICINE, GENERAL & INTERNAL",14,CORNELL UNIVERSITY,CORNELL UNIV
13201,FUJII Y;TANAKA H;ITO M,"FUJII, Y;TANAKA, H;ITO, M","ABRAMOWITZ MD, 1983, ANESTHESIOLOGY, V59, P579...",BACKGROUND: POSTOPERATIVE VOMITING (POV) AFTER...,,10.1001/archopht.123.1.25,ARTICLE; RETRACTED PUBLICATION,,,,...,,OPHTHALMOLOGY,0003-9950,ARCHIVES OF OPHTHALMOLOGY,10.0,RETRACTED: A RANDOMIZED CLINICAL TRIAL OF A SI...,OPHTHALMOLOGY,11,TORIDE KYODO GEN HOSP;TORIDE KYODO GEN HOSP,UNIV TSUKUBA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6086,BISWAS S;VERMA R;BHATIA VK;CHAUDHARY AK;CHANDR...,"BISWAS, SONIYA;VERMA, REETU;BHATIA, VINOD KUMA...","ASIDA SM, 2012, EGYPT J ANAESTH, V28, P55, DOI...",INTRODUCTION: POSTOPERATIVE PAIN AFTER THORACO...,BUPIVACAINE; FENTANYL; PAINFUL SURGERIES,10.7860/JCDR/2016/19159.8489,ARTICLE; RETRACTED PUBLICATION,,,,...,"CHAUDHARY, AJAY K/A-1399-2015",GENERAL & INTERNAL MEDICINE,2249-782X,JOURNAL OF CLINICAL AND DIAGNOSTIC RESEARCH,19.0,RETRACTED: COMPARISON BETWEEN THORACIC EPIDURA...,"MEDICINE, GENERAL & INTERNAL",21,KING GEORGE'S MEDICAL UNIVERSITY,R (CORRESPONDING AUTHOR)
6244,JAIN MJ;MAVANI KJ,"JAIN, MOHIT J.;MAVANI, KINJAL J.","ANDERSON JOHN T, 2004, IOWA ORTHOP J, V24, P53...",INTRODUCTION: THE MANAGEMENT OF HIGHLY COMMINU...,DISTAL END RADIUS FRACTURES; DYNAMIC COMPRESSI...,10.7860/JCDR/2016/21926.9036,ARTICLE; RETRACTED PUBLICATION,,,,...,"JAIN, MOHIT/AAY-4869-2020",GENERAL & INTERNAL MEDICINE,2249-782X,JOURNAL OF CLINICAL AND DIAGNOSTIC RESEARCH,9.0,RETRACTED: A COMPREHENSIVE STUDY OF INTERNAL D...,"MEDICINE, GENERAL & INTERNAL",10,SANJEEVANI MULTISPECIAL HOSP;MARATHA MANDAL INST,MJ (CORRESPONDING AUTHOR)
8989,SHAH AF;BATRA M;QURESHI A,"SHAH, AASIM FAROOQ;BATRA, MANU;QURESHI, AMBRINA","ACHARYA S, 2009, INT J DENT HYG, V7, P102, DOI...",INTRODUCTION: ORAL HEALTH IS A KEY COMPONENT O...,POSTPARTUM FEMALES; PREGNANT WOMEN; PERIODONTA...,10.7860/JCDR/2017/25862.9769,ARTICLE; RETRACTED PUBLICATION,,,,...,,GENERAL & INTERNAL MEDICINE,2249-782X,JOURNAL OF CLINICAL AND DIAGNOSTIC RESEARCH,6.0,RETRACTED: EVALUATION OF IMPACT OF PREGNANCY O...,"MEDICINE, GENERAL & INTERNAL",7,DOW UNIVERSITY OF HEALTH SCIENCES,GOVT COLL AND HOSP
7864,PATRO S;MOHAPATRA S;MISHRA S,"PATRO, SWADHEENA;MOHAPATRA, SATYAJIT;MISHRA, S...","AHLBERG KMF, 1995, INT ENDOD J, V28, P30, DOI ...",INTRODUCTION: IT IS CLINICALLY VERY IMPORTANT ...,EVALUATION STUDIES; OBTURATION; PREMETREXED; T...,10.7860/JCDR/2019/32601.12609,ARTICLE; RETRACTED PUBLICATION,,,,...,"PATRO, SWADHEENA/AAZ-8764-2021",GENERAL & INTERNAL MEDICINE,2249-782X,JOURNAL OF CLINICAL AND DIAGNOSTIC RESEARCH,0.0,RETRACTED: COMPARATIVE EVALUATION OF APICAL MI...,"MEDICINE, GENERAL & INTERNAL",0,SIKSHA 'O' ANUSANDHAN UNIVERSITY; SIKSHA 'O' A...,SOA UNIV


In [262]:
# Remove the Alterations to the name of the article
wos_dois['document_title'] = wos_dois['document_title'].str.replace('RETRACTED: ', '')
wos_dois['document_title'] = wos_dois['document_title'].str.replace('(Withdrawn Publication)', '')
wos_dois['document_title'] = wos_dois['document_title'].str.replace('(Withdrawn publication)', '')
wos_dois['document_title'] = wos_dois['document_title'].str.replace('</bold>', '')
wos_dois['document_title'] = wos_dois['document_title'].str.replace('<bold>', '')

In [263]:
# Some records include variations of the phrase "(Retracted article. See XX)". 
# Where XX is has variable length and characters, but always ends with ")"
def remove_retraction_phrase(title):

    # Define the pattern to match the retraction phrase
    pattern = r'\(retracted article\. see [^\)]+\)'
    return re.sub(pattern, '', str.lower(title)).strip()

# Apply the function to the 'Document Title' column
wos_dois['document_title'] = wos_dois['document_title'].apply(remove_retraction_phrase)

In [264]:
def remove_retraction_phrase(title):

    # Define the pattern to match the retraction phrase
    pattern = r'\(withdrawal of [^\)]+\)'
    return re.sub(pattern, '', str.lower(title)).strip()

# Apply the function to the 'Document Title' column
wos_dois['document_title'] = wos_dois['document_title'].apply(remove_retraction_phrase)

In [265]:
# Remove spaces at the end of the string
wos_dois['document_title'] = wos_dois['document_title'].str.rstrip()

#lower case string
wos_dois['document_title'] = wos_dois['document_title'].str.lower()

### Duplicates

In [266]:
wos_dois[wos_dois.duplicated(subset='doi', keep=False)].shape

(216, 32)

In [267]:
wos_dois['doi'].nunique()

20617

In [268]:
wos_dois.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20725 entries, 10235 to 19747
Data columns (total 32 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   authors                                20725 non-null  object        
 1   author_fullnames                       20723 non-null  object        
 2   cited_references                       20620 non-null  object        
 3   abstract                               19231 non-null  object        
 4   author_keywords                        12766 non-null  object        
 5   doi                                    20725 non-null  object        
 6   document_type                          20725 non-null  object        
 7   funding_agency_and_grant_number        8917 non-null   object        
 8   esi_highly_cited                       47 non-null     object        
 9   esi_hot_paper                          47 non-null     object 

In [269]:
#control_set.iloc[[2,6,36,58,60,65,67,68]]

In [270]:
wos_dois['doi'].value_counts()

doi
10.1186/1479-5876-6-27        2
10.1186/ar4300                2
10.1177/1533033819861971      2
10.1177/1533033819870209      2
10.1177/1533033819870791      2
                             ..
10.1016/j.omtn.2019.06.003    1
10.1016/j.omtn.2019.05.020    1
10.1016/j.omtn.2019.05.014    1
10.1016/j.omtn.2019.02.022    1
10.9713/kcer.2019.57.6.790    1
Name: count, Length: 20617, dtype: int64

In [271]:
wos_dois.shape

(20725, 32)

In [272]:
wos_dois[wos_dois.duplicated(subset='doi', keep=False)]

Unnamed: 0,authors,author_fullnames,cited_references,abstract,author_keywords,doi,document_type,funding_agency_and_grant_number,esi_highly_cited,esi_hot_paper,...,researcher_id_nr,research_areas,issn,publication_name,wos_core_collection_times_cited_count,document_title,wos_categories,total_times_cited_count,authors_affiliations,corresponding_author_affiliation
2865,YU L;LI HT;LIU WH;ZHANG LG;TIAN Q;LI HR;LI M,"YU, LING;LI, HAITING;LIU, WENHU;ZHANG, LIGONG;...","ALKASIR R, 2017, PROTEIN CELL, V8, P90, DOI 10...",BACKGROUND: NUMEROUS MICRORNAS (MIRNAS) HAVE B...,AKT3; ALZHEIMER'S DISEASE; APOPTOSIS; DIAGNOSI...,10.1002/mgg3.1548,ARTICLE,,,,...,,GENETICS & HEREDITY,2324-9269,MOLECULAR GENETICS & GENOMIC MEDICINE,19.0,mir-485-3p serves as a biomarker and therapeut...,GENETICS & HEREDITY,20,SHENGLI OILFIELD CENT HOSP,SHENGLI OILFIELD CENT HOSP
2866,YU L;LI HT;LIU WH;ZHANG LG;TIAN Q;LI HR;LI M,"YU, LING;LI, HAITING;LIU, WENHU;ZHANG, LIGONG;...","ALKASIR R, 2017, PROTEIN CELL, V8, P90, DOI 10...",BACKGROUND: NUMEROUS MICRORNAS (MIRNAS) HAVE B...,AKT3; ALZHEIMER'S DISEASE; APOPTOSIS; DIAGNOSI...,10.1002/mgg3.1548,ARTICLE; EARLY ACCESS; RETRACTED PUBLICATION,,,,...,,GENETICS & HEREDITY,2324-9269,MOLECULAR GENETICS & GENOMIC MEDICINE,19.0,mir-485-3p serves as a biomarker and therapeut...,GENETICS & HEREDITY,20,SHENGLI OILFIELD CENT HOSP,SHENGLI OILFIELD CENT HOSP
20732,BISWAS S;RAHMAN I,"BISWAS, SAIBAL;RAHMAN, IRFAN","BISWAS S, 2008, MOL NUTR FOOD RES, V52, P987, ...",,,10.1002/mnfr.200700259,CORRECTION,,,,...,,FOOD SCIENCE & TECHNOLOGY,1613-4125,MOLECULAR NUTRITION & FOOD RESEARCH,0.0,modulation of steroid activity in chronic infl...,FOOD SCIENCE & TECHNOLOGY,0,,NOTREPORTED
14734,BISWAS S;RAHMAN I,"BISWAS, SAIBAL;RAHMAN, IRFAN","ABE Y, 1999, PHARMACOL RES, V39, P41, DOI 10.1...",THE EXPRESSION OF NF-KAPPAB (NF-KAPPA B)-DEPEN...,ASTHMA; CHRONIC OBSTRUCTIVE PULMONARY DISEASE;...,10.1002/mnfr.200700259,REVIEW; RETRACTED PUBLICATION,NIEHS NIH HHS [ES-01247] FUNDING SOURCE: MEDLINE,,,...,,FOOD SCIENCE & TECHNOLOGY,1613-4125,MOLECULAR NUTRITION & FOOD RESEARCH,43.0,modulation of steroid activity in chronic infl...,FOOD SCIENCE & TECHNOLOGY,49,UNIVERSITY OF ROCHESTER,UNIV ROCHESTER
16487,ALLAM AA;EL-GHAREEB AW;ABDUL-HAMID M;BAKERY AE...,"ALLAM, AHMED ALY;EL-GHAREEB, ABDEL WHAAB;ABDUL...","ABDUL-HAMID M., 2005, J EGYPT GER SOC ZOOL, V2...",ACRYLAMIDE HAS BEEN EMPLOYED AS AN EXPERIMENTA...,ACRYLAMIDE; LIVER DEVELOPMENT; LIPID PEROXIDAT...,10.1007/s00204-009-0475-2,ARTICLE; RETRACTED PUBLICATION,,,,...,,TOXICOLOGY,0340-5761,ARCHIVES OF TOXICOLOGY,0.0,effect of prenatal and perinatal acrylamide on...,TOXICOLOGY,0,EGYPTIAN KNOWLEDGE BANK (EKB); BENI SUEF UNIVE...,OREGON HLTH AND SCI UNIV
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17688,YANG J;MOHSENI E;BEHFOROUZ B;KHOTBEHSARA MM,"YANG, JIAN;MOHSENI, EHSAN;BEHFOROUZ, BABAK;KHO...","ANONYMOUS, 2003, ANN BOOK ASTM STAND, V04.05;A...","IN THIS PAPER, THE EFFECTS OF USING CR2O3 AND ...",SELF-COMPACTING MORTAR; CR2O3 NANOPARTICLES; Z...,10.3139/146.111245,ARTICLE; RETRACTED PUBLICATION,,,,...,"YANG, JIAN/AHB-7153-2022; BEHFOROUZ, BABAK/AAC...",METALLURGY & METALLURGICAL ENGINEERING,1862-5282,INTERNATIONAL JOURNAL OF MATERIALS RESEARCH,0.0,an experimental investigation into the effects...,METALLURGY & METALLURGICAL ENGINEERING,0,SHANGHAI JIAO TONG UNIVERSITY; UNIVERSITY OF G...,UNIV GUILAN
301,FANG X;TONG KZ;WANG X;NI HF,"FANG, XING;TONG, KE-ZHEN;WANG, XIN;NI, HUA-FU","FANG X, 2022, J OLEO SCI, V71, P1695, DOI 10.5...","IN THIS STUDY, TWO NEW MIXED-LIGAND COORDINATI...",COORDINATION POLYMERS; MIXED-LIGAND; X-RAY DIF...,10.5650/jos.ess20047,RETRACTION; RETRACTED PUBLICATION,,,,...,,CHEMISTRY; FOOD SCIENCE & TECHNOLOGY,1345-8957,JOURNAL OF OLEO SCIENCE,1.0,retraction: two mixed-ligand coordination poly...,"CHEMISTRY, APPLIED; FOOD SCIENCE & TECHNOLOGY",1,,NOTREPORTED
324,FANG X;TONG KZ;WANG X;NI HF,"FANG, XING;TONG, KE-ZHEN;WANG, XIN;NI, HUA-FU","ALVAREZ N, 2018, INORG CHIM ACTA, V483, P61, D...","IN THIS STUDY, TWO NEW MIXED-LIGAND COORDINATI...",COORDINATION POLYMERS; MIXED-LIGAND; X-RAY DIF...,10.5650/jos.ess20047,ARTICLE,,,,...,,CHEMISTRY; FOOD SCIENCE & TECHNOLOGY,1345-8957,JOURNAL OF OLEO SCIENCE,1.0,two mixed-ligand coordination polymers: crysta...,"CHEMISTRY, APPLIED; FOOD SCIENCE & TECHNOLOGY",1,BEILUN PEOPLES HOSP;NINGBO YINZHOU PEOPLES HOSP,BEILUN PEOPLES HOSP
20789,POEHLMAN ET;TOTH MJ;GARDNER AW,"POEHLMAN, ET;TOTH, MJ;GARDNER, AW","POEHLMAN ET, 1995, ANN INTERN MED, V123, P673,...",,,10.7326/0003-4819-123-9-199511010-00005,CORRECTION,,,,...,,GENERAL & INTERNAL MEDICINE,0003-4819,ANNALS OF INTERNAL MEDICINE,3.0,changes in energy balance and body composition...,"MEDICINE, GENERAL & INTERNAL",3,,NOTREPORTED


In [273]:
testing_dupes = wos_dois[wos_dois.duplicated(subset='doi', keep=False)]
testing_dupes['doi'].value_counts()

doi
10.1002/mgg3.1548                          2
10.1159/000099107                          2
10.1177/1533033819850189                   2
10.1177/1533033818821401                   2
10.1177/1466138108089466                   2
                                          ..
10.1038/s41467-019-13388-8                 2
10.1038/s41419-018-1090-z                  2
10.1038/s41419-018-0978-y                  2
10.1038/s41388-018-0628-y                  2
10.7326/0003-4819-123-9-199511010-00005    2
Name: count, Length: 108, dtype: int64

In [274]:
testing_dupes = testing_dupes.sort_values('doi')

In [275]:
testing_dupes

Unnamed: 0,authors,author_fullnames,cited_references,abstract,author_keywords,doi,document_type,funding_agency_and_grant_number,esi_highly_cited,esi_hot_paper,...,researcher_id_nr,research_areas,issn,publication_name,wos_core_collection_times_cited_count,document_title,wos_categories,total_times_cited_count,authors_affiliations,corresponding_author_affiliation
2865,YU L;LI HT;LIU WH;ZHANG LG;TIAN Q;LI HR;LI M,"YU, LING;LI, HAITING;LIU, WENHU;ZHANG, LIGONG;...","ALKASIR R, 2017, PROTEIN CELL, V8, P90, DOI 10...",BACKGROUND: NUMEROUS MICRORNAS (MIRNAS) HAVE B...,AKT3; ALZHEIMER'S DISEASE; APOPTOSIS; DIAGNOSI...,10.1002/mgg3.1548,ARTICLE,,,,...,,GENETICS & HEREDITY,2324-9269,MOLECULAR GENETICS & GENOMIC MEDICINE,19.0,mir-485-3p serves as a biomarker and therapeut...,GENETICS & HEREDITY,20,SHENGLI OILFIELD CENT HOSP,SHENGLI OILFIELD CENT HOSP
2866,YU L;LI HT;LIU WH;ZHANG LG;TIAN Q;LI HR;LI M,"YU, LING;LI, HAITING;LIU, WENHU;ZHANG, LIGONG;...","ALKASIR R, 2017, PROTEIN CELL, V8, P90, DOI 10...",BACKGROUND: NUMEROUS MICRORNAS (MIRNAS) HAVE B...,AKT3; ALZHEIMER'S DISEASE; APOPTOSIS; DIAGNOSI...,10.1002/mgg3.1548,ARTICLE; EARLY ACCESS; RETRACTED PUBLICATION,,,,...,,GENETICS & HEREDITY,2324-9269,MOLECULAR GENETICS & GENOMIC MEDICINE,19.0,mir-485-3p serves as a biomarker and therapeut...,GENETICS & HEREDITY,20,SHENGLI OILFIELD CENT HOSP,SHENGLI OILFIELD CENT HOSP
20732,BISWAS S;RAHMAN I,"BISWAS, SAIBAL;RAHMAN, IRFAN","BISWAS S, 2008, MOL NUTR FOOD RES, V52, P987, ...",,,10.1002/mnfr.200700259,CORRECTION,,,,...,,FOOD SCIENCE & TECHNOLOGY,1613-4125,MOLECULAR NUTRITION & FOOD RESEARCH,0.0,modulation of steroid activity in chronic infl...,FOOD SCIENCE & TECHNOLOGY,0,,NOTREPORTED
14734,BISWAS S;RAHMAN I,"BISWAS, SAIBAL;RAHMAN, IRFAN","ABE Y, 1999, PHARMACOL RES, V39, P41, DOI 10.1...",THE EXPRESSION OF NF-KAPPAB (NF-KAPPA B)-DEPEN...,ASTHMA; CHRONIC OBSTRUCTIVE PULMONARY DISEASE;...,10.1002/mnfr.200700259,REVIEW; RETRACTED PUBLICATION,NIEHS NIH HHS [ES-01247] FUNDING SOURCE: MEDLINE,,,...,,FOOD SCIENCE & TECHNOLOGY,1613-4125,MOLECULAR NUTRITION & FOOD RESEARCH,43.0,modulation of steroid activity in chronic infl...,FOOD SCIENCE & TECHNOLOGY,49,UNIVERSITY OF ROCHESTER,UNIV ROCHESTER
16487,ALLAM AA;EL-GHAREEB AW;ABDUL-HAMID M;BAKERY AE...,"ALLAM, AHMED ALY;EL-GHAREEB, ABDEL WHAAB;ABDUL...","ABDUL-HAMID M., 2005, J EGYPT GER SOC ZOOL, V2...",ACRYLAMIDE HAS BEEN EMPLOYED AS AN EXPERIMENTA...,ACRYLAMIDE; LIVER DEVELOPMENT; LIPID PEROXIDAT...,10.1007/s00204-009-0475-2,ARTICLE; RETRACTED PUBLICATION,,,,...,,TOXICOLOGY,0340-5761,ARCHIVES OF TOXICOLOGY,0.0,effect of prenatal and perinatal acrylamide on...,TOXICOLOGY,0,EGYPTIAN KNOWLEDGE BANK (EKB); BENI SUEF UNIVE...,OREGON HLTH AND SCI UNIV
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17688,YANG J;MOHSENI E;BEHFOROUZ B;KHOTBEHSARA MM,"YANG, JIAN;MOHSENI, EHSAN;BEHFOROUZ, BABAK;KHO...","ANONYMOUS, 2003, ANN BOOK ASTM STAND, V04.05;A...","IN THIS PAPER, THE EFFECTS OF USING CR2O3 AND ...",SELF-COMPACTING MORTAR; CR2O3 NANOPARTICLES; Z...,10.3139/146.111245,ARTICLE; RETRACTED PUBLICATION,,,,...,"YANG, JIAN/AHB-7153-2022; BEHFOROUZ, BABAK/AAC...",METALLURGY & METALLURGICAL ENGINEERING,1862-5282,INTERNATIONAL JOURNAL OF MATERIALS RESEARCH,0.0,an experimental investigation into the effects...,METALLURGY & METALLURGICAL ENGINEERING,0,SHANGHAI JIAO TONG UNIVERSITY; UNIVERSITY OF G...,UNIV GUILAN
301,FANG X;TONG KZ;WANG X;NI HF,"FANG, XING;TONG, KE-ZHEN;WANG, XIN;NI, HUA-FU","FANG X, 2022, J OLEO SCI, V71, P1695, DOI 10.5...","IN THIS STUDY, TWO NEW MIXED-LIGAND COORDINATI...",COORDINATION POLYMERS; MIXED-LIGAND; X-RAY DIF...,10.5650/jos.ess20047,RETRACTION; RETRACTED PUBLICATION,,,,...,,CHEMISTRY; FOOD SCIENCE & TECHNOLOGY,1345-8957,JOURNAL OF OLEO SCIENCE,1.0,retraction: two mixed-ligand coordination poly...,"CHEMISTRY, APPLIED; FOOD SCIENCE & TECHNOLOGY",1,,NOTREPORTED
324,FANG X;TONG KZ;WANG X;NI HF,"FANG, XING;TONG, KE-ZHEN;WANG, XIN;NI, HUA-FU","ALVAREZ N, 2018, INORG CHIM ACTA, V483, P61, D...","IN THIS STUDY, TWO NEW MIXED-LIGAND COORDINATI...",COORDINATION POLYMERS; MIXED-LIGAND; X-RAY DIF...,10.5650/jos.ess20047,ARTICLE,,,,...,,CHEMISTRY; FOOD SCIENCE & TECHNOLOGY,1345-8957,JOURNAL OF OLEO SCIENCE,1.0,two mixed-ligand coordination polymers: crysta...,"CHEMISTRY, APPLIED; FOOD SCIENCE & TECHNOLOGY",1,BEILUN PEOPLES HOSP;NINGBO YINZHOU PEOPLES HOSP,BEILUN PEOPLES HOSP
20789,POEHLMAN ET;TOTH MJ;GARDNER AW,"POEHLMAN, ET;TOTH, MJ;GARDNER, AW","POEHLMAN ET, 1995, ANN INTERN MED, V123, P673,...",,,10.7326/0003-4819-123-9-199511010-00005,CORRECTION,,,,...,,GENERAL & INTERNAL MEDICINE,0003-4819,ANNALS OF INTERNAL MEDICINE,3.0,changes in energy balance and body composition...,"MEDICINE, GENERAL & INTERNAL",3,,NOTREPORTED


In [276]:
testing_dupes.to_excel('../testing_dupes_wos_dois.xlsx', index= False)

In [277]:
wos_dois['document_type'].value_counts()

document_type
ARTICLE; RETRACTED PUBLICATION                             15769
ARTICLE; EARLY ACCESS; RETRACTED PUBLICATION                 930
ARTICLE; EARLY ACCESS                                        918
REVIEW; RETRACTED PUBLICATION                                867
RETRACTION                                                   609
                                                           ...  
EDITORIAL MATERIAL; EARLY ACCESS; WITHDRAWN PUBLICATION        1
BIOGRAPHICAL-ITEM; RETRACTED PUBLICATION                       1
ARTICLE; DATA PAPER; EARLY ACCESS                              1
ITEM WITHDRAWAL; RETRACTED PUBLICATION                         1
NOTE                                                           1
Name: count, Length: 62, dtype: int64

In [278]:
wos_dois['publication_name'].value_counts()

publication_name
PLOS ONE                                                   633
JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING    419
JOURNAL OF BIOLOGICAL CHEMISTRY                            368
EVIDENCE-BASED COMPLEMENTARY AND ALTERNATIVE MEDICINE      332
JOURNAL OF HEALTHCARE ENGINEERING                          270
                                                          ... 
SEMINARS IN ULTRASOUND CT AND MRI                            1
SEMINARS IN PERINATOLOGY                                     1
JOURNAL OF CHILD SCIENCE                                     1
EXPERIMENTAL AND CLINICAL ENDOCRINOLOGY & DIABETES           1
KOREAN CHEMICAL ENGINEERING RESEARCH                         1
Name: count, Length: 4316, dtype: int64

In [279]:
unique_counts = testing_dupes[testing_dupes['doi']=='10.1038/nrm1196'].nunique()
unique_counts[unique_counts > 1].index

Index([], dtype='object')

In [280]:
wos_dois.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20725 entries, 10235 to 19747
Data columns (total 32 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   authors                                20725 non-null  object        
 1   author_fullnames                       20723 non-null  object        
 2   cited_references                       20620 non-null  object        
 3   abstract                               19231 non-null  object        
 4   author_keywords                        12766 non-null  object        
 5   doi                                    20725 non-null  object        
 6   document_type                          20725 non-null  object        
 7   funding_agency_and_grant_number        8917 non-null   object        
 8   esi_highly_cited                       47 non-null     object        
 9   esi_hot_paper                          47 non-null     object 

In [281]:
# number of different values per DOI
aux = testing_dupes.groupby('doi').nunique()

# number of variables that differ for each DOI
aux.gt(1).sum(axis=1).sort_values()

doi
10.1007/s10586-018-2296-7         2
10.1177/1466138108089466          3
10.1007/s10735-016-9659-2         3
10.1038/ncb3547                   4
10.1007/s11010-013-1650-6         4
                                 ..
10.1159/000099107                14
10.1109/IFITA.2009.315           15
10.1371/journal.ppat.0020025     16
10.1080/21645515.2017.1349046    16
10.1007/s11069-007-9148-8        16
Length: 108, dtype: int64

In [282]:
# variables that differ in duplicates
aux.columns[aux.max() > 1]

Index(['authors', 'author_fullnames', 'cited_references', 'abstract',
       'author_keywords', 'document_type', 'funding_agency_and_grant_number',
       'keywords_plus', 'cited_reference_count', 'open_access_indicator',
       'orcid', 'publication_date', 'page_count', 'pubmed_id', 'publisher',
       'year_published', 'researcher_id_nr', 'research_areas', 'issn',
       'publication_name', 'wos_core_collection_times_cited_count',
       'document_title', 'wos_categories', 'total_times_cited_count',
       'authors_affiliations', 'corresponding_author_affiliation'],
      dtype='object')

From the list of variables that were listed above as having different values for the same DOI, only the variables that are going to be necessary in the methodology will be individually inspected. That includes the following:
- 'authors', 
- 'author_keywords', 
- 'keywords_plus', 
- 'cited_references',
- 'abstract', 
- 'affiliations', 
- 'eissn', 
- 'funding_agency_and_grant_number',
- 'issn', 
- 'iso_source_abv', 
- 'publication_name', 
- 'month',
- 'cited_reference_count', 
- 'open_access_indicator', 
- 'publisher',
- 'research_areas', 
- 'researcher_id_numbers',
- 'wos_core_collection_times_cited_count', 
- 'document_title',
- 'document_type', 
- 'wos_categories', 
- 'year_published',
- 'authors_affiliations', 
- 'corresponding_author_affiliation',
- 'publication_date'

In [283]:
# Keep only the first occurrence of each unique DOI (the most recent date)
filtered_wos = wos_dois.drop_duplicates(subset='doi')

### Abstract pre-processing

In [284]:
wos_dois.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20725 entries, 10235 to 19747
Data columns (total 32 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   authors                                20725 non-null  object        
 1   author_fullnames                       20723 non-null  object        
 2   cited_references                       20620 non-null  object        
 3   abstract                               19231 non-null  object        
 4   author_keywords                        12766 non-null  object        
 5   doi                                    20725 non-null  object        
 6   document_type                          20725 non-null  object        
 7   funding_agency_and_grant_number        8917 non-null   object        
 8   esi_highly_cited                       47 non-null     object        
 9   esi_hot_paper                          47 non-null     object 

In [285]:
control_abstract = wos_dois[['doi','abstract']].dropna(inplace=False)

In [286]:
import re

In [287]:
import string

In [288]:
# Remove punctuation 
punctuation = string.punctuation
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Get the unicode characters that are not punctuation and that do not add value to sentiment

In [289]:
def count_special_characters(string, char_counts):
  """
  Function that counts the number of characters that are not alphanumerical nor punctuation
  """
  for char in string:
      if not char.isalnum() and char not in punctuation:
          char_counts[char] = char_counts.get(char, 0) + 1

In [290]:
invalid_char_counts = {}

control_abstract["abstract"].apply(lambda x: count_special_characters(x, invalid_char_counts))

print(invalid_char_counts)

{' ': 4026148}


Remove html and markdown symbols

In [291]:
html_mkdown_regex = r'(?!<[ \d$])<[^>]+>|&[a-zA-Z]+;|\*\*+|__'

Remove Regular expressions:
- Emails
- URLs

In [292]:
def replace_emails(text):
  """
  Replaces email addresses in a given text with a generic placeholder.
  """
  email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
  cleaned_text = re.sub(email_pattern, 'email', text)
  return cleaned_text

In [293]:
def replace_urls(text):
  """
  Replaces urls in a given text with a generic placeholder.
  """
  url_pattern = r'http\S+|www\S+'
  cleaned_text = re.sub(url_pattern, 'url', text)
  return cleaned_text

Stopwords

In [294]:
from nltk.corpus import stopwords

In [295]:
def remove_stop_words(text, language):
  stop = set(stopwords.words(language))
  if stop:
    return " ".join([word for word in text.split() if word not in stop])
  else:
    return text

In [296]:
# example usage
remove_stop_words('Hello, this is a test of stop word removal', "english")

'Hello, test stop word removal'

Word Contractions

In [297]:
import contractions

In [298]:
words_with_contractions = ["can't", "don't", "aren't", "there's", "couldn't", "didn't"]

for word in words_with_contractions:
  print(contractions.fix(word))

cannot
do not
are not
there is
could not
did not


### Export dataset

In [299]:
wos_dois.to_parquet('./retractions_data/wos_dois.parquet', index = False)

<div class="alert alert-block alert-info" style = "background:#d0de6f; color:#000000; border:0;">

# Chapter 2 - Merge Data <a class="anchor" id="chapter2"></a>

<a class="anchor"> 

## 2.1 -Retractions Data (rwd+wos) <a class="anchor" id="section_4_1"></a>

In [300]:
filtered_rwd.shape
#filtered_wos.shape

(36847, 20)

In [301]:
processed_data_retractions = filtered_rwd.merge(filtered_wos, how= 'inner', left_on= 'OriginalPaperDOI', right_on= 'doi')
processed_data_retractions

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,...,researcher_id_nr,research_areas,issn,publication_name,wos_core_collection_times_cited_count,document_title,wos_categories,total_times_cited_count,authors_affiliations,corresponding_author_affiliation
0,5729,The prevention of hip fracture with risedronat...,(HSC) Medicine - Geriatric;(HSC) Medicine - Ne...,"Department of Neurology, Mitate Hospital, Taga...",Archives of Internal Medicine,JAMA Network,Japan,Yoshihiro Sato;Tomohiro Kanoko;Kei Satoh;Jun I...,http://retractionwatch.com/2016/06/03/jama-jou...,Clinical Study;Research Article;,...,,GENERAL & INTERNAL MEDICINE,0003-9926,ARCHIVES OF INTERNAL MEDICINE,48.0,the prevention of hip fracture with risedronat...,"MEDICINE, GENERAL & INTERNAL",52,HIROSAKI UNIVERSITY; HIROSAKI UNIVERSITY; KEIO...,MITATE HOSP
1,5728,Risedronate sodium therapy for prevention of h...,(HSC) Medicine - Neurology;(HSC) Medicine - Re...,"Department of Neurology, Mitate Hospital, Taga...",Archives of Internal Medicine,JAMA Network,Japan,Yoshihiro Sato;Jun Iwamoto;Tomohiro Kanoko,http://retractionwatch.com/2016/06/03/jama-jou...,Research Article;,...,,GENERAL & INTERNAL MEDICINE,0003-9926,ARCHIVES OF INTERNAL MEDICINE,113.0,risedronate sodium therapy for prevention of h...,"MEDICINE, GENERAL & INTERNAL",119,KEIO UNIVERSITY; HIROSAKI UNIVERSITY; HIROSAKI...,MITATE HOSP
2,895,The Relation Between Pulse Pressure and Cardio...,(HSC) Biostatistics/Epidemiology;(HSC) Medicin...,"Department of Dietetics and Nutrition, Harokop...",JAMA Internal Medicine,JAMA Network,Finland;Greece;Italy;Japan;Netherlands;Serbia;...,Demosthenes B Panagiotakos;Daan Kromhout;Aless...,,Research Article;,...,"PANAGIOTAKOS, DEMOSTHENES/K-8294-2019; PANAGIO...",GENERAL & INTERNAL MEDICINE,0003-9926,ARCHIVES OF INTERNAL MEDICINE,53.0,the relation between pulse pressure and cardio...,"MEDICINE, GENERAL & INTERNAL",55,HAROKOPIO UNIVERSITY ATHENS; NETHERLANDS NATIO...,DB (CORRESPONDING AUTHOR)
3,19230,"First Foods Most: After 18-Hour Fast, People D...",(BLS) Nutrition;(SOC) Psychology;,Dyson School of Applied Economics and Manageme...,JAMA Internal Medicine,American Medical Association,United States,Brian Wansink;Aner Tal;Mitsuru Shimizu,http://retractionwatch.com/2018/04/13/caught-o...,Letter;Research Article;,...,,GENERAL & INTERNAL MEDICINE,0003-9926,ARCHIVES OF INTERNAL MEDICINE,14.0,"first foods most: after 18-hour fast, people d...","MEDICINE, GENERAL & INTERNAL",14,CORNELL UNIVERSITY,CORNELL UNIV
4,938,A Randomized Clinical Trial of a Single Dose o...,(HSC) Medicine - Ophthalmology;(HSC) Medicine ...,"Department of Anesthesiology, Toride Kyodo Gen...",JAMA Ophthalmology,American Medical Association,Japan,Yoshitaka Fujii;Hiroyoshi Tanaka;Mutsuko Ito,http://retractionwatch.com/2012/06/18/three-mo...,Clinical Study;,...,,OPHTHALMOLOGY,0003-9950,ARCHIVES OF OPHTHALMOLOGY,10.0,a randomized clinical trial of a single dose o...,OPHTHALMOLOGY,11,TORIDE KYODO GEN HOSP;TORIDE KYODO GEN HOSP,UNIV TSUKUBA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20255,23657,A Comprehensive Study of Internal Distraction ...,(HSC) Medicine - Orthopedics;(HSC) Medicine - ...,"Department of Orthopaedics, Sanjeevani Multisp...",Journal of Clinical and Diagnostic Research (J...,JCDR Research and Publications Limited,India,Mohit J Jain;Kinjal J Mavani,,Clinical Study;,...,"JAIN, MOHIT/AAY-4869-2020",GENERAL & INTERNAL MEDICINE,2249-782X,JOURNAL OF CLINICAL AND DIAGNOSTIC RESEARCH,9.0,a comprehensive study of internal distraction ...,"MEDICINE, GENERAL & INTERNAL",10,SANJEEVANI MULTISPECIAL HOSP;MARATHA MANDAL INST,MJ (CORRESPONDING AUTHOR)
20256,18340,Evaluation of Impact of Pregnancy on Oral Heal...,(HSC) Medicine - Dentistry;(HSC) Medicine - Ob...,"Department of Public Health Dentistry, Governm...",Journal of Clinical and Diagnostic Research (J...,JCDR Research and Publications Limited,India;Nepal;Pakistan,Aasim Farooq Shah;Manu Batra;Ambrina Qureshi,,Research Article;,...,,GENERAL & INTERNAL MEDICINE,2249-782X,JOURNAL OF CLINICAL AND DIAGNOSTIC RESEARCH,6.0,evaluation of impact of pregnancy on oral heal...,"MEDICINE, GENERAL & INTERNAL",7,DOW UNIVERSITY OF HEALTH SCIENCES,GOVT COLL AND HOSP
20257,30813,Correlation Curve Correction and Spinal Length...,(HSC) Biostatistics/Epidemiology;(HSC) Medicin...,"Department of Orthopaedic, Bone and Joint Reco...",Journal of Clinical and Diagnostic Research (J...,JCDR Research and Publications Limited,Iran,Hasan Ghandhari;Mir Bahram Safari;Ebrahim Amer...,,Clinical Study;,...,"SAFARI, MIR BAHRAM/N-5081-2017; GHANDHARI, HAS...",GENERAL & INTERNAL MEDICINE,2249-782X,JOURNAL OF CLINICAL AND DIAGNOSTIC RESEARCH,1.0,correlation curve correction and spinal length...,"MEDICINE, GENERAL & INTERNAL",1,IRAN UNIVERSITY OF MEDICAL SCIENCES; URMIA UNI...,URMIA UNIV MED SCI
20258,30815,Comparative Evaluation of Apical Microleakage ...,(BLS) Biochemistry;(BLS) Biology - Cellular;(H...,Department of Conservative Dentistry and Endod...,Journal of Clinical and Diagnostic Research (J...,JCDR Research and Publications Limited,India,Swadheena Patro;Satyajit Mohapatra;Sumita Mishra,,Research Article;,...,"PATRO, SWADHEENA/AAZ-8764-2021",GENERAL & INTERNAL MEDICINE,2249-782X,JOURNAL OF CLINICAL AND DIAGNOSTIC RESEARCH,0.0,comparative evaluation of apical microleakage ...,"MEDICINE, GENERAL & INTERNAL",0,SIKSHA 'O' ANUSANDHAN UNIVERSITY; SIKSHA 'O' A...,SOA UNIV


In [302]:
processed_data_retractions['doi'].value_counts()

doi
10.1001/archinte.165.15.1737    1
10.1155/2021/9968016            1
10.1155/2021/9985041            1
10.1155/2021/9984003            1
10.1155/2021/9983023            1
                               ..
10.1016/j.neucom.2016.07.081    1
10.1016/j.neucom.2015.03.059    1
10.1016/j.neucom.2013.08.027    1
10.1016/j.neucom.2012.06.051    1
10.7883/yoken.JJID.2014.147     1
Name: count, Length: 20260, dtype: int64

In [303]:
# rwd_wo_doi = rwd[rwd['OriginalPaperDOI'].isnull() | (rwd['OriginalPaperDOI'] == '')]
# wos_dois_wo_doi = wos_dois[wos_dois['doi'].isnull() | (wos_dois['doi'] == '')]
# data_retractions_by_title = rwd_wo_doi.merge(wos_dois, how= 'inner', left_on= 'Title', right_on= 'document_title')
# processed_data_retractions = pd.concat([processed_data_retractions, data_retractions_by_title], ignore_index = True)
# processed_data_retractions = processed_data_retractions.drop_duplicates()

In [304]:
processed_data_retractions

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,...,researcher_id_nr,research_areas,issn,publication_name,wos_core_collection_times_cited_count,document_title,wos_categories,total_times_cited_count,authors_affiliations,corresponding_author_affiliation
0,5729,The prevention of hip fracture with risedronat...,(HSC) Medicine - Geriatric;(HSC) Medicine - Ne...,"Department of Neurology, Mitate Hospital, Taga...",Archives of Internal Medicine,JAMA Network,Japan,Yoshihiro Sato;Tomohiro Kanoko;Kei Satoh;Jun I...,http://retractionwatch.com/2016/06/03/jama-jou...,Clinical Study;Research Article;,...,,GENERAL & INTERNAL MEDICINE,0003-9926,ARCHIVES OF INTERNAL MEDICINE,48.0,the prevention of hip fracture with risedronat...,"MEDICINE, GENERAL & INTERNAL",52,HIROSAKI UNIVERSITY; HIROSAKI UNIVERSITY; KEIO...,MITATE HOSP
1,5728,Risedronate sodium therapy for prevention of h...,(HSC) Medicine - Neurology;(HSC) Medicine - Re...,"Department of Neurology, Mitate Hospital, Taga...",Archives of Internal Medicine,JAMA Network,Japan,Yoshihiro Sato;Jun Iwamoto;Tomohiro Kanoko,http://retractionwatch.com/2016/06/03/jama-jou...,Research Article;,...,,GENERAL & INTERNAL MEDICINE,0003-9926,ARCHIVES OF INTERNAL MEDICINE,113.0,risedronate sodium therapy for prevention of h...,"MEDICINE, GENERAL & INTERNAL",119,KEIO UNIVERSITY; HIROSAKI UNIVERSITY; HIROSAKI...,MITATE HOSP
2,895,The Relation Between Pulse Pressure and Cardio...,(HSC) Biostatistics/Epidemiology;(HSC) Medicin...,"Department of Dietetics and Nutrition, Harokop...",JAMA Internal Medicine,JAMA Network,Finland;Greece;Italy;Japan;Netherlands;Serbia;...,Demosthenes B Panagiotakos;Daan Kromhout;Aless...,,Research Article;,...,"PANAGIOTAKOS, DEMOSTHENES/K-8294-2019; PANAGIO...",GENERAL & INTERNAL MEDICINE,0003-9926,ARCHIVES OF INTERNAL MEDICINE,53.0,the relation between pulse pressure and cardio...,"MEDICINE, GENERAL & INTERNAL",55,HAROKOPIO UNIVERSITY ATHENS; NETHERLANDS NATIO...,DB (CORRESPONDING AUTHOR)
3,19230,"First Foods Most: After 18-Hour Fast, People D...",(BLS) Nutrition;(SOC) Psychology;,Dyson School of Applied Economics and Manageme...,JAMA Internal Medicine,American Medical Association,United States,Brian Wansink;Aner Tal;Mitsuru Shimizu,http://retractionwatch.com/2018/04/13/caught-o...,Letter;Research Article;,...,,GENERAL & INTERNAL MEDICINE,0003-9926,ARCHIVES OF INTERNAL MEDICINE,14.0,"first foods most: after 18-hour fast, people d...","MEDICINE, GENERAL & INTERNAL",14,CORNELL UNIVERSITY,CORNELL UNIV
4,938,A Randomized Clinical Trial of a Single Dose o...,(HSC) Medicine - Ophthalmology;(HSC) Medicine ...,"Department of Anesthesiology, Toride Kyodo Gen...",JAMA Ophthalmology,American Medical Association,Japan,Yoshitaka Fujii;Hiroyoshi Tanaka;Mutsuko Ito,http://retractionwatch.com/2012/06/18/three-mo...,Clinical Study;,...,,OPHTHALMOLOGY,0003-9950,ARCHIVES OF OPHTHALMOLOGY,10.0,a randomized clinical trial of a single dose o...,OPHTHALMOLOGY,11,TORIDE KYODO GEN HOSP;TORIDE KYODO GEN HOSP,UNIV TSUKUBA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20255,23657,A Comprehensive Study of Internal Distraction ...,(HSC) Medicine - Orthopedics;(HSC) Medicine - ...,"Department of Orthopaedics, Sanjeevani Multisp...",Journal of Clinical and Diagnostic Research (J...,JCDR Research and Publications Limited,India,Mohit J Jain;Kinjal J Mavani,,Clinical Study;,...,"JAIN, MOHIT/AAY-4869-2020",GENERAL & INTERNAL MEDICINE,2249-782X,JOURNAL OF CLINICAL AND DIAGNOSTIC RESEARCH,9.0,a comprehensive study of internal distraction ...,"MEDICINE, GENERAL & INTERNAL",10,SANJEEVANI MULTISPECIAL HOSP;MARATHA MANDAL INST,MJ (CORRESPONDING AUTHOR)
20256,18340,Evaluation of Impact of Pregnancy on Oral Heal...,(HSC) Medicine - Dentistry;(HSC) Medicine - Ob...,"Department of Public Health Dentistry, Governm...",Journal of Clinical and Diagnostic Research (J...,JCDR Research and Publications Limited,India;Nepal;Pakistan,Aasim Farooq Shah;Manu Batra;Ambrina Qureshi,,Research Article;,...,,GENERAL & INTERNAL MEDICINE,2249-782X,JOURNAL OF CLINICAL AND DIAGNOSTIC RESEARCH,6.0,evaluation of impact of pregnancy on oral heal...,"MEDICINE, GENERAL & INTERNAL",7,DOW UNIVERSITY OF HEALTH SCIENCES,GOVT COLL AND HOSP
20257,30813,Correlation Curve Correction and Spinal Length...,(HSC) Biostatistics/Epidemiology;(HSC) Medicin...,"Department of Orthopaedic, Bone and Joint Reco...",Journal of Clinical and Diagnostic Research (J...,JCDR Research and Publications Limited,Iran,Hasan Ghandhari;Mir Bahram Safari;Ebrahim Amer...,,Clinical Study;,...,"SAFARI, MIR BAHRAM/N-5081-2017; GHANDHARI, HAS...",GENERAL & INTERNAL MEDICINE,2249-782X,JOURNAL OF CLINICAL AND DIAGNOSTIC RESEARCH,1.0,correlation curve correction and spinal length...,"MEDICINE, GENERAL & INTERNAL",1,IRAN UNIVERSITY OF MEDICAL SCIENCES; URMIA UNI...,URMIA UNIV MED SCI
20258,30815,Comparative Evaluation of Apical Microleakage ...,(BLS) Biochemistry;(BLS) Biology - Cellular;(H...,Department of Conservative Dentistry and Endod...,Journal of Clinical and Diagnostic Research (J...,JCDR Research and Publications Limited,India,Swadheena Patro;Satyajit Mohapatra;Sumita Mishra,,Research Article;,...,"PATRO, SWADHEENA/AAZ-8764-2021",GENERAL & INTERNAL MEDICINE,2249-782X,JOURNAL OF CLINICAL AND DIAGNOSTIC RESEARCH,0.0,comparative evaluation of apical microleakage ...,"MEDICINE, GENERAL & INTERNAL",0,SIKSHA 'O' ANUSANDHAN UNIVERSITY; SIKSHA 'O' A...,SOA UNIV


In [305]:
#data_retractions_by_title

In [306]:
processed_data_retractions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20260 entries, 0 to 20259
Data columns (total 52 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   Record ID                              20260 non-null  int64         
 1   Title                                  20260 non-null  object        
 2   Subject                                20260 non-null  object        
 3   Institution                            20259 non-null  object        
 4   Journal                                20260 non-null  object        
 5   Publisher                              20260 non-null  object        
 6   Country                                20260 non-null  object        
 7   Author                                 20260 non-null  object        
 8   URLS                                   7933 non-null   object        
 9   ArticleType                            20260 non-null  object

<a class="anchor"> 

## 2.2 -Retractions Data with journals (rwd_wos+scimago) <a class="anchor" id="section_4_1"></a>

In [307]:
journals = pd.read_csv('../scimagojr_2022.csv', sep=';')
journals.head(5)

Unnamed: 0,Rank,Sourceid,Title,Type,Issn,SJR,SJR Best Quartile,H index,Total Docs. (2022),Total Docs. (3years),...,Total Cites (3years),Citable Docs. (3years),Cites / Doc. (2years),Ref. / Doc.,Country,Region,Publisher,Coverage,Categories,Areas
0,1,28773,Ca-A Cancer Journal for Clinicians,journal,"15424863, 00079235",86091,Q1,198,44,118,...,30318,85,29999,9700,United States,Northern America,Wiley-Blackwell,1950-2022,Hematology (Q1); Oncology (Q1),Medicine
1,2,29431,Quarterly Journal of Economics,journal,"00335533, 15314650",36730,Q1,292,36,122,...,2141,122,1483,6661,United Kingdom,Western Europe,Oxford University Press,1886-2022,Economics and Econometrics (Q1),"Economics, Econometrics and Finance"
2,3,20315,Nature Reviews Molecular Cell Biology,journal,"14710072, 14710080",34201,Q1,485,121,328,...,13331,156,3547,8929,United Kingdom,Western Europe,Nature Publishing Group,2000-2022,Cell Biology (Q1); Molecular Biology (Q1),"Biochemistry, Genetics and Molecular Biology"
3,4,18434,Cell,journal,"00928674, 10974172",26494,Q1,856,420,1637,...,67791,1440,4380,6574,United States,Northern America,Cell Press,1974-2022,"Biochemistry, Genetics and Molecular Biology (...","Biochemistry, Genetics and Molecular Biology"
4,5,15847,New England Journal of Medicine,journal,"00284793, 15334406",26015,Q1,1130,1410,4561,...,133956,1854,3393,1021,United States,Northern America,Massachussetts Medical Society,1945-2022,Medicine (miscellaneous) (Q1),Medicine


In [308]:
journals['SJR Best Quartile'].value_counts()

SJR Best Quartile
Q1    7186
Q2    5393
Q3    3735
Q4    1702
-       20
Name: count, dtype: int64

In [309]:
quartiles = journals[['Issn','SJR Best Quartile']]
quartiles.loc[:, 'Issn'] = quartiles['Issn'].str.split(',')
quartiles = quartiles.explode('Issn')
quartiles['Issn'] = quartiles['Issn'].str.replace(' ', '')
quartiles = quartiles[quartiles['Issn'].str.strip() != ""]
quartiles


Unnamed: 0,Issn,SJR Best Quartile
0,15424863,Q1
0,00079235,Q1
1,00335533,Q1
1,15314650,Q1
2,14710072,Q1
...,...,...
18031,18780814,-
18032,01795953,-
18033,00428779,-
18034,18293824,-


In [310]:
quartiles.to_excel('quartiles.xlsx')

In [311]:
processed_data_retractions['issn'] = processed_data_retractions['issn'].str.replace('-', '')

In [312]:
rw_merged = processed_data_retractions.merge(quartiles, how='left', left_on='issn', right_on='Issn')

In [313]:
rw_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20260 entries, 0 to 20259
Data columns (total 54 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   Record ID                              20260 non-null  int64         
 1   Title                                  20260 non-null  object        
 2   Subject                                20260 non-null  object        
 3   Institution                            20259 non-null  object        
 4   Journal                                20260 non-null  object        
 5   Publisher                              20260 non-null  object        
 6   Country                                20260 non-null  object        
 7   Author                                 20260 non-null  object        
 8   URLS                                   7933 non-null   object        
 9   ArticleType                            20260 non-null  object

In [314]:
rw_merged['SJR Best Quartile'].value_counts()

SJR Best Quartile
Q1    9761
Q2    6192
Q3    1178
Q4     239
-      234
Name: count, dtype: int64

In [315]:
rw_merged.groupby('SJR Best Quartile')['issn'].nunique()

SJR Best Quartile
-        2
Q1    1985
Q2    1203
Q3     443
Q4      81
Name: issn, dtype: int64

In [316]:
rw_merged['issn'].nunique() - rw_merged['Issn'].nunique()

308

In [317]:
rw_merged['Issn'].nunique()

3714

In [318]:
datetime_variable_conversion = {"RetractionDate": "object",
                                "OriginalPaperDate": "object",
                                "publication_date": "object",
                                "early_access_date": "object"
                                }

rw_merged = rw_merged.astype(datetime_variable_conversion)

In [319]:
rw_merged.to_excel('./retractions_data/rw_wos_scimago_by_dois.xlsx', index = False)