The purpose of this Notebook is to join the datasets from the different sources into the following datasets:
- *retractions*: this dataset will join data from retraction watch database (RW) and bibliometric data of retractions in Web of Science (RD).
- *retracted_in_journals*: this dataset will join data from retraction watch database (RW), journal metrics from scimajor, and bibliometric data of all articles in best ranked journals (JD).


* [Chapter 0 - Libraries](#chapter0)
* [Chapter 1 - Individual Analysis](#chapter1)
    * [1.1 - Retraction Watch Database (RWD)](#section_1_1)
    * [1.2 - Control Set (CS)](#section_1_2)
    * [1.3 - Citation Data (CIT)](#section_1_3)
    * [1.4 - Corrections Data (COR)](#section_1_4)
* [Chapter 2 - Merge Data](#chapter2)   
    * [2.1 - Retraction Data (RD)](#section_2_1)

<div class="alert alert-block alert-info" style = "background:#d0de6f; color:#000000; border:0;">

# Chapter 0 - Libraries <a class="anchor" id="chapter0"></a>

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import re
#import matplotlib.pyplot as plt
#import seaborn as sns
#import plotly.express as px

<div class="alert alert-block alert-info" style = "background:#d0de6f; color:#000000; border:0;">

# Chapter 1 - Individual Analysis <a class="anchor" id="chapter1"></a>

<a class="anchor"> 

## 1.1 - Retraction Watch Database (RWD) <a class="anchor" id="section_1_1"></a>

In [2]:
rwd = pd.read_excel('./retractions_data/retraction_watch_database.xlsx', dtype={'RetractionPubMedID': object, 'OriginalPaperPubMedID': object})
rwd.head()

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes
0,47271,Binding of DCC by Netrin-1 to Mediate Axon Gui...,(BLS) Biology - Cellular;(BLS) Biology - Gener...,Departments of Anatomy and of Biochemistry and...,Science,American Association for the Advancement of Sc...,United States,Elke Stein;Yimin Zou;Mu-ming Poo;Marc Tessier-...,https://retractionwatch.com/2023/08/31/stanfor...,Research Article;,2023-08-31 00:00:00,10.1126/science.adk1521,0,2001-03-09 00:00:00,10.1126/science.1059391,11239160,Retraction,+Investigation by Company/Institution;+Manipul...,No,
1,47270,Hierarchical Organization of Guidance Receptor...,(BLS) Biochemistry;(BLS) Biology - General;(BL...,Department of Anatomy and Department of Bioche...,Science,American Association for the Advancement of Sc...,United States,Elke Stein;Marc Tessier-Lavigne,https://retractionwatch.com/2023/08/31/stanfor...,Research Article;,2023-08-31 00:00:00,10.1126/science.adk1517,0,2001-02-08 00:00:00,10.1126/science.1058445,11239147,Retraction,+Duplication of Image;+Investigation by Compan...,No,
2,47243,Therapeutic potential of targeting IRES-depend...,(BLS) Biochemistry;(BLS) Biology - Cancer;(BLS...,"Division of Hematology-Oncology, UCLA-Greater ...",Oncogene,Springer - Nature Publishing Group,United States,Y Shi;Y Yang;C Bardeleben;B Holmes;J Gera;Alan...,,Research Article;,2023-08-31 00:00:00,10.1038/s41388-023-02820-5,0,2015-05-11 00:00:00,10.1038/onc.2015.156,25961916,Retraction,+Concerns/Issues About Data;+Concerns/Issues A...,No,see also: https://pubpeer.com/publications/704...
3,47233,A classifier based on 273 urinary peptides pre...,(BLS) Biochemistry;(HSC) Medicine - Cardiovasc...,"Department of Nephrology, The Third Affiliated...",Journal of Hypertension,Wolters Kluwer - Lippincott Williams & Wilkins,China,Lirong Lin;Chunxuan Wang;Jiangwen Ren;Mei Mei;...,,Research Article;,2023-08-30 00:00:00,10.1097/HJH.0000000000003551,37642599,2023-08-01 00:00:00,10.1097/HJH.0000000000003467,37199562,Retraction,+Concerns/Issues About Results;+Investigation ...,No,see also https://journals.lww.com/jhypertensio...
4,47227,"Age, Gender Demographics and Comorbidity Preva...",(HSC) Biostatistics/Epidemiology;(HSC) Medicin...,"Department of Orthopaedics, Dhanalakshmi Srini...",Journal of Coastal Life Medicine,Journal of Coastal Life Medicine,India,S Venkatesh Kumar;Mohith Singh;Gowtham Singh;K...,,Research Article;,2023-08-30 00:00:00,unavailable,0,2023-01-01 00:00:00,unavailable,0,Retraction,+Notice - Lack of;+Withdrawal;,No,"date of retraction unknown, article title repl..."


In [3]:
rwd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42700 entries, 0 to 42699
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Record ID              42700 non-null  int64 
 1   Title                  42700 non-null  object
 2   Subject                42700 non-null  object
 3   Institution            42699 non-null  object
 4   Journal                42700 non-null  object
 5   Publisher              42700 non-null  object
 6   Country                42700 non-null  object
 7   Author                 42700 non-null  object
 8   URLS                   21687 non-null  object
 9   ArticleType            42700 non-null  object
 10  RetractionDate         42700 non-null  object
 11  RetractionDOI          42209 non-null  object
 12  RetractionPubMedID     37599 non-null  object
 13  OriginalPaperDate      42700 non-null  object
 14  OriginalPaperDOI       40173 non-null  object
 15  OriginalPaperPubMed

In [4]:
# put date variables in correct format
rwd['RetractionDate'] = pd.to_datetime(rwd['RetractionDate'], errors='coerce') #, infer_datetime_format=True
rwd['OriginalPaperDate'] = pd.to_datetime(rwd['OriginalPaperDate'])

In [5]:
# Check for NaN values in 'Digital Object Identifier (DOI)' column
rwd_filtered = rwd.dropna(subset=['RetractionDOI'])

# Filter rows starting with "http://dx.doi.org/"
rwd_filtered[rwd_filtered['RetractionDOI'].str.startswith("http://dx.doi.org/")]['RetractionDOI']

Series([], Name: RetractionDOI, dtype: object)

### Duplicates

In theory, there should only be one DOI per article, and each retracted paper should only have one record in the database. This means that all DOIs should be unique.

In [6]:
rwd['OriginalPaperDOI'].nunique()

36846

In [7]:
rwd['RetractionDOI'].nunique()

36506

In [8]:
testing_dupes = rwd[rwd.duplicated(subset='OriginalPaperDOI', keep=False)]
testing_dupes['OriginalPaperDOI'].value_counts()

OriginalPaperDOI
Unavailable                        2234
unavailable                        1074
10.1136/jim-2021-SRMC                 6
10.1002/tox.21941                     2
10.1016/j.lfs.2019.116709             2
10.1038/s41598-021-03765-z            2
10.1007/s12275-012-2294-z             2
10.1016/j.cej.2011.04.016             2
10.1016/j.swevo.2021.100868           2
10.1016/j.esxm.2021.100447            2
10.1093/jge/aabc74                    2
10.1088/1742-2140/aaaf57              2
10.1088/1742-2140/aa953a              2
10.1016/j.carbpol.2019.115799         2
10.1001/archpediatrics.2012.999       2
10.1007/s13277-014-2995-5             2
10.3109/02699052.2016.1162060         2
10.1016/j.rapm.2005.05.009            2
10.1524/9783486834062.275             2
Name: count, dtype: int64

In [9]:
filtered_dupes = testing_dupes[testing_dupes['OriginalPaperDOI'].str.lower() != 'unavailable'].sort_values('OriginalPaperDOI')
filtered_dupes

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes
21229,14089,Can Branding Improve School Lunches?,(B/T) Business - Marketing;(BLS) Nutrition;(SO...,Charles H. Dyson School of Applied Economics a...,JAMA Pediatrics,JAMA Network,United States,Brian Wansink;David R Just;Collin R Payne,http://retractionwatch.com/?s=brian+wansink;ht...,Letter;Research Article;Retracted Article;,2017-10-20,10.1001/jamapediatrics.2017.4603,0,2012-10-01,10.1001/archpediatrics.2012.999,22911396,Retraction,+Breach of Policy by Author;+Error in Data;+Er...,No,Journal previously named Archives of Pediatric...
21230,11994,Can Branding Improve School Lunches?,(B/T) Business - Marketing;(BLS) Nutrition;(SO...,Charles H. Dyson School of Applied Economics a...,JAMA Pediatrics,JAMA Network,United States,Brian Wansink;David R Just;Collin R Payne,http://retractionwatch.com/?s=brian+wansink;ht...,Letter;Research Article;,2017-09-21,10.1001/jamapediatrics.2017.3136,28973133,2012-10-01,10.1001/archpediatrics.2012.999,22911396,Retraction,+Error in Analyses;+Error in Data;+Error in Me...,No,note: the paper was retracted again on October...
5685,38940,"Erratum to: Î±,Î²-Unsaturated aldehyde polluta...",(BLS) Biology - Cellular;(BLS) Toxicology;,"Department of Clinical Immunology, Xijing Hosp...",Environmental Toxicology,Wiley,China;United States,Zhenbiao Wu;Emily Y He;Glenda I Scott;Jun Ren,https://retractionwatch.com/2022/07/25/univers...,Correction/Erratum/Corrigendum;,2022-07-27,10.1002/tox.23620,35894684,2021-09-12,10.1002/tox.21941,34514704,Retraction,+Updated to Retraction;,No,
5696,38375,"Î±,Î²-Unsaturated aldehyde pollutant acrolein ...",(BLS) Biology - Cellular;(BLS) Toxicology;,"Department of Clinical Immunology, Xijing Hosp...",Environmental Toxicology,Wiley,China;United States,Zhenbiao Wu;Emily Y He;Glenda I Scott;Jun Ren,https://retractionwatch.com/2022/07/25/univers...,Research Article;,2022-07-27,10.1002/tox.23620,35894684,2013-12-23,10.1002/tox.21941,24376112,Retraction,+Falsification/Fabrication of Image;+Investiga...,No,
6643,37342,Identification of the Vibrio vulnificus htpG G...,(BLS) Genetics;(BLS) Microbiology;,"Department of Agricultural Biotechnology, Seou...",Journal of Microbiology,Springer,South Korea,Slae Choi;Kyungku Jang;Seulah Choi;Hee Jee Yun...,,Research Article;,2022-05-23,10.1007/s12275-022-1680-4,35606641,2012-08-25,10.1007/s12275-012-2294-z,22923124,Retraction,+Concerns/Issues About Authorship;+Upgrade/Upd...,No,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42578,4780,Phenylephrine stress in the evaluation of pati...,(HSC) Medicine - Cardiology;(HSC) Medicine - P...,"Department of Radiology, University of Califor...",Investigative Radiology,Wolters Kluwer,United States,Robert A Slutsky,http://retractionwatch.com/the-retraction-watc...,Research Article;,1986-02-01,,3514538,1983-03-01,,6345451,Retraction,+Concerns/Issues About Results;+Legal Reasons/...,No,
42579,4781,Thallium pulmonary scintigraphy. Relationship ...,(HSC) Medicine - Cardiology;(HSC) Medicine - P...,"Department of Radiology, University of Califor...",Investigative Radiology,Wolters Kluwer,United States,Robert A Slutsky,http://retractionwatch.com/the-retraction-watc...,Research Article;,1986-02-01,,3514538,1984-11-01,,6392156,Retraction,+Concerns/Issues About Results;+Legal Reasons/...,No,"Article is Nov/Dec 1984 (vol. 19, iss. 6, no d..."
42611,1494,Specific antigen exclusion and non-specific fa...,(BLS) Biology - Molecular;,"Department of Immunology, Institute of Child H...",Clinical and Experimental Immunology,Blackwell Publishing,United Kingdom,S A Roberts;M C Reinhardt;R Paganelli;R J Levi...,,Research Article;,1985-01-01,,3882286,1981-07-01,,6171369,Retraction,+Error in Analyses;+Results Not Reproducible;+...,No,No DOI for Original/Notice 3/24/2017;
42613,4249,Concurrent measurement of plasma levels of vit...,(HSC) Medicine - Cardiovascular;(HSC) Medicine...,Endocrinology-Mineral Metabolism and Nephrolog...,Translational Research: The Journal of Laborat...,Elsevier,United States,PW Lambert;PB DeOreo;BW Hollis;IY Fu;DJ Ginsbe...,,Research Article;,1984-10-01,,6384395,1981-10-01,,6270222,Retraction,+Notice - Unable to Access via current resources;,Unknown,Journal formerly known as: The Journal of Labo...


In [10]:
def find_changed_columns(group):
    changed_cols = group.apply(lambda x: x.nunique()).drop(['OriginalPaperDOI', 'Record ID'])
    value = changed_cols[changed_cols>1].index.to_list()
    return value


filtered_dupes.groupby('OriginalPaperDOI').apply(find_changed_columns).reset_index()

Unnamed: 0,OriginalPaperDOI,0
0,10.1001/archpediatrics.2012.999,"[ArticleType, RetractionDate, RetractionDOI, R..."
1,10.1002/tox.21941,"[Title, ArticleType, OriginalPaperDate, Origin..."
2,10.1007/s12275-012-2294-z,"[RetractionDate, RetractionDOI, RetractionPubM..."
3,10.1007/s13277-014-2995-5,"[ArticleType, RetractionDate, RetractionDOI, R..."
4,10.1016/j.carbpol.2019.115799,"[RetractionDate, RetractionDOI, RetractionPubM..."
5,10.1016/j.cej.2011.04.016,"[RetractionDate, RetractionDOI, Notes]"
6,10.1016/j.esxm.2021.100447,"[RetractionDate, RetractionDOI, RetractionPubM..."
7,10.1016/j.lfs.2019.116709,"[Subject, RetractionDate, RetractionDOI, Retra..."
8,10.1016/j.rapm.2005.05.009,"[RetractionDate, RetractionDOI, RetractionPubM..."
9,10.1016/j.swevo.2021.100868,"[RetractionDate, RetractionDOI, Notes]"


In [11]:
# following code commented so as to not override the changed made in the file
# with pd.ExcelWriter('./wos_rd/DOI_Duplicated_RWD.xlsx') as writer:
#     filtered_dupes.groupby('OriginalPaperDOI').apply(find_changed_columns).reset_index().to_excel(writer, sheet_name= "Differing vars",index = False)
#     filtered_dupes.to_excel(writer, sheet_name = "Duplicate records",index = False)

In [12]:
rwd[rwd['OriginalPaperDOI']=='10.3109/02699052.2016.1162060']

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes
23288,8429,Are rehabilitation outcomes after anoxic brain...,(HSC) Medicine - Neurology;(HSC) Medicine - Re...,Unavailable,Brain Injury,Taylor and Francis,Netherlands;Unknown,Emre Adiguzel;Evren Yasar;Yasin Demir;Ismail S...,,Conference Abstract/Paper;,2016-08-17,10.1080/02699052.2016.1210325,27533125,2016-05-19,10.3109/02699052.2016.1162060,27196965,Retraction,+Notice - Limited or No Information;,No,Part of Accepted Abstracts from the Internatio...
23290,6000,The effect of demographic and clinical charact...,(HSC) Medicine - Neurology;(HSC) Medicine - Re...,Unavailable,Brain Injury,Taylor and Francis,Netherlands;Unknown,Evren Yasar;Serdar Kesikburun;Ummugulsum Dogan...,,Conference Abstract/Paper;,2016-08-17,10.1080/02699052.2016.1210325,27533125,2016-05-19,10.3109/02699052.2016.1162060,27196965,Retraction,+Notice - Limited or No Information;,No,


In [13]:
def find_changed_columns(group):
    changed_cols = group.apply(lambda x: x.nunique()).drop(['OriginalPaperDOI', 'Record ID'])
    value = changed_cols[changed_cols>1].index.to_list()
    return value

filtered_dupes[(filtered_dupes['Record ID'] == 8429) | (filtered_dupes['Record ID'] == 6000)].groupby('OriginalPaperDOI').apply(find_changed_columns).reset_index()

Unnamed: 0,OriginalPaperDOI,0
0,10.3109/02699052.2016.1162060,"[Title, Author]"


In [14]:
pd.set_option('display.max_colwidth', None)  # Show full content of columns
pd.set_option('display.max_rows', None)      # Display all rows

In [15]:
filtered_dupes[(filtered_dupes['Record ID'] == 30679) | (filtered_dupes['Record ID'] == 30686)]['Title']

11353    Acute kidney injury and collapsing glomerulopathy associated with COVID-19 and APOL1 high risk genotype Abstract 621
11352    Acute kidney injury and collapsing glomerulopathy associated with COVID-19 and APOL1 high risk genotype Abstract 111
Name: Title, dtype: object

In [16]:
filtered_dupes[(filtered_dupes['Record ID'] == 30687) | (filtered_dupes['Record ID'] == 30691)]['Title']

11365    Filter clotting, anticoagulation and duration of sled in patients with COVID-19 and acute kidney injury Abstract 643
11364    Filter clotting, anticoagulation and duration of sled in patients with COVID-19 and acute kidney injury Abstract 112
Name: Title, dtype: object

In [17]:
pd.reset_option('display.max_colwidth')
pd.reset_option('display.max_rows')

In [18]:
# records that should be deleted
records_to_delete = [6000, 2175, 7242]
rwd = rwd[~rwd['Record ID'].isin(records_to_delete)]

In [19]:
rwd.sort_values(by=['OriginalPaperDOI', 'OriginalPaperDate'], ascending=[True, False], inplace=True)

# Keep only the first occurrence of each unique DOI (the most recent date)
filtered_rwd = rwd.drop_duplicates(subset='OriginalPaperDOI')

### Title analysis

In [20]:
rwd[rwd['Title'].str.startswith("Retracted:")]

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes
13359,45110,Retracted: Lifting the lid on lobbying in Indi...,(B/T) Government;,"Department of Management Studies, Indian Insti...",Journal of Public Affairs,Wiley,India,Pankaj K P Shreyaskar;Pramod Pathak,,Research Article;,2020-09-21,10.1002/pa.2423,0,2020-09-21,10.1002/pa.2423,0,Retraction,+Date of Retraction/Other Unknown;+Euphemisms ...,No,
1326,46570,Retracted: miR-214-3p Protects and Restores th...,(BLS) Biology - Molecular;(BLS) Genetics;(HSC)...,Key Laboratory of Advanced Technologies of Mat...,Evidence-Based Complementary and Alternative M...,Hindawi,China,Yuan Cheng;Qing He;Tao Jin;Na Li,https://retractionwatch.com/2022/09/28/exclusi...,Research Article;,2023-06-21,10.1155/2023/9823451,37388114,2022-07-18,10.1155/2022/1175935,35899226,Retraction,+Concerns/Issues About Data;+Concerns/Issues a...,No,See also: https://pubpeer.com/publications/C08...


In [21]:
rwd['Title'] = rwd['Title'].str.replace('Retracted:', '')

In [22]:
rwd[rwd['Title'].str.startswith("Retracted:")]

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes


In [23]:
rwd.iloc[[1326,13359]]

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes
12550,25797,LncRNA ATB promotes proliferation and metastas...,(BLS) Biochemistry;(BLS) Biology - Cancer;(BLS...,"Department of Respiratory Medicine, The Affili...",Journal of Cellular Biochemistry,Wiley,China,Yiwei Cao;Xiangjun Luo;Xiaoqian Ding;Shichao C...,http://retractionwatch.com/2021/03/08/journal-...,Research Article;,2020-12-15,10.1002/jcb.29877,33590514,2018-04-25,10.1002/jcb.26894,29693289,Retraction,+Concerns/Issues About Data;+Concerns/Issues a...,No,see also: https://pubpeer.com/publications/B2B...
11996,44387,Preparation of self-healing anti-corrosion coa...,(PHY) Engineering - Chemical;(PHY) Materials S...,"Department of Materials Engineering, Isfahan U...",Surface Engineering,Taylor and Francis,Iran,Sogand Abbaspour;Ali Ashrafi;Mehdi Salehi,,Research Article;,2021-02-01,10.1080/02670844.2021.1883242,0,2019-11-21,10.1080/02670844.2019.1689641,0,Retraction,+Concerns/Issues About Image;+Concerns/Issues ...,No,


In [24]:
rwd[rwd['Title'].str.contains("(Withdrawn Publication)")]

  rwd[rwd['Title'].str.contains("(Withdrawn Publication)")]


Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes


## Title Analysis

In [25]:
# Remove spaces at the end of the string
rwd['Title'] = rwd['Title'].str.rstrip()

#lower case string
rwd['Title'] = rwd['Title'].str.lower()

In [26]:
rwd.info()

<class 'pandas.core.frame.DataFrame'>
Index: 42697 entries, 29155 to 42695
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   Record ID              42697 non-null  int64         
 1   Title                  42697 non-null  object        
 2   Subject                42697 non-null  object        
 3   Institution            42696 non-null  object        
 4   Journal                42697 non-null  object        
 5   Publisher              42697 non-null  object        
 6   Country                42697 non-null  object        
 7   Author                 42697 non-null  object        
 8   URLS                   21686 non-null  object        
 9   ArticleType            42697 non-null  object        
 10  RetractionDate         42697 non-null  datetime64[ns]
 11  RetractionDOI          42206 non-null  object        
 12  RetractionPubMedID     37596 non-null  object        
 13  Or

<a class="anchor"> 

## 1.2 - Control Set (CS) <a class="anchor" id="section_1_2"></a>

In [27]:
control_set_imported = pd.DataFrame()

In [28]:
import glob
import pyarrow.parquet as pq

In [29]:
glob.glob("../thesis_data/processed_data/*.parquet")

['../thesis_data/processed_data\\WoS_journals0-99_Rdata.parquet',
 '../thesis_data/processed_data\\WoS_journals1400-1499_Rdata.parquet',
 '../thesis_data/processed_data\\WoS_journals1600-1803_P1_Rdata.parquet',
 '../thesis_data/processed_data\\WoS_journals1600-1803_P2_Rdata.parquet',
 '../thesis_data/processed_data\\WoS_journalsleft_Rdata.parquet']

In [30]:
for file in glob.glob("../thesis_data/processed_data/*.parquet"):
    parquet_file = pq.read_table(file).to_pandas()
    control_set_imported = pd.concat([control_set_imported, parquet_file], ignore_index = True)

In [31]:
control_set_imported.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 943818 entries, 0 to 943817
Columns: 212 entries, AU to X.....................and....i
dtypes: float64(2), object(210)
memory usage: 1.5+ GB


In [32]:
control_set = control_set_imported.copy()

In [33]:
control_set.filter(like='X.').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 943818 entries, 0 to 943817
Columns: 154 entries, X....paidb..ext.link.type to X.....................and....i
dtypes: object(154)
memory usage: 1.1+ GB


In [34]:
control_set.drop(columns=control_set.filter(like='X.').columns, inplace=True)

In [35]:
control_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 943818 entries, 0 to 943817
Data columns (total 58 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   AU                          943818 non-null  object 
 1   DE                          544644 non-null  object 
 2   ID                          883685 non-null  object 
 3   C1                          940677 non-null  object 
 4   CR                          940482 non-null  object 
 5   AB                          912823 non-null  object 
 6   PA                          943818 non-null  object 
 7   affiliations                924960 non-null  object 
 8   AR                          136585 non-null  object 
 9   EM                          858780 non-null  object 
 10  book.author                 145 non-null     object 
 11  BO                          16783 non-null   object 
 12  da                          943818 non-null  object 
 13  DI            

In [36]:
control_set.columns

Index(['AU', 'DE', 'ID', 'C1', 'CR', 'AB', 'PA', 'affiliations', 'AR', 'EM',
       'book.author', 'BO', 'da', 'DI', 'GA', 'eissn',
       'esi.highly.cited.paper', 'esi.hot.paper', 'earlyaccessdate', 'BE',
       'FU', 'FX', 'BN', 'SN', 'JI', 'SO', 'LA', 'month', 'note', 'NR', 'PN',
       'oa', 'orcid.numbers', 'organization', 'PP', 'PU', 'SC',
       'researcherid.numbers', 'SE', 'TC', 'TI', 'DT', 'UT',
       'usage.count.last.180.days', 'U2', 'VL', 'web.of.science.categories.',
       'web.of.science.index', 'PY', 'RP', 'DB', 'J9', 'AU_UN', 'AU1_UN',
       'AU_UN_NR', 'SR_FULL', 'SR', 'book.group.author'],
      dtype='object')

In [37]:
rename_columns = {
    'AU': "authors", 
    'DE': "author_keywords", 
    'ID': "keywords_plus", 
    'C1': "author_address", 
    'CR': "cited_references", 
    'AB': "abstract", 
    'PA': "publisher_address", 
    #'affiliations', 
    'AR': "article_number", 
    'EM': "email_address",
    'book.author': "book_author", 
    ###'BO': "book",  -> n está na documentação
    'da': "date_report_generated", 
    'DI': "doi", 
    'GA': "document_delivery_number", 
    #'eissn',
    'esi.highly.cited.paper': "esi_highly_cited_paper", 
    'esi.hot.paper': "esi_hot_paper", 
    'earlyaccessdate': "early_access_date", 
    'BE': "editors",
    'FU': "funding_agency_and_grant_number", 
    'FX': "funding_text", 
    'BN': "isbn", 
    'SN': "issn", 
    'JI': "iso_source_abv", 
    'SO': "publication_name", 
    'LA': "language", 
    #'month', 
    #'note', 
    'NR': "cited_reference_count", 
    'PN': "part_number",
    'oa': "open_access_indicator", 
    'orcid.numbers': "orcid_numbers", 
    #'organization', 
    ###'PP', -> n está na documentação
    'PU': "publisher", 
    'SC': "research_areas",
    'researcherid.numbers': "researcher_id_numbers", 
    'SE': "book_series_title", 
    'TC': "wos_core_collection_times_cited_count", 
    'TI': "document_title", 
    'DT': "document_type", 
    'UT': "accession_number",
    'usage.count.last.180.days': "usage_count_last_180_days", 
    'U2': "usage_count_since_2013", 
    'VL': "volume", 
    'web.of.science.categories.': "wos_categories",
    'web.of.science.index': "wos_index", 
    'PY': "year_published", 
    'RP': "reprint_address", 
    'DB': "database", 
    'J9': "29_character_source_abv", 
    'AU_UN': "authors_affiliations", 
    'AU1_UN': "corresponding_author_affiliation",
    'AU_UN_NR': "not_recognized_affiliations", 
    'SR_FULL': "short_full_reference", 
    'SR': "short_reference"
}

control_set.rename(columns = rename_columns, inplace = True)
control_set.columns

Index(['authors', 'author_keywords', 'keywords_plus', 'author_address',
       'cited_references', 'abstract', 'publisher_address', 'affiliations',
       'article_number', 'email_address', 'book_author', 'BO',
       'date_report_generated', 'doi', 'document_delivery_number', 'eissn',
       'esi_highly_cited_paper', 'esi_hot_paper', 'early_access_date',
       'editors', 'funding_agency_and_grant_number', 'funding_text', 'isbn',
       'issn', 'iso_source_abv', 'publication_name', 'language', 'month',
       'note', 'cited_reference_count', 'part_number', 'open_access_indicator',
       'orcid_numbers', 'organization', 'PP', 'publisher', 'research_areas',
       'researcher_id_numbers', 'book_series_title',
       'wos_core_collection_times_cited_count', 'document_title',
       'document_type', 'accession_number', 'usage_count_last_180_days',
       'usage_count_since_2013', 'volume', 'wos_categories', 'wos_index',
       'year_published', 'reprint_address', 'database',
       '29_c

In [38]:
control_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 943818 entries, 0 to 943817
Data columns (total 58 columns):
 #   Column                                 Non-Null Count   Dtype  
---  ------                                 --------------   -----  
 0   authors                                943818 non-null  object 
 1   author_keywords                        544644 non-null  object 
 2   keywords_plus                          883685 non-null  object 
 3   author_address                         940677 non-null  object 
 4   cited_references                       940482 non-null  object 
 5   abstract                               912823 non-null  object 
 6   publisher_address                      943818 non-null  object 
 7   affiliations                           924960 non-null  object 
 8   article_number                         136585 non-null  object 
 9   email_address                          858780 non-null  object 
 10  book_author                            145 non-null     

In [39]:
cols_to_drop = ['author_address', 'publisher_address',  'email_address', 'date_report_generated', 'document_delivery_number', 'editors', 'article_number', 'book_author', 'BO', 'funding_text', 'part_number', 'orcid_numbers', 'PP',  'accession_number', 'usage_count_last_180_days',
'usage_count_since_2013', 'volume', 'reprint_address', 'database', '29_character_source_abv', 'wos_index', 'not_recognized_affiliations',
'short_full_reference', 'short_reference']

control_set = control_set.drop(cols_to_drop, axis=1)

In [40]:
new_data_types = {'cited_reference_count': 'Int64', 
                'wos_core_collection_times_cited_count': 'Int64', 
                'year_published': 'Int64'}


for col, dtype in new_data_types.items():
    control_set[col] = pd.array(control_set[col],dtype = pd.Int64Dtype())

control_set['early_access_date'] = pd.to_datetime(control_set['early_access_date'], format='%b %Y')

In [41]:
control_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 943818 entries, 0 to 943817
Data columns (total 34 columns):
 #   Column                                 Non-Null Count   Dtype         
---  ------                                 --------------   -----         
 0   authors                                943818 non-null  object        
 1   author_keywords                        544644 non-null  object        
 2   keywords_plus                          883685 non-null  object        
 3   cited_references                       940482 non-null  object        
 4   abstract                               912823 non-null  object        
 5   affiliations                           924960 non-null  object        
 6   doi                                    930355 non-null  object        
 7   eissn                                  730427 non-null  object        
 8   esi_highly_cited_paper                 37895 non-null   object        
 9   esi_hot_paper                          37895 non

In [42]:
control_set = control_set[control_set['year_published']<2023]

In [43]:
control_set.describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
early_access_date,103985.0,2021-08-14 17:11:16.309082624,2017-10-01 00:00:00,2021-03-01 00:00:00,2021-09-01 00:00:00,2022-04-01 00:00:00,2023-09-01 00:00:00,
cited_reference_count,938418.0,52.958932,0.0,29.0,42.0,59.0,4723.0,55.109981
wos_core_collection_times_cited_count,938418.0,99.088385,0.0,12.0,30.0,80.0,67880.0,337.584042
year_published,938418.0,2013.203568,2000.0,2008.0,2014.0,2019.0,2022.0,6.484763


In [44]:
control_set[control_set.duplicated()].shape[0]

4784

In [45]:
control_set.shape

(938418, 34)

In [46]:
# Replace all NaN values with a common value (e.g., a string)
control_set = control_set.fillna('This is a missing value')

# Use drop_duplicates to remove duplicates with 'NaN' values
control_set.drop_duplicates(inplace=True)

# Now, you can replace the 'NaN' values with NaN again if needed
control_set = control_set.replace('This is a missing value', np.nan)

In [47]:
control_set.shape

(933634, 34)

In [48]:
control_set['month']

0         JAN 28
1         JAN 28
2            MAR
3          NOV 7
4         JAN 28
           ...  
943812    NOV 26
943813       APR
943815     APR 1
943816    FEB 15
943817    SEP 25
Name: month, Length: 933634, dtype: object

In [49]:
control_set['month'].fillna('JAN 1', inplace=True)

control_set['publication_date'] = pd.to_datetime(control_set['month'] + ' ' + control_set['year_published'].astype(str), errors='coerce')
control_set['publication_date'].fillna(pd.to_datetime('JAN 01 ' + control_set['year_published'].astype(str), format='%b %d %Y'), inplace=True)

# Sort the DataFrame by DOI and 'Publication Date' in descending order
control_set.sort_values(by=['doi', 'publication_date'], ascending=[True, False], inplace=True)


  control_set['publication_date'] = pd.to_datetime(control_set['month'] + ' ' + control_set['year_published'].astype(str), errors='coerce')


In [50]:
# Check for NaN values in 'doi' column
control_set_filtered = control_set.dropna(subset=['doi'])

# Filter rows starting with "http://dx.doi.org/"
control_set_filtered[control_set_filtered['doi'].str.startswith("http://dx.doi.org/")]['doi']

Series([], Name: doi, dtype: object)

In [51]:
control_set['doi'] = control_set['doi'].str.replace('http://dx.doi.org/', '')

### Title Analysis

In [52]:
control_set.columns

Index(['authors', 'author_keywords', 'keywords_plus', 'cited_references',
       'abstract', 'affiliations', 'doi', 'eissn', 'esi_highly_cited_paper',
       'esi_hot_paper', 'early_access_date', 'funding_agency_and_grant_number',
       'isbn', 'issn', 'iso_source_abv', 'publication_name', 'language',
       'month', 'note', 'cited_reference_count', 'open_access_indicator',
       'organization', 'publisher', 'research_areas', 'researcher_id_numbers',
       'book_series_title', 'wos_core_collection_times_cited_count',
       'document_title', 'document_type', 'wos_categories', 'year_published',
       'authors_affiliations', 'corresponding_author_affiliation',
       'book.group.author', 'publication_date'],
      dtype='object')

In [53]:
control_set[(~control_set['document_title'].str.startswith("RETRACTED: ")) & (~control_set['document_title'].str.contains("(Withdrawn Publication)")) & ((~control_set['document_title'].str.contains("Retracted Article. See")))]

  control_set[(~control_set['document_title'].str.startswith("RETRACTED: ")) & (~control_set['document_title'].str.contains("(Withdrawn Publication)")) & ((~control_set['document_title'].str.contains("Retracted Article. See")))]


Unnamed: 0,authors,author_keywords,keywords_plus,cited_references,abstract,affiliations,doi,eissn,esi_highly_cited_paper,esi_hot_paper,...,book_series_title,wos_core_collection_times_cited_count,document_title,document_type,wos_categories,year_published,authors_affiliations,corresponding_author_affiliation,book.group.author,publication_date
484329,KHOSHNAVAZ MJ,,TIME MIGRATION; INVERSION; LOCATION; DOMAIN; GAS,"ALONAIZI F, 2013, GEOPHYS PROSPECT, V61, P1206...",BUILDING AN ACCURATE VELOCITY MODEL PLAYS A VI...,UNIVERSITY OF TEHRAN,10.0090/geo2021-0173.1,1942-2156,,,...,,4,HIGH-RESOLUTION SEISMIC VELOCITY ANALYSIS BY S...,ARTICLE,GEOCHEMISTRY \& GEOPHYSICS,2021,UNIV TEHRAN;UNIV TEHRAN,UNIV TEHRAN,,2021-01-01
91310,CARTENI G;FIORENTINO R;VECCHIONE L;CHIURAZZI C,COLORECTAL CANCER; FULLY HUMAN MONOCLONAL ANTI...,GROWTH-FACTOR RECEPTOR; MONOCLONAL-ANTIBODY; C...,"BASELGA J, 2000, J CLIN ONCOL, V18, P904, DOI ...",THE EPIDERMAL GROWTH FACTOR RECEPTOR (EGFR) IS...,ANTONIO CARDARELLI HOSPITAL; UNIVERSITA DELLA ...,10.0093/annonc/mdm218,1569-8041,,,...,,22,PANITUMUMAB A NOVEL DRUG IN CANCER TREATMENT,ARTICLE; PROCEEDINGS PAPER,ONCOLOGY,2007,C (CORRESPONDING AUTHOR);NAPLES,C (CORRESPONDING AUTHOR),,2007-01-01
180029,LEE H;SMITH KG;GRIMM CM;SCHOMBURG A,COMPETITIVE DYNAMICS; FIRST MOVER ADVANTAGE; E...,RETURNS; RIVALRY; EVENT,"BALDWIN WL, 1969, SOUTHERN ECON J, V36, P18, D...","THIS RESEARCH EXAMINED THE EFFECTS OF TIMING, ...",GEORGE MASON UNIVERSITY; UNIVERSITY SYSTEM OF ...,10.1002/(SICI)1097-0266(200001)21:1<23::AID-SM...,,,,...,,205,"TIMING, ORDER AND DURABILITY OF NEW PRODUCT AD...",ARTICLE,BUSINESS; MANAGEMENT,2000,GEORGE MASON UNIV;GEORGE MASON UNIV;UNIV MARYL...,GEORGE MASON UNIV,,2000-01-01
179248,SIMERLY RL;LI MF,ENVIRONMENTAL DYNAMISM; ORGANIZATIONAL ECONOMI...,TRANSACTION COST ECONOMICS; CORPORATE-STRATEGY...,"AIKEN L. S., 1991, MULTIPLE REGRESSION: TESTIN...",AN ONGOING ARGUMENT IN FINANCIAL MANAGEMENT HA...,CALIFORNIA STATE UNIVERSITY SYSTEM; CALIFORNIA...,10.1002/(SICI)1097-0266(200001)21:1<31::AID-SM...,1097-0266,,,...,,256,"ENVIRONMENTAL DYNAMISM, CAPITAL STRUCTURE AND ...",REVIEW,BUSINESS; MANAGEMENT,2000,CALIF STATE UNIV NORTHRIDGE;CALIF STATE UNIV N...,CALIF STATE UNIV NORTHRIDGE,,2000-01-01
180038,GERINGER JM;TALLMAN S;OLSEN DM,DIVERSIFICATION; MULTINATIONAL; PERFORMANCE,RESOURCE-BASED VIEW; PROFIT PERFORMANCE; COMPE...,"AAKER DA, 1987, ACAD MANAGE J, V30, P277, DOI ...",THIS PAPER EXAMINES THE RELATIONSHIP OF PERFOR...,UTAH SYSTEM OF HIGHER EDUCATION; UNIVERSITY OF...,10.1002/(SICI)1097-0266(200001)21:1<51::AID-SM...,,,,...,,383,PRODUCT AND INTERNATIONAL DIVERSIFICATION AMON...,ARTICLE,BUSINESS; MANAGEMENT,2000,UNIV UTAH;UNIV UTAH;CRANFIELD UNIV;CALIF POLYT...,UNIV UTAH,,2000-01-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
877831,[ANONYMOUS] A,,,,,,,1476-5608,,,...,,0,"THE BRITISH PROSTATE GROUP SPRING MEETING, 16-...",ARTICLE,ONCOLOGY; UROLOGY \& NEPHROLOGY,2000,,NOTREPORTED,,2000-01-01
878180,MAYFIELD JA;REIBER GE;SANDERS LJ;JANISSE D;POG...,,,"AMERICAN DIABETES ASSOCIATION, 1999, DIABETES ...",,,,,,,...,,2,PREVENTIVE FOOT CARE IN PEOPLE WITH DIABETES,ARTICLE,ORTHOPEDICS,2000,,NOTREPORTED,,2000-01-01
878610,[ANONYMOUS] A,,,,,,,1365-2907,,,...,,0,THE PROCEEDINGS OF A SYMPOSIUM HELD AT THE THI...,ARTICLE,ECOLOGY; ZOOLOGY,2000,,NOTREPORTED,,2000-01-01
926523,O'BRIEN T;JOHNSON LH;ALDRICH JL;ALLEN SG;LIANG...,IMMUNOASSAYS; BIDIFFRACTIVE GRATING BIOSENSOR;...,SURFACE,"ANDERSON GP, 1996, ASAIO J, V42, P942, DOI 10....",A CRITICAL NEED EXISTS FOR A FIELD DEPLOYABLE ...,UNITED STATES DEPARTMENT OF DEFENSE; UNITED ST...,,,,,...,,53,THE DEVELOPMENT OF IMMUNOASSAYS TO FOUR BIOLOG...,ARTICLE,BIOPHYSICS; BIOTECHNOLOGY \& APPLIED MICROBIOL...,2000,T (CORRESPONDING AUTHOR);MED RES CTR;BATTELLE ...,T (CORRESPONDING AUTHOR),,2000-01-01


In [54]:
control_set[(control_set['document_title'].str.startswith("RETRACTED: ")) | (control_set['document_title'].str.contains("(Withdrawn Publication)")) | ((control_set['document_title'].str.contains("Retracted Article. See")))]

  control_set[(control_set['document_title'].str.startswith("RETRACTED: ")) | (control_set['document_title'].str.contains("(Withdrawn Publication)")) | ((control_set['document_title'].str.contains("Retracted Article. See")))]


Unnamed: 0,authors,author_keywords,keywords_plus,cited_references,abstract,affiliations,doi,eissn,esi_highly_cited_paper,esi_hot_paper,...,book_series_title,wos_core_collection_times_cited_count,document_title,document_type,wos_categories,year_published,authors_affiliations,corresponding_author_affiliation,book.group.author,publication_date
275054,SCHÖN JH;KLOC C;WILDEMAN J;HADZIIOANNOU G,,,"AHN CH, 1999, SCIENCE, V284, P1152, DOI 10.112...",,ALCATEL-LUCENT; LUCENT TECHNOLOGIES; AT\&T; UN...,10.1002/1521-4095(200108)13:16<1273::AID-ADMA1...,1521-4095,,,...,,6,RETRACTED: GATE-INDUCED SUPERCONDUCTIVITY IN O...,ARTICLE; RETRACTED PUBLICATION,"CHEMISTRY, MULTIDISCIPLINARY; CHEMISTRY, PHYSI...",2001,BELL LABS;UNIV GRONINGEN,BELL LABS,,2001-08-16
275784,KUMAR G;HO CC;CO CC,,ENDOTHELIAL-CELLS; GEOMETRIC CONTROL; ADHESION...,"BARRY JJA, 2006, ADV MATER, V18, P1406, DOI 10...",,UNIVERSITY SYSTEM OF OHIO; UNIVERSITY OF CINCI...,10.1002/adma.200601629,1521-4095,,,...,,87,RETRACTED: GUIDING CELL MIGRATION USING ONE-WA...,ARTICLE; RETRACTED PUBLICATION,"CHEMISTRY, MULTIDISCIPLINARY; CHEMISTRY, PHYSI...",2007,UNIV CINCINNATI;UNIV CINCINNATI,UNIV CINCINNATI,,2007-04-20
145097,YU LW;CHEN KJ;SONG J;XU J;LI HM;WANG M;LI XF;H...,,THIN-FILMS; ARRAYS; COALESCENCE; NUCLEATION; S...,"ANONYMOUS, 1983, RCA ENG; BANIN U, 1999, NATUR...",SELF-ASSEMBLED SI QUANTUM-RING STRUCTURES ON A...,NANJING UNIVERSITY,10.1002/adma.200602804,1521-4095,,,...,,10,RETRACTED: SELF-ASSEMBLED SI QUANTUM-RING STRU...,ARTICLE; RETRACTED PUBLICATION,"CHEMISTRY, MULTIDISCIPLINARY; CHEMISTRY, PHYSI...",2007,NANJING UNIV;NANJING UNIV,NANJING UNIV,,2007-06-18
274423,GHAFFARI M;ZHOU Y;XU H;LIN M;KIM TY;RUOFF RS;Z...,NANO-POROUS MICROWAVE EXFOLIATED GRAPHITE OXID...,DOUBLE-LAYER CAPACITORS; WALLED CARBON NANOTUB...,"BARBIERI O, 2005, CARBON, V43, P1303, DOI 10.1...",,PENNSYLVANIA COMMONWEALTH SYSTEM OF HIGHER EDU...,10.1002/adma.201301243,1521-4095,,,...,,103,RETRACTED: HIGH-VOLUMETRIC PERFORMANCE ALIGNED...,ARTICLE; RETRACTED PUBLICATION,"CHEMISTRY, MULTIDISCIPLINARY; CHEMISTRY, PHYSI...",2013,PENN STATE UNIV;PENN STATE UNIV;PENN STATE UNI...,PENN STATE UNIV,,2013-09-20
151506,GHAFFARI M;KINSMAN W;ZHOU Y;MURALI S;BURLINGAM...,,POLYMER ACTUATORS; ELECTROMECHANICAL RESPONSE;...,"ANONYMOUS, APPL PHYS LETT; BAR-COHEN Y, 2008, ...",A HIGH-DENSITY ALIGNED NANOPOROUS ACTIVATED MI...,PENNSYLVANIA COMMONWEALTH SYSTEM OF HIGHER EDU...,10.1002/adma.201301370,1521-4095,,,...,,11,RETRACTED: ALIGNED NANO-POROUS MICROWAVE EXFOL...,ARTICLE; RETRACTED PUBLICATION,"CHEMISTRY, MULTIDISCIPLINARY; CHEMISTRY, PHYSI...",2013,PENN STATE UNIV;PENN STATE UNIV;PENN STATE UNI...,PENN STATE UNIV,,2013-11-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
311196,BAI R;YUAN C;SUN W;ZHANG J;LUO Y;GAO Y;LI Y;GO...,NEK2; NSCLC; TUMORIGENESIS; WNT/BETA-CATENIN; ...,DRUG-RESISTANCE; POOR-PROGNOSIS; IMMUNOTHERAPY...,"BURGER PE, 2005, P NATL ACAD SCI USA, V102, P7...",ABNORMAL EXPRESSION AND DYSFUNCTION OF NEVER-I...,WUHAN UNIVERSITY; WUHAN UNIVERSITY; WUHAN UNIV...,10.7150/ijbs.59019,,,,...,,10,RETRACTED: NEK2 PLAYS AN ACTIVE ROLE IN TUMORI...,ARTICLE; RETRACTED PUBLICATION,BIOCHEMISTRY \& MOLECULAR BIOLOGY; BIOLOGY,2021,WUHAN UNIV;WUHAN UNIV;WUHAN UNIV;WUHAN UNIV;WU...,WUHAN UNIV,,2021-01-01
305862,ZHANG QQ;DING Y;LEI Y;QI CL;HE XD;LAN T;LI JC;...,ANDROGRAPHOLIDE; INSULINOMA; GROWTH,NF-KAPPA-B; MOUSE MODELS; CANCER; APOPTOSIS; C...,"BERGERS G, 1999, SCIENCE, V284, P808, DOI 10.1...","INSULINOMAS ARE RARE TUMORS, AND APPROXIMATELY...",GUANGDONG PHARMACEUTICAL UNIVERSITY; UNIVERSIT...,10.7150/ijbs.7723,,,,...,,42,RETRACTED: ANDROGRAPHOLIDE SUPPRESS TUMOR GROW...,ARTICLE; RETRACTED PUBLICATION,BIOCHEMISTRY \& MOLECULAR BIOLOGY; BIOLOGY,2014,GUANGDONG PHARMACEUT UNIV;GUANGDONG PHARMACEUT...,GUANGDONG PHARMACEUT UNIV,,2014-01-01
802659,GUILLEMINAULT C;QUO S;HUYNH NT;LI K,PREPUBERTAL CHILDREN; PEDIATRIC OBSTRUCTIVE SL...,RAPID MAXILLARY EXPANSION; NASAL AIRWAY-RESIST...,"ANONYMOUS, 1968, MANUAL STANDARDIZED; BONNET M...",STUDY OBJECTIVE: RAPID MAXILLARY EXPANSION AND...,STANFORD UNIVERSITY; UNIVERSITY OF CALIFORNIA ...,,1550-9109,,,...,,27,RETRACTED: ORTHODONTIC EXPANSION TREATMENT AND...,ARTICLE; RETRACTED PUBLICATION,CLINICAL NEUROLOGY; NEUROSCIENCES,2008,STANFORD UNIV;STANFORD UNIV;UNIV CALIF SAN FRA...,STANFORD UNIV,,2008-07-01
619961,KOUTKIA P;MYLONAKIS E;LEVIN RM,,PNEUMOCYSTIS-CARINII INFECTION; IMMUNE-DEFICIE...,"BASILIO-DE-OLIVEIRA C A, 2000, BRAZ J INFECT D...",ABNORMALITIES OF THYROID FUNCTION ARE ASSOCIAT...,BOSTON UNIVERSITY; HARVARD UNIVERSITY; MASSACH...,,1557-9077,,,...,,10,RETRACTED: HUMAN IMMUNODEFICIENCY VIRUS INFECT...,REVIEW; RETRACTED PUBLICATION,ENDOCRINOLOGY \& METABOLISM,2002,BOSTON UNIV;BOSTON UNIV;HARVARD UNIV,BOSTON UNIV,,2002-07-01


In [55]:
# Remove the Alterations to the name of the article
control_set['document_title'] = control_set['document_title'].str.replace('RETRACTED: ', '')
control_set['document_title'] = control_set['document_title'].str.replace('(Withdrawn Publication)', '')
control_set['document_title'] = control_set['document_title'].str.replace('(Withdrawn publication)', '')
control_set['document_title'] = control_set['document_title'].str.replace('</bold>', '')
control_set['document_title'] = control_set['document_title'].str.replace('<bold>', '')

In [56]:
# Some records include variations of the phrase "(Retracted article. See XX)". 
# Where XX is has variable length and characters, but always ends with ")"
def remove_retraction_phrase(title):

    # Define the pattern to match the retraction phrase
    pattern = r'\(retracted article\. see [^\)]+\)'
    return re.sub(pattern, '', str.lower(title)).strip()

# Apply the function to the 'Document Title' column
control_set['document_title'] = control_set['document_title'].apply(remove_retraction_phrase)

In [57]:
def remove_retraction_phrase(title):

    # Define the pattern to match the retraction phrase
    pattern = r'\(withdrawal of [^\)]+\)'
    return re.sub(pattern, '', str.lower(title)).strip()

# Apply the function to the 'Document Title' column
control_set['document_title'] = control_set['document_title'].apply(remove_retraction_phrase)

In [58]:
# Remove spaces at the end of the string
control_set['document_title'] = control_set['document_title'].str.rstrip()

#lower case string
control_set['document_title'] = control_set['document_title'].str.lower()

In [59]:
#wos_rd[(~wos_rd['Document Title'].str.startswith("RETRACTED: ")) & (~wos_rd['Document Title'].str.contains("(Withdrawn Publication)"))]

### Duplicates

In [60]:
control_set[control_set.duplicated(subset='doi', keep=False)].shape

(20850, 35)

In [61]:
control_set['doi'].nunique()

916488

In [62]:
control_set.info()

<class 'pandas.core.frame.DataFrame'>
Index: 933634 entries, 484329 to 926825
Data columns (total 35 columns):
 #   Column                                 Non-Null Count   Dtype         
---  ------                                 --------------   -----         
 0   authors                                933634 non-null  object        
 1   author_keywords                        539770 non-null  object        
 2   keywords_plus                          873956 non-null  object        
 3   cited_references                       930337 non-null  object        
 4   abstract                               902793 non-null  object        
 5   affiliations                           914940 non-null  object        
 6   doi                                    920195 non-null  object        
 7   eissn                                  722054 non-null  object        
 8   esi_highly_cited_paper                 37361 non-null   object        
 9   esi_hot_paper                          37361 non

In [63]:
#control_set.iloc[[2,6,36,58,60,65,67,68]]

In [64]:
control_set['doi'].value_counts()

doi
10.1093/nar/gkl022                 3
10.1093/nar/gkg112                 3
10.1016/j.lungcan.2004.07.000      3
10.1038/s41586-021-03689-8         2
10.1093/nar/29.9.1960              2
                                  ..
10.1016/j.jcis.2015.01.002         1
10.1016/j.jcis.2015.01.003         1
10.1016/j.jcis.2015.01.004         1
10.1016/j.jcis.2015.01.005         1
y10.1016/j.physletb.2019.03.059    1
Name: count, Length: 916488, dtype: int64

In [65]:
control_set.shape

(933634, 35)

In [66]:
control_set[control_set.duplicated(subset='doi', keep=False)]

Unnamed: 0,authors,author_keywords,keywords_plus,cited_references,abstract,affiliations,doi,eissn,esi_highly_cited_paper,esi_hot_paper,...,book_series_title,wos_core_collection_times_cited_count,document_title,document_type,wos_categories,year_published,authors_affiliations,corresponding_author_affiliation,book.group.author,publication_date
178879,BOWEN HP;WIERSEMA MF,FOREIGN COMPETITION; DIVERSIFICATION; CORPORAT...,BUSINESS CYCLES; PERFORMANCE; PRODUCT; UNCERTA...,"ABOWD JM, 1990, W3351 NBER; AI CR, 2003, ECON ...",SINCE THE MID-1980S U.S. DOMESTIC FIT-INS HAVE...,UNIVERSITY OF CALIFORNIA SYSTEM; UNIVERSITY OF...,10.1002/smj.499,1097-0266,,,...,,126,foreign-based competition and corporate divers...,ARTICLE,BUSINESS; MANAGEMENT,2005,UNIV CALIF IRVINE;UNIV CALIF IRVINE;VLERICK LE...,UNIV CALIF IRVINE,,2005-12-01
179945,KUMAR MVS,JOINT VENTURES; REAL OPTIONS; TERMINATION; EVE...,STRATEGIC ALLIANCES; MARKET VALUATION; PERFORM...,"ADNER R, 2004, ACAD MANAGE REV, V29, P74, DOI ...",THIS STUDY EXAMINES THE VALUE CREATED FROM ACQ...,CITY UNIVERSITY OF NEW YORK (CUNY) SYSTEM; BAR...,10.1002/smj.499,1097-0266,,,...,,91,the value from acquiring and divesting a joint...,ARTICLE,BUSINESS; MANAGEMENT,2005,CUNY BERNARD M BARUCH COLL;CUNY BERNARD M BARU...,CUNY BERNARD M BARUCH COLL,,2005-04-01
689937,RODRÍGUEZ JR;GONZÁLEZ-PÉREZ A;DEL CASTILLO J,ALKYLDIMETHYLBENZYLAMMONIUM CHLORIDE; CMC; TEM...,CRITICAL MICELLE CONCENTRATIONS; ENTHALPY-ENTR...,"CHEN LJ, 1998, J PHYS CHEM B, V102, P4350, DOI...",SPECIFIC CONDUCTIVITIES OF ALKYLDIMETHYLBENZYL...,UNIVERSIDADE DE SANTIAGO DE COMPOSTELA; JAGIEL...,10.1006/jcis.2002.8263,,,,...,,68,thermodynamics of micellization of alkyldimeth...,ARTICLE,"CHEMISTRY, PHYSICAL",2002,UNIV SANTIAGO COMPOSTELA;UNIV SANTIAGO COMPOST...,UNIV SANTIAGO COMPOSTELA,,2002-06-15
690682,NGUYEN AV,MENISCUS; GAS-LIQUID INTERFACE; YOUNG-LAPLACE ...,,"ADAMSON AW, 1997, PHYSICAL CHEM SURFAC; DERJAG...",IN THIS PAPER THE PROBLEM OF CALCULATING THE D...,UNIVERSITY OF NEWCASTLE,10.1006/jcis.2002.8263,,,,...,,21,empirical equations for meniscus depression by...,ARTICLE,"CHEMISTRY, PHYSICAL",2002,UNIV NEWCASTLE;UNIV NEWCASTLE,UNIV NEWCASTLE,,2002-05-01
737916,LAI WA;REKHA PD;ARUN AB;YOUNG CC,AZOSPIRILLUM; PLANT GROWTH PROMOTION; CHEMICAL...,PLANT-GROWTH; ROOT DEVELOPMENT; ACC DEAMINASE;...,"ADEGBIDI HG, 2003, BIOMASS BIOENERG, V25, P389...",BENEFITS FROM THE APPLICATION OF PLANT GROWTH-...,NATIONAL CHUNG HSING UNIVERSITY,10.1007/s00374-008-0313-3,1432-0789,,,...,,0,"effect of mineral fertilizer, pig manure, and ...",ARTICLE,SOIL SCIENCE,2008,NATL CHUNG HSING UNIV;NATL CHUNG HSING UNIV,NATL CHUNG HSING UNIV,,2008-11-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
877831,[ANONYMOUS] A,,,,,,,1476-5608,,,...,,0,"the british prostate group spring meeting, 16-...",ARTICLE,ONCOLOGY; UROLOGY \& NEPHROLOGY,2000,,NOTREPORTED,,2000-01-01
878180,MAYFIELD JA;REIBER GE;SANDERS LJ;JANISSE D;POG...,,,"AMERICAN DIABETES ASSOCIATION, 1999, DIABETES ...",,,,,,,...,,2,preventive foot care in people with diabetes,ARTICLE,ORTHOPEDICS,2000,,NOTREPORTED,,2000-01-01
878610,[ANONYMOUS] A,,,,,,,1365-2907,,,...,,0,the proceedings of a symposium held at the thi...,ARTICLE,ECOLOGY; ZOOLOGY,2000,,NOTREPORTED,,2000-01-01
926523,O'BRIEN T;JOHNSON LH;ALDRICH JL;ALLEN SG;LIANG...,IMMUNOASSAYS; BIDIFFRACTIVE GRATING BIOSENSOR;...,SURFACE,"ANDERSON GP, 1996, ASAIO J, V42, P942, DOI 10....",A CRITICAL NEED EXISTS FOR A FIELD DEPLOYABLE ...,UNITED STATES DEPARTMENT OF DEFENSE; UNITED ST...,,,,,...,,53,the development of immunoassays to four biolog...,ARTICLE,BIOPHYSICS; BIOTECHNOLOGY \& APPLIED MICROBIOL...,2000,T (CORRESPONDING AUTHOR);MED RES CTR;BATTELLE ...,T (CORRESPONDING AUTHOR),,2000-01-01


In [67]:
testing_dupes = control_set[control_set.duplicated(subset='doi', keep=False)]
testing_dupes['doi'].value_counts()

doi
10.1016/j.lungcan.2004.07.000    3
10.1093/nar/gkl022               3
10.1093/nar/gkg112               3
10.1093/nar/29.1.277             2
10.1093/nar/29.1.294             2
                                ..
10.1039/c3nr01739g               2
10.1039/c3nr01847d               2
10.1039/c3nr01889j               2
10.1039/c3nr02076b               2
10.4244/EIJY16M06\_02            2
Name: count, Length: 3704, dtype: int64

In [68]:
testing_dupes = testing_dupes.sort_values('doi')

In [69]:
testing_dupes

Unnamed: 0,authors,author_keywords,keywords_plus,cited_references,abstract,affiliations,doi,eissn,esi_highly_cited_paper,esi_hot_paper,...,book_series_title,wos_core_collection_times_cited_count,document_title,document_type,wos_categories,year_published,authors_affiliations,corresponding_author_affiliation,book.group.author,publication_date
178879,BOWEN HP;WIERSEMA MF,FOREIGN COMPETITION; DIVERSIFICATION; CORPORAT...,BUSINESS CYCLES; PERFORMANCE; PRODUCT; UNCERTA...,"ABOWD JM, 1990, W3351 NBER; AI CR, 2003, ECON ...",SINCE THE MID-1980S U.S. DOMESTIC FIT-INS HAVE...,UNIVERSITY OF CALIFORNIA SYSTEM; UNIVERSITY OF...,10.1002/smj.499,1097-0266,,,...,,126,foreign-based competition and corporate divers...,ARTICLE,BUSINESS; MANAGEMENT,2005,UNIV CALIF IRVINE;UNIV CALIF IRVINE;VLERICK LE...,UNIV CALIF IRVINE,,2005-12-01
179945,KUMAR MVS,JOINT VENTURES; REAL OPTIONS; TERMINATION; EVE...,STRATEGIC ALLIANCES; MARKET VALUATION; PERFORM...,"ADNER R, 2004, ACAD MANAGE REV, V29, P74, DOI ...",THIS STUDY EXAMINES THE VALUE CREATED FROM ACQ...,CITY UNIVERSITY OF NEW YORK (CUNY) SYSTEM; BAR...,10.1002/smj.499,1097-0266,,,...,,91,the value from acquiring and divesting a joint...,ARTICLE,BUSINESS; MANAGEMENT,2005,CUNY BERNARD M BARUCH COLL;CUNY BERNARD M BARU...,CUNY BERNARD M BARUCH COLL,,2005-04-01
689937,RODRÍGUEZ JR;GONZÁLEZ-PÉREZ A;DEL CASTILLO J,ALKYLDIMETHYLBENZYLAMMONIUM CHLORIDE; CMC; TEM...,CRITICAL MICELLE CONCENTRATIONS; ENTHALPY-ENTR...,"CHEN LJ, 1998, J PHYS CHEM B, V102, P4350, DOI...",SPECIFIC CONDUCTIVITIES OF ALKYLDIMETHYLBENZYL...,UNIVERSIDADE DE SANTIAGO DE COMPOSTELA; JAGIEL...,10.1006/jcis.2002.8263,,,,...,,68,thermodynamics of micellization of alkyldimeth...,ARTICLE,"CHEMISTRY, PHYSICAL",2002,UNIV SANTIAGO COMPOSTELA;UNIV SANTIAGO COMPOST...,UNIV SANTIAGO COMPOSTELA,,2002-06-15
690682,NGUYEN AV,MENISCUS; GAS-LIQUID INTERFACE; YOUNG-LAPLACE ...,,"ADAMSON AW, 1997, PHYSICAL CHEM SURFAC; DERJAG...",IN THIS PAPER THE PROBLEM OF CALCULATING THE D...,UNIVERSITY OF NEWCASTLE,10.1006/jcis.2002.8263,,,,...,,21,empirical equations for meniscus depression by...,ARTICLE,"CHEMISTRY, PHYSICAL",2002,UNIV NEWCASTLE;UNIV NEWCASTLE,UNIV NEWCASTLE,,2002-05-01
737916,LAI WA;REKHA PD;ARUN AB;YOUNG CC,AZOSPIRILLUM; PLANT GROWTH PROMOTION; CHEMICAL...,PLANT-GROWTH; ROOT DEVELOPMENT; ACC DEAMINASE;...,"ADEGBIDI HG, 2003, BIOMASS BIOENERG, V25, P389...",BENEFITS FROM THE APPLICATION OF PLANT GROWTH-...,NATIONAL CHUNG HSING UNIVERSITY,10.1007/s00374-008-0313-3,1432-0789,,,...,,0,"effect of mineral fertilizer, pig manure, and ...",ARTICLE,SOIL SCIENCE,2008,NATL CHUNG HSING UNIV;NATL CHUNG HSING UNIV,NATL CHUNG HSING UNIV,,2008-11-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
877831,[ANONYMOUS] A,,,,,,,1476-5608,,,...,,0,"the british prostate group spring meeting, 16-...",ARTICLE,ONCOLOGY; UROLOGY \& NEPHROLOGY,2000,,NOTREPORTED,,2000-01-01
878180,MAYFIELD JA;REIBER GE;SANDERS LJ;JANISSE D;POG...,,,"AMERICAN DIABETES ASSOCIATION, 1999, DIABETES ...",,,,,,,...,,2,preventive foot care in people with diabetes,ARTICLE,ORTHOPEDICS,2000,,NOTREPORTED,,2000-01-01
878610,[ANONYMOUS] A,,,,,,,1365-2907,,,...,,0,the proceedings of a symposium held at the thi...,ARTICLE,ECOLOGY; ZOOLOGY,2000,,NOTREPORTED,,2000-01-01
926523,O'BRIEN T;JOHNSON LH;ALDRICH JL;ALLEN SG;LIANG...,IMMUNOASSAYS; BIDIFFRACTIVE GRATING BIOSENSOR;...,SURFACE,"ANDERSON GP, 1996, ASAIO J, V42, P942, DOI 10....",A CRITICAL NEED EXISTS FOR A FIELD DEPLOYABLE ...,UNITED STATES DEPARTMENT OF DEFENSE; UNITED ST...,,,,,...,,53,the development of immunoassays to four biolog...,ARTICLE,BIOPHYSICS; BIOTECHNOLOGY \& APPLIED MICROBIOL...,2000,T (CORRESPONDING AUTHOR);MED RES CTR;BATTELLE ...,T (CORRESPONDING AUTHOR),,2000-01-01


In [70]:
testing_dupes.to_excel('../testing_dupes_control_set.xlsx', index= False)

In [71]:
control_set['document_type'].value_counts()

document_type
ARTICLE                                              809132
REVIEW                                                82678
ARTICLE; PROCEEDINGS PAPER                            35443
REVIEW; BOOK CHAPTER                                   2995
ARTICLE; EARLY ACCESS                                  1132
ARTICLE; BOOK CHAPTER                                  1088
ARTICLE; RETRACTED PUBLICATION                          439
EDITORIAL MATERIAL; EARLY ACCESS                        417
REVIEW; EARLY ACCESS                                    118
CORRECTION; EARLY ACCESS                                 49
ARTICLE; PUBLICATION WITH EXPRESSION OF CONCERN          37
LETTER; EARLY ACCESS                                     32
ARTICLE; DATA PAPER                                      20
NEWS ITEM; EARLY ACCESS                                  14
BOOK REVIEW; EARLY ACCESS                                14
ARTICLE; PROCEEDINGS PAPER; RETRACTED PUBLICATION        12
REVIEW; RETRACTED PUBLICAT

In [72]:
control_set['publication_name'].value_counts()

publication_name
MATERIALS SCIENCE AND ENGINEERING A-STRUCTURAL MATERIALS PROPERTIES MICROSTRUCTURE AND PROCESSING    29925
NUCLEIC ACIDS RESEARCH                                                                               27961
JOURNAL OF COLLOID AND INTERFACE SCIENCE                                                             23707
NATURE                                                                                               20875
NANOSCALE                                                                                            20827
                                                                                                     ...  
AUTOIMMUNITY, PT D                                                                                       1
THERAPEUTICS FOR COGNITIVE AGING                                                                         1
MEETING THE HUMAN IMMUNOLOGY CHALLENGE                                                                   1
PROBIOTICS: FROM BEN

In [73]:
unique_counts = testing_dupes[testing_dupes['doi']=='10.1038/nrm1196'].nunique()
unique_counts[unique_counts > 1].index

Index(['keywords_plus', 'cited_references', 'affiliations', 'eissn', 'issn',
       'iso_source_abv', 'publication_name', 'month', 'document_type',
       'authors_affiliations', 'publication_date'],
      dtype='object')

In [74]:
control_set.info()

<class 'pandas.core.frame.DataFrame'>
Index: 933634 entries, 484329 to 926825
Data columns (total 35 columns):
 #   Column                                 Non-Null Count   Dtype         
---  ------                                 --------------   -----         
 0   authors                                933634 non-null  object        
 1   author_keywords                        539770 non-null  object        
 2   keywords_plus                          873956 non-null  object        
 3   cited_references                       930337 non-null  object        
 4   abstract                               902793 non-null  object        
 5   affiliations                           914940 non-null  object        
 6   doi                                    920195 non-null  object        
 7   eissn                                  722054 non-null  object        
 8   esi_highly_cited_paper                 37361 non-null   object        
 9   esi_hot_paper                          37361 non

In [75]:
# number of different values per DOI
aux = testing_dupes.groupby('doi').nunique()

# number of variables that differ for each DOI
aux.gt(1).sum(axis=1).sort_values()

doi
10.1111/cpsp.12377             1
10.1039/c7nr01116d             1
10.1039/c7nr01755c             1
10.1039/c7nr01845b             1
10.1039/c7nr01894k             1
                              ..
10.1016/j.msea.2006.03.011    15
10.3389/fncel.2015.00310      15
10.4244/EIJY16M06\_02         15
10.1196/annals.1322.002       19
10.1196/annals.1322.021       20
Length: 3704, dtype: int64

In [76]:
# variables that differ in duplicates
aux.columns[aux.max() > 1]

Index(['authors', 'author_keywords', 'keywords_plus', 'cited_references',
       'abstract', 'affiliations', 'eissn', 'funding_agency_and_grant_number',
       'isbn', 'issn', 'iso_source_abv', 'publication_name', 'month', 'note',
       'cited_reference_count', 'open_access_indicator', 'organization',
       'publisher', 'research_areas', 'researcher_id_numbers',
       'wos_core_collection_times_cited_count', 'document_title',
       'document_type', 'wos_categories', 'year_published',
       'authors_affiliations', 'corresponding_author_affiliation',
       'publication_date'],
      dtype='object')

From the list of variables that were listed above as having different values for the same DOI, only the variables that are going to be necessary in the methodology will be individually inspected. That includes the following:
- 'authors', 
- 'author_keywords', 
- 'keywords_plus', 
- 'cited_references',
- 'abstract', 
- 'affiliations', 
- 'eissn', 
- 'funding_agency_and_grant_number',
- 'issn', 
- 'iso_source_abv', 
- 'publication_name', 
- 'month',
- 'cited_reference_count', 
- 'open_access_indicator', 
- 'publisher',
- 'research_areas', 
- 'researcher_id_numbers',
- 'wos_core_collection_times_cited_count', 
- 'document_title',
- 'document_type', 
- 'wos_categories', 
- 'year_published',
- 'authors_affiliations', 
- 'corresponding_author_affiliation',
- 'publication_date'

In [77]:
# Keep only the first occurrence of each unique DOI (the most recent date)
filtered_wos = control_set.drop_duplicates(subset='doi')

In [78]:
control_set.to_parquet('./retractions_data/control_set.parquet', index = False)

<div class="alert alert-block alert-info" style = "background:#d0de6f; color:#000000; border:0;">

# Chapter 2 - Merge Data <a class="anchor" id="chapter2"></a>

<a class="anchor"> 

## 2.1 -Retractions Data (rwd+control_set) <a class="anchor" id="section_4_1"></a>

In [79]:
filtered_rwd.shape
#filtered_wos.shape

(36847, 20)

In [80]:
processed_data_retractions = filtered_rwd.merge(filtered_wos, how= 'inner', left_on= 'OriginalPaperDOI', right_on= 'doi')
processed_data_retractions

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,...,book_series_title,wos_core_collection_times_cited_count,document_title,document_type,wos_categories,year_published,authors_affiliations,corresponding_author_affiliation,book.group.author,publication_date
0,19131,Guiding Cell Migration Using One-Way Micropatt...,(BLS) Biochemistry;(BLS) Biology - Cellular;,"Chemical & Materials Engineering Department, U...",Advanced Materials,Wiley,United States,Girish Kumar;Chia Chi Ho;Carlos C Co,,Research Article;,...,,87,guiding cell migration using one-way micropatt...,ARTICLE; RETRACTED PUBLICATION,"CHEMISTRY, MULTIDISCIPLINARY; CHEMISTRY, PHYSI...",2007,UNIV CINCINNATI;UNIV CINCINNATI,UNIV CINCINNATI,,2007-04-20
1,5767,Self-Assembled Si Quantum-Ring Structures on a...,(PHY) Materials Science;(PHY) Nanotechnology;,National Laboratory of Solid State Microstruct...,Advanced Materials,Wiley,China,Lin Wei Yu;Kun Ji Chen;Jie Song;Jun Xu;Wei Li;...,,Research Article;,...,,10,self-assembled si quantum-ring structures on a...,ARTICLE; RETRACTED PUBLICATION,"CHEMISTRY, MULTIDISCIPLINARY; CHEMISTRY, PHYSI...",2007,NANJING UNIV;NANJING UNIV,NANJING UNIV,,2007-06-18
2,5650,High-volumetric performance aligned nano-porou...,(PHY) Engineering - Chemical;(PHY) Engineering...,Department of Materials Science and Engineerin...,Advanced Materials,Wiley,China;United States,Mehdi Ghaffari;Yue Zhou;Haiping Xu;Minren Lin;...,http://retractionwatch.com/2016/10/28/material...,Research Article;,...,,103,high-volumetric performance aligned nano-porou...,ARTICLE; RETRACTED PUBLICATION,"CHEMISTRY, MULTIDISCIPLINARY; CHEMISTRY, PHYSI...",2013,PENN STATE UNIV;PENN STATE UNIV;PENN STATE UNI...,PENN STATE UNIV,,2013-09-20
3,6221,Aligned Nano-Porous Microwave Exfoliated Graph...,(PHY) Engineering - Chemical;(PHY) Engineering...,Department of Materials Science and Engineerin...,Advanced Materials,Wiley,United States,Mehdi Ghaffari;QM Zhang;W Kinsman;Yue Zhou;Sha...,http://retractionwatch.com/2016/10/28/material...,Research Article;,...,,11,aligned nano-porous microwave exfoliated graph...,ARTICLE; RETRACTED PUBLICATION,"CHEMISTRY, MULTIDISCIPLINARY; CHEMISTRY, PHYSI...",2013,PENN STATE UNIV;PENN STATE UNIV;PENN STATE UNI...,PENN STATE UNIV,,2013-11-20
4,45124,Highly Sensitive MoS2 Humidity Sensors Array f...,(PHY) Engineering - Chemical;,Beijing National Laboratory for Condensed Matt...,Advanced Materials,Wiley,China,Jing Zhao;Na Li;Zheng Wei;Mengzhou Liao;Peng C...,,Research Article;,...,,349,highly sensitive mos<sub>2</sub> humidity sens...,ARTICLE; RETRACTED PUBLICATION,"CHEMISTRY, MULTIDISCIPLINARY; CHEMISTRY, PHYSI...",2017,INST PHYS;BEJING INST NANOENRGY AND NANOSYST;U...,INST PHYS,,2017-09-13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
655,38204,miR-1204 promotes hepatocellular carcinoma pro...,(BLS) Biochemistry;(BLS) Biology - Cancer;(BLS...,"Department of Hepatobiliary Surgery, the First...",International Journal of Biological Sciences,Ivyspring International Publisher,China,Liang Wang;Liankang Sun;Yufeng Wang;Bowen Yao;...,,Research Article;,...,,24,mir-1204 promotes hepatocellular carcinoma pro...,ARTICLE; RETRACTED PUBLICATION,BIOCHEMISTRY \& MOLECULAR BIOLOGY; BIOLOGY,2019,XI AN JIAO TONG UNIV;XI AN JIAO TONG UNIV,XI AN JIAO TONG UNIV,,2019-01-01
656,34228,"HCRP-1 regulates cell migration, invasion and ...",(BLS) Biology - Cancer;(BLS) Biology - Cellular;,"Cancer Institute, Xuzhou Medical University, X...",International Journal of Biological Sciences,Ivyspring International Publisher,China,Feifei Chen;Jianqiang Wu;Jingwei Teng;Wang Li;...,,Research Article;,...,,8,"hcrp-1 regulates cell migration, invasion and ...",ARTICLE; RETRACTED PUBLICATION,BIOCHEMISTRY \& MOLECULAR BIOLOGY; BIOLOGY,2020,XUZHOU MED UNIV;XUZHOU MED COLL;XUZHOU MED UNI...,NOTREPORTED;XUZHOU MED UNIV,,2020-01-01
657,38203,NEK2 plays an active role in Tumorigenesis and...,(BLS) Biochemistry;(BLS) Biology - Cancer;(BLS...,"Department of Radiation and Medical Oncology, ...",International Journal of Biological Sciences,Ivyspring International Publisher,China,Rui Bai;Cheng Yuan;Wenjie Sun;Jianguo Zhang;Yu...,,Research Article;,...,,10,nek2 plays an active role in tumorigenesis and...,ARTICLE; RETRACTED PUBLICATION,BIOCHEMISTRY \& MOLECULAR BIOLOGY; BIOLOGY,2021,WUHAN UNIV;WUHAN UNIV;WUHAN UNIV;WUHAN UNIV;WU...,WUHAN UNIV,,2021-01-01
658,34229,Andrographolide Suppress Tumor Growth by Inhib...,(BLS) Biology - Cancer;(BLS) Biology - Cellula...,"Vascular Biology Research Institute, Guangdong...",International Journal of Biological Sciences,IOS Publishing (Institute of Science Publishing),China,Qian Qian Zhang;Yi Ding;Yan Lei;Cui Ling Qi;Xi...,,Research Article;,...,,42,andrographolide suppress tumor growth by inhib...,ARTICLE; RETRACTED PUBLICATION,BIOCHEMISTRY \& MOLECULAR BIOLOGY; BIOLOGY,2014,GUANGDONG PHARMACEUT UNIV;GUANGDONG PHARMACEUT...,GUANGDONG PHARMACEUT UNIV,,2014-01-01


In [81]:
processed_data_retractions['doi'].value_counts()

doi
10.1002/adma.200601629           1
10.1056/NEJM200104263441702      1
10.1056/NEJMoa033374             1
10.1056/NEJMoa060467             1
10.1056/NEJMoa1101324            1
                                ..
10.1016/j.lungcan.2010.06.005    1
10.1016/j.lungcan.2014.08.015    1
10.1016/j.matdes.2009.01.022     1
10.1016/j.matdes.2015.07.028     1
10.7150/ijbs.7723                1
Name: count, Length: 659, dtype: int64

In [95]:
rwd_wo_doi = rwd[rwd['OriginalPaperDOI'].isnull() | (rwd['OriginalPaperDOI'] == '')]
control_set_wo_doi = control_set[control_set['doi'].isnull() | (control_set['doi'] == '')]
data_retractions_by_title = rwd_wo_doi.merge(control_set_wo_doi, how= 'inner', left_on= 'Title', right_on= 'document_title')
processed_data_retractions = pd.concat([processed_data_retractions, data_retractions_by_title], ignore_index = True)
processed_data_retractions = processed_data_retractions.drop_duplicates()

In [96]:
data_retractions_by_title

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,...,book_series_title,wos_core_collection_times_cited_count,document_title,document_type,wos_categories,year_published,authors_affiliations,corresponding_author_affiliation,book.group.author,publication_date
0,13650,introduction,(B/T) Business - Economics;(HUM) History - Eur...,Unavailable;,Journal of Markets & Morality,Acton Institute for the Study of Religion & Li...,Unknown,Francisco GÃ³mez Camacho,http://retractionwatch.com/2015/07/20/two-retr...,Book Chapter/Reference Work;,...,,2,introduction,ARTICLE,ORTHOPEDICS,2014,LUND UNIV;LUND UNIV,LUND UNIV,,2014-12-01
1,4261,nonlymphoid reservoirs of hiv replication in c...,(BLS) Biology - Cellular;(BLS) Biology - Molec...,"University of Washington School of Medicine, V...",Journal of Leukocyte Biology (JLB),Society for Leukocyte Biology,United States,Scott J Brodie,http://retractionwatch.com/2016/06/13/fraudste...,Research Article;,...,,15,nonlymphoid reservoirs of hiv replication in c...,ARTICLE; PROCEEDINGS PAPER; RETRACTED PUBLICATION,CELL BIOLOGY; HEMATOLOGY; IMMUNOLOGY,2000,UNIV WASHINGTON;UNIV WASHINGTON,UNIV WASHINGTON,,2000-09-01


In [84]:
#data_retractions_by_title['doi'].value_counts()

In [97]:
processed_data_retractions.columns

Index(['Record ID', 'Title', 'Subject', 'Institution', 'Journal', 'Publisher',
       'Country', 'Author', 'URLS', 'ArticleType', 'RetractionDate',
       'RetractionDOI', 'RetractionPubMedID', 'OriginalPaperDate',
       'OriginalPaperDOI', 'OriginalPaperPubMedID', 'RetractionNature',
       'Reason', 'Paywalled', 'Notes', 'authors', 'author_keywords',
       'keywords_plus', 'cited_references', 'abstract', 'affiliations', 'doi',
       'eissn', 'esi_highly_cited_paper', 'esi_hot_paper', 'early_access_date',
       'funding_agency_and_grant_number', 'isbn', 'issn', 'iso_source_abv',
       'publication_name', 'language', 'month', 'note',
       'cited_reference_count', 'open_access_indicator', 'organization',
       'publisher', 'research_areas', 'researcher_id_numbers',
       'book_series_title', 'wos_core_collection_times_cited_count',
       'document_title', 'document_type', 'wos_categories', 'year_published',
       'authors_affiliations', 'corresponding_author_affiliation',
   

In [98]:
processed_data_retractions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 662 entries, 0 to 661
Data columns (total 55 columns):
 #   Column                                 Non-Null Count  Dtype 
---  ------                                 --------------  ----- 
 0   Record ID                              662 non-null    int64 
 1   Title                                  662 non-null    object
 2   Subject                                662 non-null    object
 3   Institution                            662 non-null    object
 4   Journal                                662 non-null    object
 5   Publisher                              662 non-null    object
 6   Country                                662 non-null    object
 7   Author                                 662 non-null    object
 8   URLS                                   327 non-null    object
 9   ArticleType                            662 non-null    object
 10  RetractionDate                         662 non-null    object
 11  RetractionDOI      

In [87]:
datetime_variable_conversion = {"RetractionDate": "object",
                                "OriginalPaperDate": "object",
                                "publication_date": "object",
                                "early_access_date": "object"
                                }

processed_data_retractions = processed_data_retractions.astype(datetime_variable_conversion)

In [88]:
processed_data_retractions.to_excel('./retractions_data/processed_data_retractions.xlsx', index = False)

In [89]:
# Number of DOIs from filtered_wos not in filtered_rw
doi_not_in_rwd = filtered_wos[~filtered_wos['doi'].isin(filtered_rwd['OriginalPaperDOI'])]
num_dois_in_wos_not_in_rwd = len(doi_not_in_rwd)

# Number of DOIs from filtered_rw not in filtered_wos
doi_not_in_wos = filtered_rwd[~filtered_rwd['OriginalPaperDOI'].isin(filtered_wos['doi'])]
num_dois_in_rwd_not_in_wos = len(doi_not_in_wos)

In [90]:
print(f"Number of DOIs from filtered_wos that are not in filtered_rwd: {num_dois_in_wos_not_in_rwd}")
print(f"Number of DOIs from filtered_rwd that are not in filtered_wos: {num_dois_in_rwd_not_in_wos}")

Number of DOIs from filtered_wos that are not in filtered_rwd: 915829
Number of DOIs from filtered_rwd that are not in filtered_wos: 36187


In [91]:
filtered_rwd.info()

<class 'pandas.core.frame.DataFrame'>
Index: 36847 entries, 29155 to 22442
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   Record ID              36847 non-null  int64         
 1   Title                  36847 non-null  object        
 2   Subject                36847 non-null  object        
 3   Institution            36846 non-null  object        
 4   Journal                36847 non-null  object        
 5   Publisher              36847 non-null  object        
 6   Country                36847 non-null  object        
 7   Author                 36847 non-null  object        
 8   URLS                   19034 non-null  object        
 9   ArticleType            36847 non-null  object        
 10  RetractionDate         36847 non-null  datetime64[ns]
 11  RetractionDOI          36604 non-null  object        
 12  RetractionPubMedID     34013 non-null  object        
 13  Or

In [92]:
test = filtered_rwd.merge(filtered_wos, how= 'left', left_on= 'OriginalPaperDOI', right_on= 'doi')
test

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,...,book_series_title,wos_core_collection_times_cited_count,document_title,document_type,wos_categories,year_published,authors_affiliations,corresponding_author_affiliation,book.group.author,publication_date
0,985,Early Depth Assessment of Local Burns by Dermo...,(HSC) Biostatistics/Epidemiology;(HSC) Medicin...,"Burns Unit, Department of Dermatology, Nagasak...",Archives of Dermatology,JAMA Network,Japan,Kyomi Mihara;Hajime Shindo;Hiroya Mihara;Minak...,,Research Article;,...,,,,,,,,,,NaT
1,5729,The prevention of hip fracture with risedronat...,(HSC) Medicine - Geriatric;(HSC) Medicine - Ne...,"Department of Neurology, Mitate Hospital, Taga...",Archives of Internal Medicine,JAMA Network,Japan,Yoshihiro Sato;Tomohiro Kanoko;Kei Satoh;Jun I...,http://retractionwatch.com/2016/06/03/jama-jou...,Clinical Study;Research Article;,...,,,,,,,,,,NaT
2,5728,Risedronate sodium therapy for prevention of h...,(HSC) Medicine - Neurology;(HSC) Medicine - Re...,"Department of Neurology, Mitate Hospital, Taga...",Archives of Internal Medicine,JAMA Network,Japan,Yoshihiro Sato;Jun Iwamoto;Tomohiro Kanoko,http://retractionwatch.com/2016/06/03/jama-jou...,Research Article;,...,,,,,,,,,,NaT
3,895,The Relation Between Pulse Pressure and Cardio...,(HSC) Biostatistics/Epidemiology;(HSC) Medicin...,"Department of Dietetics and Nutrition, Harokop...",JAMA Internal Medicine,JAMA Network,Finland;Greece;Italy;Japan;Netherlands;Serbia;...,Demosthenes B Panagiotakos;Daan Kromhout;Aless...,,Research Article;,...,,,,,,,,,,NaT
4,19230,"First Foods Most: After 18-Hour Fast, People D...",(BLS) Nutrition;(SOC) Psychology;,Dyson School of Applied Economics and Manageme...,JAMA Internal Medicine,American Medical Association,United States,Brian Wansink;Aner Tal;Mitsuru Shimizu,http://retractionwatch.com/2018/04/13/caught-o...,Letter;Research Article;,...,,,,,,,,,,NaT
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36842,44860,A consideration of the impact of contaminated ...,(ENV) Ground/Surface Water;(HSC) Public Health...,"Department of Communication and Culture, Facul...",Human Life Culture Research,Otsuma Women's University Research Institute f...,Japan,Atsushi Okeda,,Research Article;,...,,,,,,,,,,NaT
36843,43526,Study on the Waste Information and Statistics ...,(ENV) Environmental Sciences;(ENV) Ground/Surf...,"Member, Ph.D. Candidate, Department of Disaste...",Journal of the Korean Society of Hazard Mitiga...,Korean Society of Hazard Mitigation,South Korea,Eun-han Lee;Waon-ho Yi,,Research Article;,...,,,,,,,,,,NaT
36844,38531,On the Feasibility of Stealthily Introducing V...,(B/T) Computer Science;(B/T) Technology;,University of Minnesota,2021 IEEE Symposium on Security and Privacy,IEEE: Institute of Electrical and Electronics ...,United States,Qiushi Wu;Kangjie Lu,,Conference Abstract/Paper;,...,,,,,,,,,,NaT
36845,47227,"Age, Gender Demographics and Comorbidity Preva...",(HSC) Biostatistics/Epidemiology;(HSC) Medicin...,"Department of Orthopaedics, Dhanalakshmi Srini...",Journal of Coastal Life Medicine,Journal of Coastal Life Medicine,India,S Venkatesh Kumar;Mohith Singh;Gowtham Singh;K...,,Research Article;,...,,,,,,,,,,NaT


In [93]:
test = filtered_rwd.merge(filtered_wos, how= 'right', left_on= 'OriginalPaperDOI', right_on= 'doi')
test

Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,...,book_series_title,wos_core_collection_times_cited_count,document_title,document_type,wos_categories,year_published,authors_affiliations,corresponding_author_affiliation,book.group.author,publication_date
0,,,,,,,,,,,...,,4,high-resolution seismic velocity analysis by s...,ARTICLE,GEOCHEMISTRY \& GEOPHYSICS,2021,UNIV TEHRAN;UNIV TEHRAN,UNIV TEHRAN,,2021-01-01
1,,,,,,,,,,,...,,22,panitumumab a novel drug in cancer treatment,ARTICLE; PROCEEDINGS PAPER,ONCOLOGY,2007,C (CORRESPONDING AUTHOR);NAPLES,C (CORRESPONDING AUTHOR),,2007-01-01
2,,,,,,,,,,,...,,205,"timing, order and durability of new product ad...",ARTICLE,BUSINESS; MANAGEMENT,2000,GEORGE MASON UNIV;GEORGE MASON UNIV;UNIV MARYL...,GEORGE MASON UNIV,,2000-01-01
3,,,,,,,,,,,...,,256,"environmental dynamism, capital structure and ...",REVIEW,BUSINESS; MANAGEMENT,2000,CALIF STATE UNIV NORTHRIDGE;CALIF STATE UNIV N...,CALIF STATE UNIV NORTHRIDGE,,2000-01-01
4,,,,,,,,,,,...,,383,product and international diversification amon...,ARTICLE,BUSINESS; MANAGEMENT,2000,UNIV UTAH;UNIV UTAH;CRANFIELD UNIV;CALIF POLYT...,UNIV UTAH,,2000-01-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
916484,,,,,,,,,,,...,,27,decrease of let-7f in low-dose metronomic pacl...,ARTICLE,BIOCHEMISTRY \& MOLECULAR BIOLOGY; BIOLOGY,2015,HARBIN MED UNIV;HARBIN MED UNIV;HARBIN MED UNI...,HARBIN MED UNIV,,2015-01-01
916485,,,,,,,,,,,...,,38,the prostate basal cell (bc) heterogeneity and...,ARTICLE,BIOCHEMISTRY \& MOLECULAR BIOLOGY; BIOLOGY,2014,BAYLOR COLL MED;TEXAS AANDM UNIV;LUZHOU MED COLL,BAYLOR COLL MED,,2014-01-01
916486,,,,,,,,,,,...,,25,"global costs, health benefits, and economic be...",ARTICLE,ONCOLOGY,2021,HARVARD UNIV;HARVARD UNIV;HARVARD UNIV;OLIVIA ...,HARVARD UNIV,,2021-03-01
916487,,,,,,,,,,,...,,58,measurement and interpretation of differential...,ARTICLE,"ASTRONOMY \& ASTROPHYSICS; PHYSICS, NUCLEAR; P...",2019,YEREVAN PHYS INST;YEREVAN PHYS INST;KRATSCHMER...,YEREVAN PHYS INST,,2019-05-10


In [94]:
filtered_wos.shape

(916489, 35)