# Data Cleaning
To clean this data set we'll start out by loading the dataset, checking for duplicates, and dropping columns that are not relevant to our analysis.

First, we'll load in our packages, set up our directories, and load in the dataset and take a look at it.

In [1]:
import pandas as pd
from pathlib import Path

#Set up directories
data_dir = Path('../data')
input_dir = data_dir / 'input'
output_dir = data_dir / 'output'

df = pd.read_csv(input_dir / '01_raw_data.csv', low_memory=False)

## Duplicate Records
Looking at the shape of the dataset against the number of unique DOIs will let us know just how many duplicate records we have.


In [2]:
df.shape

(106107, 51)

In [3]:
len(set(df['DOI']))

106036

In [4]:
#Dropping duplicate records
df.drop_duplicates(subset=['DOI'], keep='first', inplace=True)
df.shape

(106036, 51)

## Will not be dropping columns
Select columns of interest when importing data. Maintain data sctructure in file.

## Editors
There are very few records that have a value in the *editor* column. Some of our prior work indicates that this can be a sign of a work that has been mislabeled as a 'journal article'. So we'll explore some of the records with a value in the editor column in order to verify that.

We'll set up a dataframe of just those records that have data in the *editor* column.

Next, we'll search the titles of these records for a few keywords.

In [5]:
editorial = df.loc[df.title.str.contains(r'editorial|errata|contents|conference|proceedings|masthead|symposium|abstract|Book Review|preface|title page', 
                                         regex=True, case=False, na=False)]
editorial

Unnamed: 0_level_0,indexed,reference-count,publisher,issue,license,content-domain,short-container-title,published-print,DOI,type,created,page,source,is-referenced-by-count,title,prefix,volume,author,member,reference,container-title,language,link,deposited,score,resource,issued,references-count,journal-issue,URL,ISSN,issn-type,subject,published,alternative-id,published-online,archive,update-policy,assertion,funder,article-number,accepted,abstract,original-title,subtitle,published-other,editor,relation,update-to,translator,clinical-trial-number
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
5,"{'date-parts': [[2022, 3, 31]], 'date-time': '...",1,Wiley,52,"[{'start': {'date-parts': [[2015, 9, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['ChemInform'],"{'date-parts': [[2015, 12]]}",10.1002/chin.201552196,journal-article,"{'date-parts': [[2015, 12, 13]], 'date-time': ...",no-no,Crossref,0,['ChemInform Abstract: Supramolecular Polymeri...,10.10020,46,"[{'given': 'Takeharu', 'family': 'Haino', 'seq...",311.0,"[{'key': '10.1002/chin.201552196-BIB1|cit1', '...",['ChemInform'],en,[{'URL': 'https://api.wiley.com/onlinelibrary/...,"{'date-parts': [[2021, 2, 17]], 'date-time': '...",0.0,{'primary': {'URL': 'http://doi.wiley.com/10.1...,"{'date-parts': [[2015, 12]]}",1,"{'issue': '52', 'published-print': {'date-part...",http://dx.doi.org/10.1002/chin.201552196,['0931-7597'],"[{'value': '0931-7597', 'type': 'print'}]",['General Materials Science'],"{'date-parts': [[2015, 12]]}",,"{'date-parts': [[2015, 12, 10]]}",['Portico'],,,,,,,,,,,,,,
23,"{'date-parts': [[2022, 4, 4]], 'date-time': '2...",0,Elsevier BV,1,"[{'start': {'date-parts': [[1965, 6, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['IFAC Proceedings Volumes'],"{'date-parts': [[1965, 6]]}",10.1016/s1474-6670(17)69139-0,journal-article,"{'date-parts': [[2017, 7, 1]], 'date-time': '2...",577,Crossref,0,['Symposium Closing Remarks'],10.10160,2,,78.0,,['IFAC Proceedings Volumes'],en,[{'URL': 'https://api.elsevier.com/content/art...,"{'date-parts': [[2018, 8, 30]], 'date-time': '...",0.0,{'primary': {'URL': 'https://linkinghub.elsevi...,"{'date-parts': [[1965, 6]]}",0,"{'issue': '1', 'published-print': {'date-parts...",http://dx.doi.org/10.1016/s1474-6670(17)69139-0,['1474-6670'],"[{'value': '1474-6670', 'type': 'print'}]","['General Economics, Econometrics and Finance']","{'date-parts': [[1965, 6]]}",['S1474667017691390'],,,,,,,,,,,,,,,,
30,"{'date-parts': [[2022, 3, 29]], 'date-time': '...",0,Wiley,3,"[{'start': {'date-parts': [[2015, 9, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['Commun. Numer. Meth. Engng.'],"{'date-parts': [[1995, 3]]}",10.1002/cnm.1640110301,journal-article,"{'date-parts': [[2005, 8, 8]], 'date-time': '2...",fmi-fmi,Crossref,0,['Masthead'],10.10020,11,,311.0,,['Communications in Numerical Methods in Engin...,en,[{'URL': 'https://api.wiley.com/onlinelibrary/...,"{'date-parts': [[2021, 7, 2]], 'date-time': '2...",0.0,{'primary': {'URL': 'https://onlinelibrary.wil...,"{'date-parts': [[1995, 3]]}",0,"{'issue': '3', 'published-print': {'date-parts...",http://dx.doi.org/10.1002/cnm.1640110301,['1069-8299'],"[{'value': '1069-8299', 'type': 'print'}]","['Applied Mathematics', 'Computational Theory ...","{'date-parts': [[1995, 3]]}",,"{'date-parts': [[2005, 6, 21]]}",,,,,,,,,,,,,,,
33,"{'date-parts': [[2022, 4, 2]], 'date-time': '2...",1,Wiley,33,"[{'start': {'date-parts': [[2015, 9, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['Chemischer Informationsdienst'],"{'date-parts': [[1972, 8, 15]]}",10.1002/chin.197233207,journal-article,"{'date-parts': [[2016, 2, 26]], 'date-time': '...",no-no,Crossref,0,['ChemInform Abstract: FRIEDEL-CRAFTS-ACYLIERU...,10.10020,3,"[{'given': 'J. K.', 'family': 'GROVES', 'seque...",311.0,"[{'key': '10.1002/chin.197233207-BIB1|cit1', '...",['Chemischer Informationsdienst'],en,[{'URL': 'http://api.wiley.com/onlinelibrary/t...,"{'date-parts': [[2021, 7, 1]], 'date-time': '2...",0.0,{'primary': {'URL': 'https://onlinelibrary.wil...,"{'date-parts': [[1972, 8, 15]]}",1,"{'issue': '33', 'published-print': {'date-part...",http://dx.doi.org/10.1002/chin.197233207,['0009-2975'],"[{'value': '0009-2975', 'type': 'print'}]",['General Medicine'],"{'date-parts': [[1972, 8, 15]]}",,"{'date-parts': [[2016, 2, 23]]}",['Portico'],,,,,,,,,,,,,,
37,"{'date-parts': [[2022, 4, 5]], 'date-time': '2...",0,Wiley,27-29,"[{'start': {'date-parts': [[2015, 9, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['Z. Pflanzenernaehr. Dueng. Bodenk.'],{'date-parts': [[1931]]},10.1002/jpln.19310102701,journal-article,"{'date-parts': [[2007, 2, 7]], 'date-time': '2...",fmi-fmi,Crossref,0,['Masthead'],10.10020,10,,311.0,,"['Zeitschrift für Pflanzenernährung, Düngung, ...",de,[{'URL': 'https://api.wiley.com/onlinelibrary/...,"{'date-parts': [[2021, 7, 5]], 'date-time': '2...",0.0,{'primary': {'URL': 'https://onlinelibrary.wil...,{'date-parts': [[1931]]},0,"{'issue': '27-29', 'published-print': {'date-p...",http://dx.doi.org/10.1002/jpln.19310102701,"['0372-9702', '1522-2624']","[{'value': '0372-9702', 'type': 'print'}, {'va...","['Plant Science', 'Soil Science']",{'date-parts': [[1931]]},,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105970,"{'date-parts': [[2022, 3, 29]], 'date-time': '...",0,Dissolution Technologies,2,,"{'domain': [], 'crossmark-restriction': False}",['Dissolution Technol.'],{'date-parts': [[2006]]},10.14227/dt130206p25,journal-article,"{'date-parts': [[2015, 2, 3]], 'date-time': '2...",25-26,Crossref,1,['Book Review: Pharmaceutical Dissolution Test...,10.14227,13,"[{'given': 'Tahseen', 'family': 'Mirza', 'sequ...",5341.0,,['Dissolution Technologies'],,[{'URL': 'http://www.dissolutiontech.com/DTres...,"{'date-parts': [[2017, 4, 12]], 'date-time': '...",0.0,{'primary': {'URL': 'http://www.dissolutiontec...,{'date-parts': [[2006]]},0,"{'issue': '2', 'published-online': {'date-part...",http://dx.doi.org/10.14227/dt130206p25,['1521-298X'],"[{'value': '1521-298X', 'type': 'print'}]",['Pharmaceutical Science'],{'date-parts': [[2006]]},,{'date-parts': [[2006]]},,,,,,,,,,,,,,,
106018,"{'date-parts': [[2022, 4, 3]], 'date-time': '2...",0,Elsevier BV,,"[{'start': {'date-parts': [[2017, 4, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['Endocrine Practice'],"{'date-parts': [[2017, 4]]}",10.1016/s1530-891x(20)44162-x,journal-article,"{'date-parts': [[2020, 12, 31]], 'date-time': ...",230-231,Crossref,0,['Abstract #1037: Hypercalcitoninemia Mediated...,10.10160,23,"[{'given': 'Lubaina', 'family': 'Presswala', '...",78.0,,['Endocrine Practice'],en,[{'URL': 'https://api.elsevier.com/content/art...,"{'date-parts': [[2021, 11, 4]], 'date-time': '...",0.0,{'primary': {'URL': 'https://linkinghub.elsevi...,"{'date-parts': [[2017, 4]]}",0,,http://dx.doi.org/10.1016/s1530-891x(20)44162-x,['1530-891X'],"[{'value': '1530-891X', 'type': 'print'}]","['Endocrinology', 'General Medicine', 'Endocri...","{'date-parts': [[2017, 4]]}",['S1530891X2044162X'],,,,,,,,,,,,,,,,
106040,"{'date-parts': [[2022, 4, 6]], 'date-time': '2...",0,SAGE Publications,2,"[{'start': {'date-parts': [[1994, 6, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['British Journalism Review'],"{'date-parts': [[1994, 6]]}",10.1177/095647489400500214,journal-article,"{'date-parts': [[2007, 3, 18]], 'date-time': '...",54-56,Crossref,0,['Book Reviews : More an uncle than an ogre'],10.11770,5,"[{'given': 'Robert', 'family': 'Edwards', 'seq...",179.0,,['British Journalism Review'],en,[{'URL': 'http://journals.sagepub.com/doi/pdf/...,"{'date-parts': [[2021, 3, 16]], 'date-time': '...",0.0,{'primary': {'URL': 'http://journals.sagepub.c...,"{'date-parts': [[1994, 6]]}",0,"{'issue': '2', 'published-print': {'date-parts...",http://dx.doi.org/10.1177/095647489400500214,"['0956-4748', '1741-2668']","[{'value': '0956-4748', 'type': 'print'}, {'va...","['Industrial and Manufacturing Engineering', '...","{'date-parts': [[1994, 6]]}",['10.1177/095647489400500214'],"{'date-parts': [[2016, 7, 22]]}",,,,,,,,,['Michael Foot: by Mervyn Jones Victor Gollanc...,,,,,,
106050,"{'date-parts': [[2022, 3, 31]], 'date-time': '...",1,AIP Publishing,8,,"{'domain': [], 'crossmark-restriction': False}",['Journal of Applied Physics'],"{'date-parts': [[1985, 4, 15]]}",10.1063/1.334625,journal-article,"{'date-parts': [[2002, 7, 26]], 'date-time': '...",4237-4237,Crossref,0,['Domains in constructional steel: Theory and ...,10.10630,57,"[{'given': 'J. A.', 'family': 'Szpunar', 'sequ...",317.0,"[{'key': 'r1', 'first-page': '1470', 'volume':...",['Journal of Applied Physics'],en,[{'URL': 'http://aip.scitation.org/doi/pdf/10....,"{'date-parts': [[2016, 12, 28]], 'date-time': ...",0.0,{'primary': {'URL': 'http://aip.scitation.org/...,"{'date-parts': [[1985, 4, 15]]}",1,"{'issue': '8', 'published-print': {'date-parts...",http://dx.doi.org/10.1063/1.334625,"['0021-8979', '1089-7550']","[{'value': '0021-8979', 'type': 'print'}, {'va...",['General Physics and Astronomy'],"{'date-parts': [[1985, 4, 15]]}",['10.1063/1.334625'],,,,,,,,,,,,,,,,


We've found some editorials, Mastheads, conference proceedings, and abstracts. We'll go ahead and drop them from our dataset.

In [6]:
df.drop(editorial.index, inplace=True)

In [7]:
df.shape

(102486, 51)

## Conferences
Looking back at **editorial** we see that there are a couple 'Conferences' and 'Proceedings' in the *container-title* column. Let's take a look at just how many records remain in our dataset are from these journals/containers.

Additionally, we see a few records from the journal *ChemInform*, a journal that published chemistry abstracts, we'll check to see if any of those records remain as well.

We'll use a keyword search in the *container-title* column to find these records.

In [8]:
conferences = df.loc[(df['container-title'].str.contains(r'conference|ChemInform|news|CrossRef Listing of Deleted DOIs', regex=True, case=False)) | (df.publisher == 'EDP Sciences')]
conferences

Unnamed: 0_level_0,indexed,reference-count,publisher,issue,license,content-domain,short-container-title,published-print,DOI,type,created,page,source,is-referenced-by-count,title,prefix,volume,author,member,reference,container-title,language,link,deposited,score,resource,issued,references-count,journal-issue,URL,ISSN,issn-type,subject,published,alternative-id,published-online,archive,update-policy,assertion,funder,article-number,accepted,abstract,original-title,subtitle,published-other,editor,relation,update-to,translator,clinical-trial-number
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
35,"{'date-parts': [[2022, 4, 3]], 'date-time': '2...",0,EDP Sciences,3,"[{'start': {'date-parts': [[2021, 6, 28]], 'da...","{'domain': [], 'crossmark-restriction': False}",['Europhysics News'],{'date-parts': [[2021]]},10.1051/epn/2021307,journal-article,"{'date-parts': [[2021, 6, 28]], 'date-time': '...",32-32,Crossref,0,['Is there new physics around the corner?'],10.1051,52,"[{'given': 'Hans Peter', 'family': 'Beck', 'se...",250.0,,['Europhysics News'],,[{'URL': 'https://www.europhysicsnews.org/10.1...,"{'date-parts': [[2021, 10, 11]], 'date-time': ...",0.0,{'primary': {'URL': 'https://www.europhysicsne...,{'date-parts': [[2021]]},0,{'issue': '3'},http://dx.doi.org/10.1051/epn/2021307,"['0531-7479', '1432-1092']","[{'value': '0531-7479', 'type': 'print'}, {'va...",['General Physics and Astronomy'],{'date-parts': [[2021]]},['epn2021523p32'],"{'date-parts': [[2021, 6, 28]]}",,,,,,,,,,,,,,,
57,"{'date-parts': [[2022, 3, 31]], 'date-time': '...",6,IOP Publishing,1,"[{'start': {'date-parts': [[2021, 2, 1]], 'dat...","{'domain': ['iopscience.iop.org'], 'crossmark-...",['IOP Conf. Ser.: Earth Environ. Sci.'],"{'date-parts': [[2021, 2, 1]]}",10.1088/1755-1315/660/1/012131,journal-article,"{'date-parts': [[2021, 2, 22]], 'date-time': '...",012131,Crossref,0,['Application of wavelet multi-scale analysis ...,10.1088,660,"[{'given': 'Hailong', 'family': 'Sun', 'sequen...",266.0,"[{'key': 'EES_660_1_012131bib1', 'author': 'Ha...",['IOP Conference Series: Earth and Environment...,,[{'URL': 'https://iopscience.iop.org/article/1...,"{'date-parts': [[2022, 1, 29]], 'date-time': '...",0.0,{'primary': {'URL': 'https://iopscience.iop.or...,"{'date-parts': [[2021, 2, 1]]}",6,"{'issue': '1', 'published-print': {'date-parts...",http://dx.doi.org/10.1088/1755-1315/660/1/012131,"['1755-1307', '1755-1315']","[{'value': '1755-1307', 'type': 'print'}, {'va...",,"{'date-parts': [[2021, 2, 1]]}",,,,http://dx.doi.org/10.1088/crossmark-policy,[{'value': 'Application of wavelet multi-scale...,,,,<jats:title>Abstract</jats:title>\n ...,,,,,,,,
117,"{'date-parts': [[2022, 4, 2]], 'date-time': '2...",17,EDP Sciences,,"[{'start': {'date-parts': [[2021, 4, 26]], 'da...","{'domain': [], 'crossmark-restriction': False}",['EPJ Web Conf.'],{'date-parts': [[2021]]},10.1051/epjconf/202124801022,journal-article,"{'date-parts': [[2021, 4, 26]], 'date-time': '...",01022,Crossref,0,['Distributions of Two Atoms Collisions over t...,10.1051,248,"[{'given': 'Sergey', 'family': 'Zheltov', 'seq...",250.0,"[{'key': 'R1', 'doi-asserted-by': 'crossref', ...",['EPJ Web of Conferences'],,[{'URL': 'https://www.epj-conferences.org/10.1...,"{'date-parts': [[2021, 4, 26]], 'date-time': '...",0.0,{'primary': {'URL': 'https://www.epj-conferenc...,{'date-parts': [[2021]]},17,,http://dx.doi.org/10.1051/epjconf/202124801022,['2100-014X'],"[{'value': '2100-014X', 'type': 'electronic'}]","['General Earth and Planetary Sciences', 'Gene...",{'date-parts': [[2021]]},['epjconf_mnps2021_01022'],"{'date-parts': [[2021, 4, 26]]}",,,,,,,<jats:p>The processes of heat and mass transfe...,,,,"[{'given': 'A.', 'family': 'Nadykto', 'sequenc...",,,,
140,"{'date-parts': [[2022, 11, 8]], 'date-time': '...",0,Association for the Advancement of Artificial ...,1,,"{'domain': [], 'crossmark-restriction': False}",['AAAI'],,10.1609/aaai.v32i1.11721,journal-article,"{'date-parts': [[2022, 6, 24]], 'date-time': '...",,Crossref,6,['SC2Net: Sparse LSTMs for Sparse Coding'],10.1609,32,"[{'given': 'Joey Tianyi', 'family': 'Zhou', 's...",9382.0,,['Proceedings of the AAAI Conference on Artifi...,,[{'URL': 'https://ojs.aaai.org/index.php/AAAI/...,"{'date-parts': [[2022, 11, 7]], 'date-time': '...",0.0,{'primary': {'URL': 'https://ojs.aaai.org/inde...,"{'date-parts': [[2018, 4, 29]]}",0,"{'issue': '1', 'published-online': {'date-part...",http://dx.doi.org/10.1609/aaai.v32i1.11721,"['2374-3468', '2159-5399']","[{'value': '2374-3468', 'type': 'electronic'},...",['General Medicine'],"{'date-parts': [[2018, 4, 29]]}",,"{'date-parts': [[2018, 4, 29]]}",,,,,,,<jats:p>\n \n The iterative hard-t...,,,,,,,,
159,"{'date-parts': [[2022, 4, 3]], 'date-time': '2...",0,Wiley,5,"[{'start': {'date-parts': [[2015, 9, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['Sci News'],,10.1002/scin.2007.5591710502,journal-article,"{'date-parts': [[2010, 4, 23]], 'date-time': '...",67-67,Crossref,0,['Suburb of stonehenge: Ritual village found n...,10.1002,171,"[{'given': 'Bruce', 'family': 'Bower', 'sequen...",311.0,,['Science News'],en,[{'URL': 'https://api.wiley.com/onlinelibrary/...,"{'date-parts': [[2021, 7, 4]], 'date-time': '2...",0.0,{'primary': {'URL': 'https://onlinelibrary.wil...,"{'date-parts': [[2009, 9, 30]]}",0,{'issue': '5'},http://dx.doi.org/10.1002/scin.2007.5591710502,"['0036-8423', '1943-0930']","[{'value': '0036-8423', 'type': 'print'}, {'va...",['General Engineering'],"{'date-parts': [[2009, 9, 30]]}",,"{'date-parts': [[2009, 9, 30]]}",,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105726,"{'date-parts': [[2022, 8, 5]], 'date-time': '2...",24,EDP Sciences,1,,"{'domain': [], 'crossmark-restriction': False}",['A&amp;A'],"{'date-parts': [[2006, 10]]}",10.1051/0004-6361:20065495,journal-article,"{'date-parts': [[2006, 8, 4]], 'date-time': '2...",15-19,Crossref,41,['E- and B-mode mixing from incomplete knowled...,10.1051,457,"[{'given': 'M.', 'family': 'Kilbinger', 'seque...",250.0,"[{'key': 'R1', 'doi-asserted-by': 'crossref', ...",['Astronomy &amp; Astrophysics'],,[{'URL': 'http://www.aanda.org/10.1051/0004-63...,"{'date-parts': [[2017, 5, 22]], 'date-time': '...",0.0,{'primary': {'URL': 'http://www.aanda.org/10.1...,"{'date-parts': [[2006, 9, 12]]}",24,{'issue': '1'},http://dx.doi.org/10.1051/0004-6361:20065495,"['0004-6361', '1432-0746']","[{'value': '0004-6361', 'type': 'print'}, {'va...","['Space and Planetary Science', 'Astronomy and...","{'date-parts': [[2006, 9, 12]]}",['aa5495-06'],"{'date-parts': [[2006, 9, 12]]}",,,,,,,,,,,,,,,
105831,"{'date-parts': [[2022, 4, 6]], 'date-time': '2...",0,JSTOR,15,,"{'domain': [], 'crossmark-restriction': False}",['The Science News-Letter'],"{'date-parts': [[1954, 4, 10]]}",10.2307/3933390,journal-article,"{'date-parts': [[2007, 11, 28]], 'date-time': ...",227,Crossref,0,['H-Bomb Damage Officially Bared'],10.2307,65,,1121.0,,['The Science News-Letter'],,,"{'date-parts': [[2018, 5, 9]], 'date-time': '2...",0.0,{'primary': {'URL': 'https://www.jstor.org/sta...,"{'date-parts': [[1954, 4, 10]]}",0,"{'issue': '15', 'published-print': {'date-part...",http://dx.doi.org/10.2307/3933390,['0096-4018'],"[{'value': '0096-4018', 'type': 'print'}]",,"{'date-parts': [[1954, 4, 10]]}",,,,,,,,,,,,,,,,,
105880,"{'date-parts': [[2022, 3, 30]], 'date-time': '...",0,EDP Sciences,Suppl. 1,,"{'domain': [], 'crossmark-restriction': False}",['Ann. Zootech.'],{'date-parts': [[1995]]},10.1051/animres:19950579,journal-article,"{'date-parts': [[2007, 10, 10]], 'date-time': ...",109-109,Crossref,3,['Herbage intake rates and grazing behaviour o...,10.1051,44,"[{'given': 'PD', 'family': 'Penning', 'sequenc...",250.0,,['Annales de Zootechnie'],,[{'URL': 'http://animres.edpsciences.org/10.10...,"{'date-parts': [[2017, 1, 11]], 'date-time': '...",0.0,{'primary': {'URL': 'http://www.edpsciences.or...,{'date-parts': [[1995]]},0,{'issue': 'Suppl. 1'},http://dx.doi.org/10.1051/animres:19950579,['0003-424X'],"[{'value': '0003-424X', 'type': 'print'}]",['Animal Science and Zoology'],{'date-parts': [[1995]]},['Ann.Zootech._0003-424X_1995_44_Suppl1_ART0079'],,,,,,,,,,,,,,,,
105945,"{'date-parts': [[2022, 4, 2]], 'date-time': '2...",13,EDP Sciences,,"[{'start': {'date-parts': [[2020, 11, 25]], 'd...","{'domain': [], 'crossmark-restriction': False}",['BIO Web Conf.'],{'date-parts': [[2020]]},10.1051/bioconf/20202700018,journal-article,"{'date-parts': [[2020, 11, 25]], 'date-time': ...",00018,Crossref,0,['Detoxication agents and technologies for ani...,10.1051,27,"[{'given': 'Lyubov L.', 'family': 'Zakharova',...",250.0,"[{'key': 'R1', 'unstructured': 'Aturova V.P., ...",['BIO Web of Conferences'],,[{'URL': 'https://www.bio-conferences.org/10.1...,"{'date-parts': [[2020, 11, 25]], 'date-time': ...",0.0,{'primary': {'URL': 'https://www.bio-conferenc...,{'date-parts': [[2020]]},13,,http://dx.doi.org/10.1051/bioconf/20202700018,['2117-4458'],"[{'value': '2117-4458', 'type': 'electronic'}]",['General Medicine'],{'date-parts': [[2020]]},['bioconf_fies-20_00018'],"{'date-parts': [[2020, 11, 25]]}",,,,,,,<jats:p>The impact of human economic activity ...,,,,"[{'given': 'A.', 'family': 'Valiev', 'sequence...",,,,


In [9]:
df.drop(conferences.index, inplace=True)
df.shape

(100912, 51)

In [10]:
extras = df.sample(n=912, random_state=42)
df.drop(extras.index, inplace=True)
df

Unnamed: 0_level_0,indexed,reference-count,publisher,issue,license,content-domain,short-container-title,published-print,DOI,type,created,page,source,is-referenced-by-count,title,prefix,volume,author,member,reference,container-title,language,link,deposited,score,resource,issued,references-count,journal-issue,URL,ISSN,issn-type,subject,published,alternative-id,published-online,archive,update-policy,assertion,funder,article-number,accepted,abstract,original-title,subtitle,published-other,editor,relation,update-to,translator,clinical-trial-number
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
0,"{'date-parts': [[2022, 10, 7]], 'date-time': '...",14,Wiley,1,"[{'start': {'date-parts': [[2015, 9, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['Syst. Dyn. Rev.'],{'date-parts': [[2000]]},10.1002/(sici)1099-1727(200021)16:1<27::aid-sd...,journal-article,"{'date-parts': [[2002, 9, 10]], 'date-time': '...",27-41,Crossref,57,['The validation of commercial system dynamics...,10.10020,16,"[{'given': 'Geoff', 'family': 'Coyle', 'sequen...",311.0,[{'key': '10.1002/(SICI)1099-1727(200021)16:1<...,['System Dynamics Review'],en,[{'URL': 'https://api.wiley.com/onlinelibrary/...,"{'date-parts': [[2021, 7, 1]], 'date-time': '2...",0.0,{'primary': {'URL': 'https://onlinelibrary.wil...,{'date-parts': [[2000]]},14,"{'issue': '1', 'published-print': {'date-parts...",http://dx.doi.org/10.1002/(sici)1099-1727(2000...,"['0883-7066', '1099-1727']","[{'value': '0883-7066', 'type': 'print'}, {'va...","['Management of Technology and Innovation', 'S...",{'date-parts': [[2000]]},,,,,,,,,,,,,,,,,
1,"{'date-parts': [[2022, 3, 29]], 'date-time': '...",12,Springer Science and Business Media LLC,1,"[{'start': {'date-parts': [[1979, 3, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['MTB'],"{'date-parts': [[1979, 3]]}",10.1007/bf02653972,journal-article,"{'date-parts': [[2007, 7, 17]], 'date-time': '...",57-62,Crossref,20,['Effect of system geometry on the leaching be...,10.10070,10,"[{'given': 'C.', 'family': 'Vu', 'sequence': '...",297.0,"[{'key': 'BF02653972_CR1', 'volume-title': 'Ph...",['Metallurgical Transactions B'],en,[{'URL': 'http://link.springer.com/content/pdf...,"{'date-parts': [[2019, 5, 20]], 'date-time': '...",0.0,{'primary': {'URL': 'http://link.springer.com/...,"{'date-parts': [[1979, 3]]}",12,"{'issue': '1', 'published-print': {'date-parts...",http://dx.doi.org/10.1007/bf02653972,"['0360-2141', '1543-1916']","[{'value': '0360-2141', 'type': 'print'}, {'va...","['Materials Chemistry', 'Metals and Alloys', '...","{'date-parts': [[1979, 3]]}",['BF02653972'],,,,,,,,,,,,,,,,
2,"{'date-parts': [[2022, 3, 30]], 'date-time': '...",0,Wiley,3,"[{'start': {'date-parts': [[2017, 11, 1]], 'da...","{'domain': [], 'crossmark-restriction': False}",['RECIEL'],"{'date-parts': [[2017, 11]]}",10.1111/reel.12221,journal-article,"{'date-parts': [[2017, 12, 1]], 'date-time': '...",243-254,Crossref,2,['The international law on transboundary haze ...,10.11110,26,"[{'given': 'Shawkat', 'family': 'Alam', 'seque...",311.0,,"['Review of European, Comparative &amp; Intern...",en,[{'URL': 'https://api.wiley.com/onlinelibrary/...,"{'date-parts': [[2017, 12, 1]], 'date-time': '...",0.0,{'primary': {'URL': 'http://doi.wiley.com/10.1...,"{'date-parts': [[2017, 11]]}",0,"{'issue': '3', 'published-print': {'date-parts...",http://dx.doi.org/10.1111/reel.12221,['2050-0386'],"[{'value': '2050-0386', 'type': 'print'}]","['Law', 'Management, Monitoring, Policy and La...","{'date-parts': [[2017, 11]]}",,"{'date-parts': [[2017, 11, 28]]}",['Portico'],,,,,,,,,,,,,,
3,"{'date-parts': [[2022, 4, 3]], 'date-time': '2...",0,Crop Science Society of Japan,1-2,,"{'domain': [], 'crossmark-restriction': False}","['Japanese journal of crop science', 'Jpn. J. ...",{'date-parts': [[1951]]},10.1626/jcs.20.219,journal-article,"{'date-parts': [[2011, 9, 20]], 'date-time': '...",219-222,Crossref,0,['Studies on the influence of pruning on the v...,10.16260,20,"[{'given': 'C.', 'family': 'TSUDA', 'sequence'...",632.0,,['Japanese Journal of Crop Science'],en,[{'URL': 'http://www.jstage.jst.go.jp/article/...,"{'date-parts': [[2021, 4, 30]], 'date-time': '...",0.0,{'primary': {'URL': 'http://www.jstage.jst.go....,{'date-parts': [[1951]]},0,"{'issue': '1-2', 'published-print': {'date-par...",http://dx.doi.org/10.1626/jcs.20.219,"['0011-1848', '1349-0990']","[{'value': '0011-1848', 'type': 'print'}, {'va...","['Genetics', 'Agronomy and Crop Science', 'Foo...",{'date-parts': [[1951]]},,,,,,,,,,,,,,,,,
4,"{'date-parts': [[2022, 3, 31]], 'date-time': '...",60,Elsevier BV,6,"[{'start': {'date-parts': [[2018, 12, 1]], 'da...","{'domain': ['clinicalkey.fr', 'elsevier.com', ...",['Revue de Pneumologie Clinique'],"{'date-parts': [[2018, 12]]}",10.1016/j.pneumo.2018.09.002,journal-article,"{'date-parts': [[2018, 10, 10]], 'date-time': ...",391-399,Crossref,0,['Le tabagisme et l’aide à l’arrêt du tabac de...,10.10160,74,"[{'given': 'J.', 'family': 'Perriot', 'sequenc...",78.0,[{'key': '10.1016/j.pneumo.2018.09.002_bib0305...,['Revue de Pneumologie Clinique'],fr,[{'URL': 'https://api.elsevier.com/content/art...,"{'date-parts': [[2019, 10, 26]], 'date-time': ...",0.0,{'primary': {'URL': 'https://linkinghub.elsevi...,"{'date-parts': [[2018, 12]]}",60,"{'issue': '6', 'published-print': {'date-parts...",http://dx.doi.org/10.1016/j.pneumo.2018.09.002,['0761-8417'],"[{'value': '0761-8417', 'type': 'print'}]",['Pulmonary and Respiratory Medicine'],"{'date-parts': [[2018, 12]]}",['S0761841718301792'],,,http://dx.doi.org/10.1016/elsevier_cm_policy,"[{'value': 'Elsevier', 'name': 'publisher', 'l...",,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106102,"{'date-parts': [[2022, 3, 30]], 'date-time': '...",2,Elsevier BV,2,"[{'start': {'date-parts': [[1988, 4, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['British Journal of Oral and Maxillofacial Su...,"{'date-parts': [[1988, 4]]}",10.1016/0266-4356(88)90016-2,journal-article,"{'date-parts': [[2004, 4, 29]], 'date-time': '...",171-172,Crossref,1,['A champy plate template'],10.10160,26,"[{'given': 'M.T.', 'family': 'Simpson', 'seque...",78.0,"[{'key': '10.1016/0266-4356(88)90016-2_BIB1', ...",['British Journal of Oral and Maxillofacial Su...,en,[{'URL': 'https://api.elsevier.com/content/art...,"{'date-parts': [[2019, 2, 9]], 'date-time': '2...",0.0,{'primary': {'URL': 'https://linkinghub.elsevi...,"{'date-parts': [[1988, 4]]}",2,"{'issue': '2', 'published-print': {'date-parts...",http://dx.doi.org/10.1016/0266-4356(88)90016-2,['0266-4356'],"[{'value': '0266-4356', 'type': 'print'}]","['Otorhinolaryngology', 'Oral Surgery', 'Surge...","{'date-parts': [[1988, 4]]}",['0266435688900162'],,,,,,,,,,,,,,,,
106103,"{'date-parts': [[2023, 1, 10]], 'date-time': '...",41,"Impact Journals, LLC",3,,"{'domain': [], 'crossmark-restriction': False}",['Oncotarget'],"{'date-parts': [[2018, 1, 9]]}",10.18632/oncotarget.23280,journal-article,"{'date-parts': [[2017, 12, 15]], 'date-time': ...",3946-3955,Crossref,27,['Validation of a hypoxia related gene signatu...,10.18632,9,"[{'given': 'Lingjian', 'family': 'Yang', 'sequ...",7892.0,"[{'key': '1', 'doi-asserted-by': 'crossref', '...",['Oncotarget'],en,[{'URL': 'https://www.oncotarget.com/lookup/do...,"{'date-parts': [[2020, 7, 15]], 'date-time': '...",0.0,{'primary': {'URL': 'https://www.oncotarget.co...,"{'date-parts': [[2017, 12, 12]]}",41,"{'issue': '3', 'published-print': {'date-parts...",http://dx.doi.org/10.18632/oncotarget.23280,['1949-2553'],"[{'value': '1949-2553', 'type': 'electronic'}]",['Oncology'],"{'date-parts': [[2017, 12, 12]]}","['23280', '29423096']","{'date-parts': [[2017, 12, 12]]}",,,,,,,,,,,,,,,
106104,"{'date-parts': [[2022, 10, 16]], 'date-time': ...",33,AIP Publishing,9,,"{'domain': ['aip.scitation.org'], 'crossmark-r...",,"{'date-parts': [[1996, 9]]}",10.1063/1.869021,journal-article,"{'date-parts': [[2002, 7, 26]], 'date-time': '...",2365-2374,Crossref,77,['An experimental study of deep water plunging...,10.10630,8,"[{'given': 'Marc', 'family': 'Perlin', 'sequen...",317.0,"[{'key': '10.1063/1.869021_r1', 'doi-asserted-...",['Physics of Fluids'],en,[{'URL': 'http://aip.scitation.org/doi/10.1063...,"{'date-parts': [[2017, 11, 20]], 'date-time': ...",0.0,{'primary': {'URL': 'http://aip.scitation.org/...,"{'date-parts': [[1996, 9]]}",33,"{'issue': '9', 'published-print': {'date-parts...",http://dx.doi.org/10.1063/1.869021,"['1070-6631', '1089-7666']","[{'value': '1070-6631', 'type': 'print'}, {'va...","['Condensed Matter Physics', 'Fluid Flow and T...","{'date-parts': [[1996, 9]]}",,,,http://dx.doi.org/10.1063/aip-crossmark-policy...,,,,,,,,,,,,,
106105,"{'date-parts': [[2022, 4, 5]], 'date-time': '2...",5,Oxford University Press (OUP),8,"[{'start': {'date-parts': [[1999, 8, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",,"{'date-parts': [[2002, 12, 12]]}",10.1002/bjs.1155,journal-article,"{'date-parts': [[2006, 6, 16]], 'date-time': '...",1099-1100,Crossref,0,"[""Authors' reply""]",10.10930,86,"[{'given': 'R E K', 'family': 'Marshall', 'seq...",286.0,"[{'key': '2021070922214553300_bib1', 'doi-asse...",['British Journal of Surgery'],en,[{'URL': 'https://api.wiley.com/onlinelibrary/...,"{'date-parts': [[2021, 7, 10]], 'date-time': '...",0.0,{'primary': {'URL': 'https://academic.oup.com/...,"{'date-parts': [[1999, 8]]}",5,"{'issue': '8', 'published-online': {'date-part...",http://dx.doi.org/10.1002/bjs.1155,"['0007-1323', '1365-2168']","[{'value': '0007-1323', 'type': 'print'}, {'va...",['Surgery'],"{'date-parts': [[1999, 8]]}",,"{'date-parts': [[2002, 12, 12]]}",,,,,,,,,,"{'date-parts': [[1999, 8]]}",,,,,


## Cleaning Dates
Here we are going to re-format some of the datetime columns into a more easily parsed format. *Created, deposited* and *published*. Not all records have month and day values for the *published* field, so we'll only take the year from those. For *created* and *deposited* we will have a YYYY-MM-DD format.

We've chosen these dates because they reflect certain information that we'll use later on. *Created* is the date when the item was first inserted into the Crossref database. *Deposited* reflects the last time the record was entered by the publisher (potentially with changes to the record but not necessarily the case). *Published* reflects when the item itself was actually published.

We'll use a regular expression to extract the dates from each of the records in each of those three columns, then we'll convert them to datetime dtypes.

In [11]:
date_columns = ['created', 'deposited']

for col in date_columns:
    df[col] = df[col].str.extract(r"\'(\d{4}\S\d{2}\S\d{2})")
    df[col] = pd.to_datetime(df[col], format="%Y-%m-%d")
df['published'] = df['published'].str.extract(r"(\d{4})")
df

Unnamed: 0_level_0,indexed,reference-count,publisher,issue,license,content-domain,short-container-title,published-print,DOI,type,created,page,source,is-referenced-by-count,title,prefix,volume,author,member,reference,container-title,language,link,deposited,score,resource,issued,references-count,journal-issue,URL,ISSN,issn-type,subject,published,alternative-id,published-online,archive,update-policy,assertion,funder,article-number,accepted,abstract,original-title,subtitle,published-other,editor,relation,update-to,translator,clinical-trial-number
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
0,"{'date-parts': [[2022, 10, 7]], 'date-time': '...",14,Wiley,1,"[{'start': {'date-parts': [[2015, 9, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['Syst. Dyn. Rev.'],{'date-parts': [[2000]]},10.1002/(sici)1099-1727(200021)16:1<27::aid-sd...,journal-article,2002-09-10,27-41,Crossref,57,['The validation of commercial system dynamics...,10.10020,16,"[{'given': 'Geoff', 'family': 'Coyle', 'sequen...",311.0,[{'key': '10.1002/(SICI)1099-1727(200021)16:1<...,['System Dynamics Review'],en,[{'URL': 'https://api.wiley.com/onlinelibrary/...,2021-07-01,0.0,{'primary': {'URL': 'https://onlinelibrary.wil...,{'date-parts': [[2000]]},14,"{'issue': '1', 'published-print': {'date-parts...",http://dx.doi.org/10.1002/(sici)1099-1727(2000...,"['0883-7066', '1099-1727']","[{'value': '0883-7066', 'type': 'print'}, {'va...","['Management of Technology and Innovation', 'S...",2000,,,,,,,,,,,,,,,,,
1,"{'date-parts': [[2022, 3, 29]], 'date-time': '...",12,Springer Science and Business Media LLC,1,"[{'start': {'date-parts': [[1979, 3, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['MTB'],"{'date-parts': [[1979, 3]]}",10.1007/bf02653972,journal-article,2007-07-17,57-62,Crossref,20,['Effect of system geometry on the leaching be...,10.10070,10,"[{'given': 'C.', 'family': 'Vu', 'sequence': '...",297.0,"[{'key': 'BF02653972_CR1', 'volume-title': 'Ph...",['Metallurgical Transactions B'],en,[{'URL': 'http://link.springer.com/content/pdf...,2019-05-20,0.0,{'primary': {'URL': 'http://link.springer.com/...,"{'date-parts': [[1979, 3]]}",12,"{'issue': '1', 'published-print': {'date-parts...",http://dx.doi.org/10.1007/bf02653972,"['0360-2141', '1543-1916']","[{'value': '0360-2141', 'type': 'print'}, {'va...","['Materials Chemistry', 'Metals and Alloys', '...",1979,['BF02653972'],,,,,,,,,,,,,,,,
2,"{'date-parts': [[2022, 3, 30]], 'date-time': '...",0,Wiley,3,"[{'start': {'date-parts': [[2017, 11, 1]], 'da...","{'domain': [], 'crossmark-restriction': False}",['RECIEL'],"{'date-parts': [[2017, 11]]}",10.1111/reel.12221,journal-article,2017-12-01,243-254,Crossref,2,['The international law on transboundary haze ...,10.11110,26,"[{'given': 'Shawkat', 'family': 'Alam', 'seque...",311.0,,"['Review of European, Comparative &amp; Intern...",en,[{'URL': 'https://api.wiley.com/onlinelibrary/...,2017-12-01,0.0,{'primary': {'URL': 'http://doi.wiley.com/10.1...,"{'date-parts': [[2017, 11]]}",0,"{'issue': '3', 'published-print': {'date-parts...",http://dx.doi.org/10.1111/reel.12221,['2050-0386'],"[{'value': '2050-0386', 'type': 'print'}]","['Law', 'Management, Monitoring, Policy and La...",2017,,"{'date-parts': [[2017, 11, 28]]}",['Portico'],,,,,,,,,,,,,,
3,"{'date-parts': [[2022, 4, 3]], 'date-time': '2...",0,Crop Science Society of Japan,1-2,,"{'domain': [], 'crossmark-restriction': False}","['Japanese journal of crop science', 'Jpn. J. ...",{'date-parts': [[1951]]},10.1626/jcs.20.219,journal-article,2011-09-20,219-222,Crossref,0,['Studies on the influence of pruning on the v...,10.16260,20,"[{'given': 'C.', 'family': 'TSUDA', 'sequence'...",632.0,,['Japanese Journal of Crop Science'],en,[{'URL': 'http://www.jstage.jst.go.jp/article/...,2021-04-30,0.0,{'primary': {'URL': 'http://www.jstage.jst.go....,{'date-parts': [[1951]]},0,"{'issue': '1-2', 'published-print': {'date-par...",http://dx.doi.org/10.1626/jcs.20.219,"['0011-1848', '1349-0990']","[{'value': '0011-1848', 'type': 'print'}, {'va...","['Genetics', 'Agronomy and Crop Science', 'Foo...",1951,,,,,,,,,,,,,,,,,
4,"{'date-parts': [[2022, 3, 31]], 'date-time': '...",60,Elsevier BV,6,"[{'start': {'date-parts': [[2018, 12, 1]], 'da...","{'domain': ['clinicalkey.fr', 'elsevier.com', ...",['Revue de Pneumologie Clinique'],"{'date-parts': [[2018, 12]]}",10.1016/j.pneumo.2018.09.002,journal-article,2018-10-10,391-399,Crossref,0,['Le tabagisme et l’aide à l’arrêt du tabac de...,10.10160,74,"[{'given': 'J.', 'family': 'Perriot', 'sequenc...",78.0,[{'key': '10.1016/j.pneumo.2018.09.002_bib0305...,['Revue de Pneumologie Clinique'],fr,[{'URL': 'https://api.elsevier.com/content/art...,2019-10-26,0.0,{'primary': {'URL': 'https://linkinghub.elsevi...,"{'date-parts': [[2018, 12]]}",60,"{'issue': '6', 'published-print': {'date-parts...",http://dx.doi.org/10.1016/j.pneumo.2018.09.002,['0761-8417'],"[{'value': '0761-8417', 'type': 'print'}]",['Pulmonary and Respiratory Medicine'],2018,['S0761841718301792'],,,http://dx.doi.org/10.1016/elsevier_cm_policy,"[{'value': 'Elsevier', 'name': 'publisher', 'l...",,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106102,"{'date-parts': [[2022, 3, 30]], 'date-time': '...",2,Elsevier BV,2,"[{'start': {'date-parts': [[1988, 4, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['British Journal of Oral and Maxillofacial Su...,"{'date-parts': [[1988, 4]]}",10.1016/0266-4356(88)90016-2,journal-article,2004-04-29,171-172,Crossref,1,['A champy plate template'],10.10160,26,"[{'given': 'M.T.', 'family': 'Simpson', 'seque...",78.0,"[{'key': '10.1016/0266-4356(88)90016-2_BIB1', ...",['British Journal of Oral and Maxillofacial Su...,en,[{'URL': 'https://api.elsevier.com/content/art...,2019-02-09,0.0,{'primary': {'URL': 'https://linkinghub.elsevi...,"{'date-parts': [[1988, 4]]}",2,"{'issue': '2', 'published-print': {'date-parts...",http://dx.doi.org/10.1016/0266-4356(88)90016-2,['0266-4356'],"[{'value': '0266-4356', 'type': 'print'}]","['Otorhinolaryngology', 'Oral Surgery', 'Surge...",1988,['0266435688900162'],,,,,,,,,,,,,,,,
106103,"{'date-parts': [[2023, 1, 10]], 'date-time': '...",41,"Impact Journals, LLC",3,,"{'domain': [], 'crossmark-restriction': False}",['Oncotarget'],"{'date-parts': [[2018, 1, 9]]}",10.18632/oncotarget.23280,journal-article,2017-12-15,3946-3955,Crossref,27,['Validation of a hypoxia related gene signatu...,10.18632,9,"[{'given': 'Lingjian', 'family': 'Yang', 'sequ...",7892.0,"[{'key': '1', 'doi-asserted-by': 'crossref', '...",['Oncotarget'],en,[{'URL': 'https://www.oncotarget.com/lookup/do...,2020-07-15,0.0,{'primary': {'URL': 'https://www.oncotarget.co...,"{'date-parts': [[2017, 12, 12]]}",41,"{'issue': '3', 'published-print': {'date-parts...",http://dx.doi.org/10.18632/oncotarget.23280,['1949-2553'],"[{'value': '1949-2553', 'type': 'electronic'}]",['Oncology'],2017,"['23280', '29423096']","{'date-parts': [[2017, 12, 12]]}",,,,,,,,,,,,,,,
106104,"{'date-parts': [[2022, 10, 16]], 'date-time': ...",33,AIP Publishing,9,,"{'domain': ['aip.scitation.org'], 'crossmark-r...",,"{'date-parts': [[1996, 9]]}",10.1063/1.869021,journal-article,2002-07-26,2365-2374,Crossref,77,['An experimental study of deep water plunging...,10.10630,8,"[{'given': 'Marc', 'family': 'Perlin', 'sequen...",317.0,"[{'key': '10.1063/1.869021_r1', 'doi-asserted-...",['Physics of Fluids'],en,[{'URL': 'http://aip.scitation.org/doi/10.1063...,2017-11-20,0.0,{'primary': {'URL': 'http://aip.scitation.org/...,"{'date-parts': [[1996, 9]]}",33,"{'issue': '9', 'published-print': {'date-parts...",http://dx.doi.org/10.1063/1.869021,"['1070-6631', '1089-7666']","[{'value': '1070-6631', 'type': 'print'}, {'va...","['Condensed Matter Physics', 'Fluid Flow and T...",1996,,,,http://dx.doi.org/10.1063/aip-crossmark-policy...,,,,,,,,,,,,,
106105,"{'date-parts': [[2022, 4, 5]], 'date-time': '2...",5,Oxford University Press (OUP),8,"[{'start': {'date-parts': [[1999, 8, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",,"{'date-parts': [[2002, 12, 12]]}",10.1002/bjs.1155,journal-article,2006-06-16,1099-1100,Crossref,0,"[""Authors' reply""]",10.10930,86,"[{'given': 'R E K', 'family': 'Marshall', 'seq...",286.0,"[{'key': '2021070922214553300_bib1', 'doi-asse...",['British Journal of Surgery'],en,[{'URL': 'https://api.wiley.com/onlinelibrary/...,2021-07-10,0.0,{'primary': {'URL': 'https://academic.oup.com/...,"{'date-parts': [[1999, 8]]}",5,"{'issue': '8', 'published-online': {'date-part...",http://dx.doi.org/10.1002/bjs.1155,"['0007-1323', '1365-2168']","[{'value': '0007-1323', 'type': 'print'}, {'va...",['Surgery'],1999,,"{'date-parts': [[2002, 12, 12]]}",,,,,,,,,,"{'date-parts': [[1999, 8]]}",,,,,


# String slicing
Now that the dates are converted, one of the last problems to address are some of the excess character in the *title, short-container-title,* and *container-title* fields.

In [12]:
cols = ['title', 'short-container-title', 'container-title']
for col in cols:
    df[col] = df[col].str.slice(start=2, stop=-2)

In [13]:
df['title'][0]

'The validation of commercial system dynamics models'

## Cleaning XML tags
We'll be looking at the abstract column, so it will benefit us to clean out the tags and only have te relevant text for each record. We'll write a quick function to do that.

In [14]:
#import beatuiful soup
from bs4 import BeautifulSoup as bs
def clean_abstracts(abstract):
    try:
        soup = bs(abstract, features='lxml')
        stripped_strings = soup.get_text()
        return stripped_strings
    except:
        return None
stripped_abstracts = df.abstract.map(lambda x: clean_abstracts(x))
df['abstract'] = stripped_abstracts

In [15]:
df['abstract'][100080]

'Abstract\nIn this paper, Neural Network (NN) approach is developed and utilised to detect winding faults in an electrical machine using the samples data of electrical machine in both the healthy and different fault conditions (i.e. shorted-turn fault, phase-to-ground fault and coil-to-coil fault). This is done by interfacing a data acquisition device connected to the machine with a computer in the laboratory. Thereafter, a two-layer feed-forward network with Levenberg–Marquardt back-propagation algorithm is created with the collected input dataset. The NN model developed was tested with both the healthy and the four different fault conditions of the electrical machine. The results from the NN approach was also compared with other results obtained by determining the fault index (FI) of an electrical machine using signal processing approach. The results show that the NN approach can identify each of the electrical machine condition with high accuracy. The percentage accuracy for healthy

Looks great! Now we'll save our cleaned dataset.

In [16]:
df.to_csv(input_dir / '02_cleaned_data.csv', index=False)