# Data Cleaning
To clean this data set we'll start out by loading the dataset, checking for duplicates, and dropping columns that are not relevant to our analysis.

First, we'll load in our packages, set up our directories, and load in the dataset and take a look at it.

In [1]:
import pandas as pd
from pathlib import Path

#Set up directories
data_dir = Path('../data')
input_dir = data_dir / 'input'
output_dir = data_dir / 'output'

df = pd.read_csv(input_dir / '01_raw_data.csv').set_index('Unnamed: 0')
df.index.names = ['Index']

  df = pd.read_csv(input_dir / '01_raw_data.csv').set_index('Unnamed: 0')


## Duplicate Records
Looking at the shape of the dataset against the number of unique DOIs will let us know just how many duplicate records we have.


In [2]:
df.shape

(105107, 49)

In [3]:
len(set(df['DOI']))

105039

In [4]:
#Dropping duplicate records
df.drop_duplicates(subset=['DOI'], keep='first', inplace=True)
df.shape

(105039, 49)

## Will not be dropping columns
Select columns of interest when importing data. Maintain data sctructure in file.

## Editors
There are very few records that have a value in the *editor* column. Some of our prior work indicates that this can be a sign of a work that has been mislabeled as a 'journal article'. So we'll explore some of the records with a value in the editor column in order to verify that.

We'll set up a dataframe of just those records that have data in the *editor* column.

Next, we'll search the titles of these records for a few keywords.

In [5]:
editorial = df.loc[df.title.str.contains(r'editorial|errata|contents|conference|proceedings|masthead|symposium|abstract|Book Review', 
                                         regex=True, case=False, na=False)]
editorial

Unnamed: 0_level_0,indexed,reference-count,publisher,issue,license,content-domain,short-container-title,published-print,DOI,type,created,page,source,is-referenced-by-count,title,prefix,volume,author,member,reference,container-title,language,link,deposited,score,resource,issued,references-count,journal-issue,URL,ISSN,issn-type,subject,published,alternative-id,published-online,archive,update-policy,assertion,funder,article-number,accepted,abstract,original-title,subtitle,published-other,editor,update-to,relation
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
5,"{'date-parts': [[2022, 3, 31]], 'date-time': '...",1,Wiley,52,"[{'start': {'date-parts': [[2015, 9, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['ChemInform'],"{'date-parts': [[2015, 12]]}",10.1002/chin.201552196,journal-article,"{'date-parts': [[2015, 12, 13]], 'date-time': ...",no-no,Crossref,0,['ChemInform Abstract: Supramolecular Polymeri...,10.1002,46,"[{'given': 'Takeharu', 'family': 'Haino', 'seq...",311.0,"[{'key': '10.1002/chin.201552196-BIB1|cit1', '...",['ChemInform'],en,[{'URL': 'https://api.wiley.com/onlinelibrary/...,"{'date-parts': [[2021, 2, 17]], 'date-time': '...",0.0,{'primary': {'URL': 'http://doi.wiley.com/10.1...,"{'date-parts': [[2015, 12]]}",1,"{'issue': '52', 'published-print': {'date-part...",http://dx.doi.org/10.1002/chin.201552196,['0931-7597'],"[{'value': '0931-7597', 'type': 'print'}]",['General Materials Science'],"{'date-parts': [[2015, 12]]}",,"{'date-parts': [[2015, 12, 10]]}",['Portico'],,,,,,,,,,,,
23,"{'date-parts': [[2022, 4, 4]], 'date-time': '2...",0,Elsevier BV,1,"[{'start': {'date-parts': [[1965, 6, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['IFAC Proceedings Volumes'],"{'date-parts': [[1965, 6]]}",10.1016/s1474-6670(17)69139-0,journal-article,"{'date-parts': [[2017, 7, 1]], 'date-time': '2...",577,Crossref,0,['Symposium Closing Remarks'],10.1016,2,,78.0,,['IFAC Proceedings Volumes'],en,[{'URL': 'https://api.elsevier.com/content/art...,"{'date-parts': [[2018, 8, 30]], 'date-time': '...",0.0,{'primary': {'URL': 'https://linkinghub.elsevi...,"{'date-parts': [[1965, 6]]}",0,"{'issue': '1', 'published-print': {'date-parts...",http://dx.doi.org/10.1016/s1474-6670(17)69139-0,['1474-6670'],"[{'value': '1474-6670', 'type': 'print'}]","['General Economics, Econometrics and Finance']","{'date-parts': [[1965, 6]]}",['S1474667017691390'],,,,,,,,,,,,,,
30,"{'date-parts': [[2022, 3, 29]], 'date-time': '...",0,Wiley,3,"[{'start': {'date-parts': [[2015, 9, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['Commun. Numer. Meth. Engng.'],"{'date-parts': [[1995, 3]]}",10.1002/cnm.1640110301,journal-article,"{'date-parts': [[2005, 8, 8]], 'date-time': '2...",fmi-fmi,Crossref,0,['Masthead'],10.1002,11,,311.0,,['Communications in Numerical Methods in Engin...,en,[{'URL': 'https://api.wiley.com/onlinelibrary/...,"{'date-parts': [[2021, 7, 2]], 'date-time': '2...",0.0,{'primary': {'URL': 'https://onlinelibrary.wil...,"{'date-parts': [[1995, 3]]}",0,"{'issue': '3', 'published-print': {'date-parts...",http://dx.doi.org/10.1002/cnm.1640110301,['1069-8299'],"[{'value': '1069-8299', 'type': 'print'}]","['Applied Mathematics', 'Computational Theory ...","{'date-parts': [[1995, 3]]}",,"{'date-parts': [[2005, 6, 21]]}",,,,,,,,,,,,,
33,"{'date-parts': [[2022, 4, 2]], 'date-time': '2...",1,Wiley,33,"[{'start': {'date-parts': [[2015, 9, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['Chemischer Informationsdienst'],"{'date-parts': [[1972, 8, 15]]}",10.1002/chin.197233207,journal-article,"{'date-parts': [[2016, 2, 26]], 'date-time': '...",no-no,Crossref,0,['ChemInform Abstract: FRIEDEL-CRAFTS-ACYLIERU...,10.1002,3,"[{'given': 'J. K.', 'family': 'GROVES', 'seque...",311.0,"[{'key': '10.1002/chin.197233207-BIB1|cit1', '...",['Chemischer Informationsdienst'],en,[{'URL': 'http://api.wiley.com/onlinelibrary/t...,"{'date-parts': [[2021, 7, 1]], 'date-time': '2...",0.0,{'primary': {'URL': 'https://onlinelibrary.wil...,"{'date-parts': [[1972, 8, 15]]}",1,"{'issue': '33', 'published-print': {'date-part...",http://dx.doi.org/10.1002/chin.197233207,['0009-2975'],"[{'value': '0009-2975', 'type': 'print'}]",['General Medicine'],"{'date-parts': [[1972, 8, 15]]}",,"{'date-parts': [[2016, 2, 23]]}",['Portico'],,,,,,,,,,,,
37,"{'date-parts': [[2022, 4, 5]], 'date-time': '2...",0,Wiley,27-29,"[{'start': {'date-parts': [[2015, 9, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['Z. Pflanzenernaehr. Dueng. Bodenk.'],{'date-parts': [[1931]]},10.1002/jpln.19310102701,journal-article,"{'date-parts': [[2007, 2, 7]], 'date-time': '2...",fmi-fmi,Crossref,0,['Masthead'],10.1002,10,,311.0,,"['Zeitschrift für Pflanzenernährung, Düngung, ...",de,[{'URL': 'https://api.wiley.com/onlinelibrary/...,"{'date-parts': [[2021, 7, 5]], 'date-time': '2...",0.0,{'primary': {'URL': 'https://onlinelibrary.wil...,{'date-parts': [[1931]]},0,"{'issue': '27-29', 'published-print': {'date-p...",http://dx.doi.org/10.1002/jpln.19310102701,"['0372-9702', '1522-2624']","[{'value': '0372-9702', 'type': 'print'}, {'va...","['Plant Science', 'Soil Science']",{'date-parts': [[1931]]},,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
104971,"{'date-parts': [[2022, 4, 4]], 'date-time': '2...",0,American Association for Cancer Research (AACR),2,,"{'domain': [], 'crossmark-restriction': False}",['The American Journal of Cancer'],"{'date-parts': [[1934, 6, 1]]}",10.1158/ajc.1934.391,journal-article,"{'date-parts': [[2013, 5, 16]], 'date-time': '...",391-499,Crossref,0,['Abstracts'],10.1158,21,,1086.0,,['The American Journal of Cancer'],en,,"{'date-parts': [[2013, 5, 16]], 'date-time': '...",0.0,{'primary': {'URL': 'http://cancerres.aacrjour...,"{'date-parts': [[1934, 6, 1]]}",0,"{'issue': '2', 'published-print': {'date-parts...",http://dx.doi.org/10.1158/ajc.1934.391,['0099-7374'],"[{'value': '0099-7374', 'type': 'electronic'}]","['Cancer Research', 'Oncology']","{'date-parts': [[1934, 6, 1]]}",,,,,,,,,,,,,,,
104976,"{'date-parts': [[2022, 3, 29]], 'date-time': '...",0,Wiley,3,"[{'start': {'date-parts': [[2015, 9, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['Res. Nurs. Health'],"{'date-parts': [[1982, 9]]}",10.1002/nur.4770050302,journal-article,"{'date-parts': [[2007, 4, 12]], 'date-time': '...",111-111,Crossref,1,['Editorial'],10.1002,5,"[{'given': 'Margaret', 'family': 'Grier', 'seq...",311.0,,['Research in Nursing &amp; Health'],en,[{'URL': 'https://api.wiley.com/onlinelibrary/...,"{'date-parts': [[2021, 7, 4]], 'date-time': '2...",0.0,{'primary': {'URL': 'https://onlinelibrary.wil...,"{'date-parts': [[1982, 9]]}",0,"{'issue': '3', 'published-print': {'date-parts...",http://dx.doi.org/10.1002/nur.4770050302,"['0160-6891', '1098-240X']","[{'value': '0160-6891', 'type': 'print'}, {'va...",['General Nursing'],"{'date-parts': [[1982, 9]]}",,,,,,,,,,,,,,,
104979,"{'date-parts': [[2022, 3, 30]], 'date-time': '...",0,Ovid Technologies (Wolters Kluwer Health),suppl_1,,"{'domain': [], 'crossmark-restriction': False}",['Stroke'],"{'date-parts': [[2013, 2]]}",10.1161/str.44.suppl_1.atp242,journal-article,"{'date-parts': [[2021, 7, 3]], 'date-time': '2...",,Crossref,0,['Abstract TP242: Intravenous rt-PA Therapy fo...,10.1161,44,"[{'given': 'Naoki', 'family': 'Hayashi', 'sequ...",276.0,,['Stroke'],en,[{'URL': 'http://journals.lww.com/00007670-201...,"{'date-parts': [[2022, 3, 20]], 'date-time': '...",0.0,{'primary': {'URL': 'https://www.ahajournals.o...,"{'date-parts': [[2013, 2]]}",0,"{'issue': 'suppl_1', 'published-print': {'date...",http://dx.doi.org/10.1161/str.44.suppl_1.atp242,"['0039-2499', '1524-4628']","[{'value': '0039-2499', 'type': 'print'}, {'va...","['Advanced and Specialized Nursing', 'Cardiolo...","{'date-parts': [[2013, 2]]}",['10.1161/str.44.suppl_1.ATP242'],,,,,,,,<jats:p>\n <jats:bold>Background an...,,,,,,
104993,"{'date-parts': [[2022, 4, 5]], 'date-time': '2...",0,Test accounts,,,"{'domain': [], 'crossmark-restriction': False}",['Bulletin'],{'date-parts': [[1994]]},10.1306/a25ff503-171b-11d7-8645000102c1865d,journal-article,"{'date-parts': [[2002, 12, 31]], 'date-time': ...",,Crossref,0,['Kukersite Oil Shale in Estonia: Basin Geolog...,10.5555,78,"[{'family': 'Heikki Bauert', 'sequence': 'firs...",7822.0,,['AAPG Bulletin'],en,,"{'date-parts': [[2007, 2, 13]], 'date-time': '...",0.0,{'primary': {'URL': 'http://www.crossref.org/d...,{'date-parts': [[1994]]},0,,http://dx.doi.org/10.1306/a25ff503-171b-11d7-8...,['0149-1423'],"[{'value': '0149-1423', 'type': 'print'}]",['Earth and Planetary Sciences (miscellaneous)...,{'date-parts': [[1994]]},['A25FF503-171B-11D7-8645000102C1865D'],,,,,,,,,,,,,,


We've found some editorials, Mastheads, conference proceedings, and abstracts. We'll go ahead and drop them from our dataset.

In [6]:
df.drop(editorial.index, inplace=True)

In [7]:
df.shape

(101616, 49)

## Conferences
Looking back at **editorial** we see that there are a couple 'Conferences' and 'Proceedings' in the *container-title* column. Let's take a look at just how many records remain in our dataset are from these journals/containers.

Additionally, we see a few records from the journal *ChemInform*, a journal that published chemistry abstracts, we'll check to see if any of those records remain as well.

We'll use a keyword search in the *container-title* column to find these records.

In [8]:
conferences = df.loc[df['container-title'].str.contains(r'conference|ChemInform', regex=True, case=False)]
conferences

Unnamed: 0_level_0,indexed,reference-count,publisher,issue,license,content-domain,short-container-title,published-print,DOI,type,created,page,source,is-referenced-by-count,title,prefix,volume,author,member,reference,container-title,language,link,deposited,score,resource,issued,references-count,journal-issue,URL,ISSN,issn-type,subject,published,alternative-id,published-online,archive,update-policy,assertion,funder,article-number,accepted,abstract,original-title,subtitle,published-other,editor,update-to,relation
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
57,"{'date-parts': [[2022, 3, 31]], 'date-time': '...",6,IOP Publishing,1,"[{'start': {'date-parts': [[2021, 2, 1]], 'dat...","{'domain': ['iopscience.iop.org'], 'crossmark-...",['IOP Conf. Ser.: Earth Environ. Sci.'],"{'date-parts': [[2021, 2, 1]]}",10.1088/1755-1315/660/1/012131,journal-article,"{'date-parts': [[2021, 2, 22]], 'date-time': '...",012131,Crossref,0,['Application of wavelet multi-scale analysis ...,10.1088,660,"[{'given': 'Hailong', 'family': 'Sun', 'sequen...",266.0,"[{'key': 'EES_660_1_012131bib1', 'author': 'Ha...",['IOP Conference Series: Earth and Environment...,,[{'URL': 'https://iopscience.iop.org/article/1...,"{'date-parts': [[2022, 1, 29]], 'date-time': '...",0.0,{'primary': {'URL': 'https://iopscience.iop.or...,"{'date-parts': [[2021, 2, 1]]}",6,"{'issue': '1', 'published-print': {'date-parts...",http://dx.doi.org/10.1088/1755-1315/660/1/012131,"['1755-1307', '1755-1315']","[{'value': '1755-1307', 'type': 'print'}, {'va...",,"{'date-parts': [[2021, 2, 1]]}",,,,http://dx.doi.org/10.1088/crossmark-policy,[{'value': 'Application of wavelet multi-scale...,,,,<jats:title>Abstract</jats:title>\n ...,,,,,,
117,"{'date-parts': [[2022, 4, 2]], 'date-time': '2...",17,EDP Sciences,,"[{'start': {'date-parts': [[2021, 4, 26]], 'da...","{'domain': [], 'crossmark-restriction': False}",['EPJ Web Conf.'],{'date-parts': [[2021]]},10.1051/epjconf/202124801022,journal-article,"{'date-parts': [[2021, 4, 26]], 'date-time': '...",01022,Crossref,0,['Distributions of Two Atoms Collisions over t...,10.1051,248,"[{'given': 'Sergey', 'family': 'Zheltov', 'seq...",250.0,"[{'key': 'R1', 'doi-asserted-by': 'crossref', ...",['EPJ Web of Conferences'],,[{'URL': 'https://www.epj-conferences.org/10.1...,"{'date-parts': [[2021, 4, 26]], 'date-time': '...",0.0,{'primary': {'URL': 'https://www.epj-conferenc...,{'date-parts': [[2021]]},17,,http://dx.doi.org/10.1051/epjconf/202124801022,['2100-014X'],"[{'value': '2100-014X', 'type': 'electronic'}]","['General Earth and Planetary Sciences', 'Gene...",{'date-parts': [[2021]]},['epjconf_mnps2021_01022'],"{'date-parts': [[2021, 4, 26]]}",,,,,,,<jats:p>The processes of heat and mass transfe...,,,,"[{'given': 'A.', 'family': 'Nadykto', 'sequenc...",,
140,"{'date-parts': [[2022, 11, 8]], 'date-time': '...",0,Association for the Advancement of Artificial ...,1,,"{'domain': [], 'crossmark-restriction': False}",['AAAI'],,10.1609/aaai.v32i1.11721,journal-article,"{'date-parts': [[2022, 6, 24]], 'date-time': '...",,Crossref,6,['SC2Net: Sparse LSTMs for Sparse Coding'],10.1609,32,"[{'given': 'Joey Tianyi', 'family': 'Zhou', 's...",9382.0,,['Proceedings of the AAAI Conference on Artifi...,,[{'URL': 'https://ojs.aaai.org/index.php/AAAI/...,"{'date-parts': [[2022, 11, 7]], 'date-time': '...",0.0,{'primary': {'URL': 'https://ojs.aaai.org/inde...,"{'date-parts': [[2018, 4, 29]]}",0,"{'issue': '1', 'published-online': {'date-part...",http://dx.doi.org/10.1609/aaai.v32i1.11721,"['2374-3468', '2159-5399']","[{'value': '2374-3468', 'type': 'electronic'},...",['General Medicine'],"{'date-parts': [[2018, 4, 29]]}",,"{'date-parts': [[2018, 4, 29]]}",,,,,,,<jats:p>\n \n The iterative hard-t...,,,,,,
173,"{'date-parts': [[2022, 4, 1]], 'date-time': '2...",3,Springer Science and Business Media LLC,S1,"[{'start': {'date-parts': [[2013, 3, 1]], 'dat...","{'domain': ['link.springer.com'], 'crossmark-r...",['J Cheminform'],"{'date-parts': [[2013, 3]]}",10.1186/1758-2946-5-s1-o12,journal-article,"{'date-parts': [[2013, 3, 22]], 'date-time': '...",,Crossref,0,['Quantifying the shifts in physicochemical pr...,10.1186,5,"[{'given': 'Johannes', 'family': 'Kirchmair', ...",297.0,"[{'key': '389_CR1', 'doi-asserted-by': 'publis...",['Journal of Cheminformatics'],en,[{'URL': 'http://link.springer.com/content/pdf...,"{'date-parts': [[2021, 9, 1]], 'date-time': '2...",0.0,{'primary': {'URL': 'https://jcheminf.biomedce...,"{'date-parts': [[2013, 3]]}",3,"{'issue': 'S1', 'published-print': {'date-part...",http://dx.doi.org/10.1186/1758-2946-5-s1-o12,['1758-2946'],"[{'value': '1758-2946', 'type': 'electronic'}]","['Library and Information Sciences', 'Computer...","{'date-parts': [[2013, 3]]}",['389'],"{'date-parts': [[2013, 3, 22]]}",,http://dx.doi.org/10.1007/springer_crossmark_p...,"[{'value': '22 March 2013', 'order': 1, 'name'...",,O12,,,,,,,,
229,"{'date-parts': [[2022, 3, 29]], 'date-time': '...",13,EDP Sciences,,"[{'start': {'date-parts': [[2019, 6, 21]], 'da...","{'domain': [], 'crossmark-restriction': False}",['E3S Web Conf.'],{'date-parts': [[2019]]},10.1051/e3sconf/201910502010,journal-article,"{'date-parts': [[2019, 6, 21]], 'date-time': '...",02010,Crossref,1,['Computer Simulation of the Physical Processe...,10.1051,105,"[{'given': 'Maxim', 'family': 'Gucal', 'sequen...",250.0,"[{'key': 'R1', 'doi-asserted-by': 'crossref', ...",['E3S Web of Conferences'],,[{'URL': 'https://www.e3s-conferences.org/10.1...,"{'date-parts': [[2020, 3, 13]], 'date-time': '...",0.0,{'primary': {'URL': 'https://www.e3s-conferenc...,{'date-parts': [[2019]]},13,,http://dx.doi.org/10.1051/e3sconf/201910502010,['2267-1242'],"[{'value': '2267-1242', 'type': 'electronic'}]","['Pulmonary and Respiratory Medicine', 'Pediat...",{'date-parts': [[2019]]},['e3sconf_iims18_02010'],"{'date-parts': [[2019, 6, 21]]}",,,,,,,<jats:p>An algorithm for determining the elect...,,,,"[{'given': 'M.', 'family': 'Tyulenev', 'sequen...",,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
104394,"{'date-parts': [[2022, 3, 30]], 'date-time': '...",20,EDP Sciences,,"[{'start': {'date-parts': [[2018, 1, 9]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['MATEC Web Conf.'],{'date-parts': [[2018]]},10.1051/matecconf/201814504003,journal-article,"{'date-parts': [[2018, 1, 9]], 'date-time': '2...",04003,Crossref,0,['Effect of post type on the fracture resistan...,10.1051,145,"[{'given': 'Ekaterina', 'family': 'Karteva', '...",250.0,"[{'key': 'R1', 'first-page': '107', 'volume': ...",['MATEC Web of Conferences'],,[{'URL': 'https://www.matec-conferences.org/10...,"{'date-parts': [[2020, 4, 9]], 'date-time': '2...",0.0,{'primary': {'URL': 'https://www.matec-confere...,{'date-parts': [[2018]]},20,,http://dx.doi.org/10.1051/matecconf/201814504003,['2261-236X'],"[{'value': '2261-236X', 'type': 'electronic'}]",,{'date-parts': [[2018]]},['matecconf_nctam2018_04003'],"{'date-parts': [[2018, 1, 9]]}",,,,,,,<jats:p>Endodontically treated teeth (ETT) are...,,,,"[{'given': 'V.M.', 'family': 'Vassilev', 'sequ...",,
104528,"{'date-parts': [[2022, 4, 5]], 'date-time': '2...",5,IOP Publishing,1,"[{'start': {'date-parts': [[2021, 3, 1]], 'dat...","{'domain': ['iopscience.iop.org'], 'crossmark-...",['IOP Conf. Ser.: Earth Environ. Sci.'],"{'date-parts': [[2021, 3, 1]]}",10.1088/1755-1315/681/1/012054,journal-article,"{'date-parts': [[2021, 3, 24]], 'date-time': '...",012054,Crossref,1,['Farmer’s response on government policy in so...,10.1088,681,"[{'given': 'P Tandi', 'family': 'Balla', 'sequ...",266.0,"[{'key': 'EES_681_1_012054bib1', 'article-titl...",['IOP Conference Series: Earth and Environment...,,[{'URL': 'https://iopscience.iop.org/article/1...,"{'date-parts': [[2022, 1, 15]], 'date-time': '...",0.0,{'primary': {'URL': 'https://iopscience.iop.or...,"{'date-parts': [[2021, 3, 1]]}",5,"{'issue': '1', 'published-print': {'date-parts...",http://dx.doi.org/10.1088/1755-1315/681/1/012054,"['1755-1307', '1755-1315']","[{'value': '1755-1307', 'type': 'print'}, {'va...",['General Medicine'],"{'date-parts': [[2021, 3, 1]]}",,,,http://dx.doi.org/10.1088/crossmark-policy,[{'value': 'Farmer’s response on government po...,,,,<jats:title>Abstract</jats:title>\n ...,,,,,,
104575,"{'date-parts': [[2022, 6, 24]], 'date-time': '...",0,Association for the Advancement of Artificial ...,1,,"{'domain': [], 'crossmark-restriction': False}",['AAAI'],,10.1609/aaai.v30i1.9835,journal-article,"{'date-parts': [[2022, 6, 24]], 'date-time': '...",,Crossref,0,['DECT: Distributed Evolving Context Tree for ...,10.1609,30,"[{'given': 'Xiaokui', 'family': 'Shu', 'sequen...",9382.0,,['Proceedings of the AAAI Conference on Artifi...,,[{'URL': 'https://ojs.aaai.org/index.php/AAAI/...,"{'date-parts': [[2022, 6, 24]], 'date-time': '...",0.0,{'primary': {'URL': 'https://ojs.aaai.org/inde...,"{'date-parts': [[2016, 3, 5]]}",0,"{'issue': '1', 'published-online': {'date-part...",http://dx.doi.org/10.1609/aaai.v30i1.9835,"['2374-3468', '2159-5399']","[{'value': '2374-3468', 'type': 'electronic'},...",['General Medicine'],"{'date-parts': [[2016, 3, 5]]}",,"{'date-parts': [[2016, 3, 5]]}",,,,,,,<jats:p>\n \n Internet user behavi...,,,,,,
104634,"{'date-parts': [[2022, 3, 29]], 'date-time': '...",0,California Digital Library (CDL),,,"{'domain': [], 'crossmark-restriction': False}",['vertebrate_pest_conference'],,10.5070/v42811014,journal-article,"{'date-parts': [[2019, 12, 12]], 'date-time': ...",,Crossref,0,['Direct and Indirect Impacts to Ranchers from...,10.5070,28,"[{'given': 'Dan', 'family': 'Macon', 'sequence...",29705.0,,['Proceedings of the Vertebrate Pest Conference'],,,"{'date-parts': [[2019, 12, 12]], 'date-time': ...",0.0,{'primary': {'URL': 'https://escholarship.org/...,{'date-parts': [[2018]]},0,,http://dx.doi.org/10.5070/v42811014,['2641-273X'],"[{'value': '2641-273X', 'type': 'electronic'}]","['General Earth and Planetary Sciences', 'Gene...",{'date-parts': [[2018]]},,{'date-parts': [[2018]]},,,,,,,,,,,,,


In [9]:
df.drop(conferences.index, inplace=True)
df.shape

(100784, 49)

In [11]:
extras = df.sample(n=784, random_state=42)
df.drop(extras.index, inplace=True)
df

Unnamed: 0_level_0,indexed,reference-count,publisher,issue,license,content-domain,short-container-title,published-print,DOI,type,created,page,source,is-referenced-by-count,title,prefix,volume,author,member,reference,container-title,language,link,deposited,score,resource,issued,references-count,journal-issue,URL,ISSN,issn-type,subject,published,alternative-id,published-online,archive,update-policy,assertion,funder,article-number,accepted,abstract,original-title,subtitle,published-other,editor,update-to,relation
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
0,"{'date-parts': [[2022, 10, 7]], 'date-time': '...",14,Wiley,1,"[{'start': {'date-parts': [[2015, 9, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['Syst. Dyn. Rev.'],{'date-parts': [[2000]]},10.1002/(sici)1099-1727(200021)16:1<27::aid-sd...,journal-article,"{'date-parts': [[2002, 9, 10]], 'date-time': '...",27-41,Crossref,57,['The validation of commercial system dynamics...,10.1002,16,"[{'given': 'Geoff', 'family': 'Coyle', 'sequen...",311.0,[{'key': '10.1002/(SICI)1099-1727(200021)16:1<...,['System Dynamics Review'],en,[{'URL': 'https://api.wiley.com/onlinelibrary/...,"{'date-parts': [[2021, 7, 1]], 'date-time': '2...",0.0,{'primary': {'URL': 'https://onlinelibrary.wil...,{'date-parts': [[2000]]},14,"{'issue': '1', 'published-print': {'date-parts...",http://dx.doi.org/10.1002/(sici)1099-1727(2000...,"['0883-7066', '1099-1727']","[{'value': '0883-7066', 'type': 'print'}, {'va...","['Management of Technology and Innovation', 'S...",{'date-parts': [[2000]]},,,,,,,,,,,,,,,
1,"{'date-parts': [[2022, 3, 29]], 'date-time': '...",12,Springer Science and Business Media LLC,1,"[{'start': {'date-parts': [[1979, 3, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['MTB'],"{'date-parts': [[1979, 3]]}",10.1007/bf02653972,journal-article,"{'date-parts': [[2007, 7, 17]], 'date-time': '...",57-62,Crossref,20,['Effect of system geometry on the leaching be...,10.1007,10,"[{'given': 'C.', 'family': 'Vu', 'sequence': '...",297.0,"[{'key': 'BF02653972_CR1', 'volume-title': 'Ph...",['Metallurgical Transactions B'],en,[{'URL': 'http://link.springer.com/content/pdf...,"{'date-parts': [[2019, 5, 20]], 'date-time': '...",0.0,{'primary': {'URL': 'http://link.springer.com/...,"{'date-parts': [[1979, 3]]}",12,"{'issue': '1', 'published-print': {'date-parts...",http://dx.doi.org/10.1007/bf02653972,"['0360-2141', '1543-1916']","[{'value': '0360-2141', 'type': 'print'}, {'va...","['Materials Chemistry', 'Metals and Alloys', '...","{'date-parts': [[1979, 3]]}",['BF02653972'],,,,,,,,,,,,,,
2,"{'date-parts': [[2022, 3, 30]], 'date-time': '...",0,Wiley,3,"[{'start': {'date-parts': [[2017, 11, 1]], 'da...","{'domain': [], 'crossmark-restriction': False}",['RECIEL'],"{'date-parts': [[2017, 11]]}",10.1111/reel.12221,journal-article,"{'date-parts': [[2017, 12, 1]], 'date-time': '...",243-254,Crossref,2,['The international law on transboundary haze ...,10.1111,26,"[{'given': 'Shawkat', 'family': 'Alam', 'seque...",311.0,,"['Review of European, Comparative &amp; Intern...",en,[{'URL': 'https://api.wiley.com/onlinelibrary/...,"{'date-parts': [[2017, 12, 1]], 'date-time': '...",0.0,{'primary': {'URL': 'http://doi.wiley.com/10.1...,"{'date-parts': [[2017, 11]]}",0,"{'issue': '3', 'published-print': {'date-parts...",http://dx.doi.org/10.1111/reel.12221,['2050-0386'],"[{'value': '2050-0386', 'type': 'print'}]","['Law', 'Management, Monitoring, Policy and La...","{'date-parts': [[2017, 11]]}",,"{'date-parts': [[2017, 11, 28]]}",['Portico'],,,,,,,,,,,,
3,"{'date-parts': [[2022, 4, 3]], 'date-time': '2...",0,Crop Science Society of Japan,1-2,,"{'domain': [], 'crossmark-restriction': False}","['Japanese journal of crop science', 'Jpn. J. ...",{'date-parts': [[1951]]},10.1626/jcs.20.219,journal-article,"{'date-parts': [[2011, 9, 20]], 'date-time': '...",219-222,Crossref,0,['Studies on the influence of pruning on the v...,10.1626,20,"[{'given': 'C.', 'family': 'TSUDA', 'sequence'...",632.0,,['Japanese Journal of Crop Science'],en,[{'URL': 'http://www.jstage.jst.go.jp/article/...,"{'date-parts': [[2021, 4, 30]], 'date-time': '...",0.0,{'primary': {'URL': 'http://www.jstage.jst.go....,{'date-parts': [[1951]]},0,"{'issue': '1-2', 'published-print': {'date-par...",http://dx.doi.org/10.1626/jcs.20.219,"['0011-1848', '1349-0990']","[{'value': '0011-1848', 'type': 'print'}, {'va...","['Genetics', 'Agronomy and Crop Science', 'Foo...",{'date-parts': [[1951]]},,,,,,,,,,,,,,,
4,"{'date-parts': [[2022, 3, 31]], 'date-time': '...",60,Elsevier BV,6,"[{'start': {'date-parts': [[2018, 12, 1]], 'da...","{'domain': ['clinicalkey.fr', 'elsevier.com', ...",['Revue de Pneumologie Clinique'],"{'date-parts': [[2018, 12]]}",10.1016/j.pneumo.2018.09.002,journal-article,"{'date-parts': [[2018, 10, 10]], 'date-time': ...",391-399,Crossref,0,['Le tabagisme et l’aide à l’arrêt du tabac de...,10.1016,74,"[{'given': 'J.', 'family': 'Perriot', 'sequenc...",78.0,[{'key': '10.1016/j.pneumo.2018.09.002_bib0305...,['Revue de Pneumologie Clinique'],fr,[{'URL': 'https://api.elsevier.com/content/art...,"{'date-parts': [[2019, 10, 26]], 'date-time': ...",0.0,{'primary': {'URL': 'https://linkinghub.elsevi...,"{'date-parts': [[2018, 12]]}",60,"{'issue': '6', 'published-print': {'date-parts...",http://dx.doi.org/10.1016/j.pneumo.2018.09.002,['0761-8417'],"[{'value': '0761-8417', 'type': 'print'}]",['Pulmonary and Respiratory Medicine'],"{'date-parts': [[2018, 12]]}",['S0761841718301792'],,,http://dx.doi.org/10.1016/elsevier_cm_policy,"[{'value': 'Elsevier', 'name': 'publisher', 'l...",,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105102,"{'date-parts': [[2022, 4, 6]], 'date-time': '2...",0,Springer Science and Business Media LLC,1,"[{'start': {'date-parts': [[1930, 12, 1]], 'da...","{'domain': [], 'crossmark-restriction': False}",['Neophilologus'],"{'date-parts': [[1930, 12]]}",10.1007/bf01510212,journal-article,"{'date-parts': [[2005, 4, 18]], 'date-time': '...",248-249,Crossref,0,"['Filocolo, Filocopo, Filopono']",10.1007,15,"[{'given': 'G. A.', 'family': 'Nauta', 'sequen...",297.0,,['Neophilologus'],en,[{'URL': 'http://link.springer.com/article/10....,"{'date-parts': [[2019, 5, 3]], 'date-time': '2...",0.0,{'primary': {'URL': 'http://link.springer.com/...,"{'date-parts': [[1930, 12]]}",0,"{'issue': '1', 'published-print': {'date-parts...",http://dx.doi.org/10.1007/bf01510212,"['0028-2677', '1572-8668']","[{'value': '0028-2677', 'type': 'print'}, {'va...","['Literature and Literary Theory', 'Linguistic...","{'date-parts': [[1930, 12]]}",['BF01510212'],,,,,,,,,,,,,,
105103,"{'date-parts': [[2022, 4, 2]], 'date-time': '2...",0,Elsevier BV,3,"[{'start': {'date-parts': [[2020, 2, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['Biophysical Journal'],"{'date-parts': [[2020, 2]]}",10.1016/j.bpj.2019.11.2324,journal-article,"{'date-parts': [[2020, 2, 7]], 'date-time': '2...",411a,Crossref,0,['Contributions of the Transmembrane Domain to...,10.1016,118,"[{'given': 'Aerial M.', 'family': 'Pratt', 'se...",78.0,,['Biophysical Journal'],en,[{'URL': 'https://api.elsevier.com/content/art...,"{'date-parts': [[2021, 2, 7]], 'date-time': '2...",0.0,{'primary': {'URL': 'https://linkinghub.elsevi...,"{'date-parts': [[2020, 2]]}",0,"{'issue': '3', 'published-print': {'date-parts...",http://dx.doi.org/10.1016/j.bpj.2019.11.2324,['0006-3495'],"[{'value': '0006-3495', 'type': 'print'}]",['Biophysics'],"{'date-parts': [[2020, 2]]}",['S0006349519332576'],,,,,,,,,,,,,,
105104,"{'date-parts': [[2022, 4, 4]], 'date-time': '2...",7,Elsevier BV,,"[{'start': {'date-parts': [[2016, 1, 1]], 'dat...","{'domain': ['elsevier.com', 'sciencedirect.com...",['Procedia Structural Integrity'],{'date-parts': [[2016]]},10.1016/j.prostr.2016.06.306,journal-article,"{'date-parts': [[2016, 7, 22]], 'date-time': '...",2447-2455,Crossref,2,['A Probabilistic Fatigue Assessment Diagram T...,10.1016,2,"[{'given': 'S.', 'family': 'Jallouf', 'sequenc...",78.0,[{'key': '10.1016/j.prostr.2016.06.306_bib0001...,['Procedia Structural Integrity'],en,[{'URL': 'https://api.elsevier.com/content/art...,"{'date-parts': [[2018, 9, 10]], 'date-time': '...",0.0,{'primary': {'URL': 'https://linkinghub.elsevi...,{'date-parts': [[2016]]},7,,http://dx.doi.org/10.1016/j.prostr.2016.06.306,['2452-3216'],"[{'value': '2452-3216', 'type': 'print'}]",['General Medicine'],{'date-parts': [[2016]]},['S2452321616303171'],,,http://dx.doi.org/10.1016/elsevier_cm_policy,"[{'value': 'Elsevier', 'name': 'publisher', 'l...",,,,,,,,,,
105105,"{'date-parts': [[2022, 4, 7]], 'date-time': '2...",10,American Physical Society (APS),7,"[{'start': {'date-parts': [[1991, 10, 1]], 'da...","{'domain': [], 'crossmark-restriction': False}",['Phys. Rev. A'],,10.1103/physreva.44.4757,journal-article,"{'date-parts': [[2002, 7, 27]], 'date-time': '...",4757-4760,Crossref,1,['Effect of squeezed light on the photon-numbe...,10.1103,44,"[{'given': 'Lu-Bi', 'family': 'Deng', 'sequenc...",16.0,"[{'key': 'PhysRevA.44.4757Cc1R1', 'doi-asserte...",['Physical Review A'],en,[{'URL': 'http://link.aps.org/article/10.1103/...,"{'date-parts': [[2017, 6, 15]], 'date-time': '...",0.0,{'primary': {'URL': 'https://link.aps.org/doi/...,"{'date-parts': [[1991, 10, 1]]}",10,"{'issue': '7', 'published-print': {'date-parts...",http://dx.doi.org/10.1103/physreva.44.4757,"['1050-2947', '1094-1622']","[{'value': '1050-2947', 'type': 'print'}, {'va...","['Atomic and Molecular Physics, and Optics']","{'date-parts': [[1991, 10, 1]]}",,"{'date-parts': [[1991, 10, 1]]}",,,,,,,,,,,,,


## Cleaning Dates
Here we are going to re-format some of the datetime columns into a more easily parsed format. *Created, deposited* and *published*. Not all records have month and day values for the *published* field, so we'll only take the year from those. For *created* and *deposited* we will have a YYYY-MM-DD format.

We've chosen these dates because they reflect certain information that we'll use later on. *Created* is the date when the item was first inserted into the Crossref database. *Deposited* reflects the last time the record was entered by the publisher (potentially with changes to the record but not necessarily the case). *Published* reflects when the item itself was actually published.

We'll use a regular expression to extract the dates from each of the records in each of those three columns, then we'll convert them to datetime dtypes.

In [12]:
date_columns = ['created', 'deposited']

for col in date_columns:
    df[col] = df[col].str.extract(r"\'(\d{4}\S\d{2}\S\d{2})")
    df[col] = pd.to_datetime(df[col], format="%Y-%m-%d")
df['published'] = df['published'].str.extract(r"(\d{4})")
df

Unnamed: 0_level_0,indexed,reference-count,publisher,issue,license,content-domain,short-container-title,published-print,DOI,type,created,page,source,is-referenced-by-count,title,prefix,volume,author,member,reference,container-title,language,link,deposited,score,resource,issued,references-count,journal-issue,URL,ISSN,issn-type,subject,published,alternative-id,published-online,archive,update-policy,assertion,funder,article-number,accepted,abstract,original-title,subtitle,published-other,editor,update-to,relation
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
0,"{'date-parts': [[2022, 10, 7]], 'date-time': '...",14,Wiley,1,"[{'start': {'date-parts': [[2015, 9, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['Syst. Dyn. Rev.'],{'date-parts': [[2000]]},10.1002/(sici)1099-1727(200021)16:1<27::aid-sd...,journal-article,2002-09-10,27-41,Crossref,57,['The validation of commercial system dynamics...,10.1002,16,"[{'given': 'Geoff', 'family': 'Coyle', 'sequen...",311.0,[{'key': '10.1002/(SICI)1099-1727(200021)16:1<...,['System Dynamics Review'],en,[{'URL': 'https://api.wiley.com/onlinelibrary/...,2021-07-01,0.0,{'primary': {'URL': 'https://onlinelibrary.wil...,{'date-parts': [[2000]]},14,"{'issue': '1', 'published-print': {'date-parts...",http://dx.doi.org/10.1002/(sici)1099-1727(2000...,"['0883-7066', '1099-1727']","[{'value': '0883-7066', 'type': 'print'}, {'va...","['Management of Technology and Innovation', 'S...",2000,,,,,,,,,,,,,,,
1,"{'date-parts': [[2022, 3, 29]], 'date-time': '...",12,Springer Science and Business Media LLC,1,"[{'start': {'date-parts': [[1979, 3, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['MTB'],"{'date-parts': [[1979, 3]]}",10.1007/bf02653972,journal-article,2007-07-17,57-62,Crossref,20,['Effect of system geometry on the leaching be...,10.1007,10,"[{'given': 'C.', 'family': 'Vu', 'sequence': '...",297.0,"[{'key': 'BF02653972_CR1', 'volume-title': 'Ph...",['Metallurgical Transactions B'],en,[{'URL': 'http://link.springer.com/content/pdf...,2019-05-20,0.0,{'primary': {'URL': 'http://link.springer.com/...,"{'date-parts': [[1979, 3]]}",12,"{'issue': '1', 'published-print': {'date-parts...",http://dx.doi.org/10.1007/bf02653972,"['0360-2141', '1543-1916']","[{'value': '0360-2141', 'type': 'print'}, {'va...","['Materials Chemistry', 'Metals and Alloys', '...",1979,['BF02653972'],,,,,,,,,,,,,,
2,"{'date-parts': [[2022, 3, 30]], 'date-time': '...",0,Wiley,3,"[{'start': {'date-parts': [[2017, 11, 1]], 'da...","{'domain': [], 'crossmark-restriction': False}",['RECIEL'],"{'date-parts': [[2017, 11]]}",10.1111/reel.12221,journal-article,2017-12-01,243-254,Crossref,2,['The international law on transboundary haze ...,10.1111,26,"[{'given': 'Shawkat', 'family': 'Alam', 'seque...",311.0,,"['Review of European, Comparative &amp; Intern...",en,[{'URL': 'https://api.wiley.com/onlinelibrary/...,2017-12-01,0.0,{'primary': {'URL': 'http://doi.wiley.com/10.1...,"{'date-parts': [[2017, 11]]}",0,"{'issue': '3', 'published-print': {'date-parts...",http://dx.doi.org/10.1111/reel.12221,['2050-0386'],"[{'value': '2050-0386', 'type': 'print'}]","['Law', 'Management, Monitoring, Policy and La...",2017,,"{'date-parts': [[2017, 11, 28]]}",['Portico'],,,,,,,,,,,,
3,"{'date-parts': [[2022, 4, 3]], 'date-time': '2...",0,Crop Science Society of Japan,1-2,,"{'domain': [], 'crossmark-restriction': False}","['Japanese journal of crop science', 'Jpn. J. ...",{'date-parts': [[1951]]},10.1626/jcs.20.219,journal-article,2011-09-20,219-222,Crossref,0,['Studies on the influence of pruning on the v...,10.1626,20,"[{'given': 'C.', 'family': 'TSUDA', 'sequence'...",632.0,,['Japanese Journal of Crop Science'],en,[{'URL': 'http://www.jstage.jst.go.jp/article/...,2021-04-30,0.0,{'primary': {'URL': 'http://www.jstage.jst.go....,{'date-parts': [[1951]]},0,"{'issue': '1-2', 'published-print': {'date-par...",http://dx.doi.org/10.1626/jcs.20.219,"['0011-1848', '1349-0990']","[{'value': '0011-1848', 'type': 'print'}, {'va...","['Genetics', 'Agronomy and Crop Science', 'Foo...",1951,,,,,,,,,,,,,,,
4,"{'date-parts': [[2022, 3, 31]], 'date-time': '...",60,Elsevier BV,6,"[{'start': {'date-parts': [[2018, 12, 1]], 'da...","{'domain': ['clinicalkey.fr', 'elsevier.com', ...",['Revue de Pneumologie Clinique'],"{'date-parts': [[2018, 12]]}",10.1016/j.pneumo.2018.09.002,journal-article,2018-10-10,391-399,Crossref,0,['Le tabagisme et l’aide à l’arrêt du tabac de...,10.1016,74,"[{'given': 'J.', 'family': 'Perriot', 'sequenc...",78.0,[{'key': '10.1016/j.pneumo.2018.09.002_bib0305...,['Revue de Pneumologie Clinique'],fr,[{'URL': 'https://api.elsevier.com/content/art...,2019-10-26,0.0,{'primary': {'URL': 'https://linkinghub.elsevi...,"{'date-parts': [[2018, 12]]}",60,"{'issue': '6', 'published-print': {'date-parts...",http://dx.doi.org/10.1016/j.pneumo.2018.09.002,['0761-8417'],"[{'value': '0761-8417', 'type': 'print'}]",['Pulmonary and Respiratory Medicine'],2018,['S0761841718301792'],,,http://dx.doi.org/10.1016/elsevier_cm_policy,"[{'value': 'Elsevier', 'name': 'publisher', 'l...",,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105102,"{'date-parts': [[2022, 4, 6]], 'date-time': '2...",0,Springer Science and Business Media LLC,1,"[{'start': {'date-parts': [[1930, 12, 1]], 'da...","{'domain': [], 'crossmark-restriction': False}",['Neophilologus'],"{'date-parts': [[1930, 12]]}",10.1007/bf01510212,journal-article,2005-04-18,248-249,Crossref,0,"['Filocolo, Filocopo, Filopono']",10.1007,15,"[{'given': 'G. A.', 'family': 'Nauta', 'sequen...",297.0,,['Neophilologus'],en,[{'URL': 'http://link.springer.com/article/10....,2019-05-03,0.0,{'primary': {'URL': 'http://link.springer.com/...,"{'date-parts': [[1930, 12]]}",0,"{'issue': '1', 'published-print': {'date-parts...",http://dx.doi.org/10.1007/bf01510212,"['0028-2677', '1572-8668']","[{'value': '0028-2677', 'type': 'print'}, {'va...","['Literature and Literary Theory', 'Linguistic...",1930,['BF01510212'],,,,,,,,,,,,,,
105103,"{'date-parts': [[2022, 4, 2]], 'date-time': '2...",0,Elsevier BV,3,"[{'start': {'date-parts': [[2020, 2, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['Biophysical Journal'],"{'date-parts': [[2020, 2]]}",10.1016/j.bpj.2019.11.2324,journal-article,2020-02-07,411a,Crossref,0,['Contributions of the Transmembrane Domain to...,10.1016,118,"[{'given': 'Aerial M.', 'family': 'Pratt', 'se...",78.0,,['Biophysical Journal'],en,[{'URL': 'https://api.elsevier.com/content/art...,2021-02-07,0.0,{'primary': {'URL': 'https://linkinghub.elsevi...,"{'date-parts': [[2020, 2]]}",0,"{'issue': '3', 'published-print': {'date-parts...",http://dx.doi.org/10.1016/j.bpj.2019.11.2324,['0006-3495'],"[{'value': '0006-3495', 'type': 'print'}]",['Biophysics'],2020,['S0006349519332576'],,,,,,,,,,,,,,
105104,"{'date-parts': [[2022, 4, 4]], 'date-time': '2...",7,Elsevier BV,,"[{'start': {'date-parts': [[2016, 1, 1]], 'dat...","{'domain': ['elsevier.com', 'sciencedirect.com...",['Procedia Structural Integrity'],{'date-parts': [[2016]]},10.1016/j.prostr.2016.06.306,journal-article,2016-07-22,2447-2455,Crossref,2,['A Probabilistic Fatigue Assessment Diagram T...,10.1016,2,"[{'given': 'S.', 'family': 'Jallouf', 'sequenc...",78.0,[{'key': '10.1016/j.prostr.2016.06.306_bib0001...,['Procedia Structural Integrity'],en,[{'URL': 'https://api.elsevier.com/content/art...,2018-09-10,0.0,{'primary': {'URL': 'https://linkinghub.elsevi...,{'date-parts': [[2016]]},7,,http://dx.doi.org/10.1016/j.prostr.2016.06.306,['2452-3216'],"[{'value': '2452-3216', 'type': 'print'}]",['General Medicine'],2016,['S2452321616303171'],,,http://dx.doi.org/10.1016/elsevier_cm_policy,"[{'value': 'Elsevier', 'name': 'publisher', 'l...",,,,,,,,,,
105105,"{'date-parts': [[2022, 4, 7]], 'date-time': '2...",10,American Physical Society (APS),7,"[{'start': {'date-parts': [[1991, 10, 1]], 'da...","{'domain': [], 'crossmark-restriction': False}",['Phys. Rev. A'],,10.1103/physreva.44.4757,journal-article,2002-07-27,4757-4760,Crossref,1,['Effect of squeezed light on the photon-numbe...,10.1103,44,"[{'given': 'Lu-Bi', 'family': 'Deng', 'sequenc...",16.0,"[{'key': 'PhysRevA.44.4757Cc1R1', 'doi-asserte...",['Physical Review A'],en,[{'URL': 'http://link.aps.org/article/10.1103/...,2017-06-15,0.0,{'primary': {'URL': 'https://link.aps.org/doi/...,"{'date-parts': [[1991, 10, 1]]}",10,"{'issue': '7', 'published-print': {'date-parts...",http://dx.doi.org/10.1103/physreva.44.4757,"['1050-2947', '1094-1622']","[{'value': '1050-2947', 'type': 'print'}, {'va...","['Atomic and Molecular Physics, and Optics']",1991,,"{'date-parts': [[1991, 10, 1]]}",,,,,,,,,,,,,


# String slicing
Now that the dates are converted, one of the last problems to address are some of the excess character in the *title, short-container-title,* and *container-title* fields.

In [14]:
cols = ['title', 'short-container-title', 'container-title']
for col in cols:
    df[col] = df[col].str.slice(start=2, stop=-2)

In [15]:
df['title'][0]

'The validation of commercial system dynamics models'

## Cleaning XML tags
We'll be looking at the abstract column, so it will benefit us to clean out the tags and only have te relevant text for each record. We'll write a quick function to do that.

In [17]:
#import beatuiful soup
from bs4 import BeautifulSoup as bs
def clean_abstracts(abstract):
    try:
        soup = bs(abstract, features='lxml')
        stripped_strings = soup.get_text()
        return stripped_strings
    except:
        return None
stripped_abstracts = df.abstract.map(lambda x: clean_abstracts(x))
df['abstract'] = stripped_abstracts



In [18]:
df['abstract'][100080]

'Abstract\nIn this paper, Neural Network (NN) approach is developed and utilised to detect winding faults in an electrical machine using the samples data of electrical machine in both the healthy and different fault conditions (i.e. shorted-turn fault, phase-to-ground fault and coil-to-coil fault). This is done by interfacing a data acquisition device connected to the machine with a computer in the laboratory. Thereafter, a two-layer feed-forward network with Levenberg–Marquardt back-propagation algorithm is created with the collected input dataset. The NN model developed was tested with both the healthy and the four different fault conditions of the electrical machine. The results from the NN approach was also compared with other results obtained by determining the fault index (FI) of an electrical machine using signal processing approach. The results show that the NN approach can identify each of the electrical machine condition with high accuracy. The percentage accuracy for healthy

Looks great! Now we'll save our cleaned dataset.

In [19]:
df.to_csv(input_dir / '02_cleaned_data.csv')