# Data Cleaning
To clean this data set we'll start out by loading the dataset, checking for duplicates, and dropping columns that are not relevant to our analysis.

First, we'll load in our packages, set up our directories, and load in the dataset and take a look at it.

In [1]:
import pandas as pd
from pathlib import Path

#Set up directories
data_dir = Path('../data')
input_dir = data_dir / 'input'
output_dir = data_dir / 'output'

df = pd.read_csv(input_dir / '01_raw_data.csv', low_memory=False)

## Duplicate Records
Looking at the shape of the dataset against the number of unique DOIs will let us know just how many duplicate records we have.


In [2]:
df.shape

(106107, 51)

In [3]:
len(set(df['DOI']))

106036

In [4]:
#Dropping duplicate records
df.drop_duplicates(subset=['DOI'], keep='first', inplace=True)
df.shape

(106036, 51)

## Editors
There are very few records that have a value in the *editor* column. Some of our prior work indicates that this can be a sign of a work that has been mislabeled as a 'journal article'. So we'll explore some of the records with a value in the editor column in order to verify that.

We'll set up a dataframe of just those records that have data in the *editor* column.

Next, we'll search the titles of these records for a few keywords.

In [5]:
editorial = df.loc[df.title.str.contains(r'editorial|errata|contents|conference|proceedings|masthead|symposium|abstract|Book Review|preface|title page', 
                                         regex=True, case=False, na=False)]
editorial

Unnamed: 0,indexed,reference-count,publisher,issue,license,content-domain,short-container-title,published-print,DOI,type,...,accepted,abstract,original-title,subtitle,published-other,editor,relation,update-to,translator,clinical-trial-number
5,"{'date-parts': [[2022, 3, 31]], 'date-time': '...",1,Wiley,52,"[{'start': {'date-parts': [[2015, 9, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['ChemInform'],"{'date-parts': [[2015, 12]]}",10.1002/chin.201552196,journal-article,...,,,,,,,,,,
23,"{'date-parts': [[2022, 4, 4]], 'date-time': '2...",0,Elsevier BV,1,"[{'start': {'date-parts': [[1965, 6, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['IFAC Proceedings Volumes'],"{'date-parts': [[1965, 6]]}",10.1016/s1474-6670(17)69139-0,journal-article,...,,,,,,,,,,
30,"{'date-parts': [[2022, 3, 29]], 'date-time': '...",0,Wiley,3,"[{'start': {'date-parts': [[2015, 9, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['Commun. Numer. Meth. Engng.'],"{'date-parts': [[1995, 3]]}",10.1002/cnm.1640110301,journal-article,...,,,,,,,,,,
33,"{'date-parts': [[2022, 4, 2]], 'date-time': '2...",1,Wiley,33,"[{'start': {'date-parts': [[2015, 9, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['Chemischer Informationsdienst'],"{'date-parts': [[1972, 8, 15]]}",10.1002/chin.197233207,journal-article,...,,,,,,,,,,
37,"{'date-parts': [[2022, 4, 5]], 'date-time': '2...",0,Wiley,27-29,"[{'start': {'date-parts': [[2015, 9, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['Z. Pflanzenernaehr. Dueng. Bodenk.'],{'date-parts': [[1931]]},10.1002/jpln.19310102701,journal-article,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105970,"{'date-parts': [[2022, 3, 29]], 'date-time': '...",0,Dissolution Technologies,2,,"{'domain': [], 'crossmark-restriction': False}",['Dissolution Technol.'],{'date-parts': [[2006]]},10.14227/dt130206p25,journal-article,...,,,,,,,,,,
106018,"{'date-parts': [[2022, 4, 3]], 'date-time': '2...",0,Elsevier BV,,"[{'start': {'date-parts': [[2017, 4, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['Endocrine Practice'],"{'date-parts': [[2017, 4]]}",10.1016/s1530-891x(20)44162-x,journal-article,...,,,,,,,,,,
106040,"{'date-parts': [[2022, 4, 6]], 'date-time': '2...",0,SAGE Publications,2,"[{'start': {'date-parts': [[1994, 6, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['British Journalism Review'],"{'date-parts': [[1994, 6]]}",10.1177/095647489400500214,journal-article,...,,,,['Michael Foot: by Mervyn Jones Victor Gollanc...,,,,,,
106050,"{'date-parts': [[2022, 3, 31]], 'date-time': '...",1,AIP Publishing,8,,"{'domain': [], 'crossmark-restriction': False}",['Journal of Applied Physics'],"{'date-parts': [[1985, 4, 15]]}",10.1063/1.334625,journal-article,...,,,,,,,,,,


We've found some editorials, Mastheads, conference proceedings, and abstracts. We'll go ahead and drop them from our dataset.

In [7]:
df.drop(editorial.index, inplace=True)

In [8]:
df.shape

(102486, 51)

## Conferences
Looking back at **editorial** we see that there are a couple 'Conferences' and 'Proceedings' in the *container-title* column. Let's take a look at just how many records remain in our dataset are from these journals/containers.

Additionally, we see a few records from the journal *ChemInform*, a journal that publishes chemistry abstracts, we'll check to see if any of those records remain as well.

We'll use a keyword search in the *container-title* column to find these records.

In [9]:
conferences = df.loc[(df['container-title'].str.contains(r'conference|ChemInform|news|CrossRef Listing of Deleted DOIs', regex=True, case=False)) | (df.publisher == 'EDP Sciences')]
conferences

Unnamed: 0,indexed,reference-count,publisher,issue,license,content-domain,short-container-title,published-print,DOI,type,...,accepted,abstract,original-title,subtitle,published-other,editor,relation,update-to,translator,clinical-trial-number
35,"{'date-parts': [[2022, 4, 3]], 'date-time': '2...",0,EDP Sciences,3,"[{'start': {'date-parts': [[2021, 6, 28]], 'da...","{'domain': [], 'crossmark-restriction': False}",['Europhysics News'],{'date-parts': [[2021]]},10.1051/epn/2021307,journal-article,...,,,,,,,,,,
57,"{'date-parts': [[2022, 3, 31]], 'date-time': '...",6,IOP Publishing,1,"[{'start': {'date-parts': [[2021, 2, 1]], 'dat...","{'domain': ['iopscience.iop.org'], 'crossmark-...",['IOP Conf. Ser.: Earth Environ. Sci.'],"{'date-parts': [[2021, 2, 1]]}",10.1088/1755-1315/660/1/012131,journal-article,...,,<jats:title>Abstract</jats:title>\n ...,,,,,,,,
117,"{'date-parts': [[2022, 4, 2]], 'date-time': '2...",17,EDP Sciences,,"[{'start': {'date-parts': [[2021, 4, 26]], 'da...","{'domain': [], 'crossmark-restriction': False}",['EPJ Web Conf.'],{'date-parts': [[2021]]},10.1051/epjconf/202124801022,journal-article,...,,<jats:p>The processes of heat and mass transfe...,,,,"[{'given': 'A.', 'family': 'Nadykto', 'sequenc...",,,,
140,"{'date-parts': [[2022, 11, 8]], 'date-time': '...",0,Association for the Advancement of Artificial ...,1,,"{'domain': [], 'crossmark-restriction': False}",['AAAI'],,10.1609/aaai.v32i1.11721,journal-article,...,,<jats:p>\n \n The iterative hard-t...,,,,,,,,
159,"{'date-parts': [[2022, 4, 3]], 'date-time': '2...",0,Wiley,5,"[{'start': {'date-parts': [[2015, 9, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['Sci News'],,10.1002/scin.2007.5591710502,journal-article,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105726,"{'date-parts': [[2022, 8, 5]], 'date-time': '2...",24,EDP Sciences,1,,"{'domain': [], 'crossmark-restriction': False}",['A&amp;A'],"{'date-parts': [[2006, 10]]}",10.1051/0004-6361:20065495,journal-article,...,,,,,,,,,,
105831,"{'date-parts': [[2022, 4, 6]], 'date-time': '2...",0,JSTOR,15,,"{'domain': [], 'crossmark-restriction': False}",['The Science News-Letter'],"{'date-parts': [[1954, 4, 10]]}",10.2307/3933390,journal-article,...,,,,,,,,,,
105880,"{'date-parts': [[2022, 3, 30]], 'date-time': '...",0,EDP Sciences,Suppl. 1,,"{'domain': [], 'crossmark-restriction': False}",['Ann. Zootech.'],{'date-parts': [[1995]]},10.1051/animres:19950579,journal-article,...,,,,,,,,,,
105945,"{'date-parts': [[2022, 4, 2]], 'date-time': '2...",13,EDP Sciences,,"[{'start': {'date-parts': [[2020, 11, 25]], 'd...","{'domain': [], 'crossmark-restriction': False}",['BIO Web Conf.'],{'date-parts': [[2020]]},10.1051/bioconf/20202700018,journal-article,...,,<jats:p>The impact of human economic activity ...,,,,"[{'given': 'A.', 'family': 'Valiev', 'sequence...",,,,


In [10]:
df.drop(conferences.index, inplace=True)
df.shape

(100912, 51)

In [11]:
# DROP
extras = df.sample(n=912, random_state=42)
df.drop(extras.index, inplace=True)
df

Unnamed: 0,indexed,reference-count,publisher,issue,license,content-domain,short-container-title,published-print,DOI,type,...,accepted,abstract,original-title,subtitle,published-other,editor,relation,update-to,translator,clinical-trial-number
0,"{'date-parts': [[2022, 10, 7]], 'date-time': '...",14,Wiley,1,"[{'start': {'date-parts': [[2015, 9, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['Syst. Dyn. Rev.'],{'date-parts': [[2000]]},10.1002/(sici)1099-1727(200021)16:1<27::aid-sd...,journal-article,...,,,,,,,,,,
1,"{'date-parts': [[2022, 3, 29]], 'date-time': '...",12,Springer Science and Business Media LLC,1,"[{'start': {'date-parts': [[1979, 3, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['MTB'],"{'date-parts': [[1979, 3]]}",10.1007/bf02653972,journal-article,...,,,,,,,,,,
2,"{'date-parts': [[2022, 3, 30]], 'date-time': '...",0,Wiley,3,"[{'start': {'date-parts': [[2017, 11, 1]], 'da...","{'domain': [], 'crossmark-restriction': False}",['RECIEL'],"{'date-parts': [[2017, 11]]}",10.1111/reel.12221,journal-article,...,,,,,,,,,,
3,"{'date-parts': [[2022, 4, 3]], 'date-time': '2...",0,Crop Science Society of Japan,1-2,,"{'domain': [], 'crossmark-restriction': False}","['Japanese journal of crop science', 'Jpn. J. ...",{'date-parts': [[1951]]},10.1626/jcs.20.219,journal-article,...,,,,,,,,,,
4,"{'date-parts': [[2022, 3, 31]], 'date-time': '...",60,Elsevier BV,6,"[{'start': {'date-parts': [[2018, 12, 1]], 'da...","{'domain': ['clinicalkey.fr', 'elsevier.com', ...",['Revue de Pneumologie Clinique'],"{'date-parts': [[2018, 12]]}",10.1016/j.pneumo.2018.09.002,journal-article,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106102,"{'date-parts': [[2022, 3, 30]], 'date-time': '...",2,Elsevier BV,2,"[{'start': {'date-parts': [[1988, 4, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['British Journal of Oral and Maxillofacial Su...,"{'date-parts': [[1988, 4]]}",10.1016/0266-4356(88)90016-2,journal-article,...,,,,,,,,,,
106103,"{'date-parts': [[2023, 1, 10]], 'date-time': '...",41,"Impact Journals, LLC",3,,"{'domain': [], 'crossmark-restriction': False}",['Oncotarget'],"{'date-parts': [[2018, 1, 9]]}",10.18632/oncotarget.23280,journal-article,...,,,,,,,,,,
106104,"{'date-parts': [[2022, 10, 16]], 'date-time': ...",33,AIP Publishing,9,,"{'domain': ['aip.scitation.org'], 'crossmark-r...",,"{'date-parts': [[1996, 9]]}",10.1063/1.869021,journal-article,...,,,,,,,,,,
106105,"{'date-parts': [[2022, 4, 5]], 'date-time': '2...",5,Oxford University Press (OUP),8,"[{'start': {'date-parts': [[1999, 8, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",,"{'date-parts': [[2002, 12, 12]]}",10.1002/bjs.1155,journal-article,...,,,,,"{'date-parts': [[1999, 8]]}",,,,,


## Cleaning Dates
Here we are going to re-format some of the datetime columns into a more easily parsed format. *Created, deposited* and *published*. Not all records have month and day values for the *published* field, so we'll only take the year from those. For *created* and *deposited* we will have a YYYY-MM-DD format.

We've chosen these dates because they reflect certain information that we'll use later on. *Created* is the date when the item was first inserted into the Crossref database. *Deposited* reflects the last time the record was entered by the publisher (potentially with changes to the record but not necessarily the case). *Published* reflects when the item itself was actually published.

We'll use a regular expression to extract the dates from each of the records in each of those three columns, then we'll convert them to datetime dtypes.

In [12]:
date_columns = ['created', 'deposited']

for col in date_columns:
    df[col] = df[col].str.extract(r"\'(\d{4}\S\d{2}\S\d{2})")
    df[col] = pd.to_datetime(df[col], format="%Y-%m-%d")
df['published'] = df['published'].str.extract(r"(\d{4})")
df

Unnamed: 0,indexed,reference-count,publisher,issue,license,content-domain,short-container-title,published-print,DOI,type,...,accepted,abstract,original-title,subtitle,published-other,editor,relation,update-to,translator,clinical-trial-number
0,"{'date-parts': [[2022, 10, 7]], 'date-time': '...",14,Wiley,1,"[{'start': {'date-parts': [[2015, 9, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['Syst. Dyn. Rev.'],{'date-parts': [[2000]]},10.1002/(sici)1099-1727(200021)16:1<27::aid-sd...,journal-article,...,,,,,,,,,,
1,"{'date-parts': [[2022, 3, 29]], 'date-time': '...",12,Springer Science and Business Media LLC,1,"[{'start': {'date-parts': [[1979, 3, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['MTB'],"{'date-parts': [[1979, 3]]}",10.1007/bf02653972,journal-article,...,,,,,,,,,,
2,"{'date-parts': [[2022, 3, 30]], 'date-time': '...",0,Wiley,3,"[{'start': {'date-parts': [[2017, 11, 1]], 'da...","{'domain': [], 'crossmark-restriction': False}",['RECIEL'],"{'date-parts': [[2017, 11]]}",10.1111/reel.12221,journal-article,...,,,,,,,,,,
3,"{'date-parts': [[2022, 4, 3]], 'date-time': '2...",0,Crop Science Society of Japan,1-2,,"{'domain': [], 'crossmark-restriction': False}","['Japanese journal of crop science', 'Jpn. J. ...",{'date-parts': [[1951]]},10.1626/jcs.20.219,journal-article,...,,,,,,,,,,
4,"{'date-parts': [[2022, 3, 31]], 'date-time': '...",60,Elsevier BV,6,"[{'start': {'date-parts': [[2018, 12, 1]], 'da...","{'domain': ['clinicalkey.fr', 'elsevier.com', ...",['Revue de Pneumologie Clinique'],"{'date-parts': [[2018, 12]]}",10.1016/j.pneumo.2018.09.002,journal-article,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106102,"{'date-parts': [[2022, 3, 30]], 'date-time': '...",2,Elsevier BV,2,"[{'start': {'date-parts': [[1988, 4, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",['British Journal of Oral and Maxillofacial Su...,"{'date-parts': [[1988, 4]]}",10.1016/0266-4356(88)90016-2,journal-article,...,,,,,,,,,,
106103,"{'date-parts': [[2023, 1, 10]], 'date-time': '...",41,"Impact Journals, LLC",3,,"{'domain': [], 'crossmark-restriction': False}",['Oncotarget'],"{'date-parts': [[2018, 1, 9]]}",10.18632/oncotarget.23280,journal-article,...,,,,,,,,,,
106104,"{'date-parts': [[2022, 10, 16]], 'date-time': ...",33,AIP Publishing,9,,"{'domain': ['aip.scitation.org'], 'crossmark-r...",,"{'date-parts': [[1996, 9]]}",10.1063/1.869021,journal-article,...,,,,,,,,,,
106105,"{'date-parts': [[2022, 4, 5]], 'date-time': '2...",5,Oxford University Press (OUP),8,"[{'start': {'date-parts': [[1999, 8, 1]], 'dat...","{'domain': [], 'crossmark-restriction': False}",,"{'date-parts': [[2002, 12, 12]]}",10.1002/bjs.1155,journal-article,...,,,,,"{'date-parts': [[1999, 8]]}",,,,,


# String slicing
Now that the dates are converted, one of the last problems to address are some of the excess character in the *title, short-container-title,* and *container-title* fields.

In [13]:
cols = ['title', 'short-container-title', 'container-title']
for col in cols:
    df[col] = df[col].str.slice(start=2, stop=-2)

In [14]:
df['title'][0]

'The validation of commercial system dynamics models'

## Cleaning XML tags
We'll be looking at the abstract column, so it will benefit us to clean out the tags and only have te relevant text for each record. We'll write a quick function to do that.

In [15]:
#import beatuiful soup
from bs4 import BeautifulSoup as bs
def clean_abstracts(abstract):
    try:
        soup = bs(abstract, features='lxml')
        stripped_strings = soup.get_text()
        return stripped_strings
    except:
        return None
stripped_abstracts = df.abstract.map(lambda x: clean_abstracts(x))
df['abstract'] = stripped_abstracts

In [16]:
df['abstract'][100080]

'Abstract\nIn this paper, Neural Network (NN) approach is developed and utilised to detect winding faults in an electrical machine using the samples data of electrical machine in both the healthy and different fault conditions (i.e. shorted-turn fault, phase-to-ground fault and coil-to-coil fault). This is done by interfacing a data acquisition device connected to the machine with a computer in the laboratory. Thereafter, a two-layer feed-forward network with Levenberg–Marquardt back-propagation algorithm is created with the collected input dataset. The NN model developed was tested with both the healthy and the four different fault conditions of the electrical machine. The results from the NN approach was also compared with other results obtained by determining the fault index (FI) of an electrical machine using signal processing approach. The results show that the NN approach can identify each of the electrical machine condition with high accuracy. The percentage accuracy for healthy

Looks great! Now we'll save our cleaned dataset.

In [17]:
df.to_csv(input_dir / '02_cleaned_data.csv', index=False)