# Cleaning Data

This notebook is used in order to clean the metadata retrieved with the software Arcas.

In [58]:
import glob
import pandas as pd

In [28]:
files = ['../data/Arxiv.json', '../data/Ieee.json', '../data/Plos.json', '../data/Nature.json', 
         '../data/Springer.json', '../data/bibliography.json']

In [29]:
dfs = []
for filename in files:
    dfs.append(pd.read_json(filename))

In [30]:
df = pd.concat(dfs, ignore_index=True, sort=False)

In [31]:
df.provenance.unique()

array(['arXiv', 'IEEE', 'PLOS', 'Nature', 'Springer', 'Manual'],
      dtype=object)

In [32]:
len(df.title.unique())

3653

In [33]:
len(df.unique_key.unique())

3704

In [34]:
provenance_size = df.groupby(['unique_key', 'provenance']).size().reset_index().groupby('provenance').size()
provenance_size

provenance
IEEE         319
Manual        90
Nature       666
PLOS         416
Springer     334
arXiv       1879
dtype: int64

In [35]:
df = df[~(df['date'] < 1950)]

In [36]:
df.to_json('../data/pd_November_2018.json')

Cleaning authors' names 
----------------------------

The issue with names is that there are various ways ones name can be written. This issue could have not been avoided during the data collection because journals and the authors themsleves have different ways of writing one's name.

> *ex. Nikoleta Evdokia Glynatsi, Nikoleta E Glynatsi, N E Glynatsi, N Glynatsi.*

Not many efficient ways for addressing the problem have been found. After a search on various ways of string comparison the Levenshtein distance has been chosen as a measure. The Levenshtein distance is a string metric for measuring the difference between two sequences. [wikipedia link](https://en.wikipedia.org/wiki/Levenshtein_distance).

To compute the difference in python the open source library [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy) will be used. 

In [37]:
df = pd.read_json('../data/pd_November_2018.json')

In [38]:
# Initial all letter in the string author are lowercased.
df.author = df.author.str.lower()

In [39]:
#from fuzzywuzzy import fuzz
import itertools

In [40]:
import tqdm

We can output the names that are very similar but the last check has to be manually.

In [41]:
temp = df

In [42]:
pairs = itertools.combinations(temp.author.unique(), 2)

In [44]:
to_check = []
for i, j in tqdm.tqdm(pairs):
    ratio = fuzz.token_set_ratio(i,j)
    if ratio >=90 and ratio != 100:
        to_check.append((i, j))

In [64]:
to_check

[('s.cho', 's.chow'),
 ('d.cole', 'd.coyle'),
 ('p.grossman', 'g.grossman'),
 ('p.grossman', 'r.grossman'),
 ('j.campbell', 'o.campbell'),
 ('j.campbell', 't.campbell'),
 ('y.yamaguchi', 'm.yamaguchi'),
 ('k.kanazawa', 's.kanazawa'),
 ('x.han', 'x.shan'),
 ('d.zhao', 'd.hao'),
 ('t.mori', 't.omori'),
 ('e.anderson', 'j.anderson'),
 ('e.anderson', 'p.anderson'),
 ('e.anderson', 'a.anderson'),
 ('e.anderson', 'd.anderson'),
 ('m.seredynski', 'f.seredynski'),
 ('y.nakashima', 't.nakashima'),
 ('y.nakashima', 'y.nagashima'),
 ('y.nakashima', 'y.kashima'),
 ('y.nakashima', 'h.nakashima'),
 ('j.williams', 'm.williams'),
 ('j.williams', 't.williams'),
 ('j.williams', 'v.williams'),
 ('r.sorensen', 'h.sorensen'),
 ('r.sorensen', 't.sorensen'),
 ('s.schuster', 'm.schuster'),
 ('s.salmi', 's.almi'),
 ('c.backer', 'c.baker'),
 ('k.rudnicki', 'r.rudnicki'),
 ('t.zhou', 't.zhu'),
 ('h.sorensen', 't.sorensen'),
 ('c.huia', 'c.hui'),
 ('j.mendez-naya', 'l.mendez-naya'),
 ('a.kuhn', 'a.kun'),
 ('c.cha

In [11]:
df[df['author'] == 'r.grossman']['title'].unique()

array(['Rationale, design and critical end points for the Riluzole in Acute Spinal Cord Injury Study (RISCIS): a randomized, double-blinded, placebo-controlled parallel multi-center trial'],
      dtype=object)

In [8]:
df[df['author'] == 'd.coyle']['title'].unique()

array(['Summer books'], dtype=object)

Duplicate articles
------------------

In [45]:
table = df.groupby(['title', 'unique_key']).size().reset_index().groupby('title').count()
duplicates = table[table['unique_key']>1]
duplicates

Unnamed: 0_level_0,unique_key,0
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Analyzing coevolutionary games with dynamic fitness landscapes,2,2
Bad Boy of Physics,2,2
Beyond pairwise strategy updating in the prisoner's dilemma game,2,2
Books,21,21
Comparing reactive and memory-one strategies of direct reciprocity,2,2
Computer Recreations,2,2
"Cooperation, Norms, and Revolutions: A Unified Game-Theoretical Approach",2,2
Do good actions inspire good actions in others?,2,2
Evolution of cooperation driven by zealots,2,2
Excessive abundance of common resources deters social responsibility,2,2


In [46]:
duplicates_title = df[df['title'].isin(duplicates.index)]['title'].unique()

In [47]:
duplicates_in_arxiv = df[(df['title'].isin(duplicates.index)) & (df['provenance'] == 'arXiv')]['title'].unique()

In [48]:
diff = list(set(duplicates_title) - set(duplicates_in_arxiv))

In [49]:
df_without_arxiv = df[~(df['provenance']=='arXiv')]

In [50]:
df_without_arxiv = df_without_arxiv.drop_duplicates(subset='title')

In [51]:
df_without_arxiv.to_json('../data/pd_November_2018_without_arxiv.json')

**Drop duplicates.**

In [52]:
articles_to_drop = df[(df['title'].isin(duplicates.index)) & (df['provenance']=='arXiv')]['unique_key'].unique()

In [53]:
df = df[~df['unique_key'].isin(articles_to_drop)]

In [54]:
df = df.drop_duplicates(subset='title')

In [56]:
len(df['title'].unique()), len(df['unique_key'].unique())

(3637, 3637)

**Export clean json.**

In [57]:
df.to_json('../data/pd_November_2018_clean.json')