# Cleaning Data

This notebook is used in order to clean the metadata retrieved with the software Arcas.

In [1]:
import glob
import pandas as pd

In [2]:
import json
import arcas

In [3]:
def normalise_names(s):

    # split the string into a list  
    l = s.split() 
    new = "" 
  
    # traverse in the list  
    for i in range(len(l)-1): 
        s = l[i] 
          
        # adds the capital first character  
        new += (s[0].upper()+'.') 
          
    # l[-1] gives last item of list l. We 
    # use title to print first character in 
    # capital. 
    new += l[-1].title() 
      
    return new 

In [4]:
raw_articles = []
for filename in glob.glob('../raw_data/*_Springer_*.json'):
    with open(filename) as json_data:
        d = json.load(json_data)
        raw_articles.append(d)

In [5]:
flat_list = [item for sublist in raw_articles for item in sublist]

In [6]:
api =  arcas.Springer()
articles = []
for art in flat_list:
    articles.append(api.to_dataframe(art))

In [7]:
dataframe = pd.concat(articles, ignore_index=True)

In [8]:
dataframe = dataframe[~(dataframe['author']=='No authors found for this document.')]

In [9]:
# dataframe = dataframe[~(dataframe['author']==None)]

In [15]:
names = dataframe.author

In [16]:
edited = []
for name in names:
    first, last = name.split(' ', 1)
    edited.append(first + ' ' + last)

In [17]:
edited_names = [normalise_names(name) for name in edited]

In [18]:
dataframe.author = edited_names

In [19]:
dataframe.head()

Unnamed: 0,url,key,unique_key,title,author,abstract,doi,date,journal,provenance,category,score,open_access
0,http://dx.doi.org/10.1007/978-3-319-94511-8_6,Michele2019,b25dc6c833044fc195ffb5ced9674508,Using Photovoice with Ex-prisoners: An Exemplar,J.Michele,This chapter is drawn from the author’s Photov...,10.1007/978-3-319-94511-8_6,2019,Photovoice Handbook for Social Workers,Springer,Not available,Not available,False
1,http://dx.doi.org/10.1038/s41598-018-34116-0,Yu’e2018,0ed2b091cdccbc34e928f3f1d6c7209b,Environment-based preference selection promote...,W.Yu’E,The impact of environment on individuals is pa...,10.1038/s41598-018-34116-0,2018,Scientific Reports,Springer,Not available,Not available,True
2,http://dx.doi.org/10.1038/s41598-018-34116-0,Yu’e2018,0ed2b091cdccbc34e928f3f1d6c7209b,Environment-based preference selection promote...,Z.Shuhua,The impact of environment on individuals is pa...,10.1038/s41598-018-34116-0,2018,Scientific Reports,Springer,Not available,Not available,True
3,http://dx.doi.org/10.1038/s41598-018-34116-0,Yu’e2018,0ed2b091cdccbc34e928f3f1d6c7209b,Environment-based preference selection promote...,Z.Zhipeng,The impact of environment on individuals is pa...,10.1038/s41598-018-34116-0,2018,Scientific Reports,Springer,Not available,Not available,True
4,http://dx.doi.org/10.1007/s11277-018-5328-y,Jiayu2018,880f765582be433336c8dcc5b2f069a5,Trust Degree can Preserve Community Structure ...,Z.Jiayu,Community structure is one of the most ubiquit...,10.1007/s11277-018-5328-y,2018,Wireless Personal Communications,Springer,Not available,Not available,False


In [20]:
api.export(dataframe, '../Springer.json')

In [21]:
data = []
for filename in glob.glob('../*.json'):
    data.append(pd.read_json(filename))

In [23]:
df = pd.concat(data, ignore_index=True, sort=False)

In [25]:
df.provenance.unique()

array(['Springer', 'IEEE', 'PLOS', 'Nature', 'arXiv'], dtype=object)

In [26]:
len(df.title.unique())

442

In [28]:
len(df.unique_key.unique())

448

In [30]:
d = pd.read_json('../Nature.json')

In [31]:
len(d.title.unique())

13

Cleaning authors' names 
----------------------------

The issue with names is that there are various ways ones name can be written. This issue could have not been avoided during the data collection because journals and the authors themsleves have different ways of writing one's name.

> *ex. Nikoleta Evdokia Glynatsi, Nikoleta E Glynatsi, N E Glynatsi, N Glynatsi.*

Not many efficient ways for addressing the problem have been found. After a search on various ways of string comparison the Levenshtein distance has been chosen as a measure. The Levenshtein distance is a string metric for measuring the difference between two sequences. [wikipedia link](https://en.wikipedia.org/wiki/Levenshtein_distance).

To compute the difference in python the open source library [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy) will be used. 

In [6]:
df = pd.read_json('../data/data_nov_2017.json')

In [7]:
# Initial all letter in the string author are lowercased.
df.author = df.author.str.lower()

In [8]:
from fuzzywuzzy import fuzz

We can output the names that are very similar but the last check has to be manually.

In [13]:
temp = df

In [14]:
for i in temp.author.unique():
    for j in temp.author.unique():
        ratio = fuzz.token_set_ratio(i,j)
        if ratio >=85 and ratio != 100:
            print(i, j)

yilei wang wang  yiling
yilei wang lei wang
zhe liu liu zheng
cong li rong li
long wang jinlong wang
he tao shen tao
lin wang lin yang
lin wang xin wang
lin wang lei wang
lin wang wang  l.
xiaofan wang xiaoyang wang
xiaofan wang xiaofeng wang
li yang lin yang
li yang hang li
li yang yi  yang
li yang yang yi
lei xue lei xu
chen shen chen  zhen
chen shen zhen chen
jianwei huang jianwei wang
mei sun min sun
mei sun sun  min
stefan schauer schuster  stefan
lin yang lin wang
lin yang li yang
richard y. chen y. -z. chen
shiwei zhang wei zhang
hui li hui lin
hui lin hui li
xiuzhen cheng xiuzhen feng
chen  michael z. q. y. -z. chen
anatol rapoport m. rapoport
francisco c. santos m. santos
jinlong wang long wang
jinlong wang zhang jinlong
jinlong wang wang  long
zhen wang yang  zhen
jianwei wang jianwei huang
hang li li yang
y. -z. chen richard y. chen
y. -z. chen chen  michael z. q.
yan zhang yang zhang
yan zhang yanfu zhang
victor m. eguluz victor m. eguiluz
qiang li xiang li
ling jing jing l

In [13]:
df[df['author'] == 'yang  zhen']['title'].unique()

array([ 'Spectrum sharing in iterated Prisoner’s Dilemma game based on evolutionary strategies for Cognitive Radios'], dtype=object)

Duplicate articles
------------------

In [10]:
table = df.groupby(['title', 'unique_key']).size().reset_index().groupby('title').count()
duplicates = table[table['unique_key']==2]
duplicates

Unnamed: 0_level_0,unique_key,0
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Beyond pairwise strategy updating in the prisoner's dilemma game,2,2
"Cooperation, Norms, and Revolutions: A Unified Game-Theoretical Approach",2,2
Human behavior in Prisoner's Dilemma experiments suppresses network reciprocity,2,2
Playing a quantum game on polarization vortices,2,2
The Art of War: Beyond Memory-one Strategies in Population Games,2,2
The Prisoner’s Dilemma,2,2


In [11]:
len(duplicates)

6

In [13]:
df[df['title'].isin(duplicates.index)].head()

Unnamed: 0,abstract,author,date,journal,key,key_word,labels,list_strategies,pages,provenance,read,score,title,unique_key
10248,Cooperation is of utmost importance to society...,anders johansson,2010,"Helbing, D and Johansson, A (2010) Cooperation...",Helbing2010,,,,,arXiv,,,"Cooperation, Norms, and Revolutions: A Unified...",f7ef3626edc9fb376e5703f804b31d9f
10249,Cooperation is of utmost importance to society...,dirk helbing,2010,"Helbing, D and Johansson, A (2010) Cooperation...",Helbing2010,,,,,arXiv,,,"Cooperation, Norms, and Revolutions: A Unified...",f7ef3626edc9fb376e5703f804b31d9f
1087,The model of the subject with reflexion allows...,lefebvre a. vladimir,2001,Algebra of Conscience,Lefebvre2001,,,,,Springer,,,The Prisoner’s Dilemma,1bbb5d8a9929baf02be426da7daf5b29
2130,The quantum mechanical approach to the well kn...,a. g. m. schmidt,2013,arXiv,Pinheiro2013,,,,,arXiv,,,Playing a quantum game on polarization vortices,37ab4593323d0cf0901a71416ff5876c
2131,The quantum mechanical approach to the well kn...,a. r. c. pinheiro,2013,arXiv,Pinheiro2013,,,,,arXiv,,,Playing a quantum game on polarization vortices,37ab4593323d0cf0901a71416ff5876c


**Provenance of duplicates.**

In [37]:
for tlt in duplicates.index:
    print(tlt, df[df['title'] == tlt]['provenance'].unique())

Beyond pairwise strategy updating in the prisoner's dilemma game ['Nature' 'arXiv']
Cooperation, Norms, and Revolutions: A Unified Game-Theoretical Approach ['arXiv' 'PLOS']
Human behavior in Prisoner's Dilemma experiments suppresses network reciprocity ['Nature' 'arXiv']
Playing a quantum game on polarization vortices ['arXiv' 'IEEE']
The Art of War: Beyond Memory-one Strategies in Population Games ['PLOS' 'arXiv']
The Prisoner’s Dilemma ['Springer']


**Drop duplicates.**

In [38]:
articles_to_drop = df[(df['title'].isin(duplicates.index)) & (df['provenance']=='arXiv')]['unique_key'].unique()
articles_to_drop

array(['f7ef3626edc9fb376e5703f804b31d9f',
       '37ab4593323d0cf0901a71416ff5876c',
       '7e64918889fd1a9be63d428604057056',
       '9f7bb1dc93e57eb0a938eacdba9b6231',
       'e39363f9882c617dbf6f0cc1e1a448dc'], dtype=object)

In [39]:
df = df[~df['unique_key'].isin(articles_to_drop)]

In [40]:
df = df[~df['unique_key'].isin(['d25332adc4378bb2320319c6007decf3', 'e45e8a6e0e7738f987f86e45f71db931'])]

In [41]:
len(df['title'].unique()), len(df['unique_key'].unique())

(1142, 1143)

**Export clean json.**

In [42]:
df.to_json('../data/data_nov_2017_clean.json')