# Cleaning Data

This notebook is used in order to clean the metadata retrieved with the software Arcas.

In [1]:
import glob
import pandas as pd

In [2]:
# raw articles retrieved with Arcas
dfs = []
for filename in glob.glob('../data/articles/*.json'):
    dfs.append(pd.read_json(filename))

In [3]:
df = pd.concat(dfs, ignore_index=True)

In [5]:
# these have been manually looked upon and cleaned

In [6]:
df.to_json('../data/data.json')

Cleaning authors' names 
----------------------------

The issue with names is that there are various ways ones name can be written. This issue could have not been avoided during the data collection because journals and the authors themsleves have different ways of writing one's name.

> *ex. Nikoleta Evdokia Glynatsi, Nikoleta E Glynatsi, N E Glynatsi, N Glynatsi.*

Not many efficient ways for addressing the problem have been found. After a search on various ways of string comparison the Levenshtein distance has been chosen as a measure. The Levenshtein distance is a string metric for measuring the difference between two sequences. [wikipedia link](https://en.wikipedia.org/wiki/Levenshtein_distance).

To compute the difference in python the open source library [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy) will be used. 

In [7]:
df = pd.read_json('../data/data_nov_2017.json')

In [4]:
# Initial all letter in the string author are lowercased.
df.author = df.author.str.lower()

In [5]:
from fuzzywuzzy import fuzz

We can output the names that are very similar but the last check has to be manually.

In [8]:
temp = df[1000:2000]

In [9]:
for i in temp.author.unique():
    for j in temp.author.unique():
        ratio = fuzz.token_set_ratio(i,j)
        if ratio >=85 and ratio != 100:
            print(i, j)

jorge pena jorge peña
li chen lin chen
gyorgy szabo györgy szabó
györgy szabó gyorgy szabo
matjaz perc matjaž perc
christopher griffin christopher lee
j. a. cuesta josé a. cuesta
zhen wang yang  zhen
künneth  christopher christopher lee
 stephen e. stephen g. z. smith
lin chen li chen
stephen g. z. smith  stephen e.
christopher lee christopher griffin
christopher lee künneth  christopher
y. -h. wang z. y. wang
y. -h. wang p. y. wang
josé a. cuesta j. a. cuesta
matjaž perc matjaz perc
z. y. wang y. -h. wang
z. y. wang p. y. wang
p. y. wang y. -h. wang
p. y. wang z. y. wang
yang  zhen zhen wang
jorge peña jorge pena


In [13]:
df[df['author'] == 'yang  zhen']['title'].unique()

array([ 'Spectrum sharing in iterated Prisoner’s Dilemma game based on evolutionary strategies for Cognitive Radios'], dtype=object)

Duplicate articles
------------------

In [10]:
table = df.groupby(['title', 'unique_key']).size().reset_index().groupby('title').count()
duplicates = table[table['unique_key']==2]
duplicates

Unnamed: 0_level_0,unique_key,0
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Beyond pairwise strategy updating in the prisoner's dilemma game,2,2
"Cooperation, Norms, and Revolutions: A Unified Game-Theoretical Approach",2,2
Human behavior in Prisoner's Dilemma experiments suppresses network reciprocity,2,2
Playing a quantum game on polarization vortices,2,2
The Art of War: Beyond Memory-one Strategies in Population Games,2,2
The Prisoner’s Dilemma,2,2


In [11]:
len(duplicates)

6

In [13]:
df[df['title'].isin(duplicates.index)].head()

Unnamed: 0,abstract,author,date,journal,key,key_word,labels,list_strategies,pages,provenance,read,score,title,unique_key
10248,Cooperation is of utmost importance to society...,anders johansson,2010,"Helbing, D and Johansson, A (2010) Cooperation...",Helbing2010,,,,,arXiv,,,"Cooperation, Norms, and Revolutions: A Unified...",f7ef3626edc9fb376e5703f804b31d9f
10249,Cooperation is of utmost importance to society...,dirk helbing,2010,"Helbing, D and Johansson, A (2010) Cooperation...",Helbing2010,,,,,arXiv,,,"Cooperation, Norms, and Revolutions: A Unified...",f7ef3626edc9fb376e5703f804b31d9f
1087,The model of the subject with reflexion allows...,lefebvre a. vladimir,2001,Algebra of Conscience,Lefebvre2001,,,,,Springer,,,The Prisoner’s Dilemma,1bbb5d8a9929baf02be426da7daf5b29
2130,The quantum mechanical approach to the well kn...,a. g. m. schmidt,2013,arXiv,Pinheiro2013,,,,,arXiv,,,Playing a quantum game on polarization vortices,37ab4593323d0cf0901a71416ff5876c
2131,The quantum mechanical approach to the well kn...,a. r. c. pinheiro,2013,arXiv,Pinheiro2013,,,,,arXiv,,,Playing a quantum game on polarization vortices,37ab4593323d0cf0901a71416ff5876c


**Provenance of duplicates.**

In [37]:
for tlt in duplicates.index:
    print(tlt, df[df['title'] == tlt]['provenance'].unique())

Beyond pairwise strategy updating in the prisoner's dilemma game ['Nature' 'arXiv']
Cooperation, Norms, and Revolutions: A Unified Game-Theoretical Approach ['arXiv' 'PLOS']
Human behavior in Prisoner's Dilemma experiments suppresses network reciprocity ['Nature' 'arXiv']
Playing a quantum game on polarization vortices ['arXiv' 'IEEE']
The Art of War: Beyond Memory-one Strategies in Population Games ['PLOS' 'arXiv']
The Prisoner’s Dilemma ['Springer']


**Drop duplicates.**

In [38]:
articles_to_drop = df[(df['title'].isin(duplicates.index)) & (df['provenance']=='arXiv')]['unique_key'].unique()
articles_to_drop

array(['f7ef3626edc9fb376e5703f804b31d9f',
       '37ab4593323d0cf0901a71416ff5876c',
       '7e64918889fd1a9be63d428604057056',
       '9f7bb1dc93e57eb0a938eacdba9b6231',
       'e39363f9882c617dbf6f0cc1e1a448dc'], dtype=object)

In [39]:
df = df[~df['unique_key'].isin(articles_to_drop)]

In [40]:
df = df[~df['unique_key'].isin(['d25332adc4378bb2320319c6007decf3', 'e45e8a6e0e7738f987f86e45f71db931'])]

In [41]:
len(df['title'].unique()), len(df['unique_key'].unique())

(1142, 1143)

**Export clean json.**

In [42]:
df.to_json('../data/data_nov_2017_clean.json')