# 0.1. Cleaning Data

In [1]:
import glob
import pandas as pd

In [None]:
# raw articles retrieved with Arcas
dfs = []
for filename in glob.glob('../data/articles/*.json'):
    dfs.append(pd.read_json(filename))

In [None]:
df = pd.concat(dfs, ignore_index=True)

In [None]:
# these have been manually looked upon and cleaned

In [None]:
df.to_json('../data/data.json')

Cleaning authors' names 
----------------------------

The issue with names is that there are various ways ones name can be written. This issue could have not been avoided during the data collection because journals and the authors themsleves have different ways of writing one's name.

> *ex. Nikoleta Evdokia Glynatsi, Nikoleta E Glynatsi, N E Glynatsi, N Glynatsi.*

Not many efficient ways for addressing the problem have been found. After a search on various ways of string comparison the Levenshtein distance has been chosen as a measure. The Levenshtein distance is a string metric for measuring the difference between two sequences. [wikipedia link](https://en.wikipedia.org/wiki/Levenshtein_distance).

To compute the difference in python the open source library [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy) will be used. 

In [32]:
df = pd.read_json('../data/data_nov_2017.json')

In [33]:
# Initial all letter in the string author are lowercased.
df.author = df.author.str.lower()

In [18]:
from fuzzywuzzy import fuzz



We can output the names that are very similar but the last check has to be manually.

In [None]:
for i in df.author.unique():
    for j in df.author.unique():
        ratio = fuzz.token_set_ratio(i,j)
        if ratio >=85 and ratio != 100:
            print(i, j)

Duplicate articles
------------------

In [34]:
table = df.groupby(['title', 'unique_key']).size().reset_index().groupby('title').count()
duplicates = table[table['unique_key']==2]
duplicates

Unnamed: 0_level_0,unique_key,0
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Beyond pairwise strategy updating in the prisoner's dilemma game,2,2
"Cooperation, Norms, and Revolutions: A Unified Game-Theoretical Approach",2,2
Human behavior in Prisoner's Dilemma experiments suppresses network reciprocity,2,2
Playing a quantum game on polarization vortices,2,2
The Art of War: Beyond Memory-one Strategies in Population Games,2,2
The Prisoner’s Dilemma,2,2


In [35]:
len(duplicates)

6

In [36]:
df[df['title'].isin(duplicates.index) ]

Unnamed: 0,abstract,author,date,journal,key,key_word,labels,list_strategies,pages,provenance,read,score,title,unique_key
10248,Cooperation is of utmost importance to society...,anders johansson,2010,"Helbing, D and Johansson, A (2010) Cooperation...",Helbing2010,,,,,arXiv,,,"Cooperation, Norms, and Revolutions: A Unified...",f7ef3626edc9fb376e5703f804b31d9f
10249,Cooperation is of utmost importance to society...,dirk helbing,2010,"Helbing, D and Johansson, A (2010) Cooperation...",Helbing2010,,,,,arXiv,,,"Cooperation, Norms, and Revolutions: A Unified...",f7ef3626edc9fb376e5703f804b31d9f
1087,The model of the subject with reflexion allows...,lefebvre a. vladimir,2001,Algebra of Conscience,Lefebvre2001,,,,,Springer,,,The Prisoner’s Dilemma,1bbb5d8a9929baf02be426da7daf5b29
2130,The quantum mechanical approach to the well kn...,a. g. m. schmidt,2013,arXiv,Pinheiro2013,,,,,arXiv,,,Playing a quantum game on polarization vortices,37ab4593323d0cf0901a71416ff5876c
2131,The quantum mechanical approach to the well kn...,a. r. c. pinheiro,2013,arXiv,Pinheiro2013,,,,,arXiv,,,Playing a quantum game on polarization vortices,37ab4593323d0cf0901a71416ff5876c
2132,The quantum mechanical approach to the well kn...,a. z. khoury,2013,arXiv,Pinheiro2013,,,,,arXiv,,,Playing a quantum game on polarization vortices,37ab4593323d0cf0901a71416ff5876c
2133,The quantum mechanical approach to the well kn...,c. e. r. souza,2013,arXiv,Pinheiro2013,,,,,arXiv,,,Playing a quantum game on polarization vortices,37ab4593323d0cf0901a71416ff5876c
2134,The quantum mechanical approach to the well kn...,d. p. caetano,2013,arXiv,Pinheiro2013,,,,,arXiv,,,Playing a quantum game on polarization vortices,37ab4593323d0cf0901a71416ff5876c
2135,The quantum mechanical approach to the well kn...,j. a. o. huguenin,2013,arXiv,Pinheiro2013,,,,,arXiv,,,Playing a quantum game on polarization vortices,37ab4593323d0cf0901a71416ff5876c
3777,The quantum mechanical approach to the well kn...,a. g. m. schmidt,2013,CLEO: 2013,Pinheiro2013,Game theory,,,1-2,IEEE,,,Playing a quantum game on polarization vortices,5df4649dc7e744bf412f6cc5b05241a5


In [37]:
for tlt in duplicates.index:
    print(tlt, df[df['title'] == tlt]['provenance'].unique())

Beyond pairwise strategy updating in the prisoner's dilemma game ['Nature' 'arXiv']
Cooperation, Norms, and Revolutions: A Unified Game-Theoretical Approach ['arXiv' 'PLOS']
Human behavior in Prisoner's Dilemma experiments suppresses network reciprocity ['Nature' 'arXiv']
Playing a quantum game on polarization vortices ['arXiv' 'IEEE']
The Art of War: Beyond Memory-one Strategies in Population Games ['PLOS' 'arXiv']
The Prisoner’s Dilemma ['Springer']


**Drop duplicates.**

In [38]:
articles_to_drop = df[(df['title'].isin(duplicates.index)) & (df['provenance']=='arXiv')]['unique_key'].unique()
articles_to_drop

array(['f7ef3626edc9fb376e5703f804b31d9f',
       '37ab4593323d0cf0901a71416ff5876c',
       '7e64918889fd1a9be63d428604057056',
       '9f7bb1dc93e57eb0a938eacdba9b6231',
       'e39363f9882c617dbf6f0cc1e1a448dc'], dtype=object)

In [39]:
df = df[~df['unique_key'].isin(articles_to_drop)]

In [40]:
df = df[~df['unique_key'].isin(['d25332adc4378bb2320319c6007decf3', 'e45e8a6e0e7738f987f86e45f71db931'])]

In [41]:
len(df['title'].unique()), len(df['unique_key'].unique())

(1142, 1143)

In [42]:
df.to_json('../data/data_nov_2017_clean.json')