General Analysis
===================

This is a series of notebooks used on my research of the literature review on the iterated prisoner's dilemma.

This section follows a circumstantial review of the prisoner's dilemma timeline
conducted by the authors. The section focuses on the analysis of the 
prisoner's dilemma timeline using a large dataset of prisoner's dilemma articles'
metadata.

Using various machine learning techniques the number and topics that have been
researched over the years within the field are discussed. Moreover, we 
explore the connections of the authors that have work on the game using
network theory.


Data collection
-----------------

Academic articles are accessible through scholarly databases and collections of
journals. Several databases and collections today offer access through an open 
Api. An Api is an application protocol interface that allows users to talk
directly the database, skipping the user interface side of a journal.
Interacting with the Api has two phases:


- requesting;
- receiving;


The requesting phase includes composing a url with the requesting message.
The head of the url includes the address of the Api and the tail the search 
argument, such as the word 'prisoner' to exists within the title. The address 
of the Api and the search arguments themselves differ from journal to journal, 
thus different journals can generate complete different requesting urls. 

The second phase of the receiving includes receiving a number of raw metadata of
articles that satisfied the request. The answer is commonly received in an xml
format but similarly the number of features and the syntax of the xml file 
differs from journal to journal.

Data collection is a crucial proceeder. We wanted to include a large number of
articles from various journal for the analysis to be objective. Moreover, we 
wanted the data to be collected within a short period of time. For these reasons
an open source library was developed for the purpose of this work. The library
is called Arcas and though the package it self will not be analysed here the 
source code can be found here, https://github.com/Nikoleta-v3/Arcas. 

Arcas serves as a translator between us and various Apis. More specifically it
works in coordinate with five different journal. For Arcas to collect data a series
of keywords had to be specified. Each keyword individually is checked weather
it exists within the title or the abstract of an article. Only if this check is
satisfied an article is collected.


In this notebook:

- 1. General Analysis

A general analysis describing the data set is carried out. 

In [24]:
import pandas as pd
import numpy as np

The open source python library [pandas](http://pandas.pydata.org/) will be used through out this article for the analysis.

In [25]:
df = pd.read_json('../data/data_nov_2017.json')

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10987 entries, 0 to 9999
Data columns (total 14 columns):
abstract           10987 non-null object
author             10987 non-null object
date               10987 non-null int64
journal            10987 non-null object
key                10987 non-null object
key_word           10987 non-null object
labels             10987 non-null object
list_strategies    10987 non-null object
pages              10987 non-null object
provenance         10987 non-null object
read               10987 non-null object
score              10987 non-null object
title              10987 non-null object
unique_key         10987 non-null object
dtypes: int64(1), object(13)
memory usage: 1.3+ MB


Pandas info function shows us the information of the data set itself. 

We can see that the sata set contains the following columns:
- Abstract. The abstract of the article.
- Author. A single entity of an author from the list of authors of the respective article.
- Date. Date of publication.
- Journal. Journal of publication. 
- Key. A generated key containing an authors name and publication year (ex. Glynatsi2017).                
- Key_word. A signle entity of a keyword assigned to the article by the given journal.
- Labels. A single entity of labels assigned to the article manual by us.                 
- Pages. Pages of publication.              
- Provenance. Scholarly database for where the article was collected.                 
- Score. Score given to article by the given journal.              
- Title. Title of article.              
- Unique key.  A unique key generated using the [hashlib python library](https://docs.python.org/2/library/hashlib.html). The hashable string is created by: [author name, title, year,abstract]


The data set also contains the columns `list of strategies` and `read` but they are droped for this analysis.

In [27]:
df = df.drop(['read', 'list_strategies'], 1)

In [28]:
df.describe()

Unnamed: 0,date
count,10987.0
mean,2009.706926
std,5.954897
min,1944.0
25%,2007.0
50%,2010.0
75%,2013.0
max,2017.0


Using the describe function of pandas we can see that there are in total 10990 rows of data
in our data set. Only date is displayed because is the only integer value in the data set. 
The min year is 1944 and the max 2017. 

**Total number of articles.**

In [30]:
total_articles = len(df['unique_key'].unique())
total_articles

1150

In [31]:
file = open("/home/nightwing/rsc/Literature-Article/assets/total_articles.txt", 'w')
file.write('{}'.format(total_articles))
file.close()

There are in total 1152 articles within the data set.

In [32]:
df = df.replace('None', np.nan)

**Unique titles**

In [33]:
len(df['title'].unique()), len(df['unique_key'].unique())

(1144, 1150)

In [34]:
unique_titles = len(df['title'].unique())

In [35]:
file = open("/home/nightwing/rsc/Literature-Article/assets/unique_titles.txt", 'w')
file.write('{}'.format(unique_titles + 1))
file.close()

In [36]:
number_of_duplicates = total_articles - (unique_titles + 1)
number_of_duplicates

5

In [38]:
file = open("/home/nightwing/rsc/Literature-Article/assets/number_of_duplicates.txt", 'w')
file.write('{}'.format(number_of_duplicates))
file.close()

**Numbers of author and year range.**

In [43]:
number_of_authors = len(df['author'].unique())
number_of_authors

2109

In [44]:
file = open("/home/nightwing/rsc/Literature-Article/assets/number_of_authors.txt", 'w')
file.write('{}'.format(number_of_authors))
file.close()

Provenance
----------

The total number of articles is given above. Here we can illustrate the provenance of these articles.
Thus we can see from which journal they have been collected and how many articles have been added by us. The table below shows the number of articles for each provenance.

In [9]:
prov = df.groupby(['unique_key', 'provenance']).size().reset_index().groupby('provenance').size()

In [46]:
file = open("/home/nightwing/rsc/Literature-Article/assets/prov_maual.txt", 'w')
file.write('{}'.format(prov['Manual']))
file.close()

In [51]:
pd.DataFrame(prov, columns=['Number of articles'])

Unnamed: 0_level_0,Number of articles
provenance,Unnamed: 1_level_1
IEEE,241
Manual,41
Nature,25
PLOS,63
Springer,312
arXiv,470


In [53]:
file = open("/home/nightwing/rsc/Literature-Article/assets/provenance_table.tex", 'w')
file.write('{}'.format(pd.DataFrame(prov, columns=['Number of articles']).to_latex()))
file.close()

As mentioned before not all results from each API have the same format and same information. For example keywords our only given by IEEE and nature. Furthermore not all journals had full information for specific articles. 

Here we will look at the percentage of coverage of each column

In [55]:
temp = df.drop_duplicates('unique_key')
for col in df.columns:
    
    perc = len(temp[col].dropna()) / len(temp) 
    perc *= 100
    print(col, ":", round(perc, 2))

abstract : 97.39
author : 100.0
date : 100.0
journal : 99.48
key : 100.0
key_word : 21.65
labels : 11.57
pages : 24.52
provenance : 100.0
score : 6.35
title : 100.0
unique_key : 100.0


In [67]:
df.sort(['date'])

  if __name__ == '__main__':


Unnamed: 0,abstract,author,date,journal,key,key_word,labels,pages,provenance,score,title,unique_key
10429,This is the classic work upon which modern-day...,O .Morgenstern,1944,Princeton University Press,Neuman1944,,textbook,625,Manual,,Theory of Games and Economic Behavior,b774511806b2e4dde03d669ac935a72a
10428,This is the classic work upon which modern-day...,J Von Neumann,1944,Princeton University Press,Neuman1944,,textbook,625,Manual,,Theory of Games and Economic Behavior,b774511806b2e4dde03d669ac935a72a
10433,Von Neumann and Morgenstern have developed a v...,John Nash,1950,Annals of Mathematics,Nash1950,,textbook,286-295,Manual,,Non- Cooperative Games,020985d4aab4030d0d76f64ea9a31821
10430,This paper reports the results of six experime...,Merrill M. Flood,1958,Management Science,Flood1958,,textbook,5-26,Manual,,Some Experimental Games,139e34b049eee5799472fb05b7340d76
10423,,Daniel R. Lutzker,1961,The Journal of Conflict Resolution,Lutzker1961,,human experiments,366-368,Manual,,"Sex role, cooperation and competition in a two...",61180fd8d3ad6a2a02542f32fa987379
10422,,Daniel R. Lutzker,1961,The Journal of Conflict Resolution,Lutzker1961,,early experiments,366-368,Manual,,"Sex role, cooperation and competition in a two...",61180fd8d3ad6a2a02542f32fa987379
10967,"The term ""Prisoner's Dilemma"" comes from the o...",Anatol Rapoport,1965,University of Michigan Press,Rapoport1965,,textbook,,Manual,,Prisoner's Dilemma: A Study in Conflict and Co...,key01
10966,"The term ""Prisoner's Dilemma"" comes from the o...",Anatol Rapoport,1965,University of Michigan Press,Rapoport1965,,early experiments,,Manual,,Prisoner's Dilemma: A Study in Conflict and Co...,key01
10964,"The term ""Prisoner's Dilemma"" comes from the o...",Albert M. Chammah,1965,University of Michigan Press,Rapoport1965,,early experiments,,Manual,,Prisoner's Dilemma: A Study in Conflict and Co...,key01
10965,"The term ""Prisoner's Dilemma"" comes from the o...",Albert M. Chammah,1965,University of Michigan Press,Rapoport1965,,textbook,,Manual,,Prisoner's Dilemma: A Study in Conflict and Co...,key01
