# Key papers

This Jupyter Notebook can be used to perform basic publication analysis for a science branch. 

**Features:**

1. Subtopic analysis based on co-citation graph clustering:
    * Chord diagram for co-citation graph
    * Comparison of subtopics by size
    * Timeline of each subtopic
    * Extraction of 1,2,3-grams describing each subtopic
2. Detection of highlight papers:
    * Top cited papers overall
    * Detection of most cited papers for each year
    * Detection of papers with max relative citation gain for each year
3. Citation dynamics visualization for highlight papers
4. Subtopic evolution tracking based on co-citation graph clustering for different time periods

## Getting Started

1. Define the `SEARCH_TERMS` variable in the cell below with a list of keywords that describe the science branch of your interest.
2. Run all cells & see the results.

In [14]:
SEARCH_TERMS = ['dna', 'methylation', 'clock']

## Publication Analysis

In [15]:
import logging

from bokeh.plotting import show, output_notebook
from matplotlib import pyplot as plt

from keypaper.analysis import KeyPaperAnalyzer
from keypaper.visualization import Plotter

logging.basicConfig(level=logging.DEBUG, format='%(asctime)s %(levelname)s: %(message)s')
output_notebook()
%matplotlib inline

2019-05-17 23:56:34,103 DEBUG: backend module://ipykernel.pylab.backend_inline version unknown


In [16]:
analyzer = KeyPaperAnalyzer()
analyzer.launch(*SEARCH_TERMS)

TODO: handle queries which return more than 1000000 items
TODO: use local database instead of PubMed API


2019-05-17 23:56:37,516 INFO: Found 299 articles about ('dna', 'methylation', 'clock')
2019-05-17 23:56:37,519 INFO: Loading publication data
2019-05-17 23:56:37,522 INFO: Creating pmids table for request with index.
2019-05-17 23:56:55,500 INFO: Found 223 publications in the local database

2019-05-17 23:56:55,502 INFO: Started loading citation stats
2019-05-18 00:00:46,042 INFO: Done loading citation stats
2019-05-18 00:00:46,126 INFO: Loaded citation stats for 174 of 299 articles. Others may either have zero citations or be absent in the local database.
2019-05-18 00:00:46,136 INFO: 170 articles are further analyzed

2019-05-18 00:00:46,139 INFO: Calculating co-citations for selected articles
2019-05-18 00:00:46,302 INFO: Loaded 417 lines of citing info
2019-05-18 00:00:46,303 INFO: Found 4865 co-cited pairs of articles
2019-05-18 00:00:46,305 INFO: Aggregating co-citations
2019-05-18 00:00:46,533 INFO: Filtering top maximum top 10000 of all the co-citations
2019-05-18 00:00:46,534 

'Found 299 articles about (\'dna\', \'methylation\', \'clock\')\nLoading publication data\nCreating pmids table for request with index.\nFound 223 publications in the local database\n\nStarted loading citation stats\nDone loading citation stats\nLoaded citation stats for 174 of 299 articles. Others may either have zero citations or be absent in the local database.\n170 articles are further analyzed\n\nCalculating co-citations for selected articles\nLoaded 417 lines of citing info\nFound 4865 co-cited pairs of articles\nAggregating co-citations\nFiltering top maximum top 10000 of all the co-citations\nBuilding co-citations graph\nCo-citations graph nodes 134 edges 1093\n\nLouvain community clustering of co-citation graph\nFound 4 components\nGraph modularity: 0.275\nMerging components smaller than 0.01 to "Other" component\nAll components are bigger than 0.01, no need to reassign\nCluster 0: 36 (26%)\nCluster 1: 54 (40%)\nCluster 2: 29 (21%)\nCluster 3: 15 (11%)\nGetting n-gram descript

In [17]:
plotter = Plotter(analyzer)

## Subtopics a.k.a. Clusters in the Co-citation Graph

In [18]:
show(plotter.chord_diagram_components())

2019-05-18 00:00:51,623 INFO: Visualizing components with Chord diagram


In [19]:
show(plotter.component_size_summary())

2019-05-18 00:00:53,022 INFO: Summary component detailed info visualization


In [20]:
for p in plotter.subtopic_timeline_graphs():
    show(p)

2019-05-18 00:00:53,531 INFO: Per component detailed info visualization


## Top Cited Papers Overall

In [21]:
show(plotter.top_cited_papers())

## Top Cited Papers for Each Year

In [22]:
show(plotter.max_gain_papers())

2019-05-18 00:00:54,693 INFO: Different colors encode different papers


## Top by Relative Gain for Each Year

In [23]:
show(plotter.max_relative_gain_papers())

2019-05-18 00:00:55,087 INFO: Top papers in relative gain for each year
2019-05-18 00:00:55,089 INFO: Relative gain (year) = Citation Gain (year) / Citations before year
2019-05-18 00:00:55,090 INFO: Different colors encode different papers


## Citation per Year Dynamics

In [24]:
plotter.article_citation_dynamics()

2019-05-18 00:00:55,721 INFO: Choose ID to get detailed citations timeline for top cited / max gain or relative gain papers


# Experimental Features

## Component Evolution

In [25]:
plt = plotter.subtopic_evolution()
plt.show()

AttributeError: 'KeyPaperAnalyzer' object has no attribute 'evolution_df'

In [26]:
import math
import holoviews as hv
hv.extension('bokeh')

import pandas as pd

In [27]:
import math
import holoviews as hv
hv.extension('bokeh')

import pandas as pd

In [28]:
cols = evol.columns[1:]
pairs = list(zip(cols, cols[1:]))
nodes = []
edges = []

for now, then in pairs:
    nodes_now = [f'{now} {c}' for c in evol[now].unique()]
    nodes_then = [f'{then} {c}' for c in evol[then].unique()]
    inner = {node : 0 for node in nodes_then}
    changes = {node : inner.copy() for node in nodes_now}
    for pmid, comp in evol.iterrows():
        c_now, c_then = comp[now], comp[then]
        changes[f'{now} {c_now}'][f'{then} {c_then}'] += 1
    
#     nodes.extend(nodes_now)
    for v in nodes_now:
        for u in nodes_then:
            if changes[v][u] > 0:
                edges.append([v, u, changes[v][u]])
    
# nodes.extend(nodes_then)
    
# print(nodes)

NameError: name 'evol' is not defined

In [None]:
sankey = hv.Sankey(edges)
sankey.opts(width=960, height=400)

## PageRank for Citation Analysis