# Key papers

**TOPICS OF INTEREST** - from the meeting in January

* inflammation aging chronic (2004) - 13k papers
* genome editing / manipulation, CRISPR - 13k papers
* induced stem cells - 73k papers - 3h for calculating co-citations
* single-cell sequencing (2012) - 3k papers
* ATAC-seq (2015) - 276 papers
* immunomodulation cancer - 71k papers
* Telomere Theories of Aging - ??
* mTOR pathway - 14255
* autophagy - ??
* Calorie restriction - 3933

Complement Factor H + Age-Related Mascular Degeneration - investigate

**Issues**:

1. Some information in tooltips with long titles may occur out of plot bounds.
2. How should I place articles with the same year? (currently y-axis position is random in [0,1]...)
3. Some research on clustering algorithms is needed! (also `networkx.algorithms.community`)

**Functions**:

1. Subtopic Analysis based on co-citation graph clustering
2. Top Cited Papers detection (overall and for certain year)
3. Citation Dynamics for a certain article

## Search Terms

In [1]:
SEARCH_TERMS = ['DNA', 'methylation', 'clock']

In [2]:
# from importlib import reload
import logging
# reload(logging)

import re
import gc
import ipywidgets as widgets
import math
import networkx as nx
import numpy as np
import pandas as pd

from bokeh.io import push_notebook
from bokeh.models import ColumnDataSource, LabelSet, OpenURL, CustomJS
from bokeh.plotting import figure, show, output_notebook
from bokeh.transform import factor_cmap
from bokeh.core.properties import value
from bokeh.colors import Color, RGB
from bokeh.io import show
from bokeh.models import Plot, Range1d, MultiLine, Circle
# Tools used: hover,pan,tap,wheel_zoom,box_zoom,reset,save
from bokeh.models import HoverTool, PanTool, TapTool, WheelZoomTool, BoxZoomTool, ResetTool, SaveTool
from bokeh.models.graphs import from_networkx

from IPython.display import display
from matplotlib import pyplot as plt
%matplotlib inline

In [3]:
from keypaper.analysis import KeyPaperAnalyzer
from keypaper.visualization import build_data_source, serve_scatter_article_layout, serve_citation_dynamics_layout
from keypaper.utils import get_most_common_ngrams

%matplotlib inline
output_notebook()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nikol\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
analyzer = KeyPaperAnalyzer('nikolay.kapralov@gmail.com')
analyzer.search(*SEARCH_TERMS)
analyzer.load_publications()
analyzer.pub_df.head()

TODO: handle queries which return more than 1000000 items
TODO: use local database instead of PubMed API


2019-05-06 00:14:03,766 INFO: Found 295 articles about ('DNA', 'methylation', 'clock')
2019-05-06 00:14:03,769 INFO: Loading publication data
2019-05-06 00:14:06,833 INFO: Found 223 publications in the local database


Unnamed: 0,pmid,title,year
0,1722018,DNA methylation and cellular ageing.,1991.0
1,1943146,Quantitative genetic variation and development...,1991.0
2,2777259,Cytosine methylation and the fate of CpG dinuc...,1989.0
3,2857475,Control of haemoglobin switching by a developm...,1985.0
4,11032969,Crisis periods and apoptotic commitment: death...,2000.0


In [5]:
# In case this command is too long, you can stop and start DB
# pg_ctl -D /usr/local/var/postgres stop -s -m fast
# pg_ctl -D /usr/local/var/postgres -l /usr/local/var/postgres/server.log start
analyzer.load_cocitations()
analyzer.cocit_df.head()

2019-05-06 00:14:07,182 INFO: Calculating co-citations for selected articles
2019-05-06 00:14:08,441 INFO: Found 4876 co-cited pairs of articles
2019-05-06 00:14:08,443 INFO: Building co-citations graph
2019-05-06 00:14:08,499 INFO: Co-citations graph nodes 138 edges 1131


Unnamed: 0,citing,cited_1,cited_2,year
0,16760426,1722018,15975143,2006.0
1,18535014,2777259,15941485,2008.0
2,18662928,2777259,17029560,2008.0
3,25261778,2777259,17029560,2014.0
4,25788985,2777259,25313081,2015.0


In [6]:
analyzer.load_citation_stats()
analyzer.cit_df.head()

2019-05-06 00:14:08,527 INFO: Started loading citation stats
2019-05-06 00:14:08,718 INFO: Done loading citation stats
2019-05-06 00:14:08,734 INFO: Filtering top 1000 or 50% of all the papers
2019-05-06 00:14:08,736 INFO: Done aggregation
2019-05-06 00:14:08,738 INFO: Loaded citation stats for 87 of 295 articles. Others may either have zero citations or be absent in the local database.


year,pmid,1985,1986,1988,1990,1991,1992,1993,1994,1995,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,total
59,24138928,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,32.0,90.0,135.0,150.0,29.0,0.0,436.0
72,25313081,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,25.0,37.0,38.0,4.0,0.0,104.0
2,2777259,0.0,0.0,0.0,5.0,6.0,5.0,2.0,1.0,0.0,...,9.0,5.0,7.0,7.0,5.0,9.0,2.0,0.0,0.0,102.0
6,15790588,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,10.0,11.0,13.0,7.0,7.0,9.0,2.0,0.0,102.0
7,15860628,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,4.0,15.0,12.0,9.0,8.0,6.0,8.0,1.0,0.0,94.0


## Subtopic Analysis

In [7]:
analyzer.subtopic_analysis()

2019-05-06 00:22:53,603 INFO: Louvain community clustering of co-citation graph
2019-05-06 00:22:53,756 INFO: Found 4 components
2019-05-06 00:22:53,758 INFO: Merging components smaller than 0.01 to "Other" component
2019-05-06 00:22:53,759 INFO: 0: 55 (39%)
1: 54 (39%)
2: 15 (10%)
3: 14 (10%)


In [8]:
analyzer.pubcit_df.head()

Unnamed: 0,pmid,title,year,1985,1986,1988,1990,1991,1992,1993,...,2012,2013,2014,2015,2016,2017,2018,2019,total,comp
0,2777259,Cytosine methylation and the fate of CpG dinuc...,1989.0,0.0,0.0,0.0,5.0,6.0,5.0,2.0,...,5.0,7.0,7.0,5.0,9.0,2.0,0.0,0.0,102.0,2
1,15790588,"Deregulated expression of the PER1, PER2 and P...",2005.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,10.0,11.0,13.0,7.0,7.0,9.0,2.0,0.0,102.0,1
2,15860628,PERIOD1-associated proteins modulate the negat...,2005.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,15.0,12.0,9.0,8.0,6.0,8.0,1.0,0.0,94.0,1
3,15975143,Age-related human small intestine methylation:...,2005.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,21.0,2
4,16314580,Counting human somatic cell replications: meth...,2005.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,23.0,2


In [9]:
logging.info('Visualize components')

G = analyzer.CG.copy()

cmap = plt.cm.get_cmap('nipy_spectral', len(set(pm.values())))
comp_palette = [RGB(*[round(c*255) for c in cmap(i)[:3]]) for i in range(len(set(pm.values())))]

# set node attributes
node_color = {node: comp_palette[pm[node]] for node in G.nodes()}
nx.set_node_attributes(G, node_color, 'colors')

# Show with Bokeh
plot = Plot(plot_width=400, plot_height=400, x_range=Range1d(-1.1, 1.1), y_range=Range1d(-1.1, 1.1))
plot.title.text = 'Components visualization'

graph = from_networkx(G, nx.spring_layout, scale=1, center=(0, 0))
graph.node_renderer.glyph = Circle(size=10, fill_color='colors')

# add data for rendering
graph.node_renderer.data_source.data['id'] = list(G.nodes())

# add tools to the plot
# hover,pan,tap,wheel_zoom,box_zoom,reset,save
plot.add_tools(HoverTool(tooltips=[("PMID", "@id")]), 
               PanTool(), WheelZoomTool(), BoxZoomTool(), ResetTool(), SaveTool())

plot.renderers.append(graph)

show(plot)

2019-05-06 00:27:07,451 INFO: Visualize components


NameError: name 'pm' is not defined

In [11]:
years = analyzer.pubcit_df.columns.values[3:-2].astype(int)
min_year, max_year = np.min(years), np.max(years)

In [12]:
logging.info('Summary component detailed info visualization')

n_comps = analyzer.pubcit_df['comp'].nunique()
cmap = plt.cm.get_cmap('nipy_spectral', n_comps)
palette = [RGB(*[round(c*255) for c in cmap(i)[:3]]) for i in range(n_comps)]

components = [str(i) for i in range(n_comps)]
years = [str(y) for y in range(min_year, max_year)]
data = {'years': years}
for c in range(n_comps):
    data[str(c)] = [len(analyzer.pubcit_df[np.logical_and(analyzer.pubcit_df['comp'] == c, analyzer.pubcit_df['year'] == y)]) \
                    for y in range(min_year, max_year)]

p = figure(x_range=years, plot_width=960, plot_height=300, title="Components by Year",
           toolbar_location=None, tools="hover", tooltips="$name @components: @$name")

p.vbar_stack(components, x='years', width=0.9, color=palette, source=data, alpha=0.5,
             legend=[value(c) for c in components])

p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xgrid.grid_line_color = None
p.axis.minor_tick_line_color = None
p.outline_line_color = None
p.legend.location = "top_left"
p.legend.orientation = "horizontal"

show(p)

2019-05-06 00:28:08,862 INFO: Summary component detailed info visualization


In [14]:
logging.info('Per component detailed info visualization')
n_comps = analyzer.pubcit_df['comp'].nunique()
ds = [None] * n_comps
layouts = [None] * n_comps
most_common = [None] * n_comps
cmap = plt.cm.get_cmap('nipy_spectral', n_comps)
for c in range(n_comps):
    ds[c] = build_data_source(analyzer.pubcit_df[analyzer.pubcit_df['comp'] == c])
    most_common[c] = dict(get_most_common_ngrams(df_all[df_all['comp'] == c]['title'].values, 5))
    kwd = ', '.join([f'{k} ({v:.2f})' for k, v in most_common[c].items()])
    title = f'Subtopic #{c}: {kwd}'
    layouts[c] = serve_scatter_article_layout(ds[c], title, year_range=[min_year, max_year], 
                                              color=RGB(*[round(ch*255) for ch in cmap(c)[:3]]))
    show(layouts[c])

2019-05-06 00:28:35,020 INFO: Per component detailed info visualization


NameError: name 'np' is not defined

## Top Cited Papers Overall

In [49]:
df_all = df_all.sort_values(by='total', ascending=False)

In [50]:
THRESHOLD = 0.1 # 10 %
MAX_PAPERS = 100

In [51]:
print('TODO: color me by components colors')
papers_to_show = min(MAX_PAPERS, round(len(analyzer.cit_df) * THRESHOLD))
ds_top = build_data_source(df_all.iloc[:papers_to_show, :])
layout_top = serve_scatter_article_layout(ds_top, 'Top cited papers', year_range=[min_year, max_year])
show(layout_top)

TODO: color me by components colors


## Top Cited Papers for Each Year

In [52]:
max_gain_data = []
cols = df_all.columns[3:-2]
for i in range(len(cols)):
    max_gain = df_all[cols[i]].astype(int).max()
    if max_gain > 0:
        sel = df_all[df_all[cols[i]] == max_gain]
        max_gain_data.append([cols[i], sel['pmid'].values[0], 
                              sel['title'].values[0], max_gain])
        
max_gain_df = pd.DataFrame(max_gain_data, columns=['year', 'pmid', 'title', 'count'])
max_gain_df.head(20)

ds_max = ColumnDataSource(data=dict(year=max_gain_df['year'], pmid=max_gain_df['pmid'].astype(str),
                                   title=max_gain_df['title'], count=max_gain_df['count']))
factors=max_gain_df['pmid'].astype(str).unique()
cmap = plt.cm.get_cmap('nipy_spectral', len(factors))
palette = [RGB(*[round(c*255) for c in cmap(i)[:3]]) for i in range(len(factors))]
colors = factor_cmap('pmid', palette=palette, factors=factors)

year_range = [min_year, max_year]
p = figure(tools=TOOLS, toolbar_location="above", 
           plot_width=960, plot_height=300, x_range=year_range, title='Max gain')
p.xaxis.axis_label = 'Year'
p.yaxis.axis_label = 'Number of citations'
p.hover.tooltips = [
    ("PMID", '@pmid'),
    ("Title", '@title'),
    ("Year", '@year'),
    ("Cited by", '@count papers in @year')
]

p.vbar(x='year', width=0.8, top='count', fill_alpha=0.5, source=ds_max, fill_color=colors, line_color=colors)

show(p)

## Citation per Year Dynamics

In [53]:
p, h, panel = serve_citation_dynamics_layout()