### Text-mining techniques are effective for helping to answer a specific research question and for identifying similar articles.  However, the quality of the answer given is just as important as having an answer:

If asking "What conditions make someone more susceptible to contracting COVID-19?", you might be recommended this article via text-mining: [Relationship between the ABO Blood Group and the COVID-19 Susceptibility](http://www.medrxiv.org/content/10.1101/2020.03.11.20031096v2)

A confirmed relationship between blood type and susceptibility to COVID19 might greatly impact how COVID19 tests are given and distributed.  I've already seen this article being shared on social media sites.  However, this study did not utilize a large sample size, and there work has not been replicated.  A researcher might be able to make a more informed decision if they viewed similar papers on the MERS coronavirus.

Additionally, to fully understand the papers recommended, you might need to look at some articles for background knowledge.  However, articles discussing similar topics might cite different background papers, some of which could be out of date, while others rapidly gain popularity with researchers.  With the way COVID19 research is rapidly growing and changing, background knowledge may become out of date quickly.  Therefore, it would be useful to visualize which background papers are the highest-cited by a set of recommended articles.


How do you address these issues?  Article metadata, specifically, an analysis of article citations.  Commonly, the relationship between (source) articles and the ones they cite is visualized in the form of a directed graph, where sources "point" to their citations:


![](https://3spxpi1radr22mzge33bla91-wpengine.netdna-ssl.com/wp-content/uploads/2016/09/citation-cartel-closeup.png)







In this notebook, I'll demonstrate how to make citation networks utilizing the COVID19 articles, as well as how to incorporate the graphs into text mining results. All of these graphs are created with pyvis: https://pyvis.readthedocs.io/en/latest/tutorial.html  Feel free to repurpose/copy the notebook code in whatever way suits your needs and project.  


**Pros to this approach:**
* Citation graphs can be incorporated into any approach.  If your method of choice is clustering, you could create a feature to generate a pyvis graph for a cluster or clusters selected by a user

* Reduces the amount of time researchers/users have to spend hunting for the most established article or background paper on a topic

* If made interactive, users can identify "missing links," or articles between several topics that could answer multiple questions at once

* This could save users of a tool time and help them decide which articles are the most credible, or, at least, most frequently acknowledged/regarded within the COVID19 research community


**Cons to this approach:**
* Citation data has to be updated every time new articles are added to the collection.  
* Graphs are (computationally) expensive.  Although the graphs in this demo can be Jupyter Notebook outputs, if you wanted to create citation networks for a couple thousand nodes (or, say, a half of the 50k corpus), you would have to move this feature to a website.
* The JSON files, not the cleaner metadata, contained all the citations.  There are far more cited articles than cited articles with metadata.  This poses limitations when creating interactive graphs that display a url, abstract, etc.

* The citation graph is only as useful as the text mining technique recommending/filtering articles. With that in mind...
### ***Important Note*: The code to perform LDA / recommend articles is the work of Daniel Wolffram.  His complete notebook (with interactive widgets) is [here](https://www.kaggle.com/danielwolffram/topic-modeling-finding-related-articles) and this is the link to his team's website: https://discovid.ai/search.  I've also cited each of the sections of functions he wrote below, as I do not want to take credit for his work**  

# Install/Load Packages
### Use of scispacy is for Wolffram's LDA model

*Internet access needs to be switched on for this to work!*

In [None]:
from IPython.utils import io
with io.capture_output() as captured:
    !pip install scispacy
    !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_lg-0.2.4.tar.gz
    !pip install pyvis

In [None]:
import numpy as np 
import pandas as pd

from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import scispacy
import spacy
import en_core_sci_lg

from scipy.spatial.distance import jensenshannon

import joblib

from IPython.display import HTML, display

from ipywidgets import interact, Layout, HBox, VBox, Box
import ipywidgets as widgets
from IPython.display import clear_output

from tqdm import tqdm
from os.path import isfile

import seaborn as sb
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

plt.style.use("dark_background")


from pyvis import network as net
import networkx as nx

import textwrap


# Load in the datasets

### Load in the metadata for all of the articles (dataset already part of Wollfram's notebook)

In [None]:
df = pd.read_csv('../input/cord-19-create-dataframe/cord19_df.csv')

In [None]:
df.head()

In [None]:
df.dtypes

In [None]:
df['source'].value_counts()

In [None]:
df.dtypes

### Load in the citation data

In [None]:
citation_df = pd.read_csv('../input/covid19-for-citation-networks/network_all_datasets.csv')
citation_df.dropna(inplace = True)
citation_df.drop_duplicates(inplace = True)
del citation_df['Unnamed: 0']

In [None]:
citation_df.sample(10)

# Creating a citation network graph

How many articles have researchers cited in this dataset?  What is the highest cited article?

In [None]:
citation_df['cited_article'].describe()

In [None]:
citation_df['cited_article'].describe().top

Using a basic network graph, you can explore the different types of articles citing 'Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia.' The paper discusses the MERS coronavirus.

When combined with citation data for recently-published coronavirus articles, you can quickly see how the MERS and COVID19 research communities are connected. What aspects of MERS are COVID19 researchers focusing on?

Who is citing this popular article?

In [None]:
mers_articles = citation_df[citation_df['cited_article'] == 'Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia'].copy()


In [None]:
mers_articles

In [None]:
to_graph_mers = mers_articles[mers_articles['cited_article'] == 'Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia']

Now, this dataframe contained every article that's cited '', as well as any Because so many studies have cited this article (and have likely cited many other articles as well), let's only look at articles cited 2 or more times (all articles)

In [None]:
#mean = mers_articles['cited_article'].value_counts().describe()['mean']
#to_graph_mers = mers_articles[mers_articles.groupby('cited_article')['cited_article'].transform('size') > (mean)]
#to_graph_mers = to_graph_mers[to_graph_mers['source_article'].isin(list(to_graph_mers['cited_article']))]

In [None]:
#to_graph_mers

How many COVID19 articles a hop away from this core network?

In [None]:
covid_articles = list(df[df['is_covid19'] == True]['title'])
covid_hop = citation_df[citation_df['source_article'].isin(covid_articles)]

all_relevant_mers_articles = list(to_graph_mers['source_article']) + list(to_graph_mers['cited_article'])


covid_hop = covid_hop[covid_hop['cited_article'].isin(all_relevant_mers_articles)]

In [None]:
covid_hop

In [None]:
saudi_arabia_g = net.Network(height = 1000, width = 1000, directed = True, notebook = True)


for item in to_graph_mers.iterrows():
    data = item[1]

    saudi_arabia_g.add_node(data['source_article'], label = item[0], title = data['source_article'],color = 'orange') 

    
    saudi_arabia_g.add_node(data['cited_article'], label = item[0], title = data['cited_article'],color = 'orangered') 
   
    saudi_arabia_g.add_edge(data['source_article'], data['cited_article'])
        
for item in covid_hop.iterrows():
    data = item[1]

    saudi_arabia_g.add_node(data['source_article'], label = item[0], title = data['source_article'],color = 'orchid') 

    
    saudi_arabia_g.add_node(data['cited_article'], label = item[0], title = data['cited_article'],color = 'violet') 
   
    saudi_arabia_g.add_edge(data['source_article'], data['cited_article'])
saudi_arabia_g.barnes_hut(gravity=-5000, central_gravity=0, spring_length=200, spring_strength=0.009, damping=0.025, overlap=0)
    

In [None]:
saudi_arabia_g.show('MERS_COVID19_Connections_Graph.html')

That....looks a little chaotic. Once you get past ~ 1000 nodes, things can be a little out of hand.  In the example above, the graph is displayed as a notebook output. However, if your user can download/view html files, you can create graphs that, when downloaded, can include tools to alter node color, the 'physics,' or movement/arrangements of the codes, and edges.  Importantly, by disabling 'physics' you can get the modes to remain in place.

In [None]:
saudi_arabia_html_g = net.Network(height = 1000, width = 1000, directed = True)


for item in to_graph_mers.iterrows():
    data = item[1]

    saudi_arabia_html_g.add_node(data['source_article'], label = item[0], title = data['source_article'],color = 'orange') 

    
    saudi_arabia_html_g.add_node(data['cited_article'], label = item[0], title = data['cited_article'],color = 'orangered') 
   
    saudi_arabia_html_g.add_edge(data['source_article'], data['cited_article'])
        
for item in covid_hop.iterrows():
    data = item[1]

    saudi_arabia_html_g.add_node(data['source_article'], label = item[0], title = data['source_article'],color = 'orchid') 

    
    saudi_arabia_html_g.add_node(data['cited_article'], label = item[0], title = data['cited_article'],color = 'violet') 
   
    saudi_arabia_html_g.add_edge(data['source_article'], data['cited_article'])
    

saudi_arabia_html_g.show_buttons(filter_=['nodes','edges', 'physics'])
saudi_arabia_html_g.show('HTML_MERS_COVID19_Connections_Graph.html')


Check your output folder for the graph!  Download the graph and, when clicked on, it will display as a new tab

# A (brief) user guide to pyvis graphs

Although pyvis graphs can be engaging, *how* exactly you move around nodes or zoom in and out might not be all that intuitive if you have never used Gephi or related network visualization tools.  Here are the basics:


**General guide**
* To see the name/article name represented by the node: Hover over it with your cursor or click on the node 

* To zoom in and out on parts of the network: Use the scroll wheel/scroll bar, use two fingers to scroll up/down to zoom in/out

* To pan left/right/up/down: click on the background (white space behind the graph) and drag up/down/left/right as needed

**Graphs uploaded as HTML Files versus Displayed in Jupyter Notebook**

* HTML graph files are key for making graphs with HTML elements, which can allow you to link nodes to the url for their corresponding papers.  Notebook outputs cannot show HTML elements and have very few formatting options for text.  However, for smaller graphs and testing purposes, they are great.

* As you can see by the first notebook output, large graphs have trouble conforming to a layout, making it difficult to click on nodes.  Once you open your html graph, scroll down the section that says 'physics' in large bold letters and, underneath it, uncheck the box that says 'enabled'




# Creating an interactive citation graph using additional metadata

## Prepare data

In [None]:
#filter citation data to only use citations for available articles in the metadata
citation_df = citation_df[citation_df['source_article'].isin(df['title'])]

In [None]:
citation_df.reset_index(inplace = True)

In [None]:
del citation_df['index']

In [None]:
citation_df.head()

However, for the graph to be fully interactive, you would only look at cited articles with available metadata

In [None]:
#create a separate, smaller dataframe containing both source and cited articles in the covid19_df. 
in_metadata_citation_df = citation_df[citation_df['cited_article'].isin(df['title'])].copy()

In [None]:
in_metadata_citation_df.reset_index(inplace = True)

In [None]:
del in_metadata_citation_df['index']

In [None]:
in_metadata_citation_df

## In this example, let's explore the network graph of all new(dated 2019-2020)/prepublished articles, as many of those will be discussing COVID-19.

## Prepare the subset of data to graph

In [None]:
#add the pre-publication articles (many of which will be more recent as COVID19 articles are being rapidly submitted to journals)
#recent_covid_articles = df[df['source']['biorxiv', 'medrxiv']]['title']



recent_covid_articles = df[df['publish_year'].isin([2019, 2020])]['title']



#add any articles that have been published from 2019-present
recent_covid_articles.append(df[df['publish_year'].isin([2019, 2020])]['title'])


recent_covid_df = in_metadata_citation_df[in_metadata_citation_df['cited_article'].isin(recent_covid_articles)].copy()


recent_covid_df = recent_covid_df.append(in_metadata_citation_df[in_metadata_citation_df['source_article'].isin(recent_covid_articles)].copy())
recent_covid_df.drop_duplicates(inplace = True)
recent_covid_df.dropna(inplace = True)

In [None]:
recent_covid_df.reset_index(inplace = True)

In [None]:
del recent_covid_df['index']

In [None]:
recent_covid_df

In [None]:
recent_covid_df['cited_article'].describe(), recent_covid_df['source_article'].describe() 

In [None]:
recent_covid_df['cited_article'].value_counts().describe()

A while back, I mentioned that network graphs can help users identify highly/cited or regarded sources.  Let's just start by identifying articles cited at a count greater than the mean (2 or more citations)

In [None]:
mean = recent_covid_df['cited_article'].value_counts().describe()['mean']
graph_to_plot = recent_covid_df[recent_covid_df.groupby('cited_article')['cited_article'].transform('size') > (mean)]
graph_to_plot

In [None]:
graph_to_plot['title'] = graph_to_plot['source_article']
merged_graph_to_plot = pd.merge(graph_to_plot, df, on = 'title')
merged_graph_to_plot.drop_duplicates(inplace = True)

In [None]:
merged_graph_to_plot.cited_article_year.replace('None', np.nan, inplace=True)
merged_graph_to_plot.cited_article_year.fillna(value = 2020, inplace=True)
merged_graph_to_plot.publish_year.fillna(value = 2020, inplace=True)

In [None]:
merged_graph_to_plot.head()

# Create the Graph!

In [None]:
notebook_display_g = net.Network(height = 1000, width = 1000, directed = True,notebook = True)
html_link_g = net.Network(height = 1000, width = 1000, directed = True)


for item in merged_graph_to_plot.iterrows():

    data = item[1]
    
    
    #color code nodes according to whether or not they are a paper about the COVID-19
    if data['is_covid19']:
        sourceNodeColor = "lightcoral"
    else:
        sourceNodeColor = "lightskyblue"
    
    #create node with an HTML-formatted "Title" containing information about each node (citing an article)
    source_html_title = '<a href="' + data['url'] + '" target="_blank">'+ data['source_article'] + '</a>' + "<p><b>Year Published or Submitted:</b></p> {0}<p><b>Authors:</b></p>{1}<p><b>Abstract:</b></p>{2}".format(data['publish_year'], data['authors'], data['abstract'])   
    notebook_display_g.add_node(data['source_article'], label = item[0], title = data['source_article'],color = sourceNodeColor) 
    
    html_link_g.add_node(data['source_article'], label = item[0], title = source_html_title,color = sourceNodeColor)
    

    
    cited_color = df[df['title'] == data['cited_article']].iloc[0]['is_covid19']
    
    if cited_color:
        citeNodeColor = "darkred"
    else:
        citeNodeColor = "darkblue"
        
    cited_url = df[df['title'] == data['cited_article']].iloc[0]['url']
    cited_date = df[df['title'] == data['cited_article']].iloc[0]['publish_year']
    cited_authors = df[df['title'] == data['cited_article']].iloc[0]['authors']
    cited_abstract = df[df['title'] == data['cited_article']].iloc[0]['abstract']
    
    
    #create node with an HTML-formatted "Title" containing information about each node (an article being cited)
    cited_html_title = '<a href="' + cited_url + '" target="_blank">'+ data['cited_article'] + '</a>' + "<p><b>Year Published or Submitted:</b></p> {0}<p><b>Authors:</b></p>{1}<p><b>Abstract:</b></p>{2}".format(cited_date, cited_authors, cited_abstract)
    notebook_display_g.add_node(data['cited_article'], label = item[0], title = data['cited_article'], color = citeNodeColor)
    html_link_g.add_node(data['cited_article'], label = item[0], title = cited_html_title, color = citeNodeColor)
    
    
    
    
    notebook_display_g.add_edge(data['source_article'], data['cited_article'])
    html_link_g.add_edge(data['source_article'], data['cited_article'])
    
   

    
    
notebook_display_g.barnes_hut(gravity=-5000, central_gravity=0, spring_length=200, spring_strength=0.009, damping=0.025, overlap=0)
html_link_g.barnes_hut(gravity=-5000, central_gravity=0, spring_length=200, spring_strength=0.009, damping=0.025, overlap=0)
html_link_g.show_buttons(filter_=['nodes','edges', 'physics'])

html_link_g.show('COVID19_Graph_Interactive.html')

In [None]:
notebook_display_g.show('COVID19_Notebook_Graph.html')

Check your output folder again to download the HTML version of this graph.  Looking over the graph (without any details besides the article names) you can identify some highly-cited COVID19 papers, and explore their relationship to non-COVID19 research

# Applying citation networks to Topic Modelling 

Finally, let's go over an example of how to incorporate citation graphs into a text-mining approach.  

## Back to Topic Modelling! (Code by Daniel Wolffram)

We consider the text body, but the approach could also be applied to the abstracts only.

In [None]:
all_texts = df.body_text

In [None]:
# example snippet
all_texts[0][:500]

# Latend Dirichlet Allocation (Wolffram)

For preprocessing we use [scispaCy](https://allenai.github.io/scispacy/), which is a Python package containing [spaCy](https://spacy.io) models for processing biomedical, scientific or clinical text.

In [None]:
# medium model
nlp = en_core_sci_lg.load(disable=["tagger", "parser", "ner"])
nlp.max_length = 2000000

In [None]:
def spacy_tokenizer(sentence):
    return [word.lemma_ for word in nlp(sentence) if not (word.like_num or word.is_stop or word.is_punct or word.is_space or len(word)==1)]

In [None]:
# New stop words list 
customize_stop_words = [
    'doi', 'preprint', 'copyright', 'peer', 'reviewed', 'org', 'https', 'et', 'al', 'author', 'figure', 
    'rights', 'reserved', 'permission', 'used', 'using', 'biorxiv', 'medrxiv', 'license', 'fig', 'fig.', 'al.', 'Elsevier', 'PMC', 'CZI',
    '-PRON-'
]

# Mark them as stop words
for w in customize_stop_words:
    nlp.vocab[w].is_stop = True

In [None]:
filepath = '../input/topic-modeling-finding-related-articles/'

Generate files/models if they are not there yet.

In [None]:
#vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, min_df=2)
#data_vectorized = vectorizer.fit_transform(tqdm(all_texts))

In [None]:
#data_vectorized.shape

In [None]:
# vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, max_features=800000)
# data_vectorized = vectorizer.fit_transform(tqdm(all_texts))

In [None]:
# data_vectorized.shape # with bigrams: 6428134

# data_vectorized.shape # all 1.2 mio?

In [None]:
# most frequent words
#word_count = pd.DataFrame({'word': vectorizer.get_feature_names(), 'count': np.asarray(data_vectorized.sum(axis=0))[0]})

#word_count.sort_values('count', ascending=False).set_index('word')[:20].sort_values('count', ascending=True).plot(kind='barh')

In [None]:
#joblib.dump(vectorizer, 'vectorizer.csv')
#joblib.dump(data_vectorized, 'data_vectorized.csv')

In [None]:
if not (isfile(filepath + 'vectorizer.csv') & isfile(filepath + 'data_vectorized.csv')):
    print('Files not there: generating')
    vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, max_features=800000)
    data_vectorized = vectorizer.fit_transform(tqdm(all_texts))
    joblib.dump(vectorizer, 'vectorizer.csv')
    joblib.dump(data_vectorized, 'data_vectorized.csv')

else:
    vectorizer = joblib.load(filepath + 'vectorizer.csv')
    data_vectorized = joblib.load(filepath + 'data_vectorized.csv')

In [None]:
#lda = LatentDirichletAllocation(n_components=50, random_state=0)
#lda.fit(data_vectorized)
#joblib.dump(lda, 'lda.csv')

In [None]:
# # Train/Load Model
if not (isfile(filepath + 'lda.csv')):
    print('File not there: generating')
    lda = LatentDirichletAllocation(n_components=50, random_state=0)
    lda.fit(data_vectorized)

    joblib.dump(lda, 'lda.csv')

else:
    lda = joblib.load(filepath + 'lda.csv') 

## Discovered Topics (Wolffram)

In [None]:
def print_top_words(model, vectorizer, n_top_words):
    feature_names = vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        message = "\nTopic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [None]:
print_top_words(lda, vectorizer, n_top_words=25)

Each article is a mixture of topics / a distribution over topics

In [None]:
#doc_topic_dist = pd.DataFrame(lda.transform(data_vectorized))
#doc_topic_dist.to_csv('doc_topic_dist.csv', index=False)

In [None]:
if not (isfile(filepath + 'doc_topic_dist.csv')):
    print('File not there: generating')
    doc_topic_dist = pd.DataFrame(lda.transform(data_vectorized))
    doc_topic_dist.to_csv('doc_topic_dist.csv', index=False)
else:
    doc_topic_dist = pd.read_csv(filepath + 'doc_topic_dist.csv')  

In [None]:
doc_topic_dist[df.paper_id == '90b5ecf991032f3918ad43b252e17d1171b4ea63']


# Get Nearest Papers (in Topic Space) (Wolffram)

In [None]:
is_covid19_article = df.body_text.str.contains('COVID-19|SARS-CoV-2|2019-nCov|SARS Coronavirus 2|2019 Novel Coronavirus')

In [None]:
def get_k_nearest_docs(doc_dist, k=5, lower=1950, upper=2020, only_covid19=False, get_dist=False):
    '''
    doc_dist: topic distribution (sums to 1) of one article
    
    Returns the index of the k nearest articles (as by Jensen–Shannon divergence in topic space). 
    '''
    
    relevant_time = df.publish_year.between(lower, upper)
    
    if only_covid19:
        temp = doc_topic_dist[relevant_time & is_covid19_article]
        
        #print(temp)
        
    else:
        temp = doc_topic_dist[relevant_time]
        #print(temp)
         
    distances = temp.apply(lambda x: jensenshannon(x, doc_dist), axis=1)
    k_nearest = distances[distances != 0].nsmallest(n=k).index
    #print(k_nearest)
    
    if get_dist:
        k_distances = distances[distances != 0].nsmallest(n=k)
        return k_nearest, k_distances
    else:
        return k_nearest

In [None]:
d = get_k_nearest_docs(doc_topic_dist[df.paper_id == '90b5ecf991032f3918ad43b252e17d1171b4ea63'].iloc[0])

#sb.kdeplot(d)

In [None]:
def plot_article_dna(paper_id, width=20):
    t = df[df.paper_id == paper_id].title.values[0]
    doc_topic_dist[df.paper_id == paper_id].T.plot(kind='bar', legend=None, title=t, figsize=(width, 4))
    plt.xlabel('Topic')

def compare_dnas(paper_id, recommendation_id, width=20):
    t = df[df.paper_id == recommendation_id].title.values[0]
    temp = doc_topic_dist[df.paper_id == paper_id]
    ymax = temp.max(axis=1).values[0]*1.25
    temp = pd.concat([temp, doc_topic_dist[df.paper_id == recommendation_id]])
    temp.T.plot(kind='bar', title=t, figsize=(width, 4), ylim= [0, ymax])
    plt.xlabel('Topic')
    plt.legend(['Selection', 'Recommendation'])

# compare_dnas('90b5ecf991032f3918ad43b252e17d1171b4ea63', 'a137eb51461b4a4ed3980aa5b9cb2f2c1cf0292a')

def dna_tabs(paper_ids):
    k = len(paper_ids)
    outs = [widgets.Output() for i in range(k)]

    tab = widgets.Tab(children = outs)
    tab_titles = ['Paper ' + str(i+1) for i in range(k)]
    for i, t in enumerate(tab_titles):
        tab.set_title(i, t)
    display(tab)

    for i, t in enumerate(tab_titles):
        with outs[i]:
            ax = plot_article_dna(paper_ids[i])
            plt.show(ax)

def compare_tabs(paper_id, recommendation_ids):
    k = len(recommendation_ids)
    outs = [widgets.Output() for i in range(k)]

    tab = widgets.Tab(children = outs)
    tab_titles = ['Paper ' + str(i+1) for i in range(k)]
    for i, t in enumerate(tab_titles):
        tab.set_title(i, t)
    display(tab)

    for i, t in enumerate(tab_titles):
        with outs[i]:
            ax = compare_dnas(paper_id, recommendation_ids[i])
            plt.show(ax)

# Search related papers to a chosen one (Wolffram)

As a similarity measure we use 1 - Jensen-Shannon distance.

In [None]:
def recommendation(paper_id, k=5, lower=1950, upper=2020, only_covid19=False, plot_dna=False):
    '''
    Returns the title of the k papers that are closest (topic-wise) to the paper given by paper_id.
    '''
    
    #print(df.title[df.paper_id == paper_id].values[0])

    recommended, dist = get_k_nearest_docs(doc_topic_dist[df.paper_id == paper_id].iloc[0], k, lower, upper, only_covid19, get_dist=True)
    recommended = df.iloc[recommended].copy()
    recommended['similarity'] = 1 - dist
    
    h = '<br/>'.join(['<a href="' + l + '" target="_blank">'+ n + '</a>' +' (Similarity: ' + "{:.2f}".format(s) + ')' for l, n, s in recommended[['url','title', 'similarity']].values])
    display(HTML(h))
    
  
    if plot_dna:
        compare_tabs(paper_id, recommended.paper_id.values)
 
    return recommended

# Given a list of recommended papers, create a citation network

All right!  Now that we have the ability to look for similar articles, let's use some helper functions to create the graph.  The code is mostly similary to that used for the COVID19 graph, but broken down into functions for easier use.

As mentioned previously, displaying citation networks could be a part of any text-mining widget.  For example, you could repurpose the functions below to intake a dataframe of articles in the cluster (or clusters) and label and color code nodes according to their cluster number.

With these functions, we are working with 3 dataframes:

* A dataframe of recommended articles (a subsection of 'df', or all of the metadata retrieved using Wolffram's method) 
* The dataframe of article citations (where both source/cited articles have available metadata) 
* The dataframe of metadata for all of the articles 

In [None]:
"""
Given a dataframe of recommended articles (including their metadata),
retrieve citations associated with these papers and build and return a dataframe of citations
"""


def recommended_paper_citation_network(df_recommended):
    #get all of the articles cited by the recommended papers
    recommended_citations = in_metadata_citation_df[in_metadata_citation_df['source_article'].isin(df_recommended['title'])]
    
    # who is citing the same papers as the recommended papers?
    other_source_papers = in_metadata_citation_df[in_metadata_citation_df['cited_article'].isin(recommended_citations['cited_article'])]
  

    #who are the cited papers citing?
    second_network_hop = in_metadata_citation_df[in_metadata_citation_df['cited_article'].isin(recommended_citations['source_article'])]


    #who is citing the recommended papers themselves?
    citing_the_recommended = in_metadata_citation_df[in_metadata_citation_df['cited_article'].isin(df_recommended['title'])]


    #append all of the dataframes together
    recommended_citations = recommended_citations.append(other_source_papers)
    recommended_citations = recommended_citations.append(second_network_hop)
    recommended_citations = recommended_citations.append(citing_the_recommended)
    recommended_citations.drop_duplicates(inplace = True)
    recommended_citations.dropna(inplace = True)
    
    return recommended_citations



In [None]:
"""
Given the title of the paper, use the covid_df to assign both source and cited nodes a color
Nodes that have been recommended should be given a separate color
Other nodes are color-coded according to whether or not they mention the Covid19

"""

def assign_source_node_color(source_node_name, master_metadata_df, recommended_df):
    #print("Node name: ", source_node_name)
    #print(master_metadata_df[master_metadata_df['title'] == source_node_name])
    covid_node = master_metadata_df[master_metadata_df['title'] == source_node_name].iloc[0]['is_covid19']
    #covid_node = master_metadata_df[master_metadata_df['title'] == source_node_name].loc['is_covid19']
    if source_node_name in list(recommended_df['title']):
        #print('NODE SHOULD BE GREEN')
        #print(source_node_name)
        sourceNodeColor = "palegreen"    
    elif covid_node:
        sourceNodeColor = "lightcoral"
    else:
        sourceNodeColor = "lightskyblue"
    return sourceNodeColor
        
        
def assign_cited_node_color(cited_node_name, master_metadata_df, recommended_df):
    
    #print("Node name: ", cited_node_name)
    #print(master_metadata_df[master_metadata_df['title'] == cited_node_name])
    covid_node = master_metadata_df[master_metadata_df['title'] == cited_node_name].iloc[0]['is_covid19']
    #covid_node = master_metadata_df[master_metadata_df['title'] == cited_node_name].loc['is_covid19']    
    
    if cited_node_name in list(recommended_df['title']):
        #print('NODE SHOULD BE GREEN')
        #print(cited_node_name)
        citedNodeColor = "palegreen"    
    elif covid_node:
        citedNodeColor = "darkred"
    else:
        citedNodeColor = "darkblue"
    
    return citedNodeColor

In [None]:
"""
Given the title of the paper, use the covid_df to create an HTML 'title', such that,
when a user uploads a network graph as an HTML file, they can click on nodes and see
basic information about the papers, as well as a link to click on and read the full paper

"""

def create_HTML_Title(node_name, master_metadata_df, recommended_df):
    
    #get all of the paper needed to make the HTML element
    url = master_metadata_df[master_metadata_df['title'] == node_name].iloc[0]['url']
    date = master_metadata_df[master_metadata_df['title'] == node_name].iloc[0]['publish_year']
    authors = master_metadata_df[master_metadata_df['title'] == node_name].iloc[0]['authors']
    abstract = master_metadata_df[master_metadata_df['title'] == node_name].iloc[0]['abstract']
    
    if node_name in list(recommended_df['title']):
        similarity_value = recommended_df[recommended_df['title'] == node_name].iloc[0]['similarity']
        html_title = '<a href="' + url + '" target="_blank">'+ node_name + '</a>' + "<p><b>Similarity:</b></p> {0}<p><b>Year Published or Submitted:</b></p> {1}<p><b>Authors:</b></p>{2}<p><b>Abstract:</b></p>{3}".format(similarity_value, date, authors, abstract) 
        
    else:
        html_title = '<a href="' + url + '" target="_blank">'+ node_name + '</a>' + "<p><b>Year Published or Submitted:</b></p> {0}<p><b>Authors:</b></p>{1}<p><b>Abstract:</b></p>{2}".format(date, authors, abstract)
        
    return html_title

In [None]:
def create_network_graph(citation_df, covid_df, recommended_by_function):
    
    #citation_df: the small df of every cited/citing article that will be graphed
    #covid_df: the covid19 df
    #recommended_by_function: the df returned by the "recommendation(...)" function
    
    
    function_recc_notebook_display_g = net.Network(height = 1000, width = 1000, directed = True,notebook = True)
    function_recc_html_link_g = net.Network(height = 1000, width = 1000, directed = True)
    
    
    for item in citation_df.iterrows():
        data = item[1]
        
        #define source_node_color
        color_source = assign_source_node_color(data['source_article'], covid_df, recommended_by_function)
        #print(data['source_article'], color_source)]    
        
        #create HTML 'title' for source node
        source_html = create_HTML_Title(data['source_article'], covid_df, recommended_by_function)  
        
        #add source nodes
        function_recc_notebook_display_g.add_node(data['source_article'], label = item[0], title = data['source_article'],color = color_source) 
        function_recc_html_link_g.add_node(data['source_article'], label = item[0], title = source_html,color = color_source)
    

        #define cited_node_color
        color_cited = assign_cited_node_color(data['cited_article'], covid_df, recommended_by_function)
        #print(data['cited_article'], color_cited)
        
        #define create HTML 'title' for cited node
        cited_html = create_HTML_Title(data['cited_article'], covid_df, recommended_by_function)
        
        
        #add cited node
        function_recc_notebook_display_g.add_node(data['cited_article'], label = item[0], title = data['cited_article'],color = color_cited) 
        function_recc_html_link_g.add_node(data['cited_article'], label = item[0], title = cited_html,color = color_cited)
        
        
        
        
        #add the edge
        function_recc_notebook_display_g.add_edge(data['source_article'], data['cited_article'])
        function_recc_html_link_g.add_edge(data['source_article'], data['cited_article'])
        
       
    
        #save the graph
        function_recc_notebook_display_g.show('Recommended_Notebook_Graph.html')
        
        function_recc_html_link_g.show_buttons(filter_=['nodes','edges', 'physics'])
        function_recc_html_link_g.show('Recommended_HTML_Interactive_Graph.html')
       
        
    #return the graph so it can be run in the following jupyter cell
    return function_recc_notebook_display_g
        
        
 

In [None]:
recommended = recommendation('a137eb51461b4a4ed3980aa5b9cb2f2c1cf0292a', k=20, plot_dna=False)


network_graph_df = recommended_paper_citation_network(recommended)


graph = create_network_graph(network_graph_df , df, recommended)

In [None]:
graph.show('Recommended_Notebook_Graph_1.html')

All of the recommended nodes (and their scores) are in green.  Like for the other graphs, check your output folder for the html graph file.

I hope this is informative.  There are definitely improvements that can be made to make the code more universal to any type of input, particularly when color-coding the nodes. 

Please use and modify this for your own project if you are interested!