# Cooking RISOTTO artifacts

This notebook purpose is to build the binary files used by RISOTTO's GUI.
We'll need to build at least three Pandas DataFrames:

- `papers`:
contains the data of the papers, including the `cord_uid` identifier and the PageRank scores.

- `papers_topics`:
contains the associtaions between papers, topics, and subtopics.
The papers are referenced by their `cord_uid` identifier.

- `topics`:
contains the word distributions of the different topics and subtopics.

Each one of the previously defined DataFrames will be stored in a single HDF file named `risotto_data.hdf`.

In [1]:
# default_exp artifacts

In [4]:
# Install dependencies
%load_ext autoreload
%autoreload 2
!pip install -q -r requirements.txt

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
[33mYou are using pip version 19.0.3, however version 20.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
import pandas as pd
import networkx as nx

from risotto.references import load_papers_from_metadata_file, process_references, build_papers_reference_graph


CORD19_DATASET_FOLDER = "./datasets/CORD-19-research-challenge"

In [4]:
# exports

def load_papers_metadata(dataset_folder):
    metadata_df = pd.read_csv(
        f"{dataset_folder}/metadata.csv",
        index_col="cord_uid"
    )
    return metadata_df


def build_papers_artifact(dataset_folder, should_dump=True):
    metadata_df = load_papers_metadata(dataset_folder)
    papers, _ = load_papers_from_metadata_file(dataset_folder)
    process_references(papers)
    references_graph = build_papers_reference_graph(papers)
    pagerank = nx.pagerank(references_graph)
    
    # Build PageRank DataFrame dict
    pagerank_df_dict = {"cord_uid": [], "pagerank": []}
    for paper, score in pagerank.items():
        pagerank_df_dict["cord_uid"].append(paper._metadata_row.name)
        pagerank_df_dict["pagerank"].append(score)
    
    pagerank_df = pd.DataFrame.from_dict(pagerank_df_dict)
    pagerank_df = pagerank_df.set_index("cord_uid")
    
    joined_df = pagerank_df.join(metadata_df)
    unique_df = joined_df.loc[~joined_df.index.duplicated(keep='first')]
    
    if should_dump:
        unique_df.to_hdf("artifacts.hdf", key="papers")
    
    return unique_df, pagerank_df, metadata_df


def load_papers_artifact():
    return pd.read_hdf("artifacts.hdf", key="papers")


def build_papers_topics_artifact(dataset_folder, should_dump=True):
    """
    TODO: Add code.
    """
    
    if should_dump:
        joined_df.to_hdf("artifacts.hdf", key="papers_topics")
    
    return joined_df


def load_papers_topics_artifacts():
    return pd.read_hdf("artifacts.hdf", key="papers_topics")


def build_topics_artifact(dataset_folder, should_dump=True):
    """
    TODO: Add code.
    """
    
    if should_dump:
        joined_df.to_hdf("artifacts.hdf", key="topics")
    
    return joined_df


def load_topics_artifacts():
    return pd.read_hdf("artifacts.hdf", key="topics")

In [5]:
papers_artifact, pagerank_df, metadata_df = build_papers_artifact(CORD19_DATASET_FOLDER)

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block2_values] [items->Index(['sha', 'source_x', 'title', 'doi', 'pmcid', 'license', 'abstract',
       'publish_time', 'authors', 'journal', 'WHO #Covidence',
       'full_text_file', 'url'],
      dtype='object')]

  encoding=encoding,


In [15]:
# tell nbdev to generate library from notebooks
from nbdev.export import *
notebook2script()

Converted 00_downloader.ipynb.
Converted 01_references.ipynb.
Converted 02_representations_and_lda.ipynb.
Converted 03_hierarchical_topic_modelling.ipynb.
Converted 04_lda2vec.ipynb.
Converted 05_cook_artifacts.ipynb.
Converted 98_risotto_precook.ipynb.
Converted 99_risotto_gui.ipynb.
Converted index.ipynb.
