## Question answering with LDA
In this notebook, we attempt to use the trained [LDA topics model](./LDA.ipynb) to find information for the tasks. More specifically, for each task:

1) We define a set of keywords which define the task and treat the task a document consisting of these words.

2) Calculate the topic distribution of the "document" of the task.

3) Find the k abstracts with the most similar topic distributions to the task topic distribution.

In [41]:
import os
import json
from time import time
from collections import Counter

%matplotlib inline

import pickle 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import jensenshannon

### Load LDA model and papers data

In [42]:
print("Loading LDA model...")
lda_dir = os.path.join('data', 'lda_10_topics')
with open(os.path.join(lda_dir, 'model.pkl'), 'rb') as f:
    lda = pickle.load(f)
with open(os.path.join(lda_dir, 'vectorizer.pkl'), 'rb') as f:
    vectorizer = pickle.load(f)
with open(os.path.join(lda_dir, 'count_data.pkl'), 'rb') as f:
    count_data = pickle.load(f)

print("Loading all paper data...")
with open('data/preprocessed_text.json', 'r') as f:
    json_data = json.load(f)
    
paper_ids = list(json_data.keys())
index_to_paperid_map = {ind: paper_ids[ind] for ind in range(len(paper_ids))}

paper_topic_dist = lda.transform(count_data) # topic distribution of each paper

assert(len(paper_ids) == paper_topic_dist.shape[0])
paper_topic_dist.shape

(33375, 10)

In [51]:
def top_k_similar(paper_topic_dist: np.ndarray, query_topic_dist: np.ndarray, 
                  metric='euclidean', k=10, return_distances=False) -> list:
    """
        Input:
            paper_topic_dist - the LDA topic distribution of each paper
            query_topic_dist - the LDA topic distribution of the query
        Returns:
            Indicies of the papers closest in topic distribution to the query
    """
    assert(paper_topic_dist.ndim == 2)
    assert(paper_topic_dist.shape[1] == query_topic_dist.size)
    
    distances = pairwise_distances(paper_topic_dist, query_topic_dist.reshape(1, -1))
    distances = distances.flatten()
    
    assert(distances.shape[0] == paper_topic_dist.shape[0])
    
    indexed_distances = [(i, distances[i]) for i in range(distances.shape[0])]
    sorted_indexed_distances = sorted(indexed_distances, key=lambda p: p[1])
    ind, d = zip(*sorted_indexed_distances[:k])
    
    if return_distances:
        return ind, d
    else:
        return ind

### Task 1
#### What is known about transmission, incubation, and environmental stability? What do we know about natural history, transmission, and diagnostics for the virus? What have we learned about infection prevention and control?
* Range of incubation periods for the disease in humans (and how this varies across age and health status) and how long individuals are contagious, even after recovery.
* Prevalence of asymptomatic shedding and transmission (e.g., particularly children)
* Seasonality of transmission
* Physical science of the coronavirus (e.g., charge distribution, adhesion to hydrophilic/phobic surfaces, environmental survival to inform decontamination efforts for affected areas and provide information about viral shedding).
* Persistence and stability on a multitude of substrates and sources (e.g., nasal discharge, sputum, urine, fecal matter, blood).
* Disease models, including animal models for infection, disease and transmission
* Effectiveness of movement control strategies to prevent secondary transmission in health care and community settings

In [28]:
task1_questions = [
    "Range of incubation periods for the disease in humans and how this varies across age and health status and how long individuals are contagious, even after recovery.",
    "Prevalence of asymptomatic shedding and transmission particularly children", 
    "Seasonality season of transmission",
    "Physical science of the coronavirus charge distribution, adhesion to hydrophilic phobic surfaces, environmental survival to inform decontamination efforts for affected areas and provide information about viral shedding",
    "Persistence and stability on a multitude of substrates and sources nasal discharge sputum  urine  fecal matter  blood",
    "Disease models including animal models for infection disease and transmission",
    "Effectiveness of movement control strategies to prevent secondary transmission in health care healthcare and community settings"]

task1_document = '. '.join(task1_questions)
task1_counts = vectorizer.transform([task1_document])[0]
task1_topic_dist = lda.transform(task1_counts.reshape(1, -1))[0]

assert(task1_topic_dist.sum() == 1)
task1_topic_dist

array([0.14472785, 0.04632768, 0.21557333, 0.0307841 , 0.01620475,
       0.00092609, 0.0347647 , 0.0557862 , 0.40437845, 0.05052684])

In [64]:
def display_closest_papers(indices: list, distances: list):
    for ind, d in zip(indices, distances):
        print("------------------------------------------------")
        print(d)
        paper_id = index_to_paperid_map[ind]
        abstract = json_data[paper_id]['abstract']
        title = json_data[paper_id]['title']
        print("Title: ", title)
        print("Abstract: ", abstract)

In [65]:
indices, distances = top_k_similar(paper_topic_dist, task1_topic_dist, 
                                   return_distances=True, k=5, metric=jensenshannon)
display_closest_papers(indices, distances)

------------------------------------------------
0.06884586410941963
Title:  Minimising prescribing errors in the ICU DJ Melia, S Saha Queen' s Hospital, Romford, UK Critical Care
Abstract:  aimed audit prescribing practice busy 14-bedd general icu develop standardised practices tools improve safety prescribing errors occur commonly uk hospital admissions costing extra bed days per admission costing national health service estimated £1 billion per annum majority mistakes avoidable methods audited daily infusion charts patients three separate spot checks week assessed aspects prescriptions make legal valid accordance national guidance new procedures introduced included standardised prescription sticker common preprinted infusion prescriptions noradrenaline propofol forth education using new prescription stickers month later audit process repeated assessed prescriptions fi rst round intervention demonstrating improvement safe prescribing prescriptions initially fulfi lled best practice c