# Representations, LDA

En este notebook se calcularán representaciones de los papers en función del texto de éstos.
Luego, a partir de dichas representaciones se realizará un modelado de tópicos usando la técnica LDA.
Finalmente, se determinarán los papers más relevantes de cada tópico usando los puntajes PageRank de cada paper.

In [None]:
import pickle

class BasePaper:
    def __init__(self, metadata_row, file_path):
        self._metadata_row = metadata_row
        self._file_path = file_path
        self._file_contents = self._load_json_contents(file_path)
        
        self._referenced_by = []
        self._references = []
    
    def __getstate__(self):
        """
        Avoid RecursionErrors by not pickling references.
        """
        state = self.__dict__.copy()
        del state["_referenced_by"]
        del state["_references"]
        return state
    
    def __setstate__(self, state):
        self.__dict__.update(state)
        self._referenced_by = []
        self._references = []
        
    @staticmethod
    def _load_json_contents(path):
        with open(path) as file:
            contents = json.load(file)
        return contents

    @property
    def title(self):
        return self._metadata_row["title"]
        
    @property
    def authors(self):
        return self._metadata_row["authors"]
        
    @property
    def publish_time(self):
        return self._metadata_row["publish_time"]
        
    @property
    def abstract(self):
        return self._metadata_row["abstract"]
        
    @property
    def bib_entries(self):
        return self._file_contents["bib_entries"]
    
    def register_reference(self, reference):
        self._references.append(reference)
        reference.register_referenced(self)
    
    def register_referenced(self, referenced):
        self._referenced_by.append(referenced)
    
    
class PDFPaper(BasePaper):
    pass
        

class PMCPaper(BasePaper):
    pass

with open("./pageranks.pkl", "rb") as dump_file:
    pageranks = pickle.load(dump_file)

pageranks

{<__main__.PDFPaper at 0x7f170915ae90>: 0.01430039297450863,
 <__main__.PDFPaper at 0x7f16d3a0fed0>: 0.006843458528155105,
 <__main__.PMCPaper at 0x7f16d3a216d0>: 0.004866234092719767,
 <__main__.PDFPaper at 0x7f16d3a3da50>: 0.004775174613208849,
 <__main__.PMCPaper at 0x7f170ca30750>: 0.00475030023872773,
 <__main__.PMCPaper at 0x7f170ca48c50>: 0.004457157589108286,
 <__main__.PDFPaper at 0x7f16d39e8710>: 0.0044072907443844,
 <__main__.PMCPaper at 0x7f16d3995250>: 0.004233732165218011,
 <__main__.PDFPaper at 0x7f16d39ad8d0>: 0.003923803811602406,
 <__main__.PDFPaper at 0x7f16d395c0d0>: 0.0037811442078439805,
 <__main__.PDFPaper at 0x7f16d3862a90>: 0.0036584608150057128,
 <__main__.PDFPaper at 0x7f16d3867c90>: 0.0034915974841629935,
 <__main__.PDFPaper at 0x7f16d38853d0>: 0.003391606502514324,
 <__main__.PMCPaper at 0x7f16d3831f90>: 0.0029271613198010357,
 <__main__.PDFPaper at 0x7f16d383d850>: 0.002651556917500903,
 <__main__.PDFPaper at 0x7f16d37d0c10>: 0.002460381954514867,
 <__main

## Representaciones de papers

A fin de construir las representaciones de los papers, se usarán las siguientes librerías:

- spaCy: https://spacy.io/
- scispaCy: https://allenai.github.io/scispacy/

El modelo de lenguaje a usar será el denominado `en_core_sci_sm`, que corresponde a lenguaje biomédico con un vocabulario de aproximadamente 100.000 palabras.
Existen modelos con vocabularios más grandes en caso de ser necesario.

In [None]:
!pip install spacy scispacy
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz

[33mYou are using pip version 19.0.3, however version 20.1b1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz
  Using cached https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz


[33mYou are using pip version 19.0.3, however version 20.1b1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [None]:
import scispacy
import en_core_sci_sm

# Cargamos el pipeline de lenguaje científico de scispacy
nlp = en_core_sci_sm.load()

In [None]:
# Seleccionamos un paper de muestra para observar el funcionamiento de spacy
sample_paper = list(pageranks.keys())[0]
sample_text = "\n".join([ paragraph["text"] for paragraph in sample_paper._file_contents["body_text"]])
sample_text

doc = nlp(sample_text, disable=["tagger", "parser", "ner"])
doc[17].lemma_

'continue'

Se observa el documento tokenizado por el pipeline de `spacy`.
Lo interesante de usar `spacy` con el modelo entrenado es que entrega fácilmente representaciones vectoriales del documento y de los tokens que lo componen.

Un aspecto relevante para el rendimiento de los modelos posteriores es la cantidad de tokens fuera del vocabulario.
A continuación, se realizará una iteración sobre los tokens para detectarlos.
Además, se imprimen los tokens terminados en `virus` indicando si se encuentran o no en el vocabulario.

In [None]:
num_oov = 0
for token in doc:
    if token.is_oov and token.string != "\n":
        if token.string.endswith("virus"):
            print(token, "not found")
        num_oov += 1
    else:
        if token.string.endswith("virus"):
            print(token, "found")
num_oov, 100 * num_oov / len(doc)

adenovirus found
virus found
virus found
coronavirus found
adenovirus found
metapneumovirus found
virus found
virus found
virus found
coronavirus found
virus found
coronavirus found
virus found
Torovirus not found
virus found
virus found
virus found
torovirus not found
coronavirus found
virus found
coronavirus found
virus found
coronavirus found
virus found


(95, 2.1610555050045495)

In [None]:
# A continuación se prueba el mecanismo para remover stopwords,
# puntuación, espacios, y posteriormente extraer el lemma del token
for token in doc:
    if not token.is_stop and not token.is_punct and not token.is_space:
        print(token.lemma_)

outbreak 
atypical 
pneumonia 
guangdong 
province
people
republic 
china
continued 
november
2002
reported 
affected 
792 
people 
caused 
31 
deaths
1 
adjacent 
hong 
kong
surveillance 
severe 
atypical 
pneumonia 
heightened 
public 
hospital 
network 
hospital 
authority 
hong 
kong
end 
february
2003
clusters 
patients 
pneumonia 
noted 
hong 
kong
affected 
close 
contacts 
health-care 
workers
disease 
respond 
empirical 
antimicrobial 
treatment 
acute 
community-acquired 
typical 
atypical 
pneumonia
bacteriological 
virological 
pathogens 
known 
cause 
pneumonia 
identified
new 
disorder 
called 
severe 
acute 
respiratory 
syndrome 
sars
subsequently
sars 
spread 
worldwide 
involve 
patients 
north 
america
europe
asian 
countries
1 
investigated 
patients 
hong 
kong 
try 
identify 
causal 
agent
included 
study 
50 
patients 
fitting 
modified 
definition 
sars 
admitted 
acute 
regional 
hospitals 
hong 
kong 
feb 
26 
march 
26
2003
2 
briefly
case 
definition 
fever 

En total existen 95 tokens fuera del vocabulario, que representan el 2.16% del total de tokens del documento.
Por otra parte, vemos como tokens relevantes, como `coronavirus` están incluidos en el vocabulario.

## Latent Dirichlet Allocation (LDA)

A continuación se realizarán los experimentos de modelado de tópicos usando la técnica LDA.
Se usa la implementación de esta técnica incluida en la librería `scikit-learn`.

In [None]:
from tqdm.notebook import tqdm

# En primer lugar, se procesará el texto de todos los documentos.
def process_papers_file_contents(papers):
    texts = []
    for paper in tqdm(papers):
        text = " \n ".join([ paragraph["text"] for paragraph in paper._file_contents["body_text"]])
        # OJO: por motivos de agilidad de desarrollo, sólo se modelaron los tópicos
        # usando información del título y del astract. Si se desea incluir el cuerpo
        # del paper se debe modificar la siguiente línea para incluir {text}.
        texts.append(f"{paper.title} \n {paper.abstract}")
    return texts

docs = process_papers_file_contents(
    papers=list(pageranks.keys()),
)
docs[:10]

HBox(children=(FloatProgress(value=0.0, max=38882.0), HTML(value='')))




["Coronavirus as a possible cause of severe acute respiratory syndrome \n Summary Background An outbreak of severe acute respiratory syndrome (SARS) has been reported in Hong Kong. We investigated the viral cause and clinical presentation among 50 patients. Methods We analysed case notes and microbiological findings for 50 patients with SARS, representing more than five separate epidemiologically linked transmission clusters. We defined the clinical presentation and risk factors associated with severe disease and investigated the causal agents by chest radiography and laboratory testing of nasopharyngeal aspirates and sera samples. We compared the laboratory findings with those submitted for microbiological investigation of other diseases from patients whose identity was masked. Findings Patients' age ranged from 23 to 74 years. Fever, chills, myalgia, and cough were the most frequent complaints. When compared with chest radiographic changes, respiratory symptoms and auscultatory findi

Para representar los papers se utilizarán vectores que cuentan la cantidad de ocurrencias de los distintos tokens en éstos.
No se usan representaciones como `tf-idf` debido a que la técnica usada para modelar tópicos es LDA.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

def tokenizer(sentence):
    tokens = []
    for token in nlp(sentence, disable=["tagger", "parser", "ner"]):
        # Se descartan números, stopwords, puntuación, espacio y tokens de largo 1
        if not (token.like_num or token.is_stop or token.is_punct or token.is_space or len(token)==1):
            tokens.append(token.lemma_)
    return tokens

count_vectorizer = CountVectorizer(
    tokenizer=tokenizer,
    lowercase=True,
)
vectorized_docs = count_vectorizer.fit_transform(docs)

In [None]:
vectorized_docs

<38882x149389 sparse matrix of type '<class 'numpy.int64'>'
	with 2797505 stored elements in Compressed Sparse Row format>

In [None]:
count_vectorizer.vocabulary_

{'coronavirus': 40377,
 'possible': 110379,
 'cause': 33506,
 'severe': 125651,
 'acute': 17917,
 'respiratory': 118791,
 'syndrome': 132435,
 'summary': 131458,
 'background': 26581,
 'outbreak': 101681,
 'sars': 123140,
 'report': 118430,
 'hong': 67800,
 'kong': 79214,
 'investigate': 75880,
 'viral': 142671,
 'clinical': 37245,
 'presentation': 112007,
 'patient': 105033,
 'method': 87402,
 'analyse': 21076,
 'case': 33096,
 'note': 98054,
 'microbiological': 88039,
 'finding': 56943,
 'represent': 118456,
 'separate': 125091,
 'epidemiologically': 52838,
 'link': 81901,
 'transmission': 137321,
 'cluster': 37468,
 'define': 44652,
 'risk': 120047,
 'factor': 55434,
 'associate': 24913,
 'disease': 46893,
 'causal': 33491,
 'agent': 19064,
 'chest': 35727,
 'radiography': 116082,
 'laboratory': 80019,
 'test': 134365,
 'nasopharyngeal': 93803,
 'aspirate': 24815,
 'serum': 125496,
 'sample': 122883,
 'compare': 38834,
 'submit': 130958,
 'investigation': 75883,
 'identity': 70791,


A sparse matrix is built with 38.882 rows, one for each document, and 149.389 columns, one for each token.

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

# Se ajusta un modelo LDA para identificar tópicos
lda = LatentDirichletAllocation(
    n_components=10, # number of topics
    verbose=2,
)
lda = lda.fit(vectorized_docs)
lda

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   33.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


iteration: 1 of max_iter: 10


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   24.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


iteration: 2 of max_iter: 10


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   21.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


iteration: 3 of max_iter: 10


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   19.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


iteration: 4 of max_iter: 10


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   17.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


iteration: 5 of max_iter: 10


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   15.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


iteration: 6 of max_iter: 10


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   14.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


iteration: 7 of max_iter: 10


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   13.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


iteration: 8 of max_iter: 10


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   15.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


iteration: 9 of max_iter: 10


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   13.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


iteration: 10 of max_iter: 10


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   12.8s finished


LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=10, n_jobs=None,
                          perp_tol=0.1, random_state=None,
                          topic_word_prior=None, total_samples=1000000.0,
                          verbose=2)

A continuación se mostrarán los tokens más relevantes de los diferentes temas identificados.

In [None]:
def print_topic_words(topic_model, vectorizer, num_words):
    feature_names = vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(topic_model.components_):
        message = "\nTopic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-num_words - 1:-1]])
        print(message)

print_topic_words(lda, count_vectorizer, 20)


Topic #0: sequence protein structure antibody epitope analysis bind peptide study genome domain result method acid gene structural region identify base interaction

Topic #1: virus vaccine human viral influenza disease development pathogen assay drug detection new method antiviral target potential infection use test system

Topic #2: health disease public study care risk infectious nan system research control include review outbreak pandemic global patient influenza infection need

Topic #3: patient respiratory infection virus child clinical viral pneumonia study influenza acute result case test de age severe associate year sample

Topic #4: model case covid-19 transmission outbreak disease epidemic datum china spread numb estimate result study time rate infection coronavirus contact control

Topic #5: infection mers-cov virus mouse respiratory vaccine syndrome middle east coronavirus response disease lung study antibody day animal result infect high

Topic #6: cell response infection

Ahora corresponde clasificar los diferentes papers en los temas identificados en el paso anterior.

In [None]:
docs_classified = lda.transform(vectorized_docs)
docs_classified[:5]

ERROR! Session/line number was not unique in database. History logging moved to new session 63


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   14.6s finished


array([[6.89740189e-04, 6.89802180e-04, 6.89828066e-04, 7.06973626e-01,
        1.71530305e-01, 6.89808919e-04, 6.89793022e-04, 8.00286583e-02,
        3.73286408e-02, 6.89798120e-04],
       [1.42857889e-02, 1.42858855e-02, 1.42875152e-02, 1.42933597e-02,
        1.42866198e-02, 1.42901097e-02, 1.42860976e-02, 1.42860173e-02,
        8.71412087e-01, 1.42865195e-02],
       [9.43747076e-04, 9.43621468e-04, 9.43632205e-04, 9.43586700e-04,
        9.43548655e-04, 9.43617014e-04, 7.14733276e-02, 9.43674895e-04,
        2.64524996e-01, 6.57396248e-01],
       [3.68712454e-02, 1.29894830e-03, 1.29892554e-03, 6.96339470e-01,
        1.29901699e-03, 1.29915550e-03, 2.57695754e-01, 1.29937567e-03,
        1.29889846e-03, 1.29921047e-03],
       [1.25010851e-03, 1.25013402e-03, 1.25015393e-03, 8.58983662e-01,
        1.25024100e-03, 1.25028197e-03, 1.25016220e-03, 9.45583853e-02,
        3.77066875e-02, 1.25018315e-03]])

Finalmente, para cada uno de los temas identifiados, se imprimen los top-5 papers pertenecientes al tema, ordenados por su *pagerank*.

In [None]:
from collections import defaultdict
import numpy as np


def display_paper(paper):
    if isinstance(paper, list):
        for elem in paper:
            display_paper(elem)
            print("\n", end="")
    else:
        print(f"""Title: {paper.title}
Authors: {paper.authors}
Publish time: {paper.publish_time}
Abstract: {paper.abstract}""")
        
        
docs_topics = docs_classified.argmax(1)
topic_papers = defaultdict(list)
all_papers = list(pageranks.keys())
for idx, topic_id in enumerate(docs_topics):
    topic_papers[topic_id].append(all_papers[idx])
for topic_id, papers in sorted(topic_papers.items(), key=lambda t: t[0]):
    print(f"Topic ID #{topic_id}")
    sorted_papers = sorted(papers, reverse=True, key=lambda p: pageranks[p])
    display_paper(sorted_papers[:5])
    print("\n", end="")

Topic ID #0
Title: An efficient method to make human monoclonal antibodies from memory B cells: potent neutralization of SARS coronavirus
Authors: Traggiai, Elisabetta; Becker, Stephan; Subbarao, Kanta; Kolesnikova, Larissa; Uematsu, Yasushi; Gismondo, Maria Rita; Murphy, Brian R; Rappuoli, Rino; Lanzavecchia, Antonio
Publish time: 2004-07-11
Abstract: Passive serotherapy can confer immediate protection against microbial infection, but methods to rapidly generate human neutralizing monoclonal antibodies are not yet available. We have developed an improved method for Epstein-Barr virus transformation of human B cells. We used this method to analyze the memory repertoire of a patient who recovered from severe acute respiratory syndrome coronavirus (SARS-CoV) infection and to isolate monoclonal antibodies specific for different viral proteins, including 35 antibodies with in vitro neutralizing activity ranging from 10(−8)M to 10(−11)M. One such antibody confers protection in vivo in a mou