In [1]:
DATA_NAME = "trec-covid"

## Downloader

In [2]:
from model.prepare import BEIRDatasetDownloader

downloader = BEIRDatasetDownloader(data_name=DATA_NAME)

In [3]:
%%time
downloader.download()

trec-covid dataset already exists. 
Wall time: 0 ns


## Vectorizer

In [4]:
%%time
from model.prepare import DocumentsVectorizer

vectorizer = DocumentsVectorizer(data_name=DATA_NAME)

Parsing documents. 
Loading C:\Users\user\Documents\GitHub\LWMD_assignments\assignment3\datasets\trec-covid\corpus.jsonl 
Tokenizing documents. 
Wall time: 6.66 s


In [5]:
%%time
vectorizer.vectorize()

Learning vocabulary idf. 
Generating vector. 
Computing document length
Generating permutation index
Permuting matrix
Generating mappings
Wall time: 375 ms


In [6]:
vectorizer.save()

Saving C:\Users\user\Documents\GitHub\LWMD_assignments\assignment3\datasets\trec-covid\vector\vectorized.npz 
Saving C:\Users\user\Documents\GitHub\LWMD_assignments\assignment3\datasets\trec-covid\vector\vectorization_mapping.json 
Saving C:\Users\user\Documents\GitHub\LWMD_assignments\assignment3\datasets\trec-covid\vector\vectorization_inverse_mapping.json 


## Comparison

In [7]:
from model.documents import DocumentsCollection

docs = DocumentsCollection(data_name=DATA_NAME)

Parsing documents. 
Loading C:\Users\user\Documents\GitHub\LWMD_assignments\assignment3\datasets\trec-covid\corpus.jsonl 


In [8]:
docs.get_document(id_=docs.doc_ids[0])

Document: ug7v899j

In [9]:
docs.content_comparison(ids=(docs.doc_ids[0], docs.doc_ids[1]))

Document ug7v899j: 
OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (30%). More than three-quarters of patients (77.5%) had comorbidities. Twenty-four isolates (60%) were associated with pneumonia, 14 (35%) with upper respiratory tract infections, and 2 (5%) with bronchiolitis. Cough (82.5%), fever (75%), and ma

## Dimensionality reduction

In [10]:
from model.documents import DocumentVectors

dv = DocumentVectors(data_name=DATA_NAME)

Loading vectors. 
Loading C:\Users\user\Documents\GitHub\LWMD_assignments\assignment3\datasets\trec-covid\vector\vectorized.npz 
Loading mapping. 
Loading C:\Users\user\Documents\GitHub\LWMD_assignments\assignment3\datasets\trec-covid\vector\vectorization_mapping.json 
Loading inverse mapping. 
Loading C:\Users\user\Documents\GitHub\LWMD_assignments\assignment3\datasets\trec-covid\vector\vectorization_inverse_mapping.json 


In [11]:
dv.vectors

<1000x10360 sparse matrix of type '<class 'numpy.float64'>'
	with 82002 stored elements in Compressed Sparse Row format>

In [12]:
dv.get_row_info(row=326)

['sxdstw4a', 98]

In [13]:
dv.get_doc_row(doc_id='sxdstw4a')

326

In [14]:
%%time
dv.perform_dimensionality_reduction(new_dim=20)

Wall time: 1.03 s


In [15]:
dv.vectors_reduced

array([[ 0.18391328, -0.01438054,  0.10168957, ...,  0.01999485,
        -0.03988555, -0.02960575],
       [ 0.21629227, -0.11306294,  0.25950149, ...,  0.03137865,
        -0.06586437, -0.10656926],
       [ 0.25667586,  0.09295038, -0.1463625 , ...,  0.0048079 ,
         0.04430867, -0.02935193],
       ...,
       [-0.01222098, -0.07102374, -0.02499673, ..., -0.02226886,
         0.0077969 , -0.0075583 ],
       [-0.01222098, -0.07102374, -0.02499673, ..., -0.02226886,
         0.0077969 , -0.0075583 ],
       [-0.01222098, -0.07102374, -0.02499673, ..., -0.02226886,
         0.0077969 , -0.0075583 ]])

## Evaluation

In [16]:
from model.evaluation import ExactSolutionEvaluation

eval_ = ExactSolutionEvaluation(data_name=DATA_NAME)

Loading vectors. 
Loading C:\Users\user\Documents\GitHub\LWMD_assignments\assignment3\datasets\trec-covid\vector\vectorized.npz 
Loading mapping. 
Loading C:\Users\user\Documents\GitHub\LWMD_assignments\assignment3\datasets\trec-covid\vector\vectorization_mapping.json 
Loading inverse mapping. 
Loading C:\Users\user\Documents\GitHub\LWMD_assignments\assignment3\datasets\trec-covid\vector\vectorization_inverse_mapping.json 


In [17]:
eval_.evaluate(threshold=0.7)

Evaluating. 


In [18]:
eval_.pairs

[('tkfozwpf', 'zpk3gjwo')]

In [19]:
from model.documents import DocumentsCollection

docs = DocumentsCollection(data_name=DATA_NAME)

Parsing documents. 
Loading C:\Users\user\Documents\GitHub\LWMD_assignments\assignment3\datasets\trec-covid\corpus.jsonl 


In [20]:
docs.content_comparison(ids=eval_.pairs[0])

Document tkfozwpf: 
Host factors are recruited into viral replicase complexes to aid replication of plus-strand RNA viruses. In this paper, we show that deletion of eukaryotic translation elongation factor 1Bgamma (eEF1Bγ) reduces Tomato bushy stunt virus (TBSV) replication in yeast host. Also, knock down of eEF1Bγ level in plant host decreases TBSV accumulation. eEF1Bγ binds to the viral RNA and is one of the resident host proteins in the tombusvirus replicase complex. Additional in vitro assays with whole cell extracts prepared from yeast strains lacking eEF1Bγ demonstrated its role in minus-strand synthesis by opening of the structured 3′ end of the viral RNA and reducing the possibility of re-utilization of (+)-strand templates for repeated (-)-strand synthesis within the replicase. We also show that eEF1Bγ plays a synergistic role with eukaryotic translation elongation factor 1A in tombusvirus replication, possibly via stimulation of the proper positioning of the viral RNA-depende

In [21]:
eval_.save()

Saving pairs. 
Creating directory C:\Users\user\Documents\GitHub\LWMD_assignments\assignment3\datasets\trec-covid\evaluation 
Saving C:\Users\user\Documents\GitHub\LWMD_assignments\assignment3\datasets\trec-covid\evaluation\exact_solution.json 
