In [1]:
import sys
sys.path.append('/home/sebaq/Documents/GitHub/LWMD_assignments')

## Evaluation over small example

Following part computes the exact evaluation and some heuristics over the small example and compare the results.

In [2]:
DATA_NAME = 'small'

In [3]:
SIMILARITY = 0.8

Loading document content:

In [4]:
from assignment3.model.documents import DocumentsCollection

docs = DocumentsCollection(data_name=DATA_NAME)

Parsing documents... 


In [5]:
docs

small Documents [5000]

### Exact solution

In [6]:
from assignment3.model.evaluation import ExactSolutionEvaluation

eval_ = ExactSolutionEvaluation(data_name=DATA_NAME, threshold=SIMILARITY)

Loading vectors... 
Loading mapping... 
Loading inverse mapping... 


In [7]:
eval_

ExactSolutionEvaluation - small (4735 docs, 0.8 similarity) 

In [8]:
eval_.evaluate()

Evaluating... 
Computing cosine similarity... 
Inspecting similarities... 


Total running execution time:

In [9]:
eval_.execution_time

3.4334569109996664

Pairs found:

In [10]:
len(eval_.pairs)

7

In [11]:
eval_.save()

Saving evaluation results. 


Content comparison:

In [12]:
docs.content_comparison(ids=eval_.pairs[0])

Document 1czjl0hz: 
Torque teno sus virus 1 (TTSuV1) is a novel virus that has been found widely distributed in the swine population in recent years. Analysis of codon usage can reveal much about the molecular evolution of TTSuV1. In this study, synonymous codon usage patterns and the key determinants in the coding region of 29 available complete TTSuV1 genome sequences were examined. By calculating the nucleotide content and relative synonymous codon usage (RSCU) of TTSuV1 coding sequences, we found that the preferentially used codons were mostly those ending with A or C nucleotides; less-used codons were mostly codons ending with U or G nucleotides, and these were mainly affected by composition constraints. Although there was a variation in codon usage bias among different TTSuV1 genomes, the codon usage bias and GC content in the TTSuV1 coding region was lower, which was mainly determined by the base composition in the third codon position and the effective number of codons (ENC) va

In [13]:
docs.content_comparison(ids=eval_.pairs[1])

Document 1czjl0hz: 
Torque teno sus virus 1 (TTSuV1) is a novel virus that has been found widely distributed in the swine population in recent years. Analysis of codon usage can reveal much about the molecular evolution of TTSuV1. In this study, synonymous codon usage patterns and the key determinants in the coding region of 29 available complete TTSuV1 genome sequences were examined. By calculating the nucleotide content and relative synonymous codon usage (RSCU) of TTSuV1 coding sequences, we found that the preferentially used codons were mostly those ending with A or C nucleotides; less-used codons were mostly codons ending with U or G nucleotides, and these were mainly affected by composition constraints. Although there was a variation in codon usage bias among different TTSuV1 genomes, the codon usage bias and GC content in the TTSuV1 coding region was lower, which was mainly determined by the base composition in the third codon position and the effective number of codons (ENC) va

In [14]:
docs.content_comparison(ids=eval_.pairs[2])

Document 76uk9tj5: 
Pigeon circovirus (PiCV) is the most frequently diagnosed virus in pigeons and is thought to be one of the causative factors of a complex disease called the young pigeon disease syndrome (YPDS). The development of a vaccine against this virus could be a strategy for YPDS control. Since laboratory culture of PiCV is impossible, its recombinant capsid protein (rCP) can be considered as a potential antigen candidate in sub-unit vaccines. The aim of this basic research was to evaluate the immune response of pigeons to PiCV rCP. Sixty six-week-old carrier pigeons were divided into two groups (experimental immunized with PiCV rCP mixed with an adjuvant, and control immunized with an adjuvant only), and immunized twice in a 21-day interval. On the day of immunization and on two, 23, 39, and 46 days post first immunization (dpv), samples of blood, spleen, and bursa of Fabricius were collected from six birds from each group to examine anti-PiCV rCP IgY, anti-PiCV rCP IgY-sec

### Dimensionality Heuristic

We perform an heuristic by reducing the dimensionality of the vectors with a given approximation error

In [15]:
ERROR = 0.3

In [16]:
from assignment3.model.evaluation import DimensionalityHeuristicEvaluation
dim_heuristic = DimensionalityHeuristicEvaluation(data_name=DATA_NAME, threshold=SIMILARITY, eps=ERROR)

Loading vectors... 
Loading mapping... 
Loading inverse mapping... 


In [17]:
dim_heuristic

DimensionalityHeuristicEvaluation - small (4735 docs, 0.8 similarity)  ['dim_reduction':  0.3 approx error] 

In [18]:
dim_heuristic.evaluate()

Performing dimensionality reduction... 
Computing similarities... 
Inspecting similarities... 


Total running execution time:

In [19]:
dim_heuristic.execution_time

2.309927112000878

Pairs found:

In [20]:
len(dim_heuristic.pairs)

6

Score with respect of the exact solution:

In [21]:
dim_heuristic.score

0.8571428571428571

In [22]:
dim_heuristic.save()

Saving evaluation results. 


### Doc-size heuristic

We perform an heuristic by skipping computation over documents with a too high length mismatch


In [23]:
K = 1.3

In [24]:
from assignment3.model.evaluation import DocSizeHeuristicEvaluation
docsize_heuristic = DocSizeHeuristicEvaluation(data_name=DATA_NAME, threshold=SIMILARITY, k=K)

Loading vectors... 
Loading mapping... 
Loading inverse mapping... 


In [25]:
docsize_heuristic

DocSizeHeuristicEvaluation - small (4735 docs, 0.8 similarity)  ['docs_size': 1.3 mult factor]

In [26]:
docsize_heuristic.evaluate()

Evaluating... 
Computing cosine similarity... 
Inspecting similarities... 


Total running execution time:

In [27]:
docsize_heuristic.execution_time

5.280275529999926

Pairs found:

In [28]:
len(docsize_heuristic.pairs)

5

Score with respect of the exact solution:

In [29]:
docsize_heuristic.score

0.7142857142857143

In [30]:
docsize_heuristic.save()

Saving evaluation results. 


## Evaluation over medium sample

Following part computes the exact evaluation and some heuristics over the medium sample and compare the results.

In [31]:
DATA_NAME = 'medium'

In [32]:
SIMILARITY = 0.85

Loading document content:

In [33]:
from assignment3.model.documents import DocumentsCollection

docs = DocumentsCollection(data_name=DATA_NAME)

Parsing documents... 


In [34]:
docs

medium Documents [12000]

### Exact solution

In [35]:
from assignment3.model.evaluation import ExactSolutionEvaluation

eval_ = ExactSolutionEvaluation(data_name=DATA_NAME, threshold=SIMILARITY)

Loading vectors... 
Loading mapping... 
Loading inverse mapping... 


In [36]:
eval_

ExactSolutionEvaluation - medium (9374 docs, 0.85 similarity) 

In [37]:
eval_.evaluate()

Evaluating... 
Computing cosine similarity... 
Inspecting similarities... 


Total running execution time:

In [38]:
eval_.execution_time

15.127424928999972

Pairs found:

In [39]:
len(eval_.pairs)

59

In [40]:
eval_.save()

Saving evaluation results. 


Content comparison:

In [41]:
docs.content_comparison(ids=eval_.pairs[0])

Document cd5yoh0l: 
The expected cost explosion in transfusion medicine (increasing imbalance between donors and potential recipients, treatment of transfusion-associated complications) increases the socio-economic significance of specific institutional transfusion programs. In this context the estimated use of the patient’s physiologic tolerance to anemia enables 1) the tolerance of larger blood losses (loss of “diluted blood”), 2) the onset of transfusion to the time after surgical control of bleeding to be delayed and 3) the perioperative collection of autologous red blood cells. The present review article summarizes the mechanisms, influencing factors and limits of this natural tolerance to anemia and deduces the indication for perioperative red blood cell transfusion. Under strictly controlled conditions (anesthesia, normovolemia, complete muscular relaxation, hyperoxemia, mild hypothermia) extremely low hemoglobin concentrations [Hb <3 g/dl (<1.86 mmol/l)] are tolerated without t

In [42]:
docs.content_comparison(ids=eval_.pairs[1])

Document 1czjl0hz: 
Torque teno sus virus 1 (TTSuV1) is a novel virus that has been found widely distributed in the swine population in recent years. Analysis of codon usage can reveal much about the molecular evolution of TTSuV1. In this study, synonymous codon usage patterns and the key determinants in the coding region of 29 available complete TTSuV1 genome sequences were examined. By calculating the nucleotide content and relative synonymous codon usage (RSCU) of TTSuV1 coding sequences, we found that the preferentially used codons were mostly those ending with A or C nucleotides; less-used codons were mostly codons ending with U or G nucleotides, and these were mainly affected by composition constraints. Although there was a variation in codon usage bias among different TTSuV1 genomes, the codon usage bias and GC content in the TTSuV1 coding region was lower, which was mainly determined by the base composition in the third codon position and the effective number of codons (ENC) va

In [43]:
docs.content_comparison(ids=eval_.pairs[2])

Document 76uk9tj5: 
Pigeon circovirus (PiCV) is the most frequently diagnosed virus in pigeons and is thought to be one of the causative factors of a complex disease called the young pigeon disease syndrome (YPDS). The development of a vaccine against this virus could be a strategy for YPDS control. Since laboratory culture of PiCV is impossible, its recombinant capsid protein (rCP) can be considered as a potential antigen candidate in sub-unit vaccines. The aim of this basic research was to evaluate the immune response of pigeons to PiCV rCP. Sixty six-week-old carrier pigeons were divided into two groups (experimental immunized with PiCV rCP mixed with an adjuvant, and control immunized with an adjuvant only), and immunized twice in a 21-day interval. On the day of immunization and on two, 23, 39, and 46 days post first immunization (dpv), samples of blood, spleen, and bursa of Fabricius were collected from six birds from each group to examine anti-PiCV rCP IgY, anti-PiCV rCP IgY-sec

### Dimensionality Heuristic

We perform an heuristic by reducing the dimensionality of the vectors with a given approximation error

In [44]:
ERROR = 0.3

In [45]:
from assignment3.model.evaluation import DimensionalityHeuristicEvaluation
dim_heuristic = DimensionalityHeuristicEvaluation(data_name=DATA_NAME, threshold=SIMILARITY, eps=ERROR)

Loading vectors... 
Loading mapping... 
Loading inverse mapping... 


In [46]:
dim_heuristic

DimensionalityHeuristicEvaluation - medium (9374 docs, 0.85 similarity)  ['dim_reduction':  0.3 approx error] 

In [47]:
dim_heuristic.evaluate()

Performing dimensionality reduction... 
Computing similarities... 
Inspecting similarities... 


Total running execution time:

In [48]:
dim_heuristic.execution_time

9.244992007999826

Pairs found:

In [49]:
len(dim_heuristic.pairs)

60

Score with respect of the exact solution:

In [50]:
dim_heuristic.score

0.9508196721311475

In [51]:
dim_heuristic.save()

Saving evaluation results. 


### Doc-size heuristic

We perform an heuristic by skipping computation over documents with a too high length mismatch


In [52]:
K = 1.3

In [53]:
from assignment3.model.evaluation import DocSizeHeuristicEvaluation
docsize_heuristic = DocSizeHeuristicEvaluation(data_name=DATA_NAME, threshold=SIMILARITY, k=K)

Loading vectors... 
Loading mapping... 
Loading inverse mapping... 


In [54]:
docsize_heuristic

DocSizeHeuristicEvaluation - medium (9374 docs, 0.85 similarity)  ['docs_size': 1.3 mult factor]

In [55]:
docsize_heuristic.evaluate()

Evaluating... 
Computing cosine similarity... 
Inspecting similarities... 


Total running execution time:

In [56]:
docsize_heuristic.execution_time

17.105103316998793

Pairs found:

In [57]:
len(docsize_heuristic.pairs)

58

Score with respect of the exact solution:

In [58]:
docsize_heuristic.score

0.9830508474576272

In [59]:
docsize_heuristic.save()

Saving evaluation results. 


## Evaluation over large example

Following part computes the exact evaluation and some heuristics over the large example and compare the results.

In [60]:
DATA_NAME = 'large'

In [61]:
SIMILARITY = 0.9

Loading document content:

In [62]:
from assignment3.model.documents import DocumentsCollection

docs = DocumentsCollection(data_name=DATA_NAME)

Parsing documents... 


In [63]:
docs

large Documents [20000]

### Exact solution

In [64]:
from assignment3.model.evaluation import ExactSolutionEvaluation

eval_ = ExactSolutionEvaluation(data_name=DATA_NAME, threshold=SIMILARITY)

Loading vectors... 
Loading mapping... 
Loading inverse mapping... 


In [65]:
eval_

ExactSolutionEvaluation - large (13641 docs, 0.9 similarity) 

In [66]:
eval_.evaluate()

Evaluating... 
Computing cosine similarity... 
Inspecting similarities... 


Total running execution time:

In [67]:
eval_.execution_time

29.89112720400044

Pairs found:

In [68]:
len(eval_.pairs)

124

In [69]:
eval_.save()

Saving evaluation results. 


Content comparison:

In [70]:
docs.content_comparison(ids=eval_.pairs[0])

Document xu3gfwpu: 
Proteases are ubiquitous in biosystems where they have diverse roles in the biochemical, physiological, and regulatory aspects of cells and organisms. Proteases represent the largest segment of the industrial enzyme market where they are used in detergents, in food processing, in leather and fabric upgrading, as catalysts in organic synthesis, and as therapeutics. Microbial protease overproducing strains have been developed by conventional screening, mutation/selection strategies and genetic engineering, and wholly new enzymes, with altered specificity or stability, have been designed through techniques such as site-directed mutagenesis and directed evolution. Complete sequencing of the genomes of key Bacillus and Aspergillus workhorse extracellular enzyme producers and other species of interest has contributed to enhanced production yields of indigenous proteases as well as to production of heterologous proteases. With annual protease sales of about $1.5–1.8 billio

In [71]:
docs.content_comparison(ids=eval_.pairs[1])

Document hha2sctb: 
Koorts wordt bij niet-immuungecompromitteerde volwassenen in de eerste lijn zonder recent verblijf in het buitenland meestal veroorzaakt door een luchtweginfectie. Bij ouderen vormen urineweginfecties een relatief frequente oorzaak van koorts. Bij ontbreken van richtinggevende voorgeschiedenis, klachten of verschijnselen volstaat men in eerste instantie met lichamelijk onderzoek van KNO-gebied en longen. Bij negatieve bevindingen volgt urineonderzoek. Wanneer geen afwijkingen gevonden worden, gaat men uit van een onschuldige virale oorzaak. Bij een ernstig zieke indruk of verminderd bewustzijn is het lichamelijk onderzoek allereerst gericht op eventuele stoornissen in de vitale functies, omdat deze onmiddellijke therapeutische implicaties hebben. Als de koorts een week aanhoudt of eerder bij verandering van het beeld, dient een uitgebreide anamnese en algemeen lichamelijk onderzoek plaats te vinden om diagnostische aanknopingspunten op te sporen. Ook kan de arts dan

In [72]:
docs.content_comparison(ids=eval_.pairs[2])

Document ntbuwf8i: 
BACKGROUND: Acute respiratory illnesses are the leading cause of death from infectious diseases around the world, and occasional outbreaks of particularly virulent strains are can be public health disasters. Recently, a large outbreak of fatal Middle East respiratory syndrome-coronavirus (MERS-CoV) occurred following a single patient exposure in the emergency department (ED) of the Samsung Medical Center, a tertiary-care hospital in South Korea, which resulted in significant public health and economic burden. After this outbreak, a febrile respiratory infectious disease unit (FRIDU) with a negative pressure ventilation system was constructed outside the emergency department (ED) in 2015, to screen for patients with contagious diseases requiring isolation. METHODS: This is a retrospective cohort study of patients who visited the ED with febrile illness between August 2015 and July 2016. Ultimately, 1562 patients who were hospitalized after FRIDU screening were analyz

### Dimensionality Heuristic

We perform an heuristic by reducing the dimensionality of the vectors with a given approximation error

In [73]:
ERROR = 0.3

In [74]:
from assignment3.model.evaluation import DimensionalityHeuristicEvaluation
dim_heuristic = DimensionalityHeuristicEvaluation(data_name=DATA_NAME, threshold=SIMILARITY, eps=ERROR)

Loading vectors... 
Loading mapping... 
Loading inverse mapping... 


In [75]:
dim_heuristic

DimensionalityHeuristicEvaluation - large (13641 docs, 0.9 similarity)  ['dim_reduction':  0.3 approx error] 

In [76]:
dim_heuristic.evaluate()

Performing dimensionality reduction... 
Computing similarities... 
Inspecting similarities... 


Total running execution time:

In [77]:
dim_heuristic.execution_time

19.696330592998493

Pairs found:

In [78]:
len(dim_heuristic.pairs)

120

Score with respect of the exact solution:

In [79]:
dim_heuristic.score

0.967741935483871

In [80]:
dim_heuristic.save()

Saving evaluation results. 


### Doc-size heuristic

We perform an heuristic by skipping computation over documents with a too high length mismatch


In [81]:
K = 1.2

In [82]:
from assignment3.model.evaluation import DocSizeHeuristicEvaluation
docsize_heuristic = DocSizeHeuristicEvaluation(data_name=DATA_NAME, threshold=SIMILARITY, k=K)

Loading vectors... 
Loading mapping... 
Loading inverse mapping... 


In [83]:
docsize_heuristic

DocSizeHeuristicEvaluation - large (13641 docs, 0.9 similarity)  ['docs_size': 1.2 mult factor]

In [84]:
docsize_heuristic.evaluate()

Evaluating... 
Computing cosine similarity... 
Inspecting similarities... 


Total running execution time:

In [85]:
docsize_heuristic.execution_time

26.247446769000817

Pairs found:

In [86]:
len(docsize_heuristic.pairs)

122

Score with respect of the exact solution:

In [87]:
docsize_heuristic.score

0.9838709677419355

In [88]:
docsize_heuristic.save()

Saving evaluation results. 
