# Search Engine For Candidate Sentences

## Demonstration of how to use the simple search engine for fetching relevant sentences

Let's import our search engine for `src` directory.

First, one needs to set the Python source files environment variables for Juptyer Notebook. If you haven't done this, please run those two command BEFORE running Juptyer Notebook:
1. `export PYTHONPATH=/path/to/covid19/src`
2. `export JUPYTER_PATH=/path/to/covid19/src`

In [1]:
from search.searchengine import SearchEngine
from pprint import pprint
import os

unable to import 'smart_open.gcs', disabling that module


In [2]:
data_dir = "../../../workspace/kaggle/covid19/data"

Initialize out SearchEngine object with:
1. Sentences metadata
2. bi-gram model
3. tri-gram model
4. Trained FastText vectors

In [3]:
search_engine = SearchEngine(
    os.path.join(data_dir, "sentences_with_metadata.csv"),
    os.path.join(data_dir, "covid_bigram_model_v0.pkl"),
    os.path.join(data_dir, "covid_trigram_model_v0.pkl"),
    os.path.join(data_dir, "fasttext_no_subwords_trigrams/word-vectors-100d.txt"),
)

Loading CSV: ../../../workspace/kaggle/covid19/data/sentences_with_metadata.csv and building mapping dictionary...
Finished loading CSV: ../../../workspace/kaggle/covid19/data/sentences_with_metadata.csv and building mapping dictionary
Loaded 249343 sentences
Loading bi-gram model: ../../../workspace/kaggle/covid19/data/covid_bigram_model_v0.pkl
Finished loading bi-gram model: ../../../workspace/kaggle/covid19/data/covid_bigram_model_v0.pkl
Loading tri-gram model: ../../../workspace/kaggle/covid19/data/covid_trigram_model_v0.pkl
Finished loading tri-gram model: ../../../workspace/kaggle/covid19/data/covid_trigram_model_v0.pkl
Loading fasttext model: ../../../workspace/kaggle/covid19/data/fasttext_no_subwords_trigrams/word-vectors-100d.txt
Finished loading fasttext model: ../../../workspace/kaggle/covid19/data/fasttext_no_subwords_trigrams/word-vectors-100d.txt


Simple search function that gets a list of keywords to search:

In [4]:
def search(keywords, optional_keywords=None, top_n=10, synonyms_threshold=0.8, only_sentences=False):
    print(f"\nSearch for terms {keywords}\n\n")
    results = search_engine.search(
        keywords, optional_keywords=optional_keywords, top_n=top_n, synonyms_threshold=synonyms_threshold
    )
    print("\nResults:\n")
    
    if only_sentences:
        for result in results:
            print(result['sentence'] + "\n")
    else:
        pprint(results)

Let's see some examples:

In [8]:
search(["tracking", "genomes", "diagnostics", "therapeutics", "variations"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['tracking', 'genomes', 'diagnostics', 'therapeutics', 'variations']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['tracking', 'genomes', 'diagnostics_therapeutics', 'variations', 'genome', 'viral_genomes', 'virus_genomes', 'avenues_development', 'new_frontier', 'fundamental_research', 'diagnostic_therapeutic_agents', 'advancement_field', 'antibody_engineering_techniques', 'basic_research_clinical', 'current_developments', 'opportunities_design', 'tremendous_efforts_have_been', 'variation']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

All assembled query genomes in FASTA format were analyzed using Genome  

In [16]:
search(["circulation", "geographic", "genome", "sample sets", "Nagoya Protocol"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=15, only_sentences=True)


Search for terms ['circulation', 'geographic', 'genome', 'sample sets', 'Nagoya Protocol']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['circulation', 'geographic', 'genome', 'sample_sets', 'nagoya_protocol', 'geographical', 'genomes', 'benefit_sharing', 'agreement_traderelated_aspects_intellectual', 'agreement_trade', 'sanitary_phytosanitary_measures', 'fair_equitable_sharing', 'dualuse_issues', 'essential_medicines_•', 'convention_biological_diversity', 'implementation_regulations', 'property_rights_trips']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

However, the fifth gene in the Betacoronavirus core genome, the envel

In [14]:
search(["geographic distribution", "genomic differences", "geographic"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=15, only_sentences=True)


Search for terms ['geographic distribution', 'genomic differences', 'geographic']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['geographic_distribution', 'genomic', 'differences', 'geographic', 'geographical_distribution', 'difference', 'differences_between', 'geographical']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

Individual differences between donors from different sources may lead to differences in the analysis results of the two scRNA-seq datasets.

Adjusting for differences in underlying demography and assuming no overall difference in the attack rate by age, we estimate a high level of under-ascertainment of cas

In [32]:
search(["wildlife", "animals", "infections", "host range"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=15, only_sentences=True)


Search for terms ['wildlife', 'animals', 'infections', 'host range']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['wildlife', 'animals', 'infections', 'host_range', 'wildlife_species', 'domestic_animal', 'wildlife_populations', 'wildlife_conservation', 'wild_animals', 'infection', 'host_tropism', 'host_ranges']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

This chain poses a great risk for the contact between human and these wild animals, which may put the handlers at risk of infection not only for coronaviruses but also for other pathogens that these animals and birds may harbor.

While we did not look at other healthcare

In [34]:
search(["Animal host", "wildlife", "spill-over", "animal to human"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=15, only_sentences=True)


Search for terms ['Animal host', 'wildlife', 'spill-over', 'animal to human']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['animal', 'host', 'wildlife', 'spillover', 'animal_human', 'wildlife_species', 'domestic_animal', 'wildlife_populations', 'wildlife_conservation', 'wild_animals', 'spillover_events', 'spillover_from', 'spillover_humans', 'spillover_infections', 'human_animal']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

SARS-CoV-2 is likely a batorigin coronavirus that was transmitted to humans through a spillover from bats or through yet undetermined intermediate animal host (avian, swine, phocine, bovine, canine, o

In [41]:
search(["spill-over", "Socioeconomic"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=15, only_sentences=True)


Search for terms ['spill-over', 'Socioeconomic']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['spillover', 'socioeconomic', 'spillover_events', 'spillover_from', 'spillover_humans', 'spillover_infections', 'social_determinants', 'poverty_social', 'social_inequality', 'social_economic', 'societal_factors', 'socioeconomical', 'social_economic_factors', 'political_socioeconomic', 'socioeconomic_cultural', 'social_cultural_economic']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

Every location has a different socio-economic profile such that the growth rate of the epidemic (and hence R 0 ) might differ.

" 6 At this time of se

In [42]:
search(["Sustainable", "risk reduction", "strategies"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=15, only_sentences=True)


Search for terms ['Sustainable', 'risk reduction', 'strategies']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['sustainable', 'risk_reduction', 'strategies', 'strategy']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

Interpretation As strong social distancing intervention strategies had the most effect in reducing the epidemic peak, this strategy may be considered when weaker strategies are first tried and found to be less effective.

The lockdown resulted in a downward trend in national and provincial epidemic curves, however, these measures are not sustainable and eventually there will be a strategy to return to normality.