# Search Engine For Candidate Sentences

## Demonstration of how to use the simple search engine for fetching relevant sentences

Let's import our search engine for `src` directory.

First, one needs to set the Python source files environment variables for Juptyer Notebook. If you haven't done this, please run those two command BEFORE running Juptyer Notebook:
1. `export PYTHONPATH=/path/to/covid19/src`
2. `export JUPYTER_PATH=/path/to/covid19/src`

In [1]:
from search.searchengine import SearchEngine
from pprint import pprint
import os

unable to import 'smart_open.gcs', disabling that module


In [2]:
data_dir = "../../../workspace/kaggle/covid19/data"

Initialize out SearchEngine object with:
1. Sentences metadata
2. bi-gram model
3. tri-gram model
4. Trained FastText vectors

In [3]:
search_engine = SearchEngine(
    os.path.join(data_dir, "sentences_with_metadata.csv"),
    os.path.join(data_dir, "covid_bigram_model_v0.pkl"),
    os.path.join(data_dir, "covid_trigram_model_v0.pkl"),
    os.path.join(data_dir, "fasttext_no_subwords_trigrams/word-vectors-100d.txt"),
)

Loading CSV: ../../../workspace/kaggle/covid19/data/sentences_with_metadata.csv and building mapping dictionary...
Finished loading CSV: ../../../workspace/kaggle/covid19/data/sentences_with_metadata.csv and building mapping dictionary
Loaded 249343 sentences
Loading bi-gram model: ../../../workspace/kaggle/covid19/data/covid_bigram_model_v0.pkl
Finished loading bi-gram model: ../../../workspace/kaggle/covid19/data/covid_bigram_model_v0.pkl
Loading tri-gram model: ../../../workspace/kaggle/covid19/data/covid_trigram_model_v0.pkl
Finished loading tri-gram model: ../../../workspace/kaggle/covid19/data/covid_trigram_model_v0.pkl
Loading fasttext model: ../../../workspace/kaggle/covid19/data/fasttext_no_subwords_trigrams/word-vectors-100d.txt
Finished loading fasttext model: ../../../workspace/kaggle/covid19/data/fasttext_no_subwords_trigrams/word-vectors-100d.txt


Simple search function that gets a list of keywords to search:

In [4]:
def search(keywords, optional_keywords=None, top_n=10, synonyms_threshold=0.8, only_sentences=False):
    print(f"\nSearch for terms {keywords}\n\n")
    results = search_engine.search(
        keywords, optional_keywords=optional_keywords, top_n=top_n, synonyms_threshold=synonyms_threshold
    )
    print("\nResults:\n")
    
    if only_sentences:
        for result in results:
            print(result['sentence'] + "\n")
    else:
        pprint(results)

Let's see some examples:

In [9]:
search(["demographic", "Sampling methods", "asymptomatic", "serosurveys", "convalescent samples", "screening", "ELISAs"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['demographic', 'Sampling methods', 'asymptomatic', 'serosurveys', 'convalescent samples', 'screening', 'ELISAs']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['demographic', 'sampling_methods', 'asymptomatic', 'serosurveys', 'convalescent_samples', 'screening', 'elisas', 'demographics', 'demography', 'sociodemographic', 'sampling_techniques', 'symptomatic', 'seroprevalence_studies', 'serological_surveys', 'serologic_surveys', 'seroepidemiological_studies', 'serological_surveillance', 'serologic_investigations', 'serosurvey', 'serological_studies', 'serological_investigations', 'convalescentphase_serum_sample', '≥_4fold_increase', 'increase_igg_titer', 'elisa_macelisa', 'acutephase_samples', 'convalescent_serology', 'serum_samples_taken', '4fold_increase_antibody_titer', 'fourfold_increase_titer', 'between_acuteand', 'elisa_tests', 'elisa', 'indirect_elisas', 'enzymelinked_immunosorbent_assays_elisas', 'two_elisas', 'enzymelinked_immunosorbe

In [12]:
search(["Denominators", "testing", "demographics", "sharing information"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['Denominators', 'testing', 'demographics', 'sharing information']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['denominators', 'testing', 'demographics', 'sharing_information', 'numerators', 'denominator_calculate', 'using_multiple_imputation', 'random_effects_models', 'burden_influenza_ah1n1pdm09', 'appropriate_denominators', 'fit_lognormal_distribution', 'reporting_completeness_proportions', 'sex_age_groups', 'national_estimates', 'demographic', 'demographic_characteristics', 'demographic_data', 'sociodemographic', 'questionnaire_addressed', 'demographic_features', 'informationsharing', 'communication_cooperation', 'electronic_communications', 'information_sharing', 'sharing_surveillance', 'datasharing', 'international_coordination', 'nhcmoh', 'sharing_information_between', 'ihr_2005_implementation']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19'

In [14]:
search(["mitigation", "government", "strategies"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['mitigation', 'government', 'strategies']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['mitigation', 'government', 'strategies', 'local_government', 'governments', 'central_government', 'national_government', 'governmental', 'local_agencies', 'government_s', 'stateowned_enterprises_soes', 'chinese_government', 'central_local_governments', 'strategy']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

As the virus spreads globally it is likely that government strategies will shift from containment and delay towards mitigation (4) .

4 As a result, some of the public health precautionary strategies are selfiniti

In [21]:
search(["existing diagnostic platforms", "burden"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=15, only_sentences=True)


Search for terms ['existing diagnostic platforms', 'burden']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['existing', 'diagnostic_platforms', 'burden', 'naatbased', 'tests_nats', 'nucleic_acidbased_amplification', 'simultaneously_detect_multiple', 'rapid_turnaround_times', 'dipstick_assays', 'detection_platforms', 'over_conventional_methods', 'detection_capabilities', 'poc_diagnosis']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

Taken together, our observations suggest that any conclusion drawn, at present, about existing lineages and direction of viral spread, based on phylogenetic analysis of SARS-CoV-2 sequence data, i

In [27]:
search(["Recruit", "expertise", "capacity"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=15, only_sentences=True)


Search for terms ['Recruit', 'expertise', 'capacity']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['recruit', 'expertise', 'capacity', 'recruited', 'recruits', 'capability', 'capacities', 'capabilities']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

South Korea, as of writing, has the most extensive capabilities of testing individuals with a capacity of around 20,000 tests a day.

We assessed the required expertise and capacity for molecular detection of 2019-nCoV in specialised laboratories in 30 European Union/European Economic Area (EU/EEA) countries.

Organisations such as the Global Outbreak Alert and Response Network

In [29]:
search(["government", "best practices", "guidelines", "public", "public health"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=15, only_sentences=True)


Search for terms ['government', 'best practices', 'guidelines', 'public', 'public health']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['government', 'best_practices', 'guidelines', 'public', 'public_health', 'local_government', 'governments', 'central_government', 'national_government', 'governmental', 'local_agencies', 'government_s', 'stateowned_enterprises_soes', 'chinese_government', 'central_local_governments', 'recommendations', 'international_guidelines', 'national_guidelines', 'published_guidelines', 'guideline', 'guidance', 'standards_guidelines', 'newspapers_internet', 'medical_public_health']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuh

In [30]:
search(["point-of-care test", "rapid influenza test", "speed", "accuracy"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=15, only_sentences=True)


Search for terms ['point-of-care test', 'rapid influenza test', 'speed', 'accuracy']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['pointofcare_test', 'rapid', 'influenza', 'test', 'speed_accuracy', 'liat', 'alere_i', 'filmarray_respiratory_panel', 'verigene_respiratory', 'xpress_flursv', 'easytoperform', 'cliawaived', 'panel_rp', 'pointofcare_settings', 'cepheid_sunnyvale_ca_usa', 'influenza_virus', 'influenza_a', 'infl_uenza', 'influenza_a_b', 'tests', 'economical_method', 'speed_cost', 'accuracy_sensitivity', 'efficiency_speed', 'multiplexing_capacity', 'metagenomics_datasets', 'standardization_automation', 'automatization', 'sophisticated_tools', 'fast_turnaround_time']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_c

In [36]:
search(["PCR", "testers", "report", "area"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=15, only_sentences=True)


Search for terms ['PCR', 'testers', 'report', 'area']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['pcr', 'testers', 'report', 'area', 'polymerase_chain_reaction_pcr', 'skincontact', 'accessory_device', 'sensors_attached', 'must_kept_clean', 'monitored_periodically', 'indirect_ophthalmoscope', 'fluorescent_powder', 'rhinolaryngoscope', 'measuring_tape', 'removing_clothing', 'areas']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

Subsequently, total 10 treatment areas (with 162 doctors and 322 nurses) were successively established within one week, which included respiratory intensive care unit (RICU) ( for critical patients)