# Search Engine For Candidate Sentences

## Demonstration of how to use the simple search engine for fetching relevant sentences

Let's import our search engine for `src` directory.

First, one needs to set the Python source files environment variables for Juptyer Notebook. If you haven't done this, please run those two command BEFORE running Juptyer Notebook:
1. `export PYTHONPATH=/path/to/covid19/src`
2. `export JUPYTER_PATH=/path/to/covid19/src`

In [1]:
from search.searchengine import SearchEngine
from pprint import pprint
import os

unable to import 'smart_open.gcs', disabling that module


In [2]:
data_dir = "../../../workspace/kaggle/covid19/data"

Initialize out SearchEngine object with:
1. Sentences metadata
2. bi-gram model
3. tri-gram model
4. Trained FastText vectors

In [3]:
search_engine = SearchEngine(
    os.path.join(data_dir, "sentences_with_metadata.csv"),
    os.path.join(data_dir, "covid_bigram_model_v0.pkl"),
    os.path.join(data_dir, "covid_trigram_model_v0.pkl"),
    os.path.join(data_dir, "fasttext_no_subwords_trigrams/word-vectors-100d.txt"),
)

Loading CSV: ../../../workspace/kaggle/covid19/data/sentences_with_metadata.csv and building mapping dictionary...
Finished loading CSV: ../../../workspace/kaggle/covid19/data/sentences_with_metadata.csv and building mapping dictionary
Loaded 249343 sentences
Loading bi-gram model: ../../../workspace/kaggle/covid19/data/covid_bigram_model_v0.pkl
Finished loading bi-gram model: ../../../workspace/kaggle/covid19/data/covid_bigram_model_v0.pkl
Loading tri-gram model: ../../../workspace/kaggle/covid19/data/covid_trigram_model_v0.pkl
Finished loading tri-gram model: ../../../workspace/kaggle/covid19/data/covid_trigram_model_v0.pkl
Loading fasttext model: ../../../workspace/kaggle/covid19/data/fasttext_no_subwords_trigrams/word-vectors-100d.txt
Finished loading fasttext model: ../../../workspace/kaggle/covid19/data/fasttext_no_subwords_trigrams/word-vectors-100d.txt


Simple search function that gets a list of keywords to search:

In [4]:
def search(keywords, optional_keywords=None, top_n=10, synonyms_threshold=0.8, only_sentences=False):
    print(f"\nSearch for terms {keywords}\n\n")
    results = search_engine.search(
        keywords, optional_keywords=optional_keywords, top_n=top_n, synonyms_threshold=synonyms_threshold
    )
    print("\nResults:\n")
    
    if only_sentences:
        for result in results:
            print(result['sentence'] + "\n")
    else:
        pprint(results)

Let's see some examples:

In [5]:
search(keywords=["animals", "zoonotic", "spillover", "animal to human",
             "bats", "snakes", "exotic animals", "seafood"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"])


Search for terms ['animals', 'zoonotic', 'spillover', 'animal to human', 'bats', 'snakes', 'exotic animals', 'seafood']




  return (m / dist).astype(REAL)


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['animals', 'zoonotic_spillover', 'animal_human', 'bats', 'snakes', 'exotic_animals', 'seafood', 'spillover_humans', 'hostswitching_events', 'battohuman', 'host_switching_events', 'from_dromedaries_humans', 'interspecies_jumping', 'animaltoanimal_animaltohuman', 'humantohuman_transmission_events', 'emergence_events', 'zoonotic_spillover_events', 'human_animal', 'bat_species', 'fruit_bats', 'insectivorous_bats', 'species_bats', 'bat', 'exotic_pets', 'wild_animal_species', 'consumption_raw', 'salads', 'ice_cream', 'dairy_products', 'cold_cuts', 'contaminated_beef', 'beef_products', 'meats', 'improperly_cooked', 'raw_undercooked']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid1

In [6]:
search(keywords=["seasonality", "transmission", "humidity", "heat", "summer"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"])


Search for terms ['seasonality', 'transmission', 'humidity', 'heat', 'summer']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['seasonality', 'transmission', 'humidity_heat', 'summer', 'seasonal_patterns', 'seasonal_variation', 'seasonal_pattern', 'seasonal_trends', 'seasonality_influenza', 'seasonal_variations', 'trans_mission', 'transmissions', 'disease_transmission', 'contact_transmission', 'overall_discomfort', 'microclimate_temperature_humidity', 'emotional_benefits', 'perceived_comfort', 'n95mask_combination', 'perceived_exertion', 'thermophysiological', 'nanofunctional', 'physical_discomfort', 'subjective_ratings', 'winter', 'autumn', 'during_winter', 'during_summer', 'rainy', 'winter_summer', 'summer_autumn', 'fall_winter', 'cold_winter', 'during_winter_spring']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronav

In [7]:
search(["incubation_time", "incubation", "age"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=20, only_sentences=True)


Search for terms ['incubation_time', 'incubation', 'age']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['incubationtime', 'incubation', 'age']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

It suggests that age 40 can be a key age cutoff for the incubation of COVID-19 along with previous statistical analysis.  

To estimate the asymptomatic infected population, we looked at publicly available data with at least 15 days of significant increase in confirmed COVID19 case numbers and back-calculated the population count that would have likely been in the pre-symptomatic incubation phase on previous dates.

The following findings

In [8]:
search(["Prevalence", "asymptomatic", "shedding", "transmission", "children"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=20, only_sentences=True)


Search for terms ['Prevalence', 'asymptomatic', 'shedding', 'transmission', 'children']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['prevalence', 'asymptomatic_shedding', 'transmission', 'children', 'prevalence_rates', 'prolonged_shedding', 'asymptomatic_persistence', 'persistent_shedding', 'trans_mission', 'transmissions', 'disease_transmission', 'contact_transmission', 'young_children', 'children_who', 'infants_children', 'older_children', 'infants', 'adults', 'children_adults', 'among_children', 'healthy_children']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

5 The prevalence of colonisation in children, which appears

In [9]:
search(["seasonality", "transmission", "humidity", "heat", "summer"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['seasonality', 'transmission', 'humidity', 'heat', 'summer']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['seasonality', 'transmission', 'humidity_heat', 'summer', 'seasonal_patterns', 'seasonal_variation', 'seasonal_pattern', 'seasonal_trends', 'seasonality_influenza', 'seasonal_variations', 'trans_mission', 'transmissions', 'disease_transmission', 'contact_transmission', 'overall_discomfort', 'microclimate_temperature_humidity', 'emotional_benefits', 'perceived_comfort', 'n95mask_combination', 'perceived_exertion', 'thermophysiological', 'nanofunctional', 'physical_discomfort', 'subjective_ratings', 'winter', 'autumn', 'during_winter', 'during_summer', 'rainy', 'winter_summer', 'summer_autumn', 'fall_winter', 'cold_winter', 'during_winter_spring']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronav

In [10]:
search(["adhesion", "hydrophilic", "hydrophobic", "surfaces", "decontamination"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['adhesion', 'hydrophilic', 'hydrophobic', 'surfaces', 'decontamination']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['adhesion', 'hydrophilic_hydrophobic', 'surfaces', 'decontamination', 'hydrophobic_hydrophilic', 'ligand_shell', 'oppositely_charged', 'hydrophilichydrophobic', 'polar_nonpolar', 'adsorbent_surface', 'anionic_cationic', 'neutral_charged', 'both_hydrophilic_hydrophobic', 'micelles_liposomes', 'disinfection']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

Although the viral load of coronaviruses on inanimate surfaces is not known during an outbreak situation it seems plausible to reduce the v

In [11]:
search(["Persistence", "stability","nasal discharge", "sputum", "urine", "fecal matter"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['Persistence', 'stability', 'nasal discharge', 'sputum', 'urine', 'fecal matter']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['persistence', 'stability', 'nasal_discharge', 'sputum_urine', 'fecal_matter', 'sneezing_nasal', 'sneezing_nasal_discharge', 'ocular_discharge', 'watery_eyes', 'sneezing_cough', 'discharge_cough', 'cough_nasal', 'swollen_sinuses', 'eye_discharge', 'tracheal_rales', 'urine_sputum', 'lavage_specimen', 'sputum_endotracheal', 'pleural_tap', 'sputum_bal_fluid', 'fluid_pleural_fluid', 'antibiotic_treatment_commenced', 'sputum_bronchoalveolar_lavage_fluid', 'culture_sputum', 'sent_culture', 'secretions_feces', 'sweat_urine', 'saliva_nasal', 'contaminated_feces', 'other_body_secretions', 'secretions_urine', 'faecal_material', 'sweat_tears', 'animal_feces', 'fecally_contaminated']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'out

In [12]:
search(["Persistence", "materials", "copper", "stainless steel", "plastic"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['Persistence', 'materials', 'copper', 'stainless steel', 'plastic']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['persistence', 'materials', 'copper', 'stainless_steel', 'plastic', 'stainless', 'galvanized_steel']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

HCoV-19 was most stable on plastic and stainless steel and viable virus could be detected up to 33 72 hours post application ( Figure 1A ), though the virus titer was greatly reduced (plastic from 10 3.7 to 34 10 0.6 TCID 50 /mL after 72 hours, stainless steel from 10 3.7 to 10 0.6 TCID 50 /mL after 48 hours).

For example, aspects of the social cont

In [13]:
search(["natural", "history", "virus", "infected"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['natural', 'history', 'virus', 'infected']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['natural_history', 'virus', 'infected']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

In 1918, individuals infected with influenza typically passed on the virus to between 1 and 2 of their social contacts [1] .

However, these symptoms are nonspecific, as there are isolated cases where, for example, in an asymptomatic infected family a chest CT scan revealed pneumonia and the pathogenic test for the virus came back positive.

The initial symptoms of recently infected patients seem more insidious, indicating that the ne

In [14]:
search(["implementation", "diagnostics", "product", "clinical", "process"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['implementation', 'diagnostics', 'product', 'clinical', 'process']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['implementation', 'diagnostics', 'product', 'clinical', 'process', 'implementing', 'adoption', 'products']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

Effective teamwork can increase the efficiency of the work, in the process of care, while buy and use the new improved care products, not only more convenient and effective; the current hospital has been promoted to the implementation of the respiratory care ward, To share the unit to promote the implementation of experience, in order to achieve

In [15]:
search(["disease models", "animals", "infection", "transmission"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['disease models', 'animals', 'infection', 'transmission']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['disease', 'models', 'animals', 'infection', 'transmission', 'model', 'infections', 'trans_mission', 'transmissions', 'disease_transmission', 'contact_transmission']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

Based on those findings, we develop a mathematical transmission model to disentangle how transmission is affected by age differences in the biology of COVID-19 infection and disease, and altered mixing patterns due to social distancing.

The model generates an ensemble of possible epidemic scenar

In [17]:
search(["Tools", "studies", "monitor", "phenotypic change", "potential adaptation", "virus", "mutation"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['Tools', 'studies', 'monitor', 'phenotypic change', 'potential adaptation', 'virus', 'mutation']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['tools', 'studies', 'monitor', 'phenotypic_change', 'potential', 'adaptation', 'virus', 'mutation', 'powerful_computational', 'adaption', 'mutations', 'point_mutation', 'point_mutations']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

Here, we analyzed the potential mutations that may have evolved after the virus became epidemic among humans and also the mutations resulting in the human adaptation.

Thus, it is urgent to tightly monitor the mutation and adaptation of

In [19]:
search(["Immune", "Immunity", "Immune response"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['Immune', 'Immunity', 'Immune response']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['immune', 'immunity', 'immune_response', 'immunity_against', 'immune_responses', 'immune_responses', 'adaptive_immune_response', 'immune_response_against', 'cellular_immune_response', 'humoral_immune_response', 'innate_adaptive_immune_responses']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

It is currently considered that the lung injury is not only associated with direct virus-induced injury, but also COVID-19 invasion triggers the immune responses that lead to the activation of immune cells(monocyte, macrophage, T and

In [21]:
search(["Effectiveness", "movement control", "restrictions", "strategy", "prevent secondary transmission"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['Effectiveness', 'movement control', 'restrictions', 'strategy', 'prevent secondary transmission']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['effectiveness', 'movement', 'control', 'restrictions', 'strategy_prevent', 'secondary_transmission', 'efficacy', 'movements', 'strategy_preventing', 'secondary_cases', 'among_close_contacts', 'unrecognized_cases', 'superspreading_events_occurred']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

Since COVID-19 emerged in early December, 2019 in Wuhan and swept across China Mainland, a series of large-scale public health interventions, especially Wuhan lock-down comb

In [22]:
search(["Effectiveness", "personal protective equipment", "PPE", "strategy", "prevent transmission"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['Effectiveness', 'personal protective equipment', 'PPE', 'strategy', 'prevent transmission']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['effectiveness', 'personal_protective_equipment_ppe', 'strategy_prevent', 'transmission', 'efficacy', 'personal_protective_equipment', 'appropriate_ppe', 'equipment_ppe', 'n95_higher', 'proper_personal_protective', 'proper_use_ppe', 'including_n95_masks', 'use_personal_protection', 'wearing_full', 'protective_gear', 'strategy_preventing', 'trans_mission', 'transmissions', 'disease_transmission', 'contact_transmission']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

Since

In [None]:
search(["transm"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)