# Search Engine For Candidate Sentences

## Demonstration of how to use the simple search engine for fetching relevant sentences

Let's import our search engine for `src` directory.

First, one needs to set the Python source files environment variables for Juptyer Notebook. If you haven't done this, please run those two command BEFORE running Juptyer Notebook:
1. `export PYTHONPATH=/path/to/covid19/src`
2. `export JUPYTER_PATH=/path/to/covid19/src`

In [1]:
from search.searchengine import SearchEngine
from pprint import pprint
import os

unable to import 'smart_open.gcs', disabling that module


In [2]:
data_dir = "../../../workspace/kaggle/covid19/data"

Initialize out SearchEngine object with:
1. Sentences metadata
2. bi-gram model
3. tri-gram model
4. Trained FastText vectors

In [3]:
search_engine = SearchEngine(
    os.path.join(data_dir, "sentences_with_metadata.csv"),
    os.path.join(data_dir, "covid_bigram_model_v0.pkl"),
    os.path.join(data_dir, "covid_trigram_model_v0.pkl"),
    os.path.join(data_dir, "fasttext_no_subwords_trigrams/word-vectors-100d.txt"),
)

Loading CSV: ../../../workspace/kaggle/covid19/data/sentences_with_metadata.csv and building mapping dictionary...
Finished loading CSV: ../../../workspace/kaggle/covid19/data/sentences_with_metadata.csv and building mapping dictionary
Loaded 249343 sentences
Loading bi-gram model: ../../../workspace/kaggle/covid19/data/covid_bigram_model_v0.pkl
Finished loading bi-gram model: ../../../workspace/kaggle/covid19/data/covid_bigram_model_v0.pkl
Loading tri-gram model: ../../../workspace/kaggle/covid19/data/covid_trigram_model_v0.pkl
Finished loading tri-gram model: ../../../workspace/kaggle/covid19/data/covid_trigram_model_v0.pkl
Loading fasttext model: ../../../workspace/kaggle/covid19/data/fasttext_no_subwords_trigrams/word-vectors-100d.txt
Finished loading fasttext model: ../../../workspace/kaggle/covid19/data/fasttext_no_subwords_trigrams/word-vectors-100d.txt


Simple search function that gets a list of keywords to search:

In [4]:
def search(keywords, optional_keywords=None, top_n=10, synonyms_threshold=0.8, only_sentences=False):
    print(f"\nSearch for terms {keywords}\n\n")
    results = search_engine.search(
        keywords, optional_keywords=optional_keywords, top_n=top_n, synonyms_threshold=synonyms_threshold
    )
    print("\nResults:\n")
    
    if only_sentences:
        for result in results:
            print(result['sentence'] + "\n")
    else:
        pprint(results)

Let's see some examples:

In [6]:
search(["risk_factor", "Smoking", "pulmonary disease"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['risk_factor', 'Smoking', 'pulmonary disease']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['riskfactor', 'smoking', 'pulmonary_disease', 'adjust_potential_confounding', 'important_confounders', 'causal_pathway', 'order_identify_factors', 'veal_calf_mortality', 'demographic_factors_eg', 'factors_age_sex', 'possible_confounding_factors', 'gender_comorbidities', 'probable_risk_factors', 'cigarette_smoking']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

■ Smoking may cause mucosal keratinisation and pigmentary incompetence and is linked to oral cancer ■ Oral snuff dipping and chewing tobacco predispose to le

In [7]:
search(["risk_factor", "Co-infections", "viral infections", "co-morbidities"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['risk_factor', 'Co-infections', 'viral infections', 'co-morbidities']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['riskfactor', 'coinfections', 'viral_infections', 'comorbidities', 'adjust_potential_confounding', 'important_confounders', 'causal_pathway', 'order_identify_factors', 'veal_calf_mortality', 'demographic_factors_eg', 'factors_age_sex', 'possible_confounding_factors', 'gender_comorbidities', 'probable_risk_factors', 'coinfection', 'mixed_infections', 'respiratory_viral_infections', 'comorbidity', 'comorbid_conditions', 'underlying_conditions', 'underlying_diseases', 'underlying_comorbidities', 'comorbid_diseases', 'preexisting_medical_conditions', 'underlying_medical_conditions', 'underlying_illness', 'presence_comorbidities']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019nc

In [13]:
search(["risk_factors", "pregnant women", "covid19"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['risk_factors', 'pregnant women', 'covid19']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['riskfactors', 'pregnant_women_covid19', 'iliarti', 'presenting_emergency_departments', 'differences_age_sex', 'hrv_species_genotypes', 'risk_factors_presenting_ed', 'diagnoses_pneumonia', 'risk_factors_hospitalization', 'rhinitis_asthma_respiratory', 'hivpositive_negative', 'antihistamine_prescription', 'premorbid_conditions', 'shortand_longterm_outcomes', 'however_paucity_data', 'obstetrical_outcomes', 'sarscov2related', 'bad_outcomes', 'risk_factors_poor_prognosis', 'confounding_indication', 'childhood_cancer_survivors', 'risk_factors_severe_illness']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: []

Results:

According to most recent US Centers for Disease Control and Prevention (CDC) guidance, risk factors for severe illness are not yet clear, although older patients and those with chronic medical conditions may b

In [15]:
search(["risk_factors", "Socio-economic"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['risk_factors', 'Socio-economic']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['riskfactors', 'socioeconomic', 'iliarti', 'presenting_emergency_departments', 'differences_age_sex', 'hrv_species_genotypes', 'risk_factors_presenting_ed', 'diagnoses_pneumonia', 'risk_factors_hospitalization', 'rhinitis_asthma_respiratory', 'hivpositive_negative', 'antihistamine_prescription', 'social_determinants', 'poverty_social', 'social_inequality', 'social_economic', 'societal_factors', 'socioeconomical', 'social_economic_factors', 'political_socioeconomic', 'socioeconomic_cultural', 'social_cultural_economic']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbr

In [18]:
search(["risk_factors", "Transmission dynamics", "reproductive number", "incubation period", "serial interval"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['risk_factors', 'Transmission dynamics', 'reproductive number', 'incubation period', 'serial interval']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['riskfactors', 'transmission_dynamics', 'reproductive_number', 'incubation_period_serial_interval', 'iliarti', 'presenting_emergency_departments', 'differences_age_sex', 'hrv_species_genotypes', 'risk_factors_presenting_ed', 'diagnoses_pneumonia', 'risk_factors_hospitalization', 'rhinitis_asthma_respiratory', 'hivpositive_negative', 'antihistamine_prescription', 'disease_dynamics', 'epidemiological_dynamics', 'spatiotemporal_spread', 'transmission_patterns', 'spatiotemporal_variation', 'disease_transmission_dynamics', 'human_mobility_patterns', 'spatial_transmission', 'simulate_spread', 'epidemic_dynamics', 'reproduction_number', 'basic_reproductive_number', 'basic_reproduction_number', 'epidemic_growth_rate', 'effective_reproductive_number', 'basic_reproduction', 'value_r_0', 'effective_repro

In [20]:
search(["risk_factors", "Severity of disease", "fatality", "symptomatic hospitalized patients", "high-risk"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=15, only_sentences=True)


Search for terms ['risk_factors', 'Severity of disease', 'fatality', 'symptomatic hospitalized patients', 'high-risk']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['riskfactors', 'severity_disease', 'fatality', 'symptomatic', 'hospitalized_patients', 'highrisk', 'iliarti', 'presenting_emergency_departments', 'differences_age_sex', 'hrv_species_genotypes', 'risk_factors_presenting_ed', 'diagnoses_pneumonia', 'risk_factors_hospitalization', 'rhinitis_asthma_respiratory', 'hivpositive_negative', 'antihistamine_prescription', 'severity_illness', 'disease_severity', 'fatality_rate', 'asymptomatic', 'hospitalised_patients', 'patients_admitted_hospital', 'patients_hospitalized', 'patients_requiring_hospitalization', 'inpatients', 'hospital_inpatients', 'hospitalized_patients_who', 'children_admitted_pediatric_intensive', 'outpatients', 'patients_requiring_admission', 'high_risk', 'lowrisk']
Optional search terms after cleaning, bigrams, trigrams and synonym expans

In [26]:
search(["risk_factors", "population susceptibility"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=15, only_sentences=True)


Search for terms ['risk_factors', 'population susceptibility']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['riskfactors', 'population', 'susceptibility', 'iliarti', 'presenting_emergency_departments', 'differences_age_sex', 'hrv_species_genotypes', 'risk_factors_presenting_ed', 'diagnoses_pneumonia', 'risk_factors_hospitalization', 'rhinitis_asthma_respiratory', 'hivpositive_negative', 'antihistamine_prescription', 'populations']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

Thus, given the dataset x and an estimate R 0 of the associated basic reproductive ratio, the matrix R should satisfy If we look at the ith row of th

In [28]:
search(["risk_factors", "public health", "mitigation", "measure"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=15, only_sentences=True)


Search for terms ['risk_factors', 'public health', 'mitigation', 'measure']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['riskfactors', 'public_health', 'mitigation', 'measure', 'iliarti', 'presenting_emergency_departments', 'differences_age_sex', 'hrv_species_genotypes', 'risk_factors_presenting_ed', 'diagnoses_pneumonia', 'risk_factors_hospitalization', 'rhinitis_asthma_respiratory', 'hivpositive_negative', 'antihistamine_prescription', 'medical_public_health']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

Public health and healthcare experts agree that mitigation is required in order to slow the spread of COVID-19 and p