# Search Engine For Candidate Sentences

## Demonstration of how to use the simple search engine for fetching relevant sentences

Let's import our search engine for `src` directory.

First, one needs to set the Python source files environment variables for Juptyer Notebook. If you haven't done this, please run those two command BEFORE running Juptyer Notebook:
1. `export PYTHONPATH=/path/to/covid19/src`
2. `export JUPYTER_PATH=/path/to/covid19/src`

In [1]:
from search.searchengine import SearchEngine
from pprint import pprint
import os

unable to import 'smart_open.gcs', disabling that module


In [2]:
data_dir = "../../../workspace/kaggle/covid19/data"

Initialize out SearchEngine object with:
1. Sentences metadata
2. bi-gram model
3. tri-gram model
4. Trained FastText vectors

In [3]:
search_engine = SearchEngine(
    os.path.join(data_dir, "sentences_with_metadata.csv"),
    os.path.join(data_dir, "covid_bigram_model_v0.pkl"),
    os.path.join(data_dir, "covid_trigram_model_v0.pkl"),
    os.path.join(data_dir, "fasttext_no_subwords_trigrams/word-vectors-100d.txt"),
)

Loading CSV: ../../../workspace/kaggle/covid19/data/sentences_with_metadata.csv and building mapping dictionary...
Finished loading CSV: ../../../workspace/kaggle/covid19/data/sentences_with_metadata.csv and building mapping dictionary
Loaded 249343 sentences
Loading bi-gram model: ../../../workspace/kaggle/covid19/data/covid_bigram_model_v0.pkl
Finished loading bi-gram model: ../../../workspace/kaggle/covid19/data/covid_bigram_model_v0.pkl
Loading tri-gram model: ../../../workspace/kaggle/covid19/data/covid_trigram_model_v0.pkl
Finished loading tri-gram model: ../../../workspace/kaggle/covid19/data/covid_trigram_model_v0.pkl
Loading fasttext model: ../../../workspace/kaggle/covid19/data/fasttext_no_subwords_trigrams/word-vectors-100d.txt
Finished loading fasttext model: ../../../workspace/kaggle/covid19/data/fasttext_no_subwords_trigrams/word-vectors-100d.txt


Simple search function that gets a list of keywords to search:

In [4]:
def search(keywords, optional_keywords=None, top_n=10, synonyms_threshold=0.8, only_sentences=False):
    print(f"\nSearch for terms {keywords}\n\n")
    results = search_engine.search(
        keywords, optional_keywords=optional_keywords, top_n=top_n, synonyms_threshold=synonyms_threshold
    )
    print("\nResults:\n")
    
    if only_sentences:
        for result in results:
            print(result['sentence'] + "\n")
    else:
        pprint(results)

Let's see some examples:

In [13]:
search(["coordinate", "local", "public health", "emergency response"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['coordinate', 'local', 'public health', 'emergency response']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['coordinate', 'local_public_health', 'emergency_response', 'local_state', 'health_departments', 'state_local_health_departments', 'local_state_federal', 'local_county', 'local_health', 'public_health_departments', 'local_health_authorities', 'state_local_health', 'health_department', 'disaster_response', 'disaster_emergency', 'command_control', 'hospital_preparedness', 'emergency_preparedness', 'preparedness', 'responding_public_health', 'hospital_emergency_management', 'contingency_plan', 'preparedness_efforts']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'n

In [14]:
search(["federal", "state", "local", "public health", "emergency response", "surveillance"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['federal', 'state', 'local', 'public health', 'emergency response', 'surveillance']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['federal_state_local', 'public_health_emergency_response', 'surveillance', 'local_state_federal', 'response_agencies', 'local_regional_state', 'federal_agencies', 'local_provincial', 'federal_level', 'state_local_public_health', 'departments_agencies', 'other_federal_agencies', 'local_state', 'health_emergency_response', 'skills_techniques', 'national_subnational_levels', 'response_disasters', 'leadership_training_program', 'ihr_2005_implementation', 'accreditation_process', 'national_coordination', 'coordination_communication', 'risk_assessment_risk_communication']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'sin

In [16]:
search(["investments", "baseline", "public health", "response", "infrastructure preparedness"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['investments', 'baseline', 'public health', 'response', 'infrastructure preparedness']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['investments', 'baseline', 'public_health_response', 'infrastructure', 'preparedness', 'investment', 'investments_health', 'public_investment', 'publicprivate_partnerships', 'maternal_neonatal_child', 'investments_made', 'international_donors', 'pledges', 'fiscal_space', 'economic_diversification', 'response_efforts', 'response_plans', 'public_health_actions', 'response_plan', 'outbreak_control_measures', 'preparedness_efforts', 'emergencies_international_concern', 'preparedness_planning', 'public_health_crisis', 'timely_response', 'infrastructures', 'infrastructure_support', 'physical_infrastructure', 'communications_infrastructure', 'resources_infrastructure', 'emergency_preparedness', 'preparedness_response', 'preparedness_efforts', 'preparedness_planning', 'disaster_preparedness', 'level_preparedness', 'em

In [21]:
search(["communicate", "talk", "elders"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=15, only_sentences=True)


Search for terms ['communicate', 'talk', 'elders']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['communicate', 'talk', 'elders', 'lower_income', 'lowincome_families', 'more_likely_live', 'vulnerable_group', 'poorly_educated', 'alaska_natives', 'higher_income', 'disabled_people', 'among_older_people', 'live_rural_areas']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

What we know about social distancing policies is based largely on models of influenza, 4,13 where children are a vulnerable group.

As it is highly contagious, many people are frightened by it and even talk fearfully about coronavirus, which can also be observed

In [22]:
search(["Risk communication", "guidelines", "simple", "families"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=15, only_sentences=True)


Search for terms ['Risk communication', 'guidelines', 'simple', 'families']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['risk_communication', 'guidelines', 'simple', 'families', 'risk_communications', 'crisis_emergency', 'health_communication', 'effective_risk_communication', 'communication_strategies', 'preparedness_planning', 'communicating_public', 'risk_communication_principles', 'emergency_responses', 'outbreak_communication', 'recommendations', 'international_guidelines', 'national_guidelines', 'published_guidelines', 'guideline', 'guidance', 'standards_guidelines', 'family']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Resu

In [29]:
search(["indications", "risk", "population groups"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=15, only_sentences=True)


Search for terms ['indications', 'risk', 'population groups']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['indications', 'risk', 'population', 'groups', 'risks', 'populations', 'group']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

For these measures to be effective, special attention should be devoted to those population groups that are more at risk and patterns of intergenerational contact.

For these measures to be effective, a special attention should be devoted to those population groups that are more at risk and to the strength of the connections across generations.

Younger parts of a population present a much lowe

In [30]:
search(["Misunderstanding", "containment", "mitigation"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=15, only_sentences=True)


Search for terms ['Misunderstanding', 'containment', 'mitigation']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['misunderstanding', 'containment_mitigation', 'unease', 'misunderstandings', 'hospitality_tourism_industry', 'dismissive', 'divergence_views', 'scaremongering', 'local_customs', 'raising_public', 'reporting_suicide', 'miscommunication', 'developing_policies', 'response_future_outbreaks', 'policies_interventions', 'infectious_diseases_threats', 'informing_policy', 'implementation_eff_ective', 'distribution_scarce_resources', 'risk_communication_efforts', 'current_future_outbreaks', 'pandemic_planning_response']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus

In [32]:
search(["inequity", "public health", "access", "information", "available"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=15, only_sentences=True)


Search for terms ['inequity', 'public health', 'access', 'information', 'available']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['inequity', 'public_health', 'access', 'information_available', 'inequalities_access', 'health_inequality', 'social_inequality', 'social_inequalities', 'health_inequity', 'inequities_health', 'health_inequalities', 'inequities', 'disparities_health', 'china_faces', 'medical_public_health']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

This is why this chapter is focusing mainly on the issues related to the source, access, extent, and quality of medicines information available among pharmacists a

In [33]:
search(["marginalized populations", "disadvantaged populations", "research priorities", "minorities"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=15, only_sentences=True)


Search for terms ['marginalized populations', 'disadvantaged populations', 'research priorities', 'minorities']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['marginalized_populations', 'disadvantaged_populations', 'research_priorities', 'minorities', 'disenfranchised', 'disadvantaged_groups', 'poor_marginalized', 'socially_disadvantaged', 'socially_marginalized', 'hardtoreach_populations', 'unhealthy_behaviours', 'sexual_minorities', 'barriers_accessing', 'marginalised_communities', 'disadvantaged_groups', 'marginalized_populations', 'gender_norms', 'aboriginal_canadians', 'inequities_health', 'sexual_minorities', 'reproductive_rights', 'marginalised_communities', 'barriers_accessing', 'ethnic_diversity', 'priority_actions', 'intersectoral_collaboration', 'planning_preparedness', 'research_gaps', 'effective_intervention_strategies', 'environmental_health_sectors', 'implementation_strategies', 'suicide_suicide_attempts', 'oh_research', 'epidemic_preparedness

In [39]:
search(["prisoner", "prison", "access to information", "diagnosis", "treatment"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=15, only_sentences=True)


Search for terms ['prisoner', 'prison', 'information', 'lack', 'diagnosis', 'treatment']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['prisoner', 'prison', 'information', 'lack', 'diagnosis_treatment', 'found_guilty', 'hackers', 'sadhus', 'really_bad', 'rnao_2003', 'protesting', 'jailed', 'lovers', 'compulsorily', 'protested', 'inmates', 'correctional_facilities', 'information_about', 'relevant_information', 'information_regarding', 'diagnosis_management']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

With SARS-CoV-2 being discovered very recently, there is currently a lack of immunological information available about the 

In [40]:
search(["coverage policies", "diagnosis", "treatment", "care"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=15, only_sentences=True)


Search for terms ['coverage policies', 'diagnosis', 'treatment', 'care']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['coverage', 'policies', 'diagnosis_treatment', 'care', 'policy', 'polices', 'national_policy', 'existing_laws', 'national_policies', 'diagnosis_management', 'health_care']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

Genereux, Schluter, Tamahashi et al. [4] argued that standardizing psychometrically robust instruments would also be urgently needed to identify at-risk patients throughout-before, during, and after emergencies and disasters-to ensure that mental and social health needs are addressed throughou