# Search Engine For Candidate Sentences

## Demonstration of how to use the simple search engine for fetching relevant sentences

Let's import our search engine for `src` directory.

First, one needs to set the Python source files environment variables for Juptyer Notebook. If you haven't done this, please run those two command BEFORE running Juptyer Notebook:
1. `export PYTHONPATH=/path/to/covid19/src`
2. `export JUPYTER_PATH=/path/to/covid19/src`

In [1]:
from search.searchengine import SearchEngine
from pprint import pprint
import os

unable to import 'smart_open.gcs', disabling that module


In [2]:
data_dir = "../../../workspace/kaggle/covid19/data"

Initialize out SearchEngine object with:
1. Sentences metadata
2. bi-gram model
3. tri-gram model
4. Trained FastText vectors

In [3]:
search_engine = SearchEngine(
    os.path.join(data_dir, "sentences_with_metadata.csv"),
    os.path.join(data_dir, "covid_bigram_model_v0.pkl"),
    os.path.join(data_dir, "covid_trigram_model_v0.pkl"),
    os.path.join(data_dir, "fasttext_no_subwords_trigrams/word-vectors-100d.txt"),
)

Loading CSV: ../../../workspace/kaggle/covid19/data/sentences_with_metadata.csv and building mapping dictionary...
Finished loading CSV: ../../../workspace/kaggle/covid19/data/sentences_with_metadata.csv and building mapping dictionary
Loaded 249343 sentences
Loading bi-gram model: ../../../workspace/kaggle/covid19/data/covid_bigram_model_v0.pkl
Finished loading bi-gram model: ../../../workspace/kaggle/covid19/data/covid_bigram_model_v0.pkl
Loading tri-gram model: ../../../workspace/kaggle/covid19/data/covid_trigram_model_v0.pkl
Finished loading tri-gram model: ../../../workspace/kaggle/covid19/data/covid_trigram_model_v0.pkl
Loading fasttext model: ../../../workspace/kaggle/covid19/data/fasttext_no_subwords_trigrams/word-vectors-100d.txt
Finished loading fasttext model: ../../../workspace/kaggle/covid19/data/fasttext_no_subwords_trigrams/word-vectors-100d.txt


Simple search function that gets a list of keywords to search:

In [4]:
def search(keywords, optional_keywords=None, top_n=10, synonyms_threshold=0.8, only_sentences=False):
    print(f"\nSearch for terms {keywords}\n\n")
    results = search_engine.search(
        keywords, optional_keywords=optional_keywords, top_n=top_n, synonyms_threshold=synonyms_threshold
    )
    print("\nResults:\n")
    
    if only_sentences:
        for result in results:
            print(result['sentence'] + "\n")
    else:
        pprint(results)

Let's see some examples:

In [12]:
search(["ethical principles", "issues", "articulate", "translate"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['ethical principles', 'articulate', 'translate']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['ethical_principles', 'articulate', 'translate', 'ethical_standards', 'nonmaleficence', 'respect_persons', 'beneficence_justice', 'bioethical', 'guiding_principles', 'statutes_regulations', 'professional_codes', 'belmont_report', 'bioethical_principles', 'public_deliberation', 'interpretative_frameworks', 'different_audiences', 'articulating', 'communicate_clearly', 'moral_values']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

For example, methods exist to reconstruct transmission trees for sampled sequences usin

In [14]:
search(["ethics", "ethical issues", "thematic", "oversight"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['ethics', 'ethical issues', 'thematic', 'oversight']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['ethics', 'ethical_issues', 'thematic', 'oversight', 'medical_ethics', 'bioethics', 'ethical_legal', 'ethical', 'ethical_aspects', 'ethical_challenges', 'ethical_concerns', 'ethical_questions', 'ethical_considerations', 'practical_ethical', 'ethical_legal', 'legal_issues', 'practicalities', 'ethical', 'hashtags', 'semantic_networks', 'recurring_themes', 'three_main_themes', 'meaning_units', 'online_discussion', 'coding_framework', 'covid19_epidemic_sina_microblog', 'categories_themes', 'federal_agency', 'rdna_research', 'security_requirements', 'scientific_advice', 'national_authority', 'regulatory_oversight', 'federal_state_local', 'select_agent_research', 'codes_practice', 'federal_agencies']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_

In [21]:
search(["methods", "ethics", "access", "information"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['methods', 'ethics', 'access', 'information']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['methods', 'ethics', 'access_information', 'methodologies', 'medical_ethics', 'bioethics', 'ethical_legal', 'ethical', 'ethical_aspects', 'health_alerts', 'access_uptodate', 'teleconferencing', 'electronic_communication', 'use_internet', 'electronic_communications', 'requesters', 'access_internet', 'handhelds', 'online_training']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

Medical ethics review should strictly abide by the review work content of the "Ethical Review Methods for Biomedical Research Involving Humans"

In [24]:
search(["WHO", "research", "platform", "connect", "global"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['WHO', 'research', 'platform', 'connect', 'global']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['who', 'research', 'platform', 'connect', 'global', 'platforms']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

Organisations such as the Global Outbreak Alert and Response Network (GOARN), the Coalition for Epidemic Preparedness Innovations (CEPI), the Global Research Collaboration For Infectious Disease Preparedness (GloPID-R) and the Global Initiative on Sharing All Influenza Data (GISAID) have been supported by the WHO Research Blueprint and its Global Coordinating Mechanism to provide a forum where those w

In [27]:
search(["efforts", "public health", "measures", "surgical masks", "SRH", "school closures", "effective"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['efforts', 'public health', 'measures', 'surgical masks', 'SRH', 'school closures', 'effective']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['efforts', 'public_health_measures', 'surgical_masks', 'srh', 'school_closures', 'effective', 'effort', 'containment_measures', 'pharmaceutical_measures', 'quarantining_contacts', 'quarantine_measures', 'control_measures_implemented', 'nonpharmaceutical_interventions_npis', 'public_health_interventions', 'isolation_quarantine', 'travel_bans', 'border_control_measures', 'medical_masks', 'n95_respirators', 'n95_masks', 'face_masks', 'respirators', 'surgical_mask', 'n95_ffrs', 'cloth_masks', 'n95_respirator', 'ffrs', 'school_closure', 'closing_schools', 'school_closings', 'social_distancing_measures', 'social_distancing', 'most_effective', 'very_effective', 'highly_effective']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019nc

In [36]:
search(["psychological stress", "health", "nurse", "doctor"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['psychological stress', 'health', 'nurse', 'doctor']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['psychological_stress', 'health', 'nurse_doctor', 'psychological_responses', 'emotional_stress', 'wellbeing', 'health_wellbeing', 'medical_appointments', 'remind_them', 'meditation_training', 'complete_diary', 'off_duty', 'ppe_should_worn', 'hospital_administrator', 'were_asked_refrain', 'scheduled_appointments', 'registration_clerk']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavirus_covid19', '2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originating_wuhan_china', 'novel_coronavirus_outbreak', 'wuhan_coronavirus']

Results:

This is the most quoted definition of health, which clearly stresses "well-being.

The reported vulnerability analysis informs publ

In [37]:
search(["fear", "anxiety", "stigma", "misinformation", "fake news", "social media"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"],
       top_n=10, only_sentences=True)


Search for terms ['fear', 'anxiety', 'stigma', 'misinformation', 'fake news', 'social media']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['fear_anxiety', 'stigma', 'misinformation', 'fake_news', 'social_media', 'anxiety_fear', 'living_upper_floors', 'negative_feelings', 'helplessness', 'fear_death', 'fear_frustration', 'strange_environments', 'hopelessness', 'feelings_anxiety', 'anger_fear', 'stigmatization', 'shame', 'rumors', 'fake_news', 'myths', 'rumour', 'false_information', 'sensationalist', 'sentiments', 'misconceptions', 'sensationalist', 'disinformation', 'public_discourse', 'hoax', 'patriotism', 'some_commentators', 'myths', 'false_information', 'rumour', 'communicating_information', 'twitter', 'mass_media', 'facebook', 'facebook_twitter', 'youtube', 'use_social_media', 'blogs', 'through_social_media', 'healthrelated_information', 'weibo']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['newcoronavirus', 'coronavir