# Search Engine For Candidate Sentences

## Demonstration of how to use the simple search engine for fetching relevant sentences

Let's import our search engine for `src` directory.

First, one needs to set the Python source files environment variables for Juptyer Notebook. If you haven't done this, please run those two command BEFORE running Juptyer Notebook:
1. `export PYTHONPATH=/path/to/covid19/src`
2. `export JUPYTER_PATH=/path/to/covid19/src`

In [8]:
from search.searchengine import SearchEngine
from pprint import pprint
import os

In [9]:
data_dir = "../../../workspace/kaggle/covid19/data"

Initialize out SearchEngine object with:
1. Sentences metadata
2. bi-gram model
3. tri-gram model
4. Trained FastText vectors

In [12]:
search_engine = SearchEngine(
    os.path.join(data_dir, "sentences_with_metadata.csv"),
    os.path.join(data_dir, "covid_bigram_model_v0.pkl"),
    os.path.join(data_dir, "covid_trigram_model_v0.pkl"),
    os.path.join(data_dir, "fasttext_no_subwords_trigrams/word-vectors-100d.txt"),
)

Loading CSV: ../../../workspace/kaggle/covid19/data/sentences_with_metadata.csv and building mapping dictionary...
Finished loading CSV: ../../../workspace/kaggle/covid19/data/sentences_with_metadata.csv and building mapping dictionary
Loaded 249343 sentences
Loading bi-gram model: ../../../workspace/kaggle/covid19/data/covid_bigram_model_v0.pkl
Finished loading bi-gram model: ../../../workspace/kaggle/covid19/data/covid_bigram_model_v0.pkl
Loading tri-gram model: ../../../workspace/kaggle/covid19/data/covid_trigram_model_v0.pkl
Finished loading tri-gram model: ../../../workspace/kaggle/covid19/data/covid_trigram_model_v0.pkl
Loading fasttext model: ../../../workspace/kaggle/covid19/data/fasttext_no_subwords_trigrams/word-vectors-100d.txt
Finished loading fasttext model: ../../../workspace/kaggle/covid19/data/fasttext_no_subwords_trigrams/word-vectors-100d.txt


Simple search function that gets a list of keywords to search:

In [13]:
def search(keywords, optional_keywords=None, top_n=10, synonyms_threshold=0.8):
    print(f"\nSearch for terms {keywords}\n\n")
    result = search_engine.search(
        keywords, optional_keywords=optional_keywords, top_n=10, synonyms_threshold=synonyms_threshold
    )
    print("\nResults:")
    pprint(result)

Let's see some examples:

In [14]:
search(keywords=["animals", "zoonotic", "spillover", "animal to human",
             "bats", "snakes", "exotic animals", "seafood"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"])


Search for terms ['animals', 'zoonotic', 'spillover', 'animal to human', 'bats', 'snakes', 'exotic animals', 'seafood']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['spillover_humans', 'hostswitching_events', 'battohuman', 'host_switching_events', 'from_dromedaries_humans', 'interspecies_jumping', 'animaltoanimal_animaltohuman', 'humantohuman_transmission_events', 'emergence_events', 'zoonotic_spillover_events', 'human_animal', 'bat_species', 'fruit_bats', 'insectivorous_bats', 'species_bats', 'bat', 'exotic_pets', 'wild_animal_species', 'consumption_raw', 'salads', 'ice_cream', 'dairy_products', 'cold_cuts', 'contaminated_beef', 'beef_products', 'meats', 'improperly_cooked', 'raw_undercooked']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covid19', 'originati

In [16]:
search(keywords=["seasonality", "transmission", "humidity", "heat", "summer"],
       optional_keywords=["new_coronavirus", "coronavirus", "covid19"])


Search for terms ['seasonality', 'transmission', 'humidity', 'heat', 'summer']


Search terms after cleaning, bigrams, trigrams and synonym expansion: ['seasonal_patterns', 'seasonal_variation', 'seasonal_pattern', 'seasonal_trends', 'seasonality_influenza', 'seasonal_variations', 'trans_mission', 'transmissions', 'disease_transmission', 'contact_transmission', 'overall_discomfort', 'microclimate_temperature_humidity', 'emotional_benefits', 'perceived_comfort', 'n95mask_combination', 'perceived_exertion', 'thermophysiological', 'nanofunctional', 'physical_discomfort', 'subjective_ratings', 'winter', 'autumn', 'during_winter', 'during_summer', 'rainy', 'winter_summer', 'summer_autumn', 'fall_winter', 'cold_winter', 'during_winter_spring']
Optional search terms after cleaning, bigrams, trigrams and synonym expansion: ['2019ncov_covid19', 'outbreak_2019_novel', 'sarscov2_2019ncov', 'coronavirus_2019ncov', 'ongoing_outbreak_novel_coronavirus', 'since_late_december', 'ongoing_outbreak_covi