# ScienceSearch NLP Keywords with Visualiztion and Saving Results Example


## Import modules
Import modules and set up logging

In [1]:
# imports
from pathlib import Path
from sciencesearch.nlp.hyper import Hyper, algorithms_from_results
from sciencesearch.nlp.sweep import Sweep
from sciencesearch.nlp.models import Rake, Yake, KPMiner, Ensemble
from sciencesearch.nlp.train import train_hyper, load_hyper, run_hyper
from sciencesearch.nlp.search import Searcher
from operator import attrgetter
# logging
import logging
logging.root.setLevel(logging.ERROR)  # silence pke warnings
slog = logging.getLogger("sciencesearch")
slog.setLevel(logging.WARNING)
from sciencesearch.nlp.visualize_kws import JsonView
from pathlib import Path

## Generate a tuned algorithm for extracting keywords
First step is to tune the parameters of the available algorithms to the particular type of text that will be processed. This is best done by providing some "gold standard" keywords for sample documents, then allowing the system to run combinations of parameters to experimentally see which comes closest to generating the same keywords automatically. Our approach here will be to run 3 different NLP algorithms -- Rake, Yake, and KPMiner -- across a variety of settings, and pick all combinations that come close to the "best" F1 score. These algorithm/parameter combinations will be encapsulated in an "ensemble" algorithm that will take the union of the keywords generated by each individual algorithm.

Note: The F1 score balances two performance metrics: precision and recall. In terms of this case, precision is the proportion of keywords generated that match the gold standard, and recall is the proportion of the gold standard keywords that were generated at all. Since these two metrics tend to vary inversely (in particular, generating _lots_ of keywords tends to give good recall but poor precision) the F1 tries to balance them by taking their harmonic mean. The result is that, roughly speaking, the F1 reflects the lower of the two scores.

In [2]:
textdir = Path.cwd().parent / "data" / "jft"

epsilon = 0.1
max_alg = 5

In [3]:
hyperparameter = Hyper()

## Set up parameter sweeps
The `Sweep` class from the `sciencesearch.nlp.sweep` module is used to configure the algorithm and range of parameters to use in the hyperparameter tuning.
The list of possible parameters is shown with the `.print_params` method of each algorithm class. Note that these include a set of parameters shared across all the algorithms, for which there are reasonable defaults.

In [4]:
Yake.print_params()
sweep = Sweep(alg=Yake)
sweep.set_param_range("ws", lb=1, ub=3, step=1)
sweep.set_param_discrete("dedup", [0.8, 0.9, 0.95])
sweep.set_param_discrete("dedup_method", ["leve", "seqm"]) # jaro
hyperparameter.add_sweep(sweep)

Common:
  - Stopwords stopwords: Stopwords. Default is None
  - bool stemming: Whether to do stemming. Default is False
  - int num_keywords: How many keywords to extract. Default is 10
  - list keyword_sort: sort orderings: occ (number of occurrences), score, or a dict with weights for each of these keys, e.g., {'occ': 0.75, 'score': 0.25}, and additionally a flag 'i' for ignoring keyword case. Default is []
Yake:
  - int ws: YAKE window size. Default is 2
  - float dedup: Deduplication limit for YAKE. Default is 0.9
  - str dedup_method: method ('leve', 'seqm' or 'jaro'). Default is leve
  - int ngram: Maximum ngram size. Default is 2


In [5]:
Rake.print_params()
sweep = Sweep(alg=Rake)
sweep.set_param_range("min_len", lb=1, ub=1, step=1)
sweep.set_param_range("max_len", lb=1, ub=3, step=1)
sweep.set_param_range("min_kw_occ", lb=1, ub=10, step=1)
sweep.set_param_discrete("include_repeated_phrases", [False, True])
hyperparameter.add_sweep(sweep)

Common:
  - Stopwords stopwords: Stopwords. Default is None
  - bool stemming: Whether to do stemming. Default is False
  - int num_keywords: How many keywords to extract. Default is 10
  - list keyword_sort: sort orderings: occ (number of occurrences), score, or a dict with weights for each of these keys, e.g., {'occ': 0.75, 'score': 0.25}, and additionally a flag 'i' for ignoring keyword case. Default is []
Rake:
  - int min_len: Minimum ngram size. Default is 1
  - int max_len: Maximum ngram size. Default is 3
  - int min_kw_len: Minimum keyword length. Applied as post-processing filter.. Default is 3
  - int min_kw_occ: Mimumum number of occurences of keyword in text string.Applied as post-processing filter.. Default is 4
  - Any ranking_metric: ranking parameter for rake algorithm. Default is Metric.DEGREE_TO_FREQUENCY_RATIO
  - bool include_repeated_phrases: boolean for determining whether multiple of the same keywords are output by rake. Default is True


In [6]:
KPMiner.print_params()
sweep = Sweep(alg=KPMiner)
sweep.set_param_range("lasf", lb=1, ub=3, step=1)
# zomg this takes forever..
#sweep.set_param_range("cutoff", lb=200, ub=1300, nsteps=5)
#sweep.set_param_range("alpha", lb=3.0, ub=4.0, step=0.2)
#sweep.set_param_range("sigma", lb=2.6, ub=3.2, step=0.2)
hyperparameter.add_sweep(sweep)

Common:
  - Stopwords stopwords: Stopwords. Default is None
  - bool stemming: Whether to do stemming. Default is False
  - int num_keywords: How many keywords to extract. Default is 10
  - list keyword_sort: sort orderings: occ (number of occurrences), score, or a dict with weights for each of these keys, e.g., {'occ': 0.75, 'score': 0.25}, and additionally a flag 'i' for ignoring keyword case. Default is []
KPMiner:
  - int lasf: Last allowable seen frequency. Default is 3
  - int cutoff: Cutoff threshold for number of words after which if a phrase appears for the first time it is ignored. Default is 400
  - float alpha: Weight-adjustment parameter 1 for boosting factor.See original paper for definition. Default is 2.3
  - float sigma: Weight-adjustment parameter 2 for boosting factor.See original paper for definition. Default is 3.0
  - object doc_freq_info: Document frequency counts. Default (None) uses the semeval2010 countsprovided in 'df-semeval2010.tsv.gz'. Default is None


## Train and run models
In this example, we pick the 'best' result for each algorithm by training on two files with some user-provided keywords.
Then we extract keywords from a third file using the trained model.

Using a searcher which will read in training data from a search configuration, select the best model's keywords. 
We save the results of the hyperparameter training in a serialize Python "pickle" file so we don't need to repeat the training.
We could run the same hyperparameters on multiple files without retraining with `run_hyper()`

In [7]:
s = Searcher()
demo = Searcher.from_config("seach_vis_config.json")

## Visualize Results
In this example, we will save the results for easy viewing. First, as a json, then in a color coded html page that allows for easy viewing of the keywords in context. 

There are two options for saving and visualizing. 

In [8]:
json_viewer = JsonView(demo)

In [9]:
demo.training_keywords

{'file1.txt': ['hunter',
  'princes',
  'Mystic Mountain',
  'mimulus',
  'Fugi',
  'dreamer'],
 'file2.txt': ['Aya',
  'daimyo',
  'Lady Aya',
  'moon',
  'maidens',
  'garden',
  'Lord',
  'Lord of Ako']}

In [10]:
json_viewer.save_predicted_keywords

<bound method JsonView.save_predicted_keywords of <sciencesearch.nlp.visualize_kws.JsonView object at 0x342ac0040>>

First, only one set of keywords. 

In this example, predicted keywords are saved, and the resulting saved json is visualized as a html page per input file.

In [14]:
json_viewer.save_predicted_keywords(filename = '../examples/results/predicted_keywords.json')
JsonView.visualize_from_config(config_file="seach_vis_config.json", json_file="../examples/results/predicted_keywords.json", save_filename="../examples/results/predicted_keywords_html")

Second, multiple sets ofkeywords. 

In this example, predicted keywords and training keywords are saved, and the resulting saved json is visualized as a html page per input file, color coded to differentiate each set of keywords with a key. 

In [15]:
json_viewer.save_all_keyword_sets('../examples/results/keywords_all_sets.json')
JsonView.visualize_from_config(config_file="seach_vis_config.json",json_file="../examples/results/keywords_all_sets.json", save_filename="../examples/results/keywords_all_sets_html")