# ScienceSearch NLP Keywords Example
How to use the natural language processing (NLP) to generate high-quality metadata (keywords) for searching documents.

## Import modules
Import modules and set up logging

In [13]:
# imports
from pathlib import Path
import pickle
from sciencesearch.nlp.hyper import Hyper, algorithms_from_results
from sciencesearch.nlp.sweep import Sweep
from sciencesearch.nlp.models import Rake, Yake, KPMiner, Ensemble
from sciencesearch.nlp.train import train_hyper, load_hyper, run_hyper
from operator import attrgetter
# logging
import logging
logging.root.setLevel(logging.ERROR)  # silence pke warnings
slog = logging.getLogger("sciencesearch")
slog.setLevel(logging.WARNING)

## Generate a tuned algorithm for extracting keywords
First step is to tune the parameters of the available algorithms to the particular type of text that will be processed. This is best done by providing some "gold standard" keywords for sample documents, then allowing the system to run combinations of parameters to experimentally see which comes closest to generating the same keywords automatically. Our approach here will be to run 3 different NLP algorithms -- Rake, Yake, and KPMiner -- across a variety of settings, and pick all combinations that come close to the "best" F1 score. These algorithm/parameter combinations will be encapsulated in an "ensemble" algorithm that will take the union of the keywords generated by each individual algorithm.

Note: The F1 score balances two performance metrics: precision and recall. In terms of this case, precision is the proportion of keywords generated that match the gold standard, and recall is the proportion of the gold standard keywords that were generated at all. Since these two metrics tend to vary inversely (in particular, generating _lots_ of keywords tends to give good recall but poor precision) the F1 tries to balance them by taking their harmonic mean. The result is that, roughly speaking, the F1 reflects the lower of the two scores.

In [2]:
hyperparameter = Hyper()

## Set up parameter sweeps
The `Sweep` class from the `sciencesearch.nlp.sweep` module is used to configure the algorithm and range of parameters to use in the hyperparameter tuning.
The list of possible parameters is shown with the `.print_params` method of each algorithm class. Note that these include a set of parameters shared across all the algorithms, for which there are reasonable defaults.

In [3]:
Yake.print_params()
sweep = Sweep(alg=Yake)
sweep.set_param_range("ws", lb=1, ub=3, step=1)
sweep.set_param_discrete("dedup", [0.8, 0.9, 0.95])
sweep.set_param_discrete("dedup_method", ["leve", "seqm"]) # jaro
hyperparameter.add_sweep(sweep)

Common:
  - Stopwords stopwords: Stopwords. Default is None
  - bool stemming: Whether to do stemming. Default is False
  - int num_keywords: How many keywords to extract. Default is 10
  - list keyword_sort: sort orderings: occ (number of occurrences), score, or a dict with weights for each of these keys, e.g., {'occ': 0.75, 'score': 0.25}, and additionally a flag 'i' for ignoring keyword case. Default is []
Yake:
  - int ws: YAKE window size. Default is 2
  - float dedup: Deduplication limit for YAKE. Default is 0.9
  - str dedup_method: method ('leve', 'seqm' or 'jaro'). Default is leve
  - int ngram: Maximum ngram size. Default is 2


Common:
  - Stopwords stopwords: Stopwords. Default is None
  - bool stemming: Whether to do stemming. Default is False
  - int num_keywords: How many keywords to extract. Default is 10
  - list keyword_sort: sort orderings: occ (number of occurrences), score, or a dict with weights for each of these keys, e.g., {'occ': 0.75, 'score': 0.25}, and additionally a flag 'i' for ignoring keyword case. Default is []
Yake:
  - int ws: YAKE window size. Default is 2
  - float dedup: Deduplication limit for YAKE. Default is 0.9
  - str dedup_method: method ('leve', 'seqm' or 'jaro'). Default is leve
  - int ngram: Maximum ngram size. Default is 2

In [4]:
Rake.print_params()
sweep = Sweep(alg=Rake)
sweep.set_param_range("min_len", lb=1, ub=1, step=1)
sweep.set_param_range("max_len", lb=1, ub=3, step=1)
sweep.set_param_range("min_kw_occ", lb=1, ub=10, step=1)
sweep.set_param_discrete("include_repeated_phrases", [False, True])
hyperparameter.add_sweep(sweep)

Common:
  - Stopwords stopwords: Stopwords. Default is None
  - bool stemming: Whether to do stemming. Default is False
  - int num_keywords: How many keywords to extract. Default is 10
  - list keyword_sort: sort orderings: occ (number of occurrences), score, or a dict with weights for each of these keys, e.g., {'occ': 0.75, 'score': 0.25}, and additionally a flag 'i' for ignoring keyword case. Default is []
Rake:
  - int min_len: Minimum ngram size. Default is 1
  - int max_len: Maximum ngram size. Default is 3
  - int min_kw_len: Minimum keyword length. Applied as post-processing filter.. Default is 3
  - int min_kw_occ: Mimumum number of occurences of keyword in text string.Applied as post-processing filter.. Default is 4
  - Any ranking_metric: ranking parameter for rake algorithm. Default is Metric.DEGREE_TO_FREQUENCY_RATIO
  - bool include_repeated_phrases: boolean for determining whether multiple of the same keywords are output by rake. Default is True


Common:
  - Stopwords stopwords: Stopwords. Default is None
  - bool stemming: Whether to do stemming. Default is False
  - int num_keywords: How many keywords to extract. Default is 10
  - list keyword_sort: sort orderings: occ (number of occurrences), score, or a dict with weights for each of these keys, e.g., {'occ': 0.75, 'score': 0.25}, and additionally a flag 'i' for ignoring keyword case. Default is []
Rake:
  - int min_len: Minimum ngram size. Default is 1
  - int max_len: Maximum ngram size. Default is 3
  - int min_kw_len: Minimum keyword length. Applied as post-processing filter.. Default is 3
  - int min_kw_occ: Mimumum number of occurences of keyword in text string.Applied as post-processing filter.. Default is 4
  - Any ranking_metric: ranking parameter for rake algorithm. Default is Metric.DEGREE_TO_FREQUENCY_RATIO
  - bool include_repeated_phrases: boolean for determining whether multiple of the same keywords are output by rake. Default is True

In [5]:
KPMiner.print_params()
sweep = Sweep(alg=KPMiner)
sweep.set_param_range("lasf", lb=1, ub=3, step=1)
# zomg this takes forever..
#sweep.set_param_range("cutoff", lb=200, ub=1300, nsteps=5)
#sweep.set_param_range("alpha", lb=3.0, ub=4.0, step=0.2)
#sweep.set_param_range("sigma", lb=2.6, ub=3.2, step=0.2)
hyperparameter.add_sweep(sweep)

Common:
  - Stopwords stopwords: Stopwords. Default is None
  - bool stemming: Whether to do stemming. Default is False
  - int num_keywords: How many keywords to extract. Default is 10
  - list keyword_sort: sort orderings: occ (number of occurrences), score, or a dict with weights for each of these keys, e.g., {'occ': 0.75, 'score': 0.25}, and additionally a flag 'i' for ignoring keyword case. Default is []
KPMiner:
  - int lasf: Last allowable seen frequency. Default is 3
  - int cutoff: Cutoff threshold for number of words after which if a phrase appears for the first time it is ignored. Default is 400
  - float alpha: Weight-adjustment parameter 1 for boosting factor.See original paper for definition. Default is 2.3
  - float sigma: Weight-adjustment parameter 2 for boosting factor.See original paper for definition. Default is 3.0
  - object doc_freq_info: Document frequency counts. Default (None) uses the semeval2010 countsprovided in 'df-semeval2010.tsv.gz'. Default is None


Common:
  - Stopwords stopwords: Stopwords. Default is None
  - bool stemming: Whether to do stemming. Default is False
  - int num_keywords: How many keywords to extract. Default is 10
  - list keyword_sort: sort orderings: occ (number of occurrences), score, or a dict with weights for each of these keys, e.g., {'occ': 0.75, 'score': 0.25}, and additionally a flag 'i' for ignoring keyword case. Default is []
KPMiner:
  - int lasf: Last allowable seen frequency. Default is 3
  - int cutoff: Cutoff threshold for number of words after which if a phrase appears for the first time it is ignored. Default is 400
  - float alpha: Weight-adjustment parameter 1 for boosting factor.See original paper for definition. Default is 2.3
  - float sigma: Weight-adjustment parameter 2 for boosting factor.See original paper for definition. Default is 3.0
  - object doc_freq_info: Document frequency counts. Default (None) uses the semeval2010 countsprovided in 'df-semeval2010.tsv.gz'. Default is None

### Create configuration to automatically train and build a Searcher object that allows you to find files by their predicted and gold keywords

See example:  examples/search-vis-demo-config

To use existing configuration:
1. Add input files to `private_data/slac_logs'
2. Add training keyword file to slac_keywords.csv
    - the format is `filename`, `"list, of, keywords, as, a, comma, separated, string"`

#### Understanding the config file


##### Section 1: should include algorithms: Yake, Rake, and/or KPMiner

In `algorithms` include:
- `yake`
    - `module`: `sciencesearch.nlp.models`
    - `class`: `Yake`
- `rake`
    - `module`: `sciencesearch.nlp.models`
    - `class`: `Rake`
- `kpminer`
    - `module`: `sciencesearch.nlp.models`
    - `class`: `KPMiner`

<details> <summary>Example</summary>
  
```json
"algorithms": {
        "yake": {
            "module": "sciencesearch.nlp.models",
            "class": "Yake"
        },
        "rake": {
            "module": "sciencesearch.nlp.models",
            "class": "Rake"
        },
        "kpminer": {
            "module": "sciencesearch.nlp.models",
            "class": "KPMiner"
        }
    },
    ```
    </details>

##### Section 2: Define potential hyperparameter combinations

See `Set up parameter sweeps` for details

<details> <summary>Example</summary>
  
``` json
"sweeps": {
        "kpminer": {
            "lasf": {
                "_type": "range",
                "lb": 1,
                "ub": 3,
                "step": 1
            },
            "-cutoff": {
                "_type": "range",
                "lb": 200,
                "ub": 1300,
                "nsteps": 5
            },
            "-alpha": {
                "_type": "range",
                "lb": 3.0,
                "ub": 4.0,
                "step": 0.2
            },
            "sigma": {
                "_type": "range",
                "lb": 2.8,
                "ub": 3.0,
                "step": 0.2
            }
        }
        "rake": {
            "min_len": {
                "_type": "range",
                "lb": 1,
                "ub": 1,
                "step": 1
            },
            "max_len": {
                "_type": "range",
                "lb": 1,
                "ub": 3,
                "step": 1
            },
            "min_kw_occ": {
                "_type": "range",
                "lb": 1,
                "ub": 10,
                "step": 1
            },
            "include_repeated_phrases": {
                "_type": "discrete",
                "values": [false, true]
            }
        },
        "yake": {
            "ws": {
                "_type": "range",
                "lb": 1,
                "ub": 3,
                "step": 1
            },
            "dedup": {
                "_type": "discrete",
                "values": [0.8, 0.9, 0.95]
            },
            "dedup_method": {
                "_type": "discrete",
                "values": ["leve", "seqm"]
            }
        }
    },
```
 <details>

##### Section 3: Define filepaths to training data, etc

In `training` include: 
- `directory`: Base directory containing training data
- `input_files`: global pattern for input text files
- `keywords`: CSV file containing training keywords
- `epsilon`: Learning rate or noise parameter for training
- `save_file`: Output file for trained model or hyperparameters


<details> <summary>Example</summary>
  
``` json
"training": {
        "directory": "../private_data/slac_logs",
        "input_files": ["*.txt"],
        "keywords": ["slac_keywords.csv"],
        "epsilon": 0.05,
        "save_file": "slac_hyper.pkl"
    },
```
 <details>

##### Section 4: Define filepaths to saving filwpaths

In `saving` include: 
- `ouput_files_directory`: Output file directory for search results
- `css_filepath`: Filepath to styling for highlighted text HTML

<details> <summary>Example</summary>
  
``` json
"saving": {
        "css_filepath": "../../shared/keyword_vis.css",
        "output_files_directory": "../private_data/results"
    }
```
 <details>