# Methods for Hyperparameterization and Documentation of Model Parameters

Learn how this library creates a tuned ensemble algorithm for extracting keywords from text documents. The approach involves optimizing multiple NLP algorithms and combining their best-performing configurations. 

## Import modules
Import modules and set up logging

In [5]:
# imports
from pathlib import Path
import pickle
from sciencesearch.nlp.hyper import Hyper, algorithms_from_results
from sciencesearch.nlp.sweep import Sweep
from sciencesearch.nlp.models import Rake, Yake, KPMiner, Ensemble
from sciencesearch.nlp.train import train_hyper, load_hyper, run_hyper
from operator import attrgetter

# logging
import logging

logging.root.setLevel(logging.ERROR)  # silence pke warnings
slog = logging.getLogger("sciencesearch")
slog.setLevel(logging.WARNING)

## Methodology

3 Step process
1. Parameter Tuning

2. Evaluate 3 algorithms with a range of settings

3. Ensemble approach to unify high-performing algorithms


#### Performance Evaluation
__F1 Score Metric__

The primary evaluation metric is F1 scoring which balances two key performance aspects, precision and recall, and is the industry standard. 

***

## Outcome 
An ensemble algorithm that provides optimized keyword extraction, both combining the strengths of multiple algorithms and maintaining high precision and recall performance.
***

#### Step 1: Parameter Tuning Process

We tune algorithm parameters specifically for our target text type to create an effective keyword extraction system. 

This process:

1. Establish Ground Truth: Provide "gold standard" keywords for sample documents

2. Explore Parameters: Test various parameter combinations across multiple algorithms

3. Evaluate Performance: Compare automated results against the gold standard

4. Determine Selection Criteria: Identify configurations that achieve near-optimal performance


*Note: The F1 score balances two performance metrics: precision and recall. In terms of this case, precision is the proportion of keywords generated that match the gold standard, and recall is the proportion of the gold standard keywords that were generated at all. Since these two metrics tend to vary inversely (in particular, generating _lots_ of keywords tends to give good recall but poor precision) the F1 tries to balance them by taking their harmonic mean. The result is that, roughly speaking, the F1 reflects the lower of the two scores.*
***


#### Step 2: Joint Multi-Algorithm Approach

__We test three complementary NLP algorithms__

1. RAKE (Rapid Automatic Keyword Extraction)
   
2. YAKE (Yet Another Keyword Extractor)
    
3. KPMiner (Keyphrase Mining algorithm)

Each algorithm is tested across multiple parameter settings to find optimal configurations.
***

#### Step 3: Ensemble Creation

The final system combines multiple high-performing algorithm/parameter combinations into a unified ensemble that:

1. Takes the union of keywords from each component algorithm

2. Leverages the strengths of different extraction approaches

3. Provides more robust and comprehensive keyword identification

***

## Walk through of Hyperparameter Optemization

In [6]:
# Create a hyperparameter object
hyperparameter = Hyper()

### Set up parameter sweeps
The `Sweep` class from the `sciencesearch.nlp.sweep` module is used to configure the algorithm and range of parameters to use in the hyperparameter tuning.


The list of possible parameters is shown with the `.print_params` method of each algorithm class. 
*Note that these include a set of parameters shared across all the algorithms, for which there are reasonable defaults.*

In [7]:
Yake.print_params()
sweep = Sweep(alg=Yake)
sweep.set_param_range("ws", lb=1, ub=3, step=1)
sweep.set_param_discrete("dedup", [0.8, 0.9, 0.95])
sweep.set_param_discrete("dedup_method", ["leve", "seqm"])  # jaro
hyperparameter.add_sweep(sweep)

Common:
  - Stopwords stopwords: Stopwords. Default is None
  - bool stemming: Whether to do stemming. Default is False
  - int num_keywords: How many keywords to extract. Default is 7
  - list keyword_sort: sort orderings: occ (number of occurrences), score, or a dict with weights for each of these keys, e.g., {'occ': 0.75, 'score': 0.25}, and additionally a flag 'i' for ignoring keyword case. Default is []
Yake:
  - int ws: YAKE window size. Default is 2
  - float dedup: Deduplication limit for YAKE. Default is 0.9
  - str dedup_method: method ('leve', 'seqm' or 'jaro'). Default is leve
  - int ngram: Maximum ngram size. Default is 2


Common:
  - Stopwords stopwords: Stopwords. Default is None
  - bool stemming: Whether to do stemming. Default is False
  - int num_keywords: How many keywords to extract. Default is 10
  - list keyword_sort: sort orderings: occ (number of occurrences), score, or a dict with weights for each of these keys, e.g., {'occ': 0.75, 'score': 0.25}, and additionally a flag 'i' for ignoring keyword case. Default is []
Yake:
  - int ws: YAKE window size. Default is 2
  - float dedup: Deduplication limit for YAKE. Default is 0.9
  - str dedup_method: method ('leve', 'seqm' or 'jaro'). Default is leve
  - int ngram: Maximum ngram size. Default is 2

In [8]:
Rake.print_params()
sweep = Sweep(alg=Rake)
sweep.set_param_range("min_len", lb=1, ub=1, step=1)
sweep.set_param_range("max_len", lb=1, ub=3, step=1)
sweep.set_param_range("min_kw_occ", lb=1, ub=10, step=1)
sweep.set_param_discrete("include_repeated_phrases", [False, True])
hyperparameter.add_sweep(sweep)

Common:
  - Stopwords stopwords: Stopwords. Default is None
  - bool stemming: Whether to do stemming. Default is False
  - int num_keywords: How many keywords to extract. Default is 7
  - list keyword_sort: sort orderings: occ (number of occurrences), score, or a dict with weights for each of these keys, e.g., {'occ': 0.75, 'score': 0.25}, and additionally a flag 'i' for ignoring keyword case. Default is []
Rake:
  - int min_len: Minimum ngram size. Default is 1
  - int max_len: Maximum ngram size. Default is 3
  - int min_kw_len: Minimum keyword length. Applied as post-processing filter.. Default is 3
  - int min_kw_occ: Mimumum number of occurences of keyword in text string.Applied as post-processing filter.. Default is 4
  - Any ranking_metric: ranking parameter for rake algorithm. Default is Metric.DEGREE_TO_FREQUENCY_RATIO
  - bool include_repeated_phrases: boolean for determining whether multiple of the same keywords are output by rake. Default is True


Common:
  - Stopwords stopwords: Stopwords. Default is None
  - bool stemming: Whether to do stemming. Default is False
  - int num_keywords: How many keywords to extract. Default is 10
  - list keyword_sort: sort orderings: occ (number of occurrences), score, or a dict with weights for each of these keys, e.g., {'occ': 0.75, 'score': 0.25}, and additionally a flag 'i' for ignoring keyword case. Default is []
Rake:
  - int min_len: Minimum ngram size. Default is 1
  - int max_len: Maximum ngram size. Default is 3
  - int min_kw_len: Minimum keyword length. Applied as post-processing filter.. Default is 3
  - int min_kw_occ: Mimumum number of occurences of keyword in text string.Applied as post-processing filter.. Default is 4
  - Any ranking_metric: ranking parameter for rake algorithm. Default is Metric.DEGREE_TO_FREQUENCY_RATIO
  - bool include_repeated_phrases: boolean for determining whether multiple of the same keywords are output by rake. Default is True

In [9]:
KPMiner.print_params()
sweep = Sweep(alg=KPMiner)
sweep.set_param_range("lasf", lb=1, ub=3, step=1)
# commenting out because ..zong...this takes forever..
# sweep.set_param_range("cutoff", lb=200, ub=1300, nsteps=5)
# sweep.set_param_range("alpha", lb=3.0, ub=4.0, step=0.2)
# sweep.set_param_range("sigma", lb=2.6, ub=3.2, step=0.2)
hyperparameter.add_sweep(sweep)

Common:
  - Stopwords stopwords: Stopwords. Default is None
  - bool stemming: Whether to do stemming. Default is False
  - int num_keywords: How many keywords to extract. Default is 7
  - list keyword_sort: sort orderings: occ (number of occurrences), score, or a dict with weights for each of these keys, e.g., {'occ': 0.75, 'score': 0.25}, and additionally a flag 'i' for ignoring keyword case. Default is []
KPMiner:
  - int lasf: Last allowable seen frequency. Default is 3
  - int cutoff: Cutoff threshold for number of words after which if a phrase appears for the first time it is ignored. Default is 400
  - float alpha: Weight-adjustment parameter 1 for boosting factor.See original paper for definition. Default is 2.3
  - float sigma: Weight-adjustment parameter 2 for boosting factor.See original paper for definition. Default is 3.0
  - object doc_freq_info: Document frequency counts. Default (None) uses the semeval2010 countsprovided in 'df-semeval2010.tsv.gz'. Default is None


Common:
  - Stopwords stopwords: Stopwords. Default is None
  - bool stemming: Whether to do stemming. Default is False
  - int num_keywords: How many keywords to extract. Default is 10
  - list keyword_sort: sort orderings: occ (number of occurrences), score, or a dict with weights for each of these keys, e.g., {'occ': 0.75, 'score': 0.25}, and additionally a flag 'i' for ignoring keyword case. Default is []
KPMiner:
  - int lasf: Last allowable seen frequency. Default is 3
  - int cutoff: Cutoff threshold for number of words after which if a phrase appears for the first time it is ignored. Default is 400
  - float alpha: Weight-adjustment parameter 1 for boosting factor.See original paper for definition. Default is 2.3
  - float sigma: Weight-adjustment parameter 2 for boosting factor.See original paper for definition. Default is 3.0
  - object doc_freq_info: Document frequency counts. Default (None) uses the semeval2010 countsprovided in 'df-semeval2010.tsv.gz'. Default is None