## __Notebook to Reproduce Results and Run Experiments__

### Steps for reproducing results  
- Step 0: Import modules

- Step 1: Setup demo with database

- Step 2: Select which experiment you would like to run

- Step 3: Preprocess files 

- Step 4: Train and run models

- Step 5: Visualize results in context of the input file



### Step 0: Import modules
Import modules and set up logging
See [documentation](https://deepwiki.com/ScienceSearch/sciencesearch/1-overview) for classes

In [None]:
# imports
from pathlib import Path
from sciencesearch.nlp.hyper import Hyper, algorithms_from_results
from sciencesearch.nlp.sweep import Sweep
from sciencesearch.nlp.models import Rake, Yake, KPMiner, Ensemble
from sciencesearch.nlp.train import train_hyper, load_hyper, run_hyper
from sciencesearch.nlp.search import KeywordExplorer
from operator import attrgetter
from IPython.core.display import HTML

# logging
import logging

logging.root.setLevel(logging.ERROR)  # silence pke warnings
slog = logging.getLogger("sciencesearch")
slog.setLevel(logging.WARNING)
from sciencesearch.nlp.visualize_kws import JsonView
from pathlib import Path

***

### Step 1: Setup demo with database
*This demo will only work if you are a SLAC employee with access to the correct data.*

***

To begin, please create a `private_data` folder in the root directory `sciencesearch/` 

In the `private_data` folder add a database source


*Note: This code is generalizable to any database with  `logbook` and `experiments` tables (See structure in `simplified_elog.db`)*

***

### Step 2: Select a source to extract keywords

Extract keywords from:

1. All experiment logs (elogs)
2. Only experiment descriptions
3. Only elogs with experiment parameters
4. Only elogs that are misc. commentary

See README.md for more details

In [None]:
# TODO: Select an experiment to run
source = 4

source_configs = {
    1: "slac_config_all_elogs.json",
    2: "slac_config_descriptions.json",
    3: "slac_config_params.json",
    4: "slac_config_commentary.json"
}

# Set config filepath 
config_fp = f'config_files/{source_configs[source]}'
config_fp

***

### Step 3: Set up config and preprocess files 

The filepath to your database should be specified in the configuration

```
"database": "private_data/{database_name}.db"
```


Preprocessed files will be saved as `[.txt]` files in the directory specified in the config.

This need not be updated.

```
"training": {
        "directory": "../private_data/{custom_directory}",
        "input_files": ["*.txt"],
}
```
#### Run preprocessing of data files for the selected experiemnt



In [None]:
from sciencesearch.nlp.slac_data_extractor import SLACDatabaseDataExtractor

# Setup DBDataExtractor
data_extractor = SLACDatabaseDataExtractor(config_file=config_fp)
data_extractor.get_tables()
# Run the corresponding data extraction and cleaning methods based on your experiment type
if source == 1:
    """ Preprocess files for experiment 1
    results will be saved in private_data/slac_logs"""
    data_extractor.process_elogs()

elif source == 2:
    """ Preprocess files for experiment 2
    results will be saved in private_data/descriptions """
    data_extractor.process_experiment_descriptions()

elif source == 3:
    """ Preprocess files for experiment 3
    results will be saved in private_data/params """
    data_extractor.process_experiment_elog_parameters()

elif source == 4:
    """ Preprocess files for experiment 4
    results will be saved in private_data/commentary """
    data_extractor.process_experiment_elog_commentary()


***

### Step 4: Train and run models

When a KeywordExtractor object is created with a configuration file it:

1. Reads in training data from a search configuration

2. Hyperparameter optimization: compares many algorithm parameters to select the highest performing algorithm settings

3. Trains models according to the 'best' hyperparameters 

4. Extracts keywords using the trained models


We save the models in a serialized Python "pickle" file so we don't need to repeat the training.

We could use the same models on multiple files without retraining with `run_hyper()`

*Note: if you would like to re-train the model, delete `private_data/{training_directory}/{save_file}.pkl*

In [None]:

# Create a KeywordExplorer object from the configuration
slac_searcher = KeywordExplorer.from_config(config_file=config_fp)

#### Explore keyword results

In [None]:
### See all file keywords (predicted and training)
# slac_searcher.file_keywords

### See all predicted keywords
# predicted_keywords = slac_searcher.predicted_keywords

### See training keywords
# slac_searcher.training_keywords

# See both training keywords and predicted keywords
slac_searcher.training_and_predicted_keywords()

#### Search for all experiments that have a particular keyword

In [None]:
# TODO: Set keyword variable to a keyword you would like to look for

keyword = "magnet"
slac_searcher.find(keyword)

#### Save keywords
Location for results is defined in the configuration file. 

Example:

```
"saving": {
        "css_filepath": "../../shared/keyword_vis.css",
        "output_files_directory": "../private_data/results"
    }
```

In [None]:
# TODO: define save file name
file_name = f"experiment_{source}_results"
slac_searcher.save_keywords_to_file(file_name = file_name)

***

### Step 5: Visualize keywords

Highlight the keywords in the input file (HTML)

Options include 

(1) View training and/or predicted keywords

(2) Seeing one or all files



#### Visualize keywords: Single file


In [None]:

# TODO: set file name to {experiment_id}.txt
filename = "test.txt"
HTML(
    slac_searcher.view_keywords(
        show_training=True, show_predicted=True, textfilename=filename
    )
)

#### Visualize keywords: All files

In [None]:
from IPython.core.display import HTML

# View keywords in context of text logs (all files)
HTML(
    slac_searcher.view_keywords(
        show_training=True, show_predicted=True, textfilename=None
    )
)