# ScienceSearch NLP Keywords with Visualization and Saving Results Example


## Table of Contents
#### Step 0: Import modules
##### Step 1: Setup demo with database
##### Step 2: Select which experiment you would like to run
##### Step 3: Preprocess files 
##### Step 4: Train and run models
##### Step 5: Visualize results in context of the input file


## Step 0: Import modules
Import modules and set up logging

In [None]:
# imports
from pathlib import Path
from sciencesearch.nlp.hyper import Hyper, algorithms_from_results
from sciencesearch.nlp.sweep import Sweep
from sciencesearch.nlp.models import Rake, Yake, KPMiner, Ensemble
from sciencesearch.nlp.train import train_hyper, load_hyper, run_hyper
from sciencesearch.nlp.search import Searcher
from operator import attrgetter
from IPython.core.display import HTML

# logging
import logging

logging.root.setLevel(logging.ERROR)  # silence pke warnings
slog = logging.getLogger("sciencesearch")
slog.setLevel(logging.WARNING)
from sciencesearch.nlp.visualize_kws import JsonView
from pathlib import Path

## Step 1: Setup demo with database
*This demo will only work if you are a SLAC employee with access to the correct data.*

To begin, please download and add the private_data folder to the root directory `sciencesearch/`

In this directory you will see
1. The database shared with us
2. An empty folder for results
3. An empty folder depending on which experiment you are running 
3. A readme to explain which experiments can be run

----
Folders are as follows and will be populated with input files:

- Experiment 1: `slac_logs`
- Experiment 2: `descriptions`
- Experiment 3: `params`
- Experiment 4: `commentary`
----

## Step 2: Select which experiment you would like to run

See README.md in the private directory folder for experiment descriptions,

After selecting your experiment, determine the corresponding configuration file
- Experiment 1: `slac_config_all_elogs.json`
- Experiment 2: `slac_config_descriptions.json`
- Experiment 3: `slac_config_params.json`
- Experiment 4: `slac_config_commentary.json`

In [None]:
# TODO: define config file's filepath 
config_fp = 'slac_config_all_elogs.json'

#### Instructions to run your own experiment and create a custom configuration file
Please see `examples/pipeline` for instructions to build a custom configuration file



## Step 3: Preprocess files 
Run preprocessing of data files such that all input files are saved according to the location and file type defined in your config file

```
"training": {
        "directory": "../private_data/{your_directory}",
        "input_files": ["*.txt"],
}
```
Find subheading for the experiment you are running for preprocessing code

In [None]:
from private_data.preprocessing.extract_data_from_db import DBDataExtractor

# TODO: define fp for database 
db_path = 'simplified_elog.db'

# Setup DBDataExtractor
data_extractor = DBDataExtractor(db_filepath=db_path)

# Preprocess files for experiment 1
data_extractor.prepare_experiment1_data()

# Preprocess files for experiment 2
data_extractor.prepare_experiment2_data()


# Preprocess files for experiment 3
data_extractor.prepare_experiment3_data()

# Preprocess files for experiment 4
data_extractor.prepare_experiment4_data()


##### Experiment 1: Extract keywords from all experiment logs



##### Experiment 3: Extract keywords from experiment descriptions


##### Experiment 3: Extract keywords from all experiment logs regarding paramaters


##### Experiment 4: Extract keywords from all experiment logs that do not regard parameters


## Step 4: Train and run models
In this example, we pick the 'best' result for each algorithm by training on two files with some user-provided keywords.
Then we extract keywords from a third file using the trained model.

Using a searcher which will read in training data from a search configuration, select the best model's keywords. 
We save the results of the hyperparameter training in a serialize Python "pickle" file so we don't need to repeat the training.
We could run the same hyperparameters on multiple files without retraining with `run_hyper()`

In [None]:

# TODO: If you would like to re-train the model, delete `private_data/{training_directory}/{save_file}.pkl`

# Create a Searcher object from the configuration
slac_searcher = Searcher.from_config(config_file=config_fp)

### With Searcher object, search for all files that have a certain keyword

In [None]:
# find all files that have a keyword
# TODO: set keyword variable to a keyword you would like to look for
keyword = "diffraction"
slac_searcher.find(keyword)

In [None]:
# see all file keywords (predicted and training)
# slac_searcher.file_keywords

# see all predicted keywords
# predicted_keywords = slac_searcher.predicted_keywords
# predicted_keywords

# see training keywords
# slac_searcher.training_keywords
# training_keywords

# see separated training keywords and predicted keywords
all_keywords = slac_searcher.training_and_predicted_keywords()
all_keywords

In [None]:
import pandas as pd
data = all_keywords

rows = []
for name, conditions in data.items():
    print(data)
    row = {'experiment_name': name}
    row.update(conditions)
    rows.append(row)

# TODO: define save file name and location
file_name = 'results'


# Create df and save to CSV
df = pd.DataFrame(rows)
df.to_csv(f'../private_data/results/{file_name}.csv', index=False)
df

## Step 5: Visualize results in context of the input file


In [None]:
# view keywords in context of text logs (single file)
# TODO: set file name to {experiment_id}.txt
filename = "test.txt"
HTML(
    slac_searcher.view_keywords(
        show_training=True, show_predicted=True, textfilename=filename
    )
)

In [None]:
from IPython.core.display import HTML

# view keywords in context of text logs (all files)
HTML(
    slac_searcher.view_keywords(
        show_training=True, show_predicted=True, textfilename=None
    )
)