# Extract keywords from SLAC experiment logs

This example notebook will demonstrate how to configure and run the ScienceSearch Python tools for keyword extraction.

For more information about ScienceSearch, see also:
- [sciencesearch Github repository](https://github.com/ScienceSearch/sciencesearch).
- AI-generated [documentation pages](https://deepwiki.com/ScienceSearch/sciencesearch/1-overview).

## Prerequisites
- A Python environment which includes ScienceSearch Python package `sciencesearch` (see [../README.md](../README.md))
- A SLAC-generated SQLite database

## Setup
Python imports and some logging setup

In [None]:
# imports
from pathlib import Path
from sciencesearch.nlp.search import KeywordExplorer
from sciencesearch.nlp.slac_data_extractor import SLACDatabaseDataExtractor
from IPython.core.display import HTML

# logging setup
import logging

logging.root.setLevel(logging.ERROR)  # silence pke warnings
slog = logging.getLogger("sciencesearch")
slog.setLevel(logging.WARNING)
from sciencesearch.nlp.visualize_kws import JsonView
from pathlib import Path
import json

## Initialize source database
Before you can run the algorithms, you need to copy your SLAC-generated database into a file called "simplified_elog.db" in the "private_data" directory.

The database must have the _logbook_ and _experiments_ tables.

You will also need a file called "queries_info.json" in the private data directory.

In [None]:
# create target directory
p = Path("../private_data") # assume this notebook is run from the `examples/` subdirectory
p.mkdir(exist_ok=True)
dbfile = 'simplified_elog.db'
if not (p / dbfile).exists():
    print(f"Please copy database to:\n{p.resolve() / dbfile}")
if not (p / "queries_info.json").exists():
    print(f"Please copy 'queries_info.json' to directory {p.resolve()}")

## Initialize configuration file
You also will need a configuration file specifying the algorithms, initial settings, and directory locations.
For the initial run, which uses all the elogs, this file will be in the "config_files" directory and be named "slac_config_all_elogs.json".

In [None]:
conf_dir = Path(".") / "config_files"
conf_dir.mkdir(exist_ok=True)
conf_file = conf_dir / "slac_config_all_elogs.json"
if not conf_file.exists():
    print(f"Please create configuration file {conf_file.resolve()}")

## Populate training data 
Before you can train the models, you will need to provide training data.

Format: 
```
filename1, "list,of,keywords"
filename2, "another,list,of,keywords"
```

In [None]:
# Find directory to place training data
conf = json.load(open(conf_file))
input_file_dir = Path(conf["training"]["directory"])
training_keywords_file = input_file_dir / conf["training"]["keywords"][0]
if not conf_file.exists():
    print(f"Please add training data file {training_keywords_file.resolve()}")


## Extract keywords from elogs
Using the provided configuration file, we will tell ScienceSearch to perform the following steps:
1. Load data from the database using the `SlacDatabaseDataExtractor` class
2. Call the appropriate method on this class to preprocess the data to remove non-technical words, HTML tags, etc.
3. Using the `KeywordExplorer` class, choose the 'best' keyword extraction based on a comparison with training data and extract keywords

## Extract keywords from other sources
In addition to elogs, we have written some variations of the process above to extract from:
* experiment descriptions
* elogs and experiment parameters
* elogs that are labeled as misc. commentary

These variations are coded into methods in the `SLACDatabaseDataExtractor` class. Distinct configuration files are used to keep the hyperparameters and output data cleanly separated.

Uncomment the appropriate line below to run one of these other experiments.

You will also need to make sure the corresponding directory and configuration file are created for these to run successfully.

In [None]:
# create data preprocessing class
data_extractor = SLACDatabaseDataExtractor(conf_file)
# load and preprocess data
print("Load and preprocess data")
data_extractor.process_elogs()
# choose keyword parameters and extract keywords
print("Extracting keywords - this may take a minute or two")
kwe = KeywordExplorer.from_config(conf_file)

In [None]:
# show file keywords
print("\n".join([f"{k} => {', '.join(v)}" for k,v in kwe.file_keywords.items()]))

In [None]:
# Uncomment ONE of the following sections

## Experiment descriptions
# conf_file =  conf_dir / "slac_config_descriptions.json"
# SLACDatabaseDataExtractor(conf_file).process_experiment_descriptions()

## Elogs and experiment parameters
conf_file =  conf_dir / "slac_config_params.json"
SLACDatabaseDataExtractor(conf_file).process_experiment_descriptions()

## Only elogs that are misc. commentary
# conf_file =  conf_dir / "slac_config_commentary.json"
# SLACDatabaseDataExtractor(conf_file)d.process_experiment_descriptions()

In [None]:
# Common: extract keywords with chosen algorithm
kwe = KeywordExplorer.from_config(conf_file)

## Explore keyword results
We can now use the extracted keywords together with the original text to either search or visualize the keywords in context.
The code below uses the `KeywordExplorer` instance created when you extracted the keywords in the previous step.

In [None]:
# Show training and predicted keywords
kwe.training_and_predicted_keywords()

### Search for all experiments that have a particular keyword

In [None]:
# Search for a keyword
keyword = "magnet"
kwe.find(keyword)

### Visualize keywords
You can also view the keywords in context with a styled HTML output.

In [None]:
filename = "mfxp17218_content.txt"
# filename = None  # all files
HTML(kwe.view_keywords(
        show_training=True, show_predicted=True, textfilename=filename
    )
)