# Keyword Extraction: Impact of Abbreviation Expansion

This notebook demonstrates how expanding abbreviations affects keyword extraction results. We'll compare keywords extracted from the same documents with and without abbreviation expansion across multiple experiments.

For more information about ScienceSearch, see also:
- [sciencesearch Github repository](https://github.com/ScienceSearch/sciencesearch).
- AI-generated [documentation pages](https://deepwiki.com/ScienceSearch/sciencesearch/1-overview).

## Prerequisites
- A Python environment which includes ScienceSearch Python package `sciencesearch` (see [../README.md](../README.md))
- A SLAC-generated SQLite database


## Overview

We will:
1. Create utility functions to compare keyword sets
2. Run keyword extraction experiments with and without abbreviation expansion
3. Visualize the differences
4. Analyze the impact of abbreviation expansion on keyword quality


## Step 1: Setup

### 1.1 Python imports and some logging setup

In [None]:
# imports
from pathlib import Path
from sciencesearch.nlp.search import KeywordExplorer
from sciencesearch.nlp.slac_data_extractor import SLACDatabaseDataExtractor
from sciencesearch.nlp.keyword_comparator import KeywordComparator

from IPython.core.display import HTML

# logging setup
import logging

logging.root.setLevel(logging.ERROR)  # silence pke warnings
slog = logging.getLogger("sciencesearch")
slog.setLevel(logging.WARNING)
from pathlib import Path
import json

## 1.2 Connect to config directory

You also will need a configuration file specifying the algorithms, initial settings, and directory locations.

We will generate keywords from 4 data sources
* all elog content
* experiment descriptions
* elogs and experiment parameters
* elogs that are labeled as misc. commentary

In [None]:
conf_dir = Path(".") / "config_files"
conf_dir.mkdir(exist_ok=True)

## Step 2: Run keyword generation with and without abbreviation expansion

### Generate Keywords 

#### Experiment 1: all elog content 


In [None]:
conf_file_elogs = conf_dir / "slac_config_all_elogs.json"
if not conf_file_elogs.exists():
    print(f"Please create configuration file {conf_file_elogs.resolve()}")

## Experiment descriptions WITHOUT ABBREVIATION REPLACEMENT
# initialize data extractor with attribute replace_abbrv = False
SLACDatabaseDataExtractor(conf_file_elogs, replace_abbrv=False).process_elogs()

# extract keywords and save to file
kwe_elogs_norep = KeywordExplorer.from_config(conf_file_elogs)
kwe_elogs_norep.save_keywords_to_file("elogs_keywords")


## Experiment descriptions WITH ABBREVIATION REPLACEMENT
# initialize data extractor with attribute replace_abbrv = True
SLACDatabaseDataExtractor(conf_file_elogs, replace_abbrv=True).process_elogs()

# extract keywords and save to file
kwe_elogs_rep = KeywordExplorer.from_config(conf_file_elogs)
kwe_elogs_rep.save_keywords_to_file("acronym_expansion_elogs_keywords")

#### Experiment 2: experiment descriptions

In [None]:
conf_file_descriptions = conf_dir / "slac_config_descriptions.json"
if not conf_file_descriptions.exists():
    print(f"Please create configuration file {conf_file_descriptions.resolve()}")
SLACDatabaseDataExtractor(conf_file_descriptions).process_experiment_descriptions()

## Experiment descriptions WITHOUT ABBREVIATION REPLACEMENT
# initialize data extractor with attribute replace_abbrv = False
SLACDatabaseDataExtractor(
    conf_file_descriptions, replace_abbrv=False
).process_experiment_descriptions()

# extract keywords and save to file
kwe_des_norep = KeywordExplorer.from_config(conf_file_descriptions)
kwe_des_norep.save_keywords_to_file("description_keywords")


## Experiment descriptions WITH ABBREVIATION REPLACEMENT
# initialize data extractor with attribute replace_abbrv = True
SLACDatabaseDataExtractor(
    conf_file_descriptions, replace_abbrv=True
).process_experiment_descriptions()

# extract keywords and save to file
kwe_des_rep = KeywordExplorer.from_config(conf_file_descriptions)
kwe_des_rep.save_keywords_to_file("acronym_expansion_description_keywords")

#### Experiment 3: elogs of experiment parameters

In [None]:
conf_file_params = conf_dir / "slac_config_params.json"
if not conf_file_params.exists():
    print(f"Please create configuration file {conf_file_params.resolve()}")


## Elogs and experiment parameters WITHOUT ABBREVIATION REPLACEMENT
# initialize data extractor with attribute replace_abbrv = False
SLACDatabaseDataExtractor(
    conf_file_params, replace_abbrv=False
).process_experiment_elog_parameters()

# extract keywords and save to file
kwe_param_norep = KeywordExplorer.from_config(conf_file_params)
kwe_param_norep.save_keywords_to_file("params_keywords")


## Elogs and experiment parameters WITH ABBREVIATION REPLACEMENT
# initialize data extractor with attribute replace_abbrv = True
SLACDatabaseDataExtractor(
    conf_file_params, replace_abbrv=True
).process_experiment_elog_parameters()

# extract keywords and save to file
kwe_param_rep = KeywordExplorer.from_config(conf_file_params)
kwe_param_rep.save_keywords_to_file("acronym_expansion_params_keywords")

#### Experiment 4: elogs that are labeled as misc. commentary

In [None]:
conf_file_commentary = conf_dir / "slac_config_commentary.json"
if not conf_file_commentary.exists():
    print(f"Please create configuration file {conf_file_commentary.resolve()}")

# Only elogs that are misc. commentary WITHOUT ABBREVIATION REPLACEMENT
# initialize data extractor with attribute replace_abbrv = False
SLACDatabaseDataExtractor(
    conf_file_commentary, replace_abbrv=False
).process_experiment_elog_commentary()

# extract keywords and save to file
kwe_comment_norep = KeywordExplorer.from_config(conf_file_commentary)
kwe_comment_norep.save_keywords_to_file("commentary_keywords")

# Only elogs that are misc. commentary WITH ABBREVIATION REPLACEMENT
# initialize data extractor with attribute replace_abbrv = True
SLACDatabaseDataExtractor(
    conf_file_commentary, replace_abbrv=True
).process_experiment_elog_commentary()

# extract keywords and save to file
kwe_comment_rep = KeywordExplorer.from_config(conf_file_commentary)
kwe_comment_rep.save_keywords_to_file("acronym_expansion_commentary_keywords")

## 3. Example: Abbreviation Expansion Impact Analysis

Now let's demonstrate this with a concrete example comparing keyword extraction with and without abbreviation expansion.

### 3.1 Save Keyword Diffs

Use `KeywordComparator` object to format and explore comparisons between two sets of keywords.
Here the example is between the keywords with and without acronym/ abbreviation expansion. 

See similarity metrics of overlap score, jaccard similarity, and dice coefficient by adding argument similarity_metrics = True. Defaults to similarity_metrics = False

In [None]:
# Initialize Comparator
comparator = KeywordComparator()

# Experiment 1: all elog content
elog_diff = comparator.diff_acronyms(
    "../private_data/results/acronym_expansion_elogs_keywords.csv",
    "../private_data/results/elogs_keywords.csv",
    "../private_data/commentary/replaced_abbr_counter.csv",
    # similarity_metrics = True
)
# Save results
elog_diff.to_csv("../private_data/results/comparison_all_elog.csv", index=False)

# Experiment 2: experiment descriptions
description_diff = comparator.diff_acronyms(
    "../private_data/results/acronym_expansion_description_keywords.csv",
    "../private_data/results/description_keywords.csv",
    "../private_data/commentary/replaced_abbr_counter.csv",
)
# Save results
description_diff.to_csv(
    "../private_data/results/comparison_descriptions.csv", index=False
)

# Experiment 3: elogs that are labeled as experiment parameters
param_diff = comparator.diff_acronyms(
    "../private_data/results/acronym_expansion_params_keywords.csv",
    "../private_data/results/params_keywords.csv",
    "../private_data/commentary/replaced_abbr_counter.csv",
)
# Save results
param_diff.to_csv("../private_data/results/comparison_params.csv", index=False)

# Experiment 4: elogs that are labeled as misc. commentary
comment_diff = comparator.diff_acronyms(
    "../private_data/results/acronym_expansion_commentary_keywords.csv",
    "../private_data/results/commentary_keywords.csv",
    "../private_data/commentary/replaced_abbr_counter.csv",
)
# Save results
comment_diff.to_csv("../private_data/results/comparison_commentary.csv", index=False)

## 4. Visualization Functions

In [None]:
## UNCOMMENT WHICH GRAPH YOU WOULD LIKE TO SEE


# Experiment 1: all elog content
# kwe_elogs_norep.export(format="graph") # format options: "graph", "excel", "json"
# kwe_elogs_rep.export(format="graph") # format options: "graph", "excel", "json"

# Experiment 2: experiment descriptions
# kwe_des_norep.export(format="graph") # format options: "graph", "excel", "json"
# kwe_des_rep.export(format="graph") # format options: "graph", "excel", "json"

# Experiment 3: elogs that are labeled as experiment parameters
kwe_param_norep.export(format="graph")  # format options: "graph", "excel", "json"
# kwe_param_rep.export(format="graph") # format options: "graph", "excel", "json"

# Experiment 4: elogs that are labeled as misc. commentary
# kwe_comment_norep.export(format="graph") # format options: "graph", "excel", "json"
# kwe_comment_rep.export(format="graph") # format options: "graph", "excel", "json"