# Biomedical Abstract Sentence Extraction using YAKE and Pre-Extracted Entities

This notebook processes biomedical abstracts that already contain **pre-extracted entities** (e.g., using named entity recognition tools). The goal is to create a concise, informative **summary** of each abstract by selecting only the **most relevant sentences**, using a combination of:

- **Keyword extraction (YAKE)**
- **Pre-extracted biomedical entities**
- **Keyword- and entity-based sentence filtering**

## Workflow Summary

1. **Input**
   - A JSON file containing biomedical abstracts.
   - Each abstract includes:
     - `pmid`: PubMed ID
     - `title`: Title of the paper
     - `abstract`: Full abstract text
     - `entities`: A list of biomedical terms pre-extracted using tools like SciSpacy.

2. **Keyword Extraction (YAKE)**
   - YAKE extracts potential keywords and phrases from the abstract.
   - Keywords are scored for significance; we retain them without filtering by score initially.

3. **Keyword Filtering**
   - From the YAKE output, we keep only:
     - Exact matches to the entities.
     - Phrases that **contain** any of the entities (e.g., "inflammatory cytokines" if "cytokines" is an entity).
   - This helps avoid keyword bloat and ensures relevance.

4. **Sentence Matching**
   - Sentences in the abstract are tokenized.
   - A sentence is retained only if it contains **at least one filtered keyword or entity**.
   - This filters the abstract down to its most important parts.

5. **Deduplication**
   - Repeated keywords and phrases are removed.
   - Duplicate sentences are also filtered out, preserving clarity and brevity.

6. **Final Output**
   Each entry in the output JSON includes:
   - `pmid`
   - `title`
   - `abstract`
   - `combined_keywords`: Filtered set of relevant keywords (entities + YAKE)
   - `matched_text`: Concise summary built from the most relevant, unique sentences

> This approach creates targeted, information-dense summaries of biomedical abstracts, ideal for downstream tasks like indexing, classification, or text generation.



In [None]:
# Install YAKE for keyword extraction
!pip install git+https://github.com/LIAAD/yake.git -q

In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [None]:
# Update with your actual file path
input_path = '../../data/enriched/abstracts_with_entities.json'

#
output_path = '../../data/enriched/abstracts_to_text.json'


In [None]:
import yake
import json
import nltk
from nltk.tokenize import sent_tokenize
from tqdm import tqdm
# Download the 'punkt_tab' resource
nltk.download('punkt')
nltk.download('punkt_tab')

In [None]:
# Initialize YAKE extractor
yake_kw_extractor = yake.KeywordExtractor(lan="en", n=10, top=20)


In [None]:
with open(input_path, "r") as f:
    data = json.load(f)


In [None]:
# Optional. For testing purposes
data = data[:10]

In [None]:
from tqdm import tqdm
from nltk.tokenize import sent_tokenize

def phrase_contains_any(word_set, phrase):
    return any(word in phrase.lower().split() for word in word_set)

def deduplicate_phrases(phrases):
    phrases_sorted = sorted(phrases, key=lambda x: -len(x))
    result = []
    seen = set()
    for phrase in phrases_sorted:
        if not any(phrase.lower() in p.lower() and phrase.lower() != p.lower() for p in result):
            result.append(phrase)
            seen.add(phrase.lower())
    return result

for entry in tqdm(data):
    abstract = entry.get("abstract", "")
    entities = entry.get("entities", [])

    # 1. Extract YAKE keywords
    yake_keywords = [kw for kw, score in yake_kw_extractor.extract_keywords(abstract)]

    # 2. Save all_entities: union of raw YAKE and entities, deduplicated
    all_entities = list(set(map(str.lower, entities + yake_keywords)))
    entry["all_entities"] = all_entities

    # 3. Filter YAKE keywords that overlap with any entity
    entity_words = set(e.lower() for e in entities)
    filtered_yake_keywords = [kw for kw in yake_keywords if phrase_contains_any(entity_words, kw)]

    # 4. Deduplicate overlapping phrases
    deduped_keywords = deduplicate_phrases(filtered_yake_keywords)
    entry["combined_keywords"] = deduped_keywords

    # 5. Match sentences from abstract
    sentences = sent_tokenize(abstract)
    matched_sentences = [
        sent.strip() for sent in sentences
        if any(kw.lower() in sent.lower() for kw in deduped_keywords)
    ]

    # 6. Deduplicate matched sentences
    seen = set()
    unique_matched_sentences = []
    for sent in matched_sentences:
        if sent not in seen:
            unique_matched_sentences.append(sent)
            seen.add(sent)

    # 7. Compose matched text
    entry["matched_text"] = " ".join(
        sent if sent.endswith(".") else sent + "." for sent in unique_matched_sentences
    )

    # 8. Optional cleanup: remove original entities field
    entry.pop("entities", None)


In [None]:
with open(output_path, "w") as f:
    json.dump(data, f, indent=2)

print(f" Output saved to: {output_path}")
