# GENIA Term Extraction with KeyBERT

Authors: Samuel Sarria Hurtado and Paul Sheridan

Last update: 2023-10-02

Description: Preprocess the GENIA Term corpus version 3.02 dataset.

Inputs:
* Preprocessed documents (JSON): GENIAcorpus3.02-preprocessed.json

Outputs:
* KeyBERT term scores (JSON): keybert-scores.json

## Imports

In [1]:
import json
from keybert import KeyBERT
import random
from sklearn.feature_extraction.text import CountVectorizer

## Read and Process GENIA Data

In [2]:
# Read in GENIA data
genia_path = '../0-data-preprocessed/GENIAcorpus3.02-preprocessed.json'

with open(genia_path, 'r') as c:
  genia = json.loads(c.read())

genia_str = ' '.join(genia)

In [3]:
# Make sure we use every word in the GENIA corpus for vocab
pre_vocab = []
for i in range(len(genia)):
  pre_vocab.append(genia[i].split())

vocab = []
for i in range(len(pre_vocab)):
  for j in range(len(pre_vocab[i])):
    vocab.append(pre_vocab[i][j])

vocab = list(set(vocab))

In [4]:
# Define the function that will let us use all unique words in the genia file as vocab
def analyzer_custom(doc):
  return doc.split()

In [5]:
# Vectorize the data
counter = CountVectorizer(lowercase=False, vocabulary=vocab, analyzer=analyzer_custom)

## Calculate KeyBERT Rankings

In [6]:
# Set random seed
random.seed(20230807)

In [7]:
# Calculate rankings (this might take awhile)
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(genia_str, keyphrase_ngram_range=(1, 1), stop_words=None, vectorizer=counter, top_n=40804)

## Write to File

In [8]:
# This cell writes the KeyBERT rankings to a json file, uncomment to rewrite.
keybert_scores_name = 'keybert-scores.json'
with open(keybert_scores_name, 'w') as outfile:
  json.dump(keywords, outfile)