# How to compute and load DF (document frequency) counts using `pke`

DF counts are required by several keyphrase extraction models (e.g. TfIdf, Kea) for weighting candidates. Below is an example on how to compute DF counts for a given text collection.

In [None]:
!pip install git+https://github.com/boudinfl/pke.git
!pip install datasets
!python -m spacy download en_core_web_sm

Collecting git+https://github.com/boudinfl/pke.git
  Cloning https://github.com/boudinfl/pke.git to /private/var/folders/_s/dsym612j14gggkqchsd35clh0000gn/T/pip-req-build-2_zafc0t
  Running command git clone --filter=blob:none --quiet https://github.com/boudinfl/pke.git /private/var/folders/_s/dsym612j14gggkqchsd35clh0000gn/T/pip-req-build-2_zafc0t


## Preamble on keyphrase extraction datasets using 🤗 datasets

For simplicity and ease of use, we rely on the `datasets` module from 🤗 huggingface to load and access sample documents from keyphrase extraction datasets. Please have a look at our organization page (https://huggingface.co/taln-ls2n) for more information on which datasets are available.

In [None]:
from datasets import load_dataset

# load the inspec dataset
dataset = load_dataset('taln-ls2n/inspec')

In [None]:
import spacy
from tqdm.notebook import tqdm
from spacy.tokenizer import _get_regex_pattern

nlp = spacy.load("en_core_web_sm")

# Tokenization fix for in-word hyphens (e.g. 'non-linear' would be kept 
# as one token instead of default spacy behavior of 'non', '-', 'linear')
# https://spacy.io/usage/linguistic-features#native-tokenizer-additions

from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex

# Modify tokenizer infix patterns
infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
        r"(?<=[0-9])[+\-\*^](?=[0-9-])",
        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        # ✅ Commented out regex that splits on hyphens between letters:
        # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
    ]
)

infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer

# populates a docs list with spacy doc objects
docs = []
for sample in tqdm(dataset['train']):
    docs.append(nlp(sample["title"]+". "+sample["abstract"]))

In [None]:
import logging
from pke import compute_document_frequency
from string import punctuation

logging.basicConfig(level=logging.INFO)

# compute idf weights
compute_document_frequency(
    documents=docs,
    output_file='inspec.df.gz',
    language='en',              # language of the input files
    normalization='stemming',   # use porter stemmer
    stoplist=list(punctuation), # stoplist (punctuation marks)
    n=5                         # compute n-grams up to 5-grams
)

In [None]:
from pke import load_document_frequency_file

df = load_document_frequency_file(input_file='inspec.df.gz')