## KeyPhrase extraction
* Use Inspec dataset (abstracts) - [train](https://github.com/boudinfl/ake-datasets/blob/master/datasets/Inspec/train/) dataset for TF-IDF vectorization
* Use Inspec dataset (abstracts) - [test](https://github.com/boudinfl/ake-datasets/blob/master/datasets/Inspec/test/) for inference
* Evaluation of the results shows precisoin, recall, f1, and precision@5 per each document and also the mean of those

In [1]:
import operator
import json
import numpy as np
from pathlib import Path
from glob import glob
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from xml.etree import ElementTree

from tqdm import tqdm_notebook as tqdm

In [2]:
from nltk.stem.snowball import SnowballStemmer
sno = SnowballStemmer('english')

In [3]:
def read(directory):
    docs = {}
    for doc_path in tqdm(glob(f'{directory}/*.xml')):
        doc = ElementTree.parse(doc_path)
        sentences = []
        for sentence in doc.find('document').find('sentences').findall('sentence'):
            sentences.append(' '.join([token.find('lemma').text.lower() 
                                       for token in sentence.find('tokens').findall('token')]))

        docs[doc_path.split('/')[-1].split('.')[0]] = '\n'.join(sentences)
    return docs

In [4]:
train_sentences = read('ake-datasets/datasets/Inspec/train')
test_sentences = read('ake-datasets/datasets/Inspec/test')
len(train_sentences), len(test_sentences)

HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))




HBox(children=(IntProgress(value=0, max=500), HTML(value='')))




(1000, 500)

In [5]:
vectorizer = TfidfVectorizer(lowercase=True, stop_words='english', ngram_range=(1, 3))
trainvec = vectorizer.fit_transform(train_sentences.values())
feature_names = vectorizer.get_feature_names()

In [6]:
with open('ake-datasets/datasets/Inspec/references/test.uncontr.json', 'r') as f:
    target = json.load(f)
    target = {doc_name: [k[0] for k in keyphrases] for doc_name, keyphrases in target.items()}

In [7]:
target['193']

['out-of-print materials',
 'recurring issues',
 'changing practices',
 'out-of-print books',
 'library materials',
 'acquisition']

In [8]:
print(test_sentences['193'])

twenty year of the literature on acquire out-of-print material
this article review the last two-and-a-half decade of literature on acquire out-of-print material to assess recur issue and identify change practice .
the out-of-print literature be uniform in its assertion that library need to acquire o.p. material to replace worn or damaged copy , to replace missing copy , to duplicate copy of heavily used material , to fill gap in collection , to strengthen weak collection , to continue to develop strong collection , and to provide material for new course , new program , and even entire new library


In [9]:
def extract_keyphrases(vec, feature_names, nb_keywords=5):
    feature_index = vec.nonzero()[1]
    tfidf_scores = zip(feature_index, [vec[0, x] for x in feature_index])
    # Scale scores by n-gram length
    scores = {feature_names[i]: s * len(feature_names[i].split()) for i, s in tfidf_scores}
    scores = sorted(scores.items(), key=operator.itemgetter(1), reverse=True)[:nb_keywords]
    return [keyphrase for keyphrase, score in scores]

In [10]:
predictions = {}
for doc_id, doc in test_sentences.items():
    vec = vectorizer.transform([doc])[0]
    keyphrases = extract_keyphrases(vec, feature_names=feature_names, nb_keywords=5)
    predictions[doc_id] = keyphrases

In [11]:
predictions['193'], target['193']

(['material', 'print', 'copy', 'collection', 'acquire'],
 ['out-of-print materials',
  'recurring issues',
  'changing practices',
  'out-of-print books',
  'library materials',
  'acquisition'])

In [12]:
predictions = {doc_id: [sno.stem(candidate) for candidate in candidates] for doc_id, candidates in predictions.items()}
target = {doc_id: [sno.stem(candidate) for candidate in candidates] for doc_id, candidates in target.items()}

In [13]:
precision, recall, f1, precision_5 = [], [], [], []
for doc_id in sorted(predictions.keys()):
    p = set(predictions[doc_id])
    t = set(target[doc_id])
    at_5 = set(target[doc_id][:5])

    # We always predict 5 keywords
    precision.append(len(p.intersection(t)) / len(p))
    recall.append(len(p.intersection(t)) / len(t))
    f1.append(0 if precision[-1] + recall[-1] == 0 else 2 * precision[-1] * recall[-1] / (precision[-1] + recall[-1]))
    precision_5.append(len(p.intersection(at_5)) / len(p))
    print(f'{doc_id:5} -> Precision: {precision[-1]:.2f} Recall: {recall[-1]:.2f} F1: {f1[-1]:.2f} precision@5: {precision_5[-1]:.2f}')

print()
print('--------------Mean-------------')
print(f'Precision: {np.mean(precision):.2f} Recall: {np.mean(recall):.2f} F1: {np.mean(f1):.2f}   precision@5: {np.mean(precision_5):.2f}')

193   -> Precision: 0.00 Recall: 0.00 F1: 0.00 precision@5: 0.00
1930  -> Precision: 0.00 Recall: 0.00 F1: 0.00 precision@5: 0.00
1931  -> Precision: 0.00 Recall: 0.00 F1: 0.00 precision@5: 0.00
1932  -> Precision: 0.00 Recall: 0.00 F1: 0.00 precision@5: 0.00
1933  -> Precision: 0.20 Recall: 0.20 F1: 0.20 precision@5: 0.20
1934  -> Precision: 0.20 Recall: 0.17 F1: 0.18 precision@5: 0.20
1935  -> Precision: 0.20 Recall: 0.17 F1: 0.18 precision@5: 0.00
1936  -> Precision: 0.00 Recall: 0.00 F1: 0.00 precision@5: 0.00
1937  -> Precision: 0.00 Recall: 0.00 F1: 0.00 precision@5: 0.00
1938  -> Precision: 0.00 Recall: 0.00 F1: 0.00 precision@5: 0.00
1939  -> Precision: 0.20 Recall: 0.14 F1: 0.17 precision@5: 0.00
194   -> Precision: 0.00 Recall: 0.00 F1: 0.00 precision@5: 0.00
1940  -> Precision: 0.20 Recall: 0.20 F1: 0.20 precision@5: 0.20
1941  -> Precision: 0.00 Recall: 0.00 F1: 0.00 precision@5: 0.00
1942  -> Precision: 0.00 Recall: 0.00 F1: 0.00 precision@5: 0.00
1943  -> Precision: 0.00 