## KeyPhrase extraction
* Use SemEval 2010 dataset - [train](https://github.com/boudinfl/ake-datasets/blob/master/datasets/SemEval-2010/train/) dataset for TF-IDF vectorization
* Use SemEval 2010 dataset - [test](https://github.com/boudinfl/ake-datasets/blob/master/datasets/SemEval-2010/test/) for inference
* Evaluation of the results shows precisoin, recall, f1, and precision@5 per each document and also the mean of those

In [1]:
import operator
import json
import numpy as np
from pathlib import Path
from glob import glob
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from xml.etree import ElementTree

from tqdm import tqdm_notebook as tqdm

In [2]:
from nltk.stem.snowball import SnowballStemmer
sno = SnowballStemmer('english')

In [3]:
def read(directory):
    docs = {}
    for doc_path in tqdm(glob(f'{directory}/*.xml')):
        doc = ElementTree.parse(doc_path)
        sentences = []
        for sentence in doc.find('document').find('sentences').findall('sentence'):
            sentences.append(' '.join([token.find('lemma').text.lower() 
                                       for token in sentence.find('tokens').findall('token')]))

        docs[doc_path.split('/')[-1].split('.')[0]] = '\n'.join(sentences)
    return docs

In [4]:
train_sentences = read('ake-datasets/datasets/SemEval-2010/train')
test_sentences = read('ake-datasets/datasets/SemEval-2010/test')
len(train_sentences), len(test_sentences)

HBox(children=(IntProgress(value=0, max=144), HTML(value='')))




HBox(children=(IntProgress(value=0), HTML(value='')))




(144, 100)

In [5]:
vectorizer = TfidfVectorizer(lowercase=True, stop_words='english', ngram_range=(1, 3))
trainvec = vectorizer.fit_transform(train_sentences.values())
feature_names = vectorizer.get_feature_names()

In [6]:
with open('ake-datasets/datasets/SemEval-2010/references/test.author.stem.json', 'r') as f:
    target = json.load(f)
    target = {doc_name: [k[0] for k in keyphrases] for doc_name, keyphrases in target.items()}

In [7]:
target['H-11']

['imag retriev', 'activ learn', 'relev feedback']

In [8]:
print(test_sentences['H-11'])

laplacian optimal design for imag e retrieval
abstract
relevance feedback be a powerful technique to enhance contentbased image retrieval -lrb- cbir -rrb- performance .
it solicit the user 's relevance judgment on the retrieve image return by the cbir system .
the user 's labeling be then use to learn a classifier to distinguish between relevant and irrelevant image .
however , the top return image may not be the most informative one .
the challenge be thus to determine which unlabeled image would be the most informative -lrb- i.e. , improve the classifier the most -rrb- if they be label and use as training sample .
in this paper , we propose a novel active learning algorithm , call laplacian optimal design -lrb- lod -rrb- , for relevance feedback image retrieval .
we algorithm be base on a regression model which minimize the least square error on the measure -lrb- or , label -rrb- image and simultaneously preserve the local geometrical structure of the image space .
specifically , we 

In [9]:
def extract_keyphrases(vec, feature_names, nb_keywords=5):
    feature_index = vec.nonzero()[1]
    tfidf_scores = zip(feature_index, [vec[0, x] for x in feature_index])
    # Scale scores by n-gram length
    scores = {feature_names[i]: s * len(feature_names[i].split()) for i, s in tfidf_scores}
    scores = sorted(scores.items(), key=operator.itemgetter(1), reverse=True)[:nb_keywords]
    return [keyphrase for keyphrase, score in scores]

In [10]:
predictions = {}
for doc_id, doc in test_sentences.items():
    vec = vectorizer.transform([doc])[0]
    keyphrases = extract_keyphrases(vec, feature_names=feature_names, nb_keywords=5)
    predictions[doc_id] = keyphrases

In [11]:
predictions['H-11'], target['H-11']

(['image',
  'image retrieval',
  'active learning',
  'experimental design',
  'image database'],
 ['imag retriev', 'activ learn', 'relev feedback'])

In [12]:
predictions = {doc_id: [sno.stem(candidate) for candidate in candidates] for doc_id, candidates in predictions.items()}
target = {doc_id: [sno.stem(candidate) for candidate in candidates] for doc_id, candidates in target.items()}

In [13]:
precision, recall, f1, precision_5 = [], [], [], []
for doc_id in sorted(predictions.keys()):
    p = set(predictions[doc_id])
    t = set(target[doc_id])
    at_5 = set(target[doc_id][:5])

    # We always predict 5 keywords
    precision.append(len(p.intersection(t)) / len(p))
    recall.append(len(p.intersection(t)) / len(t))
    f1.append(0 if precision[-1] + recall[-1] == 0 else 2 * precision[-1] * recall[-1] / (precision[-1] + recall[-1]))
    precision_5.append(len(p.intersection(at_5)) / len(p))
    print(f'{doc_id:5} -> Precision: {precision[-1]:.2f} Recall: {recall[-1]:.2f} F1: {f1[-1]:.2f} precision@5: {precision_5[-1]:.2f}')

print()
print('--------------Mean-------------')
print(f'Precision: {np.mean(precision):.2f} Recall: {np.mean(recall):.2f} F1: {np.mean(f1):.2f}   precision@5: {np.mean(precision_5):.2f}')

C-1   -> Precision: 0.20 Recall: 0.17 F1: 0.18 precision@5: 0.20
C-14  -> Precision: 0.40 Recall: 0.40 F1: 0.40 precision@5: 0.40
C-17  -> Precision: 0.00 Recall: 0.00 F1: 0.00 precision@5: 0.00
C-18  -> Precision: 0.00 Recall: 0.00 F1: 0.00 precision@5: 0.00
C-19  -> Precision: 0.20 Recall: 0.33 F1: 0.25 precision@5: 0.20
C-20  -> Precision: 0.20 Recall: 0.33 F1: 0.25 precision@5: 0.20
C-22  -> Precision: 0.20 Recall: 0.25 F1: 0.22 precision@5: 0.20
C-23  -> Precision: 0.00 Recall: 0.00 F1: 0.00 precision@5: 0.00
C-27  -> Precision: 0.20 Recall: 0.25 F1: 0.22 precision@5: 0.20
C-28  -> Precision: 0.00 Recall: 0.00 F1: 0.00 precision@5: 0.00
C-29  -> Precision: 0.00 Recall: 0.00 F1: 0.00 precision@5: 0.00
C-3   -> Precision: 0.00 Recall: 0.00 F1: 0.00 precision@5: 0.00
C-30  -> Precision: 0.20 Recall: 0.33 F1: 0.25 precision@5: 0.20
C-31  -> Precision: 0.00 Recall: 0.00 F1: 0.00 precision@5: 0.00
C-32  -> Precision: 0.00 Recall: 0.00 F1: 0.00 precision@5: 0.00
C-33  -> Precision: 0.20 