## Exercise 1

In this exercise we will understand the functioning of TF/IDF ranking. 

Implement the vector space retrieval model, based on the code framework provided below.

For testing we have provided a simple document collection with 5 documents in file bread.txt:

  DocID | Document Text
  ------|------------------
  1     | How to Bake Breads Without Baking Recipes
  2     | Smith Pies: Best Pies in London
  3     | Numerical Recipes: The Art of Scientific Computing
  4     | Breads, Pastries, Pies, and Cakes: Quantity Baking Recipes
  5     | Pastry: A Book of Best French Pastry Recipes

Now, for the query $Q = ``baking''$, find the top ranked documents according to the TF/IDF rank.

For further testing, use the collection __epfldocs.txt__, which contains recent tweets mentioning EPFL.

Compare the results also to the results obtained from the reference implementation using the scikit-learn library.

In [77]:
# Loading of libraries and documents

from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import string
from nltk.corpus import stopwords
import math
from collections import Counter
nltk.download('stopwords')
nltk.download('punkt')

# Tokenize, stem a document
stemmer = PorterStemmer()
def tokenize(text):
    text = "".join([ch for ch in text if ch not in string.punctuation])
    tokens = nltk.word_tokenize(text)
    return " ".join([stemmer.stem(word.lower()) for word in tokens])

# Read a list of documents from a file. Each line in a file is a document
with open("epfldocs.txt") as f:
    content = f.readlines()
original_documents = [x.strip() for x in content] 
documents = [tokenize(d).split() for d in original_documents]

[nltk_data] Downloading package stopwords to /Users/yawen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/yawen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [9]:
# TF/IDF code

# create the vocabulary
vocabulary = set([item for sublist in documents for item in sublist])
vocabulary = [word for word in vocabulary if word not in stopwords.words('english')]
vocabulary.sort()

print(vocabulary)

['art', 'bake', 'best', 'book', 'bread', 'cake', 'comput', 'french', 'london', 'numer', 'pastri', 'pie', 'quantiti', 'recip', 'scientif', 'smith', 'without']


In [112]:
# compute term occurence in documents
def term_occurence(term, documents):
    occ = 0
    for doc in documents:
        if term in doc:
            occ = occ + 1
    return occ

# compute IDF, storing idf values in a dictionary
def idf_values(vocabulary, documents):
    idf = {}
    num_documents = len(documents)
    for i, term in enumerate(vocabulary):
        # nb documents having the term
#         ni = term_occurence(term, documents)
        ni = sum(term in document for document in documents)
        if ni != 0:
            idf[term] = max(0, math.log(num_documents/ni, math.e))
        else:
            idf[term] = 0

    return idf

# Function to generate the vector for a document (with normalisation)
def vectorize(document, vocabulary, idf):
    vector = [0]*len(vocabulary)
    # term-frequency in document
    counts = Counter(document)
    # most common elements for a list
    max_count = counts.most_common(1)[0][1]
    for i,term in enumerate(vocabulary):
        # tf = freq / max-freq
        tf = counts[term] / max_count
        # = tf * idf
        vector[i] = tf * idf[term]
    return vector

# Compute IDF values and vectors
idf = idf_values(vocabulary, documents)
# print("IDFs are: \n",  idf)
# print("Document vectors are:")
document_vectors = [vectorize(s, vocabulary, idf) for s in documents]
# print(document_vectors)

# Function to compute cosine similarity
def cosine_similarity(v1,v2):
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]; y = v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    if sumxy == 0:
        result = 0
    elif sumxx == 0 or sumyy == 0:
        result = 0
    else:
            result = sumxy / math.sqrt(sumxx*sumyy)
    return result

import numpy as np

def cosine_sim(v1, v2):
    dot = np.dot(v1, v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)
    return dot/(norm_v1 * norm_v2)
    

# computing the search result (get the topk documents)
def search_vec(query, topk):
    q = query.split()
    q = [stemmer.stem(w) for w in q]
    query_vector = vectorize(q, vocabulary, idf)
    scores = [[cosine_similarity(query_vector, document_vectors[d]), d] for d in range(len(documents))]
    scores.sort(key=lambda x: -x[0])
    doc_ids = []
    for i in range(topk):
        print(original_documents[scores[i][1]])
        doc_ids.append(scores[i][1]) 
    return doc_ids
# HINTS
# natural logarithm function
#     math.log(n,math.e)
# Function to count term frequencies in a document
#     Counter(document)
# most common elements for a list
#     counts.most_common(1)

In [113]:
search_vec('computer science', 5)

Exciting News: "World University Rankings 2016-2017 by subject: computer science" No1 @ETH &amp; @EPFL on No8. Congrats https://t.co/ARSlXZoShQ
New at @epfl_en Life Sciences @epflSV: "From PhD directly to Independent Group Leader" #ELFIR_EPFL:  Early Independence Research Scholars. See https://t.co/evqyqD7FFl, also for computational biology #compbio https://t.co/e3pDCg6NVb Deadline April 1 2018 at https://t.co/mJqcrfIqkb
Video of Nicola Marzari from @EPFL_en  on Computational Discovery in the 21st Century during #PASC17 now online: https://t.co/tfCkEvYKtq https://t.co/httPdHcK9W
@CodeWeekEU is turning 5, yay! We look very much forward to computational thinking unplugged activities during @CodeWeek_CH https://t.co/yDPrlKg4hw
An interview with Patrick Barth, a new @EPFL professor who combines protein #biophysics with computer modeling https://t.co/iJwBaEbocj


[4, 30, 89, 333, 795]

In [106]:
# Reference code using scikit-learn
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 1, stop_words = 'english')
features = tf.fit_transform(original_documents)
npm_tfidf = features.todense()
new_features = tf.transform(['computer science'])

cosine_similarities = linear_kernel(new_features, features).flatten()
related_docs_indices = cosine_similarities.argsort()[::-1]
topk = 5
for i in range(topk):
    print(related_docs_indices[i])
    print(original_documents[related_docs_indices[i]])

4
Exciting News: "World University Rankings 2016-2017 by subject: computer science" No1 @ETH &amp; @EPFL on No8. Congrats https://t.co/ARSlXZoShQ
838
New computer model shows how proteins are controlled "at a distance" https://t.co/zNjK3bZ6mO  via @EPFL_en #VDtech https://t.co/b9TglXO4KD
795
An interview with Patrick Barth, a new @EPFL professor who combines protein #biophysics with computer modeling https://t.co/iJwBaEbocj
420
Exposure Science Film Hackathon 2017 applications open! Come join our Scicomm-film-hacking event! #Science #scicomm https://t.co/zwtKPlh6HT
300
Le mystère Soulages éblouit la science @EPFL  https://t.co/u3uNICyAdi



## Exercise 2: Evaluate retrieval results

In this exercise, we consider the scikit reference code as an “oracle” that supposedly gives the correct result. Your exercise is to compare the above tf-idf retrieval model with this oracle for the following queries "computer science", "IC school", "information systems".



In [82]:
from operator import itemgetter

# Retrieval oracle 
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 1, stop_words = 'english')
features = tf.fit_transform(original_documents)
npm_tfidf = features.todense()

# Return all document ids that that have cosine similarity with the query larger than a threshold
def search_vec_sklearn(query, features, threshold=0.1):
    new_features = tf.transform([query])
    cosine_similarities = linear_kernel(new_features, features).flatten()
    related_docs_indices, cos_sim_sorted = zip(*sorted(enumerate(cosine_similarities), key=itemgetter(1), 
                                                       reverse=True))
    doc_ids = []
    for i, cos_sim in enumerate(cos_sim_sorted):
#         print(cos_sim_sorted)
        if cos_sim < threshold:
            break
        doc_ids.append(related_docs_indices[i])
    return doc_ids

In [85]:
ret_ids = search_vec_sklearn('computer science', features)
print(ret_ids)
# for i, v in enumerate(ret_ids):
#     print(original_documents[v])

[4, 838, 795, 420, 300, 810, 713, 426, 730, 778, 131, 904, 616, 201, 1056, 600, 764, 358, 837, 524, 250, 443, 969, 49, 210, 1054]


In [59]:
queries = ["computer science", "IC school", "information systems"]

## Exercise 2.1: Compute the precision and recall at k

In [103]:
def compute_recall_at_k(predict, gt, k):
    # predict: documents of my own funcs
    # k: int. top k docs recall
    # gt: ground truth
    tp = np.sum(predict[:k] == gt[:k])
    fn = len(gt) - tp
    recall_k = tp / (tp + fn)
    return recall_k

In [104]:
def compute_precision_at_k(predict, gt, k):
    # predict: documents of my own funcs
    # k: int. top k docs precision
    # gt: ground truth
    tp = np.sum(predict[:k] == gt[:k])
    fp = len(predict) - tp
    precision_k = tp / (tp + fp)
    return precision_k
    

## Exercise 2.2: Compute the MAP score

In [105]:
def compute_map(queries):
    map_score = 0
    for i, query in enumerate(queries):
        precision_for_query = 0
        predict = search_vec(query) # my own func
        gt = search_vec_sklearn(queries[0], features) # relevant doc ids
        for k in range(1, len(gt)):
            precision_k = compute_precision_at_k(predict, gt, k)
#             recall_k = compute_recall_at_k(predict, gt, k)
            precision_for_query = precision_for_query + precision_k
        map_score += precision_for_query / len(original_documents) # ????
    map_score = map_score / len(queries)
    return map_score

In [None]:
compute_map(queries)