# CS5481 - Tutorial 8
## Information Retrieval

Welcome to CS5481 tutorial 8. In this tutorial, you will learn how to use classic information retreival methods in practice.

## preparation
- Python
- Python Libraries
  - numpy

# Context
1. Boolean Retrieval
2. Vector Space Model
3. BM25 Model

# Boolen Retrieval


Let's consider the following documents:

In [None]:
docs = ["The top surface of the Model A's car-like exterior is a mesh so that air can pass through to eight propellers inside the body which provide lift.",
       "But flying any distance using these alone, without the assistance of wings, would require prohibitive amounts of power.",
       "Alef's proposed solution is novel - for longer flights the Model A transforms into a biplane.",
       "It's an ingenious idea, but is it a practical one?",
       "The mesh, as visualised, might also cause significant aerodynamic drag, he adds."]

Q1: construct a vocabulary table for these documents

In [None]:
import numpy as np
vocabs = [v.strip("-").strip(".").strip("?").strip(",") for doc in docs for v in doc.strip().split()]
# remove repeated vocabs
vocabs = list(set(vocabs))
vocabs

Q2. Represent documetns with One-Hot

In [None]:
one_hot_docs = np.zeros((len(docs), len(vocabs)))
print(one_hot_docs.shape) # doc_num x vocab_len
for i, doc in enumerate(docs):
    for v in doc.strip().split():
        v = v.strip("-").strip(".").strip("?").strip(",")
        one_hot_docs[i][vocabs.index(v)] = 1
one_hot_docs

Q3. Retrieval docs satisfying the following requirements with Boolean Model
1. Model OR power
2. Model AND air

In [None]:
Model_id = vocabs.index("Model")
power_id = vocabs.index("power")
air_id = vocabs.index("air")

retri_1 = np.zeros(len(vocabs))
retri_1[Model_id] = 1
retri_1[power_id] = 1

retri_1_results = np.sum(one_hot_docs * retri_1, axis=1)
print("Retrievaled Docs: ",retri_1_results >= 1)

retri_2 = np.zeros(len(vocabs))
retri_2[Model_id] = 1
retri_2[air_id] = 1

retri_2_results = np.sum(one_hot_docs * retri_2, axis=1)
print("Retrievaled Docs: ",retri_2_results >= 2)

# Vector Space Model

Here, we mainly use TF-IDF method to retrieval douements

Q1. Represent documents with TF-IDF Model

In [None]:
corpus = [
    'this is the first document',
    'this is the second second document',
    'and the third one',
    'is this the first document'
]

# tokenize words
word_list = []
for i in range(len(corpus)):
    word_list.append(corpus[i].split(' '))
word_list

In [None]:
# assign an id to each word and obtain each word's frequency
from collections import Counter
dictionary = Counter([v for item in word_list for v in item])
dictionary

In [None]:
# represent each document with word frequency in current documents
words = list(dictionary.keys())
print(words)
tf = np.zeros((len(word_list), len(dictionary)))
for i, doc in enumerate(word_list):
    for v in doc:
        # word freq in current document
        tf[i, words.index(v)] += 1 / len(doc)
tf

In [None]:
# represent each document with word inverse document freqency
idf = np.zeros((len(word_list), len(dictionary)))
for i, doc in enumerate(word_list):
    for v in doc:
        idf[i, words.index(v)] = 1
# The number of documents containing the current word
idf = idf.sum(0)+1
idf = np.log(len(word_list) / idf)
idf

In [None]:
tfidf = tf * idf
tfidf

Q2. Compute similarity bewteeen the following document and the above documents based on tfidf with dot product 

In [None]:
query_doc = "this is second document"
tf_query = np.zeros(len(dictionary))
for v in query_doc.split():
    tf_query[words.index(v)] = 1 / len(query_doc.split())
print("tf_query: ", tf_query)
idf_query = np.zeros(len(dictionary))
for v in query_doc.split():
    idf_query[words.index(v)] = idf[words.index(v)]
print("idf_query: ", idf_query)
tfidf_query = tf_query * idf_query
print("tfidf_query: ", tfidf_query)

In [None]:
# compute cosine similary
similarity = tfidf_query * tfidf
similarity.sum(1)

# BM25 Model

$Score(Q, d) = \sum_{i}^{n}W_i R(q_i, d)$, where $Q$ is the query, $d$ is a document, $n$ is the number of words in $Q$, $q_i$ is the $i$-th word in query. $W_i$ is the weight of this word, $R(q_i, d)$ is the relevant score bewteen word $q_i$ and document $d$.

In BM2.5 Model

$W_i = log\frac{N-df_i+0.5}{df_i+0.5}$, where $N$ is the numbe of all documents in document database, $df_i$ is the number of documents containing word $q_i$.

$R(q_i, d) = \frac{f_i(k_1+1)}{f+K} * \frac{qf_i(k_2+1)}{qf_i+k_2}$

$K = k_1 * (1 -b + b * \frac{dl}{avg\_dl})$, where $k_1$, $k_2$, $b$ are harmonic factors, $f_i$ is the number word $q_i$ appearing in all documents, $qf_i$ is the number word $q_i$ appearing in query, $dl$ is the length of docuemnt $d$, $avg\_dl$ is the average length of all documents.

Q1. Construct a BM25 Model

In [None]:
import numpy as np
from collections import Counter


class BM25_Model(object):
    def __init__(self, documents_list, k1=2, k2=1, b=0.5):
        # documents and each document is a list of words
        self.documents_list = documents_list
        self.documents_number = len(documents_list)
        # average document length
        self.avg_documents_len = sum([len(document) for document in documents_list]) / self.documents_number
        # save each word's frequency in each document
        self.f = []
        # word's weight
        self.idf = {}
        self.k1 = k1
        self.k2 = k2
        self.b = b
        # obtain f and idf from input documents
        self.init()

    def init(self):
        # 
        df = {}
        for document in self.documents_list:
            # word frequency in current document
            temp = {}
            # stat
            for word in document:
                temp[word] = temp.get(word, 0) + 1
            # save word frequency
            self.f.append(temp)
            # the number of documents containing the word key
            for key in temp.keys():
                df[key] = df.get(key, 0) + 1
        # compute word's idf
        for key, value in df.items():
            self.idf[key] = np.log((self.documents_number - value + 0.5) / (value + 0.5))
            
    # compute similarity score bewteen query and index-th document in all documents
    def get_score(self, index, query):
        score = 0.0
        document_len = len(self.f[index])
        qf = Counter(query)
        for q in query:
            if q not in self.f[index]:
                continue
            score += self.idf[q] * (self.f[index][q] * (self.k1 + 1) / (
                        self.f[index][q] + self.k1 * (1 - self.b + self.b * document_len / self.avg_documents_len))) * (
                                 qf[q] * (self.k2 + 1) / (qf[q] + self.k2))

        return score
    # compute simialrity scores between query and all documents
    def get_documents_score(self, query):
        score_list = []
        for i in range(self.documents_number):
            score_list.append(self.get_score(i, query))
        return score_list

# Practice

Q1. Use the above BM25 model to compute the similary of a given query and documents

In [None]:
query = "This is second documents"
corpus = [
    'this is the first document',
    'this is the second second document',
    'and the third one',
    'is this the first document'
]



In [None]:
BM25_Model()