## <center>**Information Retrieval**</center>
## <center>**Coursework 3 - Search Engine Implementation**</center><br>
_**Name: Kweku E. Acquaye<br>
Group: 68 (single member group)**_<br> 

### **<center> Table of Contents </center>**

1. Section 1: Introduction
  * Section 1.1: Outline
  * Section 1.2: Model Search Engine
2. Section 2: Building the Model
  * Section 2.1: Loading the data
  * Section 2.2: Creating the index
3. Section 3: Creating a BM25 Model 
  * Section 3.1: Creating retrieval function
  * Section 3.2: Searching and ranking documents for queries
4. Section 4: Evaluating BM25 model
  * Section 4.1: Evaluation by Harmonic Mean (F1-score)
  * Section 4.2: Evaluation by Normalized Discounted Cumulative Gain (nDCG)
5. Section 5: Discussion and Conclusions

### **Section 1: Introduction**<br> 
#### **Section 1.1: Outline**
This report uses modern data science methods to build, implement, and evaluate the performance of a model search engine designed earlier in Coursework 2. The aim is to fully understand and appreciate how a search engine works by actually building the search engine. This notebook is to be submitted together with a pdf description summarising theory, and a video presentation demostrating the model's ability to receive a query (via a command line) and then yield an output (in a file or on screen) of a ranked list of documents.<br>  

This work constitutes Coursework 3a of Information Retrieval module.

**Declaration:** Some of the code used in this assignment has been adapted and customized from www.docs.python.org/, www.pandas.pydata.org/docs, www.pytorch.org/tutorials/, www.matplotlib.org/stable/, www.stackoverflow.com/questions/, www.geeksforgeeks.org/fundamentals-of-algorithms/, www.machinelearningmind.com/, www.kaggle.com/, www.scikit-learn.org, www.numpy.org/doc/stable/user/, www.github.com/, www.ethen8181.github.io/machine-learning/search/bm25_intro.html, and IR Lab, Tutorial, and Lecture Notes.<br> 

#### **Section 1.2: Model Search Engine**
In **_Coursework 2 - Design a Model Search Engine_**, the following design architecture was created:<br> 

<center>
<div>
<img src= "https://drive.google.com/uc?id=1zk40emIkynMDI1IWKMPpovoNs1s96_4h" alt= "star schema" width=600/>
</div>
</center>

This design architecture is now implemented in this notebook, with two minor differences - the dataset has had to be changed. The reason for this is that in early trials of implementation, execution time for the 30 GB CORD-19 dataset (Wang _et al_ 2020), even in the simplest of operations, proved too long to be of use as a **_model_** search engine.<br> 

It has therefore been substituted with a fairly wide dataset of Wikipedia articles on Covid-19. Instead of local or cloud storage of the dataset, data would be accessed and parsed directly by utilising Wikipedia API for Python version 1.4 (online reference 1).<br> 

Also, instead of determining the Mean Average Precision (MAP) as the 2nd method of evaluation, the normalized Discounted Cumulative Gain (nDCG) method is used.

### **Section 2: Building the Model**
The following series of steps instantiates, builds and implements the network architecture and pipeline:

#### **Section 2.1: Loading the data**

In [None]:
# Importing libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import os
import re

# Installing and upgrading wikipedia and statsmodels packages
!pip install wikipedia
!pip install statsmodels --upgrade

import wikipedia



For the purpose of this coursework, the entire content of each article is cast as a document.<br> 
This next step defines and loads the 10-megabyte 30-document corpus:

In [None]:
# Defining dataset of documents
articles=['COVID-19','COVID-19 vaccine','Long COVID','COVID-19 testing','COVID-19 pandemic cases','Symptoms of COVID-19','Variants of SARS-CoV-2',
          'COVID-19 pandemic in North America', 'COVID-19 pandemic in Europe', 'COVID-19 pandemic in Asia', 'COVID-19 pandemic in Africa',
          'COVID-19 pandemic in Australia', 'COVID-19 pandemic by country and territory', 'COVID-19 lockdowns', 'COVID-19 pandemic in South America', 
          'Impact of the COVID-19 pandemic on education', 'SARS-CoV-2 Delta variant', 'SARS-CoV-2 Gamma variant', 'SARS-CoV-2 Omicron variant', 
          'COVID-19 vaccine clinical research', 'Coronavirus', 'Economic impact of the COVID-19 pandemic', 'COVID-19 pandemic in Antarctica', 'Virus',
          'History of COVID-19 vaccine development', 'Impact of the COVID-19 pandemic on religion', 'Political impact of the COVID-19 pandemic',
          'Severe acute respiratory syndrome coronavirus 2', 'Investigations into the origin of COVID-19', 'Angiotensin-converting enzyme 2', 
          ]
documents=[]
title=[]

# Loading wikipedia articles
for article in articles:
   print("loading content: ",article)
   documents.append(wikipedia.page(article,auto_suggest=False).content)
   title.append(article)

loading content:  COVID-19
loading content:  COVID-19 vaccine
loading content:  Long COVID
loading content:  COVID-19 testing
loading content:  COVID-19 pandemic cases
loading content:  Symptoms of COVID-19
loading content:  Variants of SARS-CoV-2
loading content:  COVID-19 pandemic in North America
loading content:  COVID-19 pandemic in Europe
loading content:  COVID-19 pandemic in Asia
loading content:  COVID-19 pandemic in Africa
loading content:  COVID-19 pandemic in Australia
loading content:  COVID-19 pandemic by country and territory
loading content:  COVID-19 lockdowns
loading content:  COVID-19 pandemic in South America
loading content:  Impact of the COVID-19 pandemic on education
loading content:  SARS-CoV-2 Delta variant
loading content:  SARS-CoV-2 Gamma variant
loading content:  SARS-CoV-2 Omicron variant
loading content:  COVID-19 vaccine clinical research
loading content:  Coronavirus
loading content:  Economic impact of the COVID-19 pandemic
loading content:  COVID-19 

In [None]:
# Outputting 1st document
print(documents[0])

Coronavirus disease 2019 (COVID-19) is a contagious disease caused by a virus, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The first known case was identified in Wuhan, China, in December 2019. The disease has since spread worldwide, leading to the ongoing COVID-19 pandemic.Symptoms of COVID‑19 are variable, but often include fever, cough, headache, fatigue, breathing difficulties, loss of smell, and loss of taste. Symptoms may begin one to fourteen days after exposure to the virus. At least a third of people who are infected do not develop noticeable symptoms. Of those people who develop symptoms noticeable enough to be classed as patients, most (81%) develop mild to moderate symptoms (up to mild pneumonia), while 14% develop severe symptoms (dyspnea, hypoxia, or more than 50% lung involvement on imaging), and 5% suffer critical symptoms (respiratory failure, shock, or multiorgan dysfunction). Older people are at a higher risk of developing severe symptoms. Some 

#### **Section 2.2: Creating the index**<br> 
The next series of steps creates an index (i.e. documents index) by obtaining requisite quantifiers such as term frequencies _(tf)_, document frequencies _(df)_, term-document pair _(t-d)_ dictionary, and inverse document frequencies _(idf)_.

In [None]:
# Creating document terms
document_terms = [doc.split(' ') for doc in documents]

In [None]:
# Removing stop words and vectorizing
vectorizer = CountVectorizer(stop_words= {'00', '000',	'00011',	'001',	'002',	'006',	'009',	'00906',	'00937', 'δ156',	'δ211',	'δ31',	'δ69',	
                             '01',	'010',	'011',	'012',	'013',	'015',	'016',	'017',	'019',	'δfvi',	'δh69',	'δv70',	'κορώνη',	'ṣaḥn',	'ἰός', 
                             'áder',	'áñez',	'édouard',	'état',	'être',	'óscar',	'δ105',	'δ1265',	'δ143', '02',	'020',	'021',	'02140',	'023',
                             '025',	'0257',	'026',	'028',	'english'})
documents_vectorized = vectorizer.fit_transform(documents)
vocabulary = vectorizer.get_feature_names_out()

In [None]:
# Peeking vocabulary data type 
vocabulary

array(['03', '030', '035', ..., 'zulia', 'zycov', 'zürich'], dtype=object)

In [None]:
# Outputting data term frequency matrix
dataframe = pd.DataFrame(documents_vectorized.toarray(), columns=vocabulary)
dataframe

Unnamed: 0,03,030,035,03614,037,04,041,042,043,045,...,zones,zooanthroponosis,zoom,zoonosis,zoonotic,zoos,zoster,zulia,zycov,zürich
0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,1,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,2,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9,0,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


There are 13,248 terms in the vocabulary.

In [None]:
# Creating term-document pairs dictionary
from collections import defaultdict

term_occurence_dict = defaultdict(list)

for doc_id, doc in enumerate(document_terms):
  for term in doc:
    term_occurence_dict[term].append(doc_id)

In [None]:
# Checking success of term-document dictionary
terms = ['genome', 'etiology']                   # find all documents with the word "genome" or "etiology"
result_atcls = []

for term in terms:
  doc_ids = term_occurence_dict[term]
  for doc_id in doc_ids:
    result_atcls.append(articles[doc_id])

result_atcls = list(set(result_atcls))
result_atcls                                     # return document titles

['SARS-CoV-2 Gamma variant',
 'Variants of SARS-CoV-2',
 'SARS-CoV-2 Delta variant',
 'Coronavirus',
 'Investigations into the origin of COVID-19',
 'Severe acute respiratory syndrome coronavirus 2',
 'COVID-19',
 'Angiotensin-converting enzyme 2',
 'SARS-CoV-2 Omicron variant',
 'Virus']

In [None]:
# Returning full documents
terms = ['genome', 'etiology']                   # find all documents with the word "genome" or "etiology"
result_docs = []

for term in terms:
  doc_ids = term_occurence_dict[term]
  for doc_id in doc_ids:
    result_docs.append(documents[doc_id])

result_docs = list(set(result_docs))
result_docs                                      # return full documents

['Coronavirus disease 2019 (COVID-19) is a contagious disease caused by a virus, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The first known case was identified in Wuhan, China, in December 2019. The disease has since spread worldwide, leading to the ongoing COVID-19 pandemic.Symptoms of COVID‑19 are variable, but often include fever, cough, headache, fatigue, breathing difficulties, loss of smell, and loss of taste. Symptoms may begin one to fourteen days after exposure to the virus. At least a third of people who are infected do not develop noticeable symptoms. Of those people who develop symptoms noticeable enough to be classed as patients, most (81%) develop mild to moderate symptoms (up to mild pneumonia), while 14% develop severe symptoms (dyspnea, hypoxia, or more than 50% lung involvement on imaging), and 5% suffer critical symptoms (respiratory failure, shock, or multiorgan dysfunction). Older people are at a higher risk of developing severe symptoms. Som

Term-document pair dictionary created successfully.

In [None]:
# Calculating document frequency (i.e. how many documents each phrase appears in)
dfs = (dataframe > 0).sum(axis=0)
dfs

03        1
030       1
035       1
03614     1
037       1
         ..
zoos      2
zoster    1
zulia     1
zycov     1
zürich    1
Length: 13249, dtype: int64

In [None]:
# Creating idf for every term in the dataset
N = dataframe.shape[0]
idfs = np.log10(N/dfs)
idfs

03        1.477121
030       1.477121
035       1.477121
03614     1.477121
037       1.477121
            ...   
zoos      1.176091
zoster    1.477121
zulia     1.477121
zycov     1.477121
zürich    1.477121
Length: 13249, dtype: float64

An index of inverse document frequencies _(idf)_ has now been created for every vocabulary term in the dataset.<br> 

### **Section 3: Creating a BM25 Model**
The seminal Best Match 25 (BM25) probabilistic Information Retrieval (IR) model (Robertson _et al_, 1994) is, to date, the standard IR model against which other models are measured. It is calculated by the function

**BM25 Function**

<center>
<div>
<img src= "https://drive.google.com/uc?id=1cvkpSt2PtX2f9eqcqCQNPG9P7CKHbRM7" alt= "BM25_formula" width=500/>
</div>
</center>  

#### **Section 3.1: Creating retrieval function**<br> 
In this section, a simplified version of the above function is used to build an instance algorithm of the Okapi BM25 in the next few series of steps, according to the equation:

<center>
<div>
<img src= "https://drive.google.com/uc?id=1OrKL8Wncj_iG6-7rv3eEfk_NNoGep04y" alt= "BM25_formula" width=400/>
</div>
</center>

where RSV = Retrieval Status Value. 

In [None]:
# Defining variables
k_1 = 1.2                                       # single value
b = 0.8                                         # single value

# Considering all words in doc
dls = [len(d.split(' ')) for d in documents]    # vector

# Considering words minus stop words in doc. 
dls = dataframe.sum(axis=1).tolist()
avgdl = np.mean(dls)                            # single value

In [None]:
# Calculating BM25 term frequency quantification
numerator = np.array((k_1 + 1) * dataframe)
denominator = np.array(k_1 *((1 - b) + b * (dls / avgdl))).reshape(N,1) + np.array(dataframe)

BM25_tf = numerator / denominator

idfs = np.array(idfs)

BM25_score = BM25_tf * idfs

In [None]:
# Outputting BM25 scores
bm25_idf = pd.DataFrame(BM25_score, columns=vocabulary)
bm25_idf

Unnamed: 0,03,030,035,03614,037,04,041,042,043,045,...,zones,zooanthroponosis,zoom,zoonosis,zoonotic,zoos,zoster,zulia,zycov,zürich
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.055068,0.0,0.0,0.625033,0.84005,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.449506,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,1.868284,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.043824,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.757947,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.286587
9,0.0,0.0,0.0,0.0,1.59437,0.0,0.0,1.59437,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The above dataframe is equivalent to a table of BM25 scores for each document.

#### **Section 3.2: Searching and ranking documents for queries**<br> 
The next few steps demonstrate the search engine receiving a query input, searching the documents index, and outputting a ranked list of suitable documents in answer to the search: 

In [None]:
# Querying search engine for the terms 'vaccine' or 'symptom'
q_terms = ['vaccine', 'symptom']

To find documents that fit the query well based on the BM25 scores:

In [None]:
# Cutting up BM25 dataframe in terms of the query terms
q_terms_only_df = bm25_idf[q_terms]
score_q_d = q_terms_only_df.sum(axis=1)

In [None]:
# Outputting top 10 retrieved documents by rank and scores
sorted(zip(documents[:10],score_q_d.values), key = lambda tup:tup[1], reverse=True)

[('Long COVID is a condition characterized by long-term consequences persisting or appearing after the typical convalescence period of COVID-19. It is also known as post-COVID-19 syndrome, post-COVID-19 condition, post-acute sequelae of COVID-19 (PASC), or chronic COVID syndrome (CCS). Long COVID can affect nearly every organ system, with sequelae including respiratory system disorders, nervous system and neurocognitive disorders, mental health disorders, metabolic disorders, cardiovascular disorders, gastrointestinal disorders, malaise, fatigue, musculoskeletal pain, and anemia. A wide range of symptoms are commonly reported, including fatigue, headaches, shortness of breath, anosmia (loss of smell), parosmia (distorted smell), muscle weakness, low fever and cognitive dysfunction.The exact nature of symptoms and the number of people who experience long-term symptoms are unknown; these vary according to the definition used, the population being studied, and the time period used in the 

Retrieval function performs successfully and as expected. Because the BM25 function normalises by document length, documents with equal number of query terms but with more words are scored lower, while those with fewer words are scored higher. Also, documents with rare terms are ranked higher by virtue of Luhn's Analysis (Kocabaş _et al_, 2011)<br> 

### **Section 4: Evaluating BM25 model**<br> 
The performance of the created model is evaluated in the next series of steps:

In [None]:
df = dataframe

In [None]:
type(df)

pandas.core.frame.DataFrame

In [None]:
df[:5]

Unnamed: 0,03,030,035,03614,037,04,041,042,043,045,...,zones,zooanthroponosis,zoom,zoonosis,zoonotic,zoos,zoster,zulia,zycov,zürich
0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,1,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# Turning term frequencies into BM25 term Frequencies

def BM25_IDF_df(df):
  """
  This definition calculates BM25-IDF weights before hand
  """

  dfs = (df > 0).sum(axis=0)
  N = df.shape[0]
  idfs = -np.log(dfs / N)

# Defining variables
  k_1 = 1.4                                             # single value
  b = 0.75                                              # single value

# Considering all words in doc
  dls = [len(d.split(' ')) for d in documents]          # vector

# Considering words minus stop words in doc. 
  dls = dataframe.sum(axis=1).tolist()
  avgdl = np.mean(dls)                                  # single value
  return pd.DataFrame(BM25_score, columns=vocabulary)

In [None]:
bm25_df = BM25_IDF_df(df) # a dataframe with BM25-idf weights
bm25_df[:5]

Unnamed: 0,03,030,035,03614,037,04,041,042,043,045,...,zones,zooanthroponosis,zoom,zoonosis,zoonotic,zoos,zoster,zulia,zycov,zürich
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.055068,0.0,0.0,0.625033,0.84005,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.449506,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The queries "vaccine effectiveness" and "cell membrane" are input to observe the relevance of documents returned in response to the queries:

In [None]:
# Inputting query 
queries = dict(enumerate(['vaccine effectiveness','cell membrane']))      # create dictionary = {query_id: query}
queries

{0: 'vaccine effectiveness', 1: 'cell membrane'}

In [None]:
terms = ['vaccine effectiveness','cell membrane']
result_atcls

['SARS-CoV-2 Gamma variant',
 'Variants of SARS-CoV-2',
 'SARS-CoV-2 Delta variant',
 'Coronavirus',
 'Investigations into the origin of COVID-19',
 'Severe acute respiratory syndrome coronavirus 2',
 'COVID-19',
 'Angiotensin-converting enzyme 2',
 'SARS-CoV-2 Omicron variant',
 'Virus']

In [None]:
result_docs

['Coronavirus disease 2019 (COVID-19) is a contagious disease caused by a virus, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The first known case was identified in Wuhan, China, in December 2019. The disease has since spread worldwide, leading to the ongoing COVID-19 pandemic.Symptoms of COVID‑19 are variable, but often include fever, cough, headache, fatigue, breathing difficulties, loss of smell, and loss of taste. Symptoms may begin one to fourteen days after exposure to the virus. At least a third of people who are infected do not develop noticeable symptoms. Of those people who develop symptoms noticeable enough to be classed as patients, most (81%) develop mild to moderate symptoms (up to mild pneumonia), while 14% develop severe symptoms (dyspnea, hypoxia, or more than 50% lung involvement on imaging), and 5% suffer critical symptoms (respiratory failure, shock, or multiorgan dysfunction). Older people are at a higher risk of developing severe symptoms. Som

#### **Section 4.1: Evaluation by Harmonic Mean (F1-score)**<br> 
First the model is evaluated by calculating the harmonic mean for query terms.<br> 

Manual method relevance judgements are used and coded to calculate F1 scores on precision-recall plane as follows:

In [None]:
# Inputting relevance judgement list as (query_id, document_id, judgement) with judgement 1 = relevant and 0 = not relevant 
qrels = [
         (0,28,0),
         (0,7,1),
         (0,0,0),
         (0,18,1),
         (0,23,1),
         (0,17,1),
         (0,29,1),
         (0,27,0),
         (0,16,0),
         (0,20,0),

         (1,28,0),
         (1,7,1),
         (1,0,0),
         (1,18,0),
         (1,23,1),
         (1,17,1),
         (1,29,0),
         (1,27,1),
         (1,16,0),
         (1,20,0),
]

In [None]:
def retrieve_ranking(query, bm25_df):
  # Creating function for ranking retrieval
  q_terms = query.split(' ')
  q_terms_only = bm25_df[q_terms]
  score_q_d = q_terms_only.sum(axis=1)
  return sorted(zip(bm25_df.index.values,score_q_d.values), key = lambda tup:tup[1], reverse=True)

In [None]:
def precision_at_k(query_id, k=5):
  # Creating function for calculating precision
  doc_ranking = retrieve_ranking(queries[query_id], bm25_df)
  retrieved = [doc[0] for doc in doc_ranking[:k]] # take only the document id, rather than score

  TP = np.array([int((query_id, doc, 1) in qrels) for doc in retrieved]).sum()
  FP = np.array([int((query_id, doc, 0) in qrels) for doc in retrieved]).sum()

  precision = TP / (TP+FP)

  return TP, FP, precision

In [None]:
def f1_score_at_k(query_id, k=5):
  # Creating function for calculating calculating F1 score
  doc_ranking = retrieve_ranking(queries[query_id], bm25_df)
  retrieved = [doc[0] for doc in doc_ranking[:k]] # take only the document id, rather than score
  
  TP, FP, precision = precision_at_k(query_id, k)
  relevant_docs = np.array(qrels)
  relevant_docs = relevant_docs[relevant_docs[:, 0] == query_id][:,2].sum()
  FN = relevant_docs - TP

  recall = TP / (TP+FN)
  f1 = (2 * precision * recall) / (precision + recall)
  
  return f1

In [None]:
# Calculating accuracy metrics for each query
k = 5
for query_id, query in queries.items():
  tp, fp, precision = precision_at_k(query_id, k=k)
  f1_score = f1_score_at_k(query_id, k=k)
  print('retrieved query "{}" with Precision@{} = {} and F1-score = {}'.format(query, k, precision, f1_score))

retrieved query "vaccine effectiveness" with Precision@5 = 1.0 and F1-score = 0.33333333333333337
retrieved query "cell membrane" with Precision@5 = 0.4 and F1-score = 0.4444444444444445


#### **Section 4.2: Evaluation by Normalized Discounted Cumulative Gain (nDCG)**<br>  
Using nDCG method instead of the mean average precision (MAP) in the design, the model is evaluated by calculating ndcg scores for query terms as follows:

In [None]:
# Importing library
from sklearn.metrics import ndcg_score

In [None]:
for query_id, query in queries.items():
  # Calculate normalized dcg (ndcg) at k
  y_score = np.array(sorted(retrieve_ranking(queries[query_id], bm25_df)))[:,1]
  y_true = np.zeros(y_score.size)
  np_qrels = np.array(qrels)

  for data in np_qrels[np_qrels[:, 0] == query_id]:
    y_true[data[1]] = data[2]

  ndcg = ndcg_score(np.expand_dims(y_true,axis=0), np.expand_dims(y_score,axis=0), k=k)
  print(f'retrieved for {query} with NGCD@{k} of {ndcg}')

retrieved for vaccine effectiveness with NGCD@5 of 0.16958010263680806
retrieved for cell membrane with NGCD@5 of 0.4414924137367807


### **Section 5: Discussion and Conclusions**<br> 
In this notebook, the model search engine designed in Coursework 2 is successfully built and implemented. Utilising a BM25 model, I demonstrate how a search engine works in indexing documents, retrieving documents, and ranking retrieved documents in relation to input queries. 

Time constraints due to single-membership in this group prevented implementation of a Divergence from Randomness (DFR) retrieval method. 

The model was evaluated successfully using two different methods. 

With the Harmonic Mean method, F1 scores of 0.333 and 0.444 were obtained with precision @5 of 1.0 and 0.4 respectively for the query terms "vaccine effectiveness" and "cell membrane". 

With the Normalized Discounted Cumulative Gain method, ndcg scores of 0.170 and 0.441 were obtained with precision @5 respectively for the query terms "vaccine effectiveness" and "cell membrane". 

Overall, this exercise is deemed an excellent success in terms of the purpose for which it is undertaken.

### **<center>References</center>**

1. Wang, L.L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., Eide, D., Funk, K., Kinney, R., Liu, Z., Merrill, W. and Mooney, P., 2020, _Cord-19: The covid-19 open research dataset_, ArXiv, [online]: 
https://doi.org/10.48550/arXiv.2004.07180

2. Online Reference 1, [online]: https://pypi.org/project/wikipedia/

3. Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M. and Gatford, M., 1994, Okapi at TREC-2, NIST SPECIAL PUBLICATION SP, pp. 21-21, [online]: https://books.google.co.uk/books?hl=en&lr=&id=W8MZAQAAIAAJ&oi=fnd&pg=PA21&dq=Stephen+E.+Robertson%3B+Steve+Walker%3B+Susan+Jones&ots=3WAX_FtN7f&sig=od8KRRiryjacsQxVFNkktfJd4Us&redir_esc=y#v=onepage&q&f=false

4. Jones, K.S., 1999, _Information retrieval and artificial intelligence_, Artificial Intelligence, **114**, (1-2), pp. 257-281, [online]: https://reader.elsevier.com/reader/sd/pii/S0004370299000752?token=427130C67846D0099AD0B9969F30C0156F72E7A7F9EA27FFC30E89DDD8F5031A5CC69B617DA9DAC5B5FD99BDD90D0191&originRegion=eu-west-1&originCreation=20220409195834

5. Kocabaş, İ., Dincer, B.T. and Karaoğlan, B., 2011, _Investigation of Luhn's claim on information retrieval_, Turkish Journal of Electrical Engineering and Computer Science, **19**, (6), pp. 993-1004, [online]: https://www.researchgate.net/publication/237266486_Investigation_of_Luhn's_claim_on_information_retrieval

6. Information Retrieval Lecture, Tutorial, and Lab notes.