# Evaluation Approaches #

So far we learned about various concepts in probability theory and machine learning. We also implemented several ML models. In this lab session we'll learn about different evaluation approaches for analyzing the performance of various ML models. 

## Precision and Recall ##

The two most commonly used measures of effectiveness in information retrieval (IR) and machine learning (ML) are precision and recall. They were originally introduced to summarize and compare search results. They are often reffered to as relevance measures. The difference between the two is in finding how well does the model perform in terms of having returned results being relevant (precision) versus returning all relevant results (recall). In IR they are defined in the following way:  

Given a search query and its set of relevant documents (known a priori) the retrieval model returns a ranked list of documents, i.e. the **retrieved** documents. **Precision** is defined as the ratio between the number of retrieved and relevant documents and the total number of **retrieved** documents. It gives us the proportion of retrieved documents that are relevant. **Recall** is defined as ratio between the number of retrieved and relevant documents and the total number of **relevant** documents for that query. Recall gives us the proportion of relevant documents that are retrieved:
$$\Large Precision= \frac{|Relevant\ \&\ Retrieved|}{|Retrieved|}$$  
$$\Large Recall = \frac{|Relevant\ \&\ Retrieved|}{|Relevant|}$$  

In IR precision and recall are often computed for a given ranked position. It is often the case that systems are evaluated in terms of precision or recall at a certain rank (e.g. P@10/P10, R@10/R10). Sometimes system’s performance is reported in terms of an average of the precisions (AP) at the ranks where relevant documents occur. When AP is computed across multiple test queries a mean of the AP values is reported. This measure is called mean average precision (MAP).  

In ML for tasks such as classification, where the model output is not a ranked list, precision and recall are defined in terms of the true positives (tp), false positives (fp), and false negatives (fn):  

$$\Large Precision= \frac{|tp|}{|tp+fp|}$$  
$$\Large Recall = \frac{|tp|}{|tp+fn|}$$  

In this case we assume that the output of the classifier is making a prediction that all retrieved documents are relevant. fp and fn are also known as Type I and Type II errors.
Below is an illustration on how to compute Precision and Recall when evaluating a classifier:

![Two Bins](precision_recall.png)

In the next example we are going to compute precision and recall on two ranked lists of documents that were generated by two different retrieval models (model a and model b).

### Computation Example ###

In this example we are going to evaluate the performace of two retrieval models using precision and recall. We have a test query and it's relevance set which consists of a set of documents stored in a file (3238.rel). Given the test query each model returns a ranked list of retrieved documents which are stored in two separate files (model_a.out and model_b.out). With the code below we are going to implement both measures. 

We'll start by reading both ranked lists (model outputs) and the relevance set.

In [None]:
import numpy as np
# read the output of the two models:
model_a = np.loadtxt('model_a.out')
model_b = np.loadtxt('model_b.out')

# read the set of relevant documents
rel_set = np.loadtxt('3238.rel')

In order to compute both measures we would need to iterate over the two ranked lists and for each rank we'll check whether the ranked document is in the relevance set:

In [None]:
rank=10
#create lists for both models where we'll store the relevant documents 
# found in the ranked list up until the given rank:
rank_rel_a =[]
rank_rel_b=[]
#iterate over the ranked list:
for i in range(0,rank):
    #if the ranked document is in the relevance set,
    #add it to the list of relevant documents found for the given model:
    if model_a[i][2] in rel_set:
        rank_rel_a.append(i)
    if model_b[i][2] in rel_set:
        rank_rel_b.append (i)

#compute the number of relevant and retrieved documents:
rank_rel_a = len(np.asarray(rank_rel_a))
rank_rel_b = len(np.asarray(rank_rel_b))

#compute the number of relevant document:
num_rel = len(rel_set)
print("Number of relevant documents="+str(num_rel))

**[Assignment 1]**
Given the above code your task is to compute Precision and Recall at rank 10 (rank=10)

**[Solution 1]** 

In [7]:
#Enter your code here.

**[Question 1]**  
Change the value of the rank at which you are computing precision and recall. Start with rank=5 and continue with rank=10, 20, 30 and 40. See if you can notice a trend in how the values of Precision and Recall change. 

**[Answer 1]** Type your answer here.

## Log-likelihood and Perplexity ##


One way to categorize evaluation measures is based on whether they evaluate the model on a particular task (i.e. extrinsically), which requires ground truth information of the model performance as we noticed in the previous example when computing precision and recall. Another way to evaluate the performance of the model is based on how well the model represents the collection (i.e. intrinsically). In many instances due to the nature of the model and/or due to lack of well defined extrinsic task, models are evaluated based on how well they could predict a held-out set. One commonly used intrinsic evaluation approach is to measure the log-likelihood of the held out set (i.e. the test set).  

Given a held out set of $n$ data points $x_i$ and the trained model parameters  $\theta$, log-likelihood is computed as the sum of the log-likelihood of the data points given the model parameters $p(x_i|\theta)$:  
		$$\mathcal{L}(x)\ =\ \sum_{i=1}^{n} \log p(x_i|\theta)$$

For many ML and natural language processing (NLP) models Perplexity is used rather than log-likelihood. Perplexity is defined as:

$$ perplexity(x)\ =\ \exp \Bigg\{-\frac{\mathcal{L}(x)}{|n|} \Bigg\} $$

In the following example we are going to use perplexity to evaluate topic models. Earlier today we learned about topic models and we used a topic model configuration with 100 topics (T=100) to infer topics on a NYT article. In the following example we are going to evaluate that model using perplexity. In order to do that we are going to first derive the  perplexity expression for topic models.  
In topic models we are dealing with documents $d$ and words $w$. The model gives us the per document-topic distributions and per topic-word distributions. For a test set of  $M$ documents each  of which containing $N_d$ words perplexity is defined as:

$$ perplexity(D)\ =\ \exp \Bigg\{-\frac{\sum_{d=1}^M{\log p(w_d)}}{\sum_{d=1}^M{N_d}} \Bigg\} $$

### Computation Example ###

In [2]:
import gensim
import numpy as np
import collections
import matplotlib.pyplot as plt
nytimes = []
with open('nytimes_100docs.txt') as inputfile:
    for line in inputfile:
        nytimes.append(line.lower().split())

# obtain the frequency of each word
frequency = collections.defaultdict(int)
for doc in nytimes:
    for token in doc:
        frequency[token] += 1

# obtain the frequency of the words as a numpy array
n_most_common = 25
np_freq = np.zeros(len(frequency))
count = 0
for token in frequency:
    np_freq[count] = frequency[token]
    count += 1
# sort the frequencies
np_freq_sorted = np.sort(np_freq)
# obtain the maximum allowed frequency
max_freq = np_freq_sorted[-n_most_common]

# remove words that appear only once or more than M times
doc_noLowFreq = [[token for token in text if frequency[token] > 1 and frequency[token]<max_freq]
                  for text in nytimes]

dictionary = gensim.corpora.Dictionary(doc_noLowFreq)
corpus = [dictionary.doc2bow(doc) for doc in doc_noLowFreq]
#Split the collection into 80% for training and 20% for test (held out set)
train_corpus = corpus[:80]
test_corpus = corpus[80:]
T=100
model = gensim.models.LdaModel(train_corpus, id2word=dictionary, num_topics=T)



**[Assignment 2]** Given that we've trained a LDA model using 80% of the corpus as a training set compute the log-likelihood of the test set using the gensim method "model.log_perplexity"

**[Solution 2]**

In [6]:
#Enter your code here.

## Significance Testing ##

In this section we'll implement and explore the randomization test which is also known as the permutation test. The randomization test helps us determine whether the difference in the test statistic used to judge the two models/systems is statistically significant or not, i.e. whether the difference happened by chance (e.g. a noise introduced in the evaluation process). We have two evaluation results obtained from averaging a certain evaluation measure across the points in our test set using two models. When we compare these two models what we typically do in order to determine which model performs better is take the difference between the evaluation results (i.e. test statistics).  

In the randomization test the null hypothesis states that the results of the two models across the data points in the test set were generated by the same underlying process and therefore there is no difference in the performance of the two models. When computing the randomization test, the first step is to generate the distribution of the test statistics under the null-hypothesis.We go about doing that by generating two evaluation results for the two models using the following procedure:  
(1) Go over each data point in the evaluation set  
(2) Randomly choose an evaluation result from the two model results for that point  
(3) Repeat the process for the two models and compute the mean for each of the newly generated sets of evaluation results  
(4) Compute the difference between the two means (i.e. the test statistic)  
(5) Repeat steps 1 through 4 n times and store each test statistic. These steps are often referred to as creating the distribution of the test statistic under the null hypothesis.   

At the end go over the n generated test statistics and count the number of times their values were larger than our original test statistic. The p-value is defined as the probability of having the test statistics generated under the null hypothesis be higher than our original test statistic. It is computed by dividing the number of times the test statistic under the null hypothesis was higher than our original test statistic with the total number of times we computed the test statistics under the null hypothesis n. 

We are now going to implement the randomization test. 

In [None]:
import numpy as np
# read the output of the two models:
model_1 = np.loadtxt('model_f.res')
model_2 = np.loadtxt('model_e.res')

perf_model_1 = model_1[:,1].mean()
perf_model_2 = model_2[:,1].mean()
diff = perf_model_a - perf_model_b
print ("Model 1 MAP is="+str(round(perf_model_1,4)))
print ("Model 2 MAP is="+str(round(perf_model_2,4)))
print ("Difference in performance is:")
print ("MAP(1)-MAP(2)="+str(round(diff,4)))

#first we are going to concatanate the two results
all_res = np.append(model_1[:,1], model_2[:,1])
#next we set the number of samples
iter=10000
iter_val=np.zeros(iter)
for j in range(0,iter):
    perm = np.random.permutation(np.arange(len(all_res)))
    tr1 =all_res[perm[:len(model_1)]]
    tr2 =all_res[perm[len(model_2):]]
    iter_val[j]=tr1.mean()-tr2.mean()

**[Assingment 3]** Use the above outlined steps and compute the p-value.

**[Solution 3]**

In [8]:
#type your code here.

**[Assignment 4]** We just used the randomization test to determine how significant the performance difference is between models "e" and "f". Aside from the results of these two models we also provided you with results from two models model "c" and model "d". Use the randomization test to perform the same significance testing over the other models. 

## Correlation Coefficients ##


In many data science problems we are often tasked with finding the relationship between ranked lists. For example, these ranked lists could contain the performance of a particular model across different configurations of the same parameter. To give a concrete example, let's say we have 4 different types of topic models and each model is evaluated using 7 different topic configurations (T=20,30,40,50,60,70 and 80). Given two such ranked lists we need to come up with a decision whether model A ranks the performance of the different settings similarly to model B. In other words we need to quantify the correlation between the two ranked lists. We do that using correlation coefficients. In order to measure the strength of the linear relationship we use linear correlation coefficients such as Pearson coefficient (often referred to as Pearson's $R$). When the goal of our correlation analysis is the relationship between the rankings of different models we turn to rank correlation coefficients such as Spearman’s rank correlation coefficient ($\rho$).

### Example ###
In this example we are going to compute the Pearson's ($R$) and the Spearman's ($\rho$) coefficients across 4 ranked lists. Each rank list contains results of the performance of one of the 4 different ML models across 7 different parameter values.

In [4]:
import scipy.stats

m1 = [0.0583 , 0.1521 , 0.2187 , 0.2562 , 0.3042 , 0.3542 , 0.3564]
m2 = [0.0274 , 0.1428 , 0.2010 , 0.0807 , 0.0854 , 0.1102 , 0.1155]
m3 = [0.1669 , 0.1715 , 0.2551 , 0.2729 , 0.2750 , 0.3170 , 0.3182]
m4 = [0.1154 , 0.2099 , 0.2878 , 0.3412 , 0.1124 , 0.4671 , 0.2723]

#Convert the above values into ranks:
rm1 = scipy.stats.rankdata(m1)
rm2 = scipy.stats.rankdata(m2)
rm3 = scipy.stats.rankdata(m3)
rm4 = scipy.stats.rankdata(m4)

print (rm1)
print (rm2)
print (rm3)
print (rm4)

[ 1.  2.  3.  4.  5.  6.  7.]
[ 1.  6.  7.  2.  3.  4.  5.]
[ 1.  2.  3.  4.  5.  6.  7.]
[ 2.  3.  5.  6.  1.  7.  4.]


**[Assignment 5]**  
Use the first rank list (rm1) as a reference and compute Pearson's and Spearman's correlation coefficients with the other 3 ranked lists. Use the **scipy.stats** implementation of these two correlation coefficients: **scipy.stats.pearsonr** and **scipy.stats.spearmanr**.

**[Solution 5]**

In [None]:
#Compute the Pearson's correlation coefficient here.

In [None]:
#Compute the Spearman's correlation coefficient here.