# INF8111 - Data Mining

## TP1 SUMMER 2020 - recommendation system

##### Team Members:

    - Kacem Khaled ()
    - Oumayma Messoussi ()
    - Semah Aissaoui ()


## 1 - Overview

Stack Exchange is a network of question-and-answer (Q&A) websites on topics in diverse fields, each site covering a specific topic. On Stack Exchange website, a thread is composed of a question and their answers and comments. In this assignment, *we will implement a recommendation system that returns threads (question + answers) that are related to a specific question*. Before submitting questions, the  website will use this engine to show the most similar threads to users in order to reduce the number of duplicate questions.

## 2 Setup

Please run the code below to install the packages needed for this assignment.

In [60]:
# If you want, you can use anaconda and install after nltk library
# pip install --user numpy
# pip install --user sklearn
# pip install --user scipy
# pip install --user nltk


#python

import nltk
nltk.download("punkt")
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


True

## 3 - Data

Please download the zip file in the following url: https://drive.google.com/file/d/1032N1oZkytHlHs20AXE9jPMBQCTyhb6H/view?usp=sharing

In this zip file, there are:

1. test.json: This file contains queries (new questions) and the relevant threads(question + answer) for each one these queries.
2. threads: It is a folder that contains the thread html sources of threads. Each html file name follows the pattern **thread_id.html**.


Figure below depicts an thread page example:

![thread_img](thread_example.png)

The figure contains 4 hilighted areas. Area A, B, C, D and E are the question subject, question body, question comments, answer body, and anwer comments, respectively. 


In [61]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
import os

# define the folder path that contain the data
# FOLDER_PATH = "Define folder path that contain threads folder and test.json"
#FOLDER_PATH = "../../../datasets/TP1/dataset/"
#THREAD_FOLDER = os.path.join(FOLDER_PATH, 'threads')
PATH = "drive/My Drive/1 Polymtl/Ete20/INF8111/TP1/" # changer le path avec votre path

# Load the evaluation dataset
import json


#test = json.load(open(os.path.join(FOLDER_PATH, "test.json")))
test = json.load(open(PATH+"test.json"))
relevant_threads_by_query = dict()


for (query_id, cand_id, label) in test: 
    if label == 'Irrelevant':
        continue
        
    l = relevant_threads_by_query.setdefault(query_id, [])
    l.append(cand_id)
    


## 4 - Web scraping

Web scraping consists in extracting relevant data from pages and prepare it for computational analysis.


### 4.1 - Question 1 (0.5 point)

Special and non-ASCII characters can be encoded into their html representantion (html entities). For instance, apostrophe (') is encoded as **\&amp;**. The webpage encoding in the data folder are incosistent. Only in a portion of the webpages, the **special** and **non-ASCII** characters were encoded into  html entities. We will fix this inconsistency by transforming the html entities into character representations, e.g., **\&amp;** is represented as **'**.

*Implement the function fix_encoding that encodes the html entities (special and non-ASCII characters) into their UTF-8 encoding.*

In [0]:
from html import unescape

def fix_encoding(text):
    """
    Encodes the html entities in a text into UTF-8 encoding. For instance, "I&apos;m ..." => "I'm ..."
    
    :param text: string.
    :return: fixed text(sting)
    """
    return unescape(text)

In [83]:
# test
txt = fix_encoding("Hello I&apos;m &amp; Foulen &ac; Fouelni &amp;&ac;&ac;&apos;&amp;")
print(txt)

Hello I'm & Foulen ∾ Fouelni &∾∾'&


### 4.2 - Question 2 (3 points)

Implement extract_data_from_page function. This function extracts question subject, question body, question comments, answer body, and anwer comments from the thread webpage. It returns a dictionary with the following structure: *{"thread_id": int,"question":{"subject": string, "body": string, "comments": [string]}, answers: [{"body": string, "comments": [string]}]}*

**Use the fix_encoding function to fix the text encoding. You can use the library [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) in this question. All html tags have to be removed from comment, question and answer textual data.**


In [0]:
from bs4 import BeautifulSoup

def extract_data_from_page(pagepath):
    """
    Scrap question, answer and comments from thread page.
    
    :param pagepath: the path of thread html file.
    :return: 
        {
            "thread_id": thread id,
            "question":{
                "subject": question subject text (Area A in the figure), 
                "body": question body text (Area B in the figure), 
                "comments": list of comment texts (Area C in the figure)
                }, 
            "answers": [
                {
                    "body": answer body text (Area D in the figure),
                    "comments": list of answer texts (Area E in the figure)
                }
                ]
            }
    """
    data = {}
    answer = {}
    data['thread_id'] = pagepath.split('/')[-1][:-5]
    data['question'] = {}
    
    soup = BeautifulSoup(open(pagepath,encoding='utf8'))
    question =  soup.find("div", class_="question")
    answers =  soup.find_all("div", class_="answer")

    data['question']['subject'] = soup.find("a", class_="question-hyperlink").get_text()
    data['question']['body'] = question.find("div", class_="postcell post-layout--right").find("div", class_="post-text").get_text().strip()
    data['question']['comments'] = [s.get_text().strip() for s in question.find_all("span", class_="comment-copy")]
    data['answers'] = []
    for ans in answers:
        answer['body'] = ans.find("div", class_="post-text").get_text().strip()
        answer['comments'] = [s.get_text().strip() for s in ans.find_all("span", class_="comment-copy")]
        data['answers'].append(answer)
        answer = {}
    return data   

In [66]:
# test on 255875915822.html, 100853498510.html
pagepath = os.path.join(THREAD_FOLDER, "255875915822.html") 
data = extract_data_from_page(pagepath)

print(json.dumps(data, indent=4, sort_keys=False))

NameError: ignored

### 4.3 - Extract text from HTML


In [0]:
import os
from multiprocessing import Pool, TimeoutError
from time import time
from tqdm import tqdm

import json
# Index each thread by its id
#index_path = os.path.join(THREAD_FOLDER, 'threads.json')
index_path = os.path.join(PATH, 'threads.json')
if os.path.isfile(index_path):
    # Load threads that webpage content were already extracted.
    thread_index = json.load(open(index_path))
else:
    # Extract webpage content
    # This can be slow (around 30 minutes). Test your code with a small sample. lxml parse is faster than html.parser
    files = (os.path.join(THREAD_FOLDER, filename) for filename in os.listdir(THREAD_FOLDER))
    threads = map(extract_data_from_page, files)
    thread_index = dict(((thread['thread_id'], thread) for thread in tqdm(threads,total=28403)))
    # Save preprocessed threads
    json.dump(thread_index, open(index_path,'w'))

## 5 - Data Preprocessing

Preprocessing is a crucial task in data mining. This task clean and transform the raw data in a format that can better suit data analysis and machine learning techniques. In natural language processing (NLP), *tokenization* and *stemming* are two well known preprocessing steps. Besides these two steps, we will implement an additional step that is designed exclusively for the twitter domain.

### 5.1 - Tokenization

In this preprocessing step, a *tokenizer* is responsible for breaking a text in a sequence of tokens (words, symbols, and punctuations). 

For instance, the sentence *It's the student's notebook.* can be split into the following list of tokens: ['It', "'s", 'the', 'student', "'s", 'notebook', '.'].


#### 5.1.1 - Question 3 (0.5 point) 

Implement the following functions: 
- **tokenize_space** tokenizes the tokens that are separated by whitespace (space, tab, newline). This is a naive tokenizer.

- **tokenize_nltk** uses the default method of the nltk package (https://www.nltk.org/api/nltk.html) to tokenize the text.

**All tokenizers have to lowercase the tokens.**

In [0]:
from nltk.tokenize import word_tokenize
def tokenize_space(object):
    """
    Tokenize the tokens that are separated by whitespace (space, tab, newline). 
    We consider that any tokenization was applied in the text when we use this tokenizer.
    
    For example: "hello\tworld of\nNLP" is split in ['hello', 'world', 'of', 'NLP']
    """
    # return a list of tokens
    return [w.lower() for w in object.split()] 
        
def tokenize_nltk(object):
    """
    This tokenizer uses the default function of nltk package (https://www.nltk.org/api/nltk.html) to tokenize the text.
    """
    # return a list of tokens
    return [w.lower() for w in word_tokenize(object)] 
    
        

In [69]:
# test
msg1 =  "hello\tworld of\nNLP"
msg2 = "It's the student's notebook."

print(tokenize_space(msg1))
print(tokenize_nltk(msg1))

print(tokenize_space(msg2))
print(tokenize_nltk(msg2))

['hello', 'world', 'of', 'nlp']
['hello', 'world', 'of', 'nlp']
["it's", 'the', "student's", 'notebook.']
['it', "'s", 'the', 'student', "'s", 'notebook', '.']


### 5.2 - Filtering Insignificant Tokens

#### 5.2.1 -  Question 4 (1 point)

There are a set of tokens that are not signficant to the similarity comparison since they appear in many different threads pages. Thus, removing them decreases the vector dimensionality and turns the similarity calculation computationally cheaper. Describe the tokens that are insignificant for the thread similarity comparison? Moreover, implement the function filter_tokens that removes these words from a list of tokens.

**We ignore the words that does not increase/decrease comprehension of the document such as preopositions because they could not measure the similarity between documents. Part of those words is assembled in the predefined lass stopwords.**

In [0]:
from nltk.corpus import stopwords
def filter_tokens(words):
    stop_words = set(stopwords.words('english'))
    words = [w.lower() for w in words if not w.lower() in stop_words]
    return words

In [71]:
# test
print(filter_tokens(tokenize_nltk(msg1)))
print(filter_tokens(tokenize_nltk(msg2)))

['hello', 'world', 'nlp']
["'s", 'student', "'s", 'notebook', '.']


## 5.3 - Stemming

The process to convert words with the same stem (word reduction that keeps word prefixes) to a standard form is called *stemming*. For instance, "fishing", "fished" and "fishes" are transformed to the stem "fish.


In [72]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")



word1 = ["Visitors", "from", "all", "over", "the", "world", "fishes", "during", "the", "summer","."]

print([ stemmer.stem(w) for w in word1])

word2 = ['I', 'was', 'fishing',]
print([ stemmer.stem(w) for w in word2])

['visitor', 'from', 'all', 'over', 'the', 'world', 'fish', 'dure', 'the', 'summer', '.']
['i', 'was', 'fish']


### 5.3.1 - Question 5 (1 point) 

Explain how stemming can benift our search engine?


**The stemming is useful for our comparison since by looking at two subjects, we can find two words which have been written in different ways but they have the same root such as: fish and peach. So to distinguish if two words have the same meaning or the same root, we apply stemming to directly compare the words by their roots. In other words, Stemming can help our search engine through reducing the vocabulary and therefore reduce the search time. It helps focusing on the sense of a text instead of its deeper meaning.**




# 6 - Data representation

## 6.1 - Bag of Words

Many algorithms only accept inputs that have the same size. However, there are some data types whose sizes are not fixed, for instance, a text can have an unlimited number of words. Imagine that we retrieve two sentences: ”Board games are much better than video games” and ”Monoply is an awesome game!”. These sentences are respectively named as Sentence 1 and 2. Table below depicts how we could represent both sentences using a fixed representation.

|<i></i>        | an | are | ! | monopoly | awesome | better | games | than | video | much | board | is | game |
|------------|----|-----|---|----------|---------|--------|-------|------|-------|------|-------|----|------|
| Sentence 1 | 0  | 1   | 0 | 0        | 0       | 1      | 2     | 1    | 1     | 1    | 1     | 0  | 0    |
| Sentence 2 | 1  | 0   | 0 | 1        | 1       | 0      | 0     | 0    | 0     | 0    | 0     | 1  | 1    |

Each column of this table 2.1 represents one of 13 vocabulary words, whereas the rows contains the word
frequencies in each sentence. For instance, the cell in row 1 and column 7 has the value 2
because the word games occurs twice in Sentence 1. Since the rows have always 13 values, we
could use those vectors to represent the Sentences 1 and 2. The table above illustrates a technique called bag-of-words. Bag-of-words represents a document as a vector whose dimensions are equal to the number of times that vocabulary words appeared in the document. Thus, each token will be related to a dimension, i.e., an integer.

### 6.1.2 - Question 6 (2.5 points)

Implement the bag-of-words model that weights the vector with the absolute word frequency.

**For this exercise, you cannot use any external python library (e.g., scikit-learn). However, if you have a problem with memory size, you can use the class scipy.sparse.csr_matrix (https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html)
https://stackoverflow.com/questions/52299420/scipy-csr-matrix-understand-indptr
https://en.wikipedia.org/wiki/Sparse_matrix**

In [73]:
# Example
from scipy.sparse import csr_matrix
import numpy as np
a = np.asarray([[0,0,1,4],[2,0,0,5], [0,0,0,0], [5,6,0,7]]) # Representation of 4 sentences. Vocabulary size = 4
print("#DENSE MATRIX")
print(a)
data = [1,4,2,5,5,6,7] # All non-zero values
indices = [2,3,0,3,0,1,3] # Column index of each non-zero value
indptr = [0,2,4,4,7] #  then maps the elements of data and indices to the rows of the sparse matrix. This is done with the following reasoning:.
b = csr_matrix((data, indices, indptr))
print("#SPARSE MATRIX")
print(b)
print("#SPARSE MATRIX=> DENSE MATRIX")
print(b.toarray()) 

#DENSE MATRIX
[[0 0 1 4]
 [2 0 0 5]
 [0 0 0 0]
 [5 6 0 7]]
#SPARSE MATRIX
  (0, 2)	1
  (0, 3)	4
  (1, 0)	2
  (1, 3)	5
  (3, 0)	5
  (3, 1)	6
  (3, 3)	7
#SPARSE MATRIX=> DENSE MATRIX
[[0 0 1 4]
 [2 0 0 5]
 [0 0 0 0]
 [5 6 0 7]]


In [0]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

def transform_count_bow(X):
    """
    This method preprocesses the data using the pipeline object, relates each token to a specific integer and  
    transforms the text in a vector. Vectors are weighted using the token frequencies in the sentence.

    X: document tokens. e.g: [['I','will', 'be', 'back', '.'], ['Helllo', 'world', '!'], ['If', 'you', 'insist', 'on', 'using', 'a', 'damp', 'cloth']]

    :return: vector representation of each document
    """ 

    print("\ngenerating count bow :")
    indices = []
    data = []
    indptr = [0]
    vocab= {}  
    for doc in X:
        for word in doc:
            index = vocab.setdefault(word, len(vocab))
            indices.append(index)
            data.append(1)
        indptr.append(len(indices))
    print('length of vocab:',len(vocab))
    return csr_matrix((data, indices, indptr), dtype=np.uint8)

In [0]:
# test
X = [['I','will', 'be', 'back', '.'], ['I','you', 'I', 'you','?'] , ['Hello', 'world', '!'], ['If', 'you', 'insist', 'on', 'using', 'a', 'damp', 'cloth']]
bow = transform_count_bow(X)
print(bow.toarray())
#print(bow.data)
#print(bow.indices)
print(bow)

## 6.2 - TF-IDF

Using raw frequency in the bag-of-words can be problematic. The word frequency distribution
is skewed - only a few words have high frequencies in a document. Consequently, the
weight of these words will be much bigger than the other ones which can give them more
impact on some tasks, like similarity comparison. Besides that, a set of words (including
those with high frequency) appears in most of the documents and, therefore, they do not
help to discriminate documents. For instance, the word *of* appears in a significant
part of documents. Thus, having the word *of* does not make
documents more or less similar. However, the word *terrible* is rarer and documents that
have this word are more likely to be negative. TF-IDF is a technique that overcomes the word frequency disadvantages.

TF-IDF weights the vector using inverse document frequency (IDF) and word frequency, called term frequency (TF).
TF is the local information about how important is a word to a specific document.  IDF measures the discrimination level of the words in a dataset.  Common words in a domain are not helpful to discriminate documents since most of them contain these terms. So, to reduce their relevance in the documents, these words should have low weights in the vectors . 
The following equation calculates the word IDF:
\begin{equation}
	idf_i = \log\left( \frac{N}{df_i} \right),
\end{equation}
where $N$ is the number of documents in the dataset, $df_i$ is the number of documents that contain a word $i$.
The new weight $w_{ij}$ of a word $i$ in a document $j$ using TF-IDF is computed as:
\begin{equation}
	w_{ij} = tf_{ij} \times idf_i,
\end{equation}
where $tf_{ij}$ is the term frequency of word $i$ in the document $j$.


### 6.2.1 - Question 7 (3.5 points)

Implement a bag-of-words model that weights the vector using TF-IDF.

**For this exercise, you cannot use any external python library (e.g., scikit-learn). However, if you have a problem with memory size, you can use the class scipy.sparse.csr_matrix (https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html)**

In [0]:
def transform_tf_idf_bow(X):
    """
    This method preprocesses the data using the pipeline object, calculates the IDF and TF and 
    transforms the text in vectors. Vectors are weighted using TF-IDF method.

    X: document tokens. e.g: [['I','will', 'be', 'back', '.'], ['Hello', 'world', '!'], ['If', 'you', 'insist', 'on', 'using', 'a', 'damp', 'cloth']]

    :return: vector representation of each document
    """
    print("\ngenerating tf-idf bow :")
    indices = []
    tf = []
    idf = []
    indptr = [0]
    vocab= {} # dict to collect each new word [key] and its index [value]
    df = {} # dict to collect nb of docs [value] that contain each word i [key]
    N = len(X)
    for doc in X:
        for word in doc:
            index = vocab.setdefault(word, len(vocab))
            indices.append(index)
            tf.append(1)
            if word in df:
                df[word] += 1
            else:
                df.setdefault(word, 1)
        indptr.append(len(indices))
    print('length of vocab:',len(vocab))
    for doc in X:
        for i in doc:
            idf.append((np.log2(N / df[i])))
    w = np.array(tf) * np.array(idf)
    return csr_matrix((list(w), indices, indptr), dtype=float)

In [0]:
bow = transform_tf_idf_bow(X)
print(bow.toarray())
print(bow.data)
print(bow.indices)
print(bow)

# 7 - Our Recommendation System

## 7.1 - Question 8 (1.5 points)

The pipeline is a sequence of preprocessing steps that transform the raw data to a format that is suitable for your problem. For our problem, you have to implement a pipeline composed of the following steps:

1. Concatenate answer, question and comment texts of thread $t$ in the dictionary thread_dict.
2. Tokenize the thread texts.
3. Filter the insignificant tokens.
4. Stem the tokens
5. Generate the vector representation using TFIDFBoW or CountBoW
6. Returns thread ids and thread vector representations.


In [0]:
def nlp_pipeline(thread_dict, tokenization_type, vectorizer_type, enable_filter_tokens, enable_stemming):
    """
    Preprocess and vectorize the threads.
    
    thread_dict: dictionary whose keys and values are thread ids and thread objects, respectively.
    tokenization_type: two possible values "space_tokenization" and "nltk_tokenization".
                            - space_tokenization: tokenize_space function is used to tokenize.
                            - nltk_tokenization: tokenize_nltk function is used to tokenize.
                            
    vectorizer_type: two possible values "count" and "tf_idf".
                            - count: use transform_count_bow to vectorize the text
                            - tf_idf: use transform_tf_idf_bow to vectorize the text
                            
    enable_filter_tokens: enable the insignificant token removal;
    
    enable_stemming: enable stemming
    
    return: a list L with thread ids and matrix B that contains the vector of each thread. B[idx] is the fixed-length representation of L[idx].
    """
    # enable this to limit the results on the first 5 threads of thread_dict
    testing = False # to limit testing
    
    L = list(thread_dict.keys())
    if testing: L = L[:5] # to limit testing
    tokens_list = []
    if testing: count=0 # to limit testing
    print("Pipeline:")
    for thread in thread_dict.values():
        thread_concat = []
        # 1. Concatenate answer, question and comment texts of thread  𝑡  in the dictionary thread_dict.
        thread_concat = thread["question"]["subject"]+' '
        thread_concat += thread["question"]["body"] + ' '
        for question_comment in thread["question"]["comments"]:
            thread_concat += ((question_comment + ' '))
        for answer in thread["answers"]:
            thread_concat += ((answer['body'] + ' '))
            for answer_comment in answer["comments"]:
                thread_concat += ((answer_comment + ' '))
        #print(thread_concat)
        # 2. Tokenize the thread texts.
        if tokenization_type == "space_tokenization":
            tokens = tokenize_space(thread_concat)
        elif tokenization_type == "nltk_tokenization":
            tokens = tokenize_nltk(thread_concat)
        else:
            raise Exception("Undefined tokenization_type : ",tokenization_type)
        
        # 3. Filter the insignificant tokens
        if enable_filter_tokens:
            tokens = filter_tokens(tokens)
        elif type(enable_filter_tokens)!=bool:
            raise Exception("enable_filter_tokens can only be True or False!")

        # 4. Stem the tokens
        if enable_stemming:
            tokens = [ stemmer.stem(token) for token in tokens]
        elif type(enable_stemming)!=bool:
            raise Exception("enable_stemming can only be True or False!")
        
        tokens_list.append(tokens)
        
        if testing: count+=1 # to limit testing
        if testing and count == 5: break # to limit testing
            
    # 5. Generate the vector representation using TFIDFBoW or CountBoW
    if vectorizer_type == "count":
        B = transform_count_bow(tokens_list)
    elif vectorizer_type == "tf_idf":
        B = transform_tf_idf_bow(tokens_list)
    else:
        raise Exception("Undefined vectorizer_type : ",vectorizer_type)
    print("Pipeline : done")
    # 6. Returns thread ids and thread vector representations.
    return L,B

In [0]:
# test
L,B = nlp_pipeline(thread_dict=thread_index, tokenization_type="nltk_tokenization", vectorizer_type="count", enable_filter_tokens=True, enable_stemming=True)
print(L[:5])
#print(len(B))
#print(B.toarray().shape)
#print(B.toarray())
print(B)
print(B[1].shape)
print(B[0])
thread = thread_index["100021749708"]
print(json.dumps(thread, indent=4, sort_keys=False))

## 7.2 - Question 9 (1.5 points)

*Implement the function rank that returns a list of thread ids sorted by thread and query similarity*. We will use the [cosine similarity function](https://en.wikipedia.org/wiki/Cosine_similarity) to compare two threads. In this assignment, query is a thread without answers and comments.

**Remove the query in the sorted list (rank output)**


In [0]:
from sklearn.metrics.pairwise import cosine_similarity
def rank(query_id, all_thread_ids, X):
    """
    Return a list of thread ids sorted by thread and query similarity. Cosine similarity is used to compare threads. 
    
    query_id: thread id 
    all_thread_ids: list of thread ids
    X: thread data representations
    
    return: ranked list of thread ids. 
    """

    # Compute the similarity of thread representations(vectors) using cosine similarity function

    i = all_thread_ids.index(query_id)
    similarities = cosine_similarity(X, X[i])
    similaritiesD = dict(zip(all_thread_ids, similarities.T.tolist()[0]))
    del similaritiesD[query_id]
    return [x[0] for x in sorted(similaritiesD.items(), key=lambda x:x[1], reverse=True)]
    

In [0]:
# test
all_thread_ids, X = nlp_pipeline(thread_dict=thread_index, tokenization_type="nltk_tokenization", vectorizer_type="count", enable_filter_tokens=True, enable_stemming=True)
print(all_thread_ids)
element = '100083250150'
similarities = rank(element, all_thread_ids, X)
print(similarities)
print(all_thread_ids)

## 7.3 - Evaluation

We will test different configurations of our recommender system. These configurations are compared using the [mean average precision (MAP) metric](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision). Basically, the closer relevant threads are from ranked list begining, the higher MAP is. Additional materials to undertand MAP: [recall and precision over ranks](https://youtu.be/H7oAofuZjjE) and [MAP](https://youtu.be/pM6DJ0ZZee0).


The function *eval* evaluates a specific configurantion of our recommender system



In [0]:
from statistics import mean 


def calculate_map(x):
    res = 0.0
    n = 0.0
    
    
    for relevant_threads, ranked_list in x:
        precisions = []
               
        for k, thread_id in enumerate(ranked_list):
            if thread_id in relevant_threads:
                prec_at_k = (len(precisions) + 1)/(k+1)
                precisions.append(prec_at_k)
                
            if len(precisions) == len(relevant_threads):
                break
        res += mean(precisions)
        n += 1

    
    return res/n
            

def eval(tokenization_type, vectorizer, enable_filter_tokens, enable_stemming):
    all_thread_ids, X = nlp_pipeline(thread_index, tokenization_type, vectorizer, enable_filter_tokens, enable_stemming)
    all_thread_ids = [int(t_id) for t_id in all_thread_ids]    
    queries,relevant_threads = zip(*relevant_threads_by_query.items())
    ranked_list = (rank(query_id, all_thread_ids, X) for query_id in queries)
        
        
    return calculate_map(zip(relevant_threads,ranked_list))
        

## 7.4 - Question 10 (5 points)

Evaluate our recommedation system performamnce(MAP) using each one of the following configurations:
1. count(BoW) + space_tokenization (sans tokenizer)
2. count(BoW) + nltk_tokenization
3. count(BoW) + nltk_tokenization + Filtrer les tokens sans importance
4. count(BoW) + nltk_tokenization + Filtrer les tokens sans importance + Stemming
5. tf_idf + nltk_tokenization
6. tf_idf + nltk_tokenization + Filtrer les tokens sans importance
7. tf_idf + nltk_tokenization + Filtrer les tokens sans importance + Stemming 

Describe the results found by you and answer the following questions:
- Was our recommendation system negatively or positively impacted by data preprocessing steps?
- TF-IDF has achieved a better performance than CountBoW? If yes, why do you think that this has occurred? 

In [0]:
# 1. count(BoW) + space_tokenization (sans tokenizer)
test1= eval(tokenization_type = "space_tokenization", vectorizer="count", enable_filter_tokens=False, enable_stemming=False)
print("\n1:",test1)

In [0]:
# 2. count(BoW) + nltk_tokenization
test2= eval(tokenization_type = "nltk_tokenization", vectorizer="count", enable_filter_tokens=False, enable_stemming=False)
print("\n2:",test2)

In [0]:
# 3. count(BoW) + nltk_tokenization + Filtrer les tokens sans importance
test3= eval(tokenization_type = "nltk_tokenization", vectorizer="count", enable_filter_tokens=True, enable_stemming=False)
print("\n3:",test3)

In [0]:
# 4. count(BoW) + nltk_tokenization + Filtrer les tokens sans importance + Stemming
test4= eval(tokenization_type = "nltk_tokenization", vectorizer="count", enable_filter_tokens=True, enable_stemming=True)
print("\n4:",test4)

In [0]:
# 5. tf_idf + nltk_tokenization
test5= eval(tokenization_type = "nltk_tokenization", vectorizer="tf_idf", enable_filter_tokens=False, enable_stemming=False)
print("\n5:",test5)

In [0]:
# 6. tf_idf + nltk_tokenization + Filtrer les tokens sans importance
test6= eval(tokenization_type = "nltk_tokenization", vectorizer="tf_idf", enable_filter_tokens=True, enable_stemming=False)
print("\n6:",test6)

In [0]:
# 7. tf_idf + nltk_tokenization + Filtrer les tokens sans importance + Stemming 
test7= eval(tokenization_type = "nltk_tokenization", vectorizer="tf_idf", enable_filter_tokens=True, enable_stemming=True)
print("\n7:",test7)

|Test    | tokenization_type | vectorizer | enable_filter_tokens| enable_stemming |MPE |
|------------|----|-----|---|----------|----------|
| 1 | space_tokenization  | count  | False  | False | 0.09392091474140551 | 
| 2 | nltk_tokenization  | count  | False  | False | 0.09417570136174147 | 
| 3 | nltk_tokenization  | count  | True  | False | 0.18419627640997 | 
| 4 | nltk_tokenization  | count  | True  | True | 0.212339985842106 | 
| 5 | nltk_tokenization  | tf_idf  | False  | False | 0.016234752329669106 | 
| 6 | nltk_tokenization  | tf_idf  | True  | False | 0.06281772959310575 | 
| 7 | nltk_tokenization  | tf_idf  | True  | True | 0.06836937398190894 | 


Our recommendation system was positively influenced by the data pre-processing stages. This is proven by the test values ​​carried out with filtration of unimportant tokens and the stemming

Theoritically, yes, TF-IDF is more efficient than BoW since it eliminates the problem of redundance of words that do not matter. Seeing this word in common does not allow us to conclude that documents are similar. On the contrary, when the word is rare, documents containing this word are more likely to have similar content. TF-IDF is therefore a method which overcomes this problem. Unfortunately, here the results does not reflect that compared to CountBOW.


