# INF8111 - Data Mining

## TP1 SUMMER 2020 - recommendation system

##### Team Members:

    - Kacem Khaled
    - Oumayma Messoussi
    - Semah Aissaoui


## 1 - Overview

Stack Exchange is a network of question-and-answer (Q&A) websites on topics in diverse fields, each site covering a specific topic. On Stack Exchange website, a thread is composed of a question and their answers and comments. In this assignment, *we will implement a recommendation system that returns threads (question + answers) that are related to a specific question*. Before submitting questions, the  website will use this engine to show the most similar threads to users in order to reduce the number of duplicate questions.

## 2 Setup

Please run the code below to install the packages needed for this assignment.

In [56]:
# If you want, you can use anaconda and install after nltk library
# pip install --user numpy
# pip install --user sklearn
# pip install --user scipy
# pip install --user nltk


#python
import numpy as np
import sklearn as sk
import scipy as sp
import nltk
from time import time
# nltk.download("punkt")
# nltk.download('stopwords')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('universal_tagset')

## 3 - Data

Please download the zip file in the following url: https://drive.google.com/file/d/1032N1oZkytHlHs20AXE9jPMBQCTyhb6H/view?usp=sharing

In this zip file, there are:

1. test.json: This file contains queries (new questions) and the relevant threads(question + answer) for each one these queries.
2. threads: It is a folder that contains the thread html sources of threads. Each html file name follows the pattern **thread_id.html**.


Figure below depicts an thread page example:

![thread_img](thread_example.png)

The figure contains 4 hilighted areas. Area A, B, C, D and E are the question subject, question body, question comments, answer body, and anwer comments, respectively. 


In [57]:
import os

# define the folder path that contain the data
# FOLDER_PATH = "Define folder path that contain threads folder and test.json"
FOLDER_PATH = '..\\..\\..\\datasets\\TP1\dataset\\'
THREAD_FOLDER = os.path.join(FOLDER_PATH, 'threads\\')


# Load the evaluation dataset
import json


test = json.load(open(os.path.join(FOLDER_PATH, "test.json")))
relevant_threads_by_query = dict()


for (query_id, cand_id, label) in test: 
    if label == 'Irrelevant':
        continue
        
    l = relevant_threads_by_query.setdefault(query_id, [])
    l.append(cand_id)
    


## 4 - Web scraping

Web scraping consists in extracting relevant data from pages and prepare it for computational analysis.


### 4.1 - Question 1 (0.5 point)

Special and non-ASCII characters can be encoded into their html representantion (html entities). For instance, apostrophe (') is encoded as **\&apos;**. The webpage encoding in the data folder are incosistent. Only in a portion of the webpages, the **special** and **non-ASCII** characters were encoded into  html entities. We will fix this inconsistency by transforming the html entities into character representations, e.g., **\&apos;** is represented as **'**.

*Implement the function fix_encoding that encodes the html entities (special and non-ASCII characters) into their UTF-8 encoding.*

In [61]:
from html.entities import html5
from html import unescape

def fix_encoding(text):
    """
    Encodes the html entities in a text into UTF-8 encoding. For instance, "I&amp;m ..." => "I'm ..."
    
    :param text: string.
    :return: fixed text(sting)
    """
    #start = time()
    if text == '':
        return text
    else:
    
#         start_index = text.find('&')
#         while (start_index != -1):
#             if text[start_index+1:start_index+4] == 'amp':
#                 start_index = text.find('&', start_index+1)
#                 continue
#             end_index = text.find(';', start_index)
#             text = text.replace(text[start_index:end_index+1], html5[text[start_index+1:end_index+1]])
#             start_index = text.find('&', start_index+1)

#         text = text.replace('&amp;', html5['amp;'])

        text = unescape(text.lower())
    #print('Total runtime in sec = ' + str(time() - start))
    
    return text
        
#     return text

In [62]:
# test
txt = fix_encoding("Hello I&apos;m &amp; oumayma &ac; messoussi &amp;&ac;&ac;&apos;&amp;")
print(txt)

hello i'm & oumayma ∾ messoussi &∾∾'&


### 4.2 - Question 2 (3 points)

Implement extract_data_from_page function. This function extracts question subject, question body, question comments, answer body, and anwer comments from the thread webpage. It returns a dictionary with the following structure: *{"thread_id": int,"question":{"subject": string, "body": string, "comments": [string]}, answers: [{"body": string, "comments": [string]}]}*

**Use the fix_encoding function to fix the text encoding. You can use the library Beatiful Soap in this question. All html tags have to be removed from comment, question and answer textual data.**


In [63]:
from bs4 import BeautifulSoup

def extract_data_from_page(pagepath):
    """
    Scrap question, answer and comments from thread page.
    
    :param pagepath: the path of thread html file.
    :return: 
        {
            "thread_id": thread id,
            "question":{
                "subject": question subject text (Area A in the figure), 
                "body": question body text (Area B in the figure), 
                "comments": list of comment texts (Area C in the figure)
                }, 
            "answers": [
                {
                    "body": answer body text (Area D in the figure),
                    "comments": list of answer texts (Area E in the figure)
                }
                ]
            }
    """
    #start = time()
    
    data = {}
    answer = {}
    data['thread_id'] = pagepath.split('\\')[-1][:-5]
    data['question'] = {}
    data['question']['comments'] = []
    data['answers'] = []
    
    soup = BeautifulSoup(open(pagepath, encoding='utf8'), features="lxml")
    
    data['question']['subject'] = fix_encoding(soup.find('a', class_='question-hyperlink').get_text().strip())
            
    QnA = soup.find(id='mainbar')
    
    question = QnA.find(id='question')     
    raw_Q = question.find('div', class_='post-text')
    data['question']['body'] = fix_encoding(raw_Q.get_text().strip())
    
    raw_Q_comm = question.find_all('div', class_='comments')
    for Q_comm in raw_Q_comm:
        comms = Q_comm.find_all('li')
        for comm in comms:
            span = comm.find('span', class_='comment-copy')
            data['question']['comments'].append(fix_encoding(span.get_text().strip()))
                        
    answers = QnA.find(id='answers')
    answers = answers.find_all('div', class_='answer')
    for answer in answers:
        tmp = {}
        tmp['comments'] = []
        ans = answer.find('div', class_='post-text')
        tmp['body'] = fix_encoding(ans.get_text().strip())
        
        raw_A_comms = answer.find_all('div', class_='comments')
        for A_comm in raw_A_comms:
            comms = A_comm.find_all('li')
            for comm in comms:
                span = comm.find('span', class_='comment-copy')
                tmp['comments'].append(fix_encoding(span.get_text().strip()))

        data['answers'].append(tmp)

#     question =  soup.find("div", class_="question")
#     answers =  soup.find_all("div", class_="answer")

#     data['question']['subject'] = soup.find("a", class_="question-hyperlink").get_text()
#     data['question']['body'] = question.find("div", class_="postcell post-layout--right").find("div", class_="post-text").get_text().strip()
#     data['question']['comments'] = [s.get_text().strip() for s in question.find_all("span", class_="comment-copy")]
#     data['answers'] = []
#     for ans in answers:
#         answer['body'] = ans.find("div", class_="post-text").get_text().strip()
#         answer['comments'] = [s.get_text().strip() for s in ans.find_all("span", class_="comment-copy")]
#         data['answers'].append(answer)
#         answer = {}
    
    #print('Total runtime in sec = ' + str(time() - start))
    return data
    

In [6]:
data = extract_data_from_page(THREAD_FOLDER+'255875915822.html')
print(json.dumps(data, indent=4, sort_keys=False))

{
    "thread_id": "255875915822",
    "question": {
        "comments": [
            "I found an outside gym thinking I could increase my strength, but I wasn't able to do anything with the weights. I would like to think hunting would improve your shooting.",
            "Does anything cause your skills to decrease"
        ],
        "subject": "What's the fastest/easiest way to level up your skills",
        "body": "Basically subj. The skills and what I found so far is: Stamina: just run, sprint, cycle, swim, whatever. Just move and it will grow(not that fast though) Shooting: personally I found shooting range to be a really fast way to improve your shooting skills Strength: fist fighting Stealth: performing stealth kills Flying: probably just flying, didn't bother about it yet Driving: got the most problems here.\n Playing as a Trevor I can't get past 1.8 or 1.9 bars(which is 36-38%). Drove for hours around and still can't get it any higher Lung Capacity: staying underwater for a

In [64]:
data = extract_data_from_page(THREAD_FOLDER+'100410376731.html')
print(json.dumps(data, indent=4, sort_keys=False))

{
    "thread_id": "100410376731",
    "question": {
        "comments": [
            "\"i have a number of schools\" and that number is...",
            "i modified my question to show that i have 8 schools and a population of 475/47/54."
        ],
        "subject": "how do you get to 100% education",
        "body": "i have a 8 schools set up in the game, and the are always fully staffed.  the population is 475/47/54.  i have had less population, and still the same results.  however, i can't seem to get past about 72% educated in my 200-300 person city.\n  will more schools help, or do i need to do something different"
    },
    "answers": [
        {
            "comments": [
                "and if they're not dying fast enough, send them to the mines ;-)"
            ],
            "body": "only children can be educated, once someone becomes an uneducated adult, they remain uneducated until they die.\n eight staffed schools will be able to educate 160 students simultaneously, 

### 4.3 - Extract text from HTML


In [65]:
import os
from multiprocessing import Pool, TimeoutError
from time import time
import json
from tqdm import tqdm

# FOLDER_PATH = "../../../datasets/TP1/dataset/"
# THREAD_FOLDER = os.path.join(FOLDER_PATH, 'test')

# Index each thread by its id
index_path = os.path.join(THREAD_FOLDER, 'threads.json')
if os.path.isfile(index_path):
    
    # Load threads that webpage content were already extracted.
    thread_index = json.load(open(index_path))
else:
    
    # Extract webpage content
    # This can be slow (around 30 minutes). Test your code with a small sample. lxml parse is faster than html.parser
    files = (os.path.join(THREAD_FOLDER, filename) for filename in os.listdir(THREAD_FOLDER))
    threads = map(extract_data_from_page, files)
    thread_index = dict(((thread['thread_id'], thread) for thread in tqdm(threads,total=28403)))
    
    # Save preprocessed threads
    json.dump(thread_index, open(index_path,'w'))
    

100%|██████████| 28403/28403 [24:29<00:00, 19.32it/s]


In [51]:
thread_index

{'100021749708': {'thread_id': '100021749708',
  'question': {'comments': ['There are official Microsoft controllers with permanently attached cables.  The hologram sounds suspicious though.',
    'I own a counterfeit mouse-pad with the same Microsoft logo you described.',
    "Buy from Amazon. You'll usually get a good price and you'll definitely get peace of mind. I'll pay $5 more for that any day.",
    'Good picture examples can be found on this dupe: http://gaming.stackexchange.com/questions/89194/is-this-xbox-360-controller-fake-how-can-i-tell'],
   'subject': 'How can I identify a counterfeit Xbox 360 controller',
   'body': 'I recently bought a Wired Xbox 360 controller from Ebay, and I\'m trying to figure out whether it\'s counterfeit. It\'s two-tone black and gray, which according to Wikipedia means it\'s probably the Elite edition of the controller.\n Here\'s what makes me suspicous: There\'s no breakaway cable near the USB port (Microsoft\'s term is Inline Cable Release), t

## 5 - Data Preprocessing

Preprocessing is a crucial task in data mining. This task cleans and transforms the raw data in a format that can better suit data analysis and machine learning techniques. In natural language processing (NLP), *tokenization* and *stemming* are two well known preprocessing steps. Besides these two steps, we will implement an additional step that is designed exclusively for the twitter domain.

### 5.1 - Tokenization

In this preprocessing step, a *tokenizer* is responsible for breaking a text in a sequence of tokens (words, symbols, and punctuations). 

For instance, the sentence *It's the student's notebook.* can be split into the following list of tokens: ['It', "'s", 'the', 'student', "'s", 'notebook', '.'].


#### 5.1.1 - Question 3 (0.5 point) 

Implement the following functions: 
- **tokenize_space** tokenizes the tokens that are separated by whitespace (space, tab, newline). This is a naive tokenizer.
- **tokenize_nltk** uses the default method of the nltk package (https://www.nltk.org/api/nltk.html) to tokenize the text.

**All tokenizers have to lowercase the tokens.**

In [9]:
def tokenize_space(object):
    """
    Tokenize the tokens that are separated by whitespace (space, tab, newline). 
    We consider that any tokenization was applied in the text when we use this tokenizer.
    
    For example: "hello\tworld of\nNLP" is split in ['hello', 'world', 'of', 'NLP']
    """
    # return a list of tokens
    
    assert isinstance(object, str)
    return object.split()
        
def tokenize_nltk(object):
    """
    This tokenizer uses the default function of nltk package (https://www.nltk.org/api/nltk.html) to tokenize the text.
    """
    # return a list of tokens
    
    assert isinstance(object, str)
    
#     sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
#     return sent_detector.tokenize(object.strip())
    return nltk.word_tokenize(object)

### 5.2 - Filtering Insignificant Tokens

#### 5.2.1 -  Question 4 (1 point)

There are a set of tokens that are not signficant to the similarity comparison since they appear in many different threads pages. Thus, removing them decreases the vector dimensionality and turns the similarity calculation computationally cheaper. Describe the tokens that are insignificant for the thread similarity comparison? Moreover, implement the function filter_word that removes these words from a list of tokens.

In [66]:
from nltk.corpus import stopwords

def filter_word(tokens):
    symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
    stop_words = set(stopwords.words('english'))
    words = [w for w in tokens if w not in stop_words and w not in symbols]
    return words


## 5.3 - Stemming

The process to convert words with the same stem (word reduction that keeps word prefixes) to a standard form is called *stemming*. For instance, "fishing", "fished" and "fishes" are transformed to the stem "fish.


In [11]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")

word1 = ["Visitors", "from", "all", "over", "the", "world", "fishes", "during", "the", "summer","."]
print([stemmer.stem(w) for w in word1])

word2 = ['I', 'was', 'fishing']
print([stemmer.stem(w) for w in word2])

['visitor', 'from', 'all', 'over', 'the', 'world', 'fish', 'dure', 'the', 'summer', '.']
['i', 'was', 'fish']


### 5.3.1 - Question 5 (1 point) 

Explain how stemming can benefit our search engine?

--> Answer: It allows us to get more search results from more forms of words with the same stem.

# 6 - Data representation

## 6.1 - Bag of Words

Many algorithms only accept inputs that have the same size. However, there are some data types whose sizes are not fixed, for instance, a text can have an unlimited number of words. Imagine that we retrieve two sentences: ”Board games are much better than video games” and ”Monoply is an awesome game!”. These sentences are respectively named as Sentence 1 and 2. Table below depicts how we could represent both sentences using a fixed representation.

|            | an | are | ! | monopoly | awesome | better | games | than | video | much | board | is | game |
|------------|----|-----|---|----------|---------|--------|-------|------|-------|------|-------|----|------|
| Sentence 1 | 0  | 1   | 0 | 0        | 0       | 1      | 2     | 1    | 1     | 1    | 1     | 0  | 0    |
| Sentence 2 | 1  | 0   | 0 | 1        | 1       | 0      | 0     | 0    | 0     | 0    | 0     | 1  | 1    |

Each column of this table 2.1 represents one of 13 vocabulary words, whereas the rows contains the word
frequencies in each sentence. For instance, the cell in row 1 and column 7 has the value 2
because the word games occurs twice in Sentence 1. Since the rows have always 13 values, we
could use those vectors to represent the Sentences 1 and 2. The table above illustrates a technique called bag-of-words. Bag-of-words represents a document as a vector whose dimensions are equal to the number of times that vocabulary words appeared in the document. Thus, each token will be related to a dimension, i.e., an integer.

### 6.1.2 - Question 6 (2.5 points)

Implement the bag-of-words model that weights the vector with the absolute word frequency.

**For this exercise, you cannot use any external python library (e.g., scikit-learn). However, if you have a problem with memory size, you can use the class scipy.sparse.csr_matrix (https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html)
**

In [79]:
def transform_count_bow(X):
    """
    This method preprocesses the data using the pipeline object, relates each token to a specific integer and  
    transforms the text in a vector. Vectors are weighted using the token frequencies in the sentence.

    X: document tokens. e.g: [['I','will', 'be', 'back', '.'], ['Hello', 'world', '!'], ['If', 'you', 'insist', 'on', 'using', 'a', 'damp', 'cloth']]

    :return: vector representation of each document
    """   
    
#     vocab = []
#     for l in X:
#         for w in l:
#             vocab.append(w)
#     vocab = list(set(vocab))
    
#     bow = np.zeros((len(X), len(vocab)))
#     for i, sentence in enumerate(X):
#         for j, v in enumerate(vocab):
#             bow[i,j] = sentence.count(v)
     
#     return sp.sparse.csr_matrix(bow)

    indptr = [0]
    indices = []
    data = []
    vocab = {}  # a dictionary
    for sentence in X:
        for w in sentence:
            index = vocab.setdefault(w, len(vocab))
            indices.append(index)
            data.append(1)
        indptr.append(len(indices))
    result = sp.sparse.csr_matrix((data, indices, indptr), dtype=np.uint8)
    return result

In [80]:
# test
text = [['I','will', 'be', 'back', 'if', 'you', 'call', 'back', '.'], ['Hello', 'world', '!'], ['If', 'you', 'insist', 'on', 'using', 'a', 'damp', 'cloth']]
transform_count_bow(text)

<3x18 sparse matrix of type '<class 'numpy.uint8'>'
	with 20 stored elements in Compressed Sparse Row format>

## 6.2 - TF-IDF

Using raw frequency in the bag-of-words can be problematic. The word frequency distribution
is skewed - only a few words have high frequencies in a document. Consequently, the
weight of these words will be much bigger than the other ones which can give them more
impact on some tasks, like similarity comparison. Besides that, a set of words (including
those with high frequency) appears in most of the documents and, therefore, they do not
help to discriminate documents. For instance, the word *of* appears in a significant
part of documents. Thus, having the word *of* does not make
documents more or less similar. However, the word *terrible* is rarer and documents that
have this word are more likely to be negative. TF-IDF is a technique that overcomes the word frequency disadvantages.

TF-IDF weights the vector using inverse document frequency (IDF) and word frequency, called term frequency (TF).
TF is the local information about how important is a word to a specific document.  IDF measures the discrimination level of the words in a dataset.  Common words in a domain are not helpful to discriminate documents since most of them contain these terms. So, to reduce their relevance in the documents, these words should have low weights in the vectors . 
The following equation calculates the word IDF:
\begin{equation}
	idf_i = \log\left( \frac{N}{df_i} \right),
\end{equation}
where $N$ is the number of documents in the dataset, $df_i$ is the number of documents that contain a word $i$.
The new weight $w_{ij}$ of a word $i$ in a document $j$ using TF-IDF is computed as:
\begin{equation}
	w_{ij} = tf_{ij} \times idf_i,
\end{equation}
where $tf_{ij}$ is the term frequency of word $i$ in the document $j$.


### 6.2.1 - Question 7 (3.5 points)

Implement a bag-of-words model that weights the vector using TF-IDF.

**For this exercise, you cannot use any external python library (e.g., scikit-learn). However, if you have a problem with memory size, you can use the class scipy.sparse.csr_matrix (https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html)**

In [68]:
def transform_tf_idf_bow(X):
    """
    This method preprocesses the data using the pipeline object, calculates the IDF and TF and 
    transforms the text in vectors. Vectors are weighted using TF-IDF method.

    X: document tokens. e.g: [['I','will', 'be', 'back', '.'], ['Helllo', 'world', '!'], ['If', 'you', 'insist', 'on', 'using', 'a', 'damp', 'cloth']]

    :return: vector representation of each document
    """    
    
    vocab = []
    N = len(X)
    for i, l in enumerate(X):
        for j, w in enumerate(l):
            vocab.append(w)
    vocab = list(set(vocab))
    
    df = np.zeros((len(vocab)))
    for sentence in X:
        for j, word in enumerate(vocab):
            if word in sentence:
                df[j] += 1
                
    tfidf = np.zeros((len(X), len(vocab)))
    for i, sentence in enumerate(X):
        for j, v in enumerate(vocab):
            tfidf[i,j] = sentence.count(v) * np.log(N/df[j])
    
    return sp.sparse.csr_matrix(tfidf)

In [18]:
# test
text = [['I','will', 'be', 'back', 'if', 'you', 'call', 'back', '.'], ['Hello', 'world', '!'], ['If', 'you', 'insist', 'on', 'using', 'a', 'damp', 'cloth']]
idf = transform_tf_idf_bow(text)
print(idf)

[[0.         1.09861229 0.         2.19722458 1.09861229 1.09861229
  0.         0.         0.         1.09861229 0.40546511 0.
  0.         0.         1.09861229 0.         0.40546511]
 [1.09861229 0.         0.         0.         0.         0.
  0.         0.         1.09861229 0.         0.         0.
  0.         0.         0.         1.09861229 0.        ]
 [0.         0.         1.09861229 0.         0.         0.
  1.09861229 1.09861229 0.         0.         0.40546511 1.09861229
  1.09861229 1.09861229 0.         0.         0.40546511]]


In [19]:
# test
X = [['I','will', 'be', 'back', '.'], ['you', 'I', 'you','?'] , ['Hello', 'world', '!'], ['If', 'you', 'insist', 'on', 'using', 'a', 'damp', 'cloth']]
idf = transform_tf_idf_bow(X)
print(idf)

[[0.         1.38629436 0.         0.         1.38629436 1.38629436
  0.         0.         0.         0.69314718 0.         0.
  0.         0.         1.38629436 0.         0.        ]
 [0.         0.         1.38629436 0.         0.         0.
  0.         0.         0.         0.69314718 0.         0.
  0.         0.         0.         0.         1.38629436]
 [1.38629436 0.         0.         0.         0.         0.
  0.         0.         1.38629436 0.         0.         0.
  0.         0.         0.         1.38629436 0.        ]
 [0.         0.         0.         1.38629436 0.         0.
  1.38629436 1.38629436 0.         0.         1.38629436 1.38629436
  1.38629436 1.38629436 0.         0.         0.69314718]]


# 7 - Our Recommendation System

## 7.1 - Question 8 (1.5 points)

The pipeline is a sequence of preprocessing steps that transform the raw data to a format that is suitable for your problem. For our problem, you have to implement a pipeline composed of the following steps:

1. Concatenate answer, question and comment texts of thread $t$ in the dictionary thread_dict.
2. Tokenize the thread texts.
3. Filter the insignificant tokens.
4. Stem the tokens
5. Generate the vector representation using TFIDFBoW or CountBoW
6. Returns thread ids and thread vector representations.


In [69]:
from nltk.stem.snowball import SnowballStemmer

def nlp_pipeline(thread_dict, tokenization_type, vectorizer_type, enable_filter_tokens, enable_stemming):
    """
    Preprocess and vectorize the threads.
    
    thread_dict: dictionary whose keys and values are thread ids and thread objects, respectively.
    tokenization_type: two possible values "space_tokenization" and "nltk_tokenization".
                            - space_tokenization: tokenize_space function is used to tokenize.
                            - nltk_tokenization: tokenize_nltk function is used to tokenize.
                            
    vectorizer_type: two possible values "count" and "tf_idf".
                            - count: use transform_count_bow to vectorize the text
                            - tf_idf: use transform_tf_idf_bow to vectorize the text
                            
    enable_filter_tokens: enable the insignificant token removal;
    
    enable_stemming: enable stemming
    
    return: a list L with thread ids and matrix B that contains the vector of each thread. B[idx] is the fixed-length representation of L[idx].
    """
    B = []
    stemmer = SnowballStemmer("english")

    for thread_id, thread_obj in zip(thread_dict.keys(), thread_dict.values()):
        thread_cont = []
        thread_cont = thread_obj['question']['subject'] + ' ' + thread_obj['question']['body']
        for Q_com in thread_obj['question']['comments']:
            thread_cont += ' ' + Q_com
        for answer in thread_obj['answers']:
            thread_cont += ' ' + answer['body']
            for A_com in answer['comments']:
                thread_cont += ' ' + A_com
                
        assert tokenization_type == 'space_tokenization' or tokenization_type == 'nltk_tokenization', 'invalid tokenization_type'
        if 'nltk' in tokenization_type:
            tokenized_text = tokenize_nltk(thread_cont)  
        else: 
            tokenized_text = tokenize_space(thread_cont)
        
        if enable_filter_tokens:
            filtered = filter_word(tokenized_text)  
        else:
            filtered = tokenized_text

        if enable_stemming:
            stemmed = [stemmer.stem(w) for w in filtered]
        else:
            stemmed = filtered

        B.append(stemmed)

    assert vectorizer_type == 'count' or vectorizer_type == 'tf_idf', 'invalid vectorizer_type'
    if vectorizer_type == 'count':
        vectorized = transform_count_bow(B)  
    else: 
        vectorized = transform_tf_idf_bow(B)
        
    return list(thread_dict.keys()), vectorized

In [81]:
start = time()
#th = ['100853498510', '998262128151', '914877405342', '333155030164', '187076740795']
# {key: thread_index[key] for key in th}
l, b = nlp_pipeline(thread_index, 'nltk_tokenization', 'count', True, True)
print(l)
print(b)
print("Total exec time = "+str(time() - start))

['100021749708', '100024321271', '100073330161', '100083250150', '100087712014', '100097586419', '100108268409', '100112303200', '100115821161', '100125886564', '100159472041', '100196169850', '100203181946', '100209235655', '100216394556', '100238406177', '100241894701', '100257400326', '100266923505', '100277786036', '100310360984', '100317075844', '100322923603', '100341593711', '100357429520', '100366855898', '100381285889', '100390067576', '100400234583', '100409027501', '100410376731', '100449477856', '100454955916', '100455876961', '100463201754', '100469551720', '100472834249', '100491871353', '100502307521', '100516231637', '100525713342', '100542751908', '100548559010', '100558174368', '100559908626', '100575387732', '100580269162', '100598649407', '100613252997', '100613706886', '100633132179', '100646864335', '100649783550', '100664395365', '100669489690', '100677418245', '100683229929', '100688940981', '100689438278', '100695346060', '100715524195', '100719090409', '100724

In [None]:
min(b.sum(axis=0))

In [83]:
b.shape

(28403, 132497)

In [23]:
thread_index

{'threads\\100021749708': {'thread_id': 'threads\\100021749708',
  'question': {'comments': ['There are official Microsoft controllers with permanently attached cables.  The hologram sounds suspicious though.',
    'I own a counterfeit mouse-pad with the same Microsoft logo you described.',
    "Buy from Amazon. You'll usually get a good price and you'll definitely get peace of mind. I'll pay $5 more for that any day.",
    'Good picture examples can be found on this dupe: http://gaming.stackexchange.com/questions/89194/is-this-xbox-360-controller-fake-how-can-i-tell'],
   'subject': 'How can I identify a counterfeit Xbox 360 controller',
   'body': 'I recently bought a Wired Xbox 360 controller from Ebay, and I\'m trying to figure out whether it\'s counterfeit. It\'s two-tone black and gray, which according to Wikipedia means it\'s probably the Elite edition of the controller.\n Here\'s what makes me suspicous: There\'s no breakaway cable near the USB port (Microsoft\'s term is Inline

## 7.2 - Question 9 (1.5 points)

*Implement the function rank that returns a list of thread ids sorted by thread and query similarity*. We will use the [cosine similarity function](https://en.wikipedia.org/wiki/Cosine_similarity) to compare two threads. In this assignment, query is a thread without answers and comments.

**Remove the query in the sorted list (rank output)**


In [86]:
from sklearn.metrics.pairwise import cosine_similarity

def rank(query_id, all_thread_ids, X):
    """
    Return a list of thread ids sorted by thread and query similarity. Cosine similarity is used to compare threads. 
    
    query_id: thread id 
    all_thread_ids: list of thread ids
    X: thread data representations
    
    return: ranked list of thread ids. 
    """
    
    # Compute the similarity of thread representations (vectors) using cosine similarity function
    # Sort the thread ids by the similarity
    
    query = all_thread_ids.index(query_id)
    vector = np.array(X[query])
    print(vector)
    
    cos = cosine_similarity(vector, X) #Should return size (1, len(X))
    ranked_list = []
    ranked_list = ranked_list.extend(cos[:query])
    ranked_list = ranked_list.extend(cos[query+1:])
    
    return sorted(ranked_list)
    

## 7.3 - Evaluation

We will test different configurations of our recommender system. These configurations are compared using the [mean average precision (MAP) metric](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision). Basically, the closer relevant threads are from ranked list begining, the higher MAP is. Additional materials to undertand MAP: [recall and precision over ranks](https://youtu.be/H7oAofuZjjE) and [MAP](https://youtu.be/pM6DJ0ZZee0).


The function *eval* evaluates a specific configurantion of our recommender system



In [84]:
from statistics import mean

def calculate_map(x):
    res = 0.0
    n = 0.0
    for relevant_threads, ranked_list in x:
        precisions = []
        for k, thread_id in enumerate(ranked_list):
            if thread_id in relevant_threads:
                prec_at_k = (len(precisions) + 1)/(k+1)
                precisions.append(prec_at_k)
            if len(precisions) == len(relevant_threads):
                break
        res += mean(precisions)
        n += 1
    return res/n

def eval(tokenization_type, vectorizer, enable_filter_tokens, enable_stemming):
    all_thread_ids, X = nlp_pipeline(thread_index, tokenization_type, vectorizer, enable_filter_tokens, enable_stemming)
    all_thread_ids = [int(t_id) for t_id in all_thread_ids]    
    queries,relevant_threads = zip(*relevant_threads_by_query.items())
    ranked_list = (rank(query_id, all_thread_ids, X) for query_id in queries)        
    return calculate_map(zip(relevant_threads,ranked_list))

## 7.4 - Question 10 (5 points)

Evaluate our recommedation system performamnce(MAP) using each one of the following configurations:
1. count(BoW) + space_tokenization (sans tokenizer)
2. count(BoW) + nltk_tokenization
3. count(BoW) + nltk_tokenization + Filtrer les tokens sans importance
4. count(BoW) + nltk_tokenization + Filtrer les tokens sans importance + Stemming
5. tf_idf + nltk_tokenization
6. tf_idf + nltk_tokenization + Filtrer les tokens sans importance
7. tf_idf + nltk_tokenization + Filtrer les tokens sans importance + Stemming 

Describe the results found by you and answer the following questions:
- Was our recommendation system negatively or positively impacted by data preprocessing steps?
- TF-IDF has achieved a better performance than CountBoW? If yes, why do you think that this has occurred? 

In [87]:
start = time()
eval('space_tokenization', 'count', False, False)
print("Total exec time = "+str(time() - start))

  (0, 0)	1
  (0, 985)	1
  (0, 31)	1
  (0, 1279)	1
  (0, 38)	1
  (0, 336)	1
  (0, 19496)	1
  (0, 300)	1
  (0, 1727)	1
  (0, 1321)	1
  (0, 62)	1
  (0, 34966)	1
  (0, 2)	1
  (0, 68)	1
  (0, 613)	1
  (0, 363)	1
  (0, 82)	1
  (0, 1704)	1
  (0, 75)	1
  (0, 4314)	1
  (0, 27661)	1
  (0, 66)	1
  (0, 99)	1
  (0, 209277)	1
  (0, 50)	1
  :	:
  (0, 31)	1
  (0, 8590)	1
  (0, 2298)	1
  (0, 2)	1
  (0, 1252)	1
  (0, 75)	1
  (0, 31)	1
  (0, 64972)	1
  (0, 32096)	1
  (0, 1475)	1
  (0, 14)	1
  (0, 53)	1
  (0, 34)	1
  (0, 31)	1
  (0, 19878)	1
  (0, 50)	1
  (0, 430)	1
  (0, 163)	1
  (0, 4)	1
  (0, 21063)	1
  (0, 10629)	1
  (0, 103)	1
  (0, 50)	1
  (0, 289)	1
  (0, 173404)	1


ValueError: setting an array element with a sequence.