# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** *W*

**Names:**

* *Cloux Olivier*
* *Reiss Saskia*
* *Urien Thibault*

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [None]:
import pickle as pk
import numpy as np
import string #string operations
import re #useful for regular expressions
import nltk # to have the lemmatizer and stemmer
import time #measure time between some operations, to have a metric

from numpy.linalg import norm
from scipy.sparse import csr_matrix, find
from utils import load_json, load_pkl, save_pkl

from nltk.stem.porter import *
from nltk.stem import WordNetLemmatizer

#from __future__ import print_function

from lab04_helper import *

In [None]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [None]:
allCourses = load_json('data/courses.txt') 
stopwords = load_pkl('data/stopwords.pkl')

In [None]:
# def pickleDump(filename, value):
#     """Save an object (e.g. table) to a pickle file to be readable by other notebooks"""
#     with open(filename, "wb") as f:
#         pk.dump(value, f)
        
def listPrettyPrint(l, n):
    """Prints a list l on n columns to improve readability"""
    if(n == 4):
        for a,b,c,d in zip(l[::4],l[1::4],l[2::4],l[3::4]):
            print('{:<30}{:<30}{:<30}{:<}'.format(a,b,c,d))
    if(n == 3):
        for a,b,c in zip(l[::3],l[1::3],l[2::3]):
            print('{:<30}{:<30}{:<}'.format(a,b,c))
    if(len(l)%n != 0): #print remaining
        for i in range(len(l)%n):
            print(l[-(len(l)%n):][i], end='\t')

## Exercise 4.1: Pre-processing

To have a complete list of punctions, please see the *lab_04_helper.py* file.
To avoid errors, we preferred to have the same preprocessing function available for the notebook 2 (LSI), in order to perform term searches. 

We implemented the following functions :
   * **Stopwords** : Obviously, terms as "*the*", "*it*" and such do not need to appear in our bag of words. They will only cause useless noise, as they appear a lot but in (almost) every description.
   * **Taking numbers out** : this was a decision we made. A lot of numbers are useless (such as describing time) but some are far from useless (e.g. "*3SAT*"). Therefore, we decided to remove lone numbers or when separated by the character *h*. This already removes most of the noise.
   * **Split appended** : Not quite a word-filtering function but still preprocesses. Due to the scraping method used, words ending and beginning a line (originally separated by a *\n*) were stuck together. Our function separates these, based on the presence of an uppercase charactes in the middle of a word. Sentences starting with a lower case character (thus creating fused words with only lower cases) could not be split.
   * **Punctuation** : "*word*", "*word!*" and "*word.*" should be treated equally ; furthermore, punctuation signs standing on their own (not appended to a word) will create a unique word, which should not be the case. Thus, we removed all punctuation sign (see complete list in the *helper* file)
   * **Lower cased** : our system should not be case-sensitive, so once capital letters are not important anymore (as in splitting), we put all letters to lower case. 
   
These functions are called in a careful order to produce our bag of words.

In [None]:
#Creation of a dictionary corpusDict that contains :
# - courses ID as keys
# - a 3-tuple(uniqueIndex, title, list[separated words]) as value
#indexCourse has indices as keys that link to their course (bijection mapping)
ref_corpus = dict() 
indexCourse = dict()
index = 0
for c in allCourses:
    if c['courseId'] not in ref_corpus.keys(): #avoids situation where courses are tripled
        ref_corpus[c['courseId']] = (index, c['name'], cleaner(c['description']))
        indexCourse[index] = c['courseId']
        index += 1
        
print("We are working with a total of %d courses" % len(ref_corpus))
save_pkl(ref_corpus, r"cidWithBag.txt")
save_pkl(indexCourse, r"indexToCourse.txt")

In [None]:
ixWords = sorted(ref_corpus['COM-308'][2])
print("Words for Internet Analytics course are (in alphabetical order) :")
listPrettyPrint(ixWords, 4)

Note : some word seem strange as they have been stemmed

## Exercise 4.2: Term-document matrix

In [None]:
#Creation of 2 dictionary.
#wordToIndex contains all distinct words as keys and their unique index as value
#indexToWord is the exact opposite. 
wordToIndex = dict() 
index = 0
for name in ref_corpus:
    for word in ref_corpus[name][2]:
        if word not in wordToIndex.keys():
            wordToIndex[word] = index
            index += 1;

indexToWord = dict((v, k) for k, v in wordToIndex.items())
print("After preprocessing, we are dealing with a total of %d \"unique\" words" % len(indexToWord))
save_pkl(indexToWord,r"indexToWord")
save_pkl( wordToIndex,r"wordToIndex")

In [None]:
#Creation of sparse occurence matrix. we define values in occValues and its indices in occRow and occCol. 
#If two pairs of indices are identical, their values will be added. 
def findOccurenceMatrix(corpus):
    """
    Takes a corpus of texts (cleaned) and outputs its occurence matrix 
    (number of times each word appears in each description of the corpus)"""
    occValues = []
    occRow = [] #indices of words
    occCol = [] #indices of courses

    i = 0
    for cid in corpus: #iterate through all courses ID (and their bag of words)
        cIndex = corpus[cid][0] #get column for this course
        for word in corpus[cid][2]: #then append to correct list :
            occCol.append(cIndex) #the col index
            occRow.append(wordToIndex[word]) #row index
            occValues.append(1) #value (1, as each word represents 1 occurence)
    return csr_matrix((occValues, (occRow, occCol)), 
                      shape=((len(wordToIndex), len(corpus))), 
                      dtype=np.float64)

def getTFIDF(corpus, occMatrix):
    """For a corpus of text and its occurence matrix, output the corresponding TF-IDF score matrix
    """
    #for each course, keep only max value ≃ occurence of most frequent word
    mostFreqWord = occMatrix.max(axis=0).data 

    #by definition : TF is term freq divided by freq of most freq word in the same document
    TF = csr_matrix(occMatrix/mostFreqWord)
    IDF = csr_matrix(-np.log2((occMatrix != 0).sum(1)/len(corpus)))
    return TF.multiply(IDF)


In [None]:
ref_occurenceMatrix = findOccurenceMatrix(ref_corpus)
ref_TFIDF = getTFIDF(ref_corpus, ref_occurenceMatrix)

#save for use in different botebooks
np.save("TFIDF", ref_TFIDF)
save_sparse_csr("occ_matrix", ref_occurenceMatrix)

In [None]:
## some work on COM-308
ixIndex = ref_corpus['COM-308'][0]
ix_col = ref_TFIDF.getcol(ixIndex)
ixBigIndices = (np.argsort(ix_col.data, axis=0)[-15:]) #yields indices of the sorted values

ixBigScores = [indexToWord[i] for i in ixBigIndices]
ixBigScores.reverse()
print("15 words with greatest scores are :")
listPrettyPrint(ixBigScores, 3)

**Explanation**
The difference between high and big scores is the essence of TF-IDF : high scores indicate the term is very frequent in the document but appears in few documents, when a low score shows the word is rare in the document (appears only once or twice), but most documents have this word.

## Exercise 4.3: Document similarity search

In [None]:
## We create a "new" document with only the words we want, as a reference vector
def sim(vec1, vec2):
    """Compute similarity (cosine of angle) between 2 vectors of same size"""
    assert(len(vec1) == len(vec2))
    prod_norm = norm(vec1)*norm(vec2)
    prod_elem = np.asscalar(vec1.T*vec2)
    return prod_elem/prod_norm

def find5closest(query):
    """Finds the 5 documents with smallest angle (cosine closest to 1)
    
    For that, a copy of the reference corpus is made (to avoid the query affecting future queries),
    then a new entry is added to this temp corpus (consider query as a new document) as before, with
    a 'cleaned' query. Then the occurence matrix and TF-IDF are computed with this new 'document'
    
    This version is not the most efficient as it recomputes the whole TFIDF matrix for every query. But as the 
    corpus/# of words is not gigantic, this is acceptable
    
    Keyword Arguments:
    query -- A string
    """
    new_corpus = ref_corpus.copy() #To not affect the ref corpus, make a temp copy
    query_index = len(ref_corpus) #query goes at last index 
    new_corpus['query'] = (query_index, 'my_query', cleaner(query)) #add an entry to the corpus
    query_occ_matrix = findOccurenceMatrix(new_corpus)
    query_TFIDF = getTFIDF(new_corpus, query_occ_matrix)
    
    query_vec = query_TFIDF.getcol(query_index) # ≃ TFIDF score of the query  vector
    
    #compare the query vector to every other vectors
    best_matches = []
    for i in range(len(ref_corpus)):
        simil = sim(query_vec.todense(), ref_TFIDF.getcol(i).todense())
        if(simil > 0):
            best_matches.append((i, simil))
    sort = sorted(best_matches, key=lambda x : x[1])
    if(len(sort) >= 5):
        return sort[-5:]
    else:
        return sort
    
def print_closest(query, result):
    """Pretty printer to print queried results
    
    Keyword Arguments:
    query -- The query used to find the results (string)
    result -- List of tuples (index, score) sorted in ascending score order
    """
    print("The course(s) most relevant to '%s' is(are):" % query)
    for i in result:
        print("\t",indexCourse[i[0]],"(",ref_corpus[indexCourse[i[0]]][1],") with score",i[1])
    print('\n')

In [None]:
bef = time.time()
markovClosest = find5closest("markov chains")
mid = time.time()
facebookClosest = find5closest("facebook")
aft = time.time()

print("%.4f s  for first query, %.4f s for second, %.4f s for both" % (mid-bef, aft-mid, aft-bef))
print_closest("markov chains", markovClosest)

print_closest("facebook", facebookClosest)

#### Explanations
For markov chains, we can see 5 courses are found (actually 20). This is an expected result ; scores are quite high with such diversity, indicating the courses are tightely linked to our query.

On the other hand, for facebook, only 1 course is found. This is normal, as this course (EE-727) is the only in the whole corpus to contain the word 'facebook'. Indeed, to obtain a similarity score with the query > 0, a course must contain at least one word (after 'cleaning') of the query. Here facebook being the only word, only courses with this word in them will appear.