# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** *W*

**Names:**

* *Cloux Olivier*
* *Reiss Saskia*
* *Urien Thibault*

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [None]:
import pickle as pk
import numpy as np
import string #string operations
import re #useful for regular expressions
import nltk # to have the lemmatizer and stemmer
#import time #measure time between some operations, to have a metric

from scipy.sparse import csr_matrix, find
from utils import load_json, load_pkl

from nltk.stem.porter import *
from nltk.stem import WordNetLemmatizer

from lab04_helper import *

In [None]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [None]:
courses = load_json('data/courses.txt') 
stopwords = load_pkl('data/stopwords.pkl')

In [None]:
def pickleDump(filename, value):
    """Save an object (e.g. table) to a pickle file to be readable by other notebooks"""
    with open(filename, "wb") as f:
        pk.dump(value, f)
        
def listPrettyPrint(l, n):
    """Prints a list l on n columns to improve readability"""
    if(n == 4):
        for a,b,c,d in zip(l[::4],l[1::4],l[2::4],l[3::4]):
            print('{:<30}{:<30}{:<30}{:<}'.format(a,b,c,d))
    if(n == 3):
        for a,b,c in zip(l[::3],l[1::3],l[2::3]):
            print('{:<30}{:<30}{:<}'.format(a,b,c))
    if(len(l)%n != 0):
        for i in range(len(l)%n):
            print(l[-(len(l)%n):][i])

## Exercise 4.1: Pre-processing

To have a complete list of punctions, please see the *lab_04_helper.py* file.
To avoid errors, we preferred to have the same preprocessing function available for the notebook 2 (LSI), in order to perform term searches. 

We implemented the following functions :
   * **Stopwords** : Obviously, terms as "*the*", "*it*" and such do not need to appear in our bag of words. They will only cause useless noise, as they appear a lot but in (almost) every description.
   * **Taking numbers out** : this was a decision we made. A lot of numbers are useless (such as describing time) but some are far from useless (e.g. "*3SAT*"). Therefore, we decided to remove lone numbers or when separated by the character *h*. This already removes most of the noise.
   * **Split appended** : Not quite a word-filtering function but still preprocesses. Due to the scraping method used, words ending and beginning a line (originally separated by a *\n*) were stuck together. Our function separates these, based on the presence of an uppercase charactes in the middle of a word. Sentences starting with a lower case character (thus creating fused words with only lower cases) could not be split.
   * **Punctuation** : "*word*", "*word!*" and "*word.*" should be treated equally ; furthermore, punctuation signs standing on their own (not appended to a word) will create a unique word, which should not be the case. Thus, we removed all punctuation sign (see complete list in the *helper* file)
   * **Lower cased** : our system should not be case-sensitive, so once capital letters are not important anymore (as in splitting), we put all letters to lower case. 
   
These functions are called in a careful order to produce our bag of words.

In [None]:
#Creation of a dictionary descDIct that contains :
# - courses ID as keys
# - a 3-tuple(uniqueIndex, title, list[separated words]) as value
#indexCourse has indices as keys that link to their course (bijection mapping)
descDict = dict() 
indexCourse = dict()
index = 0
for c in courses:
    if c['courseId'] not in descDict.keys(): #avoids situation where courses are tripled
        descDict[c['courseId']] = (index, c['name'], cleaner(c['description']))
        indexCourse[index] = c['courseId']
        index += 1
        
print("We are working with a total of %d courses" % len(descDict))
pickleDump(r"cidWithBag.txt", descDict)
pickleDump(r"indexToCourse.txt", indexCourse)

In [None]:
ixWords = sorted(descDict['COM-308'][2])
print("Words for Internet Analytics course are (in alphabetical order) :")
listPrettyPrint(ixWords, 4)

## Exercise 4.2: Term-document matrix

In [None]:
#Creation of 2 dictionary.
#wordIndex contains all distinct words as keys and their unique index as value
#indexWord is the exact opposite. 
wordIndex = dict() 
index = 0
for i in descDict:
    for word in descDict[i][2]:
        if word not in wordIndex.keys():
            wordIndex[word] = index
            index += 1;

indexWord = dict((v, k) for k, v in wordIndex.items())
assert(len(indexWord) == len(wordIndex))
print("After preprocessing, we are dealing with a total of %d \"unique\" words" % len(indexWord))
pickleDump("indexToWord", indexWord)
pickleDump("wordToIndex", wordIndex)

In [None]:
#Creation of sparse occurence matrix. we define values in occValues and its indices in occRow and occCol. 
#If two pairs of indices are identical, their values will be added. 
occValues = []
occRow = [] #indices of words
occCol = [] #indices of courses

i = 0
for cid in descDict: #iterate through all courses ID (and their bag of words)
    cIndex = descDict[cid][0] #get column for this course
    for word in descDict[cid][2]: #then append to correct list :
        occCol.append(cIndex) #the col index
        occRow.append(wordIndex[word]) #row index
        occValues.append(1) #value (1, as each word represents 1 occurence)

occurenceMatrix = csr_matrix((occValues, (occRow, occCol)), shape=((len(wordIndex), len(descDict))), dtype=np.float64)
save_sparse_csr("occ_matrix", occurenceMatrix)

In [None]:
#Creation of TF, IDF and TFIDF matrices
mostFreqWord = occurenceMatrix.max(axis=0).data #create list of max occurence for each course

#by definition : TF is term freq divided by freq of most freq word in the same document
TF = csr_matrix(occurenceMatrix/mostFreqWord)
IDF = csr_matrix(-np.log2((occurenceMatrix != 0).sum(1)/len(descDict)))

TFIDF = TF.multiply(IDF)
np.save("TFIDF", TFIDF)

In [None]:
ixIndex = descDict['COM-308'][0]
tmp = TFIDF.getcol(ixIndex)
ixBigIndices = (np.argsort(tmp.todense(), axis=0)[-15:])


ixBigScores = []
for i in ixBigIndices:
    ixBigScores.append(indexWord[int(i)])
ixBigScores.reverse()
print("Words with greatest scores are :")
listPrettyPrint(ixBigScores, 3)

**Explanation**
The difference between high and big scores is the essence of TF-IDF : high scores indicate the term is very frequent in the document but appears in few documents, when a low score shows the word is rare in the document (appears only once or twice), but most documents have this word.

## Exercise 4.3: Document similarity search