# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** *W*

**Names:**

* *Cloux Olivier*
* *Reiss Saskia*
* *Urien Thibault*

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle as pk
import numpy as np
import string
import re
import nltk
import time

from scipy.sparse import csr_matrix, find
from utils import load_json, load_pkl

from nltk.stem.porter import *
from nltk.stem import WordNetLemmatizer

from lab04_helper import *

In [2]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [3]:
courses = load_json('data/courses.txt') 
stopwords = load_pkl('data/stopwords.pkl')

In [4]:
def pickleDump(filename, value):
    with open(filename, "wb") as f:
        pk.dump(value, f)
        
def listPrettyPrint(l):
    
    for a,b,c,d in zip(l[::4],l[1::4],l[2::4],l[3::4]):
        print('{:<30}{:<30}{:<30}{:<}'.format(a,b,c,d))

## Exercise 4.1: Pre-processing

See lab_04 helper.
To avoid error we prefered to be able to have the preprocessing function also available for the part 2 : lsi to perfor term search. 

# TODO
explain why those functions

In [5]:
#Creation of a dictionary that contains :
#courses ID as keys
#a 3-tuple(uniqueIndex, title, list[separated words]) as value
descDict = dict() 
indexCourse = dict()
index = 0
for i in courses:
    if i['courseId'] not in descDict.keys():
        descDict[i['courseId']] = (index, i['name'], cleaner(i))
        indexCourse[index] = i['courseId']
        index += 1

pickleDump(r"cidWithBag.txt", descDict)
pickleDump(r"indexToCourse.txt", indexCourse)

In [6]:
ixWords = sorted(descDict['COM-308'][2])
print("Words for Internet Analytics course are (in alphabetical order) :")
listPrettyPrint(ixWords)

Words for Internet Analytics course are (in alphabetical order) :
acquired                      activities                    ad                            ad
advertisement                 algebra                       algebra                       algorithms
algorithms                    analysis                      analytics                     analytics
analyze                       apache                        applications                  applications
assessment                    auctions                      auctions                      balance
based                         based                         basic                         basic
basic                         cathedra                      chains                        class
class                         class                         cloud                         clustering
clustering                    collection                    com-3                         combination
communication                 community     

In [7]:
#Creation of 2 dictionary.
#wordIndex contains all distinct words as keys and their unique index as value
#indexWord is the exact opposite. 
wordIndex = dict() 
index = 0
for i in descDict:
    for word in descDict[i][2]:
        if word not in wordIndex.keys():
            wordIndex[word] = index
            index += 1;

indexWord = dict((v, k) for k, v in wordIndex.items())
assert(len(indexWord) == len(wordIndex))
pickleDump("indexToWord", indexWord)
pickleDump("wordToIndex", wordIndex)

In [8]:
#Creation of sparse occurence matrix. we define values in occValues and its indices in occRow and occCol. 
#If two pairs of indices are identical, their values will be added. 
occValues = []
occRow = [] #indices of words
occCol = [] #indices of courses

i = 0
for cid in descDict: #iterate through all courses (and their bag of words)
    cIndex = descDict[cid][0] #get column for this course
    for word in descDict[cid][2]: #then append to correct list :
        occCol.append(cIndex) #the col index
        occRow.append(wordIndex[word]) #row index
        occValues.append(1) #value (1, as each word represents 1 occurence)

occurenceMatrix = csr_matrix((occValues, (occRow, occCol)), shape=((len(wordIndex), len(descDict))), dtype=np.float64)
save_sparse_csr("occ_matrix", occurenceMatrix)

In [9]:
#Creation of TF, IDF and TFIDF matrices
mostFreqWord = occurenceMatrix.max(axis=0).data #create list of max occurence for each course

#by definition : TF is term freq divided by freq of most freq word in the same document
TF = csr_matrix(occurenceMatrix/mostFreqWord)
IDF = csr_matrix(-np.log2((occurenceMatrix != 0).sum(1)/len(descDict)))

TFIDF = TF.multiply(IDF)
np.save("TFIDF", TFIDF)

In [43]:
ixIndex = descDict['COM-308'][0]
tmp = TFIDF.getcol(ixIndex)
ixBigIndices = (np.argsort(tmp.todense(), axis=0)[-15:])


ixBigScores = []
for i in ixBigIndices:
    ixBigScores.append(indexWord[int(i)])
ixBigScores.reverse()
print("Words with greatest scores are :")
listPrettyPrint(ixBigScores)

Words with greatest scores are :
services                      online                        real-world                    social
mining                        explore                       networking                    e-commerce
hadoop                        large-scale                   recommender                   ad


**Explanation**
The difference between high and big scores is the essence of TF-IDF : high scores indicate the term is very frequent in the document but appears in few documents, when a low score shows the word is rare in the document (appears only once or twice), but most documents have this word.

## Exercise 4.2: Term-document matrix

## Exercise 4.3: Document similarity search