# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** *P*

**Names:**

* *Pierre Fouche*
* *Matthias Leroy*


---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Donâ€™t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [None]:
import pickle
import numpy as np
from utils import save_json
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl
import string
import math
import collections
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.stem.lancaster import LancasterStemmer

courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')

## Exercise 4.1: Pre-processing

In [None]:
#function that given a list return a new list with the original elements of the list and its bi and tri grams.
def bitrigrams(l):
    zipO1 = list(zip(l, l[1:]))
    zipO2 = list(zip(l, l[1:], l[2:]))   
    returnl1 = [str(tup[0])+' '+str(tup[1]) for tup in zipO1]
    returnl2 = [str(tup[0])+' '+str(tup[1])+' '+str(tup[2]) for tup in zipO2]  
    return l+returnl1+returnl2

l1=[1,2,3,4,5,6,7,8]
print(bitrigrams(l1))

In [None]:
#function that remove the infrequent words
def removeInfrequentWords(wordsList):
    count = collections.Counter(wordsList)
    for key,value in count.items():
        if value == 1:
            wordsList.remove(key) 
    return wordsList

In [None]:
ps = PorterStemmer()
wnl = WordNetLemmatizer()
ls = LancasterStemmer()
translator = str.maketrans('', '', string.punctuation)
newCourses =[]

#test = wnl.lemmatize('studing')
#print(test)


#we pre-process each course in order to have a new description with a list of all words we decide to keep.
for course in courses:
    temp = course['description'].lower()
    temp = temp.translate(translator)
    temp = temp.split(' ')
    temp=[word for word in temp if word not in stopwords]
    temp=[ps.stem(word) for word in temp]
    temp = removeInfrequentWords(temp)
    temp = bitrigrams(temp)
    temp = removeInfrequentWords(temp)
    newCourses.append({'name':course['name'],'listDescription':temp,'courseId':course['courseId'], 'description':course['description']})

save_json(newCourses, 'courses.txt')

1) First of all we decide to pass every word in lower case, thus same words with different case won't be processed differently. We decide to remove the punctuation to prevent new words stick with a punctuation mark. Moreover we delete the stopwords because those words are not really relevant and are likely to be very frequent. After that we stem the words in order to gather words with the same root form, because they present the same idea. Then we decide to remove the infrequent words because like the stopwords they are not relevant and usefull. Finally we add bi and trigrams to the vocabulary because some expressions make more sens that lonely words (we delete the infrequence words on more time in order to remove all the bi and trigrams that are not frequent and avoid overloading the descriptions).

In [None]:
#2) we print the terms in the pre-processed description of the IX class in alphabetical order.
for course in newCourses:
    if course['name'] == 'Internet analytics':
        for word in sorted(course['listDescription']):
            print(word)

## Exercise 4.2: Term-document matrix

In [None]:
terms =[]
#we create a list with all the words in every list description
for item in newCourses:
    terms.extend(item['listDescription'])
countTerms = collections.Counter(terms)

#we create a dictionary that link words with their index (for the futur matrix, in order to easily get back a term according to its index)
termsDict ={}
for i,term in enumerate(countTerms.keys()):
    termsDict[i]=term
termsDict = dict(collections.OrderedDict(sorted(termsDict.items())))
nb_terms = len(termsDict)
print(nb_terms)

#we create a dictionary that link a course with its index (for the futur matrix)
newCoursesDict = {}
for i,doc in enumerate(newCourses):
    newCoursesDict[i]=doc
newCoursesDict = dict(collections.OrderedDict(sorted(newCoursesDict.items())))
nb_courses = len(newCoursesDict)
print(nb_courses)

#function that count the number of descriptions where a given word appear
def countDocWithTerm(term,docs):
    result = 0
    for doc in docs:
        if term in doc['listDescription']:
            result += 1
    return result

with open("newCoursesDict.pickle", "wb") as f:
    pickle.dump(newCoursesDict, f)
with open("termsDict.pickle", "wb") as f:
    pickle.dump(termsDict, f)

In [None]:
values =[]
rows=[]
columns=[]

#we construct a MxN term-document sparse matrix X, where M is the number of terms and N is the number of documents.

#In order to do this we create the array with the values, the index rows and columns
for i,term in termsDict.items():
    for index,doc in newCoursesDict.items():
        if term in doc['listDescription']:
            #we compute the tf idf for each terms and documents in the corpus
            tf = doc['listDescription'].count(term)/len(doc['listDescription'])
            idf = math.log(nb_courses/countDocWithTerm(term,newCourses))
            tf_idf=tf*idf
            
            values.append(tf_idf)
            rows.append(i)
            columns.append(index)

#then we construct the sparse matrix with these informations.
X = csr_matrix((values, (rows, columns)), shape=(len(termsDict), len(newCoursesDict)))

#we save it because we are going to use it in the other exercises.
with open("matrix.pickle", "wb") as f:
    pickle.dump(X, f)

In [None]:
#print(X.count_nonzero())
Xarray = X.toarray()

#we get the columns index of the IX class thanks to the newCoursesDict we create above.
a = 0
for key,value in newCoursesDict.items():
    if value['name'] == 'Internet analytics':
        a = key
        break;

#We get the 15 terms in the description of the IX class with the highest TF-IDF scores.
idxIX = np.argsort(Xarray[:,a])[::-1][:15]

b=0
for key,value in termsDict.items():
    if value == 'system':
        b = key
        break;
#print(Xarray[b][a])

#1) then we print them
print('First 15 terms in the description of the IX class with the highest TF-IDF scores. \n')
for i in idxIX:
    print(termsDict[i])

2) A term has a large score inside a document if it appears a lot in this document while it is not really frequent inside the whole corpus. Whereas a term has a small score inside a document if it does not appears a lot inside the document or if it is really frequent inside the whole corpus.

## Exercise 4.3: Document similarity search

In [None]:
'''
fb = 0
mc = 0
for key,value in termsDict.items():
    if value==ps.stem('facebook'):
        fb = key
    elif value==ps.stem('markov chains'):
        mc = key
'''

#function that compute the cosine similarity between two vectors
def similarity(a,b):
    sim = (np.dot(a.T,b))/(np.linalg.norm(a)*np.linalg.norm(b))
    return sim

In [None]:
import itertools

#we create a function that print courses which their description contains a given query (word, expression)
#then compute the cosine similarity between all these courses and print the 5 closest ones.
def query(q):
    cosSim = {}
    idxCourses = []
    print('Courses with a description that contains',q,':\n')
    for i,doc in newCoursesDict.items():        
        if ps.stem(q) in doc['listDescription']:
            print(doc['name'])
            idxCourses.append(i)
    print('------------------------------\n')
    combi = itertools.combinations(idxCourses, 2)
    
    for idx in combi:
        a = Xarray[:,idx[0]]
        b = Xarray[:,idx[1]]
        cosSim[idx] = similarity(a,b)

    npCosSim = np.array(list(cosSim.values()))
    idxQuery = np.argsort(npCosSim)[::-1][:5]
    
    print('Top five courses together for the query',q,':\n')
    for j in idxQuery:
        tup = list(cosSim.keys())[j]
        for i,doc in newCoursesDict.items():
            if i==tup[0]:
                print(doc['name'])
            elif i==tup[1]:
                print(doc['name'])
        print('similarity score:',list(cosSim.values())[j])
        print('------------------------------')

In [None]:
query('markov chain')

In [None]:
query('facebook')

2) For the query Facebook, we can see that only one course has this word in its description therefore it's not possible to compute the cosine similarity between different courses. For markov chain we have supposed that 2 courses are close if their descriptions contains most of the same word with a large tf-idf scores.