# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** *Your group letter.*

**Names:**

* *Name 1*
* *Name 2*
* *Name 3*

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [79]:
import pickle
import numpy as np
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl,save_pkl
import collections 
from collections import OrderedDict
import operator
courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')

## Exercise 4.1: Pre-processing

First let's gather all the courses in a list. We pay attention to not include several times the same ID, we thus look for unique ids.

In [40]:
Ids_courses = list({c['courseId']:c for c in courses})#list of course ids

In [41]:
Ids_courses

['MSE-440',
 'BIO-695',
 'FIN-523',
 'MICRO-614',
 'ME-231(a)',
 'AR-402(v)',
 'ChE-421',
 'CH-403',
 'COM-302',
 'EE-432',
 'MGT-430',
 'PHYS-455',
 'EE-517',
 'MSE-613',
 'MSE-423',
 'MGT-621',
 'MSE-437',
 'MSE-431',
 'BIOENG-448',
 'BIOENG-450',
 'MSE-474',
 'MICRO-424',
 'ME-432',
 'ENV-400',
 'HUM-429(a)',
 'BIOENG-517',
 'MSE-420',
 'ENV-501',
 'MATH-111(en)',
 'ChE-302',
 'MICRO-505',
 'CS-352',
 'HUM-417(a)',
 'CIVIL-429',
 'CIVIL-449',
 'FIN-405',
 'CS-699(1)',
 'ME-551',
 'MSE-463',
 'COM-500',
 'MATH-106(en)',
 'MGT-414',
 'BIO-501',
 'COM-308',
 'MGT-526',
 'CH-332',
 'ME-476',
 'EE-605',
 'ENG-603',
 'MSE-425',
 'MATH-408',
 'CS-322',
 'ME-453',
 'MSE-629',
 'CS-490',
 'ENV-715',
 'CS-699(2)',
 'MATH-625',
 'ME-231(b)',
 'AR-402(w)',
 'ME-499',
 'MICRO-562',
 'MSE-443(b)',
 'CH-709',
 'MICRO-504',
 'ENG-431',
 'MATH-232',
 'BIO-676',
 'MSE-646',
 'MICRO-452',
 'PHYS-630',
 'ChE-601(a)',
 'ENV-426',
 'CH-313',
 'MICRO-602',
 'PHYS-437',
 'BIO-617',
 'PHYS-622',
 'PHYS-600'

In [42]:
len(Ids_courses)

854

We notice a course called "Caution, these contents corresponds to the coursebooks of last year". Let's create two list of courses, one with and one without this "weird" ID:

In [43]:
courses_with =list({c['courseId']:c for c in courses}.values())

In [44]:
courses_without = list({c['courseId']:c for c in courses if not c['courseId']== 'Caution, these contents corresponds to the coursebooks of last year'}.values())

The same way we create a list of unique stopwords:

In [45]:
stop = set(stopwords)

If we look at stop dictionnary, we notice that there is no ponctuation. Let's create our own dictionnary of them:

In [46]:
import string
punct=[]
for c in string.punctuation:
    punct.append(c)#create the array of punctuation character

In [47]:
punct[0:5]#6 first characters

['!', '"', '#', '$', '%']

Let's update our stop words list with punctuation:

In [48]:
stop.update(punct)

Now we want to remove stop words and punctuation and compute word frequency for each document:

In [49]:
from nltk.tokenize import wordpunct_tokenize 

In [50]:
#We create a dict with course Id as key and words as value
word_dict = {}

for id, course in enumerate(courses_without):
    
    desc = course['description']
    w = []
    for words in wordpunct_tokenize(desc):
        
        word = words.lower() 
        
        if word not in stop:
             
            w.append(word)      
         
            
    w = [''.join(c for c in s if c not in punct) for s in w]#remove some words containing several punctuations
    w = [s for s in w if s]#remove empty strings
   
    word_dict[course['courseId']]=w#for each courses, we provide the list of words

In [51]:
len(word_dict.keys())

853

In [52]:
word_dict

{'AR-201(c)': ['house',
  'simple',
  'topic',
  'studio',
  'matter',
  'simple',
  'complexity',
  'defining',
  'space',
  'corner',
  'cascade',
  'rooms',
  'arriving',
  'simple',
  'complexity',
  'house',
  'learning',
  'house',
  'learning',
  'architecture',
  'house',
  'contexts',
  'house',
  'content',
  'cut',
  'con',
  '\xad',
  'struct',
  'conceive',
  'studio',
  'architecten',
  'de',
  'vylder',
  'vinck',
  'taillieu',
  'deals',
  'idea',
  'reference',
  'frame',
  'reference',
  'idea',
  'prac',
  '\xad',
  'tice',
  'hand',
  'working',
  'projects',
  'time',
  'hand',
  'starting',
  'detail',
  'immediately',
  'studio',
  'architecture',
  'simulated',
  'exercise',
  'architect',
  'studio',
  'simulating',
  'practice',
  'observation',
  'analyze',
  'imagination',
  'concept',
  'part',
  'ap',
  '\xad',
  'proach',
  'strong',
  'belief',
  'variety',
  'media',
  'handmade',
  'drawing',
  'crafted',
  'modeling',
  'result',
  'ongoing',
  'metho

We perform stemming to remove suffixes:

In [53]:
from nltk.stem.porter import *

In [54]:
stemmer=PorterStemmer()
stemmed_words = []
word_dict_stem={}

for course, val in word_dict.items():
    words = word_dict[course]
    
    temp = []
    for w in words:
        temp.append(stemmer.stem(w))
        
    stemmed_words+=temp
    word_dict_stem[course] = temp
   

Now we want to compute the count of each word in each course description:

In [55]:
word_dict_count={}
for course, val in word_dict_stem.items():
    count=collections.Counter(val)
    word_dict_count[course]=dict(count)
  

And eventually the frequency:

In [56]:
word_dict_freq={}
for course, val in word_dict_count.items():
   
    total = sum(val.values(), 0.0)
    new_val = {k: v / total for k, v in val.items()}
    word_dict_freq[course]=new_val
   
    


## Exercise 4.2: Term-document matrix

Let's create the matrix :

In [57]:
terms=list(set(stemmed_words))
classes = list(set(Ids_courses))

In [58]:
Matrix_TF = np.zeros((len(terms), len(classes)))

In [59]:
Matrix_TF.shape

(11816, 854)

In [60]:
for idc, course in enumerate(word_dict_count):
    for word, freq in word_dict_count[course].items():
        idw=terms.index(word)
        Matrix_TF[idw][idc] = freq

In [61]:
terms.index('intend')

1884

In [62]:
Matrix_TF[1884][:]

array([ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  2.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0

Now let's compute the inverse document frequency:

In [63]:
terms.index('latest')

11481

In [64]:
terms[0]

'light4'

In [65]:
word_dict_inv_count={key:0 for key in terms}
for word in terms:
   
    for k,v in word_dict_freq.items():
       
        if word in v:#the key exists in that doc
            word_dict_inv_count[word]+=1#We add one each time a topic appears at least once in a doc
            
    

In [66]:
word_dict_inv_freq={}
for k, v in word_dict_inv_count.items():
    word_dict_inv_freq[k]=np.log(len(classes)/v)

In [67]:
len(word_dict_inv_freq)

11816

In [68]:
IDF=list(word_dict_inv_freq.values())#IDF vector

In [69]:
TF_IDF = Matrix_TF * np.array([IDF]).T

In [70]:
save_pkl(TF_IDF, 'tfidf_mat.pkl')

Now let's have a look to the results for our Internet Analytics class: COM 308:

In [71]:
index_IA = Ids_courses.index('COM-308')

In [72]:
classes

['MGT-641(b)',
 'FIN-504',
 'COM-303',
 'MSE-231',
 'COM-308',
 'ChE-302',
 'MATH-472',
 'AR-522',
 'MATH-408',
 'CH-630(2)',
 'MATH-640',
 'ME-484',
 'COM-402',
 'COM-507',
 'FIN-610',
 'EE-724',
 'CS-450',
 'CIVIL-443',
 'ChE-409',
 'COM-401',
 'CS-208',
 'MSE-471',
 'BIOENG-801',
 'COM-502',
 'MGT-482',
 'CS-699(1)',
 'MSE-470(a)',
 'HUM-422(a)',
 'MGT-401',
 'BIO-617',
 'CS-206',
 'CH-707',
 'CS-210',
 'EE-548',
 'PENS-210',
 'ENV-461',
 'EE-730',
 'MICRO-421',
 'MICRO-617',
 'HUM-348',
 'BIO-480',
 'MATH-634',
 'MGT-454',
 'PHYS-433',
 'CH-422',
 'ChE-204',
 'HUM-429(b)',
 'BIO-471',
 'ME-421',
 'ChE-403',
 'MICRO-513',
 'AR-402(b)',
 'MGT-439',
 'MATH-454',
 'EE-490(d)',
 'ENG-601(2)',
 'MSE-478',
 'PHYS-702',
 'MGT-707',
 'MICRO-567',
 'MSE-803',
 'MICRO-711',
 'CS-491',
 'CH-453',
 'ChE-413',
 'MICRO-486',
 'CS-422',
 'CS-490',
 'BIO-504',
 'CH-402',
 'FIN-406',
 'ENG-435',
 'MSE-628',
 'MICRO-553',
 'FIN-506',
 'ME-602',
 'MGT-400',
 'BIO-630',
 'MICRO-614',
 'MGT-466',
 'BIOE

In [73]:
index_IA

43

We want to retrieve all the words for this class:

In [74]:
scores_IA = {}
for idx, value in enumerate(TF_IDF[:,index_IA]):
    
    scores_IA[terms[idx]] = value

In [75]:
scores_IA

{'light4': 0.0,
 'nanopattern': 0.0,
 'approxim': 0.0,
 'appoit': 0.0,
 'analysi': 0.99735855496293691,
 'divers': 0.0,
 'matanya': 0.0,
 'refriger': 0.0,
 'entrop': 0.0,
 'review': 0.0,
 'spoken': 0.0,
 'multiconductor': 0.0,
 'nem': 0.0,
 'addit': 0.0,
 'nutrit': 0.0,
 'agaros': 0.0,
 'illu': 0.0,
 'brief': 0.0,
 'quantit': 0.0,
 'recallbas': 0.0,
 'projectassess': 0.0,
 'regen': 0.0,
 'bioremedi': 0.0,
 'copolymerizationr': 0.0,
 'thermohydraul': 0.0,
 'boil': 0.0,
 'semiconductorsdifferenti': 0.0,
 'cooper': 0.0,
 'renen': 0.0,
 'procedur': 0.0,
 '13formul': 0.0,
 'ito': 0.0,
 'gu': 0.0,
 'bench': 0.0,
 'crop': 0.0,
 'materialsmagnet': 0.0,
 'focal': 0.0,
 'vivian': 0.0,
 'bsse': 0.0,
 'kelvin': 0.0,
 'primit': 0.0,
 'e1formul': 0.0,
 'chang': 0.0,
 'immunostain': 0.0,
 'moder': 0.0,
 'assaydesign': 0.0,
 'travers': 0.0,
 'leak': 0.0,
 'resist': 0.0,
 'proxim': 0.0,
 'profit': 0.0,
 'methodolog': 0.0,
 'polycycl': 0.0,
 'palmer': 0.0,
 'bliefert': 0.0,
 'electoron': 0.0,
 'pull': 0

In [77]:
from collections import OrderedDict

In [80]:
Scores_ordered_IA = OrderedDict(sorted(scores_IA.items(),key = operator.itemgetter(1),reverse = True))
Scores_ordered_IA

OrderedDict([('mine', 18.681958608434936),
             ('onlin', 17.459173278835436),
             ('social', 15.695066405721727),
             ('explor', 15.061307877526007),
             ('world', 14.393650914403395),
             ('hadoop', 12.11356802645725),
             ('real', 11.42011537566993),
             ('servic', 10.843310933578261),
             ('auction', 10.727273665337359),
             ('commerc', 10.727273665337359),
             ('retriev', 9.6080420894665135),
             ('internet', 9.6080420894665135),
             ('network', 9.3728477860972674),
             ('dataset', 8.5300490880011388),
             ('stream', 8.369963672654066),
             ('data', 8.0565565339707064),
             ('ad', 7.9546849430975772),
             ('larg', 7.8683904262304338),
             ('cluster', 7.5083978404691578),
             ('graph', 7.3177774808605083),
             ('scale', 7.2575935605067166),
             ('lab', 7.1796671012969338),
             ('servicesd

In [81]:
list(Scores_ordered_IA)[0:14]#15 first topics

['mine',
 'onlin',
 'social',
 'explor',
 'world',
 'hadoop',
 'real',
 'servic',
 'auction',
 'commerc',
 'retriev',
 'internet',
 'network',
 'dataset']

## Exercise 4.3: Document similarity search