# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** *J*

**Names:**

* *Maxime Lucas Lanvin*
* *Victor Salvia*
* *Erik Axel Wilhelm Sjöberg*

---

#### Imports

In [36]:
import pickle
import numpy as np
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl
from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer 
from nltk.util import ngrams 
from collections import Counter
from scipy.sparse import csr_matrix
from numpy.linalg import norm
import copy

courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')

## Exercise 4.1: Pre-processing

Pre-process the corpus to create bag-of-words representations of each document. You are free
to proceed as you wish.

1. Explain which ones you implemented and why.
2. Print the terms in the pre-processed description of the IX class in alphabetical order

#### Removing special characters, lemmazation and stemming

In [2]:
# Creating the tokenizer, the stemmer and the lemmatizer.
tknzr = TweetTokenizer(strip_handles=True, reduce_len=True,preserve_case=False)
ps = PorterStemmer() 
lemmatizer = WordNetLemmatizer()

In [3]:
# Creating a list containing all of the special chars. This list is then added to the stopwords and together they form
# the ignored words. These words will be removed from the corpus. 
specialchar = ['.', ',', '(', ')', '&', ':', '/','-','"',';','', ' ', '..', '...',"'",'%']
ignored_words = set(list(stopwords) + specialchar)

In [4]:
# Checks whether or not there is a digit in a string.
def NoNumbers(s):
    return not any(char.isdigit() for char in s)

# Stemms a given string
def stemmer(s):
    word_tokens = tknzr.tokenize(s)
    temp_list = [ps.stem(w) for w in word_tokens if not w in ignored_words] 
    return [w for w in temp_list if NoNumbers(w)]

# Lemmatizes a given string
def lemmazation(s):
    word_tokens = tknzr.tokenize(s)
    temp_list = [lemmatizer.lemmatize(w) for w in word_tokens if not w in ignored_words]
    return [w for w in temp_list if NoNumbers(w)]

def lem_n_stem(s):
    word_tokens = tknzr.tokenize(s)
    temp_list = [ps.stem(lemmatizer.lemmatize(w)) for w in word_tokens if not w in ignored_words]
    return [w for w in temp_list if NoNumbers(w)]    

# Helper function for the tokenize_1gram.
# tokenzie a given string, either stem or Lemmatise the words and removes the ignored words for a 1 gram
def tokenize_1gram(l,lem,stemlem):
    courses_loc = copy.deepcopy(l)
    for i in courses_loc:
        description = i['description']
        if lem == True:
            i['description'] = lemmazation(description)
            if stemlem == False:
                i['description'] = lemmazation(description)
            else:
                i['description'] = lem_n_stem(description)
        else:
            i['description'] = stemmer(description)  
    return courses_loc

# Description: Tokenzie a given string, either stem or Lemmatise the words and removes the ignored words for a 1 gram.
# After this step n-grams are created over the cleaned string //

# @ l: Indicats the level of the n-gram we want returned over the string l. Default is 1.
# @ lem: boolean exression determining whether or not to use stemming or lemmazation. Default is lemmazation
# @ stemlem: boolean expression determining whether or not to use both stemming and lemmazation
def tokenize_ngram(l,n=1,lem=True,stemlem=False):
    if n ==1:
        return tokenize_1gram(l,lem,stemlem)  
    courses_loc = copy.deepcopy(l)
    for i in courses_loc:
        description = i['description']
        sentences = description.split('.')
        grams = []
        for s in sentences:
            if lem == True:
                if stemlem == False:
                    tokens = lemmazation(s)
                else:
                    tokens = lem_n_stem(s)
            else:
                tokens = stemmer(s)
            grams = grams + list(ngrams(tokens,n))
        i['description'] = grams
    return courses_loc   

In [5]:
# Tokenzing and Lemmatizing the corpus, for 1-grams, 2-grams and 3-grams
lemmatized_1gram = tokenize_ngram(courses,1)
lemmatized_2gram = tokenize_ngram(courses,2)
lemmatized_3gram = tokenize_ngram(courses,3)

# Tokenzing and Stemming the corpus, for 1-grams, 2-grams and 3-grams
stemmed_1gram = tokenize_ngram(courses,1,lem=False)
stemmed_2gram = tokenize_ngram(courses,2,lem=False)
stemmed_3gram = tokenize_ngram(courses,3,lem=False)

# Tokenzing, Lemmatizing & Stemming the corpus, for 1-grams, 2-grams and 3-grams
lem_and_stem_1gram = tokenize_ngram(courses,1,stemlem=True)
lem_and_stem_2gram = tokenize_ngram(courses,2,stemlem=True)
lem_and_stem_3gram = tokenize_ngram(courses,3,stemlem=True)

In this step we have removed all of the stopwords and some special characters defined above, (We used the stopwords provided by us in the handout). Moreover we collected all the bigrams and trigrams from every SENTENCE. This means that the description was split into its sentences and from these the bigrams and trigrams were collected. This was done because the words after the punctuation are assumed to not be connected with the words before punctuation. Words containing numbers were also removed from the each Corpus. These words were often things like courseIDs, percentage, numbers before time (e.g 1.5 hours), etc. In the end of this step, we had 9 Corpora in total. These are the ones listed directly above.

#### Removing lesser and very common words

In [6]:
def remove_frequent_infrequent_words(d,lower_limit=5,higher_limit=500):
    global_dictionary, dictionary_mapping = get_dictionary(d)
    counts = dict(Counter(global_dictionary))
    low_freq_grams = dict((k, v) for (k,v) in counts.items() if v < lower_limit)
    high_freq_grams = dict((k, v) for (k,v) in counts.items() if v > higher_limit)
    
    cleaned_dict = copy.deepcopy(d)
    
    # Would be nice to use a helper here
    for k,v in low_freq_grams.items():
        temp_list = dictionary_mapping[k]
        for u in temp_list:
            cleaned_dict[u]['description'].remove(k)
    for k,v in high_freq_grams.items():
        temp_list = dictionary_mapping[k]
        for u in temp_list:
            cleaned_dict[u]['description'].remove(k)
    return cleaned_dict
    
def get_dictionary(d):
    global_dictionary = []
    dictionary_mapping = {}
    for i in range(0,len(d)):
        temp_list = d[i]
        global_dictionary = global_dictionary + temp_list['description']
        for w in temp_list['description']:
            if w in dictionary_mapping:
                dictionary_mapping[w].append(i)
            else:
                dictionary_mapping[w] = [i]
    return global_dictionary, dictionary_mapping

In [7]:
global_dictionary, dictionary_mapping = get_dictionary(lemmatized_1gram)

In [8]:
# Examples of very common words
Counter(global_dictionary).most_common(10)

[('student', 2029),
 ('method', 1765),
 ('learning', 1472),
 ('system', 1063),
 ('content', 917),
 ('model', 788),
 ('design', 787),
 ('course', 759),
 ('analysis', 727),
 ('basic', 702)]

In [9]:
# Examples of lesser common words
n=10
Counter(global_dictionary).most_common()[:-n-1:-1]

[('mandelbrot', 1),
 ('matplotlib', 1),
 ('lapack', 1),
 ('calculati', 1),
 ('blokesch', 1),
 ('fluorescently', 1),
 ('bacterium', 1),
 ('unknown', 1),
 ('microbetracker', 1),
 ('artifical', 1)]

In [10]:
counts = dict(Counter(global_dictionary))
low_freq_words = dict((k, v) for (k,v) in counts.items() if v < 3)
high_freq_words = dict((k, v) for (k,v) in counts.items() if v > 500)

In [11]:
# For Lemmatizing
print('The number of words that are removed due do low frequency in lemmatized_1gram: ' + str(len(low_freq_words)))
print('The number of words that are removed due do high frequency in lemmatized_1gram: ' + str(len(high_freq_words)))

The number of words that are removed due do low frequency in lemmatized_1gram: 10703
The number of words that are removed due do high frequency in lemmatized_1gram: 25


In [12]:
lemmatized_1gram = remove_frequent_infrequent_words(lemmatized_1gram,lower_limit=3,higher_limit=500)
lemmatized_2gram = remove_frequent_infrequent_words(lemmatized_2gram,lower_limit=3,higher_limit=400)
lemmatized_3gram = remove_frequent_infrequent_words(lemmatized_3gram,lower_limit=2,higher_limit=400)

stemmed_1gram = remove_frequent_infrequent_words(stemmed_1gram,lower_limit=3,higher_limit=500)
stemmed_2gram = remove_frequent_infrequent_words(stemmed_2gram,lower_limit=3,higher_limit=400)
stemmed_3gram = remove_frequent_infrequent_words(stemmed_3gram,lower_limit=2,higher_limit=400)

lem_and_stem_1gram = remove_frequent_infrequent_words(lem_and_stem_1gram,lower_limit=3,higher_limit=500)
lem_and_stem_2gram = remove_frequent_infrequent_words(lem_and_stem_2gram,lower_limit=3,higher_limit=400)
lem_and_stem_3gram = remove_frequent_infrequent_words(lem_and_stem_3gram,lower_limit=2,higher_limit=400)

In this step we have removed the very common and lesser common words from the Corpora. For 1-grams, words that occur less than 3 times are removed. 2-grams that occur less than 3 times and 3-grams that occur only once are removed. The reason why we have a lower limit for the 3-grams is because if they occur several times it is more deliberate than a single word or a bigram. 

#### Index of Internet Analytics course

In [13]:
course_name_to_id = {}
for i in range(len(courses)):
    course_name_to_id[courses[i]['name']] = i
course_id_to_name = dict((v,k) for k,v in course_name_to_id.items())

In [14]:
# Finding COM-308's place in the list. 
course_name_to_id['Internet analytics']

43

#### Lemmazation

In [15]:
ix_id = 43
temp_1 = lemmatized_1gram[ix_id]['description'].copy()
temp_2 = lemmatized_2gram[ix_id]['description'].copy()
temp_3 = lemmatized_3gram[ix_id]['description'].copy()
temp_1.sort(), temp_2.sort(), temp_3.sort()
print('--------------------------------------------------- 1 grams -------------------------------------------------')
print(temp_1), 
print('\n--------------------------------------------------- 2 grams -------------------------------------------------')
print(temp_2)
print('\n--------------------------------------------------- 3 grams -------------------------------------------------')
print(temp_3)

--------------------------------------------------- 1 grams -------------------------------------------------
['acquired', 'ad', 'ad', 'algebra', 'algebra', 'algorithm', 'algorithm', 'analytics', 'analytics', 'auction', 'auction', 'balance', 'based', 'based', 'cathedra', 'chain', 'class', 'class', 'class', 'cloud', 'clustering', 'clustering', 'collection', 'combination', 'communication', 'community', 'community', 'computing', 'computing', 'concrete', 'coverage', 'current', 'data', 'data', 'data', 'data', 'data', 'data', 'datasets', 'datasets', 'decade', 'dedicated', 'designed', 'detection', 'detection', 'dimensionality', 'draw', 'e-commerce', 'e-commerce', 'effectiveness', 'efficiency', 'exam', 'expected', 'explore', 'explore', 'explore', 'explore', 'explores', 'field', 'final', 'foundational', 'framework', 'function', 'fundamental', 'good', 'graph', 'graph', 'hadoop', 'hadoop', 'hands-on', 'homework', 'homework', 'important', 'information', 'information', 'infrastructure', 'inspired',

#### Stemming

In [16]:
ix_id = 43
temp_1 = stemmed_1gram[ix_id]['description'].copy()
temp_2 = stemmed_2gram[ix_id]['description'].copy()
temp_3 = stemmed_3gram[ix_id]['description'].copy()
temp_1.sort(), temp_2.sort(), temp_3.sort()
print('--------------------------------------------------- 1 grams -------------------------------------------------')
print(temp_1), 
print('\n--------------------------------------------------- 2 grams -------------------------------------------------')
print(temp_2)
print('\n--------------------------------------------------- 3 grams -------------------------------------------------')
print(temp_3)

--------------------------------------------------- 1 grams -------------------------------------------------
['acquir', 'ad', 'ad', 'algebra', 'algebra', 'algorithm', 'algorithm', 'analyt', 'analyt', 'auction', 'auction', 'balanc', 'base', 'base', 'cathedra', 'chain', 'class', 'class', 'class', 'cloud', 'cluster', 'cluster', 'collect', 'combin', 'commun', 'commun', 'commun', 'comput', 'comput', 'concret', 'coverag', 'current', 'data', 'data', 'data', 'data', 'data', 'data', 'dataset', 'dataset', 'decad', 'dedic', 'detect', 'detect', 'dimension', 'draw', 'e-commerc', 'e-commerc', 'effect', 'effici', 'exam', 'expect', 'explor', 'explor', 'explor', 'explor', 'explor', 'field', 'final', 'foundat', 'framework', 'function', 'fundament', 'good', 'graph', 'graph', 'hadoop', 'hadoop', 'hands-on', 'homework', 'homework', 'import', 'inform', 'inform', 'infrastructur', 'inspir', 'internet', 'internet', 'java', 'key', 'knowledg', 'lab', 'lab', 'lab', 'laboratori', 'large-scal', 'large-scal', 'larg

#### Stemming & Lemmatizing

In [17]:
ix_id = 43
temp_1 = lem_and_stem_1gram[ix_id]['description'].copy()
temp_2 = lem_and_stem_2gram[ix_id]['description'].copy()
temp_3 = lem_and_stem_3gram[ix_id]['description'].copy()
temp_1.sort(), temp_2.sort(), temp_3.sort()
print('--------------------------------------------------- 1 grams -------------------------------------------------')
print(temp_1), 
print('\n--------------------------------------------------- 2 grams -------------------------------------------------')
print(temp_2)
print('\n--------------------------------------------------- 3 grams -------------------------------------------------')
print(temp_3)

--------------------------------------------------- 1 grams -------------------------------------------------
['acquir', 'ad', 'ad', 'algebra', 'algebra', 'algorithm', 'algorithm', 'analyt', 'analyt', 'auction', 'auction', 'balanc', 'base', 'base', 'cathedra', 'chain', 'class', 'class', 'class', 'cloud', 'cluster', 'cluster', 'collect', 'combin', 'commun', 'commun', 'commun', 'comput', 'comput', 'concret', 'coverag', 'current', 'data', 'data', 'data', 'data', 'data', 'data', 'dataset', 'dataset', 'decad', 'dedic', 'detect', 'detect', 'dimension', 'draw', 'e-commerc', 'e-commerc', 'effect', 'effici', 'exam', 'expect', 'explor', 'explor', 'explor', 'explor', 'explor', 'field', 'final', 'foundat', 'framework', 'function', 'fundament', 'good', 'graph', 'graph', 'hadoop', 'hadoop', 'hands-on', 'homework', 'homework', 'import', 'inform', 'inform', 'infrastructur', 'inspir', 'internet', 'internet', 'java', 'key', 'knowledg', 'lab', 'lab', 'lab', 'laboratori', 'large-scal', 'large-scal', 'larg

#### Merging the single words with the bigrams:
By looking at the printouts above, the bigrams makes a lot of sense, e.g: real-world problem, social network linear algebra, topic model, etc. 

The trigrams on the other hand are not very informative. We skip the trigrams and continue with a corpus where we merged the 2-grams and 1-grams. 

Here on after we continue with the stemmed and lemmatized corpus containing the 1-grams and 2-grams. By looking at a few examples like the one above, the words do not seem to be overstemmed so we use stemming in combination with the lemmatizing.   


In [18]:
# We do not care for the position here so just add them up
def corpus_merge(c1,c2):
    merged_corpus =  copy.deepcopy(c1)
    for i in range(len(c1)):
        merged_corpus[i]['description'] = merged_corpus[i]['description'] + c2[i]['description']
    return merged_corpus

In [19]:
lem_stem_corpus = corpus_merge(lem_and_stem_1gram, lem_and_stem_2gram)

## Exercise 4.2: Term-document matrix

Construct an M ×N term-document matrix X, where M is the number of terms and N is the
number of documents. The matrix X should be sparse. You are not allowed to use libraries


1. Print the 15 terms in the description of the IX class with the highest TF-IDF scores.
2. Explain where the difference between the large scores and the small ones comes from

In [20]:
# Constructs a term-document matrix from a corpus
def get_term_document_matrix(corpus):
    global_dictionary, dictionary_mapping = get_dictionary(corpus)
    df_index = dict((k,len(list(set(v)))) for k,v in dictionary_mapping.items()) # A dict where the key is the term and the value is in how many documents the term is present
    unique_words = list(df_index.keys()) #The unique words
    word_to_index = dict(zip(unique_words,(range(len(unique_words))))) # Mapping from word to index (That we use for the encoding)
    index_to_word = dict((v,k) for k,v in word_to_index.items()) # Mapping back from index to word. 


    m = len(unique_words)
    n = len(corpus)

    values = []
    rows = []
    columns = []

    for i in range(n):
        tokens = corpus[i]['description']
        loc_word_count = len(tokens)
        loc_counts = Counter(tokens)
        unique_tokens = list(loc_counts.keys())

        for token in unique_tokens:
            tf = loc_counts[token]/loc_word_count
            df = df_index[token]
            idf = np.log(n/(df+1))

            rows.append(word_to_index[token])
            columns.append(i)
            values.append(tf*idf)

    return csr_matrix((values, (rows, columns)), shape=(m, n)), index_to_word, word_to_index

def get_column_scores(mat, i,index_to_word, n=-1):
    a = mat.getcol(i)
    non_zero_rows = csr_matrix.nonzero(a)[0]
    d = {}
    for i in non_zero_rows:
        d[index_to_word[i]] = a[i,0]
    order_d = {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}
    if n == -1:
        return Counter(order_d).most_common()
    else:
        return Counter(order_d).most_common(n)
    
def get_row_scores(mat, word,word_to_index, n=-1,corpus=courses):
    a = mat.getrow(word_to_index[word])
    non_zero_cols = csr_matrix.nonzero(a)[1]
    d = {}
    for i in non_zero_cols:
        d[corpus[i]['name']] = a[0,i]
    
    order_d = {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}
    if n == -1:
        return Counter(order_d).most_common()
    else:
        return Counter(order_d).most_common(n)

In [21]:
X, index_to_word, word_to_index = get_term_document_matrix(lem_stem_corpus)
ix_id = 43
get_column_scores(X,ix_id,index_to_word,15)

[('onlin', 0.07574768262684738),
 ('real-world', 0.0734207339706497),
 ('social', 0.06825318057153912),
 ('explor', 0.06605836788388599),
 (('data', 'mine'), 0.06523910163895413),
 (('social', 'network'), 0.06321080322017443),
 ('mine', 0.05990403442700461),
 ('large-scal', 0.05153576118068886),
 ('hadoop', 0.04957297285193386),
 (('system', 'cluster'), 0.04957297285193386),
 (('commun', 'detect'), 0.04957297285193386),
 ('e-commerc', 0.04704944590060245),
 (('recommend', 'system'), 0.04704944590060245),
 (('topic', 'model'), 0.04704944590060245),
 ('servic', 0.046998386361060844)]

In [22]:
ix_id_scores = get_column_scores(X,ix_id,index_to_word,15)
l = []
for w in ix_id_scores:
    l.append((w[0], lem_stem_corpus[ix_id]['description'].count(w[0])))
l

[('onlin', 5),
 ('real-world', 4),
 ('social', 5),
 ('explor', 5),
 (('data', 'mine'), 3),
 (('social', 'network'), 3),
 ('mine', 3),
 ('large-scal', 3),
 ('hadoop', 2),
 (('system', 'cluster'), 2),
 (('commun', 'detect'), 2),
 ('e-commerc', 2),
 (('recommend', 'system'), 2),
 (('topic', 'model'), 2),
 ('servic', 3)]

In [23]:
global_dictionary, dictionary_mapping = get_dictionary(lem_stem_corpus)
df_index = dict((k,len(list(set(v)))) for k,v in dictionary_mapping.items())
l = []
for w in ix_id_scores:
    l.append((w[0], df_index[w[0]]))
l

[('onlin', 26),
 ('real-world', 12),
 ('social', 37),
 ('explor', 41),
 (('data', 'mine'), 5),
 (('social', 'network'), 6),
 ('mine', 8),
 ('large-scal', 16),
 ('hadoop', 2),
 (('system', 'cluster'), 2),
 (('commun', 'detect'), 2),
 ('e-commerc', 3),
 (('recommend', 'system'), 3),
 (('topic', 'model'), 3),
 ('servic', 23)]

The difference in TD-IDF Score comes from the fact that words with a higher TD-IDF score are either very frequent in description and/or occurs in fewer documents. We can see above how many times each lemmatized and stemmed word/bigram is present in the description of Internet Analytics and in how many documents they occur. As we can see the onlin, social and explor are present equally many times in the description however onlin has a a higher score. This is because onlin occurs less frequently and thus is a word that is less diluted (resulting in a higher IDF score). 

## Exercise 4.3: Document similarity search

Search for "markov chains" and "facebook".

1. Display the top five courses together with their similarity score for each query.
2. What do you think of the results? Give your intuition on what is happening.

In [190]:
def cosine_similarity(a,b):
    return np.dot(np.transpose(a),b)/(norm(a)*norm(b))

#Returns the n most similar courses to the search term. 
def term_query(term,n=5):
    b = np.zeros((10766,1))
    if term not in word_to_index:
        print('Try another term, not present in the global dictionary')
        return 
    term_index = word_to_index[term]
    b[term_index] = 1
    term_scores = {}
    l = [0] * len(courses)
    for i in range(len(courses)):
        u = cosine_similarity(X.getcol(i).toarray(),b)[0][0]
        if u > 0: #We're only interested on those courses with a higher similarity than 0
            term_scores[i] = u
    return dict((course_id_to_name[k],v) for k,v in dict(Counter(term_scores).most_common(n)).items())

In [192]:
print('---------------------- markov chains ------------------------')
markov_chain = term_query(('markov', 'chain'))
for k,v in markov_chain.items():
    print(k + ': '+ str(np.round(v,4)))
print('\n---------------------- facebook ------------------------')
facebook = term_query('facebook')
for k,v in facebook.items():
    print(k + ': '+ str(np.round(v,4)))

---------------------- markov chains ------------------------
Applied stochastic processes: 0.3453
Markov chains and algorithmic applications: 0.2822
Applied probability & stochastic processes: 0.2615
Optimization and simulation: 0.1025
Networks out of control: 0.0742

---------------------- facebook ------------------------
Computational Social Media: 0.1369


We only have one course with term 'facebook' in it, thus it is the only one we return when we run the query (all other courses will have a similarity score of 0). To see why this is the case we observe the numerator. It contains:

1. $d_{i}^{T}$ - our encoding for our search term which contains zeros in all entries except for the entry corresponding to the term facebook. 

2. $d_{j}$ our encoding for document j, which only contains non-zeros elemnts for the words which are in the  desciption. 

Thereby $d_{i}^{T}d_{j}$ = 0, when document j does not have the term facebook in it. In order to find some courses that have content related to facebook we could run a query on the somewhat broader term social network

In [179]:
print('\n---------------------- social network ------------------------')
social_network = term_query(('social', 'network'))
for k,v in social_network.items():
    print(k + ': '+ str(np.round(v,4)))


---------------------- social network ------------------------
Internet analytics: 0.197
Networks out of control: 0.0763
Computational Social Media: 0.0724
A Network Tour of Data Science: 0.0711
Applied data analysis: 0.0441
