# Text 4: Word2Vec
**Internet Analytics - Lab 4**

---

**Group:** K

**Names:**

* Luca Mouchel
* Mathieu Sauser
* Heikel Jebali
* Jérémy Chaverot

---

#### Instructions

*This is a template for part 4 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import re
import numpy as np
from scipy.sparse import csr_matrix
from collections import defaultdict
import json
import string

from utils import *
import gensim
from sklearn.cluster import KMeans


courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')
model = gensim.models.KeyedVectors.load_word2vec_format('/ix/model.txt')  

  LARGE_SPARSE_SUPPORTED = LooseVersion(scipy_version) >= '0.14.0'


## Redo pre-processing

Let's see what kind of words are not represented in the model. Let's take the first description in the dataset

In [2]:
## Just print examples of words that are not in the model
description = courses[0]['description']
words = description.split(" ")
for word in words:
    if word not in stopwords:
        try:
            model[word]
        except KeyError:
            print(word)


discussed.
Nanocomposites,
biocomposites
presented.
development,
work.
materialsConstituentsProcessing
compositesDesign
structures Current
developmentNanocomposites
compositesBiocompositesAdaptive
composites ApplicationsDriving
marketsCost
analysisAerospaceAutomotiveSport
Nanocomposites
Biocomposites
Prerequisites
By
course,
to:
Propose
design,
partApply
materialsDiscuss
task.Use
IT
toolsCommunicate
disciplines.Evaluate
one's
team,
feedback.
part,
 


We can see that some words are not separated by whitespace (e.g., _"analysisAerospaceAutomotiveSport"_) or some words have some punctuation attached to them (e.g., _"part,"_)

In [3]:
"""
Our preprocessing consists of separating strings like: abcDef into words: abc, Def 
It also fixes the isuse where the word 3D is appended to words so the model doesnt recognize the word properly
Hence we separate 3D from the word theyre appended to (e.g., "image3D" becomes "image", "3D")
the same problem occurs with Ph D.
Then we also remove all punctuation because it is not necessary
Finally we remove words who appear above the max_freq amount of times
"""
def preprocess(description, max_freq=25):
    description = re.sub(r'([a-z])([A-Z])', r'\1 \2', description) # this allows to separate words like lookingForAnswers into looking For Answers 
    description = re.sub(r'3D', r' 3D ', description) # there are several occasions where 3D is stuck besides a word like this 3DMovie or something
    description = re.sub(r'Ph D', r'PhD', description) #the same thing happens with PhD
    words = description.translate(str.maketrans(string.punctuation, ' '*len(string.punctuation))).split() 
    term_frequency = {}
    for word in words:
        if word in term_frequency:
            term_frequency[word] += 1
        else:
            term_frequency[word] = 1
            
    frequent_words = set([word for word in words if term_frequency[word] >= max_freq])
    if frequent_words != set():
        print(frequent_words)

    clean = [word for word in words if word.lower() not in stopwords and not word.isdigit() and term_frequency[word] < max_freq]
    return clean

Now let's create a **clean** dataset, where the description of courses are preprocessed

In [4]:
print("as you can see, the terms that appear frequently are terms that do not contribute to the meaning of the text and can therefore be removed")
clean_courses = list(map(lambda course: (course['courseId'], course['name'], preprocess(course['description'])), courses))

as you can see, the terms that appear frequently are terms that do not contribute to the meaning of the text and can therefore be removed
{'and', 'the'}
{'of', 'the', 'and'}
{'the'}
{'and'}
{'the'}
{'the'}
{'and'}
{'of'}
{'the'}
{'and'}
{'the'}
{'and'}
{'the'}
{'of', 'and', 'the'}
{'of'}
{'the'}
{'the'}
{'and'}
{'of'}
{'of', 'the'}
{'and', 'the', 'to'}
{'and'}
{'and'}
{'the'}
{'and'}
{'and'}
{'of', 'the'}
{'of'}
{'of', 'and', 'the'}
{'the'}
{'and'}
{'the'}
{'and', 'the'}
{'and', 'the'}
{'of', 'and', 'the'}
{'the'}
{'and', 'the'}
{'the'}
{'and'}
{'of', 'and', 'the'}
{'of', 'and', 'the'}
{'the'}
{'the', 'and'}
{'and'}
{'the'}
{'and'}
{'of', 'the'}
{'and'}
{'the'}
{'and'}
{'and'}
{'of', 'the'}
{'and', 'the'}
{'and'}
{'of'}
{'the'}
{'of'}
{'the'}
{'the'}
{'the'}
{'of'}
{'and'}
{'of', 'the'}
{'and'}
{'the'}
{'and'}
{'and'}
{'and'}
{'and', 'the'}
{'the'}
{'and'}
{'and', 'the'}
{'and'}
{'of', 'and', 'the'}
{'and'}
{'and', 'the'}
{'of', 'and', 'the'}
{'of', 'and', 'the'}
{'of', 'and'}
{'and'}


## Exercise 4.12 : Clustering word vectors

In [5]:
"""
Vectors in the model have shape (300, 1) so we try to fetch words from the model and if the word isnt 
in the model, we simply return a 300 long vector of 0s
"""
def get_vector(word):
    try:
        return model[word]
    except KeyError:
        return np.zeros(300)

In [6]:
# We collect all the words that have been preprocessed from the descriptions of each course
all_words = [word for ls in list(map(lambda t : t[2], clean_courses)) for word in ls]

For KMeans, we discard words that are not in the model as this would create one large cluster for the default value

In [7]:
all_words = list(filter(lambda word: word in model.key_to_index, all_words))

In [8]:
w2v = dict(map(lambda word: (word, get_vector(word)), all_words))

Let's define the KMeans model now

In [9]:
kmeans = KMeans(n_clusters=30, random_state=0).fit(list(w2v.values()))

In [10]:
clusters = kmeans.predict(list(w2v.values()))

In [21]:
"""
Here we just print the top ten closest words to each cluster.
"""
for i, center in enumerate(kmeans.cluster_centers_):
    similarities = model.cosine_similarities(center, np.array(list(w2v.values()))[clusters == i])
    top_idx = np.argsort(similarities)[-10:]
    print(f'The 10 closest words to cluster {i+1} are:\n{np.array(list(w2v.keys()))[clusters == i][top_idx]}\n')


The 10 closest words to cluster 1 are:
['Methodology' 'Paradigms' 'Methods' 'Perturbation' 'Perspective'
 'Processes' 'Analysis' 'Methodologies' 'Theory' 'Context']

The 10 closest words to cluster 2 are:
['arguments' 'questions' 'explaining' 'facts' 'questioning' 'implication'
 'justification' 'comment' 'arguing' 'statements']

The 10 closest words to cluster 3 are:
['renderings' 'compositions' 'pictorial' 'illustrative' 'pictures'
 'images' 'drawings' 'illustrating' 'texts' 'illustrations']

The 10 closest words to cluster 4 are:
['equations' 'convolution' 'univariate' 'discretized' 'invariant'
 'eigenvalues' 'summability' 'finite' 'ODEs' 'quadratic']

The 10 closest words to cluster 5 are:
['Dielectric' 'Reactivity' 'Waveform' 'Pulsed' 'Extrusion' 'Detectors'
 'Impedance' 'Measurement' 'Diode' 'Resistive']

The 10 closest words to cluster 6 are:
['graduates' 'education' 'taught' 'faculty' 'students' 'coursework'
 'courses' 'graduate' 'teaching' 'undergraduate']

The 10 closest words

We can observe that clusters are somewhat related to science, concepts (understanding, time), names or course topics

### Labeling clusters
* cluster #1: Methodology
* cluster #3: Illustrations and pictures
* cluster #4: Mathematics
* cluster #6: Education and School work
* cluster #7: biology and cell vocabulary
* cluster #9: Finance and investment
* cluster #15: Medicine and medical conditions
* cluster #20: Programming and software
* cluster #24: Biology, physics and chemistry
* cluster #25: notion of time

### Comparing with LSI and LDA

Cluster #7, #4, #1 and other clusters are very similar to those found using LDA. Similarly, clusters #4 is similar here and in LSI. Otherwise, words tend to be more closely tied to their clusters in this case. For example for mathematics: LDA finds: 'learning', 'model', 'time', 'method', 'management', 'analysis', 'stochastic' whereas we find: 'equations', 'convolution', 'univariate', 'discretized', 'invariant',
 'eigenvalues', 'summability', 'finite', 'ODEs', and 'quadratic'.

## Exercise 4.13 : Document similarity search

In [13]:
DF = {}
for word in all_words:
    if word not in DF:
        DF[word] = 1
    else: DF[word] += 1

In [14]:
from collections import defaultdict, Counter

def get_DF(word):
    if word not in model:
        DF[word] = 1
    return DF[word]


def compute_TF_IDF(description):
    TF_IDF = {word: TF/get_DF(word) for word, TF in Counter(description).items()}
    total_TF_IDF = np.sum(list(TF_IDF.values()))
    vec = np.sum([get_vector(word) * TF_IDF[word] / total_TF_IDF  for word in description], axis=0)
    return TF_IDF, total_TF_IDF, vec


"""
Basically, here we compute the TF-IDF for a course which has been preprocessed (clean_course) and we also take the index
of the course as a parameter to provide the course description of the original data.
The dataset is transformed into : {course_id, name, description, vector}
vector is the addition to the data. It is calculated by a weighted average of word vectors based on the TF-IDF scores of the words in the course's description. 
It combines the word vectors by multiplying each vector by its respective TF-IDF weight and then normalizing the result by dividing by the total TF-IDF score. 
Finally, the vectors are summed to obtain a single vector representation for the course's description.
"""
def transform_course_description_to_vector(i, clean_course):
    description = clean_course[2]
    TF_IDF, total_TF_IDF, vec = compute_TF_IDF(description)
    return {'courseId': clean_course[0],
            'name': clean_course[1],
            'description': courses[i]['description'],
            'vector': vec}


"""
Get a vector representation for courses and their descriptions
"""
def vectorize_course_descriptions():
    return list(map(lambda index_to_course: transform_course_description_to_vector(index_to_course[0], index_to_course[1]), enumerate(clean_courses)))

In [15]:
def contained_in_DF(words):
    for word in words:
        if word not in DF:
            return False
    return True

def get_search_vector(query):
    words = query.split(" ")
    if contained_in_DF(words):
        _, _, vec = compute_TF_IDF(words)
        return vec
    else:
        return np.mean([get_vector(word) for word in words], axis=0)

In [16]:
courses_with_vectors = vectorize_course_descriptions()
def get_top_courses(query_vector,top=5):
    top_matches = list(map(lambda course: (course, model.cosine_similarities(query_vector, [course['vector']])[0]), courses_with_vectors))
    top_matches.sort(key=lambda tup: tup[1], reverse=True)
    top_matches = top_matches[:top]
    return top_matches

def print_top_matching_courses(query):
    search_vector = get_search_vector(query)
    top_matches = get_top_courses(search_vector)
    i = 1
    for course in top_matches:
        print("Result {i}: course {courseId}, {name}, with similarity: {dist}".format(i=i, courseId=course[0]['courseId'], name=course[0]['name'], dist=round(course[1],3)))
        i+=1

In [17]:
print_top_matching_courses("Markov chains")

Result 1: course MATH-332, Applied stochastic processes, with similarity: 0.562
Result 2: course MGT-484, Applied probability & stochastic processes, with similarity: 0.529
Result 3: course COM-516, Markov chains and algorithmic applications, with similarity: 0.519
Result 4: course CH-311, Molecular and cellular biophysic I, with similarity: 0.483
Result 5: course MSE-211, Organic chemistry, with similarity: 0.472


In [18]:
print_top_matching_courses("Facebook")

Result 1: course EE-727, Computational Social Media, with similarity: 0.755
Result 2: course COM-308, Internet analytics, with similarity: 0.494
Result 3: course CS-622, Privacy Protection, with similarity: 0.478
Result 4: course COM-208, Computer networks, with similarity: 0.472
Result 5: course CS-486, Human computer interaction, with similarity: 0.469


### Comparing with the vector space model and LSI

When comparing with the vector space model we find that **MATH-332, COM-516, MGT-484** are in the top 5 for both W2V and the VSM. Likewise, LSI and W2V find the same relevant courses. All of LSI, W2V and LDA yield similar results. However, w2v gives 2 chemistry courses which are not relevant to markov chains. Nevertheless, the results are quite good for both models.

## Exercise 4.14: Document similarity search with outside terms

In [19]:
print_top_matching_courses("MySpace Orkut")

Result 1: course EE-727, Computational Social Media, with similarity: 0.712
Result 2: course COM-208, Computer networks, with similarity: 0.522
Result 3: course COM-308, Internet analytics, with similarity: 0.519
Result 4: course MGT-517, Entrepreneurship laboratory (e-lab), with similarity: 0.489
Result 5: course CS-486, Human computer interaction, with similarity: 0.482


The resulting matching courses are similar to the ones when searching _Facebook_. This makes sense as _Orkut_ and _MySpace_ are two older social media platforms

In [20]:
print_top_matching_courses("coronavirus")

Result 1: course BIO-657, Landmark Papers in Cancer and Infection, with similarity: 0.597
Result 2: course BIO-477, Infection biology, with similarity: 0.588
Result 3: course BIO-638, Practical - Lemaitre Lab, with similarity: 0.571
Result 4: course CH-414, Pharmacological chemistry, with similarity: 0.548
Result 5: course BIOENG-433, Biotechnology lab (for CGC), with similarity: 0.541


The results seem somewhat reasonable, they are related to infections and biology, two topics closely related to coronavirus, given it is an infectious disease.