# Similarity Method

In this notebook we will create a method for __quantifying similarity between two EPFL courses__. 

We have 3 datasets contaning some data for each EPFL course such as name, description, summary, etc. Using word embeddings, we will create a method which will take that data for a pair of courses and __establish whether they are similar or not.__ There is a small dataset of ground truth (list of pairs of courses for which professors said are similar) which we will use to evaluate the efficiency of our method.

The notebook is divided into three parts:
1. __Pre-processing:__ Data cleaning, fixing inconsistencies, pre-processing pipeline, creating final dataframe. 
2. __Method:__ Explanation and implementation of a word-embeddings based similarity method.
3. __Evaluation:__ Using our ground-truth to evaluate the efficiency of our similarity method.

__Note:__ In this notebook, we will use terms similar courses and related courses synonymously, since we assume that the notion of course "similarity" is the same as course "relatedness". 

In [1]:
%matplotlib notebook

# general
import re, itertools
import pandas as pd
import numpy as np
import string

#sklearn
from sklearn.metrics.pairwise import cosine_similarity

#gensim
import gensim
from gensim.models import Doc2Vec
from gensim.models import KeyedVectors
from gensim.models.doc2vec import TaggedDocument

#nltk
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /Users/rinjac/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rinjac/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## 1. Pre-processing

### 1. 1 Data cleaning

Firstly, let's get the datasets. We have three datasets, which are:
1. __course_dependencies__ - Contains ground truth (__some EPFL courses__ for which professors said are similar)
2. __course_descriptions__ - Contains course name, description, summary, and other data for __all EPFL courses__
3. __course_keywords__ - Contains keywords for each EPFL course (Keywords are usually main concepts taught in the course)

In [2]:
cddf = pd.read_csv('datasets/course_dependencies.csv', index_col=0, keep_default_na=False)
tdf = pd.read_csv('datasets/course_desc.csv', index_col=0, keep_default_na=False)
kwdf = pd.read_csv('datasets/course_keywords.csv', index_col=0, keep_default_na=False)

Fixing minor inconsistencies in the datasets.

In [3]:
# in a lof of database rows, we don't have the course title in English ('CourseNameEN')
# but the 'CourseNameFR' field is actually title in English
# so we fix this
cddf['CourseName1'] = cddf.apply(lambda row: row['CourseNameFR1'] if not row['CourseNameEN1'] else row['CourseNameEN1'], axis=1)
cddf['CourseName2'] = cddf.apply(lambda row: row['CourseNameFR2'] if not row['CourseNameEN2'] else row['CourseNameEN2'], axis=1)
tdf['CourseName'] = tdf.apply(lambda row: row['CourseNameFR'] if not row['CourseNameEN'] else row['CourseNameEN'], axis=1)

# we drop the not needed rows
cddf = cddf.drop(columns=['CourseNameEN1', 'CourseNameEN2', 'CourseNameFR1', 'CourseNameFR2'])
tdf = tdf.drop(columns=['CourseNameEN', 'CourseNameFR'])

# if the course name is still null, we put it as empty string
cddf['CourseName1'] = cddf['CourseName1'].apply(lambda x: " " if not x else x)
cddf['CourseName2'] = cddf['CourseName2'].apply(lambda x: " " if not x else x)
tdf['CourseName'] = tdf['CourseName'].apply(lambda x: " " if not x else x)

For each course we have the following data:
1. __course name__: Name of the course.
2. __course description__: Around 50-100 words, description about what is the course about.
3. __course summary__: Summary of the curriculum of the course. 
4. __course keywords__: Important concepts learned in the course.

__We will use data obtained from all 4 in our method__. So, for each course we make a "course_content" field which will just be all 4 of that, concatenated.

In [4]:
# concat the course content and summary
tdf['CourseContent'] = tdf['CourseContent'] + tdf['SummaryEN']
tdf = tdf.drop(columns=['SummaryEN'])

cddf['CourseContent1'] = cddf['CourseContent1'] + cddf['SummaryEN1']
cddf['CourseContent2'] = cddf['CourseContent2'] + cddf['SummaryEN2']
cddf = cddf.drop(columns=['SummaryEN1', 'SummaryEN2'])

# remove rows which still have null values
cddf = cddf.dropna()
tdf = tdf.dropna()

# group keywords by course code
kwdf = kwdf.groupby('CourseCode')['TagValue'].agg(lambda col: ' '.join(col))
kwdf = pd.DataFrame(kwdf)

# add keywords for TDF dataframe
tdf = tdf.set_index(keys=['CourseCode'], drop=True)
tdf = tdf.merge(kwdf, on=['CourseCode'])
tdf = tdf.reset_index()

# add keywords to the first course
cddf = cddf.rename(columns={'CourseCode1': 'CourseCode'})
cddf = cddf.set_index(keys=['CourseCode'], drop=True)
cddf = cddf.merge(kwdf, on=['CourseCode'])
cddf = cddf.reset_index()
cddf = cddf.rename(columns={'CourseCode': 'CourseCode1', 'TagValue': 'TagValue1'})

# add keywords to the second course
cddf = cddf.rename(columns={'CourseCode2': 'CourseCode'})
cddf = cddf.set_index(keys=['CourseCode'], drop=True)
cddf = cddf.merge(kwdf, on=['CourseCode'])
cddf = cddf.reset_index()
cddf = cddf.rename(columns={'CourseCode': 'CourseCode2', 'TagValue': 'TagValue2'})

Defining pre-processing functions we are going to use:

In [5]:
def remove_nonascii(s):
    return s.encode('ascii', 'ignore').decode("utf-8")

def remove_newline(s):
    return s.replace('\n','')

def remove_squote(s):
    return s.replace('<squote/>',' ')

stop_words = stopwords.words('english')
def remove_stop_words(tokens):
    return [word for word in tokens if word not in stop_words]

french_stop_words = stopwords.words('french')
def remove_french_stop_words(tokens):
    return [word for word in tokens if word not in french_stop_words]

punc = list(string.punctuation)
def remove_punc(tokens):
    return [word for word in tokens if word not in punc]

def to_lower(tokens):
    return [token.lower() for token in tokens]

def apply_preproc(df, column, func): 
    df[column] = df[column].apply(func)

### 1.2 Finding most common words

__For reasons explained below (in the section 2.2)__ we will write the code which finds the N most common words in course descriptions and a pre-processing function which will remove them. 

In [6]:
ndf = tdf.copy()

str_list = []
for i in range(len(tdf)):
    str_list.append(tdf.iloc[i]['CourseContent'])
all_content = ''.join(str_list)
                    
all_content = word_tokenize(all_content)
all_content = to_lower(all_content)
all_content = remove_stop_words(all_content)
all_content = remove_punc(all_content)

freq_d = dict()
for w in all_content:
    freq_d[w] = 1 + freq_d.get(w, 0)

In [7]:
freq = [(freq_d[key], key) for key in freq_d]
freq.sort()
freq.reverse()

In [8]:
most_common_f = freq[:100]
_, most_common = [list(tup) for tup in zip(*most_common_f)]
print(most_common)

['course', 'semicolon/', 'students', 'design', 'systems', 'analysis', 'methods', 'introduction', 'squote/', 'basic', 'models', 'techniques', 'energy', 'project', 'theory', 'concepts', '2', 'applications', 'materials', 'data', 'research', '3', 'different', 'processes', 'principles', 'system', 'tools', 'student', 'development', 'work', 'well', 'engineering', 'field', 'linear', 'topics', 'management', 'properties', 'part', 'also', 'control', '4', 'use', 'main', 'understanding', 'practical', 'learning', 'social', 'knowledge', 'using', 'study', 'understand', 'modeling', 'theoretical', 'problems', 'examples', 'various', 'structure', 'aspects', 'digital', 'science', '5', 'process', 'urban', 'fundamental', 'based', '1', 'application', 'new', "''", 'specific', 'used', 'processing', 'issues', 'information', 'time', 'approach', 'power', 'model', 'structures', 'learn', 'physical', 'chemical', '``', 'studies', 'case', 'technology', 'related', 'general', 'types', 'optical', 'quantum', 'physics', 'in

In [9]:
def remove_most_common_words(tokens):
    return [word for word in tokens if word not in most_common]

### 1.3 Creating final dataframe

We need two types of course-pairs: 
1. the courses which are similar
2. the courses which are not similar

We have 1. from the dataset course_dependencies (the ground truth). To make 2., we are going to randomly select some course pairs from all of the courses, as long as they are not in the "similar course-pairs list" which is defined by 1.

After we get two dataframes (one for course-pairs which are similar and one for ones which are not) we will merge them into a single dataframe.

In [10]:
# take a random sample of the course descriptions table (used for making pairs of not-similar samples)
tdf1 = tdf.sample(n = 250)
tdf2 = tdf.sample(n = 250)
tdf1 = tdf1.rename(index=str, columns={"CourseCode": "CourseCode1", "CourseName": "CourseName1", "CourseContent": "CourseContent1", "TagValue": "TagValue1"})
tdf2 = tdf2.rename(index=str, columns={"CourseCode": "CourseCode2", "CourseName": "CourseName2", "CourseContent": "CourseContent2", "TagValue": "TagValue2"})
tdf1 = tdf1.reset_index(drop=True)
tdf2 = tdf2.reset_index(drop=True)

# create the table of samples which are not-similar 
tdf = pd.concat([tdf1, tdf2], axis = 1)
tdf.insert(2, 'Relationship', 'None')
tdf = tdf.drop_duplicates(subset=['CourseCode1', 'CourseCode2'], keep='first')
tdf = tdf.query("CourseCode1 != CourseCode2")

In [11]:
# merge the similar and not-similar samples into a single table 
cddf = pd.concat([cddf, tdf], axis=0, sort=False)
cddf = cddf.reset_index(drop=True)

# remove duplicates
# (since the sampling is random, it could happen that a pair of courses is both similar and not-similar
# we keep just the similar version of these pairs, since that is the ground truth)
cddf = cddf.drop_duplicates(subset=['CourseCode1', 'CourseCode2'], keep='first')

# shuffle
cddf = cddf.sample(frac = 1)
cddf = cddf.reset_index(drop=True)

Now that we have one dataframe for all course-pairs (both similar and not-similar) we will perform pre-processing on it.

In [12]:
columns = ['CourseContent1', 'CourseName1', 'CourseContent2', 'CourseName2', 'TagValue1', 'TagValue2']

for column in columns:
    # tokenization
    apply_preproc(cddf, column, word_tokenize)
    # to lower
    apply_preproc(cddf, column, to_lower)
    # remove stop words
    apply_preproc(cddf, column, remove_stop_words)
    # remove french stop words
    apply_preproc(cddf, column, remove_french_stop_words)
    # remove punc
    apply_preproc(cddf, column, remove_punc)

# remove most common words
# ONLY for the course content
apply_preproc(cddf, 'CourseContent1', remove_most_common_words)
apply_preproc(cddf, 'CourseContent2', remove_most_common_words)

# merge content and keywords
cddf['CourseContent1'] = cddf['TagValue1'] + cddf['CourseContent1']
cddf['CourseContent2'] = cddf['TagValue2'] + cddf['CourseContent2']
cddf = cddf.drop(columns=['TagValue1', 'TagValue2'])

# merge name and content
cddf['CourseContent1'] = cddf['CourseName1'] + cddf['CourseContent1']
cddf['CourseContent2'] = cddf['CourseName2'] + cddf['CourseContent2']
cddf = cddf.drop(columns=['CourseName1', 'CourseName2'])

Let's check out how our pre-processed dataframe looks like:

In [13]:
cddf.head(10)

Unnamed: 0,CourseCode2,CourseCode1,CourseContent1,Relationship,CourseContent2
0,ME-473,EE-704,"[computational, perception, using, multimodal,...",,"[computational, solid, structural, dynamics, d..."
1,MICRO-420,MICRO-424,"[optics, laboratories, ii, waveguide, fiber, o...",Depends on,"[selected, topics, advanced, optics, optics, c..."
2,CS-101,CS-251,"[theory, computation, théorie, complexité, np-...",Depends on,"[advanced, information, computation, communica..."
3,MICRO-330,ME-402,"[mechanical, engineering, project, ii, communi...",,"[sensors, capteurs, sensors, sensors, characte..."
4,CS-322,CS-449,"[systems, data, science, databases, data-paral...",Depends on,"[introduction, database, systems, database, ma..."
5,HUM-365,HUM-436(a),"[sciences, religions, naturalisme, scientifiqu...",Benefits from,"[sciences, religions, b, créationnismes, evolu..."
6,MICRO-523,MICRO-423,"[optics, laboratories, he-ne, laser, optical, ...",Benefits from,"[optical, detectors, caméras, cmos, photodiode..."
7,COM-401,MATH-409,"[algebraic, curves, cryptography, discrete, lo...",Benefits from,"[cryptography, security, cryptography, secure,..."
8,HUM-245(a),MGT-403,"[economics, innovation, biomedical, industry, ...",,"[economy, innovation, innovation, micro-économ..."
9,MSE-632,CIVIL-604,"[introduction, digital, signal, processing, us...",,"[ccmx, winter, school, nanoparticles, fundamen..."


In [14]:
print(len(cddf))

422


__As we can see, we have 422 course pairs, of which around half are similar and half are not.__ For each course of the pair we have a course code and course content.

## 2. Method

In this section we will describe and create our similarity method.

### 2.1 Word embeddings method

Our method uses word embeddings. The word embeddings we chose are _fasttext_ word2vec embeddings: 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens). It's possible to download them [here](https://fasttext.cc/docs/en/english-vectors.html) (download _wiki-news-300d-1M.vec.zip_)



In [15]:
# pretrained word2vec on wikipedia
model = KeyedVectors.load_word2vec_format('wiki-news-300d-1M.vec')
EMBEDDING_SIZE = 300

Our similarity method is simple. __To establish similarity between two courses, we project both courses into vector space and find the cosine similarity between them.__

To project the course into a vector space, we are using word embeddings. We have a list of 200 to 1000 words related to the course for each course (the column course_content). We get the word2vec word embeddings vector for each word. Then we take the average of the word embeddings for words in course_content, and that is the embedding of the course in vector space.

__Why we think this will work?__ If two words are semantically similar, their word embedding vectors should be close to each other in the vector space. The words in the course_content list should mostly be related to the course. If the take the "course embedding" for each course (which is calculated from that list of words), it should be that similar courses are near to each other in vector space: most of their words in course_content list should be semantically similar.

Now, let's define a function which calculates the average word embeddings from a list of words.

In [16]:
def get_average_vector(words_list):
    """ Function for getting the average word embedding out of list of words """
    
    base = np.zeros(EMBEDDING_SIZE)
    word_vec = 0
    n = 0
    for word in words_list:
        try:
            word_vec = model[word]
            n += 1
        except KeyError:
            word_vec = np.zeros(300)
        base = np.add(base, word_vec)
    base = np.divide(base, n)
    
    return base.reshape(1, -1)

Also, we are going to define a function which returnes cosine similarity between two lists of words, using the above function.

In [17]:
def get_similarity(words1, words2):
    base1 = get_average_vector(words1)
    base2 = get_average_vector(words2)
    sim = cosine_similarity(base1, base2)
    return sim[0][0]

### 2.2 Explanation of most common word removal 

In the pre-processing step, we removed top 100 most common words from our course_content column. Why did we do this?

We want each course_content list to have as much as possible words related to the course, and the least possible amount of words which are not related to the course. Considering a lot of words are related to all courses (such as "course", "project", "theory", etc), these words genereally increase our similarity scores by making all courses more similar to each other, because they all share those words. And even worse, if some course has a smaller proportion of these common words, it will seem less related to other courses just because of that. So we fix this by removing the top 100 most common words found in all courses.

Does this actually help? Let's have a small example.

We have two related and unrelated courses, and the course_content word list for them without the top 100 most common words removed.

In [19]:
# data from two not related courses (CIVIL-552 and EE-201)
unrelated1 = ['Introduction', 'Conceptual', 'seismic', 'design', 'Analysis', 'methods', 'Design', 'evaluation', 'methods', 'Design', 'philosophies', 'Reinforced', 'concrete', 'structures', 'Existing', 'reinforced', 'concrete', 'masonry', 'structuresThis', 'course', 'deals', 'main', 'aspects', 'seismic', 'design', 'buildings', 'bridges', 'It', 'covers', 'different', 'structural', 'design', 'evaluation', 'philosophies', 'new', 'existing', 'reinforced', 'concrete', 'masonry', 'structures']
unrelated2 = ['See', 'French', 'textThis', 'course', 'deals', 'electromagnetism', 'free', 'space', 'continuous', 'media', 'Starting', 'basic', 'principles', 'establish', 'methods', 'solving', 'Maxwell', 'squote/', 'equation', 'vacuum', 'complex', 'material', 'media']

# data from two related courses (CS-328 and CS-440)
related1 = ['This', 'course', 'provides', 'first', 'introduction', 'field', 'numerical', 'analysis', 'strong', 'focus', 'visual', 'computing', 'applications', 'Using', 'examples', 'computer', 'graphics', 'geometry', 'processing', 'computer', 'vision', 'computational', 'photography', 'students', 'gain', 'hands-on', 'experience', 'range', 'essential', 'numerical', 'algorithms', 'The', 'course', 'begin', 'review', 'important', 'considerations', 'regarding', 'floating', 'point', 'arithmetic', 'error', 'propagation', 'numerical', 'computations', 'Following', 'students', 'study', 'experiment', 'several', 'techniques', 'solve', 'systems', 'linear', 'non-linear', 'equations', 'Since', 'many', 'interesting', 'problems', 'solved', 'exactly', 'numerical', 'optimization', 'techniques', 'constitute', 'second', 'major', 'topic', 'course', 'Students', 'learn', 'principal', 'component', 'analysis', 'leveraged', 'compress', 'reduce', 'dimension', 'large', 'datasets', 'make', 'easier', 'store', 'analyze', 'The', 'course', 'concludes', 'review', 'numerical', 'methods', 'make', 'judicious', 'use', 'randomness', 'solve', 'problems', 'would', 'otherwise', 'intractable', 'Students', 'opportunity', 'gain', 'practical', 'experience', 'discussed', 'methods', 'using', 'programming', 'assignments', 'based', 'Scientific', 'Python']
related2 = ['This', 'project-based', 'course', 'students', 'initially', 'receive', 'basic', 'software', 'package', 'lacks', 'rendering-related', 'functionality', 'Over', 'course', 'semester', 'discuss', 'variety', 'concepts', 'tools', 'including', 'basic', 'physical', 'quantities', 'light', 'interacts', 'surfaces', 'solve', 'resulting', 'mathematical', 'problem', 'numerically', 'create', 'realistic', 'images', 'Advanced', 'topics', 'include', 'participating', 'media', 'material', 'models', 'sub-surface', 'light', 'transport', 'Markov', 'Chain', 'Monte', 'Carlo', 'Methods', 'Each', 'major', 'topic', 'accompanied', 'assignment', 'students', 'implement', 'solution', 'algorithms', 'obtain', 'practical', 'experience', 'techniques', 'within', 'software', 'framework', 'Towards', 'end', 'course', 'students', 'realize', 'self-directed', 'final', 'project', 'extends', 'rendering', 'software', 'additional', 'features', 'choosing', 'The', 'objective', 'final', 'project', 'create', 'single', 'image', 'technical', 'artistic', 'merit', 'entered', 'rendering', 'competition', 'judged', 'independent', 'panel', 'computer', 'graphics', 'experts']

# calculate similarity for unrelated
sim1 = get_similarity(unrelated1, unrelated2)
print("Unrelated similarity is " + str(sim1))

# calculate similarity for related
sim2 = get_similarity(related1, related2)
print("Related similarity is " + str(sim2))

print("Difference in similarity is " + str(sim2 - sim1))

Unrelated similarity is 0.8689054222463344
Related similarity is 0.9740207131280505
Difference in similarity is 0.10511529088171612


 Now, we will see how do the similarity scores change when we remove the top 100 most common words.

In [20]:
unrelated1 = remove_most_common_words(unrelated1)
unrelated2 = remove_most_common_words(unrelated2)
related1 = remove_most_common_words(related1)
related2 = remove_most_common_words(related2)

# calculate similarity for unrelated
sim1 = get_similarity(unrelated1, unrelated2)
print("Unrelated similarity is " + str(sim1))

# calculate similarity for related
sim2 = get_similarity(related1, related2)
print("Related similarity is " + str(sim2))

print("Difference in similarity is " + str(sim2 - sim1))

Unrelated similarity is 0.8056721208269843
Related similarity is 0.9634055755047493
Difference in similarity is 0.157733454677765


As you can see, we managed to increase the difference to 0.15. This is just a small example, but it does give credit to the idea that we should remove the top 100 most common words.

Finally, we will have a human pick the words which seem most relevant to the course from the course_content list (before top 100 most common removal). 

In [21]:
unrelated1_ = ['Conceptual', 'seismic', 'design', 'philosophies', 'concrete', 'structures', 'reinforced', 'masonry', 'seismic', 'design', 'buildings', 'bridges', 'structural', 'design', 'reinforced', 'masonry', 'structures']
unrelated2_ = ['electromagnetism', 'free', 'space', 'continuous', 'solving', 'Maxwell', 'equation', 'vacuum', 'complex', 'material']

related1_ = ['numerical', 'analysis', 'visual', 'computing', 'computer', 'graphics', 'geometry', 'processing', 'computer', 'vision', 'computational', 'photography', 'numerical', 'algorithms', 'floating', 'point', 'arithmetic', 'error', 'propagation', 'numerical', 'computations', 'experiment', 'techniques', 'solve', 'systems', 'linear', 'non-linear', 'equations', 'numerical', 'optimization', 'techniques', 'leveraged', 'compress', 'reduce', 'dimension', 'large', 'datasets', 'store', 'analyze', 'numerical', 'methods', 'randomness', 'solve', 'problems', 'Scientific', 'Python']
related2_ = ['project-based', 'software', 'package', 'rendering-related', 'physical', 'quantities', 'light', 'interacts', 'surfaces', 'mathematical', 'problem', 'numerically', 'create', 'realistic', 'images', 'models', 'sub-surface', 'Markov', 'Chain', 'Monte', 'Carlo', 'implement', 'solution', 'algorithms', 'practical', 'experience', 'software', 'framework', 'self-directed', 'rendering', 'software', 'features', 'project', 'computer', 'graphics',]

# calculate similarity for unrelated
sim1 = get_similarity(unrelated1_, unrelated2_)
print("Unrelated similarity is " + str(sim1))

# calculate similarity for related
sim2 = get_similarity(related1_, related2_)
print("Related similarity is " + str(sim2))

print("Difference in similarity is " + str(sim2 - sim1))

Unrelated similarity is 0.7204567820856593
Related similarity is 0.9433678431101818
Difference in similarity is 0.22291106102452252


The human managed to have the difference at 0.22, so there is possibly room for improvement. However, the removal of top 100 words seems to help our method, so we will keep it.

## 3. Evaluation

Now that we have our method, we have to evaluate it. We have the ground truth (we know which courses truly are similar to each other), so we will compare our method's results with those.

In the previous chapter, we established a method for measuring the similarity of two courses. But this method gives us a continuous value between 0 and 1 - and we want to have a discrete value: related or not related. We need one hyperparameter: threshold. If the similarity is above the threshold, the courses are related, and if it is below, they are not.

To calculate the best threshold we will use cross-validation. We will split the dataset: 70% for the train set and 30% for the test set. We will find the best value for our hyperparamater, the threshold, on the test set. It will be the threshold for which the F-score is the highest. Then, we will evaluate our method on the test set.

In [22]:
# split data in train and test set
train = cddf.sample(frac=0.7)
test = cddf.drop(train.index)

train = train.reset_index()
test = test.reset_index()

Now we will use the train set to find the threshold value.

In [23]:
# similarity threshold will be between 0 and 1 
thresholds = np.linspace(0, 1, num=101)

best_f1_score = 0
best_threshold = 0

# for EVERY value of threshold we can have
# on the train set
# calculate cosine similarity between courses descriptions using word2vec
# take word embedding of each word in text, then find their average
for threshold in thresholds:
    
    t_p = 0
    t_n = 0
    f_p = 0
    f_n = 0
    
    for k in range(len(train)):
        list1 = train.iloc[k]['CourseContent1']
        list2 = train.iloc[k]['CourseContent2']

        sim = get_similarity(list1, list2)

        if sim>=threshold:
            if train.iloc[k]['Relationship']!='None':
                t_p += 1
            else:
                f_p += 1
        else:
            if train.iloc[k]['Relationship']!='None':
                f_n += 1
            else:
                t_n += 1
    
    precision = 0
    recall = 0
    f1_score = 0
    if t_p + f_p != 0:
        precision = t_p / (t_p + f_p)
    if t_p + f_n != 0:
        recall = t_p / (t_p + f_n)
    if precision + recall != 0:
        f1_score = (2 * (precision * recall)) / (precision + recall)
    
    if f1_score > best_f1_score:
        best_f1_score = f1_score
        best_threshold = threshold

In [24]:
THRESHOLD = best_threshold
print(THRESHOLD)

0.89


Now that we have threshold value, we can evaluate our method on the test set.

In [25]:
def print_eval(t_p, t_n, f_p, f_n):
    accuracy = (t_p + t_n) / (t_p + t_n + f_p + f_n)
    precision = t_p / (t_p + f_p)
    recall = t_p / (t_p + f_n)
    f1_score = (2 * (precision * recall)) / (precision + recall)

    print(str(t_p) + " | " + str(f_p))
    print(str(f_n) + " | " + str(t_n))

    print("Accuracy: " + str(accuracy))
    print("Precision: " + str(precision))
    print("Recall: " + str(recall))
    print("################")
    print("F1-score: " + str(f1_score))

In [26]:
t_p = 0
t_n = 0
f_p = 0
f_n = 0

f_ns = []
f_ps = []

# calculate cosine similarity between courses descriptions using word2vec
# take word embedding of each word in text, then find their average
for k in range(len(test)):
    list1 = test.iloc[k]['CourseContent1']
    list2 = test.iloc[k]['CourseContent2']
    
    sim = get_similarity(list1, list2)
    
    if sim>=THRESHOLD:
        if test.iloc[k]['Relationship']!='None':
            t_p += 1
        else:
            f_p += 1
    else:
        if test.iloc[k]['Relationship']!='None':
            f_n += 1
        else:
            t_n += 1
    
print_eval(t_p, t_n, f_p, f_n)

42 | 12
11 | 62
Accuracy: 0.8188976377952756
Precision: 0.7777777777777778
Recall: 0.7924528301886793
################
F1-score: 0.7850467289719626


As you can see, our method manages to predict whether the two courses are related quite well. 

The number of false positives is larger than the number of false negatives because the not-related courses in our dataframe might actually be related (we got not-related by just taking courses which are not in the "related courses" dataset, but it could be that some courses which are related were not noted by the professors which made the related-courses list.)  