# Creating a Similarity Method

In this notebook we will create a method for quantifying similariy between two courses. To check if our method is efficient, there is a small dataset of ground truth (List of pairs of courses for which professors said are similar)

In [2]:
%matplotlib notebook

# general
import re, itertools
import pandas as pd
import numpy as np
import string

#sklearn
from sklearn.metrics.pairwise import cosine_similarity

#gensim
import gensim
from gensim.models import Doc2Vec
from gensim.models import KeyedVectors
from gensim.models.doc2vec import TaggedDocument

#nltk
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/rinjac/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/rinjac/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## 1. Pre-processing

### 1. 1 Data cleaning

Firstly, let's get the datasets. We have three datasets, which are:
1. __course_dependencies__ - Contains ground truth (__some courses__ for which professors said are similar)
2. __course_descriptions__ - Contains course name, description, summary, and other data for __all courses__
3. __course_keywords__ - Contains keywords for each course (Keywords are usually main concepts taught in the course)

In [66]:
cddf = pd.read_csv('datasets/course_dependencies.csv', index_col=0, keep_default_na=False)
tdf = pd.read_csv('datasets/course_desc.csv', index_col=0, keep_default_na=False)
kwdf = pd.read_csv('datasets/course_keywords.csv', index_col=0, keep_default_na=False)

Fixing minor inconsistencies in the datasets.

In [67]:
# in a lof of database rows, we don't have the course title in English ('CourseNameEN')
# but the 'CourseNameFR' field is actually title in English
# so we fix this
cddf['CourseName1'] = cddf.apply(lambda row: row['CourseNameFR1'] if not row['CourseNameEN1'] else row['CourseNameEN1'], axis=1)
cddf['CourseName2'] = cddf.apply(lambda row: row['CourseNameFR2'] if not row['CourseNameEN2'] else row['CourseNameEN2'], axis=1)
tdf['CourseName'] = tdf.apply(lambda row: row['CourseNameFR'] if not row['CourseNameEN'] else row['CourseNameEN'], axis=1)

# we drop the not needed rows
cddf = cddf.drop(columns=['CourseNameEN1', 'CourseNameEN2', 'CourseNameFR1', 'CourseNameFR2'])
tdf = tdf.drop(columns=['CourseNameEN', 'CourseNameFR'])

# if the course name is still null, we put it as empty string
cddf['CourseName1'] = cddf['CourseName1'].apply(lambda x: " " if not x else x)
cddf['CourseName2'] = cddf['CourseName2'].apply(lambda x: " " if not x else x)
tdf['CourseName'] = tdf['CourseName'].apply(lambda x: " " if not x else x)

# concat the course content and summary
tdf['CourseContent'] = tdf['CourseContent'] + tdf['SummaryEN']
tdf = tdf.drop(columns=['SummaryEN'])

cddf['CourseContent1'] = cddf['CourseContent1'] + cddf['SummaryEN1']
cddf['CourseContent2'] = cddf['CourseContent2'] + cddf['SummaryEN2']
cddf = cddf.drop(columns=['SummaryEN1', 'SummaryEN2'])

# remove rows which still have null values
cddf = cddf.dropna()
tdf = tdf.dropna()

Merging data from keyword dataset to other two datasets.

In [68]:
# group keywords by course code
kwdf = kwdf.groupby('CourseCode')['TagValue'].agg(lambda col: ' '.join(col))
kwdf = pd.DataFrame(kwdf)

# add keywords for TDF dataframe
tdf = tdf.set_index(keys=['CourseCode'], drop=True)
tdf = tdf.merge(kwdf, on=['CourseCode'])
tdf = tdf.reset_index()

# add keywords to the first course
cddf = cddf.rename(columns={'CourseCode1': 'CourseCode'})
cddf = cddf.set_index(keys=['CourseCode'], drop=True)
cddf = cddf.merge(kwdf, on=['CourseCode'])
cddf = cddf.reset_index()
cddf = cddf.rename(columns={'CourseCode': 'CourseCode1', 'TagValue': 'TagValue1'})

# add keywords to the second course
cddf = cddf.rename(columns={'CourseCode2': 'CourseCode'})
cddf = cddf.set_index(keys=['CourseCode'], drop=True)
cddf = cddf.merge(kwdf, on=['CourseCode'])
cddf = cddf.reset_index()
cddf = cddf.rename(columns={'CourseCode': 'CourseCode2', 'TagValue': 'TagValue2'})

Defining pre-processing functions we are going to use:

In [69]:
def remove_nonascii(s):
    return s.encode('ascii', 'ignore').decode("utf-8")

def remove_newline(s):
    return s.replace('\n','')

def remove_squote(s):
    return s.replace('<squote/>',' ')

stop_words = stopwords.words('english')
def remove_stop_words(tokens):
    return [word for word in tokens if word not in stop_words]

french_stop_words = stopwords.words('french')
def remove_french_stop_words(tokens):
    return [word for word in tokens if word not in french_stop_words]

punc = list(string.punctuation)
def remove_punc(tokens):
    return [word for word in tokens if word not in punc]

def to_lower(tokens):
    return [token.lower() for token in tokens]

def apply_preproc(df, column, func): 
    df[column] = df[column].apply(func)

### 1.2 Finding most common words

For reasons explained below (in the Example section) we will write the code which finds the N most common words in course descriptions and a pre-processing function which will remove them. 

In [32]:
ndf = tdf.copy()

str_list = []
for i in range(len(tdf)):
    str_list.append(tdf.iloc[i]['CourseContent'])
all_content = ''.join(str_list)
                    
all_content = word_tokenize(all_content)
all_content = to_lower(all_content)
all_content = remove_stop_words(all_content)
all_content = remove_punc(all_content)

freq_d = dict()
for w in all_content:
    freq_d[w] = 1 + freq_d.get(w, 0)

In [33]:
freq = [(freq_d[key], key) for key in freq_d]
freq.sort()
freq.reverse()

In [34]:
most_common_f = freq[:100]
_, most_common = [list(tup) for tup in zip(*most_common_f)]
print(most_common)

['course', 'students', 'semicolon/', 'design', 'analysis', 'squote/', 'methods', 'systems', 'introduction', 'project', 'basic', '2', 'research', 'energy', 'techniques', '3', 'student', 'concepts', 'work', 'development', 'theory', 'materials', 'field', 'models', 'social', 'tools', 'different', 'linear', 'also', 'well', 'applications', 'urban', '4', 'study', 'understand', 'use', 'part', 'new', 'engineering', 'theoretical', 'principles', 'management', 'various', 'knowledge', 'using', 'data', 'system', '5', 'approach', 'properties', '1', 'structures', 'main', 'practical', 'process', 'understanding', 'science', "''", 'topics', 'fundamental', '``', 'specific', 'learning', 'issues', 'processes', 'elements', 'structure', 'model', 'microscopy', 'application', '--', 'electron', 'control', 'aspects', 'power', 'teaching', 'physics', 'one', 'examples', 'time', 'technology', 'related', 'first', 'architecture', '6', 'scientific', 'information', 'studies', 'semester', 'types', 'modeling', 'constructio

In [35]:
def remove_most_common_words(tokens):
    return [word for word in tokens if word not in most_common]

### 1.3 Final pre-processing

We need two types of course-pairs: 
1. the courses which are similar
2. the courses which are not similar

We have 1. from the dataset course_dependencies (the ground truth). To make 2., we are going to randomly select some course pairs from all of the courses, as long as they are not in the "similar course-pairs list" which is defined by 1.

After we get two dataframes (one for course-pairs which are similar and one for ones which are not) we will merge them into a single dataframe.

In [70]:
# take a random sample of the course descriptions table (used for making not-similar samples)
tdf1 = tdf.sample(n = 250)
tdf2 = tdf.sample(n = 250)
tdf1 = tdf1.rename(index=str, columns={"CourseCode": "CourseCode1", "CourseName": "CourseName1", "CourseContent": "CourseContent1", "TagValue": "TagValue1"})
tdf2 = tdf2.rename(index=str, columns={"CourseCode": "CourseCode2", "CourseName": "CourseName2", "CourseContent": "CourseContent2", "TagValue": "TagValue2"})
tdf1 = tdf1.reset_index(drop=True)
tdf2 = tdf2.reset_index(drop=True)

# create the table of samples which are not-similar 
tdf = pd.concat([tdf1, tdf2], axis = 1)
tdf.insert(2, 'Relationship', 'None')
tdf = tdf.drop_duplicates(subset=['CourseCode1', 'CourseCode2'], keep='first')
tdf = tdf.query("CourseCode1 != CourseCode2")

In [71]:
# merge the similar and not-similar samples into a single table 
cddf = pd.concat([cddf, tdf], axis=0, sort=False)
cddf = cddf.reset_index(drop=True)

# remove duplicates
# (since the sampling is random, it could happen that a pair of courses is both similar and not-similar
# we keep just the similar version of these pairs, since that is the ground truth)
cddf = cddf.drop_duplicates(subset=['CourseCode1', 'CourseCode2'], keep='first')

# shuffle
cddf = cddf.sample(frac = 1)
cddf = cddf.reset_index(drop=True)

Now that we have one dataframe for all course-pairs (both similar and not-similar) we will perform pre-processing on it.

In [72]:
columns = ['CourseContent1', 'CourseName1', 'CourseContent2', 'CourseName2', 'TagValue1', 'TagValue2']

for column in columns:
    # tokenization
    apply_preproc(cddf, column, word_tokenize)
    # to lower
    apply_preproc(cddf, column, to_lower)
    # remove stop words
    apply_preproc(cddf, column, remove_stop_words)
    # remove french stop words
    apply_preproc(cddf, column, remove_french_stop_words)
    # remove punc
    apply_preproc(cddf, column, remove_punc)

# remove most common words
# ONLY for the course content
apply_preproc(cddf, 'CourseContent1', remove_most_common_words)
apply_preproc(cddf, 'CourseContent2', remove_most_common_words)

# merge content and keywords
cddf['CourseContent1'] = cddf['TagValue1'] + cddf['CourseContent1']
cddf['CourseContent2'] = cddf['TagValue2'] + cddf['CourseContent2']
cddf = cddf.drop(columns=['TagValue1', 'TagValue2'])

# merge name and content
cddf['CourseContent1'] = cddf['CourseName1'] + cddf['CourseContent1']
cddf['CourseContent2'] = cddf['CourseName2'] + cddf['CourseContent2']
cddf = cddf.drop(columns=['CourseName1', 'CourseName2'])

Let's check out how our pre-processed dataframe looks like:

In [73]:
cddf.head(10)

Unnamed: 0,CourseCode2,CourseCode1,CourseContent1,Relationship,CourseContent2
0,CS-323,CS-323(a),"[operating, systems, implementation, linux, op...",Depends on,"[introduction, operating, systems, operating, ..."
1,CS-486,COM-480,"[data, visualization, data, science, data, viz...",Benefits from,"[human, computer, interaction, user, experienc..."
2,MATH-106(e),DH-403,"[measuring, literature, distant, reading, digi...",,"[analysis, ii, lagrange, multipliers, differen..."
3,CH-313,CH-411,"[cellular, signalling, protein, modifications,...",Benefits from,"[biochemistry, ii, cofacteur, metabolism, cata..."
4,ME-232,FIN-404,"[derivatives, derivatives, évaluation, arbitra...",,"[mechanics, structures, gm, structures, equili..."
5,CS-473,CS-309,"[projet, systems-on-chip, oscilloscope, embedd...",Prepares for,"[embedded, systems, embedded, systems, microco..."
6,PHYS-106(h),MGT-408,"[technology, policy, energy, transition, marke...",,"[general, physics, ii, rigid, body, équation, ..."
7,ENG-445,MICRO-270,"[chemistry, surfaces, caractérisation, surface...",,"[building, energetics, energy, flows, building..."
8,HUM-216,CIVIL-443,"[advanced, composites, engineering, structures...",,"[philosophy, science, creationism, natural, se..."
9,PHYS-312,PHYS-311,"[nuclear, particle, physics, physique, hautes,...",Prepares for,"[nuclear, particle, physics, ii, nuclear, phys..."
