<a href="https://colab.research.google.com/github/SuperNZH/IBM-ML-Professional-Certificate/blob/main/Capstone/1_EDA_and_Feature_Engineering/1_3_Calculate_Course_Similarity_using_BoW_Features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Calculate Course Similarity using BoW Features

Similarity measurement: If a new course is similar to user's enrolled courses, we could recommend that new similar course to the user. Or If user A is similar to user B, then we can recommend some of user B's courses to user A (the unseen courses) because user A and user B may have similar interests.


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module\_2/images/course_sim.png)

## Prepare and setup lab environment

In [1]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import gensim
import nltk as nltk

from scipy.spatial.distance import cosine
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import ngrams
from gensim import corpora

%matplotlib inline

In [2]:
# Mount drive
from google.colab import drive 
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# random state
rs = 123

## Code

In [4]:
# Load the BoW features as Pandas dataframe
bows_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/courses_bows.csv"
bows_df = pd.read_csv(bows_url)
bows_df = bows_df[['doc_id', 'token', 'bow']]

In [5]:
bows_df.head(10)

Unnamed: 0,doc_id,token,bow
0,ML0201EN,ai,2
1,ML0201EN,apps,2
2,ML0201EN,build,2
3,ML0201EN,cloud,1
4,ML0201EN,coming,1
5,ML0201EN,create,1
6,ML0201EN,data,1
7,ML0201EN,developer,1
8,ML0201EN,found,1
9,ML0201EN,fun,1


In [6]:
# Load the course dataframe
course_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/course_processed.csv"
course_df = pd.read_csv(course_url)

In [7]:
course_df.head(10)

Unnamed: 0,COURSE_ID,TITLE,DESCRIPTION
0,ML0201EN,robots are coming build iot apps with watson ...,have fun with iot and learn along the way if ...
1,ML0122EN,accelerating deep learning with gpu,training complex deep learning models with lar...
2,GPXX0ZG0EN,consuming restful services using the reactive ...,learn how to use a reactive jax rs client to a...
3,RP0105EN,analyzing big data in r using apache spark,apache spark is a popular cluster computing fr...
4,GPXX0Z2PEN,containerizing packaging and running a sprin...,learn how to containerize package and run a ...
5,CNSC02EN,cloud native security conference data security,introduction to data security on cloud
6,DX0106EN,data science bootcamp with r for university pr...,a multi day intensive in person data science ...
7,GPXX0FTCEN,learn how to use docker containers for iterati...,learn how to use docker containers for iterati...
8,RAVSCTEST1,scorm test 1,scron test course
9,GPXX06RFEN,create your first mongodb database,in this guided project you will get started w...


To compare the BoWs of any two courses, which normally have a different set of tokens, we need to create a union token set and then transpose them. We have provided a method called `pivot_two_bows` as follows:

In [9]:
def pivot_two_bows(basedoc, comparedoc):
    base = basedoc.copy()
    base['type'] = 'base'
    compare = comparedoc.copy()
    compare['type'] = 'compare'
    # Append the two token sets vertically
    join = base.append(compare)
    # Pivot the two joined courses
    joinT = join.pivot(index=['doc_id', 'type'], columns='token').fillna(0).reset_index(level=[0, 1])
    # Assign columns
    joinT.columns = ['doc_id', 'type'] + [t[1] for t in joinT.columns][2:]
    return joinT

## Code Test

In [10]:
# Transpose
course1 = bows_df[bows_df['doc_id'] == 'ML0151EN']
course2 = bows_df[bows_df['doc_id'] == 'ML0101ENv3']
bow_vectors = pivot_two_bows(course1, course2)
bow_vectors

Unnamed: 0,doc_id,type,approachable,basics,beneficial,comparison,course,dives,free,future,...,relates,started,statistical,supervised,tool,tools,trends,unsupervised,using,vs
0,ML0101ENv3,compare,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0
1,ML0151EN,base,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,...,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0


In [11]:
# Compare
similarity = 1 - cosine(bow_vectors.iloc[0, 2:], bow_vectors.iloc[1, 2:])
similarity

0.6626221399549089

Now trying to find all courses similar to the course `Machine Learning with Python`

In [14]:
mlwp = course_df[course_df['COURSE_ID']=='ML0101ENv3']
mlwp

Unnamed: 0,COURSE_ID,TITLE,DESCRIPTION
158,ML0101ENv3,machine learning with python,machine learning can be an incredibly benefici...


In [18]:
similar_courses = {
    'course_id':[],
    'title':[],
    'description':[],
    'similarity':[]
}
# For each course other than ML0101ENv3, 
# use pivot_course_rows to convert it with course ML0101ENv3 
# into horizontal two BoW feature vectors
for cid in course_df['COURSE_ID']:
  course1 = bows_df[bows_df['doc_id'] == 'ML0101ENv3']
  course2 = bows_df[bows_df['doc_id'] == cid]
  bow_vectors = pivot_two_bows(course1, course2)
  similarity = 1 - cosine(bow_vectors.iloc[0, 2:], bow_vectors.iloc[1, 2:])
  if similarity >= 0.5:
    similar_courses['course_id'].append(cid)
    similar_courses['title'].append(course_df['TITLE'][course_df['COURSE_ID']==cid].item())
    similar_courses['description'].append(course_df['DESCRIPTION'][course_df['COURSE_ID']==cid].item())
    similar_courses['similarity'].append(similarity)
  else:
    pass


# Then use the cosine method to calculate the similarity

# Report all courses with similarities larger than a specific threshold 
# (such as 0.5)


output = pd.DataFrame(similar_courses)

In [22]:
output.head(10)

Unnamed: 0,course_id,title,description,similarity
0,ML0109EN,machine learning dimensionality reduction,machine learning dimensionality reduction,0.521749
1,ML0101ENv3,machine learning with python,machine learning can be an incredibly benefici...,1.0
2,ML0151EN,machine learning with r,this machine learning with r course dives into...,0.662622
3,excourse46,machine learning,machine learning is the science of getting com...,0.612054
4,excourse47,machine learning for all,machine learning often called artificial inte...,0.634755
5,excourse60,introduction to tensorflow for artificial inte...,if you are a software developer who wants to b...,0.54904
