<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML321ENSkillsNetwork817-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Calculate Course Similarity using BoW Features**


Estimated time needed: **45** minutes


Similarity measurement between items is the foundation of many recommendation algorithms, especially for content-based recommendation algorithms. For example, if a new course is similar to user's enrolled courses, we could recommend that new similar course to the user. Or If user A is similar to user B, then we can recommend some of user B's courses to user A (the unseen courses) because user A and user B may have similar interests.


In a previous course, you learned many similarity measurements such as `consine`, `jaccard index`, or `euclidean distance`, and these methods need to work on either two vectors or two sets (sometimes even matrices or tensors). 

In previous labs, we extracted the BoW features from course textual content. Given the course BoW feature vectors, we can easily apply similarity measurement to calculate the course similarity as shown in the below figure.


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_2/images/course_sim.png)


## Objectives


After completing this lab you will be able to:


* Calculate the similarity between any two courses using BoW feature vectors


----


## Prepare and setup lab environment


First let's install and import required libraries:


In [None]:
#!pip install nltk==3.6.7
#!pip install gensim==4.1.2

In [1]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import gensim
import pandas as pd
import nltk as nltk

from scipy.spatial.distance import cosine
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import ngrams
from gensim import corpora

%matplotlib inline

In [2]:
# also set a random state
rs = 123

### Calculate the consine similarity between two example courses


Suppose we have two simple example courses:


In [3]:
course1 = "machine learning for everyone"

In [4]:
course2 = "machine learning for beginners"

Next we can quickly tokenize them using the split() method (or using `word_tokenize()` method provided in `nltk` as we did in the previous lab).


In [10]:
tokens = set(course1.split() + course2.split())
tokens

{'beginners', 'everyone', 'for', 'learning', 'machine'}

In [11]:
tokens = list(tokens)
tokens

['for', 'everyone', 'machine', 'beginners', 'learning']

then generate BoW features (token counts) for these two courses (or using `tokens_dict.doc2bow()` method provided in `nltk`, similar to what we did in the previous lab).


In [7]:
def generate_sparse_bow(course):
    bow_vector = []
    words = course.split()
    for token in tokens:
        if token in words:
            bow_vector.append(1)
        else:
            bow_vector.append(0)
    return bow_vector

In [8]:
bow1 = generate_sparse_bow(course1)
bow1

[1, 1, 1, 0, 1]

In [9]:
bow2 = generate_sparse_bow(course2)
bow2

[1, 0, 1, 1, 1]

From the above cell outputs, we can see the two vectors are very similar. Only two dimensions are different.


Now we can quickly apply the cosine similarity measurement on the two vectors:


In [12]:
cos_sim = 1 - cosine(bow1, bow2)

In [14]:
cos_sim


0.7499999999999999

In [13]:
print(f"The cosine similarity between course `{course1}` and course `{course2}` is {round(cos_sim, 2) * 100}%")

The cosine similarity between course `machine learning for everyone` and course `machine learning for beginners` is 75.0%


_Practice: Try other similarity measurements such as Euclidean Distance or Jaccard index._


In [15]:
# WRITE YOUR CODE HERE
from scipy.spatial.distance import euclidean
euq = euclidean(bow1,bow2)
euq


1.4142135623730951

For Example: Euclidean distance between 2 points $p$ and $q$ can be summarized by this equation: $d(p,q)={\sqrt {(p_{1}-q_{1})^{2}+(p_{2}-q_{2})^{2}+(p_{3}-q_{3})^{2}}}$. You can use `euclidean(p,q)` function from ```scipy``` package to calculate it. 


### TASK: Find similar courses to the course `Machine Learning with Python`


Now you have learned how to calculate cosine similarity between two sample BoW feature vectors. Let's work on some real course BoW feature vectors.


In [16]:
# Load the BoW features as Pandas dataframe
bows_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/courses_bows.csv"
bows_df = pd.read_csv(bows_url)
bows_df = bows_df[['doc_id', 'token', 'bow']]

In [17]:
bows_df.head(10)

Unnamed: 0,doc_id,token,bow
0,ML0201EN,ai,2
1,ML0201EN,apps,2
2,ML0201EN,build,2
3,ML0201EN,cloud,1
4,ML0201EN,coming,1
5,ML0201EN,create,1
6,ML0201EN,data,1
7,ML0201EN,developer,1
8,ML0201EN,found,1
9,ML0201EN,fun,1


The `bows_df` dataframe contains the BoW features vectors for each course, in a vertical and dense format. It has three columns `doc_id` represents the course id, `token` represents the token value, and `bow` represents the BoW value (token count).


Then, let's load another course content dataset which contains the course title and description:


In [18]:
# Load the course dataframe
course_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/course_processed.csv"
course_df = pd.read_csv(course_url)

In [19]:
course_df.head(10)

Unnamed: 0,COURSE_ID,TITLE,DESCRIPTION
0,ML0201EN,robots are coming build iot apps with watson ...,have fun with iot and learn along the way if ...
1,ML0122EN,accelerating deep learning with gpu,training complex deep learning models with lar...
2,GPXX0ZG0EN,consuming restful services using the reactive ...,learn how to use a reactive jax rs client to a...
3,RP0105EN,analyzing big data in r using apache spark,apache spark is a popular cluster computing fr...
4,GPXX0Z2PEN,containerizing packaging and running a sprin...,learn how to containerize package and run a ...
5,CNSC02EN,cloud native security conference data security,introduction to data security on cloud
6,DX0106EN,data science bootcamp with r for university pr...,a multi day intensive in person data science ...
7,GPXX0FTCEN,learn how to use docker containers for iterati...,learn how to use docker containers for iterati...
8,RAVSCTEST1,scorm test 1,scron test course
9,GPXX06RFEN,create your first mongodb database,in this guided project you will get started w...


Given course ID `ML0101ENv3`, let's find out its title and description:


In [20]:
course_df[course_df['COURSE_ID'] == 'ML0101ENv3']

Unnamed: 0,COURSE_ID,TITLE,DESCRIPTION
158,ML0101ENv3,machine learning with python,machine learning can be an incredibly benefici...


We can see it is a machine learning with Python course so we can expect any machine learning or Python related courses would be similar.


Then, let's print its associated BoW features:


In [21]:
ml_course = bows_df[bows_df['doc_id'] == 'ML0101ENv3']
ml_course

Unnamed: 0,doc_id,token,bow
2747,ML0101ENv3,course,1
2748,ML0101ENv3,learning,4
2749,ML0101ENv3,machine,3
2750,ML0101ENv3,need,1
2751,ML0101ENv3,get,1
2752,ML0101ENv3,started,1
2753,ML0101ENv3,python,2
2754,ML0101ENv3,tool,1
2755,ML0101ENv3,tools,1
2756,ML0101ENv3,predict,1


We can see the BoW feature vector is in vertical format but normally feature vectors are in horizontal format. One way to transpose the feature vector from vertical to horizontal is to use the Pandas `pivot()` method:


In [23]:
# simple transpose does not result in what we want
ml_course.transpose()

Unnamed: 0,2747,2748,2749,2750,2751,2752,2753,2754,2755,2756,2757,2758,2759,2760,2761,2762,2763,2764,2765
doc_id,ML0101ENv3,ML0101ENv3,ML0101ENv3,ML0101ENv3,ML0101ENv3,ML0101ENv3,ML0101ENv3,ML0101ENv3,ML0101ENv3,ML0101ENv3,ML0101ENv3,ML0101ENv3,ML0101ENv3,ML0101ENv3,ML0101ENv3,ML0101ENv3,ML0101ENv3,ML0101ENv3,ML0101ENv3
token,course,learning,machine,need,get,started,python,tool,tools,predict,free,hidden,insights,beneficial,future,trends,give,supervised,unsupervised
bow,1,4,3,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1


In [22]:
ml_courseT = ml_course.pivot(index=['doc_id'], columns='token').reset_index(level=[0])
ml_courseT

Unnamed: 0_level_0,doc_id,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow
token,Unnamed: 1_level_1,beneficial,course,free,future,get,give,hidden,insights,learning,machine,need,predict,python,started,supervised,tool,tools,trends,unsupervised
0,ML0101ENv3,1,1,1,1,1,1,1,1,4,3,1,1,2,1,1,1,1,1,1


To compare the BoWs of any two courses, which normally have a different set of tokens, we need to create a union token set and then transpose them. We have provided a method called `pivot_two_bows` as follows:


In [30]:
def pivot_two_bows(basedoc, comparedoc):
    base = basedoc.copy()
    base['type'] = 'base'
    compare = comparedoc.copy()
    compare['type'] = 'compare'
    # Append the two token sets vertically
    join = pd.concat([base, compare])
    #join = base.append(compare)
    # Pivot the two joined courses
    joinT = join.pivot(index=['doc_id', 'type'], columns='token').fillna(0).reset_index(level=[0, 1])
    # Assign columns
    joinT.columns = ['doc_id', 'type'] + [t[1] for t in joinT.columns][2:]
    return joinT

In [31]:
course1 = bows_df[bows_df['doc_id'] == 'ML0151EN']
course2 = bows_df[bows_df['doc_id'] == 'ML0101ENv3']

In [63]:
course1 = bows_df[bows_df['doc_id'] == 'ML0151EN']
course1

Unnamed: 0,doc_id,token,bow
3512,ML0151EN,learn,1
3513,ML0151EN,course,1
3514,ML0151EN,learning,5
3515,ML0151EN,machine,4
3516,ML0151EN,using,1
3517,ML0151EN,r,2
3518,ML0151EN,basics,1
3519,ML0151EN,language,1
3520,ML0151EN,programming,1
3521,ML0151EN,statistical,1


In [32]:
bow_vectors = pivot_two_bows(course1, course2)
bow_vectors

Unnamed: 0,doc_id,type,approachable,basics,beneficial,comparison,course,dives,free,future,...,relates,started,statistical,supervised,tool,tools,trends,unsupervised,using,vs
0,ML0101ENv3,compare,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0
1,ML0151EN,base,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,...,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0


Similarly, we can use the cosine method to calculate their similarity:


In [33]:
bow_vectors.iloc[1, 2:]

approachable    1.0
basics          1.0
beneficial      0.0
comparison      1.0
course          1.0
dives           1.0
free            0.0
future          0.0
get             0.0
give            0.0
hidden          0.0
insights        0.0
known           1.0
language        1.0
learn           1.0
learning        5.0
look            1.0
machine         4.0
modeling        1.0
need            0.0
predict         0.0
programming     1.0
python          0.0
r               2.0
relates         1.0
started         0.0
statistical     1.0
supervised      1.0
tool            0.0
tools           0.0
trends          0.0
unsupervised    1.0
using           1.0
vs              1.0
Name: 1, dtype: object

In [34]:
similarity = 1 - cosine(bow_vectors.iloc[0, 2:], bow_vectors.iloc[1, 2:])
similarity

0.6626221399549089

Now it's your turn to perform a task of finding all courses similar to the course `Machine Learning with Python`:


In [35]:
course_df[course_df['COURSE_ID'] == 'ML0101ENv3']

Unnamed: 0,COURSE_ID,TITLE,DESCRIPTION
158,ML0101ENv3,machine learning with python,machine learning can be an incredibly benefici...


You can set a similarity threshold such as 0.5 to determine if two courses are similar enough.


_TODO: Find courses which are similar to course `Machine Learning with Python (ML0101ENv3)`, you also need to show the title and descriptions of those courses._


In [60]:
course_IDs = course_df.COURSE_ID.to_list()
course_IDs.remove("ML0101ENv3")
course_IDs

['ML0201EN',
 'ML0122EN',
 'GPXX0ZG0EN',
 'RP0105EN',
 'GPXX0Z2PEN',
 'CNSC02EN',
 'DX0106EN',
 'GPXX0FTCEN',
 'RAVSCTEST1',
 'GPXX06RFEN',
 'GPXX0SDXEN',
 'CC0271EN',
 'WA0103EN',
 'DX0108EN',
 'GPXX0PICEN',
 'DAI101EN',
 'GPXX0W7KEN',
 'GPXX0QR3EN',
 'BD0145EN',
 'HCC105EN',
 'DE0205EN',
 'DS0132EN',
 'OS0101EN',
 'DS0201EN',
 'BENTEST4',
 'CC0210EN',
 'PA0103EN',
 'HCC104EN',
 'GPXX0A1YEN',
 'TMP0105EN',
 'PA0107EN',
 'DB0113EN',
 'PA0109EN',
 'PHPM002EN',
 'GPXX03HFEN',
 'RP0103',
 'RP0103EN',
 'BD0212EN',
 'GPXX0IBEN',
 'SECM03EN',
 'SC0103EN',
 'GPXX0YXHEN',
 'RP0151EN',
 'TA0105',
 'SW0201EN',
 'TMP0106',
 'GPXX0BUBEN',
 'ST0201EN',
 'ST0301EN',
 'SW0101EN',
 'TMP0101EN',
 'DW0101EN',
 'BD0143EN',
 'WA0101EN',
 'GPXX04HEEN',
 'BD0141EN',
 'CO0401EN',
 'ML0122ENv1',
 'BD0151EN',
 'TA0106EN',
 'TMP107',
 'ML0111EN',
 'GPXX048OEN',
 'CO0201EN',
 'GPXX01DCEN',
 'GPXX04XJEN',
 'GPXX0JZ4EN',
 'GPXX0ZYVEN',
 'GPXX0ZMZEN',
 'GPXX0742EN',
 'GPXX0KV4EN',
 'GPXX01RYEN',
 'CC0120EN',
 'QC01

In [66]:
course1 = course_df[course_df['COURSE_ID'] == 'ML0101ENv3']
print(type(course1))
course1_name = course1["TITLE"].item()
print(course1_name)
course_1_ID = course1["COURSE_ID"].item()

similar_courses_ID = []
similar_Titles = []
for course_ID in course_IDs:
    #print (course_1_ID)
    #print (course_ID)
    course1 = bows_df[bows_df['doc_id'] == course_1_ID]
    course2 = bows_df[bows_df['doc_id'] == course_ID]
    bow_vectors = pivot_two_bows(course1, course2)
    bow_vectors
    similarity = 1 - cosine(bow_vectors.iloc[0, 2:], bow_vectors.iloc[1, 2:])
    similarity
    if similarity > 0.5:
        similar_courses_ID.append(course_ID)
        

<class 'pandas.core.frame.DataFrame'>
machine learning with python
ML0101ENv3
ML0201EN
ML0101ENv3
ML0122EN
ML0101ENv3
GPXX0ZG0EN
ML0101ENv3
RP0105EN
ML0101ENv3
GPXX0Z2PEN
ML0101ENv3
CNSC02EN
ML0101ENv3
DX0106EN
ML0101ENv3
GPXX0FTCEN
ML0101ENv3
RAVSCTEST1
ML0101ENv3
GPXX06RFEN
ML0101ENv3
GPXX0SDXEN
ML0101ENv3
CC0271EN
ML0101ENv3
WA0103EN
ML0101ENv3
DX0108EN
ML0101ENv3
GPXX0PICEN
ML0101ENv3
DAI101EN
ML0101ENv3
GPXX0W7KEN
ML0101ENv3
GPXX0QR3EN
ML0101ENv3
BD0145EN
ML0101ENv3
HCC105EN
ML0101ENv3
DE0205EN
ML0101ENv3
DS0132EN
ML0101ENv3
OS0101EN
ML0101ENv3
DS0201EN
ML0101ENv3
BENTEST4
ML0101ENv3
CC0210EN
ML0101ENv3
PA0103EN
ML0101ENv3
HCC104EN
ML0101ENv3
GPXX0A1YEN
ML0101ENv3
TMP0105EN
ML0101ENv3
PA0107EN
ML0101ENv3
DB0113EN
ML0101ENv3
PA0109EN
ML0101ENv3
PHPM002EN
ML0101ENv3
GPXX03HFEN
ML0101ENv3
RP0103
ML0101ENv3
RP0103EN
ML0101ENv3
BD0212EN
ML0101ENv3
GPXX0IBEN
ML0101ENv3
SECM03EN
ML0101ENv3
SC0103EN
ML0101ENv3
GPXX0YXHEN
ML0101ENv3
RP0151EN
ML0101ENv3
TA0105
ML0101ENv3
SW0201EN
ML0101ENv3

In [71]:
course_df[course_df['COURSE_ID'].isin(similar_courses_ID)]

Unnamed: 0,COURSE_ID,TITLE,DESCRIPTION
157,ML0109EN,machine learning dimensionality reduction,machine learning dimensionality reduction
200,ML0151EN,machine learning with r,this machine learning with r course dives into...
259,excourse46,machine learning,machine learning is the science of getting com...
260,excourse47,machine learning for all,machine learning often called artificial inte...
273,excourse60,introduction to tensorflow for artificial inte...,if you are a software developer who wants to b...


In [None]:
# WRITE YOUR CODE HERE

## For each course other than ML0101ENv3, use pivot_course_rows to convert it with course ML0101ENv3 into horizontal two BoW feature vectors
## Then use the cosine method to calculate the similarity
## Report all courses with similarities larger than a specific threshold (such as 0.5)


<details>
    <summary>Click here for Hints</summary>
    
You can use `bows_df[bows_df['doc_id'] == 'ML0101ENv3']` to find 'ML0101ENv3' course bow. Then in a similar matter you can find bows for each course_id that's not 'ML0101ENv3'. Then you can join 2 bows by using predefined `pivot_two_bows` function and calculate the similarity as we just did using the cosine method. Print the course ids with similarity>0.5 
</details>


### Summary


Congratulations, you have finished the course similarity lab. In this lab, you used cosine and course BoW features to calculate the similarities among courses. Such similarity measurement is the core of many content-based recommender systems, which you will learn and practice in the later labs.


## Authors


[Yan Luo](https://www.linkedin.com/in/yan-luo-96288783/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML321ENSkillsNetwork817-2022-01-01)


### Other Contributors


## Change Log


|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2021-10-25|1.0|Yan|Created the initial version|


Copyright © 2021 IBM Corporation. All rights reserved.
