**Calculate Course Similarity using BoW Features**

Similarity measurement between items is the foundation of many recommendation algorithms, especially for content-based recommendation algorithms. For example, if a new course is similar to user's enrolled courses, we could recommend that new similar course to the user. Or If user A is similar to user B, then we can recommend some of user B's courses to user A (the unseen courses) because user A and user B may have similar interests.


Objectives
- Calculate the similarity between any two courses using BoW feature vectors

In a previous course, you learned many similarity measurements such as consine, jaccard index, or euclidean distance, and these methods need to work on either two vectors or two sets (sometimes even matrices or tensors).

In previous labs, we extracted the BoW features from course textual content. Given the course BoW feature vectors, we can easily apply similarity measurement to calculate the course similarity as shown in the below figure.

![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_2/images/course_sim.png)

In [None]:
#!pip install nltk==3.6.7
#!pip install gensim==4.1.2

In [None]:
# import required libraries
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import gensim
import pandas as pd
import nltk as nltk

from scipy.spatial.distance import cosine
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import ngrams
from gensim import corpora

%matplotlib inline

In [None]:
# also set a random state
rs = 123

In [None]:
# Suppose we have two simple example courses:
course1 = "machine learning for everyone"
course2 = "machine learning for beginners"

# tokenize them using the split() method 
# (or using word_tokenize() method provided in nltk
tokens = set(course1.split() + course2.split())
tokens = list(tokens)
tokens

In [None]:
# generate BoW features (token counts) for these two courses 
# (or using tokens_dict.doc2bow() method provided in nltk
def generate_sparse_bow(course):
    bow_vector = []
    words = course.split()
    for token in tokens:
        if token in words:
            bow_vector.append(1)
        else:
            bow_vector.append(0)
    return bow_vector

bow1 = generate_sparse_bow(course1)
print(bow1)
bow2 = generate_sparse_bow(course2)
print(bow2)

In [None]:
# apply the cosine similarity measurement on the two vectors:
cos_sim = 1 - cosine(bow1, bow2)
print(f"The cosine similarity between course `{course1}` and course `{course2}` is {round(cos_sim, 2) * 100}%")

# Try similarity measurements 
# such as Euclidean Distance or Jaccard index.
from scipy.spatial.distance import euclidean
euc_sim = euclidean(bow1,bow2)-1
euc_sim*100


In [None]:
# Load the BoW features as Pandas dataframe
bows_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/courses_bows.csv"
bows_df = pd.read_csv(bows_url)
bows_df = bows_df[['doc_id', 'token', 'bow']]
bows_df.head(10)

In [None]:
# Load the course dataframe
course_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/course_processed.csv"
course_df = pd.read_csv(course_url)
course_df.head(10)

In [None]:
course_df[course_df['COURSE_ID'] == 'ML0101ENv3']

In [None]:
ml_course = bows_df[bows_df['doc_id'] == 'ML0101ENv3']
ml_course

In [None]:
ml_courseT = ml_course.pivot(index=['doc_id'], columns='token').reset_index(level=[0])
ml_courseT

In [None]:
def pivot_two_bows(basedoc, comparedoc):
    base = basedoc.copy()
    base['type'] = 'base'
    compare = comparedoc.copy()
    compare['type'] = 'compare'
    # Append the two token sets vertically
    join = base.append(compare)
    # Pivot the two joined courses
    joinT = join.pivot(index=['doc_id', 'type'], columns='token').fillna(0).reset_index(level=[0, 1])
    # Assign columns
    joinT.columns = ['doc_id', 'type'] + [t[1] for t in joinT.columns][2:]
    return joinT

In [None]:
course1 = bows_df[bows_df['doc_id'] == 'ML0151EN']
course2 = bows_df[bows_df['doc_id'] == 'ML0101ENv3']
bow_vectors = pivot_two_bows(course1, course2)
bow_vectors

In [None]:
similarity = 1 - cosine(bow_vectors.iloc[0, 2:], bow_vectors.iloc[1, 2:])
similarity

In [None]:
# WRITE YOUR CODE HERE
## For each course other than ML0101ENv3, use pivot_course_rows to convert it with course ML0101ENv3 into horizontal two BoW feature vectors
## Then use the cosine method to calculate the similarity
## Report all courses with similarities larger than a specific threshold (such as 0.5)
for id in course_df['COURSE_ID'].unique():
    base_course = bows_df[bows_df['doc_id'] == 'ML0101ENv3']
    compare_course = bows_df[bows_df['doc_id'] == id]
    bow_vectors = pivot_two_bows(base_course, compare_course)
    similarity = 1 - cosine(bow_vectors.iloc[0, 2:], bow_vectors.iloc[1, 2:])
    if 1 > similarity > 0.5:
        print(f"similarity score of {bow_vectors.iloc[0,0]} with {bow_vectors.iloc[1,0]} : {similarity*100}")
    
    