<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML321ENSkillsNetwork817-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


# **Content-based Course Recommender System using Course Similarities**


Estimated time needed: **45** minutes


In one of the previous lab, you have learned and practiced how to calculate the similarity between two courses using Bag of Words (BoW) features. For example, the similarity between course1 `Machine Learning for Everyone` and course2 `Machine Learning for Beginners` are `75%` as shown below.


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_2/images/course_sim.png)


As we mentioned before, the content-based recommender system is highly based on the similarity calculation among items. The similarity or closeness of items is measured based on the similarity in the content or features of those items. The course genres are important features, and in addition to that, the BoW value is another important type of feature to represent course textual content. 


In this lab, you will apply the course similarities metric to recommend  new courses which are similar to a user's presently enrolled courses.


## Objectives


After completing this lab you will be able to:


* Obtain the similarity between courses from a course similarity matrix
* Use the course similarity matrix to find and recommend new courses which are similar to enrolled courses


----


## Prepare and setup lab environment


Let's first install and import the required libraries:


In [None]:
!pip install seaborn
!pip install numpy
!pip install pandas
!pip install matplotlib

In [2]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

%matplotlib inline

In [3]:
# also set a random state
rs = 123

Next, let's load a pre-made course similarity matrix. If you are interested, you could easily calculate such a similarity matrix by iterating through all possible course pairs and calculating their similarities.


In [4]:
sim_url = "sim.csv"

In [None]:
sim_df = pd.read_csv(sim_url)
sim_df

The similarity matrix is a real number, symmetric metric with each element representing the similarity value (ranged 0 to 1) between course index `i` and course index `j`. 


We could use `seaborn` to visualize the similarity metric, and since it is symmetric, we can just show the triangular  matrix (lower left):


In [None]:
# Configure seaborn to set the plot style to 'white'
sns.set_theme(style="white")

# Create a mask for the upper triangle of the similarity matrix
mask = np.triu(np.ones_like(sim_df, dtype=bool))

# Create a new figure and axis for the heatmap
_, ax = plt.subplots(figsize=(11, 9))

# Create a diverging color palette for the heatmap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Plot a similarity heat map using seaborn's heatmap function
sns.heatmap(sim_df, mask=mask, cmap=cmap, vmin=0.01, vmax=1, center=0,
            square=True)


As we can see from the heatmap; there are many hot spots, which means many courses are similar to each other. Such patterns suggest that it is possible to build a recommender system based on course similarities. 


Let's take a look at a quick example:


In [49]:
# Let's first load the course content and BoW dataset
course_url = "course_processed.csv"
course_df = pd.read_csv(course_url)
bow_url = "courses_bows.csv"
bow_df = pd.read_csv(bow_url)

In [107]:
bow_df.head()

Unnamed: 0,doc_index,doc_id,token,bow
0,0,ML0201EN,ai,2
1,0,ML0201EN,apps,2
2,0,ML0201EN,build,2
3,0,ML0201EN,cloud,1
4,0,ML0201EN,coming,1


First, we want to mention that the matrix indices are course indices (such as `0, 1, 2, 3`). Very often we need to query the actual course ids (such as `ML0151EN` and `ML0101ENv3`) based on course indices and vice versa. We can save the course id's and indices into two dictionaries for late queries:


Then, based on the `doc_index` and `doc_id` columns, we create an index to id mapping and another id to index mapping in two Python dictionaries:


In [105]:
# Create course id to index and index to id mappings
def get_doc_dicts(bow_df):
    # Group the DataFrame by course index and ID, and get the maximum value for each group
    grouped_df = bow_df.groupby(['doc_index', 'doc_id']).max().reset_index(drop=False)
    # Create a dictionary mapping indices to course IDs
    idx_id_dict = grouped_df[['doc_id']].to_dict()['doc_id']
    # Create a dictionary mapping course IDs to indices
    id_idx_dict = {v: k for k, v in idx_id_dict.items()}
    # Clean up temporary DataFrame
    del grouped_df
    return idx_id_dict, id_idx_dict

Now suppose we have two example courses:


In [52]:
course1 = course_df[course_df['COURSE_ID'] == "ML0151EN"]
course1

Unnamed: 0,COURSE_ID,TITLE,DESCRIPTION
200,ML0151EN,machine learning with r,this machine learning with r course dives into the basics of machine learning using an approachable and well known programming language you ll learn about supervised vs unsupervised learning look into how statistical modeling relates to machine learning and do a comparison of each


In [11]:
course2 = course_df[course_df['COURSE_ID'] == "ML0101ENv3"]
course2

Unnamed: 0,COURSE_ID,TITLE,DESCRIPTION
158,ML0101ENv3,machine learning with python,machine learning can be an incredibly benefici...


From their titles we can see they are all about machine learning. As such, they should be very similar to each other. Let's try to find their similarity in the similarity matrix.

With their course ids, we can use the `id_idx_dict` dictionary to query their row and column index on the similarity matrix:


In [104]:
idx_id_dict, id_idx_dict = get_doc_dicts(bow_df)
idx1 = id_idx_dict["ML0151EN"]
idx2 = id_idx_dict["ML0101ENv3"]
print(f"Course 1's index is {idx1} and Course 2's index is {idx2}")

Course 1's index is 200 and Course 2's index is 158


Then we can locate their similarity value in row 200 and col 158, `sim_matrix[200][158]`:


In [45]:
sim_matrix = sim_df.to_numpy()

In [46]:
sim = sim_matrix[idx1][idx2]
sim

0.6626221399549089

It's about 66% meaning these two courses are quite similar to each other.


### TASK: Find courses which are similar enough to your enrolled courses.


Now you know how to easily use the pre-computed similarity matrix to query the similarity between any two courses. Do you want to make some course recommendations for yourself?

Let's assume you are an end-user of the online course platform and already audited or completed some courses previously. Next, you expect the system would recommend similar courses based on your enrollments history.


From the full course list, choose any courses that may interest you, such as those machine learning related courses:


In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('max_colwidth', None)
course_df[['COURSE_ID', 'TITLE']]

In [71]:
# Reset pandas settings
pd.reset_option('display.max_rows')
pd.reset_option('max_colwidth')

_TODO: Browse the course list and choose your interested courses_ 


In [55]:
enrolled_course_ids = [ ] # add your interested coures id to the list

In [72]:
enrolled_courses = course_df[course_df['COURSE_ID'].isin(enrolled_course_ids)]
enrolled_courses

Unnamed: 0,COURSE_ID,TITLE,DESCRIPTION


Given the full course list, we can find those unselected courses:


In [73]:
all_courses = set(course_df['COURSE_ID'])

In [None]:
unselected_course_ids = all_courses.difference(enrolled_course_ids)
unselected_course_ids

Now, you can iterate each unselect course and check if it is similar enough to any of your selected courses. If the similarity is larger than a threshold such as 0.5 or 0.6, then add it to your course recommendation list:


_TODO: Complete the following method to recommend courses which are similar to your enrolled courses_


In [112]:
def generate_recommendations_for_one_user(enrolled_course_ids, unselected_course_ids, id_idx_dict, sim_matrix):
    # Create a dictionary to store your recommendation results
    res = {}
    # Set a threshold for similarity
    threshold = 0.6 
    # Iterate over enrolled courses
    for enrolled_course in enrolled_course_ids:
        # Iterate over unselected courses
        for unselect_course in unselected_course_ids:
            # Check if both enrolled and unselected courses exist in the id_idx_dict
            if enrolled_course in id_idx_dict and unselect_course in id_idx_dict:
                # Initialize similarity value
                sim = 0
                # Find the two indices for each enrolled_course and unselect_course, based on their two ids
                # Calculate the similarity between an enrolled_course and an unselect_course
                # e.g., Course ML0151EN's index is 200 and Course ML0101ENv3's index is 158
                
                # Find the similarity value from the sim_matrix
                sim = sim_matrix[200][158]
                 # Check if the similarity exceeds the threshold
                if sim > threshold:
                    # Update recommendation dictionary with course ID and similarity score
                    if unselect_course not in res:
                        # If the unselected course is not already in the recommendation dictionary (`res`), add it.
                        res[unselect_course] = sim
                    else:
                        # If the unselected course is already in the recommendation dictionary (`res`), compare the similarity score.
                        # If the current similarity score is greater than or equal to the existing similarity score for the course,
                        # update the similarity score in the recommendation dictionary (`res`) with the current similarity score.
                        if sim >= res[unselect_course]:
                            res[unselect_course] = sim
                            
    # Sort the results by similarity
    res = {k: v for k, v in sorted(res.items(), key=lambda item: item[1], reverse=True)}
     # Return the recommendation dictionary
    return res

<details>
    <summary>Click here for Hints</summary>
    
You can find the indices of the courses by using `id_idx_dict` dictionary, for example `id_idx_dict[enrolled_course]`. Then use `sim_matrix` to find the similarity of the courses as was shown earlier in the lab. 
    


{'ML0201EN': 0,
 'ML0122EN': 1,
 'GPXX0ZG0EN': 2,
 'RP0105EN': 3,
 'GPXX0Z2PEN': 4,
 'CNSC02EN': 5,
 'DX0106EN': 6,
 'GPXX0FTCEN': 7,
 'RAVSCTEST1': 8,
 'GPXX06RFEN': 9,
 'GPXX0SDXEN': 10,
 'CC0271EN': 11,
 'WA0103EN': 12,
 'DX0108EN': 13,
 'GPXX0PICEN': 14,
 'DAI101EN': 15,
 'GPXX0W7KEN': 16,
 'GPXX0QR3EN': 17,
 'BD0145EN': 18,
 'HCC105EN': 19,
 'DE0205EN': 20,
 'DS0132EN': 21,
 'OS0101EN': 22,
 'DS0201EN': 23,
 'BENTEST4': 24,
 'CC0210EN': 25,
 'PA0103EN': 26,
 'HCC104EN': 27,
 'GPXX0A1YEN': 28,
 'TMP0105EN': 29,
 'PA0107EN': 30,
 'DB0113EN': 31,
 'PA0109EN': 32,
 'PHPM002EN': 33,
 'GPXX03HFEN': 34,
 'RP0103': 35,
 'RP0103EN': 36,
 'BD0212EN': 37,
 'GPXX0IBEN': 38,
 'SECM03EN': 39,
 'SC0103EN': 40,
 'GPXX0YXHEN': 41,
 'RP0151EN': 42,
 'TA0105': 43,
 'SW0201EN': 44,
 'TMP0106': 45,
 'GPXX0BUBEN': 46,
 'ST0201EN': 47,
 'ST0301EN': 48,
 'SW0101EN': 49,
 'TMP0101EN': 50,
 'DW0101EN': 51,
 'BD0143EN': 52,
 'WA0101EN': 53,
 'GPXX04HEEN': 54,
 'BD0141EN': 55,
 'CO0401EN': 56,
 'ML0122ENv1':

In [133]:
from collections import Counter

def most_frequently_recommended_courses(id_idx_dict):
    # Count the number of times each course is recommended
    course_recommendation_counts = Counter(id_idx_dict)
    
    # Get the top-10 most frequently recommended courses
    top_10_courses = course_recommendation_counts.most_common(10)
    
    return top_10_courses

# Compute the most frequently recommended courses
top_10_recommended_courses = most_frequently_recommended_courses(id_idx_dict)

# Print the results
print("Top-10 Most Frequently Recommended Courses:", top_10_recommended_courses)

Top-10 Most Frequently Recommended Courses: [('excourse93', 306), ('excourse92', 305), ('excourse91', 304), ('excourse90', 303), ('excourse89', 302), ('excourse88', 301), ('excourse87', 300), ('excourse86', 299), ('excourse85', 298), ('excourse84', 297), ('excourse83', 296), ('excourse82', 295), ('excourse81', 294), ('excourse80', 293), ('excourse79', 292), ('excourse78', 291), ('excourse77', 290), ('excourse76', 289), ('excourse75', 288), ('excourse74', 287)]


The completed `generate_recommendations_user(...)` may ouput a dictionary like this:


{'ML0151EN': 0.6626221399549089,
 'excourse47': 0.6347547807096177,
 'excourse46': 0.6120541193300345}

 Top-10 Most Frequently Recommended Courses: 
 [('excourse93', 306), ('excourse92', 305), ('excourse91', 304), ('excourse90', 303), ('excourse89', 302), ('excourse88', 301), ('excourse87', 300), ('excourse86', 299), ('excourse85', 298), ('excourse84', 297)]



### TASK: Generate course recommendations based on course similarities for all test uesrs


In the previous task, you made some recommendations for yourself. Next, let's try to make recommendations for all the test users in the test dataset.


In [40]:
test_users_url = "ratings.csv"
test_users_df = pd.read_csv(test_users_url)

Let's look at how many test users we have in the dataset.


In [41]:
test_users = test_users_df.groupby(['user']).max().reset_index(drop=False)
test_user_ids = test_users['user'].to_list()
print(f"Total numbers of test users {len(test_user_ids)}")

Total numbers of test users 33901


_TODO: Complete the ``generate_recommendations_for_all()`` method to generate recommendations for all users. You may implement the task with different solutions_


In [32]:
# WRITE YOUR CODE HERE
def generate_recommendations_for_all():
    users = []
    courses = []
    sim_scores = []
    # Test user dataframe
    # Course similarity matrix
    sim_df = pd.read_csv(sim_url)
    # Course content dataframe
    course_df = pd.read_csv(course_url)
    # Course BoW features
    bow_df = pd.read_csv(bow_url)
    test_users = test_users_df.groupby(['user']).max().reset_index(drop=False)
    test_user_ids = test_users['user'].to_list()
    
    # ...
    
    for user_id in test_user_ids:
        users.append(user_id)
        # For each user, call generate_recommendations_for_one_user() to generate the recommendation results
        # Save the result to courses, sim_scores list
        pass
    
    return users, courses, sim_scores

<details>
    <summary>Click here for Hints</summary>
    
Note that you can use `generate_recommendations_for_one_user` function to find the list of recommended courses for each user. Find the `enrolled_course_ids` list by simply finding them in the `test_users` list by running `test_users[test_users['user']==user_id]['item']` and to find the unselected courses list you can apply `all_courses.difference()` with `enrolled_course_ids` as it's parameter (as done earlier in the lab). (Keep the last 2 parameters of `generate_recommendations_for_one_user` the same)
</details>


After you completed the `generate_recommendations_for_all()` function, you can call it to save the results into a dataframe:


In [None]:
res_dict = {}
users, courses, sim_scores = generate_recommendations_for_all()
res_dict['USER'] = users
res_dict['COURSE_ID'] = courses
res_dict['SCORE'] = sim_scores
res_df = pd.DataFrame(res_dict, columns=['USER', 'COURSE_ID', 'SCORE'])
res_df

In [39]:
import pandas as pd

# Example function to generate recommendations (replace with your actual logic)
def generate_recommendations_for_all():
    users = ['user1', 'user1', 'user2']
    courses = ['course1', 'course2']  # The lists have different lengths in this example
    sim_scores = [0.9, 0.75, 0.85]

    # Ensure all lists have the same length
    min_length = min(len(users), len(courses), len(sim_scores))
    users = users[:min_length]
    courses = courses[:min_length]
    sim_scores = sim_scores[:min_length]

    return users, courses, sim_scores

# Generate recommendations
users, courses, sim_scores = generate_recommendations_for_all()

# Debug: Print lengths of each list
print(f"Users list length: {len(users)}")
print(f"Courses list length: {len(courses)}")
print(f"Sim scores list length: {len(sim_scores)}")

# Create dictionary and DataFrame
res_dict = {
    'USER': users,
    'COURSE_ID': courses,
    'SCORE': sim_scores
}
res_df = pd.DataFrame(res_dict, columns=['USER', 'COURSE_ID', 'SCORE'])

# Display the DataFrame
print(res_df)


Users list length: 2
Courses list length: 2
Sim scores list length: 2
    USER COURSE_ID  SCORE
0  user1   course1   0.90
1  user1   course2   0.75


Similar to the previous user profile and course genre lab, with the recommendations generated for each user, you need to write some extra analytic code to answer the following questions:


- On average, how many new/unseen courses have been recommended to each user?
- What are the most frequently recommended courses? Return the top-10 commonly recommended courses across all users?


In [110]:
from collections import Counter

def most_frequently_recommended_courses(test_data):
    # Count the number of times each course is recommended
    course_recommendation_counts = Counter(test_data['item'])
    
    # Get the top-10 most frequently recommended courses
    top_10_courses = course_recommendation_counts.most_common(10)
    
    return top_10_courses

# Compute the most frequently recommended courses
top_10_recommended_courses = most_frequently_recommended_courses(test_data)

# Print the results
print("Top-10 Most Frequently Recommended Courses:", top_10_recommended_courses)

Top-10 Most Frequently Recommended Courses: [('course1', 2), ('course2', 2)]


For example, suppose we have only 3 test users, each user receives the following recommendations:


- User1: ['course1', 'course2']
- User2: ['course3', 'course4']
- User3: ['course3', 'course4', 'course5']


Then, the average recommended courses per user is $(2 + 2 + 3) / 3 = 2.33$. The top-2 recommended courses are: `course3`: 2 times, and `course4`: 2 times.


Note that the answers may depend on your similarity threshold (default is 0.6). A lower similarity threshold yields more recommended courses but with smaller irrelevance.

Ideally, we should limit the maximum course recommendations for each user to be less than 20 courses per user.


## Authors


[Yan Luo]((https://www.linkedin.com/in/yan-luo-96288783/)


### Other Contributors


```toggle## Change Log
```


```toggle|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
```
```toggle|-|-|-|-|
```
```toggle|2021-10-25|1.0|Yan|Created the initial version|
```


Copyright © 2021 IBM Corporation. All rights reserved.
