<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML321ENSkillsNetwork817-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Clustering based Course Recommender System**


Estimated time needed: **90** minutes


Previously, we have generated user profile vectors based on course ratings and genres. 

A user profile vector may look like a row vector in the following matrix, for example, we can see the Database column for user2 has a value 1 which means user2 is very interesting in courses related to the databases. With the user profile vectors generated, we can also easily compute the similarity among users based on their shared interests.


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_3/images/userprofiles.png)


Furthermore, we could perform clustering algorithms such as K-means or DBSCAN to group users with similar learning interests. For example, in the below user clusters, we have user clusters whom have learned courses related to machine learning, cloud computing, databases, and web development, etc.


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_3/images/userprofiles_clustering.png)


For each user group, we can come up with a list of popular courses. For example, for the machine learning user cluster/learning group, we can count the most frequently enrolled courses, which are very likely to be the most popular and good machine learning courses because they are enrolled by many users who are interested in machine learning. 

If we know a user belongs to the machine learning group, we may recommend the most enrolled courses to them and it is very likely the user will be interested in them.


Next in this lab, you will be implementing some clustering-based recommender system algorithms.


## Objectives


After completing this lab you will be able to:


* Perform k-means clustering on the original user profile feature vectors
* Apply PCA (Principle Component Analysis ) on user profile feature vectors to reduce dimensions
* Perform k-means clustering on the PCA transformed main components
* Generate course recommendations based on other group members' enrollment history


----


## Prepare and setup lab environment


First install and import required libraries:


In [ ]:
!pip install scikit-learn==1.0.2
!pip install seaborn==0.11.1

In [ ]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

%matplotlib inline

In [ ]:
# also set a random state
rs = 123

### Load the user profile dataset


Let's first load the original user profile feature vectors:


In [ ]:
user_profile_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/user_profile.csv"
user_profile_df = pd.read_csv(user_profile_url)
user_profile_df.head()

In [ ]:
user_profile_df.shape

we can then list the feature names, they are the user interested topics (course genres):


In [ ]:
feature_names = list(user_profile_df.columns[1:])
feature_names

As we can see from the user profile dataset, we have about 33K unique users with interests in areas like ``Database``, ``Python``, ``CloudComputing``, etc. Then, let's check the  summary statistics for each feature.


In [ ]:
user_profile_df.describe()

The original user profile feature vector is not normalized, which may cause issues when we perform clustering and Principal component analysis (PCA), therefor we standardize the data.


In [ ]:
# Use StandardScaler to make each feature with mean 0, standard deviation 1
scaler = StandardScaler()
user_profile_df[feature_names] = scaler.fit_transform(user_profile_df[feature_names])
print("mean {} and standard deviation{} ".format(user_profile_df[feature_names].mean(),user_profile_df[feature_names].std()))

In [ ]:
user_profile_df.describe()

The normalized user profile features are: 


In [ ]:
features = user_profile_df.loc[:, user_profile_df.columns != 'user']
features

we can also save the user ids for later recommendation tasks:


In [ ]:
user_ids = user_profile_df.loc[:, user_profile_df.columns == 'user']
user_ids

### TASK: Perform K-means clustering algorithm on the user profile feature vectors


With the user profile dataset ready, you need to use the `KMeans` class provided by scikit-learn library to perform clustering on the user profile feature vectors. 


For `KMeans` algorithm, one important hyperparameter is the number of clusters `n_cluster`, and a good way to find the optimized `n_cluster` is using to grid search a list of candidates and find the one with the best or optimized clustering evaluation metrics such as minimal `sum of squared distance`:


_TODO: grid search the optimized n_cluster for KMeans() model_


In [ ]:
# WRITE YOUR CODE HERE

# Find an optimized number of neighors k from a candidate list such as list_k = list(range(1, 30))


<details>
    <summary>Click here for Hints</summary>
    
Create a list that will hold the the sum of square distances for each fitted model. For each k in `n_clusters` make a model by calling `KMeans(n_clusters=k, random_state=rs).fit(features)` ans append `model.inertia_` to the list. Plot square distances against the k values.  

</details>


If you plot the grid search process, you may get a elbow plot like the following:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_3/images/clusters_elbow.png)


From the elbow plot, you should visualy identify the point where the metric starting to be flatten, which indicates the optimized number of clusters.


Once you have identified the best number of clusters, you can apply `KMeans()` again to generate cluster label for all users.


In [ ]:
cluster_labels = [None] * len(user_ids)

_TODO: Apply KMeans() on the features with optimized n_cluster parameter after model fitting, you can find output cluster labels in `model.labels_` attribute_


In [ ]:
## WRITE YOUR CODE HERE

## ...
## cluster_labels = model.labels
## ...


<details>
    <summary>Click here for Hints</summary>
    
Create  a model by calling `KMeans(n_clusters=k, random_state=rs).fit(features)`. Save the labels by accessing `model.labels`.

</details>


The cluster labels you generated is a list of integers indicating cluster indices. You may use the following utility method to combine the cluster labels and user ids to a dataframe, so that you know which cluster a user belongs:


In [ ]:
def combine_cluster_labels(user_ids, labels):
    labels_df = pd.DataFrame(labels)
    cluster_df = pd.merge(user_ids, labels_df, left_index=True, right_index=True)
    cluster_df.columns = ['user', 'cluster']
    return cluster_df

Your clustering results may look like the following screenshot:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_3/images/cluster_labels.png)


Now, each user finds its own cluster or we can say we have created many clusters of learning communities. Learners within each community share very similar learning interests.


### TASK: Apply PCA on user profile feature vectors to reduce dimensions


In the previous step, we applied `KMeans` on the original user profile feature vectors which have 14 original features (the course genres).


In [ ]:
features = user_profile_df.loc[:, user_profile_df.columns != 'user']
user_ids = user_profile_df.loc[:, user_profile_df.columns == 'user']
feature_names = list(user_profile_df.columns[1:])

In [ ]:
print(f"There are {len(feature_names)} features for each user profile.")

If we plot a covariance matrix of the user profile feature vectors with 14 features, we can observe that some features are actually correlated:


In [ ]:
sns.set_theme(style="white")

# Compute the correlation matrix
corr = features.cov()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})


plt.show()

For example, the feature `MachineLearning` and the feature `DataScience` are correlated. Such covariances among features may indicate that we can apply PCA to find its main components (eigenvectors with max eigenvalues on the covariance matrix). 

If we only keep the independent main components, then we can reduce the dimensions of our user profile feature vectors.


Now let's apply the `PCA()` provided by  `scikit-learn` to find the main components in user profile feature vectors and see if we can reduce its dimensions by only keeping the main components.


Note that when calling the  `PCA()` class, there is also an import argument called `n_components` which indicates how many components you want to keep in the PCA result. One way to find an optimized `n_components` is to do a grid search on a list of argument candidates (such as `range(1, 15)`) and calculate the ratio of the accumulated variance for each candidate. 

If the accumulated variances ratio of a candidate `n_components` is larger than a threshold, e.g., 90%, then we can say the transformed `n_components` could explain about 90% of variances of the original data variance and can be considered as an optimized components size.


_TODO: Find the optimized `n_components` for PCA_


In [ ]:
# WRITE YOUR CODE HERE

# - For a list of candidate `n_components` arguments such as 1 to 14, find out the minimal `n` that can explain accumulated 90% variances of previous data
# - In the fitted PCA() model, you can find explained_variance_ratio_ and use the sum() function to add them to get the accumulated variance ratio


<details>
    <summary>Click here for Hints</summary>
    
* For each `n_components` from 1 to 14 you can call `PCA=PCA(n_components=component)`, then you can simply fit it by callsing `pca.fit_transform(features)` where `features = user_profile_df.loc[:, user_profile_df.columns != 'user']`. 
* Then you can find `accumulated_variance_ratios` by applying `sum()` to `pca.explained_variance_ratio_`. 
* Then find the smallest n_components value for which `accumulated_variance_ratios >= 0.9` and return it.
</details>


If you visualize your hyperparameter searching process, you may get a trend line like the following:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_3/images/pca.png)


Once you found the optimized `n_component` argument value, you can apply PCA on the user profile feature vectors and reduce the  14 features into  `n_component` features.


_TODO: Perform PCA to transform original user profile features_


In [ ]:
# WRITE YOUR CODE HERE

# - For a list of candidate `n_components` arguments such as 1 to 14, find out the minimal `n` that can explain accumulated 90% variances of previous data
# - In the fitted PCA() model, you can find explained_variance_ratio_ and use the sum() function to add them to get the accumulated variance ratio
# - Merge the user ids and transformed features into a new dataframe


<details>
    <summary>Click here for Hints</summary>
    
* Call PCA class as `pca = PCA(n_components=n_components)` 
* Fit PCA model using predefined `features` variable as only parameter
* Get the components by calling `pca.fit_transform(features)` 
* Create a `pd.DataFrame(data=components)` and use `pd.merge` to merge it with `user_ids` don't forget to specify `left_index=True, right_index=True` in `merge` function parameters.
    
</details> 
    


Your PCA transformed dataframe may look like the following:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_3/images/pca_res.png)


### TASK: Perform k-means clustering on the PCA transformed feature vectors


Now, you have the PCA  components of the original profile vectors. You can perform k-means on them again:


_TODO: Perform K-means on the PCA transformed features_


In [ ]:
## WRITE YOUR CODE HERE

## - Apply KMeans() on the PCA features
## - Obtain the cluster label lists from model.labels_ attribute
## - Assign each user a cluster label by combining user ids and cluster labels


Your clustering results should have the same format as the k-means on the original dataset:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_3/images/cluster_labels.png)


Great, now all users find their learning interest groups, either based on their original or the PCA transformed user profile features. 


When a user is in a group or a community, it is very likely that the user will be interested in the courses enrolled by other members within the same group.


### TASK: Generate course recommendations based on the popular courses in the same cluster


The Intuition of clustering-based course recommendation is very simple and can be illustrated via the following example:


Suppose a user has joined a machine learning group (via clustering algorithm). In the group, he/she finds that the top-3 courses enrolled by all other group members are `Machine Learning for Everyone`, `Machine Learning with Python`, `Machine Learning with Scikit-learn`. Since the user has already completed the `Machine Learning for Everyone` earlier, he/she decides to trust the group members' choices and enroll in other two unselected courses `Machine Learning with Python` and `Machine Learning with Scikit-learn`.


In summary, the clustering-based recommender system first groups all users based on their profiles, and maintains a popular courses list for each group. 

For any group member who needs course recommendations, the algorithm recommends the unselected courses from the popular course lists.


Next, suppose we have a set of test users, and we want to recommend new courses to them using a clustering-based recommender system:


In [ ]:
test_user_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/rs_content_test.csv"
test_users_df = pd.read_csv(test_user_url)[['user', 'item']]
test_users_df.head()

The test users dataset has only two columns, the user id and the enrolled course id. 


For each user, let's find its cluster label using the k-means results you have performed in previous steps, assuming it is named `cluster_df`.


You can assign the cluster label to all test users via merging the clustering labels (`cluster_df`:):


In [ ]:
# test_users_labelled = pd.merge(test_users_df, cluster_df, left_on='user', right_on='user')

The merged the test dataset may look like the following:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_3/images/test_users_label.png)


From the above dataframe, we know each user's enrolled courses and its cluster index.


If we use a `groupby`  and `sum` aggregation, we can get the enrollments count for each course in each group, like the following code snippet:


In [ ]:
'''
courses_cluster = test_users_labelled[['item', 'cluster']]
courses_cluster['count'] = [1] * len(courses_cluster)
courses_cluster.groupby(['cluster','item']).agg(enrollments = ('count','sum')).reset_index()
'''

_TODO: For each test user, try to recommend any unseen courses based on the popular courses in his/her cluster. You may use an enrollment count threshold (such as larger than 10) to determine if it is a popular course in the cluster_ 


In [ ]:
## WRITE YOUR CODE HERE

## - For each user, first finds its cluster label

    ## - First get all courses belonging to the same cluster and figure out what are the popular ones (such as course enrollments beyond a threshold like 100)
    
    ## - Get the user's current enrolled courses
    
    ## - Check if there are any courses on the popular course list which are new/unseen to the user. 
    
    ## If yes, make those unseen and popular courses as recommendation results for the user


<details>
    <summary>Click here for Hints</summary>
    
* First of all, create a `user_subset` of  `test_users_labelled` where `test_users_labelled['user'] == user_id`. 
* Get the enrolled courses by  simply accessing `['item']` column of `user_subset`
* Find its cluster label by accessing `['cluster']` column of `user_subset`. You can just use the first one (`.iloc[0]`) since every value in the column is the same for an individual user.
* You can find all courses in the same cluster, by simply accessing `['item']` column of a subset of `test_users_labelled` where `test_users_labelled[test_users_labelled['cluster'] == cluster_id]`
* You can find new/unseen courses to the user by taking a set difference between courses in cluster and enrolled_courses by using `.difference` method (dont forget to convert 2 lists in to sets before calling the method). 
* Use `courses_cluster` data set to find popularity of new/unseen courses and make unseen and popular courses as recommendation results for the user
    
</details> 


With the recommendation results, you also need to write some analytic code to answer the following two questions:


- On average, how many new/unseen courses have been recommended to each user?
- What are the most frequently recommended courses? Return the top-10 commonly recommended courses across all users.


For example, suppose we have only 3 test users, each user receives the following recommendations:


- User1: ['course1', 'course2']
- User2: ['course3', 'course4']
- User3: ['course3', 'course4', 'course5']


Then, the average recommended courses per user is $(2 + 2 + 3) / 3 = 2.33$. The top-2 recommended courses are: `course3`: 2 times, and `course4`: 2 times.


Note that the answers will depend on how you compute the popular courses for each cluster. A lower threshold yields more recommended courses but with smaller confidence so that some test users may receive very long course recommendation lists and feel overwhelmed. 

Ideally, we should limit the maximum course recommendations for each user to be less than 20 courses per user.


### Explore other clustering algorithms


As you have learned in previous unsupervised learning course, there are many other clustering algorithms such as `DBSCAN` and `Hierarchical Clustering`. You are encouraged to try them on the user profile feature vectors and compare the results with K-means.


### Summary


Congratulations! In this lab, you have applied clustering algorithms to group users with similar interests and also tried PCA to reduce the dimensions of user feature vectors.

Furthermore, with each user finding its learning interest group, you have also implemented clustering-based course recommender system to make recommendations based on his/her group members' popular courses choices.


## Authors


[Yan Luo](https://www.linkedin.com/in/yan-luo-96288783/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML321ENSkillsNetwork817-2022-01-01)


### Other Contributors


## Change Log


|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2021-10-25|1.0|Yan|Created the initial version|


Copyright © 2021 IBM Corporation. All rights reserved.
