# Exercises in Recommender systems

This notebook contains exercises in Recommender systems

In [91]:
import os
import kagglehub
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel


In [92]:
# Download latest version
path = kagglehub.dataset_download("khusheekapoor/coursera-courses-dataset-2021")
dataset_path = "/Users/marek/.cache/kagglehub/datasets/khusheekapoor/coursera-courses-dataset-2021/versions/1"
files = os.listdir(dataset_path)
file_path = "/Users/marek/.cache/kagglehub/datasets/khusheekapoor/coursera-courses-dataset-2021/versions/1/Coursera.csv"




In [93]:
ccd = pd.read_csv(file_path)
ccd.head()

Unnamed: 0,Course Name,University,Difficulty Level,Course Rating,Course URL,Course Description,Skills
0,Write A Feature Length Screenplay For Film Or ...,Michigan State University,Beginner,4.8,https://www.coursera.org/learn/write-a-feature...,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...
1,Business Strategy: Business Model Canvas Analy...,Coursera Project Network,Beginner,4.8,https://www.coursera.org/learn/canvas-analysis...,"By the end of this guided project, you will be...",Finance business plan persona (user experien...
2,Silicon Thin Film Solar Cells,�cole Polytechnique,Advanced,4.1,https://www.coursera.org/learn/silicon-thin-fi...,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...
3,Finance for Managers,IESE Business School,Intermediate,4.8,https://www.coursera.org/learn/operational-fin...,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...
4,Retrieve Data using Single-Table SQL Queries,Coursera Project Network,Beginner,4.6,https://www.coursera.org/learn/single-table-sq...,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...


## Exercise 1

Using the "Coursera Courses Dataset 2021" available at kaggle ([https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021](https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021)) or on moodle, to do the following:

1. Create a Content-based filtering recommender system based on the Course Descriptions.
2. Create a Content-based filtering recommender system based on the Skills.

Using the "Book Recommendation Dataset" available at kaggle ([https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset)) or on moodle, to do the following:

3. Load in the `Ratings.csv` file (on moodle, it is called `Books_Ratings.csv`). Group by `User-ID` and sort by `Book-Rating` in descending order to get the users who rated most books. Filter the rating data to only contain the 200 users that rated most books.
4. Create a Collaborative filtering recommender system based on the user ratings from 3 together with the `Books.csv` dataset.

In [94]:
tfidf = TfidfVectorizer(stop_words='english')

tfidf_matrix_dsc = tfidf.fit_transform(ccd['Course Description'])
tfidf_matrix_skll = tfidf.fit_transform(ccd['Skills'])

cosine_sim_dsc = linear_kernel(tfidf_matrix_dsc, tfidf_matrix_dsc)
cosine_sim_skll = linear_kernel(tfidf_matrix_skll, tfidf_matrix_skll)

indices = pd.Series(ccd.index, index=ccd['Course Name']).drop_duplicates()
indices

Course Name
Write A Feature Length Screenplay For Film Or Television                 0
Business Strategy: Business Model Canvas Analysis with Miro              1
Silicon Thin Film Solar Cells                                            2
Finance for Managers                                                     3
Retrieve Data using Single-Table SQL Queries                             4
                                                                      ... 
Capstone: Retrieving, Processing, and Visualizing Data with Python    3517
Patrick Henry: Forgotten Founder                                      3518
Business intelligence and data analytics: Generate insights           3519
Rigid Body Dynamics                                                   3520
Architecting with Google Kubernetes Engine: Production                3521
Length: 3522, dtype: int64

In [95]:
def get_recommendations_desc(title, cosine_sim=cosine_sim_dsc, indices=indices):    
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    course_indices = [i[0] for i in sim_scores]
    return ccd['Course Name'].iloc[course_indices]


def get_recommendations_skll(title, cosine_sim=cosine_sim_skll, indices=indices, ccd=ccd):    
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    course_indices = [i[0] for i in sim_scores]
    return ccd['Course Name'].iloc[course_indices]




In [96]:
print(get_recommendations_desc('Patrick Henry: Forgotten Founder'))
print(get_recommendations_skll('Patrick Henry: Forgotten Founder'))

527                                      Age of Jefferson
2579    Chemerinsky on Constitutional Law - The Struct...
906     Revolutionary Ideas: Borders, Elections, Const...
3307    From Freedom Rides to Ferguson: Narratives of ...
1512    The Making of the US President: A Short Histor...
510     Chemerinsky on Constitutional Law � Individual...
2276    Revolutionary Ideas: Utility, Justice, Equalit...
766                         Innovating in a Digital World
2646                                   The Ancient Greeks
3226             Introduction to Satellite Communications
Name: Course Name, dtype: object
2365    Deciphering Secrets: The Illuminated Manuscrip...
1431    Plagues, Witches, and War: The Worlds of Histo...
2980    The History of Modern Israel � Part I: From an...
78                   Russian History: from Lenin to Putin
1430             Ideas from the History of Graphic Design
1092                       Religions and Society in China
811                          America Th

In [97]:
books_ratings = pd.read_csv('Books_Ratings.csv')
books = pd.read_csv('Books.csv', dtype={'Year-Of-Publication': str})

books_ratings_df = books_ratings[books_ratings['Book-Rating'] != 0]
top200 = books_ratings_df['User-ID'].value_counts().nlargest(200)
books_ratings_df_t200 = books_ratings_df.merge(top200.to_frame(), on='User-ID')

ratings_df = books_ratings_df_t200.merge(books, on = 'ISBN')

In [98]:
ratings_df_m = ratings_df.pivot_table(index=["User-ID"], columns=['ISBN'], values="Book-Rating")
ratings_df_m.shape

(200, 52079)

In [99]:
books_ratings_df_t200

Unnamed: 0,User-ID,ISBN,Book-Rating,count
0,2276,0020960808,10,212
1,2276,0030632366,9,212
2,2276,0061030643,8,212
3,2276,0061098353,8,212
4,2276,0061099155,9,212
...,...,...,...,...
79626,274061,1892213737,10,215
79627,274061,189221394X,10,215
79628,274061,1892213958,10,215
79629,274061,1892213966,10,215


In [100]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71677 entries, 0 to 71676
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   User-ID              71677 non-null  int64 
 1   ISBN                 71677 non-null  object
 2   Book-Rating          71677 non-null  int64 
 3   count                71677 non-null  int64 
 4   Book-Title           71677 non-null  object
 5   Book-Author          71676 non-null  object
 6   Year-Of-Publication  71677 non-null  object
 7   Publisher            71675 non-null  object
 8   Image-URL-S          71677 non-null  object
 9   Image-URL-M          71677 non-null  object
 10  Image-URL-L          71677 non-null  object
dtypes: int64(3), object(8)
memory usage: 6.0+ MB


In [101]:
def user_based_recommender(input_user, user_book_df, rate_ratio=0.1, num_recommendations=5):
    input_user_df = user_book_df[user_book_df.index == input_user]
    input_user_books_rated = input_user_df.dropna(axis=1).columns.tolist()
    
    books_rated_df = user_book_df[input_user_books_rated]

    # Counting how many books other users have rated that the input user have also rated
    user_book_count = books_rated_df.T.notnull().sum()
    user_book_count = user_book_count.reset_index()
    user_book_count.columns = ["User-ID", "book_count"]
    
    # Selecting similar users based on a rating similarity count ratio threshold
    user_same_books = user_book_count[user_book_count["book_count"] > (len(input_user_books_rated) * rate_ratio)]["User-ID"]

    # Creating a correlation matrix based on ratings
    final_df = books_rated_df[books_rated_df.index.isin(user_same_books)]
    corr_df = final_df.T.corr()
    
    # Created top correlated users
    user_corr = corr_df[input_user].reset_index()
    user_corr = user_corr.rename(columns={input_user: 'correlation'})
    user_corr = user_corr.sort_values(by="correlation", ascending=False)
    user_corr = user_corr.loc[user_corr["User-ID"] != input_user]
    user_corr = user_corr.reset_index(drop=True)

    # Creating correlated weighting of rating
    top_users_ratings = user_corr.merge(ratings_df[["User-ID", "ISBN", "Book-Rating"]], how="inner")
    top_users_ratings["weighted_rating"] = top_users_ratings["correlation"] * top_users_ratings["Book-Rating"]

    # Creating a recommendation dataframe
    recommendation_df = top_users_ratings.groupby("ISBN").agg({"weighted_rating": "mean"}).sort_values(by="weighted_rating", ascending=False)
    recommendation_df = recommendation_df.reset_index()

    # Creating the final recommendations
    books_to_be_recommended = recommendation_df.merge(books[['ISBN', 'Book-Title']], on="ISBN")
    books_to_be_recommended = books_to_be_recommended.head(num_recommendations)

    return books_to_be_recommended["Book-Title"]


In [102]:
user_based_recommender(2276, ratings_df_m)

0              A Kid's Guide To How to Save the Planet
1                                          The Dionnes
2    Ultimate Japanese: Advanced (Living Language U...
3                                  Stanislaski Sisters
4                36 Hours Christmas (Silhouette Promo)
Name: Book-Title, dtype: object

## Exercise 2

Using the "Coursera Courses Dataset 2021" from Exercise 1, to do the following:

1. [Optional] Create a Content-based filtering recommender system based on both the Course Descriptions and the Skills.
2. [Optional] Can you come up with a way of including Difficulty Level and Course Rating in your recommender system?