# Exercises in Recommender systems

This notebook contains exercises in Recommender systems

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## Exercise 1

Using the "Coursera Courses Dataset 2021" available at kaggle ([https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021](https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021)) or on moodle, to do the following:

1. Create a Content-based filtering recommender system based on the Course Descriptions.
2. Create a Content-based filtering recommender system based on the Skills.

Using the "Book Recommendation Dataset" available at kaggle ([https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset)) or on moodle, to do the following:

3. Load in the `Ratings.csv` file (on moodle, it is called `Books_Ratings.csv`). Group by `User-ID` and sort by `Book-Rating` in descending order to get the users who rated most books. Filter the rating data to only contain the 200 users that rated most books.
4. Create a Collaborative filtering recommender system based on the user ratings from 3 together with the `Books.csv` dataset.

1. Create a Content-based filtering recommender system based on the Course Descriptions.

In [2]:
courses = pd.read_csv("Coursera.csv")
courses.head(2)

Unnamed: 0,Course Name,University,Difficulty Level,Course Rating,Course URL,Course Description,Skills
0,Write A Feature Length Screenplay For Film Or ...,Michigan State University,Beginner,4.8,https://www.coursera.org/learn/write-a-feature...,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...
1,Business Strategy: Business Model Canvas Analy...,Coursera Project Network,Beginner,4.8,https://www.coursera.org/learn/canvas-analysis...,"By the end of this guided project, you will be...",Finance business plan persona (user experien...


In [3]:
courses["Course Description"].head()

0    Write a Full Length Feature Film Script  In th...
1    By the end of this guided project, you will be...
2    This course consists of a general presentation...
3    When it comes to numbers, there is always more...
4    In this course you�ll learn how to effectively...
Name: Course Description, dtype: object

In [4]:
courses["Course Description"].isna().sum()

np.int64(0)

In [5]:
tfidf = TfidfVectorizer(stop_words="english")

In [6]:
tfidf_matrix = tfidf.fit_transform(courses["Course Description"])

In [7]:
tfidf_matrix.shape

(3522, 20074)

In [8]:
tfidf_matrix.toarray()[1, :]

array([0., 0., 0., ..., 0., 0., 0.], shape=(20074,))

In [9]:
tfidf_matrix

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 253718 stored elements and shape (3522, 20074)>

In [10]:
%%time
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

CPU times: total: 250 ms
Wall time: 255 ms


# Since you have used the TF-IDF vectorizer, calculating the dot product between each vector will directly give you the cosine similarity score. Therefore, you will use sklearn's linear_kernel() instead of cosine_similarities() since it is faster.

In [11]:
##from sklearn.metrics.pairwise import linear_kernel

In [12]:

# %%time
# cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix) 

In [13]:
cosine_sim

array([[1.00000000e+00, 3.12366523e-02, 1.97603991e-02, ...,
        3.17538002e-02, 3.33859933e-02, 1.96231367e-02],
       [3.12366523e-02, 1.00000000e+00, 8.58915185e-03, ...,
        3.13671991e-02, 4.88239107e-03, 4.56033552e-02],
       [1.97603991e-02, 8.58915185e-03, 1.00000000e+00, ...,
        3.45669421e-03, 1.65197252e-02, 6.37237740e-03],
       ...,
       [3.17538002e-02, 3.13671991e-02, 3.45669421e-03, ...,
        1.00000000e+00, 5.07544593e-04, 6.72367274e-03],
       [3.33859933e-02, 4.88239107e-03, 1.65197252e-02, ...,
        5.07544593e-04, 1.00000000e+00, 1.14068789e-03],
       [1.96231367e-02, 4.56033552e-02, 6.37237740e-03, ...,
        6.72367274e-03, 1.14068789e-03, 1.00000000e+00]],
      shape=(3522, 3522))

Matrix is symmetric:

In [14]:
cosine_sim[0, 1]

np.float64(0.0312366522978012)

In [15]:
cosine_sim[1, 0]

np.float64(0.0312366522978012)

Reverse map of index index to Course Names

In [16]:
indices = pd.Series(courses.index, index=courses["Course Name"]).drop_duplicates()

The below shows that similarity score from "cosine_sim[0, 1]" is the similarity between courses Write a Feature.... and Business Strategy....

In [17]:
indices[0:5]

Course Name
Write A Feature Length Screenplay For Film Or Television       0
Business Strategy: Business Model Canvas Analysis with Miro    1
Silicon Thin Film Solar Cells                                  2
Finance for Managers                                           3
Retrieve Data using Single-Table SQL Queries                   4
dtype: int64

In [18]:
def get_recommendations(course_name, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[course_name]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return courses["Course Name"].iloc[movie_indices]

In [19]:
get_recommendations("Write A Feature Length Screenplay For Film Or Television")

1481    Script Writing: Write a Pilot Episode for a TV...
1629                               Write Your First Novel
3481                                   Transmedia Writing
2186         Presentation skills: Public Speaking Project
3445                   Better Business Writing in English
3384              English for Effective  Business Writing
2894    Automating Team Communication with Google Shee...
614                      Writing in English at University
2732    Writing Professional Email and Memos (Project-...
104                                      Business Writing
Name: Course Name, dtype: object

2. Create a Content-based filtering recommender system based on the Skills.

In [20]:
courses["Skills"].head()

0    Drama  Comedy  peering  screenwriting  film  D...
1    Finance  business plan  persona (user experien...
2    chemistry  physics  Solar Energy  film  lambda...
3    accounts receivable  dupont analysis  analysis...
4    Data Analysis  select (sql)  database manageme...
Name: Skills, dtype: object

In [21]:
courses["Skills"].isna().sum()

np.int64(0)

In [22]:
tfidf_matrix_skills = tfidf.fit_transform(courses["Skills"])

In [23]:
tfidf_matrix_skills.shape

(3522, 4337)

In [24]:
%%time
cosine_sim_skills = cosine_similarity(tfidf_matrix_skills, tfidf_matrix_skills)

CPU times: total: 141 ms
Wall time: 159 ms


In [25]:
cosine_sim_skills

array([[1.        , 0.        , 0.05204333, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.20061523, 0.        ,
        0.01306076],
       [0.05204333, 0.        , 1.        , ..., 0.        , 0.1787157 ,
        0.00490933],
       ...,
       [0.        , 0.20061523, 0.        , ..., 1.        , 0.        ,
        0.03178263],
       [0.        , 0.        , 0.1787157 , ..., 0.        , 1.        ,
        0.00459616],
       [0.        , 0.01306076, 0.00490933, ..., 0.03178263, 0.00459616,
        1.        ]], shape=(3522, 3522))

In [26]:
cosine_sim_skills[0, 1]

np.float64(0.0)

In [27]:
cosine_sim_skills[1, 0]

np.float64(0.0)

No similarity, the ciusine similarity is (close to) 0, aka orthogonal. 

In [28]:
indices_skills = pd.Series(courses.index, index=courses["Course Name"]).drop_duplicates()

In [29]:
indices_skills[0:5]

Course Name
Write A Feature Length Screenplay For Film Or Television       0
Business Strategy: Business Model Canvas Analysis with Miro    1
Silicon Thin Film Solar Cells                                  2
Finance for Managers                                           3
Retrieve Data using Single-Table SQL Queries                   4
dtype: int64

In [30]:
get_recommendations("Write A Feature Length Screenplay For Film Or Television", cosine_sim_skills)

1451    Creative Writing: The Craft of Setting and Des...
1481    Script Writing: Write a Pilot Episode for a TV...
3462                 Creative Writing: The Craft of Style
2424                      Writing Stories About Ourselves
3005                             Writing a Personal Essay
339     Memoir and Personal Essay: Managing Your Relat...
3481                                   Transmedia Writing
535                 Writing in First Person Point of View
1629                               Write Your First Novel
3255                         So You Think You Know Tango?
Name: Course Name, dtype: object

Using the "Book Recommendation Dataset" available at kaggle ([https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset)) or on moodle, to do the following:

3. Load in the `Ratings.csv` file (on moodle, it is called `Books_Ratings.csv`). Group by `User-ID` and sort by `Book-Rating` in descending order to get the users who rated most books. Filter the rating data to only contain the 200 users that rated most books.

In [31]:
ratings = pd.read_csv("Books_Ratings.csv")
ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [32]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1149780 non-null  int64 
 1   ISBN         1149780 non-null  object
 2   Book-Rating  1149780 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


In [33]:
user_rating_count = ratings.groupby("User-ID").size() ## Num of rows pr group
user_rating_count

User-ID
2          1
7          1
8         18
9          3
10         2
          ..
278846     2
278849     4
278851    23
278852     1
278854     8
Length: 105283, dtype: int64

In [34]:
top_200_users = user_rating_count.sort_values(ascending=False).head(200).index

In [35]:
filtered_ratings = ratings[ratings["User-ID"].isin(top_200_users)]
filtered_ratings

Unnamed: 0,User-ID,ISBN,Book-Rating
4330,278418,0006128831,0
4331,278418,0006542808,5
4332,278418,0020209606,0
4333,278418,0020418809,0
4334,278418,0020420900,0
...,...,...,...
1147612,275970,3829021860,0
1147613,275970,4770019572,0
1147614,275970,896086097,0
1147615,275970,9626340762,8


4. Create a Collaborative filtering recommender system based on the user ratings from 3 together with the `Books.csv` dataset.

In [36]:
books = pd.read_csv("Books.csv")
books.head(2)

  books = pd.read_csv("Books.csv")


Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...


In [37]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Book-Author          271358 non-null  object
 3   Year-Of-Publication  271360 non-null  object
 4   Publisher            271358 non-null  object
 5   Image-URL-S          271360 non-null  object
 6   Image-URL-M          271360 non-null  object
 7   Image-URL-L          271357 non-null  object
dtypes: object(8)
memory usage: 16.6+ MB


In [38]:
df = books.merge(filtered_ratings, how="left", on="ISBN")
df.head(2)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L,User-ID,Book-Rating
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,,
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,11676.0,8.0


In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414693 entries, 0 to 414692
Data columns (total 10 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   ISBN                 414693 non-null  object 
 1   Book-Title           414693 non-null  object 
 2   Book-Author          414691 non-null  object 
 3   Year-Of-Publication  414693 non-null  object 
 4   Publisher            414691 non-null  object 
 5   Image-URL-S          414693 non-null  object 
 6   Image-URL-M          414693 non-null  object 
 7   Image-URL-L          414690 non-null  object 
 8   User-ID              270629 non-null  float64
 9   Book-Rating          270629 non-null  float64
dtypes: float64(2), object(8)
memory usage: 31.6+ MB


In [40]:
df.shape

(414693, 10)

In [41]:
df["Book-Title"].nunique()

242135

In [42]:
df["User-ID"].nunique()

200

Theres 414693 ratings by users, and they are made by the 200 users we selected before (out of 242135 unique books). However, not all books has a rating neccesarely, which we will adress:

In [43]:
df[["Book-Title","Book-Rating"]].isna().sum()


Book-Title          0
Book-Rating    144064
dtype: int64

We see that there is 144064 books that have no rating

In [44]:
df[["Book-Title","Book-Rating"]].dropna().drop(columns=["Book-Rating"]).value_counts()

Book-Title                                                                                                
Bridget Jones's Diary                                                                                         117
Wild Animus                                                                                                   100
The Pelican Brief                                                                                              99
Message in a Bottle                                                                                            97
The Notebook                                                                                                   93
                                                                                                             ... 
Ã?Â?lpiraten.                                                                                                   1
 Deceived                                                                                      

Dropping the missing values, we see that there is 115766 books rated (by the 200 users)

We now create a new DF, containing only the User ID's, the Book titles and the Rating for the book that the user (might not) have given

In [45]:
user_rating_books_df = df[["User-ID", "Book-Title", "Book-Rating"]]

In [46]:
user_rating_books_df

Unnamed: 0,User-ID,Book-Title,Book-Rating
0,,Classical Mythology,
1,11676.0,Clara Callan,8.0
2,177458.0,Clara Callan,0.0
3,,Decision in Normandy,
4,197659.0,Flu: The Story of the Great Influenza Pandemic...,9.0
...,...,...,...
414688,,There's a Bat in Bunk Five,
414689,,From One to One Hundred,
414690,,Lily Dale : The True Story of the Town that Ta...,
414691,,Republic (World's Classics),


We now convert the dataset into a "user-item" matrix (wide-format). That is, each row will represent a unique user. And each column will represent a unique book title, along with the rating the user (might not, aka NaN) have given. 

In [47]:
user_rating_books_df = user_rating_books_df.pivot_table(index=["User-ID"], columns=["Book-Title"], values="Book-Rating")

In [48]:
user_rating_books_df.head()

Book-Title,"A Light in the Storm: The Civil War Diary of Amelia Martin, Fenwick Island, Delaware, 1861 (Dear America)",Always Have Popsicles,Apple Magic (The Collector's series),Beyond IBM: Leadership Marketing and Finance for the 1990s,Dark Justice,Deceived,"Earth Prayers From around the World: 365 Prayers, Poems, and Invocations for Honoring the Earth",Final Fantasy Anthology: Official Strategy Guide (Brady Games),Garfield Bigger and Better (Garfield (Numbered Paperback)),"Good Wives: Image and Reality in the Lives of Women in Northern New England, 1650-1750",...,whataboutrick.com: a poetic tribute to Richard A. Ricci,"Â¡Corre, perro, corre!",Â¡Cristina! confidencias de una rubia,Â¿Eres tu mi mamÃ¡?/Are You My Mother?,"Â¿QuÃ© me quieres, amor?","Ã?ber den Wunsch, sich wohlzufÃ¼hlen: Geschichten",Ã?Â?ber das Fernsehen.,Ã?Â?ber die Pflicht zum Ungehorsam gegen den Staat.,Ã?Â?lpiraten.,Ã?Â?stlich der Berge.
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3363.0,,,,,,,,,,,...,,,,,,,,,,
6251.0,,,,,,,,,,,...,,,,,,,,,,
6575.0,,,,,,,,,,,...,,,,,,,,,,
7346.0,,,,,,,,,,,...,,,,,,,,,,
11601.0,,,,0.0,,,,,,,...,,,,,,,,,,


In [49]:
user_rating_books_df.shape

(200, 115766)

We are now ready to build the recommender. We take the User-iD, 7346, as a start. We get the books that user 7346 has rated: 

In [50]:
user_7346_df = user_rating_books_df[user_rating_books_df.index == 7346]
user_7346_df

Book-Title,"A Light in the Storm: The Civil War Diary of Amelia Martin, Fenwick Island, Delaware, 1861 (Dear America)",Always Have Popsicles,Apple Magic (The Collector's series),Beyond IBM: Leadership Marketing and Finance for the 1990s,Dark Justice,Deceived,"Earth Prayers From around the World: 365 Prayers, Poems, and Invocations for Honoring the Earth",Final Fantasy Anthology: Official Strategy Guide (Brady Games),Garfield Bigger and Better (Garfield (Numbered Paperback)),"Good Wives: Image and Reality in the Lives of Women in Northern New England, 1650-1750",...,whataboutrick.com: a poetic tribute to Richard A. Ricci,"Â¡Corre, perro, corre!",Â¡Cristina! confidencias de una rubia,Â¿Eres tu mi mamÃ¡?/Are You My Mother?,"Â¿QuÃ© me quieres, amor?","Ã?ber den Wunsch, sich wohlzufÃ¼hlen: Geschichten",Ã?Â?ber das Fernsehen.,Ã?Â?ber die Pflicht zum Ungehorsam gegen den Staat.,Ã?Â?lpiraten.,Ã?Â?stlich der Berge.
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7346.0,,,,,,,,,,,...,,,,,,,,,,


In [51]:
user_7346_books_rated = user_7346_df.columns[user_7346_df.notna().any()].tolist()
user_7346_books_rated

['10,000 Things to Praise God for',
 '101 Dalmatians',
 '1984',
 '36 Hour Day : A Family Guide to Caring for Person with          Alzheimer Disease',
 'A 2nd Helping of Chicken Soup for the Soul (Chicken Soup for the Soul Series (Paper))',
 'A Beautiful Mind: The Life of Mathematical Genius and Nobel Laureate John Nash',
 'A CLEAR CASE OF MURDER',
 'A Child Called \\It\\": One Child\'s Courage to Survive"',
 'A Civil Action',
 'A Cow on the Line and Other Thomas the Tank Engine Stories (Please Read to Me)',
 'A DRAGON IN THE FAMILY : A DRAGON IN THE FAMILY',
 'A Dark Traveling',
 'A Fool for Murder: A Mystery',
 'A Girl of the Limberlost',
 'A Great Day for the Deadly',
 'A Hog on Ice and Other Curious Expressions (Harper Colophon Books)',
 'A Kiss Gone Bad',
 'A Lesson Before Dying (Vintage Contemporaries)',
 'A Light in the Window (The Mitford Years)',
 'A Little Princess',
 "A Midsummer Night's Dream",
 'A Most Contagious Game',
 'A Murderous Yarn (Needlecraft Mysteries)',
 'A Night

In [52]:
len(user_7346_books_rated)

972

We see that our user has rated 883 books

We now create a new DF consisting of only the books that our user has rated. We will use this to look for similar users 

In [53]:
books_rated_df = user_rating_books_df[user_7346_books_rated]

In [54]:
books_rated_df

Book-Title,"10,000 Things to Praise God for",101 Dalmatians,1984,36 Hour Day : A Family Guide to Caring for Person with Alzheimer Disease,A 2nd Helping of Chicken Soup for the Soul (Chicken Soup for the Soul Series (Paper)),A Beautiful Mind: The Life of Mathematical Genius and Nobel Laureate John Nash,A CLEAR CASE OF MURDER,"A Child Called \It\"": One Child's Courage to Survive""",A Civil Action,A Cow on the Line and Other Thomas the Tank Engine Stories (Please Read to Me),...,Wondrous Beginnings,"Word Freak: Heartbreak, Triumph, Genius, and Obsession in the World of Competitive Scrabble Players",Working Woman's Art of War: Winning Without Confrontation,Working Wounded: Advice That Adds Insight to Injury,Wouldn't It Be Nice?: My Own Story,Wuthering Heights (The World's Classics),Xanth 13: Isle of View,Xanth 14: Question Quest,Xanth 15: The Color of Her Panties,"\O\"" Is for Outlaw"""
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3363.0,,,,,0.0,,,,0.0,,...,,,,,,,,,,
6251.0,,,,,,0.0,,,,,...,,,,,,,,,,
6575.0,,,,,,,,,0.0,,...,,,,,,,,,,
7346.0,8.0,10.0,8.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,7.0,6.0,6.0,6.0,8.0
11601.0,,,,,,,,0.0,,,...,,,,,,,,,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
271284.0,,,,,,,,,,,...,,,,,,,,,,
274061.0,,,,,,,,,,,...,,,,,,,10.0,,10.0,
274308.0,,,,,,,,,,,...,,,,,,,,,,
275970.0,,,0.0,,,,,,,,...,,,,,,,,,,


In [55]:
books_rated_df.shape

(200, 972)

To find similarity, we want to calculate how many books the other users have rated in regards to the selected books for user_7346

Here we transpose the books_rated_df. The columns becomes the rows and the rows becomes the columns. In other words: Each Book_title will be a row, and each column will be the userId and their rating for the book (if any). We get the .sum() of .notnull(): That is we get the number of non-NaN ratings for each user. 

In [56]:
user_book_count = books_rated_df.T.notnull().sum()

In [57]:
user_book_count

User-ID
3363.0       57
6251.0       70
6575.0       82
7346.0      972
11601.0      82
           ... 
271284.0     40
274061.0     26
274308.0     62
275970.0     43
278418.0     82
Length: 200, dtype: int64

Resetting index, and giving column names. This makes it easier to look interpret:

In [58]:
user_book_count = user_book_count.reset_index()
user_book_count.columns = ["User-ID", "book_count"]
user_book_count

Unnamed: 0,User-ID,book_count
0,3363.0,57
1,6251.0,70
2,6575.0,82
3,7346.0,972
4,11601.0,82
...,...,...
195,271284.0,40
196,274061.0,26
197,274308.0,62
198,275970.0,43


We now filter away those user that has rated less than 20% of the books that our user 7346 has rated. User 7346 has a high number of rated books, thus we select a low "similarity score" of 15%. Ideally this would be set way higher, eg. at 70%. For now we stick with this value

In [63]:
user_same_books = user_book_count[user_book_count["book_count"] > (len(user_7346_books_rated)*15)/100]["User-ID"]
user_same_books

3        7346.0
5       11676.0
25      35859.0
137    198711.0
Name: User-ID, dtype: float64

The above shows us the userids (as the col) that have rated 15% or more of the same books as user 7346 has

We create a new DF consisting of the ratings of the users that satisfied the similarity as above 

In [66]:
final_df = books_rated_df[books_rated_df.index.isin(user_same_books)]
final_df

Book-Title,"10,000 Things to Praise God for",101 Dalmatians,1984,36 Hour Day : A Family Guide to Caring for Person with Alzheimer Disease,A 2nd Helping of Chicken Soup for the Soul (Chicken Soup for the Soul Series (Paper)),A Beautiful Mind: The Life of Mathematical Genius and Nobel Laureate John Nash,A CLEAR CASE OF MURDER,"A Child Called \It\"": One Child's Courage to Survive""",A Civil Action,A Cow on the Line and Other Thomas the Tank Engine Stories (Please Read to Me),...,Wondrous Beginnings,"Word Freak: Heartbreak, Triumph, Genius, and Obsession in the World of Competitive Scrabble Players",Working Woman's Art of War: Winning Without Confrontation,Working Wounded: Advice That Adds Insight to Injury,Wouldn't It Be Nice?: My Own Story,Wuthering Heights (The World's Classics),Xanth 13: Isle of View,Xanth 14: Question Quest,Xanth 15: The Color of Her Panties,"\O\"" Is for Outlaw"""
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7346.0,8.0,10.0,8.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,7.0,6.0,6.0,6.0,8.0
11676.0,,,3.333333,,0.0,0.0,,0.0,3.5,,...,,7.5,,,,,,,,3.5
35859.0,,,,,,0.0,,0.0,,,...,,0.0,,,,,,,,
198711.0,,0.0,,,,,,,0.0,0.0,...,,,,,,,,,,


We now want to calculate the correlation between the users. We need to transpose the DF, as .corr() is calculating the correlation between columns.

In [None]:
example_df = final_df.T ##Showcasing the transposed df, as an example. 
example_df

User-ID,7346.0,11676.0,35859.0,198711.0
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"10,000 Things to Praise God for",8.0,,,
101 Dalmatians,10.0,,,0.0
1984,8.0,3.333333,,
36 Hour Day : A Family Guide to Caring for Person with Alzheimer Disease,0.0,,,
A 2nd Helping of Chicken Soup for the Soul (Chicken Soup for the Soul Series (Paper)),7.0,0.000000,,
...,...,...,...,...
Wuthering Heights (The World's Classics),7.0,,,
Xanth 13: Isle of View,6.0,,,
Xanth 14: Question Quest,6.0,,,
Xanth 15: The Color of Her Panties,6.0,,,


Transposing and doing the correlation:

In [112]:
corr_df = final_df.T.corr()
corr_df

User-ID,7346.0,11676.0,35859.0,198711.0
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
7346.0,1.0,-0.056281,-0.043172,
11676.0,-0.056281,1.0,-0.095168,
35859.0,-0.043172,-0.095168,1.0,
198711.0,,,,


As expected, the correlation is quite bad

We now make a new DF which shows the correlation of the selected users, in regards to user 7346:

In [None]:
user_corr = corr_df[7346].reset_index() ##Resetting index, turning the correlation matrix from a series to a dataframe
user_corr

Unnamed: 0,User-ID,7346.0
0,7346.0,1.0
1,11676.0,-0.056281
2,35859.0,-0.043172
3,198711.0,


In [None]:
user_corr = user_corr.rename(columns={7346: "correlation"}) ##Renaming column "7346", so it instead is called correlation
user_corr

Unnamed: 0,User-ID,correlation
0,7346.0,1.0
1,11676.0,-0.056281
2,35859.0,-0.043172
3,198711.0,


In [127]:
user_corr = user_corr.sort_values(by="correlation", ascending=False)
user_corr

Unnamed: 0,User-ID,correlation
0,7346.0,1.0
2,35859.0,-0.043172
1,11676.0,-0.056281
3,198711.0,


In [128]:
user_corr = user_corr.loc[user_corr["User-ID"] != 7346]
user_corr

Unnamed: 0,User-ID,correlation
2,35859.0,-0.043172
1,11676.0,-0.056281
3,198711.0,


In [129]:
user_corr = user_corr.reset_index(drop=True)
user_corr

Unnamed: 0,User-ID,correlation
0,35859.0,-0.043172
1,11676.0,-0.056281
2,198711.0,


The above is now a sorted correlation matrix, that shows correlation between the selected users and user 7346 (where user 7346 is excluded)

We now merge the correlation df with the ratings df, giving us the correlated users, alongside the ISBN and their rating for that particular book. We use inner join as we only want to users that are present in both dataframes, to be included in the final result

In [131]:
top_users_ratings = user_corr.merge(ratings[["User-ID", "ISBN", "Book-Rating"]], how="inner")
top_users_ratings

Unnamed: 0,User-ID,correlation,ISBN,Book-Rating
0,35859.0,-0.043172,0004722124,10
1,35859.0,-0.043172,0006543936,0
2,35859.0,-0.043172,0006547230,0
3,35859.0,-0.043172,0007101937,0
4,35859.0,-0.043172,0020186002,0
...,...,...,...,...
26997,198711.0,,8511839102,0
26998,198711.0,,9307166813,0
26999,198711.0,,9590624067,0
27000,198711.0,,9631172937,0


Here we create the weighted ratings. More similar users and higher ratings will give us higher weighted ratings. In other words, the weighted rating helps us prioritize ratings of the users who are more similar to the input user. 

In [132]:
top_users_ratings["weighted_rating"] = top_users_ratings["correlation"] * top_users_ratings["Book-Rating"]
top_users_ratings

Unnamed: 0,User-ID,correlation,ISBN,Book-Rating,weighted_rating
0,35859.0,-0.043172,0004722124,10,-0.431723
1,35859.0,-0.043172,0006543936,0,-0.000000
2,35859.0,-0.043172,0006547230,0,-0.000000
3,35859.0,-0.043172,0007101937,0,-0.000000
4,35859.0,-0.043172,0020186002,0,-0.000000
...,...,...,...,...,...
26997,198711.0,,8511839102,0,
26998,198711.0,,9307166813,0,
26999,198711.0,,9590624067,0,
27000,198711.0,,9631172937,0,


We now calculate the mean weighted rating for all books. This final weighted rating can be regarded as how much the book is to be considered recommended for the input user

In [139]:
recommendation_df = top_users_ratings.groupby("ISBN").agg({"weighted_rating": "mean"}).sort_values(by = "weighted_rating", ascending = False)
recommendation_df = recommendation_df.reset_index() ## reset index, so that ISBN is not the index anymore
recommendation_df

Unnamed: 0,ISBN,weighted_rating
0,O805063196,0.0
1,O77O428452,0.0
2,0 7336 1053 6,0.0
3,9997511417,0.0
4,9993763128,0.0
...,...,...
24710,8467003995,
24711,8511839102,
24712,9590624067,
24713,9631172937,


In [140]:
books_to_be_recommended = recommendation_df.merge(books[["ISBN", "Book-Title"]], on="ISBN")
books_to_be_recommended = books_to_be_recommended.head()
books_to_be_recommended

Unnamed: 0,ISBN,weighted_rating,Book-Title
0,9997511417,0.0,A Bundle for the Toff
1,9993763128,0.0,Star Wars: From the Adventures of Luke Skywalker
2,9871138016,0.0,Cronica De Una Muerte Anunciada
3,0003252477,0.0,A midsummer night's dream; (The Alexander Shak...
4,000617891X,0.0,At the Stroke of Twelve


# Not same result exactly, as some books dont have a book-title but only a ISBN

In [142]:
books.nunique()

ISBN                   271360
Book-Title             242135
Book-Author            102022
Year-Of-Publication       202
Publisher               16807
Image-URL-S            271044
Image-URL-M            271044
Image-URL-L            271041
dtype: int64

Now creating recommender function:

In [None]:
def user_based_recommender(input_user, user_rating_books_df, rate_ratio=0.70, num_recommendations=5):

    input_user_df = user_rating_books_df[user_rating_books_df.index == input_user]
    input_user_books_rated = input_user_df.columns[input_user_df.notna().any()].tolist()

    books_rated_df = user_rating_books_df[input_user_books_rated]


    user_book_count = books_rated_df.T.notnull().sum()
    user_book_count = user_book_count.reset_index()
    user_book_count.columns = ["User-ID", "book_count"]

    user_same_books = user_book_count[user_book_count["book_count"] > (len(input_user_books_rated)*rate_ratio)]["User-ID"]


    final_df = books_rated_df[books_rated_df.index.isin(user_same_books)]
    corr_df = final_df.T.corr()


    user_corr = corr_df[input_user].reset_index()
    user_corr = user_corr.rename(columns={input_user: "correlation"})
    user_corr = user_corr.sort_values(by="correlation", ascending=False)
    user_corr = user_corr.loc[user_corr["User-ID"] != input_user]
    user_corr = user_corr.reset_index(drop=True)


    top_users_ratings = user_corr.merge(ratings[["User-ID", "ISBN", "Book-Rating"]], how="inner")
    top_users_ratings["weighted_rating"] = top_users_ratings["correlation"] * top_users_ratings["Book-Rating"]


    recommendation_df = top_users_ratings.groupby("ISBN").agg({"weighted_rating": "mean"}).sort_values(by = "weighted_rating", ascending = False)
    recommendation_df = recommendation_df.reset_index()


    books_to_be_recommended = recommendation_df.merge(books[["ISBN", "Book-Title"]], on="ISBN")
    books_to_be_recommended = books_to_be_recommended.head(num_recommendations)

    return books_to_be_recommended["Book-Title"]

In [160]:
user_based_recommender(6575, user_rating_books_df, 0.25)

0           Three's a Crowd (Sweet Valley Twins, No 7)
1    Encyclopedia Brown: Boy Detective (Encyclopedi...
2    David Letterman's Book of Top Ten Lists and Ze...
3                               Hawk O'Toole's Hostage
4                                         Miracle Cure
Name: Book-Title, dtype: object

## Exercise 2

Using the "Coursera Courses Dataset 2021" from Exercise 1, to do the following:

1. [Optional] Create a Content-based filtering recommender system based on both the Course Descriptions and the Skills.
2. [Optional] Can you come up with a way of including Difficulty Level and Course Rating in your recommender system?