## Exercise 1
Using the "Coursera Courses Dataset 2021" available at kaggle ([https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021](https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021)) or on moodle, to do the following:

1. Create a Content-based filtering recommender system based on the Course Descriptions.
2. Create a Content-based filtering recommender system based on the Skills.

Using the "Book Recommendation Dataset" available at kaggle ([https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset)) or on moodle, to do the following:

3. Load in the `Ratings.csv` file (on moodle, it is called `Books_Ratings.csv`). Group by `User-ID` and sort by `Book-Rating` in descending order to get the users who rated most books. Filter the rating data to only contain the 200 users that rated most books.
4. Create a Collaborative filtering recommender system based on the user ratings from 3 together with the `Books.csv` dataset.

In [1]:
#Import libraries
import pandas as pd
import numpy as np

## 1. Create a Content-based filtering recommender system based on the Course Descriptions.

In [3]:
#Load data
coursera_df = pd.read_csv("Datasets/Coursera.csv")
coursera_df.info()
coursera_df.head()
coursera_df.isnull().sum()
print("Shaper of coursera dataset",coursera_df.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3522 entries, 0 to 3521
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Course Name         3522 non-null   object
 1   University          3522 non-null   object
 2   Difficulty Level    3522 non-null   object
 3   Course Rating       3522 non-null   object
 4   Course URL          3522 non-null   object
 5   Course Description  3522 non-null   object
 6   Skills              3522 non-null   object
dtypes: object(7)
memory usage: 192.7+ KB
Shaper of coursera dataset (3522, 7)


In order to create the content based recommender system we have to define a way to represent the content and a way to measure the distance between content. Fot this exercise the content will be Course Description.


In [4]:
coursera_df["Course Description"][0]

'Write a Full Length Feature Film Script  In this course, you will write a complete, feature-length screenplay for film or television, be it a serious drama or romantic comedy or anything in between. You�ll learn to break down the creative process into components, and you�ll discover a structured process that allows you to produce a polished and pitch-ready script by the end of the course. Completing this project will increase your confidence in your ideas and abilities, and you�ll feel prepared to pitch your first script and get started on your next. This is a course designed to tap into your creativity and is based in "Active Learning". Most of the actual learning takes place within your own activities - that is, writing! You will learn by doing.  Here is a link to a TRAILER for the course. To view the trailer, please copy and paste the link into your browser. https://vimeo.com/382067900/b78b800dc0  Learner review: "Love the approach Professor Wheeler takes towards this course. It\'s

We are going to implement TF-IDF

In [5]:
#Import TfidVectorizer from sklikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a Tf-IDF Vectorizer matrix. Remove all english stop words 
tfidf = TfidfVectorizer(stop_words="english")

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(coursera_df["Course Description"])

#Output the shape of the tfidf_matrix
tfidf_matrix.shape

(3522, 20074)

In [6]:
tfidf_matrix.toarray()[1, :]

array([0., 0., 0., ..., 0., 0., 0.])

In [7]:
tfidf_matrix

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 253718 stored elements and shape (3522, 20074)>

Now we have each course description represented as a 20074 long vector and have to measure the distance between two such vector. We are going to use cosine similarity

In [8]:
#Import cosine similarity library
from sklearn.metrics.pairwise import cosine_similarity

#Calculate the distances
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

In [9]:
cosine_sim.shape

(3522, 3522)

In [10]:
#Verify that the matrix is symetric
cosine_sim[0,1] == cosine_sim[1,0]

np.True_

In [11]:
indices = pd.Series(coursera_df.index, index= coursera_df["Course Name"]).drop_duplicates()


In [12]:
indices

Course Name
Write A Feature Length Screenplay For Film Or Television                 0
Business Strategy: Business Model Canvas Analysis with Miro              1
Silicon Thin Film Solar Cells                                            2
Finance for Managers                                                     3
Retrieve Data using Single-Table SQL Queries                             4
                                                                      ... 
Capstone: Retrieving, Processing, and Visualizing Data with Python    3517
Patrick Henry: Forgotten Founder                                      3518
Business intelligence and data analytics: Generate insights           3519
Rigid Body Dynamics                                                   3520
Architecting with Google Kubernetes Engine: Production                3521
Length: 3522, dtype: int64

In [13]:
#Function that takes in a course title and output similar courses based on the course description
def get_recommendation(course_title, cosine_sim=cosine_sim):

    #Get the index of the course that matches the course title
    idx = indices[course_title]

    #Get the pairwise similarity scores of all course titles with that course
    sim_scores = list(enumerate(cosine_sim[idx]))

    #sort the course titles based on the similarity scores
    sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse=True)

    #Get the scores of the 10 most similar course titles
    sim_scores = sim_scores[1:11]

    #Get the course title sindices
    course_indices = [i[0] for i in sim_scores]

    #Return the top 10 most similar course titles
    return coursera_df["Course Name"].iloc[course_indices]

In [14]:
get_recommendation('Programming Languages, Part A')

3505                        Programming Languages, Part C
1930                        Programming Languages, Part B
1706                   Functional Program Design in Scala
3042           Functional Programming Principles in Scala
1258               Introduction to Programming in Swift 5
1000                               Crash Course on Python
2364         Mastering Software Development in R Capstone
3362    Miracles of Human Language: An Introduction to...
857                              Programming with Scratch
16                          Python Programming Essentials
Name: Course Name, dtype: object

## 2.Create a Content-based filtering recommender system based on the Skills.


In [16]:
coursera_df["Skills"][0]

'Drama  Comedy  peering  screenwriting  film  Document Review  dialogue  creative writing  Writing  unix shells arts-and-humanities music-and-art'

In [17]:
#Replace missing values with an empty string
coursera_df["Skills"] = coursera_df["Skills"].fillna(" ")

#Import Tfidvectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorier Object. Remove all english stop words.
tfidf = TfidfVectorizer(stop_words='english')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(coursera_df["Skills"])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(3522, 4337)

In [18]:
#Calcualte the cosine similarity table
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

In [20]:
#Verify symetry and that makes sense
if (cosine_sim[0,1] == cosine_sim[1,0]):
    print("Symetry achieved")
else:
    print("Symetry failed")

#Construct a reveree map if indices and course titles
indices = pd.Series(coursera_df.index, index = coursera_df["Course Name"]).drop_duplicates()

indices[0:10]

Symetry achieved


Course Name
Write A Feature Length Screenplay For Film Or Television                                         0
Business Strategy: Business Model Canvas Analysis with Miro                                      1
Silicon Thin Film Solar Cells                                                                    2
Finance for Managers                                                                             3
Retrieve Data using Single-Table SQL Queries                                                     4
Building Test Automation Framework using Selenium and TestNG                                     5
Doing Business in China Capstone                                                                 6
Programming Languages, Part A                                                                    7
The Roles and Responsibilities of Nonprofit Boards of Directors within the Governance Process    8
Business Russian Communication. Part 3                                                           

In [26]:
#Function that takes in a course name as input and output similar courses based on skills
def get_recommendation_skills(course_name, cosine_sim=cosine_sim):

    #Get the index of the course that matches the title
    idx = indices[course_name]

    #Get the pairwise similarity scores of all courses
    sim_scores = list(enumerate(cosine_sim[idx]))

    #Sorth the courses based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    #Get the scores of the 10 most similar courses
    sim_scores = sim_scores[1:11]

    #Get the courses indices
    course_indices = [i[0] for i in sim_scores]

    #Return the top most similar courses
    return coursera_df["Course Name"].iloc[course_indices]

In [29]:
get_recommendation_skills('Programming Languages, Part A')

3042           Functional Programming Principles in Scala
346              Functions, Methods, and Interfaces in Go
3022    Implementing Hangman Game Using Basics of Pyth...
3505                        Programming Languages, Part C
2990                           Kotlin for Java Developers
747             Python Functions, Files, and Dictionaries
1780                               Python Data Structures
1258               Introduction to Programming in Swift 5
1210    Compose and Program Music in Python using Ears...
1399                               Advanced R Programming
Name: Course Name, dtype: object

## 3.Load in the `Books_Ratings.csv` file. Group by `User-ID` and sort by `Book-Rating` in descending order to get the users who rated most books. Filter the rating data to only contain the 200 users that rated most books.


In [32]:
#Load rating data
book_ratings = pd.read_csv("Datasets/Books_Ratings.csv")

In [124]:
#Group by User-ID and sort by Book-Rating
book_ratings = book_ratings[book_ratings["Book-Rating"] != 0]
top_users = book_ratings.groupby("User-ID")["User-ID"].value_counts().nlargest(200)
book_ratings_top = book_ratings.merge(top_users.to_frame(), on="User-ID")


## 4.Create a Collaborative filtering recommender system based on the user ratings from 3 together with the `Books.csv` dataset.

In [125]:
books = pd.read_csv("Datasets/Books.csv", low_memory=False)

In [126]:
book_index = book_ratings_top.merge(books, on="ISBN")

In [127]:
book_index = book_index.pivot_table(index=["User-ID"], columns=["Book-Title"], values="Book-Rating")

In [128]:
def user_based_recommender(input_user, user_book_df, rate_ratio = 0.1, num_recommendations = 5):

    #Creating a list of books the input user have rated
    input_user_df = user_book_df[user_book_df.index == input_user]
    input_user_books_rated = input_user_df.columns[input_user_df.notna().any()].tolist()

    #Creating a dataframe with the user rating of the books the input user have rated
    books_rated_df = user_book_df[input_user_books_rated]

    #Counting how many movies other users have rated that the input user have aldo rated
    user_book_count = books_rated_df.T.notnull().sum()
    user_book_count = user_book_count.reset_index()
    user_book_count.columns = ["User-ID", "Book-Count"]

    #Selecting similar users over based on a rating similarity count ratio threshold
    user_same_books = user_book_count[user_book_count["Book-Count"] > (len(input_user_books_rated)* rate_ratio)]["User-ID"]

    #Creating a correlation matrix based on ratings
    final_df = books_rated_df[books_rated_df.index.isin(user_same_books)]
    corr_df = final_df.T.corr()

    #Create top correlated users
    user_corr = corr_df[input_user].reset_index()
    user_corr = user_corr.rename(columns={input_user:'correlation'})
    user_corr = user_corr.sort_values(by='correlation', ascending=False)
    user_corr = user_corr.loc[user_corr["User-ID"] != input_user]
    user_corr = user_corr.reset_index(drop=True)

    #Creating correlated weigthing of rating
    top_users_ratings = user_corr.merge(book_ratings[["User-ID", "ISBN", "Book-Rating"]], how="inner")
    top_users_ratings["weighted_rating"] = top_users_ratings["correlation"] * top_users_ratings["Book-Rating"]

    #Creating a recommendation dataframe
    recommendation_df = top_users_ratings.groupby("ISBN").agg({"weighted_rating": "mean"}).sort_values(by = "weighted_rating", ascending = False)
    recommendation_df = recommendation_df.reset_index()

    #Create the final recommendation
    books_to_be_recommended = recommendation_df.merge(books[["ISBN", "Book-Title"]], left_on="ISBN", right_on="ISBN").drop(columns=["ISBN"])
    books_to_be_recommended = books_to_be_recommended.head(num_recommendations)

    

    return books_to_be_recommended["Book-Title"]

In [129]:
user_based_recommender(6242, book_index)[:]

0                 Aloha Las Vegas: And Other Plays
1    Ender's Game (Ender Wiggins Saga (Paperback))
2                                 My Louisiana Sky
3                            The Absence of Nectar
4            Ellen Foster (Vintage contemporaries)
Name: Book-Title, dtype: object

In [133]:
#Loop to iterate through the user and show recommendations
result = {}
for user_id in book_index.index:
    recommendations = user_based_recommender(user_id, book_index)
    
    if not recommendations.empty: result[f"User-ID:{user_id}"]=pd.DataFrame(recommendations)



In [131]:
#Print results
result

[0                    MARBLE HEART
 1          Angel: City of (Angel)
 2    The Coming Global Superstorm
 3                      The Breach
 4                    Halcyon Daze
 Name: Book-Title, dtype: object,
 Series([], Name: Book-Title, dtype: object),
 Series([], Name: Book-Title, dtype: object),
 0                 Aloha Las Vegas: And Other Plays
 1    Ender's Game (Ender Wiggins Saga (Paperback))
 2                                 My Louisiana Sky
 3                            The Absence of Nectar
 4            Ellen Foster (Vintage contemporaries)
 Name: Book-Title, dtype: object,
 0                    MARBLE HEART
 1          Angel: City of (Angel)
 2    The Coming Global Superstorm
 3                      The Breach
 4                    Halcyon Daze
 Name: Book-Title, dtype: object,
 0                                      Relative Danger
 1    The Girl's Got Bite: The Original Unauthorized...
 2                             Detective Inspector Huss
 3                      Midn