# 10: Recommender Systems

The hand-in exercise for this topic is Exercise 1 from the notebook “Exercises in Recommender systems.ipynb"

---

## Exercise 1

Using the "Coursera Courses Dataset 2021" available at kaggle ([https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021](https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021)) or on moodle, to do the following:

1. Create a Content-based filtering recommender system based on the Course Descriptions.
2. Create a Content-based filtering recommender system based on the Skills.

Using the "Book Recommendation Dataset" available at kaggle ([https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset)) or on moodle, to do the following:

3. Load in the `Ratings.csv` file (on moodle, it is called `Books_Ratings.csv`). Group by `User-ID` and sort by `Book-Rating` in descending order to get the users who rated most books. Filter the rating data to only contain the 200 users that rated most books.
4. Create a Collaborative filtering recommender system based on the user ratings from 3 together with the `Books.csv` dataset.

---

In [40]:
# !pip install surprise

In [41]:
# Lib imports for the notebook
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from surprise import SVD
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import accuracy
from collections import defaultdict

In [42]:
# Only for while working in google colab, can be removed again using jupyter
# from google.colab import files
# uploaded = files.upload()

---

### Coursera Courses Dataset 2021
1. Create a Content-based filtering recommender system based on the Course Descriptions.
2. Create a Content-based filtering recommender system based on the Skills.

Some EDA to prepare and look at the structure of the data

In [43]:
coursera_data = pd.read_csv('Coursera.csv')

In [44]:
coursera_data.head(10)

Unnamed: 0,Course Name,University,Difficulty Level,Course Rating,Course URL,Course Description,Skills
0,Write A Feature Length Screenplay For Film Or ...,Michigan State University,Beginner,4.8,https://www.coursera.org/learn/write-a-feature...,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...
1,Business Strategy: Business Model Canvas Analy...,Coursera Project Network,Beginner,4.8,https://www.coursera.org/learn/canvas-analysis...,"By the end of this guided project, you will be...",Finance business plan persona (user experien...
2,Silicon Thin Film Solar Cells,�cole Polytechnique,Advanced,4.1,https://www.coursera.org/learn/silicon-thin-fi...,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...
3,Finance for Managers,IESE Business School,Intermediate,4.8,https://www.coursera.org/learn/operational-fin...,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...
4,Retrieve Data using Single-Table SQL Queries,Coursera Project Network,Beginner,4.6,https://www.coursera.org/learn/single-table-sq...,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...
5,Building Test Automation Framework using Selen...,Coursera Project Network,Beginner,4.7,https://www.coursera.org/learn/building-test-a...,Selenium is one of the most widely used functi...,maintenance test case test automation scree...
6,Doing Business in China Capstone,The Chinese University of Hong Kong,Advanced,3.3,https://www.coursera.org/learn/doing-business-...,Doing Business in China Capstone enables you t...,marketing plan Planning Marketing consumpti...
7,"Programming Languages, Part A",University of Washington,Intermediate,4.9,https://www.coursera.org/learn/programming-lan...,This course is an introduction to the basic co...,inference ml (programming language) higher-o...
8,The Roles and Responsibilities of Nonprofit Bo...,The State University of New York,Intermediate,4.3,https://www.coursera.org/learn/nonprofit-gov-2,This course provides a more in-depth look at t...,Planning Peer Review fundraising strategic ...
9,Business Russian Communication. Part 3,Saint Petersburg State University,Intermediate,Not Calibrated,https://www.coursera.org/learn/business-russia...,Russian is considered to be one of the most di...,Russian market (economics) tax exemption co...


In [45]:
coursera_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3522 entries, 0 to 3521
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Course Name         3522 non-null   object
 1   University          3522 non-null   object
 2   Difficulty Level    3522 non-null   object
 3   Course Rating       3522 non-null   object
 4   Course URL          3522 non-null   object
 5   Course Description  3522 non-null   object
 6   Skills              3522 non-null   object
dtypes: object(7)
memory usage: 192.7+ KB


---

### 1. Create a Content-based filtering recommender system based on the Course Descriptions.

### Recommendation based on similariry in 'Description' feature

* Write the name of a course from the data you like and the function will recommend courses with similar term frequency

* I like the course "Retrieve Data using Single-Table SQL Queries" so im going to look for 5 similar ones

In [46]:
# Use TFIDF vectorization to calc term frequency and match documents with similar vectors
tfidf_vectorizer = TfidfVectorizer(stop_words='english') # skip common english words so they dont affect the search
# Fit and transform the Course Descriptions
tfidf_matrix = tfidf_vectorizer.fit_transform(coursera_data['Course Description'])
# we need the cosine similarity matrix, to find similarities between input description vs lookup
cosine_similar_descriptions = cosine_similarity(tfidf_matrix, tfidf_matrix)
# Func for recommending similar courses to the search, based on simiar term frequency
def recommend_courses_based_on_desc(course_title, num_recommendations):
    if course_title not in coursera_data['Course Name'].values:
        return "Course not found." # if the course dont exist in the data

    idx = coursera_data[coursera_data['Course Name'] == course_title].index[0]
    similarity_scores = list(enumerate(cosine_similar_descriptions[idx]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores = similarity_scores[1:num_recommendations+1]
    recommended_courses = [coursera_data.iloc[i[0]]['Course Name'] for i in similarity_scores]
    return recommended_courses

# Make a search and get a recommendation on the course i like
recommend_courses_based_on_desc("Retrieve Data using Single-Table SQL Queries", 5) # I want 5 recommendationss

['Creating Database Tables with SQL',
 'Querying Databases Using SQL SELECT statement',
 'Advanced Relational Database and SQL',
 'Create Relational Database Tables Using SQLiteStudio',
 'Intermediate Relational Database and SQL']

---

### 2. Create a Content-based filtering recommender system based on the Skills

* We can use the same syntax as before, but instead targeting the 'Skills' feature

* Write the name of a course from the data you like and the function will recommend courses with similar term frequency

* I like the course "Retrieve Data using Single-Table SQL Queries" so im going to look for 5 similar ones

In [47]:
# Use TFIDF vectorization to calc term frequency and match documents with similar vectors
tfidf_vectorizer = TfidfVectorizer(stop_words='english') # skip common english words so they dont affect the search
# Fit and transform the Course Descriptions
tfidf_matrix = tfidf_vectorizer.fit_transform(coursera_data['Skills'])
# we need the cosine similarity matrix, to find similarities between input description vs lookup
cosine_similar_descriptions = cosine_similarity(tfidf_matrix, tfidf_matrix)
# Func for recommending similar courses to the search, based on simiar term frequency
def recommend_courses_based_on_desc(course_title, num_recommendations):
    if course_title not in coursera_data['Course Name'].values:
        return "Course not found." # if the course dont exist in the data

    idx = coursera_data[coursera_data['Course Name'] == course_title].index[0]
    similarity_scores = list(enumerate(cosine_similar_descriptions[idx]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores = similarity_scores[1:num_recommendations+1]
    recommended_courses = [coursera_data.iloc[i[0]]['Course Name'] for i in similarity_scores]
    return recommended_courses

# Make a search and get a recommendation on the course i like
recommend_courses_based_on_desc("Retrieve Data using Single-Table SQL Queries", 5) # I want 5 recommendationss as defined as parameter for the func

['Manipulating Data with SQL',
 'Creating Database Tables with SQL',
 'Managing Big Data with MySQL',
 'Create Relational Database Tables Using SQLiteStudio',
 'Retrieve Data with Multiple-Table SQL Queries']

---

### Book Recommendation Dataset
3. Load in the Ratings.csv file (on moodle, it is called Books_Ratings.csv). Group by User-ID and sort by Book-Rating in descending order to get the users who rated most books. Filter the rating data to only contain the 200 users that rated most books.
4. Create a Collaborative filtering recommender system based on the user ratings from 3 together with the Books.csv dataset.

### First steps EDA

In [48]:
books_ratings_data = pd.read_csv('Books_Ratings.csv')

In [49]:
books_ratings_data.head(10)

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6
5,276733,2080674722,0
6,276736,3257224281,8
7,276737,0600570967,6
8,276744,038550120X,7
9,276745,342310538,10


In [50]:
books_ratings_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1149780 non-null  int64 
 1   ISBN         1149780 non-null  object
 2   Book-Rating  1149780 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


---

### 3. Load in the Books_Ratings.csv file. Group by User-ID and sort by Book-Rating in descending order to get the users who rated most books. Filter the rating data to only contain the 200 users that rated most books.

In [51]:
# Counts for rating pr user
user_rating_count = books_ratings_data.groupby("User-ID").size().reset_index(name="User-Rating-Count")
# Quick look to see if it looks correct
user_rating_count

Unnamed: 0,User-ID,User-Rating-Count
0,2,1
1,7,1
2,8,18
3,9,3
4,10,2
...,...,...
105278,278846,2
105279,278849,4
105280,278851,23
105281,278852,1


In [52]:
# Sort the grouped data in descending order
# Filter the data into a new dataframe, which will only include the 200 users with most ratings in descending order
top_users = user_rating_count.sort_values(by="User-Rating-Count", ascending=False).head(200)
# Quick look to see if it looks correct
top_users

Unnamed: 0,User-ID,User-Rating-Count
4213,11676,13602
74815,198711,7550
58113,153662,6109
37356,98391,5891
13576,35859,5850
...,...,...
99371,262998,679
79124,210035,678
50591,133747,677
40296,106225,677


In [53]:
filtered_books_ratings_data = books_ratings_data[books_ratings_data["User-ID"].isin(top_users["User-ID"])]
# This will now only show top 200 rating users and all the books they have rated
filtered_books_ratings_data['User-ID'].nunique()

200

---

### 4. Create a Collaborative filtering recommender system based on the user ratings from 3 together with the Books_Ratings.csv dataset.

In [54]:
books_data = pd.read_csv('Books.csv')

  books_data = pd.read_csv('Books.csv')


In [55]:
books_data.head(10)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...
5,0399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...
6,0425176428,What If?: The World's Foremost Military Histor...,Robert Cowley,2000,Berkley Publishing Group,http://images.amazon.com/images/P/0425176428.0...,http://images.amazon.com/images/P/0425176428.0...,http://images.amazon.com/images/P/0425176428.0...
7,0671870432,PLEADING GUILTY,Scott Turow,1993,Audioworks,http://images.amazon.com/images/P/0671870432.0...,http://images.amazon.com/images/P/0671870432.0...,http://images.amazon.com/images/P/0671870432.0...
8,0679425608,Under the Black Flag: The Romance and the Real...,David Cordingly,1996,Random House,http://images.amazon.com/images/P/0679425608.0...,http://images.amazon.com/images/P/0679425608.0...,http://images.amazon.com/images/P/0679425608.0...
9,074322678X,Where You'll Find Me: And Other Stories,Ann Beattie,2002,Scribner,http://images.amazon.com/images/P/074322678X.0...,http://images.amazon.com/images/P/074322678X.0...,http://images.amazon.com/images/P/074322678X.0...


In [56]:
books_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Book-Author          271358 non-null  object
 3   Year-Of-Publication  271360 non-null  object
 4   Publisher            271358 non-null  object
 5   Image-URL-S          271360 non-null  object
 6   Image-URL-M          271360 non-null  object
 7   Image-URL-L          271357 non-null  object
dtypes: object(8)
memory usage: 16.6+ MB


---