# Exercises in Recommender systems

This notebook contains exercises in Recommender systems

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


## Exercise 1

Using the "Coursera Courses Dataset 2021" available at kaggle ([https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021](https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021)) or on moodle, to do the following:

1. Create a Content-based filtering recommender system based on the Course Descriptions.
2. Create a Content-based filtering recommender system based on the Skills.

Using the "Book Recommendation Dataset" available at kaggle ([https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset)) or on moodle, to do the following:

3. Load in the `Ratings.csv` file (on moodle, it is called `Books_Ratings.csv`). Group by `User-ID` and sort by `Book-Rating` in descending order to get the users who rated most books. Filter the rating data to only contain the 200 users that rated most books.
4. Create a Collaborative filtering recommender system based on the user ratings from 3 together with the `Books.csv` dataset.

1. Create a Content-based filtering recommender system based on the Course Descriptions.

In [None]:
# Load the dataset
df = pd.read_csv("Coursera.csv")

# Display dataset structure
print(df.info())



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3522 entries, 0 to 3521
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Course Name         3522 non-null   object
 1   University          3522 non-null   object
 2   Difficulty Level    3522 non-null   object
 3   Course Rating       3522 non-null   object
 4   Course URL          3522 non-null   object
 5   Course Description  3522 non-null   object
 6   Skills              3522 non-null   object
dtypes: object(7)
memory usage: 192.7+ KB
None


In [3]:
df_course = df[['Course Name', 'Course Description']].dropna()

# Display first few rows
df_course.head()


Unnamed: 0,Course Name,Course Description
0,Write A Feature Length Screenplay For Film Or ...,Write a Full Length Feature Film Script In th...
1,Business Strategy: Business Model Canvas Analy...,"By the end of this guided project, you will be..."
2,Silicon Thin Film Solar Cells,This course consists of a general presentation...
3,Finance for Managers,"When it comes to numbers, there is always more..."
4,Retrieve Data using Single-Table SQL Queries,In this course you�ll learn how to effectively...


In [11]:
# Define TF-IDF Vectorizer
tfidf_vectorizer_desc = TfidfVectorizer(stop_words='english')

# Transform the course descriptions into TF-IDF matrix
tfidf_matrix_desc = tfidf_vectorizer_desc.fit_transform(df_course['Course Description'])

# Compute cosine similarity
cosine_sim_desc = cosine_similarity(tfidf_matrix_desc, tfidf_matrix_desc)

# Print shape
print(cosine_sim_desc.shape)


(3522, 3522)


In [13]:
def recommend_courses_by_description(course_name, num_recommendations=5):
    if course_name not in df['Course Name'].values:
        return "Course not found in dataset!"
    
    # Get index of the course
    idx = df[df['Course Name'] == course_name].index[0]

    # Get similarity scores
    sim_scores = list(enumerate(cosine_sim_desc[idx]))

    # Sort courses by similarity
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:num_recommendations+1]

    # Get indices of recommended courses
    course_indices = [i[0] for i in sim_scores]

    # Return recommended courses
    return df.iloc[course_indices][['Course Name', 'Course Description']]

# Example usage
recommend_courses_by_description("Finance for Managers", num_recommendations=5)


Unnamed: 0,Course Name,Course Description
1839,Fundamentals of financial and management accou...,This is an introductory course on financial an...
1891,Accounting and Finance for IT professionals,This course presents an introduction to the ba...
1985,Introduction to Finance: The Basics,In the Introduction to Finance I: The Basics c...
419,Finance for Non-Financial Managers,Finance is for �Non-financial Managers� who wa...
1164,Corporate Finance Essentials,Corporate Finance Essentials will enable you t...


2. Create a Content-based filtering recommender system based on the Skills.

In [6]:
df_skills = df[['Course Name', 'Skills']].dropna()
df_skills.head()


Unnamed: 0,Course Name,Skills
0,Write A Feature Length Screenplay For Film Or ...,Drama Comedy peering screenwriting film D...
1,Business Strategy: Business Model Canvas Analy...,Finance business plan persona (user experien...
2,Silicon Thin Film Solar Cells,chemistry physics Solar Energy film lambda...
3,Finance for Managers,accounts receivable dupont analysis analysis...
4,Retrieve Data using Single-Table SQL Queries,Data Analysis select (sql) database manageme...


In [7]:
# Define TF-IDF Vectorizer
tfidf_vectorizer_skills = TfidfVectorizer(stop_words='english')

# Transform the course skills into TF-IDF matrix
tfidf_matrix_skills = tfidf_vectorizer_skills.fit_transform(df_skills['Skills'])

# Compute cosine similarity
cosine_sim_skills = cosine_similarity(tfidf_matrix_skills, tfidf_matrix_skills)

# Print shape
print(cosine_sim_skills.shape)


(3522, 3522)


In [18]:
def recommend_courses_by_skills(course_name, num_recommendations=5):
    if course_name not in df_skills['Course Name'].values:
        return "Course not found in dataset!"
    
    # Get index of the course
    idx = df_skills[df_skills['Course Name'] == course_name].index[0]

    # Get similarity scores
    sim_scores = list(enumerate(cosine_sim_skills[idx]))

    # Sort courses by similarity
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:num_recommendations+1]

    # Get indices of recommended courses
    course_indices = [i[0] for i in sim_scores]

    # Return recommended courses
    return df_skills.iloc[course_indices][['Course Name', 'Skills']]

# Example usage
recommend_courses_by_skills("Finance for Managers", 5)


Unnamed: 0,Course Name,Skills
1228,The Language and Tools of Financial Analysis,analysis ratio analysis Finance financial r...
1936,Formal Financial Accounting,accounts payable Finance Accounting debits ...
1126,Financial Accounting: Foundations,financial statement accrual income Financia...
1298,Management and financial accounting: Know your...,internality balance sheet Leadership and Man...
2995,Entrepreneurship,Entrepreneurship interview market (economics...


3. Load in the `Ratings.csv` file (on moodle, it is called `Books_Ratings.csv`). Group by `User-ID` and sort by `Book-Rating` in descending order to get the users who rated most books. Filter the rating data to only contain the 200 users that rated most books.

In [None]:
# Load the book ratings dataset
df_ratings = pd.read_csv("Books_Ratings.csv")

# Count how many books each user has rated
user_counts = df_ratings.groupby("User-ID")["Book-Rating"].count().reset_index()

# Get the top 200 users with the most ratings
top_200_users = user_counts.sort_values(by="Book-Rating", ascending=False).head(200)["User-ID"]

# Filter ratings dataset to only include the top 200 users
filtered_ratings = df_ratings[df_ratings["User-ID"].isin(top_200_users)]

# Load the books dataset
df_books = pd.read_csv("Books.csv")

# Merge ratings with book details using ISBN
df_merged = filtered_ratings.merge(df_books, on="ISBN", how="left")

# Keep only relevant columns
df_merged = df_merged[["User-ID", "ISBN", "Book-Title", "Book-Rating"]]

# Display dataset
print(df_merged.head())


  df_books = pd.read_csv("Books.csv")


      User-ID        ISBN  Book-Rating
4330   278418  0006128831            0
4331   278418  0006542808            5
4332   278418  0020209606            0
4333   278418  0020418809            0
4334   278418  0020420900            0
   User-ID        ISBN                                         Book-Title  \
0   278418  0006128831                                                NaN   
1   278418  0006542808                              Silence of the Sirens   
2   278418  0020209606                                NEVER ALONE REISSUE   
3   278418  0020418809                                    CADDIE WOODLAWN   
4   278418  0020420900  Paul Revere : Boston Patriot (Childhood Of Fam...   

   Book-Rating  
0            0  
1            5  
2            0  
3            0  
4            0  


4. Create a Collaborative filtering recommender system based on the user ratings from 3 together with the `Books.csv` dataset.

In [58]:
# Create a User-Item rating matrix (Users as rows, Books as columns)
user_item_matrix = df_merged.pivot_table(index="User-ID", columns="Book-Title", values="Book-Rating", fill_value=0)

# Compute cosine similarity between users
user_similarity = cosine_similarity(user_item_matrix)

# Convert similarity matrix into a DataFrame
user_similarity_df = pd.DataFrame(user_similarity, index=user_item_matrix.index, columns=user_item_matrix.index)

# Display similarity matrix
print(user_similarity_df.head())


User-ID    3363      6251      6575      7346      11601     11676     12538   \
User-ID                                                                         
3363     1.000000  0.000000  0.026805  0.013205  0.000000  0.022112  0.009240   
6251     0.000000  1.000000  0.048537  0.014077  0.019330  0.037508  0.000000   
6575     0.026805  0.048537  1.000000  0.055537  0.012073  0.063106  0.020387   
7346     0.013205  0.014077  0.055537  1.000000  0.007492  0.051761  0.034727   
11601    0.000000  0.019330  0.012073  0.007492  1.000000  0.020077  0.019004   

User-ID    13552     15408     16634   ...    264321    265115    265313  \
User-ID                                ...                                 
3363     0.007907  0.012508  0.000000  ...  0.005234  0.005454  0.000000   
6251     0.010237  0.006046  0.020697  ...  0.013283  0.007398  0.024124   
6575     0.021962  0.018973  0.010158  ...  0.019694  0.032359  0.012910   
7346     0.025568  0.015265  0.009065  ...  0.015554

In [62]:
def recommend_books(user_id, rating_matrix, user_similarity_df, df_books, top_n=5):
    
    if user_id not in user_similarity_df.index:
        return "User not found in top 200!"

    # Get similarity scores for the user, excluding themselves
    sim_scores = user_similarity_df.loc[user_id].drop(user_id)

    # Select top 5 most similar users
    similar_users = sim_scores.sort_values(ascending=False).head(5).index

    # Aggregate ratings from similar users
    similar_ratings = rating_matrix.loc[similar_users]

    # Compute the average rating for each book among these similar users
    avg_ratings = similar_ratings.mean(axis=0)

    # Find books the target user has not rated
    target_user_ratings = rating_matrix.loc[user_id]
    unrated_books = target_user_ratings[target_user_ratings == 0].index

    # Filter ratings to only include unrated books
    recommendations = avg_ratings.loc[unrated_books].sort_values(ascending=False).head(top_n)

    # Convert index (Book-Title) to a list for merging with book details
    recommended_books = df_books[df_books['Book-Title'].isin(recommendations.index)][['ISBN', 'Book-Title', 'Book-Author']]

    return recommended_books


In [61]:
example_user = top_200_users.iloc[0]  # Select the first user in the top 200

top_books = recommend_books(example_user, user_item_matrix, user_similarity_df, df_books, top_n=5)

print(f"Top {len(top_books)} recommended books for User {example_user}:")
print(top_books)

Top 10 recommended books for User 11676:
              ISBN                                    Book-Title  \
1360    0671759361                    Pearl in the Mist (Landry)   
4269    0316284955  White Oleander : A Novel (Oprah's Book Club)   
15611   0064400026                   Little House on the Prairie   
24362   0064400069                The Long Winter (Little House)   
26127   0440167531                                      Palomino   
43124   0060581859                The Long Winter (Little House)   
108090  0060522410                The Long Winter (Little House)   
188054  0702219487                                      Palomino   
202942  0061070068                   Little House on the Prairie   
219583  044056753X                                      Palomino   

                 Book-Author  
1360            V.C. Andrews  
4269             Janet Fitch  
15611   Laura Ingalls Wilder  
24362   Laura Ingalls Wilder  
26127         DANIELLE STEEL  
43124   Laura Ingalls Wi

## Exercise 2

Using the "Coursera Courses Dataset 2021" from Exercise 1, to do the following:

1. [Optional] Create a Content-based filtering recommender system based on both the Course Descriptions and the Skills.
2. [Optional] Can you come up with a way of including Difficulty Level and Course Rating in your recommender system?