<h1 style="color:orange">This Notebook was done in colaborate between Shibab Ahsan(79826) and Sebastian Rix(71411)</h1>

## Exercise 1

Using the "Coursera Courses Dataset 2021" available at kaggle ([https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021](https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021)) or on moodle, to do the following:

1. Create a Content-based filtering recommender system based on the Course Descriptions.
2. Create a Content-based filtering recommender system based on the Skills.

Using the "Book Recommendation Dataset" available at kaggle ([https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset)) or on moodle, to do the following:

3. Load in the `Ratings.csv` file (on moodle, it is called `Books_Ratings.csv`). Group by `User-ID` and sort by `Book-Rating` in descending order to get the users who rated most books. Filter the rating data to only contain the 200 users that rated most books.
4. Create a Collaborative filtering recommender system based on the user ratings from 3 together with the `Books.csv` dataset.

In [1]:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from scipy.sparse.linalg import svds
import kagglehub



In [2]:
path = kagglehub.dataset_download("khusheekapoor/coursera-courses-dataset-2021")
df = pd.read_csv(path+"/Coursera.csv")


df.head()

Unnamed: 0,Course Name,University,Difficulty Level,Course Rating,Course URL,Course Description,Skills
0,Write A Feature Length Screenplay For Film Or ...,Michigan State University,Beginner,4.8,https://www.coursera.org/learn/write-a-feature...,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...
1,Business Strategy: Business Model Canvas Analy...,Coursera Project Network,Beginner,4.8,https://www.coursera.org/learn/canvas-analysis...,"By the end of this guided project, you will be...",Finance business plan persona (user experien...
2,Silicon Thin Film Solar Cells,�cole Polytechnique,Advanced,4.1,https://www.coursera.org/learn/silicon-thin-fi...,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...
3,Finance for Managers,IESE Business School,Intermediate,4.8,https://www.coursera.org/learn/operational-fin...,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...
4,Retrieve Data using Single-Table SQL Queries,Coursera Project Network,Beginner,4.6,https://www.coursera.org/learn/single-table-sq...,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...


In [3]:
print(df.columns)


Index(['Course Name', 'University', 'Difficulty Level', 'Course Rating',
       'Course URL', 'Course Description', 'Skills'],
      dtype='object')


### Task 1: Create a Content-based filtering recommender system based on the Course Descriptions.


In [4]:
# Step 1: Import required libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


# Step 2: Preprocess the descriptions
df['Course Description'] = df['Course Description'].fillna("")

# Step 3: Convert text to TF-IDF vectors
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['Course Description'])

# Step 4: Compute cosine similarity between all courses
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Step 5: Create mapping from course name to index
indices = pd.Series(df.index, index=df['Course Name']).drop_duplicates()

# Step 6: Build the recommendation function
def recommend_courses(course_name, num_recommendations=5):
    if course_name not in indices:
        return f"  Course '{course_name}' not found in dataset."

    idx = indices[course_name]
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort by similarity score, ignore the same course (index 0)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:num_recommendations+1]
    
    # Get indices of the top similar courses
    course_indices = [i[0] for i in sim_scores]

    # Return recommended course details
    return df[['Course Name', 'University', 'Difficulty Level', 'Course Rating']].iloc[course_indices]


In [5]:
recommend_courses("Finance for Managers")


Unnamed: 0,Course Name,University,Difficulty Level,Course Rating
1839,Fundamentals of financial and management accou...,Politecnico di Milano,Beginner,4.7
1891,Accounting and Finance for IT professionals,Indian School of Business,Beginner,4.5
1985,Introduction to Finance: The Basics,University of Illinois at Urbana-Champaign,Advanced,4.6
419,Finance for Non-Financial Managers,Emory University,Beginner,4.2
1164,Corporate Finance Essentials,IESE Business School,Beginner,4.8


### Task 2: Create a Content-based filtering recommender system based on the Skills.

In [6]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


# Step 2: Preprocess the 'Skills' column
df['Skills'] = df['Skills'].fillna("")

# Step 3: Convert 'Skills' to TF-IDF vectors
tfidf_skills = TfidfVectorizer(stop_words='english')
skills_matrix = tfidf_skills.fit_transform(df['Skills'])

# Step 4: Compute cosine similarity
skills_similarity = cosine_similarity(skills_matrix, skills_matrix)

# Step 5: Create course name to index mapping
indices_skills = pd.Series(df.index, index=df['Course Name']).drop_duplicates()

# Step 6: Define recommender function (based on skills)
def recommend_by_skills(course_name, num_recommendations=5):
    if course_name not in indices_skills:
        return f"Course '{course_name}' not found in dataset."

    idx = indices_skills[course_name]
    sim_scores = list(enumerate(skills_similarity[idx]))

    # Sort by similarity and exclude itself
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:num_recommendations+1]

    # Get indices of top similar courses
    top_indices = [i[0] for i in sim_scores]

    return df[['Course Name', 'University', 'Skills', 'Course Rating']].iloc[top_indices]


In [7]:
recommend_by_skills("Retrieve Data using Single-Table SQL Queries")


Unnamed: 0,Course Name,University,Skills,Course Rating
2892,Manipulating Data with SQL,Coursera Project Network,system u table (database) relational databas...,4.6
1546,Creating Database Tables with SQL,Coursera Project Network,SQL HTML5 mysql database management systems...,4.6
119,Managing Big Data with MySQL,Duke University,SQL Leadership and Management analysis rela...,4.6
92,Create Relational Database Tables Using SQLite...,Coursera Project Network,database management systems Databases web br...,4.7
3272,Retrieve Data with Multiple-Table SQL Queries,Coursera Project Network,relational database data retrieval Databases...,4.7


### Task 3: Load in the `Ratings.csv` file (on moodle, it is called `Books_Ratings.csv`). Group by `User-ID` and sort by `Book-Rating` in descending order to get the users who rated most books. Filter the rating data to only contain the 200 users that rated most books.


In [8]:
path = kagglehub.dataset_download("arashnic/book-recommendation-dataset")
df = pd.read_csv(path+"/Ratings.csv")
df.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [9]:
import pandas as pd
# Step 3: Load the dataset and group by 'User-ID'
df = pd.read_csv(path+"/Ratings.csv")
user_ratings = df.groupby('User-ID').size().reset_index(name='Book Count')
# Step 4: Sort by 'Book Count' in descending order
top_users = user_ratings.sort_values(by='Book Count', ascending=False).head(200)
# Step 5: Filter the original dataset to only include these top users
top_user_ids = top_users['User-ID'].unique()
filtered_df = df[df['User-ID'].isin(top_user_ids)]
filtered_df


Unnamed: 0,User-ID,ISBN,Book-Rating
4330,278418,0006128831,0
4331,278418,0006542808,5
4332,278418,0020209606,0
4333,278418,0020418809,0
4334,278418,0020420900,0
...,...,...,...
1147612,275970,3829021860,0
1147613,275970,4770019572,0
1147614,275970,896086097,0
1147615,275970,9626340762,8


In [10]:
import pandas as pd

# Step 1: Load the dataset
ratings_df = pd.read_csv(path+"/Ratings.csv")

# Step 2: Count number of ratings per user
user_rating_counts = ratings_df.groupby('User-ID').size().sort_values(ascending=False)

# Step 3: Get top 200 users who rated the most books
top_200_users = user_rating_counts.head(200).index

# Step 4: Filter ratings for those users
filtered_ratings_df = ratings_df[ratings_df['User-ID'].isin(top_200_users)]

# Step 5: Sort by User-ID and Book-Rating in descending order
filtered_ratings_df = filtered_ratings_df.sort_values(by=['User-ID', 'Book-Rating'], ascending=[True, False])

# Step 6: View result
filtered_ratings_df.head(50)


Unnamed: 0,User-ID,ISBN,Book-Rating
17882,3363,0060294698,10
17905,3363,006098824X,10
17912,3363,0064405176,10
17920,3363,0064460932,10
17926,3363,0064472795,10
17941,3363,0140186484,10
17956,3363,0140365931,10
17972,3363,0152006737,10
17974,3363,0152021973,10
17991,3363,0312046448,10


### Task 4: Create a Collaborative filtering recommender system based on the user ratings from 3 together with the `Books.csv` dataset.

In [11]:
books_df = pd.read_csv(path+"/Books.csv")
books_df.head(100)


  books_df = pd.read_csv(path+"/Books.csv")


Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...
...,...,...,...,...,...,...,...,...
95,0671867156,Pretend You Don't See Her,Mary Higgins Clark,1998,Pocket,http://images.amazon.com/images/P/0671867156.0...,http://images.amazon.com/images/P/0671867156.0...,http://images.amazon.com/images/P/0671867156.0...
96,0312252617,Fast Women,Jennifer Crusie,2001,St. Martin's Press,http://images.amazon.com/images/P/0312252617.0...,http://images.amazon.com/images/P/0312252617.0...,http://images.amazon.com/images/P/0312252617.0...
97,0312261594,Female Intelligence,Jane Heller,2001,St. Martin's Press,http://images.amazon.com/images/P/0312261594.0...,http://images.amazon.com/images/P/0312261594.0...,http://images.amazon.com/images/P/0312261594.0...
98,0316748641,Pasquale's Nose: Idle Days in an Italian Town,Michael Rips,2002,Back Bay Books,http://images.amazon.com/images/P/0316748641.0...,http://images.amazon.com/images/P/0316748641.0...,http://images.amazon.com/images/P/0316748641.0...


In [12]:

# Step 1: Re-import everything cleanly to prepare for collaborative filtering
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
import numpy as np

# Load ratings and books datasets
ratings_df = pd.read_csv(path+"/Ratings.csv")
books_df = pd.read_csv(path+"/Books.csv")

# Step 2: Filter top 200 users from Task 3
user_rating_counts = ratings_df.groupby('User-ID').size().sort_values(ascending=False)
top_200_users = user_rating_counts.head(200).index
filtered_ratings_df = ratings_df[ratings_df['User-ID'].isin(top_200_users)]

# Step 3: Merge with Books.csv to get book titles
merged_df = pd.merge(filtered_ratings_df, books_df[['ISBN', 'Book-Title']], on='ISBN', how='inner')

# Step 4: Create the User-Book matrix
user_book_matrix = merged_df.pivot_table(index='User-ID', columns='Book-Title', values='Book-Rating')

# Fill missing values with 0 (unrated books)
user_book_matrix_filled = user_book_matrix.fillna(0)

# Step 5: Build the collaborative filtering model using Nearest Neighbors (User-Based)
model_knn = NearestNeighbors(metric='cosine', algorithm='brute')
model_knn.fit(user_book_matrix_filled)

# Step 6: Define recommendation function
def recommend_books_for_user(user_id, num_recommendations=5):
    if user_id not in user_book_matrix_filled.index:
        return f"User-ID {user_id} not found."

    user_vector = user_book_matrix_filled.loc[user_id].values.reshape(1, -1)
    distances, indices = model_knn.kneighbors(user_vector, n_neighbors=6)  # include the user itself

    similar_users = user_book_matrix_filled.index[indices.flatten()[1:]]  # exclude the input user

    # Get books rated highly by similar users but not by the target user
    user_books = set(user_book_matrix.loc[user_id].dropna().index)
    recommendations = {}

    for sim_user in similar_users:
        sim_user_ratings = user_book_matrix.loc[sim_user].dropna()
        for book, rating in sim_user_ratings.items():
            if book not in user_books and rating >= 8:  # only consider strong ratings
                recommendations[book] = recommendations.get(book, 0) + rating

    # Sort and return top N recommended books
    sorted_recommendations = sorted(recommendations.items(), key=lambda x: x[1], reverse=True)
    return [book for book, _ in sorted_recommendations[:num_recommendations]]

# Example usage (replace 3363 with any User-ID from the filtered set)
example_user_id = user_book_matrix_filled.index[10]
recommend_books_for_user(example_user_id)


  books_df = pd.read_csv(path+"/Books.csv")


["Charlotte's Web (Trophy Newbery)",
 'Harry Potter and the Chamber of Secrets (Book 2)',
 'On the Banks of Plum Creek',
 'By the Shores of Silver Lake (Little House)',
 'These Happy Golden Years (Little House)']