# Notebook Overview

## 1. Data Loading & Preprocessing
- Load `cleaned_book_ratings_plus.csv`.  
- Convert `user_id` and `isbn` to string type.  
- Separate books and users dataframes.  

## 2. Train/Test Split
- Split ratings so each user has ~30% of ratings in the test set.  
- Function ensures each user appears in both train and test.  

## 3. Non-Personalized Recommendation
- Recommend top books based on `weighted_score`.  
- Filter books with at least 30 ratings.  

## 4. Content-Based Recommendation
- TF-IDF vectorization of titles.  
- One-hot encoding of authors and publishers.  
- Min-max scaling for numerical features (year).  
- Recommend based on cosine similarity with user profile.  

## 5. Collaborative Filtering
- Pivot table of users vs books.  
- Fill missing values with user mean ratings.  
- Recommend based on top-5 similar users using cosine similarity.  

## 6. Evaluation
- Precision@k and Recall@k computed for sampled users.  
- Low precision/recall is expected due to sparsity of Book-Crossing dataset.


# Data Loading

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler

In [2]:
df = pd.read_csv("../data/cleaned_book_ratings_plus.csv")
df['user_id']=df['user_id'].astype('str')
df['isbn']=df['isbn'].astype('str')

In [3]:
books=df.drop_duplicates(subset='isbn')[['isbn','book_rating','title','author','year','publisher','img_url','num_of_rating','avg_book_rate','weighted_score']]
users=df.drop_duplicates(subset='user_id')[['user_id','user_age','location','fav_author','fav_publisher']]

In [4]:
print(df.shape)
print(books.shape)
print(users.shape)

(98605, 15)
(10135, 10)
(8203, 5)


# Data Split

Split ratings into training set and test set so that:

- Each user has at least num_of_rates_in_test ratings placed in the test set.

- The rest of the ratings stay in the train set.

This ensures every user appears in both train and test, which is crucial for evaluating recommendation systems.

In [5]:
# check min number of rating for each user
df.groupby('user_id').size().min()

3

In [6]:
def train_test_split_one_in_test(df):
    grouped=df.groupby('user_id')
    train_list=[]
    test_list=[]
    for uid,g in grouped:
        #test size = 30%
        #train size = 70%
        num_of_rates_in_test=int(len(g)*0.3)
        test_idx = g.sample(n=num_of_rates_in_test,random_state=42).index
        train_idx = g.index.difference(test_idx)
        test_list.extend(list(test_idx))
        train_list.extend(list(train_idx))

    train_df=df.loc[train_list]
    test_df=df.loc[test_list]
    return train_df.reset_index(drop=True), test_df.reset_index(drop=True)

In [7]:
train_df, test_df=train_test_split_one_in_test(df)
print("train:", train_df.shape, "test:", test_df.shape)

train: (73359, 15) test: (25246, 15)


# Models

## Non-Personalized

In [8]:
#select only books which have sufficient number of ratings
most_freq_books=books[books['num_of_rating']>30]
most_freq_books.shape

(929, 10)

In [9]:
#function for non_personalized top books 
def select_non_personalized_top_books(num_books):
    return most_freq_books.sort_values('weighted_score').head(num_books).isbn.values

select_non_personalized_top_books(3)

array(['0971880107', '0871138190', '039914739X'], dtype=object)

## Item Content Based

### Overview
The content-based recommender suggests books to a user based on the features of the books they have liked in the past.
It builds a user profile by aggregating the features of previously liked books and recommends books that are most similar in content.

### Steps

#### 1. Feature Extraction
- **Title:** TF-IDF vectorization (`min_df=2`, `max_df=0.7`)  
- **Author & Publisher:** One-hot encoding  
- **Year:** Min-Max scaling  

#### 2. User Profile Construction
- Select books the user rated above a threshold (e.g., 5).  
- Compute the mean vector of the selected books’ features to represent the user profile.  

#### 3. Recommendation
- Compute cosine similarity between the user profile and all other books.  
- Exclude books the user has already rated.  
- Return the top-k most similar books.


In [10]:
#create tfidfvectorizer to generate features for titles
#min_df = 2 -> make a word features if it only accours at least twice
#max_df = 0.7 -> if word appear in more than 70% of titles then ingore it 
tfidvec=TfidfVectorizer(min_df=2,max_df=0.7)
vectorized_titles=tfidvec.fit_transform(books.title)

#create datafram and put features for each book in it
books_features=pd.DataFrame(vectorized_titles.toarray(),columns=tfidvec.get_feature_names_out(),index=books['isbn'])

In [11]:
#encode author using one hot encoding and add it as features
encoder=OneHotEncoder()
encoded_authors=encoder.fit_transform(books[['author']])
encoded_authors=pd.DataFrame(encoded_authors.toarray(),columns=encoder.get_feature_names_out(),index=books['isbn'])
books_features=pd.concat((books_features,encoded_authors),axis=1)

In [12]:
#encode publisher using one hot encoding and add it as features
encoder=OneHotEncoder()
encoded_publisher=encoder.fit_transform(books[['publisher']])
encoded_publisher=pd.DataFrame(encoded_publisher.toarray(),columns=encoder.get_feature_names_out(),index=books['isbn'])
books_features=pd.concat((books_features,encoded_publisher),axis=1)

In [13]:
#for year we use minmaxscaler as it is numerical column
scaler=MinMaxScaler()
scaled_numric=scaler.fit_transform(books[['year']])
scaled_numric=pd.DataFrame(scaled_numric,columns=['year'],index=books['isbn'])
books_features=pd.concat((books_features,scaled_numric),axis=1)

In [14]:
books_features

Unnamed: 0_level_0,000,03,10,100,1001,101,11,12,13,14,...,publisher_Xlibris Corporation,publisher_Yearling,publisher_Yearling Books,publisher_Yossi Ghinsberg,publisher_Zebra Books,publisher_Zebra Books (Mass Market),publisher_Zondervan Publishing Company,publisher_Zumaya Publications,publisher_btb,year
isbn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0060517794,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.976471
0671537458,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.882353
0679776818,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.905882
0060096195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.964706
0141310340,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.964706
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0886776791,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.894118
0553238132,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.929412
0804115419,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.964706
075640049X,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.964706


In [15]:
def find_similar_books_content_based(book_isbn, num_books):
    # compute cosine similarity between the given book and all other books
    sim_scores = cosine_similarity(
        books_features.loc[book_isbn].values.reshape(1, -1),
        books_features.drop(book_isbn, axis=0)
    )
    
    # convert similarity scores into DataFrame with book ISBNs as index
    sim_scores = pd.DataFrame(
        sim_scores[0],
        index=books_features.drop(book_isbn, axis=0).index,
        columns=['score']
    )
    
    # return top N most similar books sorted by similarity score
    return sim_scores.sort_values('score', ascending=False).head(num_books)


In [16]:
find_similar_books_content_based('0151010668',3)

Unnamed: 0_level_0,score
isbn,Unnamed: 1_level_1
151009716,0.572806
151006040,0.571068
151006903,0.571068


In [17]:
def recommend_for_user_content_based(user_id, num_books):
    # keep only books the user rated above 5
    user_pervious_books = df[(df['user_id'] == user_id) & (df['book_rating'] > 5)]

    # if user has no such books, return empty recommendations
    if user_pervious_books.empty:
        return pd.DataFrame([], columns=['score'])

    # take features for these books
    user_pervious_books = books_features.loc[user_pervious_books['isbn']]

    # build user profile
    user_pervious_books_mean = user_pervious_books.mean().values.reshape(1, -1)

    # compute similarity with all other books
    sim_scores = cosine_similarity(
        user_pervious_books_mean,
        books_features.drop(user_pervious_books.index, axis=0)
    )

    # wrap results
    sim_scores = pd.DataFrame(
        sim_scores[0],
        index=books_features.drop(user_pervious_books.index, axis=0).index,
        columns=['score']
    )

    return sim_scores.sort_values('score', ascending=False).head(num_books).index


In [18]:
recommend_for_user_content_based('276747',5)

Index(['0671864173', '0060502258', '0060976845', '0671872001', '0670839809'], dtype='object', name='isbn')

## User Based Colaporative Filtering

### Overview
The collaborative filtering recommender suggests books to a user based on the ratings of similar users. It finds users with similar taste and recommends books they liked that the target user hasn’t read yet.

### Steps

#### 1. User-Book Matrix
- Pivot the dataset to create a matrix with `user_id` as rows, `isbn` as columns, and `book_rating` as values.  
- Fill missing ratings with the user’s mean rating.  

#### 2. User Similarity
- Compute cosine similarity between the target user and all other users.  

#### 3. Top-N Recommendations
- Identify top-k most similar users.  
- Compute mean ratings of books from these users.  
- Exclude books already rated by the target user.  
- Return top-k books with the highest mean ratings.


In [19]:
#create pivot tabel user accros book
user_book_pivot=train_df.pivot(index='user_id',columns='isbn',values='book_rating')

In [20]:
#fill null values with mean for each user
user_book_pivot=user_book_pivot.apply(lambda row:row.fillna(row.mean()),axis=1)

In [21]:
def recommend_for_user_colaporative_filtering(user_id, num_books):
    # Compute cosine similarity between the target user and all other users
    sim = cosine_similarity(
        user_book_pivot.loc[user_id].values.reshape(1, -1),
        user_book_pivot.drop(user_id, axis=0).values
    )

    # Store similarities in a DataFrame, indexed by user_id
    users_score = pd.DataFrame(
        sim.reshape(-1, 1),
        columns=['score'],
        index=user_book_pivot.drop(user_id, axis=0).index
    )

    # Pick top-5 most similar users
    top_users = users_score.sort_values('score', ascending=False).head(5).index

    # Average their ratings for each book
    mean_ratings = user_book_pivot.loc[top_users].mean(axis=0)

    # Exclude user rated books
    mean_ratings= mean_ratings[~mean_ratings.index.isin(train_df[train_df['user_id']==user_id]['isbn'])]
    
    # Return top books with highest mean rating
    return mean_ratings.sort_values(ascending=False).head(num_books).index


In [22]:
recommend_for_user_colaporative_filtering('276747',5)

Index(['0002251760', '0609806564', '0609801864', '0609803875', '0609804138'], dtype='object', name='isbn')

# Evaluation

In [None]:
def precision_recall_at_k(model, test_df, k=5):
    precisions, recalls = [], []
    i=0
    for user in test_df['user_id'].unique():
        # true books the user rated in test
        if i%50==0:
            print(f'{model.__name__} :user number{i}')
        i+=1
        true_books = test_df[test_df['user_id']==user]['isbn'].values
        
        # skip users with no test data
        if len(true_books) == 0:
            continue

        # get recommendations
        try:
            recs = model(user, k)  # your recommender function
        except:
            continue

        recs = set(recs)  # recommended
        true_books = set(true_books)  # actual
        # compute precision and recall
        hit = len(recs & true_books)
        precisions.append(hit / k if k > 0 else 0)
        recalls.append(hit / len(true_books) if len(true_books) > 0 else 0)

    return np.mean(precisions), np.mean(recalls)


In [None]:
sample_users=users.iloc[:500]['user_id']
sample_users=test_df[test_df['user_id'].isin(sample_users)]
sample_users

In [27]:
p_cb, r_cb = precision_recall_at_k(recommend_for_user_content_based, sample_users, k=20)
p_cf, r_cf = precision_recall_at_k(recommend_for_user_colaporative_filtering, sample_users, k=20)

k=20
print(f"Content-based: Precision@{k} =", p_cb, f"Recall@{k} =", r_cb)
print(f"Collaborative: Precision@{k} =", p_cf, f"Recall@{k} =", r_cf)


recommend_for_user_content_based :user number0
recommend_for_user_content_based :user number50
recommend_for_user_content_based :user number100
recommend_for_user_content_based :user number150
recommend_for_user_content_based :user number200
recommend_for_user_content_based :user number250
recommend_for_user_content_based :user number300
recommend_for_user_content_based :user number350
recommend_for_user_colaporative_filtering :user number0
recommend_for_user_colaporative_filtering :user number50
recommend_for_user_colaporative_filtering :user number100
recommend_for_user_colaporative_filtering :user number150
recommend_for_user_colaporative_filtering :user number200
recommend_for_user_colaporative_filtering :user number250
recommend_for_user_colaporative_filtering :user number300
recommend_for_user_colaporative_filtering :user number350
Content-based: Precision@20 = 0.0010638297872340426 Recall@20 = 0.012879939209726445
Collaborative: Precision@20 = 0.0007978723404255319 Recall@20 = 0