### Book recommendation system!


### Step 1: Loading and Preparing Data

Importing necessary libraries:

In [None]:
import pandas as pd
import numpy as np


Load the data using pandas and take an initial overview.


In [None]:
# Load the data from CSV file
df = pd.read_csv('Book_Ratings.csv', sep=';', encoding='utf-8')

# Display the first few rows of the dataframe
print("Data Overview:")
df.head(10)


Data Overview:


Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6
5,276733,2080674722,0
6,276736,3257224281,8
7,276737,0600570967,6
8,276744,038550120X,7
9,276745,342310538,10


### Step 2: Data Analysis
Let's analyze the data by checking the number of users, the number of books, and the distribution of ratings.


In [None]:
# Basic data analysis
num_users = df['User-ID'].nunique()  # Count of unique users
num_books = df['ISBN'].nunique()  # Count of unique books
rating_distribution = df['Book-Rating'].value_counts()  # Distribution of ratings

print(f'Number of users: {num_users}')
print(f'Number of books: {num_books}')
print(f'Distribution of ratings: \n{rating_distribution}')


Number of users: 941
Number of books: 9340
Distribution of ratings: 
0     7353
8      630
7      516
9      410
10     393
5      294
6      254
4       72
3       48
2       20
1        9
Name: Book-Rating, dtype: int64


### Step 3: Preparing the Scoring Matrix
Now, let's create a utility matrix (scoring matrix), where each row represents a user, each column represents a book, and each cell contains the rating given by that user to that book.


In [8]:
# Creating the scoring matrix (pivot table)
scoring_matrix = df.pivot_table(index='User-ID', columns='ISBN', values='Book-Rating', fill_value=0)

# Display the scoring matrix
scoring_matrix.head(100)


ISBN,0002005018,0002231115,0002232766,0002240114,000225669X,000254794,0002558122,0002740230,0006128831,0006379702,...,9782264014184,9782922145441,9788401499883,9871138016,9995585227,B0000BLD7X,B158991965,DITISEENSOORT,N3453124715,O6712345670
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
276733,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
276736,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
276737,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
276744,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Step 4: Collaborative Filtering (User-Based)
In collaborative filtering, we need to calculate the similarity between users. I'll use cosine similarity and Pearson correlation as methods for this.

#### 4.1: Cosine Similarity
Cosine similarity is a measure of similarity between two users based on their ratings, calculated as the cosine of the angle between two vectors.



In [10]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute the cosine similarity matrix between users
cosine_sim = cosine_similarity(scoring_matrix)

# Convert the cosine similarity matrix to a dataframe for better readability
cosine_sim_df = pd.DataFrame(cosine_sim, index=scoring_matrix.index, columns=scoring_matrix.index)

# Display the cosine similarity matrix for the first few users
cosine_sim_df.head(100)


User-ID,2,7,8,9,10,12,14,16,17,19,...,278832,278836,278838,278843,278844,278846,278849,278851,278852,278854
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
276733,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
276736,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
276737,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
276744,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### 4.2: Pearson Correlation
Pearson correlation measures the linear relationship between the ratings of two users. It is more sensitive to the actual values given by users.



In [11]:
# Compute the Pearson correlation matrix between users
pearson_sim = scoring_matrix.corr()

# Display the Pearson similarity matrix for the first few users
pearson_sim.head()





ISBN,0002005018,0002231115,0002232766,0002240114,000225669X,000254794,0002558122,0002740230,0006128831,0006379702,...,9782264014184,9782922145441,9788401499883,9871138016,9995585227,B0000BLD7X,B158991965,DITISEENSOORT,N3453124715,O6712345670
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0002005018,1.0,,,-0.001064,-0.001064,-0.001064,,-0.001064,,-0.001064,...,,-0.001064,,,,,-0.001064,,-0.001064,-0.001064
0002231115,,,,,,,,,,,...,,,,,,,,,,
0002232766,,,,,,,,,,,...,,,,,,,,,,
0002240114,-0.001064,,,1.0,-0.001064,-0.001064,,-0.001064,,-0.001064,...,,-0.001064,,,,,-0.001064,,-0.001064,-0.001064
000225669X,-0.001064,,,-0.001064,1.0,-0.001064,,-0.001064,,-0.001064,...,,-0.001064,,,,,-0.001064,,-0.001064,-0.001064


### Step 5: Making Book Recommendations
Now that we have computed the similarities between users, I will use these similarities to predict ratings for books that a user hasn't rated yet. I'll recommend the top 5 books for a specific user based on their predicted ratings.

#### 5.1: Function to Recommend Books Based on Similarity
I'll define a function that calculates the predicted ratings for unrated books based on the similarity scores of users. The function will recommend the top 5 books for a specific user.



In [12]:
# Function to recommend books for a specific user
def recommend_books(user_id, sim_matrix, scoring_matrix, top_n=5):
    # Get the similarities for the target user
    sim_scores = sim_matrix[user_id]
    
    # Get the books rated by the user
    user_ratings = scoring_matrix.loc[user_id]
    
    # Find books that the user has not rated
    unrated_books = user_ratings[user_ratings == 0].index
    
    # Calculate weighted ratings for each unrated book
    book_scores = {}
    for book in unrated_books:
        total_score = 0
        total_weight = 0
        for other_user in scoring_matrix.index:
            if scoring_matrix.at[other_user, book] > 0:
                weight = sim_scores[other_user]  # Similarity score as weight
                total_score += weight * scoring_matrix.at[other_user, book]
                total_weight += weight
        if total_weight > 0:
            book_scores[book] = total_score / total_weight
    
    # Sort books by predicted rating and return the top N
    recommended_books = sorted(book_scores.items(), key=lambda x: x[1], reverse=True)[:top_n]
    
    return recommended_books

# Example: Recommend books for user 278418
recommended_books = recommend_books(278418, cosine_sim_df, scoring_matrix)
print(f'Recommended books for user 278418: {recommended_books}')


Recommended books for user 278418: [('780446525800', 10.0), ('0060572965', 9.0), ('038533558X', 9.0), ('0671658301', 9.0), ('0671891537', 9.0)]


### Step 6: Reporting and Documentation

#### Data Overview: 
Describe the dataset, including the number of unique users and books, and the distribution of ratings.
#### Similarity Metrics: 
Explain the two similarity metrics used:

Cosine Similarity: Works well when the magnitude of ratings doesn’t matter, focusing more on the direction (whether two users rate books similarly or not).

Pearson Correlation: Works well when the actual ratings are important, as it measures linear relationships between users’ ratings.
#### Recommendation Function: 
Discuss how the recommendation system works:

I compute the similarity scores between users.
I predict ratings for books that a user hasn’t rated yet.
I recommend books with the highest predicted ratings.
#### Results: 
Provide examples of recommended books for specific users and explain why the recommendation works (i.e., why the most similar users' ratings influence the recommendations).


### Conclusion
#### Cosine Similarity vs Pearson Correlation: 
In this case, both methods can work well for making recommendations. Cosine similarity might be more effective if the ratings are sparse or if we care more about user behavior patterns rather than exact ratings.

#### Performance: 
For large datasets, matrix factorization techniques like Singular Value Decomposition (SVD) could improve performance, as user-item interactions become sparse in large datasets.

This collaborative filtering-based recommendation system will allow the bookstore to suggest relevant books to users based on their preferences, thereby enhancing their user experience and potentially increasing sales.


##### For this practice, I only used the first 10000 data cause of the lack of memory in my sytem.

### Thank you!