# Content
The Book-Crossing dataset comprises 3 files.

* Users
Contains the users. Note that user IDs (User-ID) have been anonymized and map to integers. Demographic data is provided (Location, Age) if available. Otherwise, these fields contain NULL-values.

* Books
Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (Book-Title, Book-Author, Year-Of-Publication, Publisher), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (Image-URL-S, Image-URL-M, Image-URL-L), i.e., small, medium, large. These URLs point to the Amazon web site.

* Ratings
Contains the book rating information. Ratings (Book-Rating) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.

In [87]:
# Importing Libraries
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

In [88]:
#loading rating dataset
ratings = pd.read_csv("Ratings.csv")
print(ratings.head())

   User-ID        ISBN  Book-Rating
0   276725  034545104X            0
1   276726  0155061224            5
2   276727  0446520802            0
3   276729  052165615X            3
4   276729  0521795028            6


In [89]:
# loading books dataset
books = pd.read_csv("Books.csv")
print(books.head())

         ISBN                                         Book-Title  \
0  0195153448                                Classical Mythology   
1  0002005018                                       Clara Callan   
2  0060973129                               Decision in Normandy   
3  0374157065  Flu: The Story of the Great Influenza Pandemic...   
4  0393045218                             The Mummies of Urumchi   

            Book-Author Year-Of-Publication                   Publisher  \
0    Mark P. O. Morford                2002     Oxford University Press   
1  Richard Bruce Wright                2001       HarperFlamingo Canada   
2          Carlo D'Este                1991             HarperPerennial   
3      Gina Bari Kolata                1999        Farrar Straus Giroux   
4       E. J. W. Barber                1999  W. W. Norton &amp; Company   

                                         Image-URL-S  \
0  http://images.amazon.com/images/P/0195153448.0...   
1  http://images.amazon.com/

  books = pd.read_csv("Books.csv")


In [90]:
# loading users dataset
users = pd.read_csv("Users.csv")
print(users.head())

   User-ID                            Location   Age
0        1                  nyc, new york, usa   NaN
1        2           stockton, california, usa  18.0
2        3     moscow, yukon territory, russia   NaN
3        4           porto, v.n.gaia, portugal  17.0
4        5  farnborough, hants, united kingdom   NaN


In [91]:
ratings_with_name=ratings.merge(books, on="ISBN")

In [92]:
df=ratings_with_name.merge(users, on="User-ID")

In [93]:
df.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L,Location,Age
0,276725,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,"tyler, texas, usa",
1,2313,034545104X,5,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,"cincinnati, ohio, usa",23.0
2,2313,0812533550,9,Ender's Game (Ender Wiggins Saga (Paperback)),Orson Scott Card,1986,Tor Books,http://images.amazon.com/images/P/0812533550.0...,http://images.amazon.com/images/P/0812533550.0...,http://images.amazon.com/images/P/0812533550.0...,"cincinnati, ohio, usa",23.0
3,2313,0679745580,8,In Cold Blood (Vintage International),TRUMAN CAPOTE,1994,Vintage,http://images.amazon.com/images/P/0679745580.0...,http://images.amazon.com/images/P/0679745580.0...,http://images.amazon.com/images/P/0679745580.0...,"cincinnati, ohio, usa",23.0
4,2313,0060173289,9,Divine Secrets of the Ya-Ya Sisterhood : A Novel,Rebecca Wells,1996,HarperCollins,http://images.amazon.com/images/P/0060173289.0...,http://images.amazon.com/images/P/0060173289.0...,http://images.amazon.com/images/P/0060173289.0...,"cincinnati, ohio, usa",23.0


In [94]:
df.columns

Index(['User-ID', 'ISBN', 'Book-Rating', 'Book-Title', 'Book-Author',
       'Year-Of-Publication', 'Publisher', 'Image-URL-S', 'Image-URL-M',
       'Image-URL-L', 'Location', 'Age'],
      dtype='object')

In [95]:
df.drop(columns=["Image-URL-L", "Image-URL-S", "Image-URL-M","Age",'ISBN'], axis=1,inplace=True)

In [96]:
df

Unnamed: 0,User-ID,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Location
0,276725,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,"tyler, texas, usa"
1,2313,5,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,"cincinnati, ohio, usa"
2,2313,9,Ender's Game (Ender Wiggins Saga (Paperback)),Orson Scott Card,1986,Tor Books,"cincinnati, ohio, usa"
3,2313,8,In Cold Blood (Vintage International),TRUMAN CAPOTE,1994,Vintage,"cincinnati, ohio, usa"
4,2313,9,Divine Secrets of the Ya-Ya Sisterhood : A Novel,Rebecca Wells,1996,HarperCollins,"cincinnati, ohio, usa"
...,...,...,...,...,...,...,...
1031131,276442,7,Le Huit,Katherine Neville,2002,Le Cherche Midi,"genève, genève, switzerland"
1031132,276618,5,Ludwig Marum: Briefe aus dem Konzentrationslag...,Ludwig Marum,1984,C.F. MÃ¼ller,"stuttgart, \n/a\""., germany"""
1031133,276647,0,Christmas With Anne and Other Holiday Stories:...,L. M. Montgomery,2001,Starfire,"arlington heights, illinois, usa"
1031134,276647,10,Heaven (Coretta Scott King Author Award Winner),Angela Johnson,1998,Simon &amp; Schuster Children's Publishing,"arlington heights, illinois, usa"


In [97]:
# Statistical Analysis of Ratings
n_ratings = len(df["Book-Rating"])
n_books = len(df['Book-Title'].unique())
n_users = len(df['User-ID'].unique())

print(f"Number of ratings: {n_ratings}")
print(f"Number of unique movieId's: {n_books}")
print(f"Number of unique users: {n_users}")
print(f"Average ratings per user: {round(n_ratings/n_users, 2)}")
print(f"Average ratings per movie: {round(n_ratings/n_books, 2)}")

Number of ratings: 1031136
Number of unique movieId's: 241071
Number of unique users: 92106
Average ratings per user: 11.2
Average ratings per movie: 4.28


In [98]:
# User Rating Frequency
user_freq = df[['User-ID', 'Book-Title']].groupby('User-ID').count().reset_index()
user_freq.columns = ['User-ID', 'n_ratings']
print(user_freq.head())

   User-ID  n_ratings
0        2          1
1        8         17
2        9          3
3       10          1
4       12          1


In [99]:
from sklearn.preprocessing import LabelEncoder
booksID = LabelEncoder()
df['Book-ID'] = booksID.fit_transform(df['Book-Title'])    

In [100]:
df

Unnamed: 0,User-ID,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Location,Book-ID
0,276725,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,"tyler, texas, usa",67829
1,2313,5,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,"cincinnati, ohio, usa",67829
2,2313,9,Ender's Game (Ender Wiggins Saga (Paperback)),Orson Scott Card,1986,Tor Books,"cincinnati, ohio, usa",60582
3,2313,8,In Cold Blood (Vintage International),TRUMAN CAPOTE,1994,Vintage,"cincinnati, ohio, usa",90902
4,2313,9,Divine Secrets of the Ya-Ya Sisterhood : A Novel,Rebecca Wells,1996,HarperCollins,"cincinnati, ohio, usa",52982
...,...,...,...,...,...,...,...,...
1031131,276442,7,Le Huit,Katherine Neville,2002,Le Cherche Midi,"genève, genève, switzerland",104678
1031132,276618,5,Ludwig Marum: Briefe aus dem Konzentrationslag...,Ludwig Marum,1984,C.F. MÃ¼ller,"stuttgart, \n/a\""., germany""",113038
1031133,276647,0,Christmas With Anne and Other Holiday Stories:...,L. M. Montgomery,2001,Starfire,"arlington heights, illinois, usa",35210
1031134,276647,10,Heaven (Coretta Scott King Author Award Winner),Angela Johnson,1998,Simon &amp; Schuster Children's Publishing,"arlington heights, illinois, usa",81799


In [101]:
# books rating analysis
# Find Lowest and Highest rated:
mean_rating = df.groupby('Book-ID')[['Book-Rating']].mean()
# Lowest rated
lowest_rated = mean_rating['Book-Rating'].idxmin()
df.loc[df['Book-ID'] == lowest_rated]
# Highest rated
highest_rated = mean_rating['Book-Rating'].idxmax()
df.loc[df['Book-ID'] == highest_rated]
# show number of people who rated books rated book highest
df[df['Book-ID']==highest_rated]
# show number of people who rated books rated book lowest
df[df['Book-ID']==lowest_rated]

## the above movies has very low dataset. We will use bayesian average
book_stats = df.groupby('Book-ID')[['Book-Rating']].agg(['count', 'mean'])
book_stats.columns = book_stats.columns.droplevel()

In [102]:
df.columns

Index(['User-ID', 'Book-Rating', 'Book-Title', 'Book-Author',
       'Year-Of-Publication', 'Publisher', 'Location', 'Book-ID'],
      dtype='object')

In [104]:
# User-Item Matrix Creation
# Now, we create user-item matrix using scipy csr matrix
from scipy.sparse import csr_matrix

def create_matrix(df):
    
    N = len(df['User-ID'].unique())
    M = len(df['Book-ID'].unique())

    # Map Ids to indices
    user_mapper = dict(zip(np.unique(df["User-ID"]), list(range(N))))
    book_mapper = dict(zip(np.unique(df["Book-ID"]), list(range(M))))

    # Map indices to IDs
    user_inv_mapper = dict(zip(list(range(N)), np.unique(df["User-ID"])))
    book_inv_mapper = dict(zip(list(range(M)), np.unique(df["Book-ID"])))

    user_index = [user_mapper[i] for i in df['User-ID']]
    book_index = [book_mapper[i] for i in df['Book-ID']]

    X = csr_matrix((df["Book-Rating"], (book_index, user_index)), shape=(M, N))

    return X, user_mapper, book_mapper, user_inv_mapper, book_inv_mapper

X, user_mapper, book_mapper, user_inv_mapper, book_inv_mapper = create_matrix(df)


In [105]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1031136 entries, 0 to 1031135
Data columns (total 8 columns):
 #   Column               Non-Null Count    Dtype 
---  ------               --------------    ----- 
 0   User-ID              1031136 non-null  int64 
 1   Book-Rating          1031136 non-null  int64 
 2   Book-Title           1031136 non-null  object
 3   Book-Author          1031135 non-null  object
 4   Year-Of-Publication  1031136 non-null  object
 5   Publisher            1031134 non-null  object
 6   Location             1031136 non-null  object
 7   Book-ID              1031136 non-null  int32 
dtypes: int32(1), int64(2), object(5)
memory usage: 66.9+ MB


In [106]:
# Movie Similarity Analysis
"""
Find similar books using KNN
"""
from sklearn.neighbors import NearestNeighbors
def find_similar_books(book_id, X, k, metric='cosine', show_distance=False):

    neighbour_ids = []

    book_ind = book_mapper[book_id]
    book_vec = X[book_ind]
    k+=1
    kNN = NearestNeighbors(n_neighbors=k, algorithm="brute", metric=metric)
    kNN.fit(X)
    book_vec = book_vec.reshape(1,-1)
    neighbour = kNN.kneighbors(book_vec, return_distance=show_distance)
    for i in range(0,k):
        n = neighbour.item(i)
        neighbour_ids.append(book_inv_mapper[n])
    neighbour_ids.pop(0)
    return neighbour_ids


book_titles = dict(zip(df['Book-ID'], df['Book-Title']))

book_id = 67829

similar_ids = find_similar_books(book_id, X, k=10)
book_title = book_titles[book_id]

print(f"Since you read {book_title}")
for i in similar_ids:
    print(book_titles[i])


Since you read Flesh Tones: A Novel
Frontier Woman
Bullets
The Devil's Apocrypha: There Are Two Sides to Every Story
Harvest of Murder: A Gardening Mystery (Gardening Mysteries (Hardcover))
First Light (A. D. Chronicles, No. 1)
Bulletproof Billionaire : New Orleans Confidential (Intrigue)
A Murder of Justice
Top Hook
Winter's End (Alex Rourke, 1)
The Devil in Bellminster: An Unlikely Mystery (Unlikely Heroes)


In [107]:
# Book Recommendation with respect to Users Preference
# Create a function to recomment the movies based on the user preferences.
def recommend_Books_for_user(user_id, X, user_mapper, book_mapper, book_inv_mapper, k=10):
    df1 = df[df['User-ID'] == user_id]

    if df1.empty:
        print(f"User with ID {user_id} does not exist.")
        return

    book_id = df1[df1['Book-Rating'] == max(df1['Book-Rating'])]['Book-ID'].iloc[0]

    book_titles = dict(zip(df['Book-ID'], df['Book-Title']))

    similar_ids = find_similar_books(book_id, X, k)
    book_title = book_titles.get(book_id, "Book not found")

    if book_title == "Book not found":
        print(f"Book with ID {book_id} not found.")
        return

    print(f"Since you Read {book_title}, you might also like:")
    for i in similar_ids:
        print(book_titles.get(i, "Book not found"))


# Recommend the Book to user ..

In [108]:
user_id = 2313 # Replace with the desired user ID
recommend_Books_for_user(user_id, X, user_mapper, book_mapper, book_inv_mapper, k=10)

Since you Read Godel, Escher, Bach: An Eternal Golden Braid, you might also like:
Making Peace
Dmca: The Digital Millenium Copyright Act
Winners, Losers &amp; Microsoft: Competition and Antitrust in High Technology
The Ugly American
Swamp Monster in Third Grade (Little Apple)
Struggle for Intimacy
The Eros of Everyday Life: Essays on Ecology, Gender and Society
Personal Fouls
Forgiveness and Other Acts of Love
Clicking With Your Dog: Step-By-Step in Pictures
