In [1]:
import numpy as np
import pandas as pd 
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/book-recommendation-dataset/Ratings.csv
/kaggle/input/book-recommendation-dataset/Users.csv
/kaggle/input/book-recommendation-dataset/classicRec.png
/kaggle/input/book-recommendation-dataset/Books.csv
/kaggle/input/book-recommendation-dataset/DeepRec.png
/kaggle/input/book-recommendation-dataset/recsys_taxonomy2.png


### A content-based recommendation system recommends items to users based on the content or characteristics of the items. This type of recommendation system focuses on understanding the properties of items and learning user preferences from the items they have interacted with in the past.

How Does it Work?

1. Feature Extraction: Extract relevant features from the items. For example, in a book recommendation system, features could include title, author, and category

2. User Profile: Create a user profile based on their interactions with items. This profile is essentially a summary of the features of items the user has liked or interacted with in the past.

3. Recommendation: Calculate the similarity between the user profile and each item's features. Items that are most similar to the user profile are recommended.

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

content_ratings = pd.read_csv('../input/book-recommendation-dataset/Ratings.csv')
content_users = pd.read_csv('../input/book-recommendation-dataset/Users.csv')
content_books = pd.read_csv('../input/book-recommendation-dataset/Books.csv')

  content_books = pd.read_csv('../input/book-recommendation-dataset/Books.csv')


In [3]:
content_books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [4]:
content_users.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [5]:
content_ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [6]:
print("Null values in books data: \n", content_books.isnull().sum())

Null values in books data: 
 ISBN                   0
Book-Title             0
Book-Author            2
Year-Of-Publication    0
Publisher              2
Image-URL-S            0
Image-URL-M            0
Image-URL-L            3
dtype: int64


In [7]:
# Dropping unnecessary columns
content_books.drop(['Image-URL-S', 'Image-URL-M', 'Image-URL-L', 'Year-Of-Publication'], axis=1, inplace=True)

content_users.drop(['Location', 'Age'], axis=1, inplace=True)

In [8]:
# Merge ratings data with cleaned books and users data
content = pd.merge(content_ratings, content_books, on='ISBN')
content = pd.merge(content, content_users, on='User-ID')

In [9]:
content.shape

(1031136, 6)

In [10]:
content.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Book-Author,Publisher
0,276725,034545104X,0,Flesh Tones: A Novel,M. J. Rose,Ballantine Books
1,276726,0155061224,5,Rites of Passage,Judith Rae,Heinle
2,276727,0446520802,0,The Notebook,Nicholas Sparks,Warner Books
3,276729,052165615X,3,Help!: Level 1,Philip Prowse,Cambridge University Press
4,276729,0521795028,6,The Amsterdam Connection : Level 4 (Cambridge ...,Sue Leather,Cambridge University Press


In [11]:
content.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1031136 entries, 0 to 1031135
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1031136 non-null  int64 
 1   ISBN         1031136 non-null  object
 2   Book-Rating  1031136 non-null  int64 
 3   Book-Title   1031136 non-null  object
 4   Book-Author  1031134 non-null  object
 5   Publisher    1031134 non-null  object
dtypes: int64(2), object(4)
memory usage: 47.2+ MB


In [12]:
# Concatenate relevant columns to create feature
content['Features'] = content['Book-Title'] + ', ' + content['Book-Author'] + ', ' + content['Publisher']

In [13]:
content.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Book-Author,Publisher,Features
0,276725,034545104X,0,Flesh Tones: A Novel,M. J. Rose,Ballantine Books,"Flesh Tones: A Novel, M. J. Rose, Ballantine B..."
1,276726,0155061224,5,Rites of Passage,Judith Rae,Heinle,"Rites of Passage, Judith Rae, Heinle"
2,276727,0446520802,0,The Notebook,Nicholas Sparks,Warner Books,"The Notebook, Nicholas Sparks, Warner Books"
3,276729,052165615X,3,Help!: Level 1,Philip Prowse,Cambridge University Press,"Help!: Level 1, Philip Prowse, Cambridge Unive..."
4,276729,0521795028,6,The Amsterdam Connection : Level 4 (Cambridge ...,Sue Leather,Cambridge University Press,The Amsterdam Connection : Level 4 (Cambridge ...


In [14]:
content.drop(['Book-Author', 'Publisher', 'User-ID', 'Book-Rating'], axis=1, inplace=True)

In [15]:
content.isnull().sum()

ISBN          0
Book-Title    0
Features      4
dtype: int64

In [16]:
# Drop rows with NaN values in columns
content.dropna(subset=['Features'], inplace=True)

In [17]:
content['Features'] = content['Features'].apply(lambda x : str(x).lower())
content

Unnamed: 0,ISBN,Book-Title,Features
0,034545104X,Flesh Tones: A Novel,"flesh tones: a novel, m. j. rose, ballantine b..."
1,0155061224,Rites of Passage,"rites of passage, judith rae, heinle"
2,0446520802,The Notebook,"the notebook, nicholas sparks, warner books"
3,052165615X,Help!: Level 1,"help!: level 1, philip prowse, cambridge unive..."
4,0521795028,The Amsterdam Connection : Level 4 (Cambridge ...,the amsterdam connection : level 4 (cambridge ...
...,...,...,...
1031131,0876044011,Edgar Cayce on the Akashic Records: The Book o...,edgar cayce on the akashic records: the book o...
1031132,1563526298,Get Clark Smart : The Ultimate Guide for the S...,get clark smart : the ultimate guide for the s...
1031133,0679447156,Eight Weeks to Optimum Health: A Proven Progra...,eight weeks to optimum health: a proven progra...
1031134,0515107662,The Sherbrooke Bride (Bride Trilogy (Paperback)),the sherbrooke bride (bride trilogy (paperback...


In [18]:
content.head()

Unnamed: 0,ISBN,Book-Title,Features
0,034545104X,Flesh Tones: A Novel,"flesh tones: a novel, m. j. rose, ballantine b..."
1,0155061224,Rites of Passage,"rites of passage, judith rae, heinle"
2,0446520802,The Notebook,"the notebook, nicholas sparks, warner books"
3,052165615X,Help!: Level 1,"help!: level 1, philip prowse, cambridge unive..."
4,0521795028,The Amsterdam Connection : Level 4 (Cambridge ...,the amsterdam connection : level 4 (cambridge ...


In [19]:

unique_book_count = content1000['Book-Title'].nunique()
print("Number of unique books:", unique_book_count)
content1000.shape

Number of unique books: 968


(1000, 7)

In [20]:
duplicate_counts = sample_merged_df['Book-Title'].value_counts()
duplicate_books = duplicate_counts[duplicate_counts > 1]
total_duplicate_count = duplicate_books.sum()
print("Total count of duplicate books:", total_duplicate_count)

for book_title, count in duplicate_books.items():
    indices = sample_merged_df.index[sample_merged_df['Book-Title'] == book_title].tolist()
    print(f"Book Title: {book_title} (Count: {count}) - Indices: {indices}")

Total count of duplicate books: 63
Book Title: The Rescue (Count: 3) - Indices: [955006, 995065, 790692]
Book Title: Stupid White Men ...and Other Sorry Excuses for the State of the Nation! (Count: 2) - Indices: [421246, 319514]
Book Title: False Memory (Count: 2) - Indices: [814569, 321524]
Book Title: Bright Eyes (Coulter Family Series) (Count: 2) - Indices: [165652, 835473]
Book Title: Bridget Jones's Diary (Count: 2) - Indices: [240811, 178870]
Book Title: Live and Learn and Pass It on: People Ages 5 to 95 Share What They'Ve Discovered About Life, Love, and Other Good Stuff (Live &amp; Learn &amp; Pass It on) (Count: 2) - Indices: [253730, 876150]
Book Title: The Client (Count: 2) - Indices: [863808, 384948]
Book Title: The Poet (Count: 2) - Indices: [781372, 899129]
Book Title: Mind Prey (Count: 2) - Indices: [652505, 755426]
Book Title: The Lovely Bones: A Novel (Count: 2) - Indices: [312388, 472622]
Book Title: Black Notice (Count: 2) - Indices: [648352, 113198]
Book Title: The 

### using count_vectorizer

In [21]:
# Reduce size of the data for testing
sample_size = 1000 

content_cv = content.sample(n=sample_size, random_state=42)


In [22]:
content_cv.head()

Unnamed: 0,ISBN,Book-Title,Features
320091,0393020312,The Selected Stories of Patricia Highsmith,"the selected stories of patricia highsmith, pa..."
237244,0345353145,Sphere,"sphere, michael crichton, ballantine books"
504863,0688174574,Blindsighted,"blindsighted, karin slaughter, william morrow ..."
82683,078688908X,Chain of Evidence,"chain of evidence, ridley pearson, hyperion"
46501,0671776126,Plain Truth,"plain truth, jodi picoult, atria"


In [23]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=1000, stop_words='english')
cv.fit_transform(content_cv['Features']).toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [24]:
profiles = cv.fit_transform(content_cv['Features']).toarray()
profiles.shape

(1000, 1000)

In [25]:
# Similarity calculation (cosine similarity) with progress bar
def calculate_cosine_similarity_with_progress(profiles):
    # Initialize an empty cosine similarity matrix
    cosine_sim = np.zeros((profiles.shape[0], profiles.shape[0]))

    # Iterate over rows of the matrix and calculate cosine similarity
    for i in tqdm(range(profiles.shape[0]), desc="Calculating Cosine Similarity"):
        for j in range(profiles.shape[0]):
            cosine_sim[i][j] = cosine_similarity(profiles[i].reshape(1, -1), profiles[j].reshape(1, -1))[0, 0]
    return cosine_sim

# Call the function with the profiles matrix to calculate cosine similarity
cosine_sim = calculate_cosine_similarity_with_progress(profiles)

Calculating Cosine Similarity: 100%|██████████| 1000/1000 [05:43<00:00,  2.91it/s]


In [28]:
def recommend_cv(book):
    index = np.where(content_cv['Book-Title'] == book)[0][0]
    similar_books = sorted(enumerate(cosine_sim[index]), key=lambda x: x[1], reverse=True)[1:10]

    print("\nTop 5 recommended movies based on your preferences:")
    for i in similar_books:
#         print(content_cv['Book-Title'][i[0]])
        print(content['Book-Title'][i[0]])

In [29]:
recommend_cv('Chain of Evidence')


Top 5 recommended movies based on your preferences:
La Casa De Bernarda Alba
Die MÃ?Â¤dchen mit den dunklen Augen.
Maudit Manege
Rolling Thunder
Night Sins
A Kiss of Shadows (Meredith Gentry Novels (Paperback))
Disney's The Lion King (A Golden Look-Look Book)
Saving Private Ryan
Flesh Tones: A Novel


In [30]:
from tabulate import tabulate

def recommend_book_cv(book):
    index = np.where(content_cv['Book-Title'] == book)[0][0]
    similar_books = sorted(enumerate(cosine_sim[index]), key=lambda x: x[1], reverse=True)[1:6]
    
    table_data = []
    for i in similar_books:
        recommended_index = i[0]
#         recommended_title = content_cv['Book-Title'][recommended_index]
        recommended_title = content['Book-Title'][recommended_index]
        similarity_score = i[1]
        table_data.append([recommended_index, recommended_title, similarity_score])
    
    print(tabulate(table_data, headers=['Index', 'Recommended Book', 'Similarity Score'], tablefmt='grid'))


In [31]:
recommend_book_cv('The Client')

+---------+--------------------------------------------------------------------+--------------------+
|   Index | Recommended Book                                                   |   Similarity Score |
|     973 | Witch's Christmas                                                  |           1        |
+---------+--------------------------------------------------------------------+--------------------+
|     156 | Auf dÃ?Â¼nnem Eis.                                                 |           0.912871 |
+---------+--------------------------------------------------------------------+--------------------+
|     941 | An ALTOGETHER NEW BOOK OF TOP TEN LISTS LATE NIGHT DAVID LETTERMAN |           0.833333 |
+---------+--------------------------------------------------------------------+--------------------+
|     351 | Contacto                                                           |           0.612372 |
+---------+--------------------------------------------------------------------+--

In [32]:
recommend_book_cv('Deadly Decisions')

+---------+--------------------------------------------------------+--------------------+
|   Index | Recommended Book                                       |   Similarity Score |
|     304 | The Coral Island (Puffin Classics)                     |           0.67082  |
+---------+--------------------------------------------------------+--------------------+
|     471 | El espectador                                          |           0.353553 |
+---------+--------------------------------------------------------+--------------------+
|      22 | Reise nach Ixtlan. Die Lehre des Don Juan.             |           0.288675 |
+---------+--------------------------------------------------------+--------------------+
|     815 | The Cat Who Robbed a Bank (Cat Who... (Paperback))     |           0.288675 |
+---------+--------------------------------------------------------+--------------------+
|     385 | Dawn and the We Love Kids Club (Baby-Sitters Club, 72) |           0.223607 |
+---------

In [29]:
recommend_book_cv('Bright Eyes (Coulter Family Series)')

+---------+------------------------------+--------------------+
|   Index | Recommended Book             |   Similarity Score |
|     957 | Something Special            |           1        |
+---------+------------------------------+--------------------+
|     190 | O Pioneers! (Bantam Classic) |           0.5      |
+---------+------------------------------+--------------------+
|     654 | Mother Night                 |           0.5      |
+---------+------------------------------+--------------------+
|     409 | Cosmetique De L'Enneme       |           0.408248 |
+---------+------------------------------+--------------------+
|     681 | Amazonia                     |           0.408248 |
+---------+------------------------------+--------------------+


In [30]:
duplicate_counts = content_cv['Book-Title'].value_counts()
duplicate_books = duplicate_counts[duplicate_counts > 1]
total_duplicate_count = duplicate_books.sum()
print("Total count of duplicate books:", total_duplicate_count)


for book_title, count in duplicate_books.items():
    indices = content_cv.index[content_cv['Book-Title'] == book_title].tolist()
    print(f"Book Title: {book_title} (Count: {count}) - Indices: {indices}")

Total count of duplicate books: 63
Book Title: The Rescue (Count: 3) - Indices: [955006, 995065, 790692]
Book Title: Stupid White Men ...and Other Sorry Excuses for the State of the Nation! (Count: 2) - Indices: [421246, 319514]
Book Title: False Memory (Count: 2) - Indices: [814569, 321524]
Book Title: Bright Eyes (Coulter Family Series) (Count: 2) - Indices: [165652, 835473]
Book Title: Bridget Jones's Diary (Count: 2) - Indices: [240811, 178870]
Book Title: Live and Learn and Pass It on: People Ages 5 to 95 Share What They'Ve Discovered About Life, Love, and Other Good Stuff (Live &amp; Learn &amp; Pass It on) (Count: 2) - Indices: [253730, 876150]
Book Title: The Client (Count: 2) - Indices: [863808, 384948]
Book Title: The Poet (Count: 2) - Indices: [781372, 899129]
Book Title: Mind Prey (Count: 2) - Indices: [652505, 755426]
Book Title: The Lovely Bones: A Novel (Count: 2) - Indices: [312388, 472622]
Book Title: Black Notice (Count: 2) - Indices: [648352, 113198]
Book Title: The 

In [31]:
unique_book_count = content_cv['Book-Title'].nunique()
print("Number of unique books:", unique_book_count)
content_cv.shape

Number of unique books: 968


(1000, 3)

### Should use Count Vectorizer: 
to convert the textual features into numerical vectors. It simply counts the frequency of each word in the text data and creates a matrix where each row represents a book in my case) and each column represents a unique word in the corpus. 
If not concerned about down-weighting the presence of certain words (such as Book-Title, Book-Author, and Publisher) based on their frequency across all books (documents), CountVectorizer is more suitable than TF-IDF.

### Reasoning: 
In the context of content-based recommendation systems for books, I want to capture the presence of specific words (such as Book-Title, Book-Author, and Publisher) in each book's features. CountVectorizer does exactly that by representing each book's textual features as a frequency count of words, which helps in identifying similarities between books based on their textual content. 
As I'm interested in treating all words equally without considering their rarity across all books, TF-IDF's IDF component is not necessary in this case.