# Content Based Filtering

- Filtering done on the basis of the features 'Title', 'authors', and 'categories'.
- TFIDF Vectorization and Cosine Similarity used for recommending similar books based on the above three attributes.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('final_data.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,Id,Title,User_id,review/helpfulness,review/score,review/time,review/summary,review/text,description,authors,publisher,publishedDate,categories,ratingsCount,compound,Sentiment
0,0,1882931173,Its Only Art If Its Well Hung!,AVCGYZL8FQQTD,7/7,4.0,940636800,Nice collection of Julie Strain images,This is only for Julie Strain fans. It's a col...,,['Julie Strain'],,1996,['Comics & Graphic Novels'],2.0,0.9408,positive
1,1,826414346,Dr. Seuss: American Icon,A30TK6U7DNS82R,10/10,5.0,1095724800,Really Enjoyed It,I don't care much for Dr. Seuss but after read...,Philip Nel takes a fascinating look into the k...,['Philip Nel'],A&C Black,2005-01-01,['Biography & Autobiography'],2.0,0.9876,positive
2,2,826414346,Dr. Seuss: American Icon,A3UH4UZ4RSVO82,10/11,5.0,1078790400,Essential for every personal and Public Library,"If people become the books they read and if ""t...",Philip Nel takes a fascinating look into the k...,['Philip Nel'],A&C Black,2005-01-01,['Biography & Autobiography'],2.0,0.9932,positive
3,3,826414346,Dr. Seuss: American Icon,A2MVUWT453QH61,7/7,4.0,1090713600,Phlip Nel gives silly Seuss a serious treatment,"Theodore Seuss Geisel (1904-1991), aka &quot;D...",Philip Nel takes a fascinating look into the k...,['Philip Nel'],A&C Black,2005-01-01,['Biography & Autobiography'],2.0,0.9782,positive
4,4,826414346,Dr. Seuss: American Icon,A22X4XUPKF66MR,3/3,4.0,1107993600,Good academic overview,Philip Nel - Dr. Seuss: American IconThis is b...,Philip Nel takes a fascinating look into the k...,['Philip Nel'],A&C Black,2005-01-01,['Biography & Autobiography'],2.0,0.9604,positive


## Data Cleaning

- Dropping the rows with NULL entries in any one the three book attributes chosen for providing recommendations.

In [4]:
df = df.dropna(subset=['categories', 'authors', 'Title'])

In [5]:
columns = df['categories'] + " " + df['authors'] + " " + df['Title']

## TF-IDF Vectorization

- Using the TF-IDF Vectorizer to convert the combined text data into a matrix of TFIDF features.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words = 'english')
tf_idf_matrix = vectorizer.fit_transform(columns)

In [7]:
print(tf_idf_matrix.shape)

(1951308, 114930)


## Dimensionality Reduction - SVD

- Employing Singular Value Decomposition (SVD) for reducing the number of features for each textual entry down to the 100 most useful ones.
- Converting each element of the SVD matrix into float32 for easy processing with Cosine Similarity calculations.

In [8]:
from sklearn.decomposition import TruncatedSVD

In [9]:
svd = TruncatedSVD(n_components = 100)
svd_matrix = svd.fit_transform(tf_idf_matrix)

In [10]:
print(svd_matrix.shape)

(1951308, 100)


In [11]:
import numpy as np
svd_matrix = svd_matrix.astype(np.float32)

## Cosine Similarity

- Calculating the cosine similarity between all pairs of books based on their numerical feature vectors (obtained from TFIDF and then SVD).
- Because of the enormous size of the dataset, computing the cosine similarity matrix on all the books in the dataset throws a Memory Limit Exceeded error. 
- Therefore, 1% of the books are sampled from the dataset randomly, and used for the generation of the cosine similarity matrix.

In [12]:
from sklearn.metrics.pairwise import cosine_similarity

In [13]:
n_rows = svd_matrix.shape[0]
sample_size = int(0.01*n_rows)
random_indices = np.random.choice(n_rows, size=sample_size, replace=False)
sampled_svd_matrix = svd_matrix[random_indices, :]

In [14]:
cosine_sim = cosine_similarity(sampled_svd_matrix)

## Book Recommendations
- Using the Cosine Similarity matrix, the function takes the Title of a book as input, and outputs a list of the most similar books based on the cosine similarity scores.
- The idea here is that books that are similar to each other, based on the book attributes ('Title', 'authors', and 'categories') have higher cosine similarity scores.
- Cosine Similarity can capture more nuanced similarities based on the combination of terms used in books' titles, categories etc.
- Therefore, the cosine similarity matrix quantifies the similarity between all pairs of books in the dataset. This matrix enables to recommend books that are content-wise similar to a particular book.

In [40]:
def recommend_books(title, cosine_sim = cosine_sim):
    book_index = df.loc[df['Title'] == title].index[0]
    sim_scores = list(enumerate(cosine_sim[book_index]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    books_recommended = []
    book_indices = []
    
    for sim in sim_scores:
        if sim[0] != book_index and sim[0] not in book_indices and df['Title'].iloc[sim[0]] not in books_recommended:
            book_indices.append(sim[0])
            books_recommended.append(df['Title'].iloc[sim[0]])
            if len(book_indices) == 10: 
                break
                
    return df['Title'].iloc[book_indices]

In [41]:
print('Top 10 most relevant books are: ')
print()
print(recommend_books('Dr. Seuss: American Icon'))

Top 10 most relevant books are: 

16620                        One Hundred Years of Solitude
4010     Lifetimes: The Beautiful Way to Explain Death ...
8783     Text and Thought: An Integrated Approach to Co...
11002                            The Clan of the Cave Bear
13891                                       Fahrenheit 451
9324                                      Forward the Mage
19753                        California Real Estate Primer
21599                                          Jim the Boy
1312                                        Edge of Danger
3413                                             Rakaposhi
Name: Title, dtype: object
