for accessing deployed project click on this link = Book Recommendation System
A Book Recommendation System which recommends the users a selection of books based on their interests.
Data used for this project was taken from here
The dataset consists of three tables; Books, Users, and Ratings. Data from all three tables are cleaned and preprocessed separately as defined below briefly:
Drop all three Image URL features.
Check for the number of null values in each column. There comes only 3 null values in the table. Replace these three empty cells with ‘Other’. Check for the unique years of publications. Two values in the year column are publishers. Also, for three tuples name of the author of the book was merged with the title of the book. Manually set the values for these three above obtained tuples for each of their features using the ISBN of the book. Convert the type of the years of publications feature to the integer. By keeping the range of valid years as less than 2022 and not 0, replace all invalid years with the mode of the publications that is 2002. Upper-casing all the alphabets present in the ISBN column and removal of duplicate rows from the table. For Users Table:
Check for null values in the table. The Age column has more than 1 lakh null values. Check for unique values present in the Age column. There are many invalid ages present like 0 or 244. By keeping the valid age range of readers as 10 to 80 replace null values and invalid ages in the Age column with the mean of valid ages. The location column has 3 values city, state, and country. These are split into 3 different columns named; City, State, and Country respectively. In the case of null value, 'other' has been assigned as the entity value. Removal of duplicate entries from the table.
Check for null values in the table. Check for Rating column and User-ID column to be an integer. Removal of punctuation from ISBN column values and if that resulting ISBN is available in the book dataset only then considering else drop that entity. Upper-casing all the alphabets present in the ISBN column. Removal of duplicate entries from the table.
Popular in the Whole Collection We have sorted the dataset according to the total ratings each of the books have received in non-increasing order and then recommended top n books.
The dataset was filtered according to a given place (city, state, or country) and then sorted according to total ratings they have received by the users in decreasing order of that place and recommended top n books.
For this model, we have sorted the books by rating for the same author and same publisher of the given book and recommended top n books.
This is the most basic model in which we have grouped all the books published in the same year and recommended the top-rated book yearly.
We have calculated the weighted score using the below formula for all the books and recommended the books with the highest score.
score= t/(t+m)∗a + m/(m+t)∗c
where, t represents the total number of ratings received by the book m represents the minimum number of total ratings considered to be included a represents the average rating of the book and, c represents the mean rating of all the books.
Collaborative Filtering Recommendation System works by considering user ratings and finds cosine similarities in ratings by several users to recommend books. To implement this, we took only those books' data that have at least 50 ratings in all.
For this model, we have created the correlation matrix considering only those books which have total ratings of more than 50. Then a user-book rating matrix is created. For the input book using the correlation matrix, top books are recommended.
To train the Nearest Neighbours model, we have created a compressed sparse row matrix taking ratings of each Book by each User individually. This matrix is used to train the Nearest Neighbours model and then to find n nearest neighbors using the cosine similarity metric.
This system recommends books by calculating similarities in Book Titles. For this, TF-IDF feature vectors were created for unigrams and bigrams of Book-Titles; only those books' data has been considered which are having at least 80 ratings.
A hybrid recommendation system was built using the combination of both content-based filtering and collaborative filtering systems. A percentile score is given to the results obtained from both content and collaborative filtering models and is combined to recommend top n books.
ipython-notebook - Python Text Editor sklearn - Machine learning library seaborn, matplotlib - Visualization libraries numpy, scipy- number python library pandas - data handling library