**Book Recommendation System using Python**

This project about Book Recommendation System is carried out using combination of Data Processing, Machine Learning expertise, and a deep understanding of user preferences. A Book Recommendation System is a data-driven application designed to suggest books to users based on their preferences, reading history, and behaviour.

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import plotly.express as px
import plotly.graph_objects as go

data = pd.read_csv("/content/drive/MyDrive/books_data (2).csv")
print(data.head())

   bookID                                              title  \
0       1  Harry Potter and the Half-Blood Prince (Harry ...   
1       2  Harry Potter and the Order of the Phoenix (Har...   
2       4  Harry Potter and the Chamber of Secrets (Harry...   
3       5  Harry Potter and the Prisoner of Azkaban (Harr...   
4       8  Harry Potter Boxed Set  Books 1-5 (Harry Potte...   

                      authors average_rating  
0  J.K. Rowling/Mary GrandPré           4.57  
1  J.K. Rowling/Mary GrandPré           4.49  
2                J.K. Rowling           4.42  
3  J.K. Rowling/Mary GrandPré           4.56  
4  J.K. Rowling/Mary GrandPré           4.78  


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11127 entries, 0 to 11126
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   bookID          11127 non-null  int64 
 1   title           11127 non-null  object
 2   authors         11127 non-null  object
 3   average_rating  11127 non-null  object
dtypes: int64(1), object(3)
memory usage: 347.8+ KB


Let’s see the distribution of average ratings of all the books:

In [None]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

# Convert 'average_rating' column to numeric and drop NaNs
data['average_rating'] = pd.to_numeric(data['average_rating'], errors='coerce')
data = data.dropna(subset=['average_rating'])

# Create histogram
fig = px.histogram(data, x='average_rating', nbins=30,
                   title='Distribution of Average Ratings',
                   color_discrete_sequence=['#636EFA'])

# Calculate mean and max frequency for annotations
mean_rating = data['average_rating'].mean()
max_frequency = data['average_rating'].value_counts().max()

# Adding a line to show the mean
fig.add_shape(
    type="line",
    x0=mean_rating, x1=mean_rating,
    y0=0, y1=max_frequency,
    line=dict(color="crimson", width=3, dash="dash"),
    name="Mean"
)

# Add mean as an annotation
fig.add_annotation(x=mean_rating, y=max_frequency,
                   text=f"Mean: {mean_rating:.2f}",
                   showarrow=True, arrowhead=1, ax=20)

# Update axis labels and title
fig.update_xaxes(title_text='Average Rating', title_font=dict(size=14), tickfont=dict(size=12))
fig.update_yaxes(title_text='Frequency', title_font=dict(size=14), tickfont=dict(size=12))

# Update layout for cleaner look
fig.update_layout(title_font_size=16, title_x=0.5, template="simple_white")

fig.show()



Now, let’s have a look at the total number of books per author:

In [None]:
import plotly.express as px

top_authors = data['authors'].value_counts().head(10)

# Create a bar chart for the top 10 authors by book count
fig = px.bar(top_authors,
             x=top_authors.values,
             y=top_authors.index,
             orientation='h',
             color=top_authors.values,  # Adds color variation
             color_continuous_scale='Blues',  # Color scheme for visual appeal
             labels={'x': 'Number of Books', 'y': 'Author'},
             title='Number of Books per Author')

# Update layout for readability
fig.update_layout(
    title={'text': 'Number of Books per Author', 'x': 0.5, 'xanchor': 'center'},
    xaxis_title='Number of Books',
    yaxis_title='Author',
    coloraxis_showscale=False,  # Hide color scale as it's not necessary here
    template='simple_white'
)

# Add text labels for each bar
fig.update_traces(text=top_authors.values, textposition='auto')

# Adjust font sizes for title and axis labels
fig.update_layout(
    title_font_size=16,
    xaxis_title_font_size=14,
    yaxis_title_font_size=14,
    yaxis=dict(tickfont=dict(size=12)),
    xaxis=dict(tickfont=dict(size=12))
)

fig.show()


The average rating column in an object data type in the dataset. Let’s convert it into numeric:

In [None]:
# Convert 'average_rating' to a numeric data type
data['average_rating'] = pd.to_numeric(data['average_rating'], errors='coerce')

To consider book content for recommendations, we’ll use the book titles and authors. Let’s combine these features into a single text feature:

In [None]:
# Create a new column 'book_content' by combining 'title' and 'authors'
data['book_content'] = data['title'] + ' ' + data['authors']

Now, we will transform the text-based features into numerical vectors using TF-IDF (Term Frequency-Inverse Document Frequency) vectorization:

In [None]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(data['book_content'])

It converts text data into a numerical representation suitable for recommendation algorithms.

Now, we’ll use a simple content-based recommendation system algorithm based on the cosine similarity between books:

In [None]:
#Compute the cosine similarity between books
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

Now, let’s define a function to recommend books based on user preferences:

In [None]:
def recommend_books(book_title, cosine_sim=cosine_sim):
    # Get the index of the book that matches the title
    idx = data[data['title'] == book_title].index[0]

    # Get the cosine similarity scores for all books with this book
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the books based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the top 10 most similar books (excluding the input book)
    sim_scores = sim_scores[1:11]

    # Get the book indices
    book_indices = [i[0] for i in sim_scores]

    # Return the top 10 recommended books
    return data['title'].iloc[book_indices]

This function will take a book title as input and recommend books with high cosine similarity. Now, let’s test the recommendation system by providing a book title and getting recommendations:

In [None]:
book_title = "Dubliners: Text  Criticism  and Notes"
recommended_books = recommend_books(book_title)
print(recommended_books)

2837                 The Long-Lost Map (Ulysses Moore #2)
7025    Things Pondered: From the Heart of a Lesser Woman
3044                                        Bruno's Dream
151                                  The Door Into Summer
3099                                       A Door of Hope
3688                                James Joyce's Ulysses
2261                                The Door in the Hedge
6196                                        Ulysses Found
4383                            One Door Away from Heaven
3512                                              Ulysses
Name: title, dtype: object


**Conclusion**

In this project, we successfully built a content-based book recommendation system. By leveraging TF-IDF vectorization and cosine similarity, we created a model that can suggest books similar to a given title based on the book's content, including its title and author. This recommendation approach provides users with relevant suggestions that align with their interests. Future improvements could include incorporating user-based collaborative filtering or hybrid recommendation techniques to further personalize recommendations based on user ratings and preferences. This foundational system demonstrates the potential of machine learning in enhancing user experiences through personalized book suggestions.