# **Group Project:**

**Drive Mounting:**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Importing all the importent libraries required:**

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import plotly.express as px
import plotly.graph_objects as go

**Loading the dataset:**

In [None]:
bookdata = pd.read_excel("/content/drive/MyDrive/Files/books_data.xlsx")
bookdata.head(100)

Unnamed: 0,bookID,title,authors,average_rating,prices,rating_type,prices_type,author_category
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPrÃ©,4.57,1502,Very High,Very High,"Novelists, playwrights, poets, and non-fiction"
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPrÃ©,4.49,1363,Very High,High,"Novelists, playwrights, poets, and non-fiction"
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,1822,Very High,Medium,"Novelists, playwrights, poets, and non-fiction"
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPrÃ©,4.56,1601,Very High,High,"Novelists, playwrights, poets, and non-fiction"
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPrÃ©,2.44,1229,Medium,High,"Novelists, playwrights, poets, and non-fiction"
...,...,...,...,...,...,...,...,...
95,157,Anna Karenina,Leo Tolstoy/Constance Garnett/Amy Mandelker,4.05,1071,Very High,High,Authors and Influential Figures in Literature ...
96,159,Dinner with Anna Karenina,Gloria Goldreich,2.99,1574,Medium,Medium,Authors and Influential Figures in Literature ...
97,160,Tolstoy: Anna Karenina,Anthony Thorlby,4.19,979,Very High,High,Authors and Influential Figures in Literature ...
98,162,Untouchable,Mulk Raj Anand/E.M. Forster,3.71,1084,High,Low,Authors and Influential Figures in Literature ...


**Basic Info about dataset:**

In [None]:
#Print name of the column
print("Name of the columns: \n")
print(bookdata.columns)
print('\n')


Name of the columns: 

Index(['bookID', 'title', 'authors', 'average_rating', 'prices', 'rating_type',
       'prices_type', 'author_category'],
      dtype='object')




In [None]:
#Print the datatype of the column
print("Datatype of the columns: \n")
print(bookdata.dtypes)
print('\n')

Datatype of the columns: 

bookID               int64
title               object
authors             object
average_rating     float64
prices               int64
rating_type         object
prices_type         object
author_category     object
dtype: object




In [None]:
#Print information of the column
print("Information of the columns: \n")
print(bookdata.info())
print('\n')


Information of the columns: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11127 entries, 0 to 11126
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   bookID           11127 non-null  int64  
 1   title            11127 non-null  object 
 2   authors          11127 non-null  object 
 3   average_rating   11127 non-null  float64
 4   prices           11127 non-null  int64  
 5   rating_type      11127 non-null  object 
 6   prices_type      11127 non-null  object 
 7   author_category  11126 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 695.6+ KB
None




In [None]:
#Print the shape of the dataset
print("Shape of the dataset: \n")
print(bookdata.shape)
print('\n')

#Number of row
print("Number of rows: \n")
print(len(bookdata))

#Number of column
print("Number of columns: \n")
print(len(bookdata.columns))

Shape of the dataset: 

(11127, 8)


Number of rows: 

11127
Number of columns: 

8


In [None]:
#Print the description of the column
print("Description of the columns: \n")
print(bookdata.describe())
print('\n')

Description of the columns: 

             bookID  average_rating        prices
count  11127.000000    11127.000000  11127.000000
mean   21310.938887        3.935630   1652.107576
std    13093.358023        0.325376    492.740985
min        1.000000        1.000000    800.000000
25%    10287.000000        3.770000   1220.000000
50%    20287.000000        3.960000   1660.000000
75%    32104.500000        4.130000   2078.000000
max    45641.000000        5.000000   2500.000000




In [None]:
#Print the null values of the column
print("Null values of the columns: \n")
print(bookdata.isnull().sum())

Null values of the columns: 

bookID             0
title              0
authors            0
average_rating     0
prices             0
rating_type        0
prices_type        0
author_category    1
dtype: int64


**Visualizations:**

In [None]:
histogram_plot = px.histogram(bookdata, x='average_rating',
                   nbins=40,
                   title='Distribution of Average Ratings')
histogram_plot.update_xaxes(title_text='Average Rating')
histogram_plot.update_yaxes(title_text='Frequency')
histogram_plot.show()

In [None]:
top_authors = bookdata['authors'].value_counts().head(30)
author_books_chart = px.bar(top_authors, x=top_authors.values, y=top_authors.index,
             labels={'x': 'Number of Books', 'y': 'Author'},
             title='Number of Books per Author')
author_books_chart.show()

In [None]:
author_average_ratings = bookdata.groupby('authors')['average_rating'].mean().reset_index()
author_average_ratings_table = go.Figure(data=[go.Table(
    header=dict(values=['Author', 'Average Rating'],
                fill_color='paleturquoise',
                align='left'),
    cells=dict(values=[author_average_ratings.authors, author_average_ratings.average_rating],
               fill_color='lavender',
               align='left'))
])

author_average_ratings_table.update_layout(title='Average Rating per Author')
author_average_ratings_table.show()


In [None]:
author_average_price = bookdata.groupby('authors')['prices'].mean().reset_index()
author_average_price_table = go.Figure(data=[go.Table(
    header=dict(values=['Author', 'Average Price'],
                fill_color='paleturquoise',
                align='left'),
    cells=dict(values=[author_average_price.authors, author_average_price.prices],
               fill_color='lavender',
               align='left'))
])

author_average_price_table.update_layout(title='Average Price per Author')
author_average_price_table.show()


In [None]:
price_counts = bookdata['prices_type'].value_counts()
fig = px.pie(price_counts, values=price_counts.values, names=price_counts.index, title='Distribution of Price Types')
fig.show()


In [None]:
rating_counts = bookdata['rating_type'].value_counts()
fig = px.pie(rating_counts, values=rating_counts.values, names=rating_counts.index, title='Distribution of Rating Types')
fig.show()


In [None]:
fig = px.scatter(bookdata, x='rating_type', y='average_rating',
                 title='Average Rating vs. Rating_Type',
                 labels={'rating_type': 'Rating_Type', 'average_rating': 'Average Rating'})
fig.show()


In [None]:
fig = px.scatter(bookdata, x='average_rating', y='prices',
                 trendline="ols",
                 title='Correlation between Average Rating and Number of Pages',
                 labels={'average_rating': 'Average Rating', 'prices': 'Prices'})
fig.show()


**Adding a new column combining the data of title and author:**

In [None]:
# Convert 'average_rating' to a numeric data type
bookdata['average_rating'] = pd.to_numeric(bookdata['average_rating'],
                                       errors='coerce')

In [None]:
# Convert 'bookID' to string before concatenation
bookdata['title'] = bookdata['title'].astype(str)

# Convert 'authors' to string
bookdata['authors'] = bookdata['authors'].astype(str)

# Now you should be able to concatenate
bookdata['title_author'] = bookdata['title'] + ' by ' + bookdata['authors']

bookdata.head()

Unnamed: 0,bookID,title,authors,average_rating,prices,rating_type,prices_type,author_category,title_author
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPrÃ©,4.57,1502,Very High,Very High,"Novelists, playwrights, poets, and non-fiction",Harry Potter and the Half-Blood Prince (Harry ...
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPrÃ©,4.49,1363,Very High,High,"Novelists, playwrights, poets, and non-fiction",Harry Potter and the Order of the Phoenix (Har...
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,1822,Very High,Medium,"Novelists, playwrights, poets, and non-fiction",Harry Potter and the Chamber of Secrets (Harry...
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPrÃ©,4.56,1601,Very High,High,"Novelists, playwrights, poets, and non-fiction",Harry Potter and the Prisoner of Azkaban (Harr...
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPrÃ©,2.44,1229,Medium,High,"Novelists, playwrights, poets, and non-fiction",Harry Potter Boxed Set Books 1-5 (Harry Potte...


In [None]:
#Print the shape of the dataset
print("Shape of the dataset: \n")
print(bookdata.shape)
print('\n')

#Number of row
print("Number of rows: \n")
print(len(bookdata))

#Number of column
print("Number of columns: \n")
print(len(bookdata.columns))

Shape of the dataset: 

(11127, 9)


Number of rows: 

11127
Number of columns: 

9


**Creating a TfidfVectorizer Instance and Fitting and Transforming Data:**

In [None]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(bookdata['title_author'])

**Similarity between book:**

In [None]:
# Compute the cosine similarity between books
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

**Recommendation function:**

**Testing:**

In [None]:
def recommend_books(book_title, cosine_sim=cosine_sim):
    # Get the index of the book that matches the title
    idx = bookdata[bookdata['title'] == book_title].index[0]

    # Get the cosine similarity scores for all books with this book
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the books based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the top 10 most similar books (excluding the input book)
    sim_scores = sim_scores[1:11]

    # Get the book indices
    book_indices = [i[0] for i in sim_scores]

    # Return the top 10 recommended books
    return bookdata['title'].iloc[book_indices]

In [None]:
book_title = input("Enter the book title: ")
recommended_books = recommend_books(book_title)
print(f"Recommended books for '{book_title}':")
recommended_books.head(10)

Enter the book title: My Inventions
Recommended books for 'My Inventions':


Unnamed: 0,title
211,Tesla Papers
210,Nikola Tesla: A Spark of Genius
209,Wizard: The Life and Times of Nikola Tesla: Bi...
3712,Girls Think of Everything: Stories of Ingeniou...
7103,Teleportation: From Star Trek to Tesla
0,Harry Potter and the Half-Blood Prince (Harry ...
1,Harry Potter and the Order of the Phoenix (Har...
2,Harry Potter and the Chamber of Secrets (Harry...
3,Harry Potter and the Prisoner of Azkaban (Harr...
4,Harry Potter Boxed Set Books 1-5 (Harry Potte...


In [None]:
def recommend_books(book_title, cosine_sim=cosine_sim):
    # Get the index of the book that matches the title
    idx = bookdata[bookdata['title'] == book_title].index[0]

    # Get the cosine similarity scores for all books with this book
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the books based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the top 10 most similar books (excluding the input book)
    sim_scores = sim_scores[1:11]

    # Get the book indices
    book_indices = [i[0] for i in sim_scores]

    # Return the top 10 recommended books as a DataFrame
    return bookdata.iloc[book_indices]  # Return a DataFrame instead of a Series

In [None]:
book_title = input("Enter the book title: ")
recommended_books = recommend_books(book_title)

# Display the recommended books
print(f"Recommended books for '{book_title}':")

# Check if there are any recommended books
if not recommended_books.empty:
    for index, row in recommended_books.iterrows():
        # Assuming the DataFrame has columns like bookID, title, authors, average_rating, prices, rating_type, prices_type, author_category
        print(f"Book ID: {row['bookID']}")
        print(f"Title: {row['title']}")
        print(f"Authors: {row['authors']}")
        print(f"Average Rating: {row['average_rating']}")
        print(f"Prices: {row['prices']}")
        print(f"Rating Type: {row['rating_type']}")
        print(f"Price Type: {row['prices_type']}")
        print(f"Author Type: {row['author_category']}")
        print("-" * 40)  # Separator for readability
else:
    print("No recommendations found.")


Enter the book title: My Inventions
Recommended books for 'My Inventions':
Book ID: 498
Title: Tesla Papers
Authors: Nikola Tesla/David Hatcher Childress
Average Rating: 4.13
Prices: 1734
Rating Type: Very High
Price Type: High
Author Type: Authors in Literature and Philosophy
----------------------------------------
Book ID: 497
Title: Nikola Tesla: A Spark of Genius
Authors: Carol Dommermuth-Costa
Average Rating: 3.93
Prices: 1521
Rating Type: High
Price Type: High
Author Type: Authors in Literature and Philosophy
----------------------------------------
Book ID: 494
Title: Wizard: The Life and Times of Nikola Tesla: Biography of a Genius
Authors: Marc J.  Seifer/William H. Terbo
Average Rating: 3.78
Prices: 1529
Rating Type: High
Price Type: Very High
Author Type: Authors in Literature and Philosophy
----------------------------------------
Book ID: 13460
Title: Girls Think of Everything: Stories of Ingenious Inventions by Women
Authors: Catherine Thimmesh/Melissa Sweet
Average Rati

In [None]:
def recommend_books_by_author(author_name, cosine_sim=cosine_sim):
    # Get the indices of books by the given author
    author_indices = bookdata[bookdata['authors'] == author_name].index.tolist()

    if not author_indices:
        return pd.DataFrame()  # Return an empty DataFrame if no books by the author are found

    # Calculate the average similarity scores for all books by the author
    sim_scores = np.mean([cosine_sim[idx] for idx in author_indices], axis=0)

    # Sort the books based on the average similarity scores
    sim_scores = list(enumerate(sim_scores))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the top 10 most similar books (excluding books by the same author)
    sim_scores = [score for score in sim_scores if score[0] not in author_indices][:10]

    # Get the book indices
    book_indices = [i[0] for i in sim_scores]

    # Return the top 10 recommended books as a DataFrame
    return bookdata.iloc[book_indices]


In [None]:
# Get the author's name from user input
author = input("Enter the author name: ")
recommended_books_by_author = recommend_books_by_author(author)

# Display the recommended books
print(f"Recommended books by '{author}':")

# Check if there are any recommended books
if not recommended_books_by_author.empty:
    # Sort the recommended books by average_rating in descending order
    recommended_books_by_author = recommended_books_by_author.sort_values(by='average_rating', ascending=False)

    for index, row in recommended_books_by_author.iterrows():
        # Display the details of each recommended book
        print(f"Book ID: {row['bookID']}")
        print(f"Title: {row['title']}")
        print(f"Authors: {row['authors']}")
        print(f"Average Rating: {row['average_rating']}")
        print(f"Prices: {row['prices']}")
        print(f"Rating Type: {row['rating_type']}")
        print(f"Price Type: {row['prices_type']}")
        print(f"Author Type: {row['author_category']}")
        print("-" * 40)  # Separator for readability
else:
    print("No recommendations found.")


Enter the author name: Bill Bryson
Recommended books by 'Bill Bryson':
Book ID: 4824
Title: Before the Mayflower: A History of Black America
Authors: Lerone Bennett Jr.
Average Rating: 4.44
Prices: 1184
Rating Type: Very High
Price Type: High
Author Type: Literary and Intellectual Contributors
----------------------------------------
Book ID: 4829
Title: Before The Mayflower A History of Black America
Authors: Lerone Bennett Jr.
Average Rating: 4.44
Prices: 2193
Rating Type: Very High
Price Type: Medium
Author Type: Literary and Intellectual Contributors
----------------------------------------
Book ID: 42895
Title: Una breve historia de casi todo
Authors: Bill Bryson/JosÃ© Manuel Ãlvarez
Average Rating: 4.21
Prices: 2353
Rating Type: Very High
Price Type: Very High
Author Type: Authors and Influential Figures in Literature and Non-Fiction
----------------------------------------
Book ID: 7415
Title: The Shaping of America: A Geographical Perspective on 500 Years of History: Volume 2:

In [None]:
def get_top_rated_books_by_author(author_name, num_books=5):
  author_books = bookdata[bookdata['authors'] == author_name]
  top_rated_books = author_books.sort_values(by='average_rating', ascending=False).head(num_books)
  return top_rated_books

# Get the author's name from user input
author = input("Enter the author name: ")
top_rated_books = get_top_rated_books_by_author(author)

# Display the top rated books
print(f"Top rated books by '{author}':")
if not top_rated_books.empty:
  for _, row in top_rated_books.iterrows():
    print(f"Title: {row['title']} (Average Rating: {row['average_rating']})(Authors: {row['authors']})")
else:
  print("No books found for this author.")


Enter the author name: Bill Bryson
Top rated books by 'Bill Bryson':
Title: A Short History of Nearly Everything (Average Rating: 4.21)(Authors: Bill Bryson)
Title: A Short History of Nearly Everything (Average Rating: 4.21)(Authors: Bill Bryson)
Title: A Short History of Nearly Everything (Illustrated Edition) (Average Rating: 4.21)(Authors: Bill Bryson)
Title: Bill Bryson: The Complete Notes (Average Rating: 4.09)(Authors: Bill Bryson)
Title: In a Sunburned Country (Average Rating: 4.07)(Authors: Bill Bryson)


## **Project Proposal: Book Recommendation System**

### **1. Introduction**

The goal of this project is to develop a book recommendation system using a dataset of books and their attributes. This system will leverage natural language processing (NLP) techniques to analyze book titles and authors, and generate recommendations based on the similarity between books.

### **2. Data and Methodology**

#### **2.1 Data Source**

The project utilizes a dataset of books, including attributes such as title, author, average rating, and price. The dataset is loaded from an Excel file (`books_data.xlsx`) located in Google Drive.

#### **2.2 Data Exploration and Visualization**

Initial data exploration involves understanding the structure and characteristics of the dataset. This includes:

- Examining column names, data types, and basic statistics.
- Visualizing the distribution of average ratings using a histogram.
- Analyzing the number of books per author using a bar chart.
- Displaying average ratings and prices per author using tables.
- Visualizing the distribution of price and rating types using pie charts.
- Exploring the relationship between average rating and rating type using a scatter plot.

#### **2.3 Feature Engineering**

A new feature is created by combining the book title and author into a single column (`title_author`). This combined feature will be used for text analysis and similarity calculations.

#### **2.4 Text Vectorization**

The `TfidfVectorizer` from scikit-learn is employed to convert the text data in the `title_author` column into numerical vectors. This process allows for the quantification of textual similarity between books.

#### **2.5 Similarity Calculation**

Cosine similarity is used to measure the similarity between the TF-IDF vectors of different books. This provides a way to identify books that are similar in terms of their title and author.

### **3. Recommendation System**

#### **3.1 Recommendation Function**

A function `recommend_books` is defined to generate book recommendations based on a given book title. The steps involved are:

1. Identify the index of the input book in the dataset.
2. Calculate cosine similarity scores between the input book and all other books.
3. Sort the books in descending order of similarity scores.
4. Select the top 10 most similar books (excluding the input book) as recommendations.

#### **3.2 User Interaction**

The system prompts the user to enter a book title and then displays the top 10 recommended books based on the similarity analysis.

### **4. Future Work**

- **Enhanced Recommendation Algorithm:** Explore more sophisticated recommendation algorithms, such as collaborative filtering or matrix factorization, to improve recommendation accuracy.
- **User Interface:** Develop a user-friendly interface to interact with the recommendation system.
- **Additional Features:** Incorporate other book attributes, such as genre or publication year, to refine recommendations.
- **Evaluation:** Implement evaluation metrics to assess the performance of the recommendation system.

### **5. Conclusion**

This project demonstrates the development of a basic book recommendation system using NLP and similarity analysis. The system provides a foundation for further development and improvement, with the potential to offer personalized book recommendations to users.
