# **Project Name**    - **Book Recommendation System**



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual


# **Project Summary -**

**Project Summary: Building Book Recommendation System**

**Objective:**

The primary goal of this project was to develop a robust and accurate book recommendation system that enhances user experience by suggesting relevant books based on their preferences and past interactions. The system was designed to handle a large dataset of books, users, and ratings, and to provide personalized recommendations that align with individual user tastes.

**Dataset:**

The project utilized three key datasets:

User Data: Containing information about the users, including user IDs and potentially other demographic details.

Book Data: Including book titles, authors, publication years, publishers, and additional metadata.

Rating Data: Consisting of user-provided ratings for various books, which formed the core interaction data for the collaborative filtering model.

Approach:

To maximize the recommendation system's effectiveness, a hybrid approach combining Collaborative Filtering and Content-Based Filtering was employed.

**Collaborative Filtering (SVD):**

Model Used: Singular Value Decomposition (SVD) was used to perform collaborative filtering. This model decomposes the user-item interaction matrix into latent factors, capturing the underlying patterns in user preferences and book characteristics.

Tuning: Hyperparameter tuning was performed using GridSearchCV to find the optimal number of factors, learning rates, and regularization terms. The final model demonstrated a lower RMSE, indicating improved prediction accuracy.

Evaluation Metrics: The model’s performance was evaluated using RMSE, MAE, Precision, Recall, and F1 Score. These metrics helped ensure that the model was not only accurate but also effective in making relevant recommendations.

**Content-Based Filtering (FAISS):**

Model Used: FAISS (Facebook AI Similarity Search) was utilized to perform fast, scalable content-based filtering. The content of the books was vectorized using TF-IDF, capturing the importance of various terms in the books’ metadata.

Tuning: Hyperparameters such as the number of clusters (nlist) and the number of probes (nprobe) were tuned to optimize the search efficiency and accuracy. The evaluation was based on metrics like Mean Average Precision (MAP) to ensure high-quality recommendations.

Explanation: TF-IDF weights and cosine similarity were key components in determining the importance of features, helping to identify which terms were most influential in recommending similar books.

**Hybrid Recommendation System:**

The final model was a hybrid of the above two approaches, combining the strengths of collaborative filtering (personalization based on user behavior) and content-based filtering (recommendations based on book content).

Integration: The hybrid system integrated the predictions from both models, ranking recommendations based on the collaborative filtering score while ensuring relevance through content-based filtering.

Impact: This hybrid model provided more accurate and relevant recommendations, improving user satisfaction and engagement, and ultimately driving positive business outcomes.

**Model Explainability:**

SHAP (SHapley Additive exPlanations): SHAP was used to interpret the model's predictions, providing insights into the importance of different features and latent factors. This helped in understanding why certain books were recommended to users, thereby increasing transparency and trust in the system.

**Results:**

Improved RMSE: The hybrid model achieved a lower RMSE compared to individual models, indicating more accurate predictions of user ratings.

Enhanced Recommendations: The system provided more accurate and diverse book recommendations, resulting in higher user engagement.

Positive Business Impact: The model’s improvements in accuracy and relevance led to better user satisfaction, potentially increasing book sales and platform engagement.

**Conclusion:**

The project successfully developed a sophisticated hybrid book recommendation system that effectively combines collaborative filtering and content-based filtering. Through careful tuning and evaluation, the model achieved a balance between accuracy and relevance, making it a valuable tool for enhancing user experience and driving business success. The use of model explainability tools like SHAP further added transparency, allowing stakeholders to understand and trust the recommendation process.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**BUSINESS PROBLEM OVERVIEW**

A book recommendation system aims to help readers discover books that align with their interests and preferences. By providing personalized suggestions, such systems enhance user engagement, increase customer satisfaction, and drive sales for bookstores or online platforms.

Book recommendation systems are widely used across various industries to enhance user experience, drive sales, and promote engagement. By leveraging personalized recommendations, these industries can better serve their users and achieve their business objectives.

Personalized recommendations enhance user satisfaction by helping readers find books that match their tastes, which can lead to increased customer loyalty.
Users are more likely to engage with a platform that understands their preferences and offers tailored suggestions.
Recommending books that align with user interests increases the likelihood of purchases, directly boosting sales.
Suggesting related books or genres can lead to additional sales from users who discover new interests.

Personalized recommendations can be used in targeted email campaigns,
 improving the effectiveness of marketing efforts.
Efficient targeting reduces wasted marketing spend by focusing on users with a higher likelihood of conversion.
For businesses with extensive book catalogs, recommendation systems help users navigate the vast selection and find relevant titles quickly.
Automating the recommendation process ensures consistent and scalable personalization without the need for manual intervention.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
! pip install surprise

In [None]:
!pip install faiss-cpu

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from google.colab import drive
import seaborn as sns
from datetime import date
from wordcloud import WordCloud
from scipy import stats
from sklearn.preprocessing import LabelEncoder
import string
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('punkt')
from nltk.tokenize import word_tokenize
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import faiss
from surprise import Dataset, Reader, SVD, accuracy
from surprise.model_selection import cross_validate,train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support
from surprise.model_selection import GridSearchCV
from sklearn.metrics import average_precision_score

### Dataset Loading

In [None]:
# Load Dataset
drive.mount('/drive')
users_df = pd.read_csv('/drive/My Drive/Users.csv')
books_df = pd.read_csv('/drive/My Drive/Books.csv')
ratings_df = pd.read_csv('/drive/My Drive/Ratings.csv')

### Dataset First View

In [None]:
# Dataset First Look
users_df.head()

In [None]:
books_df.head()

In [None]:
ratings_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Shape of User dataset\n", users_df.shape)
print("Shape of Book dataset\n", books_df.shape)
print("Shape of Rating dataset\n", ratings_df.shape)

### Dataset Information

In [None]:
# Dataset Info
# User Data
print("Information of User data\n")
users_df.info()

In [None]:
# Book Data
print("Information of Book data\n")
books_df.info()

In [None]:
# Rating Data
print("Information of Rating data\n")
ratings_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(users_df.duplicated().sum())
print(books_df.duplicated().sum())
print(ratings_df.duplicated().sum())

#### Missing Values/Null Values

In [None]:
users_df.isnull().sum()

In [None]:
books_df.isnull().sum()

In [None]:
ratings_df.isnull().sum()

### What did you know about your dataset?

To effectively work with dataset, it's essential to understand the data. Here's a summary of what I got about dataset.

**1. Basic Structure**

Dimensions: The dataset has three parts with different dimensions:

User Data: (278,858 rows, 3 columns)

Book Data: (271,360 rows, 8 columns)

Rating Data: (1,149,780 rows, 3 columns)

**2. Column Information**

User Data Columns: Contains user-related information.

Example: User-ID - has unique values, Location, Age

Book Data Columns: Contains book-related information.

Example: ISBN - has unique values(similar as book_id), book title, author, publication_year

Rating Data Columns: Contains user ratings for books.

Example: User-ID, ISBN, rating

Datasets have 0 duplicates.

**3 datasets, which contain the following columns with their respective missing values:**

**User data contains following column with null values**

Age             -       110762 null values

**Book data contains following columns with null values**

Book-Author     -       2 null values

Publisher       -       2 null values

Image-URL-L     -       3 null values




## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("User dataset column\n",users_df.columns)
print("Book dataset column\n",books_df.columns)
print("Rating dataset column\n",ratings_df.columns)

In [None]:
# Dataset Describe
users_df.describe()

In [None]:
ratings_df.describe()

**User ID**

IDs range from 2 to 278,854, with a mean around 140,386.4 and substantial variation (std = 80,562.28).

**Ratings:**

Ratings are highly varied, with many zero ratings which may need further investigation. The mean rating is 2.87 with a standard deviation of 3.85, suggesting diverse user opinions.

**Age:**

The ages range from 0 to 244, with a mean age of 37.24 and a standard deviation of 14.25. The data includes extreme values which may be outliers or errors.

### Variables Description

### **User dataset columns**
**User-ID** - Unique identifiers assigned to each user.

**Location** - Location of the user.

**Age** - Age of the user.

### **Book dataset columns**

**ISBN** - Books are identified by their respective ISBN.

**Book_Title** - Title of the book.

**Book-Author** - Represent the author of the book.

**Year-Of-Publication** - Represents the year in which a book was published.

**Publisher** - Represents the name of the publishing company or organization that published the book.

**Image-URL-S** - URLs linking to cover images appearing in small.

**Image-URL_M** - URLs linking to cover images appearing in medium.

**Image_URL-L** - URLs linking to cover images appearing in large.

### **Rating dataset column**
**User-ID** - Unique identifiers assigned to each user.

**ISBN** - Books are identified by their respective ISBN.

**Booking-Rating** - Represents the rating given to a book by a user.








### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in users_df.columns.tolist():
  print("No. of unique values in ",i,"is",users_df[i].nunique(),".")

In [None]:
for i in books_df.columns.tolist():
  print("No. of unique values ",i,"is",books_df[i].nunique(),".")

In [None]:
for i in ratings_df.columns.tolist():
  print("No. of unique() values ",i,"is",ratings_df[i].nunique(),".")


## 3. ***Data Wrangling***

### Data Wrangling Code

**Merging User Data with Rating Data**:

The pd.merge() function merges **ratings_df** with **users_df** on the **User-ID** column using a left join to ensure all ratings are retained, including those with missing user information.

**Merging Result with Book Data:**

The resulting dataset is then merged with **books_df** on the **ISBN** column using a left join to ensure all user-book rating pairs are retained, including those with missing book information.

**Ensuring No Data Points are Lost:**

**Left Joins:** Using how='left' in the merge operations ensures that all records from the left DataFrame are retained. This prevents loss of data points from the larger datasets.

**Handling Missing Values:** After merging, there might be some missing values (NaNs) due to unmatched keys. These can be handled later based on the requirement.

In [None]:
# Write your code to make your dataset analysis ready.
print("User Dataset column names\n",users_df.columns)
print("Book Dataset column names\n",books_df.columns)
print("Rating Dataset column names\n",ratings_df.columns)

In [None]:
# Merge user_data with rating_data
user_ratings_merged = pd.merge(ratings_df, users_df, on='User-ID', how='left')

# Merge the above result with book_data
final_merged_data = pd.merge(user_ratings_merged, books_df, on='ISBN', how='left')


In [None]:
# Create a copy of the current dataset and assigning to df
#df=final_merged_data.copy()
df=final_merged_data

### **Handling missing values**

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(12, 8))
sns.heatmap(df.isnull(), cbar=False)
plt.title('Heatmap of Missing Values')
plt.show()

Given that the book name is crucial for recommendations and understanding user preferences, it makes sense to drop entries where the book name is missing. This ensures that recommendation system is based on complete and meaningful data.

In [None]:
# Drop rows where Book-Title is missing
df = df.dropna(subset=['Book-Title'])

The "Age" column is crucial for recommendation system, as different age groups often have different preferences. To ensure accurate imputation and effective use of this data, I need to check the type of missing values and identify any outliers.

In [None]:
# Check the type of missing values in the Age column
missing_ages = df['Age'].isnull().sum()
print(f"Missing values in Age column: {missing_ages}")

# Visualize the distribution of the Age column
plt.figure(figsize=(12, 6))
sns.histplot(df['Age'], bins=30, kde=True)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

# Identify outliers using a boxplot
plt.figure(figsize=(12, 6))
sns.boxplot(x=df['Age'])
plt.title('Boxplot of Age')
plt.xlabel('Age')
plt.show()

In [None]:
# Handling outliers
# Define the upper and lower bounds for acceptable age values
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Cap outliers at the bounds
df['Age'] = np.where(df['Age'] < lower_bound, lower_bound, df['Age'])
df['Age'] = np.where(df['Age'] > upper_bound, upper_bound, df['Age'])

# Visualize the distribution after capping outliers
plt.figure(figsize=(12, 6))
sns.histplot(df['Age'], bins=30, kde=True)
plt.title('Distribution of Age After Capping Outliers')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

Given that the "Age" column is crucial for your recommendation system and the distribution appears nearly normal after capping outliers, using median imputation for missing values is a suitable approach.

In [None]:
# Impute missing values with the median age
median_age = df['Age'].median()
df['Age'].fillna(median_age, inplace=True)

In [None]:
df.dropna(inplace=True)

### **Handling Unlikely Age Values in the Dataset**
Given the context of Age column, it is improbable that very young children (e.g., ages 2.5, 3, 4, 5, 6, 7, 8, 9) are providing book ratings. These values are likely outliers or errors in the data. Therefore, it's reasonable to filter out these unlikely age values to ensure the integrity of dataset.

In [None]:
df['Age'].unique()

In [None]:
# Define a reasonable age range for users who can give book ratings
min_age = 10  # Minimum plausible age
max_age = 100  # Maximum plausible age, assuming no one older than 100 is likely to give ratings

# Filter out unlikely age values
df = df[(df['Age'] >= min_age) & (df['Age'] <= max_age)]

### **Handling Invalid Book Ratings**
Since book ratings should be between 1 and 10, any rating with a value of 0 is invalid. Given the nature of ratings, it is generally more appropriate to remove rows with zero ratings, as imputing them might introduce bias or inaccuracies.

In [None]:
print(df['Book-Rating'].unique())
# Remove rows with zero ratings
df = df[df['Book-Rating'] != 0]

### **Handling Unusual Values / Outilers  in the Year-Of-Publication Column**



In [None]:
# Convert Year-Of-Publication to numeric
df['Year-Of-Publication'] = pd.to_numeric(df['Year-Of-Publication'], errors='coerce')

# Visualize the outliers using a boxplot
plt.figure(figsize=(12, 6))
sns.boxplot(x=df['Year-Of-Publication'])
plt.title('Boxplot of Year-Of-Publication')
plt.xlabel('Year of Publication')
plt.show()

In [None]:
df['Year-Of-Publication'].unique()

The values 1376, 2030, 2037, and 2050 etc. in the Year-Of-Publication column are indeed unusual and likely erroneous. Given that most books are published within a more recent time frame, it's reasonable to investigate and correct these values.

In [None]:
# Calculate Q1, Q3, and IQR
Q1 = df['Year-Of-Publication'].quantile(0.25)
Q3 = df['Year-Of-Publication'].quantile(0.75)
IQR = Q3 - Q1

# Calculate lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print("Lower bound ",lower_bound)
print("Upper bound ",upper_bound)

Given that dataset contains recent publication years like 2020, it is reasonable to set a range that includes modern publication years. I have set a logical lower bound based on historical data and a reasonable upper bound close to the current year.

In [None]:
# Define a reasonable range for publication years
current_year = 2024
min_year = 1800  # Earliest plausible publication year

# Convert Year-Of-Publication to numeric and remove rows with values outside the plausible range
df['Year-Of-Publication'] = pd.to_numeric(df['Year-Of-Publication'], errors='coerce')
df = df[(df['Year-Of-Publication'] >= min_year) & (df['Year-Of-Publication'] <= current_year)]

# Ensure the column is of integer type
df['Year-Of-Publication'] = df['Year-Of-Publication'].astype(int)

### **Creating New Feature Columns from Existing Columns**
Creating new feature columns from existing ones, can significantly improve the performance of machine learning models and provide deeper insights during data analysis.

**Potential New Features to Create**

**Book Age:**

Calculated the age of the book from the publication year to understand how the book's age affects its popularity and ratings.

**User Age Group:**

Categorize users into age groups (e.g., children, teenagers, adults, seniors) to analyze preferences and behaviors by age group.

In [None]:
# Feature 1: Create a new column for the book age
# creating the date object of today's date
todays_date = date.today()
df['Book-Age'] = todays_date.year - df['Year-Of-Publication']

# Feature 2: Create a new column for user age groups
# Define age group bins and labels
age_bins = [0, 12, 18, 35, 60, np.inf]
age_labels = ['Child', 'Teenager', 'Adult', 'Middle-aged', 'Senior']

# Categorize users into age groups
df['Age-Group'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels, right=False)

# Verify the new columns
print(df[['Year-Of-Publication', 'Book-Age', 'Age', 'Age-Group']].head())


In [None]:
df.shape

In [None]:
df.head()

## What all manipulations have you done and insights you found?

### **Summary of Data Manipulations and Insights**

Here's a detailed summary of the data manipulations performed and the insights gained from the dataset:

### **Data Manipulations**

**Handling Missing Values:**

Age: Missing values in the Age column were imputed with the median value.

Book-Title: Rows with missing Book-Title values were dropped, as they are essential for recommendations.

**Outlier Detection and Handling:**

Age: Outliers in the Age column were capped at the lower and upper bounds.

Year-Of-Publication: Outliers were identified, and values outside the range of 1800 to the current year were removed.

**Data Type Conversion:**

Converted Age from float to int.

Converted Year-Of-Publication from object to int after handling outliers and missing values.

**Filtering Invalid Ratings:**

Removed rows with Book-Rating values of 0, as ratings should be between 1 and 10.

**Feature Engineering:**

Book Age -  Created a new column Book-Age by calculating the age of the book from its publication year.

User Age Group -  Created a new column Age-Group by categorizing users into age groups based on their age.


### **Insights Gained**

**Age Distribution:**

After handling outliers and imputing missing values, the age distribution of users shows a reasonable spread, indicating diverse user demographics.

**Year of Publication:**

Filtering out improbable publication years and focusing on a reasonable range (1800 to 2024) ensured that the dataset only includes valid publication years, enhancing the reliability of the analysis.

**Book Ratings:**

By removing invalid ratings (0), the dataset now only contains valid ratings between 1 and 10, which is crucial for accurate recommendations.

**Feature Engineering Benefits:**

Book Age -  This new feature can help analyze trends and user preferences based on the age of the book.

User Age Group -  Categorizing users into age groups provides deeper insights into preferences and behaviors across different age segments, improving the recommendation system.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Histplot - Distribution of Book Ratings:

In [None]:
#visualization code
# Distribution of Book Ratings
plt.figure(figsize=(10, 6))
sns.histplot(df['Book-Rating'], bins=10)
plt.title('Distribution of Book Ratings')
plt.xlabel('Book Rating')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

To understand the overall distribution of book ratings in dataset. The histplot reveals how ratings are spread across different values, helping to understand if there are biases or trends in user ratings.

##### 2. What is/are the insight(s) found from the chart?
Upon analyzing the distribution of book ratings, it is evident that the plot is right-skewed. This skewness indicates that most books received ratings between 6 and 10. Here’s a detailed interpretation of this observation:

**Right-Skewed Distribution:**

The histogram of book ratings shows a right-skewed distribution. This means that the bulk of the ratings are clustered towards the higher end of the scale (6 to 10), with fewer books receiving lower ratings.

**Concentration of Ratings:**

6 to 10 Range: A significant number of books have been rated between 6 and 10, indicating that users generally have a positive outlook towards the books they read.

Few Low Ratings: There are fewer books with ratings below 6, which might suggest that users either do not rate books they dislike or the books in the dataset are of generally high quality.

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights derived from the analysis of the book ratings distribution can indeed help create a positive business impact in several ways.

1. Enhanced User Satisfaction and Engagement

  **Business Impact:**

  **Personalized Recommendations:** By understanding that users tend to rate books highly, the recommendation system can focus on suggesting similar high-rated books, thereby enhancing user satisfaction.

  **Engagement:** Providing users with books they are more likely to enjoy increases the likelihood of continued engagement and use of the platform.
2. Improved Marketing and Promotion Strategies

  **Business Impact:**

  **Targeted Promotions:** Promote books that have received high ratings more prominently, as they are more likely to attract user interest and generate sales.

  **User Testimonials:** Utilize positive ratings and reviews in marketing campaigns to build trust and attract new users.

3. Increased Sales and Revenue

  **Business Impact:**

  **Sales Boost:** By recommending books that are similar to those with high ratings, users are more likely to make purchases, thereby increasing sales.

  **Cross-Selling Opportunities:** Recommend high-rated books alongside other related products (e.g., merchandise, additional content), creating opportunities for cross-selling and upselling.

4. Refined Content Acquisition Strategy

  **Business Impact:**

  **Content Curation:** Focus on acquiring and promoting books that align with user preferences, as indicated by high ratings. This ensures a curated selection that meets user expectations.
  
  **Supplier Relationships:** Strengthen relationships with publishers and authors of highly-rated books to secure more exclusive or early access to popular content.


#### Scatterplot - Age vs. Book Ratings:

In [None]:
# Chart - 2 visualization code
# Age vs. Book Ratings
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Age', y='Book-Rating', data=df, alpha=0.5)
plt.title('Age vs. Book Ratings')
plt.xlabel('Age')
plt.ylabel('Book Rating')
plt.show()

##### 1. Why did you pick the specific chart?

To analyze how book ratings vary with user age. The scatter plot can show if certain age groups tend to rate books higher or lower, indicating user preferences based on age.

##### 2. What is/are the insight(s) found from the chart?

The scatter plot showing that users of all ages have given ratings from 1 to 10 provides valuable insights into user behavior and preferences.

Insight: The fact that users across all age groups have given ratings from 1 to 10 suggests that the books available on the platform have a broad appeal, catering to diverse age groups.

Insight: The variation in ratings across all age groups implies that individual preferences vary widely, regardless of age. This means that people of the same age group might have different tastes in books.

Insight: Users from all age groups are actively engaging with the platform by rating books, indicating a healthy and engaged user base.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact:**
1. Enhanced Personalization:
Utilize the insight that preferences vary widely within age groups to enhance personalization algorithms. This means focusing on individual user behavior and ratings history rather than relying heavily on age as a predictor.
2. Targeted Marketing Campaigns:
Develop marketing campaigns that highlight the diversity of books available, showcasing that the platform caters to all age groups and tastes.

3. User Retention Strategies:
  Implement user retention strategies that acknowledge the active participation of all age groups. For example, create age-specific reading challenges or book clubs to foster community engagement.

#### Scatterplot -  Book Age vs. Book Ratings:

In [None]:
# Chart - 3 visualization code
# Create a scatter plot of Age vs. Book Ratings
plt.figure(figsize=(12, 6))
sns.scatterplot(x='Book-Age', y='Book-Rating', data=df, alpha=0.5)

# Set the x-axis scale to start at 0 and have a gap of 10
plt.xticks(np.arange(0, df['Book-Age'].max() + 10, 10))

# Add title and labels
plt.title('Book Age vs. Book Ratings')
plt.xlabel('Book Age')
plt.ylabel('Book Rating')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

To investigate if there is a relationship between the age of the book and its ratings. Scatter plot can reveal if newer or older books tend to receive higher ratings, indicating potential biases toward newer releases or classics. It can easily visible with scatter plot.

##### 2. What is/are the insight(s) found from the chart?

**Books Less Than 20 Years Old:**

Observation: These books have ratings from a limited number of users.

Insight: This could indicate that newer books are either not as widely discovered or not as extensively reviewed by users on the platform. It may also suggest that these books are still building their reputation and readership.

**Books Aged 20 to 100 Years:**

Observation: These books received ratings from many users, covering the full spectrum of ratings from 1 to 10.

Insight: This indicates a strong interest and engagement with books that fall within this age range. It suggests that these books are well-established and have a broad appeal across the user base.

**Books More Than 100 Years Old:**

Observation: These books predominantly received high ratings, typically between 5 and 10. But these books have ratings from a few users.

Insight: This suggests that very old books, likely classics or well-regarded historical works, are highly valued by users. These books may have stood the test of time and are appreciated for their enduring quality and significance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**1. Targeted Marketing Strategies:**

  Highlight and promote books aged 20 to 100 years, as they are highly engaged with and span the full range of user ratings. Create marketing campaigns that emphasize the diversity and richness of these books.

**2. Promotion of Classics:**

  Create special collections or features for books that are more than 100 years old, emphasizing their high ratings and classic status.

**3. Discovery of Newer Books:**

  Implement recommendation algorithms and discovery features that help users find and rate newer books (less than 20 years old).


#### Barplot - Average Book Rating by Age Group:

In [None]:
# Chart - 4 visualization code
# Average Book Rating by Age Group
plt.figure(figsize=(10, 6))
sns.barplot(x='Age-Group', y='Book-Rating', data=df, estimator=lambda x: sum(x) / len(x))
plt.title('Average Book Rating by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Average Book Rating')
plt.show()

##### 1. Why did you pick the specific chart?

To compare average book ratings across different age groups. The bar plot helps compare average ratings across different age groups, revealing which groups rate books more favorably.

##### 2. What is/are the insight(s) found from the chart?

**Consistent Positive Ratings Across Age Groups:**

**Observation:** All age groups give average ratings between 7 and 8.

**Insight:** This indicates a generally positive reception of the books available on the platform across different age demographics. It suggests that the platform’s book selection appeals broadly to users of all ages.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Confidence in Current Catalog:**

Maintain the diversity and quality of the current book catalog, as it is well-received across all age groups.

**Tailored Marketing Campaigns:**

Create marketing campaigns that highlight the overall high ratings of the platform's books, appealing to users of all age groups.

**Enhanced User Experience:**

Focus on maintaining and improving the user experience by ensuring the availability of high-quality books across various genres and categories that appeal to all age groups.

#### Histplot - Distribution of Year of Publication:

In [None]:
# Chart - 5 visualization code
# Distribution of Year of Publication
plt.figure(figsize=(10, 6))
sns.histplot(df['Year-Of-Publication'], bins=30, kde=True)
plt.title('Distribution of Year of Publication')
plt.xlabel('Year of Publication')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

To examine the distribution of book publication years in the dataset. The histplot helps to plot distribution of publication years which can indicate trends over time and the dataset's focus on older or newer books.

##### 2. What is/are the insight(s) found from the chart?

**Concentration of Publications (1970-2010):**

Observation: A significant number of books in the dataset were published between 1970 and 2010.

Insight: This period likely represents a time of prolific book production, and these books may still be popular and relevant to readers today. It also suggests that the platform has a strong collection of books from this era.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Curated Collections and Recommendations:**

Create curated collections of books from the 1970-2010 period and feature them prominently on the platform. Highlight these collections in recommendation algorithms.

**Content Strategy and Acquisition:**

Focus on acquiring more books that are either from this period or similar in style and themes to those popular between 1970 and 2010.

**User Engagement Initiatives:**

Launch reading challenges, book clubs, or discussion forums centered around popular books from the 1970-2010 period.


#### Stack plot - Top 20 books with rating of 10, 9, 8

In [None]:
# Chart - 7 visualization code
# Filter the dataset for ratings of 10, 9, and 8
high_rate_df = df[df['Book-Rating'].isin([8, 9, 10])]

# Calculate the frequency of each rating for each book
rating_frequencies = high_rate_df.groupby(['Book-Title', 'Book-Rating']).size().unstack(fill_value=0)

# Calculate total ratings for each book to find the top 20 books
rating_frequencies['Total'] = rating_frequencies.sum(axis=1)
top_20_books = rating_frequencies.sort_values(by='Total', ascending=False).head(20)

# Plot the rating frequencies
top_20_books.drop(columns='Total').plot(kind='bar', stacked=True, figsize=(14, 8), color=['#FF9999', '#66B3FF', '#99FF99'])
plt.title('Top 20 Books with Ratings of 10, 9, and 8 Along with Rating Frequency')
plt.xlabel('Book Title')
plt.ylabel('Frequency')
plt.legend(title='Book Rating', loc='upper right')
plt.xticks(rotation=90)
plt.tight_layout()

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

A stacked bar chart provides a clear view of the top 20 books that received high ratings (8, 9, and 10). Its easy to visualize each rating frequency.

##### 2. What is/are the insight(s) found from the chart?

**High Popularity of "The Lovely Bones":**

Observation: "The Lovely Bones: A Novel" received significantly more high ratings (8 to 10) compared to other books.

Insight: This indicates that "The Lovely Bones" is exceptionally popular among users, suggesting it has a broad appeal and is well-received.

**General Positive Reception of Books:**

Observation: Most books receive high ratings (8 to 10) from around 200 users.

Insight: This suggests a generally positive reception of books on the platform, indicating that users are finding and enjoying books that meet their expectations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Marketing and Promotion:**

Use the popularity of "The Lovely Bones" as a key marketing asset. Promote it prominently on the platform and in marketing campaigns.

**Recommendation Systems:**

Incorporate the high rating and popularity data into recommendation algorithms to suggest similar books to users.

**Content Acquisition:**

Focus on acquiring and promoting books that are similar in genre, style, or theme to "The Lovely Bones" and other highly-rated books.

#### Horizontal Bar plot - Top 20 book by rating count

In [None]:
# Chart - 8 visualization code
# Calculate the rating count for each book
rating_counts = df['Book-Title'].value_counts().reset_index()
rating_counts.columns = ['Book-Title', 'Rating Count']

# Get the top 20 books with the highest rating counts
top_20_books = rating_counts.head(20)

# Plot the rating counts
plt.figure(figsize=(14, 8))
plt.barh(top_20_books['Book-Title'], top_20_books['Rating Count'], color='skyblue')
plt.xlabel('Rating Count')
plt.ylabel('Book Title')
plt.title('Top 20 Most Popular Books by Rating Count')
plt.gca().invert_yaxis()  # Invert y-axis to have the highest rating count at the top
plt.tight_layout()

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

If categories or labels are long (e.g., book titles), a horizontal bar plot makes it easier to display these labels without overlapping or truncating them.

##### 2. What is/are the insight(s) found from the chart?

**Observation:**

"The Lovely Bones: A Novel" received over 700 ratings.

"Wild Animus" received nearly 600 ratings.

"The Da Vinci Code" received around 500 ratings.

"The Secret Life of Bees" received around 400 ratings.


**High Engagement with Specific Titles:**

**Observation:** These books have received a significant number of ratings, indicating high user engagement with these titles.

**Insight:** The high rating counts suggest these books are popular and widely read among users on the platform. These titles likely resonate well with a broad audience, making them key assets for the platform.

**Marketing and Promotional Strategies:**

**Observation:** The popularity of these books can be leveraged in marketing campaigns.

**Insight:** Featuring these books prominently in marketing materials, newsletters, and on the homepage can attract more users. Highlighting these popular books can also encourage users to engage more with the platform, leading to increased traffic and sales.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Increased User Engagement:**

Action: Use these popular books to enhance user engagement by recommending similar titles and creating themed collections around these books.

**Enhanced Marketing Strategies:**

Action: Highlight these popular books in marketing campaigns, emails, and social media promotions.

**Improved Recommendation Algorithms:**

Action: Integrate the popularity of these books into recommendation algorithms to suggest similar books to users.

#### Horizantal Bar Plot - Top 10 Popular Authors

In [None]:
# Chart - 9 visualization code
# Calculate the rating count for each author
author_rating_counts = df.groupby('Book-Author')['Book-Rating'].count().reset_index()
author_rating_counts.columns = ['Book-Author', 'Rating Count']

# Get the top 10 authors with the highest rating counts
top_10_authors = author_rating_counts.sort_values(by='Rating Count', ascending=False).head(10)

# Plot the rating counts
plt.figure(figsize=(12, 8))
plt.barh(top_10_authors['Book-Author'], top_10_authors['Rating Count'], color='lightgreen')
plt.xlabel('Rating Count')
plt.ylabel('Author')
plt.title('Top 10 Most Popular Authors by Rating Count')
plt.gca().invert_yaxis()  # Invert y-axis to have the most popular authors at the top
plt.tight_layout()

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

If categories or labels are long (e.g., Book Author), a horizontal bar plot makes it easier to display these labels without overlapping or truncating them

##### 2. What is/are the insight(s) found from the chart?

**Observation:**

**Stephen King:** Over 4,000 rating count.

**Nora Roberts:** Around 3,000 rating count.

**John Grisham:** More than 2,000 rating count.

**James Patterson:** More than 2,000 rating count.

**J.K. Rowling:** Around 2,000 rating count.

**High Popularity of Certain Authors:**

**Observation:** Stephen King, Nora Roberts, John Grisham, James Patterson, and J.K. Rowling are some of the most popular authors on the platform, with Stephen King leading by a significant margin.

**Insight:** These authors have a strong fan base and their books are highly engaging for users. Their popularity suggests that they are key figures in the reading community and drive significant traffic to the platform.

**Opportunities for Targeted Marketing:**

**Observation:** The high rating counts for these authors indicate that their books are widely read and reviewed.

**Insight:** Marketing campaigns featuring these authors' books are likely to resonate with a broad audience. Promoting new releases, special collections, or discounts on books by these authors could lead to increased user engagement and sales.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Increased User Engagement:**

Action: Feature these popular authors prominently on the platform, in newsletters, and on social media. Highlight their most popular books and any new releases.

**Boosted Sales and Traffic:**

Action: Run targeted marketing campaigns focused on these authors, including promotions, discounts, and bundles of their books.

#### Word Cloud - Most frequent words in Book Titles

In [None]:
# Chart - 10 visualization code
# Create a single string of all book titles
text = ' '.join(df['Book-Title'].values)

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

# Display the word cloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

##### 1. Why did you pick the specific chart?

A word cloud chart is a popular visualization technique that displays the frequency of words in a Book Title, with the size of each word indicating its frequency or importance.

##### 2. What is/are the insight(s) found from the chart?

**Dominance of Fiction:**

**Observation:** The word "Novel" being the largest suggests that the majority of books in dataset are fictional works.

**Insight:** This indicates that fiction is the dominant genre on platform. The emphasis on novels suggests a strong user preference for this type of content.

**Emphasis on Emotional and Relatable Themes:**

**Observation:** Words like "life," "love," "heart," and "mysterious" indicate that books dealing with emotional, relatable, and intriguing themes are popular.

**Insight:** These themes resonate with audience, highlighting the types of narratives that engage users the most.

**Specific Interests in "Harry Potter":**

**Observation:** The prominence of "Harry Potter" suggests a strong interest in this specific series.

**Insight:** The popularity of "Harry Potter" indicates a significant user interest in fantasy and young adult fiction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Dominance of Fiction:**

**Business Impact:** To capitalize on this insight, we can focus on expanding fiction catalog, ensuring a diverse range of novels to cater to audience's preferences.

**Emphasis on Emotional and Relatable Themes:**

**Business Impact:** we can use this information to curate and promote books that emphasize these themes(**Words like "life," "love," "heart," and "mysterious"**). Creating themed collections around these keywords could attract more users and enhance engagement.

**Specific Interests in "Harry Potter":**

**Business Impact:** we can create special promotions around the "Harry Potter" series, or similar fantasy books.

#### **Bubble Chart of Author Popularity**

In [None]:
# Chart - 11 visualization code
# Calculate rating count and average rating for each author
author_stats = df.groupby('Book-Author')['Book-Rating'].agg(['mean', 'count']).reset_index()

# Plot the bubble chart
plt.figure(figsize=(10, 6))
plt.scatter(author_stats['count'], author_stats['mean'], s=author_stats['count'], alpha=0.5)
plt.title('Bubble Chart of Author Popularity')
plt.xlabel('Rating Count')
plt.ylabel('Average Rating')
plt.show()

##### 1. Why did you pick the specific chart?

A bubble chart is a type of data visualization that extends the concept of a scatter plot by adding a third dimension, typically represented by the size of the bubbles. It is particularly useful in cases where you want to visualize the relationships between three variables in a single chart.

##### 2. What is/are the insight(s) found from the chart?

**Observation:** Most authors have average ratings between 6 and 10, suggesting that readers generally rate books favorably.

**Insight:** The broad range of average ratings within this band indicates that users are mostly satisfied with the books they read. This positive reception could mean that the platform is successful in curating content that resonates well with its audience.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact:** This insight confirms that the platform is maintaining a good selection of books that meet user expectations. Continued focus on acquiring and promoting highly-rated books will likely maintain or improve user satisfaction and engagement.

#### Chart - 12 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

# Calculate the correlation matrix
corr_matrix = df.corr(numeric_only = True)

# Plot the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Features')
plt.show()

##### 1. Why did you pick the specific chart?

A correlation heatmap is a visualization that displays the correlation matrix between multiple variables in a dataset. Correlation measures the strength and direction of the relationship between two variables, typically ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no correlation.

##### 2. What is/are the insight(s) found from the chart?

**Correlation Insights:**

**Book Rating and Age:** A correlation of 0.03 suggests a very weak positive relationship. This means that age does not significantly influence book ratings.

**Book Rating and Year of Publication:** A correlation of -0.01 indicates an almost negligible negative relationship. Thus, the publication year has a minimal impact on book ratings.

**Book Rating and Book Age:** A correlation of 0.01 implies a very weak positive relationship. Book age (time since publication) does not significantly impact ratings.

**Age and Year of Publication:** A correlation of 0.02 shows a very weak positive relationship between age and publication year. Age does not significantly impact the year a book was published.

**Age and Book Age:** A correlation of -0.02 indicates a very weak negative relationship. Age has minimal effect on book age.

#### Chart - 13 - Pair Plot

In [None]:
# Pair Plot visualization code

# Create pair plot
sns.pairplot(df)

# Show plot
plt.show()

##### 1. Why did you pick the specific chart?

Pair plots help in visualizing the relationships between each pair of variables. This makes it easier to identify any linear or non-linear correlations.


##### 2. What is/are the insight(s) found from the chart?

**Plot - Book-Rating Vs. Age**

Insight: The fact that users across all age groups have given ratings from 1 to 10 suggests that the books available on the platform have a broad appeal, catering to diverse age groups.

**Plot - Book-Rating Vs. Book Age**

**Books Less Than 20 Years Old:**

Observation: These books have ratings from a limited number of users.

Insight: This could indicate that newer books are either not as widely discovered or not as extensively reviewed by users on the platform. It may also suggest that these books are still building their reputation and readership.

**Books Aged 20 to 100 Years:**

Observation: These books received ratings from many users, covering the full spectrum of ratings from 1 to 10.

Insight: This indicates a strong interest and engagement with books that fall within this age range. It suggests that these books are well-established and have a broad appeal across the user base.

**Books More Than 100 Years Old:**

Observation: These books predominantly received high ratings, typically between 5 and 10. But these books have ratings from a few users.

Insight: This suggests that very old books, likely classics or well-regarded historical works, are highly valued by users. These books may have stood the test of time and are appreciated for their enduring quality and significance.

**Plot - Year of Publication Vs. Book Age:**

plot suggests a perfect negative relationship, meaning that as the publication year increases, book age decreases proportionally. This indicates a consistent trend where newer books have a shorter age, which is expected.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Hypothesis 1:**

**Statement:** "Books by J.K. Rowling have a significantly higher average rating compared to books by Stephen King."

**Null Hypothesis (H0):** The average rating of books by J.K. Rowling is equal to the average rating of books by Stephen King.

**Alternative Hypothesis (H1):** The average rating of books by J.K. Rowling is higher than the average rating of books by Stephen King.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Filter ratings for J.K. Rowling and Stephen King
jk_rowling_ratings = df[df['Book-Author'] == 'J.K. Rowling']['Book-Rating']
stephen_king_ratings = df[df['Book-Author'] == 'Stephen King']['Book-Rating']

# Perform t-test (one-tailed, as we are testing if J.K. Rowling's ratings are higher)
t_stat, p_value = stats.ttest_ind(jk_rowling_ratings, stephen_king_ratings, alternative='greater')

print("Hypothesis 1 - J.K. Rowling vs. Stephen King")
print(f"T-Statistic: {t_stat}, P-Value: {p_value}")

# Decision based on p-value
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: J.K. Rowling's books have significantly higher ratings than Stephen King's books.")
else:
    print("Fail to reject the null hypothesis: No significant difference in ratings between J.K. Rowling's and Stephen King's books.")

##### Which statistical test have you done to obtain P-Value?

Test Used: Independent t-test (one-tailed)

##### Why did you choose the specific statistical test?

**Nature of Data:** We are comparing the average ratings of books from two different authors. The data consists of two independent samples (ratings for J.K. Rowling's books and ratings for Stephen King's books).

**Test Purpose:** An independent t-test is used to determine whether there is a statistically significant difference between the means of two independent groups. In this case, we are testing if the mean rating of J.K. Rowling's books is significantly higher than that of Stephen King's books.

**One-Tailed Test:** Since we are specifically interested in whether J.K. Rowling's average ratings are higher (not just different), a one-tailed test is appropriate.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Hypothesis 2:**

**Statement:** "Books published after the year 2000 receive higher ratings compared to books published before 2000."

**Null Hypothesis (H0):** The average rating of books published after 2000 is equal to or less than the average rating of books published before 2000.

**Alternative Hypothesis (H1):** The average rating of books published after 2000 is higher than the average rating of books published before 2000.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Filter ratings for books published before and after 2000
ratings_before_2000 = df[df['Year-Of-Publication'] < 2000]['Book-Rating']
ratings_after_2000 = df[df['Year-Of-Publication'] >= 2000]['Book-Rating']

# Perform t-test (one-tailed, as we are testing if ratings after 2000 are higher)
t_stat, p_value = stats.ttest_ind(ratings_after_2000, ratings_before_2000, alternative='greater')

print("\nHypothesis 2 - Books Published After 2000 vs. Before 2000")
print(f"T-Statistic: {t_stat}, P-Value: {p_value}")

# Decision based on p-value
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: Books published after 2000 have significantly higher ratings.")
else:
    print("Fail to reject the null hypothesis: No significant difference in ratings between books published before and after 2000.")

##### Which statistical test have you done to obtain P-Value?

Test Used: Independent t-test (one-tailed)

##### Why did you choose the specific statistical test?

**Nature of Data:** We are comparing the average ratings of books published before and after 2000. These are two independent groups.

**Test Purpose:** The independent t-test is again suitable here to compare the means of two independent groups (books published before 2000 vs. books published after 2000).

**One-Tailed Test:** We hypothesize that books published after 2000 are rated higher, so a one-tailed test is used to see if the mean rating post-2000 is significantly greater than pre-2000.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Hypothesis 3:**

**Statement:** "Books with more than 500 ratings have a different average rating compared to books with fewer than 500 ratings."

**Null Hypothesis (H0):** The average rating of books with more than 500 ratings is equal to the average rating of books with fewer than 500 ratings.

**Alternative Hypothesis (H1):** The average rating of books with more than 500 ratings is different from the average rating of books with fewer than 500 ratings.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Calculate rating counts for each book
rating_counts = df['Book-Title'].value_counts()

# Filter books with more than 500 ratings and fewer than 500 ratings
popular_books = df[df['Book-Title'].isin(rating_counts[rating_counts > 500].index)]['Book-Rating']
less_popular_books = df[df['Book-Title'].isin(rating_counts[rating_counts <= 500].index)]['Book-Rating']

# Perform t-test (two-tailed, as we are testing if ratings are different)
t_stat, p_value = stats.ttest_ind(popular_books, less_popular_books)

print("\nHypothesis 3 - Books with More vs. Fewer Than 500 Ratings")
print(f"T-Statistic: {t_stat}, P-Value: {p_value}")

# Decision based on p-value
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: Books with more than 500 ratings have a significantly different average rating.")
else:
    print("Fail to reject the null hypothesis: No significant difference in ratings between books with more and fewer than 500 ratings.")

##### Which statistical test have you done to obtain P-Value?

Test Used: Independent t-test (two-tailed)

##### Why did you choose the specific statistical test?

**Nature of Data:** We are comparing the average ratings of two groups of books—those with more than 500 ratings and those with fewer than 500 ratings.

**Test Purpose:** The independent t-test is suitable for comparing the means of two independent groups (high-rating-count books vs. low-rating-count books).

**Two-Tailed Test:** Here, we are interested in any significant difference in average ratings, whether higher or lower, so a two-tailed test is appropriate.


Independent t-test: This test is chosen for all hypotheses because we are comparing the means of two independent groups in each case.

One-Tailed vs. Two-Tailed: The choice between one-tailed and two-tailed tests depends on whether we have a directional hypothesis (expecting one group to have higher/lower ratings) or a non-directional hypothesis (expecting any difference between groups).

Why Not Use Other Tests?
Paired t-test: Not appropriate here since we are not comparing two related groups (e.g., before-and-after measurements on the same subjects).

ANOVA: Could be used if we were comparing more than two groups, but since we are only comparing two groups in each case, the t-test is more straightforward.

Mann-Whitney U test: This non-parametric alternative to the t-test could be used if the data were heavily non-normal, but the t-test is preferred for normally distributed data or when sample sizes are large enough for the Central Limit Theorem to apply.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

I have handled missing value already.
Which missing value imputation techniques have I used and why did I use those techniques?

**Explaination:**

**Book Title column - Simply dropped rows**

Given that the book name is crucial for recommendations and understanding user preferences, it makes sense to drop entries where the book name is missing. This ensures that recommendation system is based on complete and meaningful data.

**Age column - Median Imputation**

Given that the "Age" column is crucial for your recommendation system and the distribution appears nearly normal after capping outliers, using median imputation for missing values is a suitable approach.

### 2. Categorical Encoding

In [None]:
# Take a random subset of 30,000 rows
subset_data = df.sample(n=30000, random_state=42)
subset_data = subset_data.reset_index(drop=True)

In [None]:
# Initialize the LabelEncoder
le = LabelEncoder()

# Perform label encoding on the 'ISBN' column
subset_data['ISBN_encoded'] = le.fit_transform(subset_data['ISBN'])

#### What all categorical encoding techniques have you used & why did you use those techniques?

**Encoding technique used - Label Encoding**

Why?

**Uniqueness Representation**

Each **ISBN** is unique: Label Encoding allows you to convert each unique ISBN into a unique integer, which preserves the uniqueness of the identifier. This is important in collaborative filtering, where each item (book) needs to be uniquely represented in the model.

**Compact Representation:**

Label Encoding is more memory-efficient compared to One-Hot Encoding, especially when dealing with high-cardinality features like ISBN(book id). It maps each unique ISBN to a single integer, keeping the data size manageable without introducing additional dimensions.

**Preserving Relationships**

Direct Mapping: Unlike One-Hot Encoding, which creates additional columns for each category, Label Encoding maintains a one-to-one mapping between the original ISBN and the encoded integer. This direct mapping is useful when the focus is on identifying unique entities rather than understanding relationships between them.

### 3. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Combine text features
subset_data['Combined Text'] = subset_data['Book-Title'] + ' ' + subset_data['Publisher'] + ' ' + subset_data['Book-Author']

#### 2. Lower Casing

In [None]:
# Lower Casing
subset_data['Combined Text'] = subset_data['Combined Text'].str.lower()

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
# Remove punctuation from 'book_title' column
subset_data['Combined Text'] = subset_data['Combined Text'].str.replace(f'[{string.punctuation}]', '', regex=True)

#### 4. Removing words and digits contain digits.

In [None]:
# Remove words and digits contain digits
# Define a function to remove words containing digits
def remove_words_with_digits(text):
    # Use regex to match words with digits and remove them
    return ' '.join(word for word in text.split() if not re.search(r'\d', word))

# Apply the function to the 'book_title' column
subset_data['Combined Text'] = subset_data['Combined Text'].apply(remove_words_with_digits)

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
# Get the list of English stopwords
stop_words = set(stopwords.words('english'))

# Function to remove stopwords
def remove_stopwords(text):
    return ' '.join(word for word in text.split() if word.lower() not in stop_words)

# Apply the function to the 'book_title' column
subset_data['Combined Text'] = subset_data['Combined Text'].apply(remove_stopwords)

In [None]:
# Remove White spaces
# Replace multiple spaces with a single space
subset_data['Combined Text'] = subset_data['Combined Text'].str.replace('\s+', ' ', regex=True).str.strip()

#### 6. Tokenization

In [None]:
# Tokenization
# Define a function for tokenization
def tokenize_text(text):
    return word_tokenize(text)

# Apply the function to the 'book_title' column
subset_data['Combined Text Token'] = subset_data['Combined Text'].apply(tokenize_text)

#### 7. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Define a function for lemmatization
def lemmatize_text(text):
    tokens = word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(lemmatized_tokens)

# Apply the function to the 'book_title' column
subset_data['Combined Text lemmatized'] = subset_data['Combined Text'].apply(lemmatize_text)

##### Which text normalization technique have you used and why?

Lemmatization is the process of reducing words to their base or root form. This is different from stemming, which cuts off prefixes or suffixes to get a root form. Lemmatization ensures that the base form is a valid word in the language.

#### 8. Text Vectorization

In [None]:
# Vectorizing Text
# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the 'book_title' column
tfidf_matrix = tfidf_vectorizer.fit_transform(subset_data['Combined Text lemmatized'])

In [None]:
subset_data.shape

In [None]:
#df = df.drop(['Combined Text', 'Combined Text Token', 'Combined Text lemmatized'], axis=1)
subset_data = subset_data.drop(['Combined Text', 'Combined Text Token', 'Combined Text lemmatized'], axis=1)

##### Which text vectorization technique have you used and why?

TF-IDF is used for vectorizing text data like combine text because it effectively captures the importance of terms, reduces the impact of common words, improves feature representation, and facilitates accurate document similarity measurements.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

Feature manipulation has handled in text data preprocessing already

#### 2. Feature Selection

##### Which all features you found important and why?

**Key Features for Collaborative Filtering**

**User-ID:**

Importance: Essential.

Role: Represents the user in the recommendation system. Collaborative filtering models rely heavily on identifying users and finding similar users (User-Based Collaborative Filtering) or determining which items are most similar based on user interactions (Item-Based Collaborative Filtering).

**ISBN_encoded (Book-ID):**

Importance: Essential.

Role: Represents the item (book) in the recommendation system. It is used to track which books have been rated by users and to identify similar items.

**Book-Rating:**

Importance: Essential.
Role: The actual rating given by users to books. This feature is the core of collaborative filtering since it captures user preferences. Ratings are used to calculate similarities between users or items.

**Key Features for Content-Based Filtering:**

**Book-Title:**

Importance: High.

Role: Although book titles themselves might not directly convey content, they can be used in natural language processing (NLP) tasks to extract keywords, or be included in a string similarity comparison. They might be more useful in combination with other text features like descriptions.

**Book-Author:**

Importance: High.

Role: Author names can be indicative of the writing style or genre, which can be important for users who prefer books by specific authors. This feature can be used to suggest other books by the same author or similar authors.

**Publisher:**

Importance: Moderate.

Role: Like the author, the publisher might be less critical but can still provide context, especially if certain publishers specialize in specific genres or types of books.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

No need

In [None]:
# Transform Your data

### 6. Data Scaling

No need

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

In [None]:
# Dimensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# Define the feature columns and the target column
feature_columns = ['User-ID','Book-Title','Book-Author','Year-Of-Publication','Publisher','ISBN_encoded']
target_column = 'Book-Rating'  # The column you want to predict

# Split the data into features (X) and target (y)
X = subset_data[feature_columns]
y = subset_data[target_column]

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


##### What data splitting ratio have you used and why?

**80:20 Split:**

80% Training Data: Used for training the model.
20% Test Data: Used for evaluating the model's performance on unseen data.

**Why Use an 80:20 Split?**

**Sufficient Training Data:**

80%  of the data being used for training provides the model with enough data to learn the patterns and relationships in the dataset. This is particularly important for recommendation systems, where understanding user preferences and item features requires a substantial amount of data.

**Reliable Evaluation:**

20% of the data being held out for testing allows for a reliable evaluation of the model's performance. This ensures that the model's accuracy, precision, recall, and other metrics reflect its ability to generalize to new, unseen data.

**Preventing Overfitting:**

By holding out a portion of the data for testing, you can detect overfitting, where the model might perform well on the training data but poorly on unseen data. This is crucial for building robust recommendation systems that work well in real-world scenarios.

**Resource Considerations:**

An 80:20 split is a good balance between giving the model enough data to learn and having enough data to evaluate the model's performance.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

No

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# Step 1: Combine X_train and y_train into a single DataFrame
train_data = X_train.copy()
train_data['Book-Rating'] = y_train

# Combine X_test and y_test similarly
test_data = X_test.copy()
test_data['Book-Rating'] = y_test

# Step 2: Prepare the data for Surprise
# Define the rating scale (assuming ratings are between 1 and 10)
reader = Reader(rating_scale=(1, 10))

# Load the training data into the Surprise dataset
trainset = Dataset.load_from_df(train_data[['User-ID', 'ISBN_encoded', 'Book-Rating']], reader).build_full_trainset()

# Step 3: Train the Collaborative Filtering Model
# Use SVD (Singular Value Decomposition)
model = SVD()
model.fit(trainset)

# Step 4: Evaluate the Model on the Test Set
# Create the testset for the Surprise model using the test data
testset = list(zip(X_test['User-ID'], X_test['ISBN_encoded'], y_test))

# Generate predictions
predictions = model.test(testset)  # Evaluate the model performance



#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The model is user is (Singular Value Decomposition (SVD).

In [None]:
# Visualizing evaluation Metric Score chart
# Evaluate the model performance
rmse = accuracy.rmse(predictions)
print(f"Collaborative Filtering RMSE: {rmse:.4f}")

# You can also explore other metrics, like MAE
mae = accuracy.mae(predictions)
print(f"Collaborative Filtering MAE: {mae:.4f}")

# Step 1: Define a threshold for considering a prediction as a positive recommendation
threshold = 7  # This is a common threshold, but you can adjust based on your application

# Extract true labels and predicted labels based on the threshold
y_true = [true_r >= threshold for (_, _, true_r, _, _) in predictions]
y_pred = [est >= threshold for (_, _, _, est, _) in predictions]

# Calculate precision, recall, and F1 score
precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='binary')

print(f"Tuned Collaborative Filtering Precision: {precision:.4f}")
print(f"Tuned Collaborative Filtering Recall: {recall:.4f}")
print(f"Tuned Collaborative Filtering F1 Score: {f1:.4f}")

# Step 2: Predict the rating for the book 'First to Fight' for a specific user
user_id = X_test['User-ID'].iloc[1]  # Replace with User-ID from your test set
isbn_encoded = 10057
book_title = 'First to Fight'
# Use the trained model (model) to predict the rating
predicted_rating = model.predict(user_id, isbn_encoded).est
print(f"Predicted rating for user {user_id} on '{book_title}': {predicted_rating:.2f}")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Combine X_train and y_train into a single DataFrame
train_data = X_train.copy()
train_data['Book-Rating'] = y_train

# Prepare the data for Surprise
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(train_data[['User-ID', 'ISBN_encoded', 'Book-Rating']], reader)

# Define the parameter grid for SVD
param_grid = {
    'n_factors': [20, 50, 100, 150],  # Number of latent factors in the matrix factorization
    'lr_all': [0.002, 0.005, 0.01, 0.02],  # Learning rate for all parameters
    'reg_all': [0.02, 0.05, 0.1, 0.2]  # Regularization term for all parameters
}

# Perform GridSearchCV to find the best hyperparameters
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=5, n_jobs=-1)
gs.fit(data)

# Get the best model based on RMSE
best_model = gs.best_estimator['rmse']
print(f"Best RMSE Score: {gs.best_score['rmse']}")
print(f"Best Hyperparameters: {gs.best_params['rmse']}")

# Train the best model on the full training set
trainset = data.build_full_trainset()
best_model.fit(trainset)

# Evaluate the model on the test set
testset = list(zip(X_test['User-ID'], X_test['ISBN_encoded'], y_test))
predictions = best_model.test(testset)

In [None]:
# Visualizing evaluation Metric Score chart
# Evaluate the model performance
rmse = accuracy.rmse(predictions)
print(f"Collaborative Filtering RMSE: {rmse:.4f}")

# You can also explore other metrics, like MAE
mae = accuracy.mae(predictions)
print(f"Collaborative Filtering MAE: {mae:.4f}")

# Step 1: Define a threshold for considering a prediction as a positive recommendation
threshold = 7  # This is a common threshold, but you can adjust based on your application

# Extract true labels and predicted labels based on the threshold
y_true = [true_r >= threshold for (_, _, true_r, _, _) in predictions]
y_pred = [est >= threshold for (_, _, _, est, _) in predictions]

# Calculate precision, recall, and F1 score
precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='binary')

print(f"Tuned Collaborative Filtering Precision: {precision:.4f}")
print(f"Tuned Collaborative Filtering Recall: {recall:.4f}")
print(f"Tuned Collaborative Filtering F1 Score: {f1:.4f}")

# Step 2: Predict the rating for the book which have isbn encoded 13276 for a specific user
user_id = X_test['User-ID'].iloc[1]  # Replace with an actual User-ID from your test set
isbn_encoded = 10057
book_title = 'First to Fight'

# Use the trained model (model) to predict the rating
predicted_rating = model.predict(user_id, isbn_encoded).est
print(f"Predicted rating for user {user_id} on '{book_title}': {predicted_rating:.2f}")

##### Which hyperparameter optimization technique have you used and why?

Grid Search is a widely used hyperparameter optimization technique in machine learning. It systematically works through multiple combinations of parameter values, cross-validating as it goes to determine which combination provides the best model performance. Here's why Grid Search is commonly used.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, It sightly reduce the Root Mean Square Error and Mean Absolute Error value and increase the precision.

####Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

**1. Root Mean Square Error (RMSE)**

Indication:

RMSE measures the average magnitude of the error between predicted and actual ratings, penalizing larger errors more heavily due to squaring the differences before averaging.
A lower RMSE indicates that the model’s predictions are closer to the actual ratings, meaning it’s more accurate in predicting user preferences.

**Business Impact:**

Customer Satisfaction: A model with a low RMSE is likely to recommend products (or books) that users truly enjoy, leading to higher customer satisfaction. If the model often predicts ratings that are far from the user's actual rating, it might suggest irrelevant or less-liked products, frustrating the user and potentially leading to churn.

Sales & Revenue: More accurate predictions can drive higher engagement with recommended items, leading to increased sales and revenue. For instance, if a recommendation is accurate and the customer is satisfied with the suggestion, they are more likely to make a purchase.

Trust in the Platform: Consistently accurate recommendations build user trust. Users are more likely to rely on and continue using a platform that consistently suggests relevant and satisfying options.

**2. Mean Absolute Error (MAE)**

Indication:

MAE measures the average magnitude of errors in predictions, without considering the direction (i.e., it doesn’t penalize large errors as heavily as RMSE).
Like RMSE, a lower MAE indicates better model accuracy, but it provides a more straightforward interpretation as the average error in predicted ratings.

**Business Impact:**

User Experience: MAE gives an easy-to-understand metric for the average error in recommendations. If users frequently see recommendations that are close to their preferences, the overall user experience improves.

Product Relevance: A low MAE ensures that the recommendations made by the model are generally relevant to the user’s tastes. This can increase the time users spend on the platform, exploring and interacting with more products.

Cost Efficiency: Lower prediction errors reduce the likelihood of users being recommended items they won’t purchase, which can improve the efficiency of marketing spend and inventory management.

**3. Precision**

Indication:

Precision measures the proportion of true positive recommendations (items the user liked) out of all items the model predicted as liked (recommended).
High precision means that when the model predicts an item as liked, it’s likely to be correct.

**Business Impact:**

Targeted Marketing: High precision ensures that marketing efforts, such as targeted ads or personalized emails, focus on products that users are likely to purchase, reducing wasted marketing resources.

User Retention: Precise recommendations improve user satisfaction, reducing the risk of user churn. If users consistently see relevant recommendations, they are more likely to return to the platform.
Brand Loyalty: By consistently recommending products that users like, the brand builds loyalty, as users feel that the platform understands their needs and preferences.

**4. Recall**

Indication:

Recall measures the proportion of true positive recommendations (items the user liked) out of all items that the user actually liked.
High recall means the model is good at identifying and recommending most of the items a user would like.

**Business Impact:**

Comprehensive Recommendations: High recall ensures that users are exposed to a wide range of products that fit their preferences, increasing the likelihood of discovery and purchase of additional items.

Sales Uplift: By recommending more items that users are likely to enjoy, even if they haven't explicitly searched for them, recall can lead to additional sales, thus boosting revenue.

Customer Discovery: Strong recall can help users discover new products that align with their tastes, potentially leading to increased customer satisfaction and loyalty as they find value in the platform's ability to suggest new favorites.

**5. F1 Score**

Indication:

The F1 Score is the harmonic mean of precision and recall, providing a single metric that balances the trade-offs between the two. It’s particularly useful when you need to balance the importance of precision and recall.
A high F1 score indicates that the model performs well in both precision and recall, meaning it’s both accurate and comprehensive in its recommendations.

**Business Impact:**

Balanced Performance: A high F1 score ensures that the recommendation system is both precise (users see mostly relevant items) and has a wide coverage (users see most of the items they would like). This balance is critical for maintaining a positive user experience.

Optimized Customer Engagement: By balancing precision and recall, the F1 score reflects the model’s ability to drive engagement without overwhelming users with too many irrelevant suggestions or missing potential opportunities to recommend good products.

Strategic Decision-Making: Businesses can rely on the F1 score to understand the overall effectiveness of their recommendation systems, allowing them to make informed decisions about where to invest in further model improvements.

**Overall Business Impact of the ML Model:**

Increased Revenue: By improving the relevance and accuracy of recommendations, the model drives more purchases, thereby increasing revenue.

Enhanced Customer Experience: Accurate and comprehensive recommendations improve user satisfaction, leading to higher retention rates and brand loyalty.

Efficient Resource Allocation: Precision ensures that marketing and promotional efforts are well-targeted, reducing waste and improving ROI.

Competitive Advantage: A robust recommendation system can set a business apart from its competitors, making it a preferred platform for users seeking personalized experiences.

### ML Model - 2

In [None]:
# Convert the sparse matrix to a dense format and cast to float32 for Faiss
tfidf_dense = tfidf_matrix.astype(np.float32).toarray()

# Initialize the Faiss index
dimension = tfidf_dense.shape[1]
index = faiss.IndexFlatL2(dimension)

# Add the dense vectors to the Faiss index
index.add(tfidf_dense)

In [None]:
# Function to get content-based recommendations using Faiss
def get_faiss_recommendations(dataset, title, n=10):
    # Check if the title exists in the dataset
    if title not in dataset['Book-Title'].values:
        print(f"Title '{title}' not found in the dataset.")
        return None

    # Get the index of the book that matches the title
    idx = dataset[dataset['Book-Title'] == title].index[0]

    # Get the query vector for the book
    query_vector = tfidf_dense[idx].reshape(1, -1)  # Reshape for Faiss

    # Search for the most similar items
    distances, indices = index.search(query_vector, n+1)  # n+1 to skip the first one (itself)

    # Get the top similar books, excluding the queried book itself
    similar_books = dataset.iloc[indices[0][1:]][['Book-Title', 'Book-Author', 'Publisher', 'Year-Of-Publication']]

    return similar_books

# Example usage
recommended_books = get_faiss_recommendations(subset_data,'First to Fight')
if recommended_books is not None:
    print(recommended_books)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Define the hyperparameter grid
param_grid = {
    'index_type': ['FlatL2', 'IVF'],  # Types of FAISS index to try
    'nlist': [50, 100, 200],          # Number of clusters (for IVF)
    'nprobe': [5, 10, 20]             # Number of clusters to visit during search (for IVF)
}

# Initialize variables to track the best configuration
best_ap = 0
best_params = None

# Function to evaluate FAISS configuration using Average Precision (AP)
def evaluate_faiss(index, validation_titles, n=10):
    ap_scores = []
    for title in validation_titles:
        recommendations = get_faiss_recommendations(subset_data,title, n)
        if recommendations is not None:
            # Assume that true relevance is the first item for simplification
            true_relevance = np.zeros(n)
            true_relevance[0] = 1  # Assume the first recommended item is the correct one
            ap_scores.append(average_precision_score(true_relevance, np.ones(n)))
    return np.mean(ap_scores)

# Perform the grid search manually
for index_type in param_grid['index_type']:
    for nlist in param_grid['nlist']:
        for nprobe in param_grid['nprobe']:
            print(f"Evaluating index_type={index_type}, nlist={nlist}, nprobe={nprobe}")

            # Create and train FAISS index
            if index_type == 'IVF':
                quantizer = faiss.IndexFlatL2(tfidf_dense.shape[1])
                index = faiss.IndexIVFFlat(quantizer, tfidf_dense.shape[1], nlist)
                index.train(tfidf_dense)
                index.nprobe = nprobe  # Set the number of probes for IVF index
            else:
                index = faiss.IndexFlatL2(tfidf_dense.shape[1])

            # Add the vectors to the index
            index.add(tfidf_dense)

            # Evaluate this configuration
            validation_titles = subset_data['Book-Title'].sample(100, random_state=42).values
            ap_score = evaluate_faiss(index, validation_titles, n=10)

            print(f"AP Score: {ap_score:.4f}")

            # Update the best parameters if the current configuration is better
            if ap_score > best_ap:
                best_ap = ap_score
                best_params = {'index_type': index_type, 'nlist': nlist, 'nprobe': nprobe}

print(f"Best AP Score: {best_ap}")
print(f"Best Parameters: {best_params}")

# Once the best parameters are found, rebuild the final index with those parameters
if best_params['index_type'] == 'IVF':
    quantizer = faiss.IndexFlatL2(tfidf_dense.shape[1])
    final_index = faiss.IndexIVFFlat(quantizer, tfidf_dense.shape[1], best_params['nlist'])
    final_index.train(tfidf_dense)
    final_index.nprobe = best_params['nprobe']
else:
    final_index = faiss.IndexFlatL2(tfidf_dense.shape[1])

final_index.add(tfidf_dense)

# Update the function to use the final index
def get_faiss_recommendations(dataset,title, n=10):
    # Check if the title exists in the dataset
    if title not in dataset['Book-Title'].values:
        print(f"Title '{title}' not found in the dataset.")
        return None

    # Get the index of the book that matches the title
    idx = dataset[dataset['Book-Title'] == title].index[0]

    # Get the query vector for the book
    query_vector = tfidf_dense[idx].reshape(1, -1)

    # Search for the most similar items
    distances, indices = final_index.search(query_vector, n+1)

    # Get the top similar books, excluding the queried book itself
    similar_books = dataset.iloc[indices[0][1:]][['Book-Title', 'Book-Author', 'Publisher', 'Year-Of-Publication']]

    return similar_books

# Example usage with the final tuned index
recommended_books = get_faiss_recommendations(subset_data,'First to Fight')
if recommended_books is not None:
    print(recommended_books)

##### Which hyperparameter optimization technique have you used and why?

uses FAISS (Facebook AI Similarity Search) for content-based recommendations. FAISS is used to perform efficient similarity searches, typically in high-dimensional spaces like those generated by TF-IDF (Term Frequency-Inverse Document Frequency) vectors.

### ML Model - 3

In [None]:
# Function to get content-based recommendations using Faiss
def get_faiss_recommendations(dataset, title, n=10):
    # Check if the title exists in the dataset
    if title not in dataset['Book-Title'].values:
        print(f"Title '{title}' not found in the dataset.")
        return None

    # Get the index of the book that matches the title
    idx = dataset[dataset['Book-Title'] == title].index[0]

    # Get the query vector for the book
    query_vector = tfidf_dense[idx].reshape(1, -1)  # Reshape for Faiss

    # Search for the most similar items
    distances, indices = index.search(query_vector, n+1)  # n+1 to skip the first one (itself)

    # Get the top similar books, excluding the queried book itself
    similar_books = dataset.iloc[indices[0][1:]][['Book-Title', 'Book-Author', 'Publisher', 'Year-Of-Publication']]

    return similar_books

# Example usage for content-based recommendations
recommended_books = get_faiss_recommendations(subset_data, 'First to Fight')

# Step 2: Collaborative Filtering Setup
# Combine X_train and y_train into a single DataFrame
train_data = X_train.copy()
train_data['Book-Rating'] = y_train

# Combine X_test and y_test similarly
test_data = X_test.copy()
test_data['Book-Rating'] = y_test

# Step 3: Prepare the data for Surprise (Collaborative Filtering)
# Define the rating scale (assuming ratings are between 1 and 10)
reader = Reader(rating_scale=(1, 10))

# Load the training data into the Surprise dataset
trainset = Dataset.load_from_df(train_data[['User-ID', 'ISBN_encoded', 'Book-Rating']], reader).build_full_trainset()

# Train the Collaborative Filtering Model (SVD)
model = SVD()
model.fit(trainset)


# Example of Hybrid Recommendation
def get_hybrid_recommendations(user_id, title, n=10):
    # Collaborative Filtering: Predict the rating the user would give to the book
    try:
        collaborative_prediction = model.predict(user_id, title).est
    except:
        collaborative_prediction = None

    # Content-Based Filtering: Get similar books using Faiss
    content_based_recommendations = get_faiss_recommendations(subset_data, title, n)

    if content_based_recommendations is None:
        print(f"No content-based recommendations found for '{title}'.")
        return None

    # Combine the results
    if collaborative_prediction is not None:
        # Add the collaborative filtering score to the content-based recommendations
        content_based_recommendations['Collaborative_Score'] = collaborative_prediction

        # Sort the content-based recommendations by the collaborative score (if available)
        content_based_recommendations = content_based_recommendations.sort_values('Collaborative_Score', ascending=False)

    return content_based_recommendations

# Example usage with a specific user and book title for hybrid recommendations
user_id = X_test['User-ID'].iloc[1]  # Replace with an actual User-ID from your test set
book_title = 'First to Fight'  # Replace with an actual book title from your dataset
hybrid_recommendations = get_hybrid_recommendations(user_id, book_title, n=10)

if hybrid_recommendations is not None:
    print(hybrid_recommendations)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Step 4: Evaluate the Model on the Test Set
# Create the testset for the Surprise model using the test data
testset = list(zip(X_test['User-ID'], X_test['ISBN_encoded'], y_test))

# Generate predictions
predictions = model.test(testset)  # Evaluate the model performance
rmse = accuracy.rmse(predictions)
print(f"Collaborative Filtering RMSE: {rmse:.4f}")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Function to get content-based recommendations using Faiss
def get_faiss_recommendations(dataset, title, n=10):
    if title not in dataset['Book-Title'].values:
        print(f"Title '{title}' not found in the dataset.")
        return None

    idx = dataset[dataset['Book-Title'] == title].index[0]
    query_vector = tfidf_dense[idx].reshape(1, -1)

    distances, indices = index.search(query_vector, n+1)
    similar_books = dataset.iloc[indices[0][1:]][['Book-Title', 'Book-Author', 'Publisher', 'Year-Of-Publication']]

    return similar_books

# Function to evaluate FAISS configuration using Average Precision (AP)
def evaluate_faiss(index, validation_titles, n=10):
    ap_scores = []
    for title in validation_titles:
        recommendations = get_faiss_recommendations(subset_data,title, n)
        if recommendations is not None:
            # Assume that true relevance is the first item for simplification
            true_relevance = np.zeros(n)
            true_relevance[0] = 1  # Assume the first recommended item is the correct one
            ap_scores.append(average_precision_score(true_relevance, np.ones(n)))
    return np.mean(ap_scores)

# Step 2: Hyperparameter Tuning for Collaborative Filtering (SVD)
param_grid = {
    'n_factors': [20, 50, 100],
    'lr_all': [0.002, 0.005, 0.01],
    'reg_all': [0.02, 0.05, 0.1]
}

gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=5, n_jobs=-1)
train_data = X_train.copy()
train_data['Book-Rating'] = y_train
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(train_data[['User-ID', 'ISBN_encoded', 'Book-Rating']], reader)
gs.fit(data)

best_svd = gs.best_estimator['rmse']
print(f"Best SVD Parameters: {gs.best_params['rmse']}")
print(f"Best RMSE Score: {gs.best_score['rmse']}")

# Train the best SVD model on the full training set
trainset = data.build_full_trainset()
best_svd.fit(trainset)

# Step 3: Hyperparameter Tuning for FAISS
param_grid_faiss = {
    'nlist': [50, 100, 200],
    'nprobe': [5, 10, 20]
}

best_ap = 0
best_faiss_params = None

for nlist in param_grid_faiss['nlist']:
    for nprobe in param_grid_faiss['nprobe']:
        index_ivf = faiss.IndexIVFFlat(faiss.IndexFlatL2(dimension), dimension, nlist)
        index_ivf.train(tfidf_dense)
        index_ivf.nprobe = nprobe
        index_ivf.add(tfidf_dense)
        # Evaluate this configuration
        validation_titles = subset_data['Book-Title'].sample(100, random_state=42).values
        ap_score = evaluate_faiss(index, validation_titles, n=10)
        #ap_score = evaluate_faiss(index_ivf, X_test, X_train, n=10)  # Custom evaluation function for MAP or Precision
        print(f"nlist: {nlist}, nprobe: {nprobe}, AP Score: {ap_score:.4f}")

        if ap_score > best_ap:
            best_ap = ap_score
            best_faiss_params = {'nlist': nlist, 'nprobe': nprobe}

print(f"Best FAISS Parameters: {best_faiss_params}")
print(f"Best AP Score: {best_ap}")

# Use the best parameters for FAISS
index_ivf = faiss.IndexIVFFlat(faiss.IndexFlatL2(dimension), dimension, best_faiss_params['nlist'])
index_ivf.train(tfidf_dense)
index_ivf.nprobe = best_faiss_params['nprobe']
index_ivf.add(tfidf_dense)

# Step 4: Hybrid Recommendation Function
def get_hybrid_recommendations(user_id, title, n=10):
    try:
        collaborative_prediction = best_svd.predict(user_id, title).est
    except:
        collaborative_prediction = None

    content_based_recommendations = get_faiss_recommendations(subset_data, title, n)
    if content_based_recommendations is None:
        print(f"No content-based recommendations found for '{title}'.")
        return None

    if collaborative_prediction is not None:
        content_based_recommendations['Collaborative_Score'] = collaborative_prediction
        content_based_recommendations = content_based_recommendations.sort_values('Collaborative_Score', ascending=False)

    return content_based_recommendations

# Example usage with a specific user and book title for hybrid recommendations
user_id = X_test['User-ID'].iloc[1]
book_title = 'First to Fight'
hybrid_recommendations = get_hybrid_recommendations(user_id, book_title, n=10)

if hybrid_recommendations is not None:
    print(hybrid_recommendations)


##### Which hyperparameter optimization technique have you used and why?

Grid Search is a widely used hyperparameter optimization technique in machine learning. It systematically works through multiple combinations of parameter values, cross-validating as it goes to determine which combination provides the best model performance. Here's why Grid Search is commonly used and uses FAISS (Facebook AI Similarity Search) for content-based recommendations. FAISS is used to perform efficient similarity searches, typically in high-dimensional spaces like those generated by TF-IDF (Term Frequency-Inverse Document Frequency) vectors.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

The recent adjustments in the model have led to a slight improvement in the RMSE (Root Mean Square Error) value, which indicates that the prediction errors have decreased, resulting in more accurate predictions of book ratings. Additionally, these refinements have significantly enhanced the accuracy of book recommendations, ensuring that the suggested books better align with users' preferences.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

**1. Root Mean Square Error (RMSE):**

Why it was chosen:
 RMSE is a widely used metric in recommendation systems because it measures the average magnitude of the prediction errors. By calculating the square root of the average squared differences between predicted and actual ratings, RMSE provides a direct insight into how well the model's predicted ratings align with the users' actual ratings.

**Business Impact:** A lower RMSE indicates that the model's predictions are closer to the actual ratings, reducing the risk of recommending books that users might not like. Accurate predictions lead to higher user satisfaction and trust in the recommendation system, which can enhance user retention and increase sales or engagement.

**2. Precision, Recall, and F1 Score:**

Why they were chosen:

Precision measures the proportion of relevant books among the recommended ones. High precision ensures that users are presented with books they are likely to enjoy.

Recall measures the proportion of relevant books that were recommended out of all relevant books available. High recall ensures that the system is capturing a broad range of user interests.

F1 Score balances precision and recall, providing a single metric that reflects both false positives and false negatives in recommendations.

**Business Impact:** High precision reduces the chances of users being dissatisfied with irrelevant recommendations, while high recall ensures that users are not missing out on books they might like. The F1 Score provides a balanced view, helping to ensure that the recommendation system is both effective and efficient, leading to better user experiences and potentially higher sales.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

The final model chosen for the book recommendation system is the Hybrid Model, which combines Collaborative Filtering (using SVD) and Content-Based Filtering (using FAISS). This hybrid approach was selected for several compelling reasons:

**1. Strengths of Both Models:**

**Collaborative Filtering (SVD):**

Strength: The SVD model excels at capturing latent factors in user-book interactions, allowing it to make personalized recommendations based on users' past behavior (e.g., ratings). It can suggest books that users might not have explicitly shown interest in but are likely to enjoy based on similar users' preferences.
Weakness: Collaborative filtering alone can suffer from the "cold start" problem, where it struggles to make recommendations for new users or new books that lack sufficient rating history.

**Content-Based Filtering (FAISS):**

Strength: FAISS-based content filtering leverages the actual content of the books (e.g., descriptions, authors, publishers) to find similar items. It can recommend books based on textual and other metadata, which is particularly useful when dealing with new books or when explicit user ratings are sparse.
Weakness: Content-based filtering alone might recommend books that are too similar to what the user has already read, leading to a narrow range of suggestions.

**2. Balanced and Comprehensive Recommendations:**

The hybrid model effectively mitigates the weaknesses of each individual approach by combining their strengths. By integrating collaborative filtering, the system leverages user interaction data to offer diverse recommendations. Simultaneously, content-based filtering ensures that recommendations are relevant to the specific content attributes of the books.
This approach allows the system to recommend books that are both personalized (through collaborative filtering) and relevant to the user's interests or search queries (through content-based filtering).

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

The model used for the final book recommendation system is a Hybrid Model that combines Collaborative Filtering (using SVD) and Content-Based Filtering (using FAISS). Here's an explanation of how these models work, along with an approach to feature importance using a model explainability tool.

**1. Collaborative Filtering (SVD)**

How It Works:

Singular Value Decomposition (SVD) is a matrix factorization technique used in collaborative filtering. In this context, it decomposes the user-item interaction matrix (e.g., user-book ratings) into latent factors that represent underlying patterns in user preferences and book attributes.

Latent Factors: These are abstract dimensions that capture user preferences and item characteristics. For instance, one latent factor might capture a preference for a specific genre, while another might reflect a preference for books by a certain author.

Prediction: The model predicts a user’s rating for a book by estimating how closely the book's latent factors align with the user's latent preferences.

Feature Importance in SVD:

In SVD, the concept of "feature importance" is not as direct as in tree-based models like Random Forests. However, you can interpret the importance of

features through:

Latent Factors: These represent the most significant patterns in user preferences and item characteristics.
Bias Terms: The model can also include bias terms for users and items, capturing individual tendencies (e.g., a user generally gives higher ratings).

**2. Content-Based Filtering (FAISS)**

How It Works:

TF-IDF Vectorization: The content-based filtering uses TF-IDF (Term Frequency-Inverse Document Frequency) to convert the textual content of books (e.g., titles, authors, descriptions) into numerical vectors. These vectors represent the importance of words in the context of the entire dataset.

FAISS (Facebook AI Similarity Search): FAISS is used to quickly search for and retrieve similar items based on these vectors. The content-based model recommends books that are similar to the ones the user has shown interest in, based on textual similarity.

Feature Importance in FAISS:

TF-IDF Weights: The importance of features in this model can be interpreted through the TF-IDF weights. Higher TF-IDF weights indicate that a term is particularly important in distinguishing one book from others.

Cosine Similarity: FAISS relies on cosine similarity to measure the distance between vectors, meaning the terms that most distinguish a book’s content (those with higher TF-IDF weights) are crucial in determining similarity.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

**Conclusion:**

The project successfully developed a sophisticated hybrid book recommendation system that effectively combines collaborative filtering and content-based filtering. Through careful tuning and evaluation, the model achieved a balance between accuracy and relevance, making it a valuable tool for enhancing user experience and driving business success. The use of model explainability tools like SHAP further added transparency, allowing stakeholders to understand and trust the recommendation process.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***