In [4]:
!python -m pip install --upgrade pip



Collecting pip
  Obtaining dependency information for pip from https://files.pythonhosted.org/packages/ef/7d/500c9ad20238fcfcb4cb9243eede163594d7020ce87bd9610c9e02771876/pip-24.3.1-py3-none-any.whl.metadata
  Downloading pip-24.3.1-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-24.3.1-py3-none-any.whl (1.8 MB)
   ---------------------------------------- 0.0/1.8 MB ? eta -:--:--
   - -------------------------------------- 0.1/1.8 MB 1.7 MB/s eta 0:00:02
   ----- ---------------------------------- 0.2/1.8 MB 3.0 MB/s eta 0:00:01
   -------------------- ------------------- 0.9/1.8 MB 7.5 MB/s eta 0:00:01
   ---------------------------------------  1.8/1.8 MB 10.5 MB/s eta 0:00:01
   ---------------------------------------- 1.8/1.8 MB 9.7 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.2.1
    Uninstalling pip-23.2.1:
      Successfully uninstalled pip-23.2.1
Successfully installed pip-24.3.1


In [8]:
import pandas as pd
from surprise import Dataset as SurpriseDataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy

In [9]:
# Step 1: Load a Subset of the Dataset
# Load only the first 10,000 rows for proof of concept
data = pd.read_csv(r'C:\Users\shrey\Downloads\Sabudh Project\shreya.csv', low_memory=False, nrows=10000)

In [10]:

# Step 2: Data Preprocessing

# Remove Unnamed Columns
data = data.drop(columns=['Unnamed: 0.1', 'Unnamed: 0'], errors='ignore')

In [11]:
# Convert Columns to Appropriate Data Types
data['book_id'] = data['book_id'].fillna(0).astype(int)
data['publication_year'] = data['publication_year'].fillna(0).astype(int)
data['publication_month'] = data['publication_month'].fillna(0).astype(int)
data['average_rating'] = data['average_rating'].astype(float)
data['ratings_count'] = data['ratings_count'].fillna(0).astype(int)
data['language_code'] = data['language_code'].fillna('unknown').astype(str)
data['country_code'] = data['country_code'].fillna('unknown').astype(str)
data['num_pages'] = data['num_pages'].fillna(data['num_pages'].median())
data['publisher'] = data['publisher'].fillna('unknown')
data['text_reviews_count'] = data['text_reviews_count'].fillna(0).astype(int)


In [24]:
# Feature Engineering: Create is_new_release feature
data['is_new_release'] = (2023 - data['publication_year']).apply(lambda x: 1 if x <= 1 else 0)


Feature Engineering:

Added a new column is_new_release to indicate if a book is a recent release.
Focused on relevant features like text_reviews_count, average_rating, num_pages, publication_year, and language_code.

In [12]:
# Remove outliers from ratings
data = data[(data['average_rating'] >= 1) & (data['average_rating'] <= 5)]

# Remove duplicates
data = data.drop_duplicates()

In [13]:
# Check the final structure of the data
print(data.info())
print(data.describe())


<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 29 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   isbn                  10000 non-null  object 
 1   text_reviews_count    10000 non-null  int32  
 2   series                10000 non-null  object 
 3   country_code          10000 non-null  object 
 4   language_code         10000 non-null  object 
 5   popular_shelves       10000 non-null  object 
 6   asin                  0 non-null      float64
 7   is_ebook              10000 non-null  bool   
 8   average_rating        10000 non-null  float64
 9   kindle_asin           3550 non-null   object 
 10  similar_books         10000 non-null  object 
 11  description           10000 non-null  object 
 12  format                9127 non-null   object 
 13  link                  10000 non-null  object 
 14  authors               10000 non-null  object 
 15  publisher           

In [15]:
print(data.columns)


Index(['isbn', 'text_reviews_count', 'series', 'country_code', 'language_code',
       'popular_shelves', 'asin', 'is_ebook', 'average_rating', 'kindle_asin',
       'similar_books', 'description', 'format', 'link', 'authors',
       'publisher', 'num_pages', 'publication_day', 'isbn13',
       'publication_month', 'edition_information', 'publication_year', 'url',
       'image_url', 'book_id', 'ratings_count', 'work_id', 'title',
       'title_without_series'],
      dtype='object')


In [17]:
# Create a synthetic user_id column
import numpy as np
num_users = 100  # Number of synthetic users
data['user_id'] = np.random.randint(1, num_users + 1, size=len(data))

In [25]:
# Step 3: Build the Recommender System

# Prepare the data for Surprise
reader = Reader(rating_scale=(1, 5))

# Load the data into the Surprise format
dataset = SurpriseDataset.load_from_df(data[['user_id', 'book_id', 'average_rating']], reader)

# Split the dataset
trainset, testset = train_test_split(dataset, test_size=0.2)


In [26]:
# Create and train the SVD model
model = SVD()
model.fit(trainset)


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x182a0a9f280>

In [27]:
# Evaluate the model
predictions = model.test(testset)
print("RMSE: ", accuracy.rmse(predictions))


RMSE: 0.3619
RMSE:  0.36188167034851026


In [21]:
# Step 4: Generate Recommendations

def get_recommendations(model, user_id, n_recommendations=5):
    all_book_ids = data['book_id'].unique()
    rated_books = data[data['user_id'] == user_id]['book_id']

    to_predict = [book for book in all_book_ids if book not in rated_books]

    predictions = [model.predict(user_id, book) for book in to_predict]
    predictions.sort(key=lambda x: x.est, reverse=True)

    recommended_books = predictions[:n_recommendations]
    return [(pred.iid, pred.est) for pred in recommended_books]


Model Training: The code uses the SVD algorithm from the Surprise library to train the recommender system.

Recommendations: The function get_recommendations generates book recommendations based on the predicted ratings for books that the user has not yet rated.

In [28]:
# Example usage for a synthetic user_id
user_id = np.random.randint(1, num_users + 1)  # Randomly choose a user ID
recommendations = get_recommendations(model, user_id)
print(f"Recommendations for synthetic user {user_id}: {recommendations}")

Recommendations for synthetic user 33: [(3116884, 4.243889914616535), (14070444, 4.231791759124515), (2775591, 4.221983167485723), (16718170, 4.209368087000669), (18984670, 4.191291333821746)]


The output received is a list of recommended books for the synthetic user (in this case, user ID 33). Each entry in the list provides two pieces of information:

- Book ID: The first element in each tuple is the ID of the recommended book. This ID corresponds to a specific book in dataset.

- Estimated Rating: The second element is the estimated rating that the model predicts the synthetic user would give to that book. This rating is based on the user's past interactions and the characteristics of the book.

Interpretation
- Book IDs: Can look up these IDs in original dataset to find out more details about each book (e.g., title, author, genre, etc.).
- Estimated Ratings: Higher values indicate that the model predicts the user will enjoy these books more. For example, a predicted rating of 4.208 suggests that the model believes the user will rate that book quite positively

In [29]:
recommended_ids = [rec[0] for rec in recommendations]
recommended_books = data[data['book_id'].isin(recommended_ids)]
print(recommended_books[['book_id', 'title', 'authors', 'average_rating']])


       book_id                                              title  \
1936  16718170         The Third Wheel (Diary of a Wimpy Kid, #7)   
3196  14070444  Viaje al Bosque: un maletín lleno de Historias...   
5790  18984670                      How to Steal a Dragon's Sword   
8168   3116884                 Curious George Learns the Alphabet   
9518   2775591                            The Teddy Bears' Picnic   

                                                authors  average_rating  
1936              [{'author_id': '221559', 'role': ''}]            4.20  
3196  [{'author_id': '288388', 'role': ''}, {'author...            4.83  
5790  [{'author_id': '23894', 'role': ''}, {'author_...            4.43  
8168              [{'author_id': '967839', 'role': ''}]            4.23  
9518  [{'author_id': '60143', 'role': ''}, {'author_...            4.22  


The output received is a DataFrame that provides details about the recommended books. Here’s a breakdown of each column and what it represents:

- The number in output refers to the index of the row in the DataFrame that contains the details for the recommended book. It indicates that this particular book is located at that position of the DataFrame.


- book_id: This is the unique identifier for each book in dataset. It corresponds to the IDs received in the recommendation output.

- title: This is the title of the book. It provides context about what each recommended book is.

- authors: This column contains information about the authors of the book. In the output, it appears to be a list of dictionaries, where each dictionary contains the author’s ID and potentially their role (like primary author, editor, etc.). For a clearer view, we can extract just the author names.

- average_rating: This represents the average rating that the book has received from all users in the dataset. It gives an indication of how well-received the book is generally.

Further Feature Engineering: Can further refine features based on insights from EDA or additional data sources.
Experiment with Models: Trying different algorithms (like KNN, NMF) available in the Surprise library to see which performs best.