In [36]:
import pandas as pd

In [37]:
df = pd.read_csv("../data/cleaned_book_ratings.csv")
print("Shape:", df.shape)
df.head()

Shape: (98605, 11)


Unnamed: 0,user_id,isbn,book_rating,location,user_age,title,author,year,publisher,img_url,num_of_rating
0,276747,60517794,9,"iowa city, iowa, usa",25.0,Little Altars Everywhere,Rebecca Wells,2003.0,HarperTorch,http://images.amazon.com/images/P/0060517794.0...,20
1,276747,671537458,9,"iowa city, iowa, usa",25.0,Waiting to Exhale,Terry McMillan,1995.0,Pocket,http://images.amazon.com/images/P/0671537458.0...,20
2,276747,679776818,8,"iowa city, iowa, usa",25.0,Birdsong: A Novel of Love and War,Sebastian Faulks,1997.0,Vintage Books USA,http://images.amazon.com/images/P/0679776818.0...,10
3,276822,60096195,10,"calgary, alberta, canada",11.0,The Boy Next Door,Meggin Cabot,2002.0,Avon Trade,http://images.amazon.com/images/P/0060096195.0...,34
4,276822,141310340,9,"calgary, alberta, canada",11.0,Skin and Other Stories (Now in Speak!),Roald Dahl,2002.0,Puffin Books,http://images.amazon.com/images/P/0141310340.0...,4


# Feature Engineering

## User-level Features

In [38]:
# --- Favorite Author ---
# Compute average rating per user-author pair
user_author_mean = df.groupby(["user_id", "author"])["book_rating"].mean().reset_index()

# For each user, find the author with the max average rating
fav_author = user_author_mean.loc[user_author_mean.groupby("user_id")["book_rating"].idxmax()]

# Keep only user_id and author
fav_author = fav_author[["user_id", "author"]].rename(columns={"author": "fav_author"}).reset_index(drop=True)
fav_author

Unnamed: 0,user_id,fav_author
0,92,J. R. R. Tolkien
1,99,Robert T. Kiyosaki
2,114,Dan Brown
3,165,Rebecca Wells
4,254,Amy Tan
...,...,...
8198,278586,ELIZABETH GEORGE
8199,278694,Jeffery Deaver
8200,278832,Jack Canfield
8201,278843,Amy Tan


In [39]:
# --- Favorite Publisher ---

# Count number of books rated by each user-publisher pair
user_publisher_count = df.groupby(["user_id", "publisher"])["book_rating"].count().reset_index(name="count")

# For each user, get the publisher with the max count
fav_publisher = user_publisher_count.loc[user_publisher_count.groupby("user_id")["count"].idxmax()]

# Keep only user_id and publisher
fav_publisher = fav_publisher[["user_id", "publisher"]].rename(columns={"publisher": "fav_publisher"}).reset_index(drop=True)
fav_publisher

Unnamed: 0,user_id,fav_publisher
0,92,Minotauro
1,99,St. Martin's Press
2,114,Warner Books
3,165,HarperTorch
4,254,Harlequin
...,...,...
8198,278586,Bantam
8199,278694,Pocket
8200,278832,Health Communications
8201,278843,Penguin Books


## Books-level-Features

In [40]:
# --- average rating ---
avg_book_rating=df.groupby('isbn')['book_rating'].mean().reset_index(name='avg_book_rate')
avg_book_rating

Unnamed: 0,isbn,avg_book_rate
0,0002251760,8.800000
1,0002550563,6.500000
2,0003300277,8.600000
3,000648302X,6.500000
4,0006485200,8.600000
...,...,...
10130,8495618605,7.555556
10131,950491036X,9.500000
10132,9505156642,9.000000
10133,9505156944,9.500000


In [41]:
num_rating=df[['isbn','num_of_rating']].drop_duplicates()
book_info=avg_book_rating.merge(num_rating,on='isbn',how='inner')
book_info

Unnamed: 0,isbn,avg_book_rate,num_of_rating
0,0002251760,8.800000,5
1,0002550563,6.500000,4
2,0003300277,8.600000,5
3,000648302X,6.500000,9
4,0006485200,8.600000,5
...,...,...,...
10130,8495618605,7.555556,9
10131,950491036X,9.500000,8
10132,9505156642,9.000000,4
10133,9505156944,9.500000,4



### Popularity / Weighted Score (IMDb-style)

**Formula:**


$$
\text{score} = \frac{v}{v+m}R + \frac{m}{v+m}C
$$

Where:
- **R** = average rating for the book  
- **v** = number of ratings for the book  
- **m** = minimum ratings threshold   
- **C** = global average rating across all books  

**How it works:**
- If a book has **few ratings**, the score leans closer to **C** (the overall average).  
- If a book has **many ratings**, the score leans closer to its true **R**.  

👉 This prevents books with just a few perfect ratings from unfairly dominating the ranking.


In [42]:
print(book_info['num_of_rating'].describe())
m = 13     # minimum ratings threshold
# - Chosen m as 13 because from our dataset stats, 75% of books have <= 13 ratings
# - This ensures that books with very few ratings do not dominate the weighted score

C=df['book_rating'].mean()  # global average rating across all books  

count    10135.000000
mean        13.383029
std         19.664095
min          4.000000
25%          4.000000
50%          7.000000
75%         13.000000
max        336.000000
Name: num_of_rating, dtype: float64


In [43]:
book_info['weighted_score']=book_info['num_of_rating']*book_info['avg_book_rate']/(book_info['num_of_rating']+m) + m*C/(book_info['num_of_rating']+m)
book_info

Unnamed: 0,isbn,avg_book_rate,num_of_rating,weighted_score
0,0002251760,8.800000,5,8.140467
1,0002550563,6.500000,4,7.560495
2,0003300277,8.600000,5,8.084912
3,000648302X,6.500000,9,7.319473
4,0006485200,8.600000,5,8.084912
...,...,...,...,...
10130,8495618605,7.555556,9,7.751291
10131,950491036X,9.500000,8,8.501353
10132,9505156642,9.000000,4,8.148730
10133,9505156944,9.500000,4,8.266377


# New Dataset

In [44]:
df_plus=df

In [45]:
df_plus=df_plus.merge(fav_author,on='user_id',how='left').merge(fav_publisher,on='user_id',how='left')
df_plus

Unnamed: 0,user_id,isbn,book_rating,location,user_age,title,author,year,publisher,img_url,num_of_rating,fav_author,fav_publisher
0,276747,0060517794,9,"iowa city, iowa, usa",25.0,Little Altars Everywhere,Rebecca Wells,2003.0,HarperTorch,http://images.amazon.com/images/P/0060517794.0...,20,Rebecca Wells,HarperTorch
1,276747,0671537458,9,"iowa city, iowa, usa",25.0,Waiting to Exhale,Terry McMillan,1995.0,Pocket,http://images.amazon.com/images/P/0671537458.0...,20,Rebecca Wells,HarperTorch
2,276747,0679776818,8,"iowa city, iowa, usa",25.0,Birdsong: A Novel of Love and War,Sebastian Faulks,1997.0,Vintage Books USA,http://images.amazon.com/images/P/0679776818.0...,10,Rebecca Wells,HarperTorch
3,276822,0060096195,10,"calgary, alberta, canada",11.0,The Boy Next Door,Meggin Cabot,2002.0,Avon Trade,http://images.amazon.com/images/P/0060096195.0...,34,Eoin Colfer,Avon Trade
4,276822,0141310340,9,"calgary, alberta, canada",11.0,Skin and Other Stories (Now in Speak!),Roald Dahl,2002.0,Puffin Books,http://images.amazon.com/images/P/0141310340.0...,4,Eoin Colfer,Avon Trade
...,...,...,...,...,...,...,...,...,...,...,...,...,...
98600,276681,0060809833,10,"chicago, illinois, usa",43.0,Brave New World,Aldous Huxley,1989.0,Harpercollins,http://images.amazon.com/images/P/0060809833.0...,82,Aldous Huxley,Perennial
98601,276681,0060930535,9,"chicago, illinois, usa",43.0,The Poisonwood Bible: A Novel,Barbara Kingsolver,1999.0,Perennial,http://images.amazon.com/images/P/0060930535.0...,91,Aldous Huxley,Perennial
98602,276681,0060938455,9,"chicago, illinois, usa",43.0,Fast Food Nation: The Dark Side of the All-Ame...,Eric Schlosser,2002.0,Perennial,http://images.amazon.com/images/P/0060938455.0...,81,Aldous Huxley,Perennial
98603,276681,0399144463,8,"chicago, illinois, usa",43.0,Who Moved My Cheese? An Amazing Way to Deal wi...,Spencer Johnson,1998.0,Putnam Pub Group (Paper),http://images.amazon.com/images/P/0399144463.0...,44,Aldous Huxley,Perennial


In [46]:
del book_info['num_of_rating'] # df_plus contain num_of_raging column so no need to merge it

In [47]:
df_plus=df_plus.merge(book_info,on='isbn',how='left')
df_plus

Unnamed: 0,user_id,isbn,book_rating,location,user_age,title,author,year,publisher,img_url,num_of_rating,fav_author,fav_publisher,avg_book_rate,weighted_score
0,276747,0060517794,9,"iowa city, iowa, usa",25.0,Little Altars Everywhere,Rebecca Wells,2003.0,HarperTorch,http://images.amazon.com/images/P/0060517794.0...,20,Rebecca Wells,HarperTorch,7.850000,7.864497
1,276747,0671537458,9,"iowa city, iowa, usa",25.0,Waiting to Exhale,Terry McMillan,1995.0,Pocket,http://images.amazon.com/images/P/0671537458.0...,20,Rebecca Wells,HarperTorch,7.230769,7.489206
2,276747,0679776818,8,"iowa city, iowa, usa",25.0,Birdsong: A Novel of Love and War,Sebastian Faulks,1997.0,Vintage Books USA,http://images.amazon.com/images/P/0679776818.0...,10,Rebecca Wells,HarperTorch,7.000000,7.501235
3,276822,0060096195,10,"calgary, alberta, canada",11.0,The Boy Next Door,Meggin Cabot,2002.0,Avon Trade,http://images.amazon.com/images/P/0060096195.0...,34,Eoin Colfer,Avon Trade,8.275862,8.168249
4,276822,0141310340,9,"calgary, alberta, canada",11.0,Skin and Other Stories (Now in Speak!),Roald Dahl,2002.0,Puffin Books,http://images.amazon.com/images/P/0141310340.0...,4,Eoin Colfer,Avon Trade,7.500000,7.795789
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98600,276681,0060809833,10,"chicago, illinois, usa",43.0,Brave New World,Aldous Huxley,1989.0,Harpercollins,http://images.amazon.com/images/P/0060809833.0...,82,Aldous Huxley,Perennial,8.647059,8.543024
98601,276681,0060930535,9,"chicago, illinois, usa",43.0,The Poisonwood Bible: A Novel,Barbara Kingsolver,1999.0,Perennial,http://images.amazon.com/images/P/0060930535.0...,91,Aldous Huxley,Perennial,8.274725,8.226235
98602,276681,0060938455,9,"chicago, illinois, usa",43.0,Fast Food Nation: The Dark Side of the All-Ame...,Eric Schlosser,2002.0,Perennial,http://images.amazon.com/images/P/0060938455.0...,81,Aldous Huxley,Perennial,8.296296,8.239664
98603,276681,0399144463,8,"chicago, illinois, usa",43.0,Who Moved My Cheese? An Amazing Way to Deal wi...,Spencer Johnson,1998.0,Putnam Pub Group (Paper),http://images.amazon.com/images/P/0399144463.0...,44,Aldous Huxley,Perennial,7.318182,7.447867


In [48]:
df_plus.to_csv('../data/cleaned_book_ratings_plus.csv')

### 🛠 Feature Engineering: New Columns Added

After feature engineering, the following new features have been added to the dataset:

1. **fav_author**  
   - The author that the user most prefers.  
   - Calculated based on the author for whom the user gave the highest average rating.

2. **fav_publisher**  
   - The publisher that the user most prefers.  
   - Calculated as the publisher from which the user has rated the most books (highest count).

3. **avg_book_rate**  
   - The average rating for each book across all users.  
   - Provides a measure of overall book popularity/quality.

4. **weighted_score**  
   - An IMDb-style weighted rating for each book.  
   - Balances a book’s average rating with the global mean to prevent books with very few ratings from dominating the ranking.
