## Part 1: Book Recommendation

In [33]:
import pandas as pd

df = pd.read_csv('../data/books_data.csv')

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
df["Clean Title"] = df['Title'].str.lower().str.replace(r'[^\w\s]', '', regex=True)
df['combined_features'] = df['Clean Title'].fillna('') + ' ' + df['authors'].fillna('') + ' ' + df['description'].fillna('')
df['combined_features'] = df['combined_features'].str.lower().str.replace(r'[^\w\s]', '', regex=True)

# Remove any non-unique rows
df = df.drop_duplicates(subset=['Clean Title'])

# Define variables
ID_COLUMN = 'Title'
COMPARISON_COLUMN = 'combined_features'
SPECIFIC_ID = 'Its Only Art If Its Well Hung!'

# Vectorize the comments
v = TfidfVectorizer(stop_words='english')
X = v.fit_transform(df[COMPARISON_COLUMN])

# Map IDs to indices
Id2idx = pd.Series(df.index, index=df[ID_COLUMN])

def get_most_similar(id):
    idx = Id2idx[id]
    scores = cosine_similarity(X, X[idx]).flatten()
    recommended = (-scores).argsort()[1:6]
    return df[ID_COLUMN].iloc[recommended], scores[recommended]

# Get similar items
similar = get_most_similar(SPECIFIC_ID)

# Create DataFrame with results
df_similar = pd.DataFrame({
    'ID': similar[0],
    'Score': similar[1],
    'Comment': similar[0].apply(lambda x: df[df[ID_COLUMN] == x][COMPARISON_COLUMN].values[0])
})

print('Original:')
print(df[df[ID_COLUMN] == SPECIFIC_ID][COMPARISON_COLUMN].values[0])

print(df_similar[["ID", "Score"]].head(5))

Original:
its only art if its well hung julie strain 
                                         ID     Score
209127                   Hung by the tongue  0.365535
203498                  Julie Of The Wolves  0.321138
65990                            The Return  0.290065
40737                The RETURN: THE RETURN  0.288097
167024  Gardening Without Stress and Strain  0.282020


### Evaluate the model

For now, let's keep model evaluation simple by sampling a couple recommendations and manually reviewing. Later, we'll take a more complex approach

In [57]:
# Manual review

# Sample five books to assess recommendations
sample_books = df.sample(5, random_state=20)

top_recs = []
for _, row in sample_books.iterrows():
    book_recs = get_most_similar(row["Title"])
    top_recs.append(book_recs[0])

# Print the sampled titles with top recommendations for that title
for i, rec in enumerate(top_recs):
    print(f"Book: {sample_books.iloc[i]['Title']}")
    print(f"Top rec: {rec.iloc[0]}")

Book: The Sierra Club: Mountain Light Postcard Collection: A Portfolio
Top rec: The Encyclopedia of Ancient Civilizations
Book: Starting and Succeeding in Real Estate
Top rec: The Official XMLSPY Handbook
Book: The Eye of the Abyss (Franz Schmidt, 1)
Top rec: The Art of Translating Prose
Book: The Kabbalah Pillars: A Romance of The Ages
Top rec: How To Make The Devil Obey You!!!
Book: Iridescent Soul
Top rec: Wallace Stevens: A Poet's Growth


Our recommendations seem a bit all over the place right now. In part 3, we'll do some work to improve the quality of our recommendations.

## Part 2: Composite Book Ratings

In [37]:
import pandas as pd

ratings = pd.read_csv('../data/Books_rating.csv')
ratings["Clean Title"] = ratings["Title"].str.lower().str.replace(r'[^\w\s]', '', regex=True)

In [16]:
import nltk

nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/adene/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from scipy.sparse import hstack
import numpy as np

ratings_sample = ratings.sample(frac = 0.05, random_state = 42)

# TODO: Build review/text and review/summary -> review/score model
v = TfidfVectorizer(stop_words='english')
combined = ratings_sample["review/summary"].fillna("") + " " + ratings_sample["review/text"].fillna("")


# Add Vader sentiment analysis
sid_obj = SentimentIntensityAnalyzer()
ratings_sample["Polarity"] = ratings_sample["review/summary"].fillna("").apply(lambda x: sid_obj.polarity_scores(x)["compound"])

# Convert polarity series to a 2D numpy array (shape: n_samples x 1)
X_polarity = np.array(ratings_sample["Polarity"]).reshape(-1, 1)

# Get the TF-IDF features (sparse matrix)
X_text = v.fit_transform(combined)

# Combine the two using hstack so that each row is a concatenation of TF-IDF features and the polarity value.
X = hstack([X_text, X_polarity])
y = ratings_sample["review/score"]

In [39]:
from sklearn.model_selection import train_test_split
from sklearn import tree

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

model = tree.DecisionTreeRegressor(max_depth=15)
model.fit(X_train, y_train)

test_predictions = model.predict(X_test)
train_predictions = model.predict(X_train)

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
DECIMALS = 3

# Testing metrics
test_MAE = mean_absolute_error(y_test, test_predictions)
test_MSE = mean_squared_error(y_test, test_predictions)
test_RMSE = np.sqrt(test_MSE)
test_r2 = r2_score(y_test, test_predictions)
print('Testing: MAE=', round(test_MAE, DECIMALS), ', MSE=', round(test_MSE, DECIMALS), ', RMSE=', round(test_RMSE, DECIMALS), ', R-squared: ', round(test_r2, DECIMALS))

# Training metrics
train_MAE = mean_absolute_error(y_train, train_predictions)
train_MSE = mean_squared_error(y_train, train_predictions)
train_RMSE = np.sqrt(train_MSE)
train_r2 = r2_score(y_train, train_predictions)
print('Training: MAE=', round(train_MAE, DECIMALS), ', MSE=', round(train_MSE, DECIMALS), ', RMSE=', round(train_RMSE, DECIMALS), ', R-squared: ', round(train_r2, DECIMALS))

Testing: MAE= 0.766 , MSE= 1.099 , RMSE= 1.049 , R-squared:  0.236
Training: MAE= 0.708 , MSE= 0.929 , RMSE= 0.964 , R-squared:  0.358


In [None]:
from tqdm import tqdm
# To predict the average review, let's predict individual reviews and average

# Group by book from ratings, sample 5% of total books, 
unique_titles = ratings["Clean Title"].drop_duplicates()

sample_books = unique_titles.sample(frac = 0.05, random_state = 42)

grouped = ratings.groupby("Clean Title")

abs_error = []

for title in tqdm(sample_books):
    # Retrieve all rows (i.e., reviews) for this book.
    book_reviews = grouped.get_group(title).copy()

    book_reviews.loc[:, "Polarity"] = book_reviews["review/summary"].fillna("").apply(lambda x: sid_obj.polarity_scores(x)["compound"])

    # Convert polarity series to a 2D numpy array (shape: n_samples x 1)
    X_polarity = np.array(book_reviews["Polarity"]).reshape(-1, 1)

    # Get the TF-IDF features (sparse matrix)
    X_text = v.transform(book_reviews["review/summary"].fillna("") + " " + book_reviews["review/text"].fillna(""))

    # Combine the two using hstack so that each row is a concatenation of TF-IDF features and the polarity value.
    X = hstack([X_text, X_polarity])
    y = ratings_sample["review/score"]
    
    predicted_scores = model.predict(X)
    avg_pred = np.mean(predicted_scores)
    
    # Here, we compute the actual average review score:
    avg_score = book_reviews["review/score"].mean()
    
    abs_error.append(abs(avg_pred - avg_score))

print("Mean absolute error:", round(np.mean(abs_error), 3))

100%|██████████| 10355/10355 [00:22<00:00, 464.34it/s]

Mean absolute error: 0.52





By averaging individual predictions we achieved a MAE of 0.52. This is a 0.34 improvement from the null model of predicting the average rating for all groups of books. 

In [48]:
# Find the grouped standard deviation of average ratings per book
averages = grouped["review/score"].mean()
std_dev = averages.std()
print("Standard deviation of average ratings per book:", round(std_dev, 3))

Standard deviation of average ratings per book: 0.824


## Part 3: Comprehensive Recommender System

To improve upon our recommendation model from part 1, let's consider which books users tend to like if they also liked the reference book. 

In [229]:
# Potentially much simpler solution
import pandas as pd

df = pd.read_csv('../data/books_data.csv')
ratings = pd.read_csv('../data/Books_rating.csv')

In [231]:
ratings.head()

Unnamed: 0,Id,Title,Price,User_id,profileName,review/helpfulness,review/score,review/time,review/summary,review/text
0,1882931173,Its Only Art If Its Well Hung!,,AVCGYZL8FQQTD,"Jim of Oz ""jim-of-oz""",7/7,4.0,940636800,Nice collection of Julie Strain images,This is only for Julie Strain fans. It's a col...
1,826414346,Dr. Seuss: American Icon,,A30TK6U7DNS82R,Kevin Killian,10/10,5.0,1095724800,Really Enjoyed It,I don't care much for Dr. Seuss but after read...
2,826414346,Dr. Seuss: American Icon,,A3UH4UZ4RSVO82,John Granger,10/11,5.0,1078790400,Essential for every personal and Public Library,"If people become the books they read and if ""t..."
3,826414346,Dr. Seuss: American Icon,,A2MVUWT453QH61,"Roy E. Perry ""amateur philosopher""",7/7,4.0,1090713600,Phlip Nel gives silly Seuss a serious treatment,"Theodore Seuss Geisel (1904-1991), aka &quot;D..."
4,826414346,Dr. Seuss: American Icon,,A22X4XUPKF66MR,"D. H. Richards ""ninthwavestore""",3/3,4.0,1107993600,Good academic overview,Philip Nel - Dr. Seuss: American IconThis is b...


In [233]:
# Pairwise recommendation tuning
# Filter & sample reference books (rating 5, 5% sample)
ratings_renamed = ratings.rename(columns={"User_id": "user", "review/score": "rating"})
df['combined_features'] = df['Title'].fillna('') + ' ' + df['authors'].fillna('') + ' ' + df['description'].fillna('')
df['combined_features'] = df['combined_features'].str.lower().str.replace(r'[^\w\s]', '', regex=True)
df["Clean Title"] = df['Title'].str.lower().str.replace(r'[^\w\s]', '', regex=True)
ratings_renamed["Clean Title"] = ratings_renamed['Title'].str.lower().str.replace(r'[^\w\s]', '', regex=True)

df_merge = pd.merge(df, ratings_renamed[["Clean Title", "user", "rating"]], on="Clean Title", how="left")

print("Vectorizing...")
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(df['combined_features'])
print("Setting up book vectors...")
book_vectors = {i: tfidf_matrix[i] for i in range(tfidf_matrix.shape[0])}

Vectorizing...
Setting up book vectors...


In [234]:
# Build a dictionary mapping each Clean Title (converted to string, lowercased, and stripped) to its first index in df.
clean_title_to_index = {}
for idx, title in df["Clean Title"].astype(str).items():
    cleaned_title = title.strip().lower()
    if cleaned_title not in clean_title_to_index:
        clean_title_to_index[cleaned_title] = idx

clean_title_to_index['the great gatsby']

19805

In [235]:
# Construct the TF-IDF matrix (X) from the combined features column
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['combined_features'])

In [236]:
def get_top5_recommendations(ref_idx, rating_threshold=5, top_n=5):
    # Use df for reference lookup so indices match X and book_vectors
    ref_title = df.loc[ref_idx, "Clean Title"]
    ref_vector = X[ref_idx]  # Using the TF-IDF matrix from df
     
    # Find users who rated the reference book with the desired threshold (fallback to threshold-1)
    users = set(df_merge[(df_merge["Clean Title"] == ref_title) & (df_merge["rating"] == rating_threshold)]["user"])
    if not users:
        users = set(df_merge[(df_merge["Clean Title"] == ref_title) & (df_merge["rating"] == rating_threshold - 1)]["user"])
    
    candidates = pd.DataFrame()
    if users:
        # Filter candidate reviews from these users (and exclude the ref book)
        candidates = df_merge[
            (df_merge["user"].isin(users)) &
            (df_merge["rating"] == rating_threshold) &
            (df_merge["Clean Title"] != ref_title)
        ].drop_duplicates(subset=["Clean Title"])
    if len(candidates) == 0:
        # If no such users, default: use all books except the reference, taking only unique titles
        candidates = df_merge[df_merge["Clean Title"] != ref_title].drop_duplicates(subset=["Clean Title"])
        
    # Build a list of candidate indices corresponding to df.
    # (Remember: book_vectors and X were built on df, not df_merge.)
    candidate_df_indices = []  # Indices from df
    candidate_mapping = {}     # Map from candidate index in df to candidate index in df_merge
    for merge_idx in candidates.index:
        candidate_title = df_merge.loc[merge_idx, "Clean Title"]
        if pd.isna(candidate_title):
            continue
        candidate_key = candidate_title.strip().lower()
        if candidate_key in clean_title_to_index:
            df_idx = clean_title_to_index[candidate_key]
            candidate_df_indices.append(df_idx)
            candidate_mapping[df_idx] = merge_idx  # record mapping back to df_merge
    
    if not candidate_df_indices:
        return pd.DataFrame()  # No candidates found
    
    # Compute cosine similarities in one batch:
    candidate_matrix = X[candidate_df_indices]
    sim_values = cosine_similarity(ref_vector, candidate_matrix)[0]
    
    # Create a list of tuples: (merge_idx, similarity)
    candidate_sim_tuples = []
    for df_idx, sim in zip(candidate_df_indices, sim_values):
        merge_idx = candidate_mapping[df_idx]
        candidate_sim_tuples.append((merge_idx, sim))
    
    # Sort by similarity (highest first) and select top_n candidates
    top_candidates = sorted(candidate_sim_tuples, key=lambda x: x[1], reverse=True)[:top_n]
    
    recs = []
    for merge_idx, sim in top_candidates:
        candidate_title = df_merge.loc[merge_idx, "Clean Title"]
        # Get the corresponding df index from the lookup dictionary
        df_candidate_idx = clean_title_to_index[candidate_title.strip().lower()]
        recs.append({
            "candidate_index": df_candidate_idx,
            "Title": df.loc[df_candidate_idx, "Title"],
            "similarity": sim,
            "rating": df_merge.loc[merge_idx, "rating"]
        })
    return pd.DataFrame(recs)

In [242]:
ref_idx = clean_title_to_index['harry potter and the chamber of secrets']
# ref_idx = clean_title_to_index['the great gatsby']
# ref_idx = 101
top5_recs = get_top5_recommendations(ref_idx)
print("Reference book:", df.loc[ref_idx, "Title"])
print(top5_recs)

Reference book: Harry Potter and the Chamber of Secrets
   candidate_index                                              Title  \
0            59516  Harry Potter Y La Camara Secreta (Harry Potter...   
1           124209  Harry Potter And The Chamber Of Secrets/ Teach...   
2           136930  There's Something About Harry: A Catholic Anal...   
3            16478         Harry Potter und der Gefangene von Azkaban   
4            36776              Harry Potter and The Sorcerer's Stone   

   similarity  rating  
0    0.434626     5.0  
1    0.428419     5.0  
2    0.413875     5.0  
3    0.409391     5.0  
4    0.399634     5.0  
