## Part 1: Book Recommendation

In [86]:
import pandas as pd

df = pd.read_csv('../data/books_data.csv')

In [87]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
df["Clean Title"] = df['Title'].str.lower().str.replace(r'[^\w\s]', '', regex=True)
df['combined_features'] = df['Clean Title'].fillna('') + ' ' + df['authors'].fillna('') + ' ' + df['description'].fillna('')
# df['combined_features'] = df['Clean Title'].fillna('') + ' ' + df['description'].fillna('')
# df['combined_features'] = df['description'].fillna('') # The problem here is that some books don't have a description
df['combined_features'] = df['combined_features'].str.lower().str.replace(r'[^\w\s]', '', regex=True)

# Remove any non-unique rows
df = df.drop_duplicates(subset=['Clean Title'])

# Define variables
ID_COLUMN = 'Title'
COMPARISON_COLUMN = 'combined_features'
SPECIFIC_ID = 'Its Only Art If Its Well Hung!'

# Vectorize the comments
v = TfidfVectorizer(stop_words='english')
X = v.fit_transform(df[COMPARISON_COLUMN])

# Map IDs to indices
Id2idx = pd.Series(df.index, index=df[ID_COLUMN])

def get_most_similar(id):
    idx = Id2idx[id]
    scores = cosine_similarity(X, X[idx]).flatten()
    recommended = (-scores).argsort()[1:6]
    return df[ID_COLUMN].iloc[recommended], scores[recommended]

# Get similar items
similar = get_most_similar(SPECIFIC_ID)

# Create DataFrame with results
df_similar = pd.DataFrame({
    'ID': similar[0],
    'Score': similar[1],
    'Comment': similar[0].apply(lambda x: df[df[ID_COLUMN] == x][COMPARISON_COLUMN].values[0])
})

print('Original:')
print(df[df[ID_COLUMN] == SPECIFIC_ID][COMPARISON_COLUMN].values[0])

print(df_similar[["ID", "Score"]].head(5))

Original:
its only art if its well hung julie strain 
                                         ID     Score
209127                   Hung by the tongue  0.365535
203498                  Julie Of The Wolves  0.321138
65990                            The Return  0.290065
40737                The RETURN: THE RETURN  0.288097
167024  Gardening Without Stress and Strain  0.282020


### Evaluate the model

For now, let's keep model evaluation simple by sampling a couple recommendations and manually reviewing. Later, we'll take a more complex approach

In [88]:
# Manual review

# Sample five books to assess recommendations
sample_books = df.sample(5, random_state=20)

top_recs = []
for _, row in sample_books.iterrows():
    book_recs = get_most_similar(row["Title"])
    top_recs.append(book_recs[0])

# Print the sampled titles with top recommendations for that title
for i, rec in enumerate(top_recs):
    print(f"Book: {sample_books.iloc[i]['Title']}")
    print(f"Top rec: {rec.iloc[0]}")

Book: The Sierra Club: Mountain Light Postcard Collection: A Portfolio
Top rec: The Encyclopedia of Ancient Civilizations
Book: Starting and Succeeding in Real Estate
Top rec: The Official XMLSPY Handbook
Book: The Eye of the Abyss (Franz Schmidt, 1)
Top rec: The Art of Translating Prose
Book: The Kabbalah Pillars: A Romance of The Ages
Top rec: How To Make The Devil Obey You!!!
Book: Iridescent Soul
Top rec: Wallace Stevens: A Poet's Growth


Our recommendations seem a bit all over the place right now. In part 3, we'll do some work to improve the quality of our recommendations.

## Part 2: Composite Book Ratings

In [3]:
import pandas as pd

ratings = pd.read_csv('../data/Books_rating.csv')

In [None]:
ratings.head()

Unnamed: 0,Id,Title,Price,User_id,profileName,review/helpfulness,review/score,review/time,review/summary,review/text
0,1882931173,Its Only Art If Its Well Hung!,,AVCGYZL8FQQTD,"Jim of Oz ""jim-of-oz""",7/7,4.0,940636800,Nice collection of Julie Strain images,This is only for Julie Strain fans. It's a col...
1,826414346,Dr. Seuss: American Icon,,A30TK6U7DNS82R,Kevin Killian,10/10,5.0,1095724800,Really Enjoyed It,I don't care much for Dr. Seuss but after read...
2,826414346,Dr. Seuss: American Icon,,A3UH4UZ4RSVO82,John Granger,10/11,5.0,1078790400,Essential for every personal and Public Library,"If people become the books they read and if ""t..."
3,826414346,Dr. Seuss: American Icon,,A2MVUWT453QH61,"Roy E. Perry ""amateur philosopher""",7/7,4.0,1090713600,Phlip Nel gives silly Seuss a serious treatment,"Theodore Seuss Geisel (1904-1991), aka &quot;D..."
4,826414346,Dr. Seuss: American Icon,,A22X4XUPKF66MR,"D. H. Richards ""ninthwavestore""",3/3,4.0,1107993600,Good academic overview,Philip Nel - Dr. Seuss: American IconThis is b...


## Temp
Things we could try:

- Sentiment specific analyzer <- Let's try this one first
- More advanced regression models

In [16]:
import nltk

nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/adene/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from scipy.sparse import hstack
import numpy as np

df = ratings.sample(frac = 0.05)

# TODO: Build review/text and review/summary -> review/score model
v = TfidfVectorizer(stop_words='english')
combined = df["review/summary"].fillna("") + " " + df["review/text"].fillna("")
# X = v.fit_transform(combined)
# Optional: Add Vader sentiment analysis
sid_obj = SentimentIntensityAnalyzer()
df["Polarity"] = df["review/summary"].fillna("").apply(lambda x: sid_obj.polarity_scores(x)["compound"])

# Get the TF-IDF features (sparse matrix)
X_text = v.fit_transform(combined)

# Convert polarity series to a 2D numpy array (shape: n_samples x 1)
X_polarity = np.array(df["Polarity"]).reshape(-1, 1)

# Combine the two using hstack so that each row is a concatenation of TF-IDF features and the polarity value.
X = hstack([X_text, X_polarity])

# df["Polarity"] = df["review/summary"].apply(sid_obj.polarity_scores)
# X = pd.DataFrame(v.fit_transform(combined), df["Polarity"])


y = df["review/score"]

In [18]:
sid_obj.polarity_scores("My cat is sad")

{'neg': 0.508, 'neu': 0.492, 'pos': 0.0, 'compound': -0.4767}

In [15]:
print(df["review/score"].std())

1.2065157212408122


In [None]:
from sklearn.model_selection import train_test_split
from sklearn import tree

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

model = tree.DecisionTreeRegressor(max_depth=15)
model.fit(X_train, y_train)

test_predictions = model.predict(X_test)
train_predictions = model.predict(X_train)

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
DECIMALS = 3

# Testing metrics
test_MAE = mean_absolute_error(y_test, test_predictions)
test_MSE = mean_squared_error(y_test, test_predictions)
test_RMSE = np.sqrt(test_MSE)
test_r2 = r2_score(y_test, test_predictions)
print('Testing: MAE=', round(test_MAE, DECIMALS), ', MSE=', round(test_MSE, DECIMALS), ', RMSE=', round(test_RMSE, DECIMALS), ', R-squared: ', round(test_r2, DECIMALS))

# Training metrics
train_MAE = mean_absolute_error(y_train, train_predictions)
train_MSE = mean_squared_error(y_train, train_predictions)
train_RMSE = np.sqrt(train_MSE)
train_r2 = r2_score(y_train, train_predictions)
print('Training: MAE=', round(train_MAE, DECIMALS), ', MSE=', round(train_MSE, DECIMALS), ', RMSE=', round(train_RMSE, DECIMALS), ', R-squared: ', round(train_r2, DECIMALS))

ValueError: setting an array element with a sequence.

In [7]:
type(summaries.todense())

numpy.matrix

In [None]:
# TODO: Test-train model evaluation

In [None]:
# TODO: Multiple reviews -> average review/score model

## Part 3: Comprehensive Recommender System

In [None]:
# Pairwise recommendation tuning