# "I like *Lord of the Rings*, what else should I read?"

The goal of this analysis is to understand the behavior of users who liked *The Lord of the Rings* or other books by J.R.R. Tolkien. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import sys
sys.path.append('../src')  # Metrics defined in ../src for clarity
from metrics_notebook_02 import compute_reader_ratio, compute_lift, compute_lift_modif

# Load data
books = pd.read_csv('../data/Books.csv', dtype={"ISBN": str}, low_memory=False)
users = pd.read_csv('../data/Users.csv')
ratings = pd.read_csv('../data/Ratings.csv')

# Basic cleanup (inherited from Notebook 01)

# Strip whitespace and normalize ISBN
books['ISBN'] = books['ISBN'].str.strip()
ratings['ISBN'] = ratings['ISBN'].str.strip()

# Drop unused image columns to save memory
books.drop(columns=['Image-URL-S', 'Image-URL-M', 'Image-URL-L'], inplace=True)

# Drop rows with missing User-ID or ISBN in ratings
ratings = ratings[
    ratings['User-ID'].isin(users['User-ID']) &
    ratings['ISBN'].isin(books['ISBN'])
]
# Clean book titles and authors 
books['Book-Title'] = books['Book-Title'].astype(str).str.strip()
books['Book-Author'] = books['Book-Author'].astype(str).str.strip()


# Normalize author and title names
books['Book-Author'] = books['Book-Author'].str.lower()
books['Book-Title'] = books['Book-Title'].str.lower()


# Preview
print(f"Books: {books.shape}, Users: {users.shape}, Ratings: {ratings.shape}")


Books: (271360, 5), Users: (278858, 3), Ratings: (1031139, 3)


## First Idea - Selection by Author "Tolkien"
Since The Lord of the Rings is a complex title and the book has three parts (Fellowship, Towers, King) — and those parts may appear in various languages or without the full series name—searching by title alone can be unreliable.
However, “Tolkien” is a relatively unique and consistent author name, making it a more reliable filter for identifying relevant works.

In [2]:
# Filter books where author contains "tolkien"
tolkien_books = books[books['Book-Author'].str.contains("tolkien", na=False)].copy()

# Preview the books
tolkien_books.head(10)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
91,8445071408,el senor de los anillos: la comunidad del anil...,j. r. r. tolkien,2001,Minotauro
92,8445071769,el senor de los anillos: las dos torres (lord ...,j. r. r. tolkien,2001,Minotauro
93,8445071777,el senor de los anillos: el retorno del rey (t...,j. r. r. tolkien,2001,Distribooks
780,0345339711,"the two towers (the lord of the rings, part 2)",j.r.r. tolkien,1986,Del Rey
1171,034537522X,the book of lost tales 2 (the history of middl...,j.r.r. tolkien,1992,Del Rey
1172,0345375211,the book of lost tales 1 (the history of middl...,j. r. r. tolkien,1992,Del Rey Books
1431,0345325818,the silmarillion,j.r.r. tolkien,1985,Del Rey
1571,0345339703,the fellowship of the ring (the lord of the ri...,j.r.r. tolkien,1986,Del Rey
1703,0618126988,the silmarillion,j.r.r. tolkien,2001,Houghton Mifflin
1763,0395177111,the hobbit (leatherette collector's edition),j. r. r. tolkien,1973,Houghton Mifflin Company


In [3]:
# All the variants of the author's name
tolkien_books['Book-Author'].value_counts()

Book-Author
j. r. r. tolkien             174
j.r.r. tolkien                57
john ronald reuel tolkien     19
j r r tolkien                  9
j. r. r tolkien                7
tolkien                        4
christopher tolkien            2
simon tolkien                  1
john r. r. tolkien             1
tolkienquest                   1
tolkien j.r.r.                 1
Name: count, dtype: int64

#### Almost all books are by J.R.R. Tolkien (author of The Lord of the Rings).
Christopher is actually his son, and Simon his grandson — both continue their (grand)father’s work (we don't have to drop them necessarily). There are also some (~30) books with "Lord of the Rings" in the title.

In [4]:
# Filter books not written by Tolkien
non_tolkien_books = books[~books['Book-Author'].str.contains("tolkien", na=False)]

# Find books with "lord of the rings" in the title
lotr_named_books = non_tolkien_books[
    non_tolkien_books['Book-Title'].str.contains("lord of the rings", na=False)
]

# Show relevant columns: title and author
lotr_named_books[['Book-Title', 'Book-Author']].sort_values(by='Book-Author').head()

Unnamed: 0,Book-Title,Book-Author
50539,the fellowship of the ring photo guide (the lo...,alison sage
76216,j.r.r. tolkien's the hobbit and the lord of th...,anne m. pienciak
107350,the real middle earth : exploring the magic an...,brian bates
135851,the lord of the rings official movie guide,brian sibley
135063,the making of the movie trilogy (the lord of t...,brian sibley


## First, we can focus on what (author/title) the readers or fans of Tolkien also read and enjoy.

The **primary** goal of this analysis is to identify other **fantasy worlds created by different authors** that resonate with readers who enjoy Tolkien’s work.
Later, we will revisit this assumption and ask a more nuanced question: **Who is my customer?**
Do they truly want to explore more richly built fantasy universes? Or are they simply looking for other **popular titles** they’ve heard of something accessible  and mainstream (e.g. Harry Potter, Da Vinci Code), or perhaps looking for **literary quality** from 20s century (as e.g. my high school teacher would).

**In the "Main Line"** we don’t have to limit ourselves to The Lord of the Rings, but rather consider Tolkien’s work as a whole, since the majority of his books take place in the same fantasy world. We can reasonably expect (and will validate later) that fans of The Lord of the Rings also enjoy The Hobbit or The Silmarillion.



In [5]:
# Filter books by authors containing "tolkien"
tolkien_isbns = books[books['Book-Author'].str.contains('tolkien', case=False, na=False)]['ISBN']

# Filter ratings related to Tolkien books
tolkien_ratings = ratings[ratings['ISBN'].isin(tolkien_isbns)]

# Get users who rated Tolkien books
tolkien_fans = tolkien_ratings['User-ID'].unique()

# Get all ratings from Tolkien fans
tolkien_fan_ratings = ratings[ratings['User-ID'].isin(tolkien_fans)]

# Merge with book metadata
tolkien_fan_books = tolkien_fan_ratings.merge(books, on='ISBN', how='left')

# Exclude Tolkien books themselves
tolkien_fan_books = tolkien_fan_books[~tolkien_fan_books['Book-Author'].str.contains('tolkien', case=False, na=False)]

# Count occurrences of each rating (0–10) for Tolkien books
rating_counts = tolkien_ratings['Book-Rating'].value_counts().sort_index()

# Calculate percentage of each rating
rating_percentages = (rating_counts / rating_counts.sum()) * 100

# Calculate percentages for nonzero ratings only
nonzero_rating_percentages = (rating_counts / (rating_counts.sum() - rating_counts.get(0, 0))) * 100
nonzero_rating_percentages[0] = np.nan

# Combine into a summary table
tolkien_rating_summary = pd.DataFrame({
    'Count': rating_counts,
    'Percentage': rating_percentages.round(2),
    'Percentage between nonzero ratings': nonzero_rating_percentages.round(2)
})
# Rename index
tolkien_rating_summary.index.name = "Tolkien Books Rating"

tolkien_rating_summary

Unnamed: 0_level_0,Count,Percentage,Percentage between nonzero ratings
Tolkien Books Rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1047,44.14,
1,3,0.13,0.23
2,2,0.08,0.15
3,8,0.34,0.6
4,8,0.34,0.6
5,109,4.6,8.23
6,40,1.69,3.02
7,81,3.41,6.11
8,189,7.97,14.26
9,250,10.54,18.87


### How to Define "Tolkien Fans"?

While analyzing the ratings for Tolkien's books:

- **44% of ratings are 0**, meaning the book was read but not rated. Ignoring these would discard nearly **half the available data** — a significant information loss.
- Among non-zero ratings:
  - **67% rated 9–10** (strong positive sentiment)
  - **82% rated 8–10** (generally favorable)

This confirms Tolkien’s books are **highly appreciated** in general.  
Should we define a fan by their *explicit rating*, or just by the *fact they read the book*?

- **Ratings are highly subjective** — a user might love a book and still rate it 8, while another user automatically gives 10 to anything they like.

Rather than setting a threshold (such as ≥8), we **include all users who have read** any Tolkien book, regardless of rating. We may choose to exclude very negative reviews (ratings 1–4), although they represent less than 2% of the data. Alternatively, we could exclude all lower ratings (1–6 or 1–7). In this case we should estimate the attitude of users who didn’t rate the books at all (otherwise losing almost half of the data) — perhaps by analyzing the types of books they did rate and their rating patterns—but this would be complex task. For the sake of simplicity, we will proceed by including all readers.

In [6]:
tolkien_fan_books.head(5)

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,276786,842333533X,0,vieja nueva york,edith wharton,2003,Planeta Pub Corp
1,276786,8433914456,0,invitacion a la etica,fernando savater,1995,Anagrama
2,276786,8437606322,8,anna karenina,leo tolstoy,1999,Ediciones Catedra S.A.
4,276786,847765011X,0,isabelle eberhardt,eglal errera,1991,Circe
5,276786,8478442588,6,el elogio de la sombra,tanazaki,1998,Siruela


In [7]:
# Tolkien Reader Count (any rating, including 0), grouped by ISBN
reader_count = (
    tolkien_fan_books.groupby('ISBN')
    .size()
    .reset_index(name='Reader Count')
)

# Rating Count (non-zero ratings only), grouped by ISBN
non_zero_ratings = tolkien_fan_books[tolkien_fan_books['Book-Rating'] > 0]
rating_count = (
    non_zero_ratings.groupby('ISBN')
    .size()
    .reset_index(name='Rating Count')
)

# Average Rating (non-zero only), grouped by ISBN
avg_rating = (
    non_zero_ratings.groupby('ISBN')['Book-Rating']
    .mean()
    .reset_index(name='Average Rating')
)

# Merge all stats
summary = reader_count.merge(rating_count, on='ISBN', how='left')
summary = summary.merge(avg_rating, on='ISBN', how='left')

# Add metadata (Book Title and Author)
summary = summary.merge(books[['ISBN', 'Book-Title', 'Book-Author']], on='ISBN', how='left')

# Sort by Reader Count and display
summary_sorted = summary.sort_values(by='Reader Count', ascending=False)
summary_sorted.head(5)

Unnamed: 0,ISBN,Reader Count,Rating Count,Average Rating,Book-Title,Book-Author
106821,0971880107,234,37.0,4.72973,wild animus,rich shapero
17590,0316666343,159,65.0,8.0,the lovely bones: a novel,alice sebold
35884,0385504209,155,66.0,8.787879,the da vinci code,dan brown
65700,059035342X,142,64.0,8.90625,harry potter and the sorcerer's stone (harry p...,j. k. rowling
19678,0345337662,131,53.0,7.528302,interview with the vampire,anne rice


### Some results were expected
Harry Potter, The Da Vinci Code as they are worlwide popular titles. I haven’t heard of Wild Animus. Results seem to mirror the usual most-read titles. I will explore "non-Tolkien readers" as well.

In [8]:
# Get ratings from users who are NOT Tolkien fans
non_tolkien_ratings = ratings[~ratings['User-ID'].isin(tolkien_fans)]

# Merge with books metadata
non_tolkien_books = non_tolkien_ratings.merge(books, on='ISBN', how='left')

# Reader Count (any rating, including 0), grouped by ISBN
non_tolkien_reader_count = (
    non_tolkien_books.groupby('ISBN')
    .size()
    .reset_index(name='Reader Count')
)

# Rating Count (non-zero ratings only)
non_zero_nt_ratings = non_tolkien_books[non_tolkien_books['Book-Rating'] > 0]
non_tolkien_rating_count = (
    non_zero_nt_ratings.groupby('ISBN')
    .size()
    .reset_index(name='Rating Count')
)

# Average Rating (non-zero only)
non_tolkien_avg_rating = (
    non_zero_nt_ratings.groupby('ISBN')['Book-Rating']
    .mean()
    .reset_index(name='Average Rating')
)

# Merge all stats
non_tolkien_summary = non_tolkien_reader_count.merge(non_tolkien_rating_count, on='ISBN', how='left')
non_tolkien_summary = non_tolkien_summary.merge(non_tolkien_avg_rating, on='ISBN', how='left')

# Add Book Title and Author (for display)
book_meta = books[['ISBN', 'Book-Title', 'Book-Author']].drop_duplicates()
non_tolkien_summary = non_tolkien_summary.merge(book_meta, on='ISBN', how='left')

# Sort by Reader Count
non_tolkien_summary_sorted = non_tolkien_summary.sort_values(by='Reader Count', ascending=False)

# Display top results
non_tolkien_summary_sorted.head(5)

Unnamed: 0,ISBN,Reader Count,Rating Count,Average Rating,Book-Title,Book-Author
176607,971880107,2268,544.0,4.367647,wild animus,rich shapero
32149,316666343,1136,642.0,8.20405,the lovely bones: a novel,alice sebold
59305,385504209,728,421.0,8.380048,the da vinci code,dan brown
26757,312195516,631,353.0,8.135977,the red tent (bestselling backlist),anita diamant
6081,60928336,609,288.0,7.822917,divine secrets of the ya-ya sisterhood: a novel,rebecca wells


### Almost the Same Results
And the results are nearly identical even when considering all readers - Tolkien/Non-Tolkien (tested but omitted here to avoid redundancy).

## Identify Authors/Books Disproportionately Read by Tolkien Fans  
This analysis aims to identify authors who are read disproportionately more by Tolkien fans compared to other readers.

- My Original Idea: **Reader Ratio**:  
  `Reader Ratio = (# of Tolkien fans who read the author) / (# of all readers who read the author)`  
  

- Metric which is Typically Used: **Lift** *(used as a primary metric)*:  
  `Lift = P(Author | Tolkien fan) / P(Author)`  
I also worked with variant of the lift P(Author | Tolkien fan) / P(Author | Other than Tolkien fan)

Both metrics should help identify authors with unusually high engagement among Tolkien fans compared to the general reader base.



## Author Recommendation

Since books often appear under multiple title variants and different ISBNs (making them difficult to merge), I decided to start with a simple author-based recommendation approach. Specifically, I group books by the author's **surname**, which is relatively easy to extract from the full author name.

This method works well for authors with distinctive surnames like *Tolkien*, *Rowling*, or *Paolini*. However, it has clear limitations when dealing with common surnames such as *King*, *Brown*, or *Smith*, where different authors share the same last name.

In [9]:
# Extract author surnames
books['Author_Surname'] = books['Book-Author'].str.extract(r'(\w+)$').iloc[:, 0].str.lower()

# Merge ratings with extracted author surnames
ratings_with_authors = ratings.merge(books[['ISBN', 'Author_Surname']], on='ISBN', how='left')

# Identify Tolkien readers (based on surname)
tolkien_fans = ratings_with_authors[
    ratings_with_authors['Author_Surname'] == 'tolkien'
]['User-ID'].unique()

# Author Reader Ratio
author_reader_ratio = compute_reader_ratio(
    ratings_df=ratings_with_authors,
    target_users=tolkien_fans,                 
    entity_col='Author_Surname',
    min_target_readers=30,
    min_global_readers=50
)
print("Top authors by Reader Ratio:")
print(author_reader_ratio.head(10))

# Author Lift
author_lift = compute_lift(
    ratings_df=ratings_with_authors,
    target_users=tolkien_fans,
    entity_col='Author_Surname',
    min_target_readers=30,
    min_global_readers=50
)
print("\nTop authors by Lift:")
print(author_lift.head(10))

Top authors by Reader Ratio:
  Author_Surname  Target_Readers  Global_Readers  Reader_Ratio
0        tolkien            1329            1329      1.000000
1          hobbs              31              65      0.476923
2       rockwell              39              90      0.433333
3        paolini              31              72      0.430556
4            hoh              35              84      0.416667
5          hoban              31              77      0.402597
6     saberhagen              63             157      0.401274
7         farley              34              85      0.400000
8        bunting              40             101      0.396040
9       moorcock              36              92      0.391304

Top authors by Lift:
  Author_Surname  Target_Readers  Global_Readers       Lift
0        tolkien            1329            1329  69.304740
1          hobbs              31              65  33.053030
2       rockwell              39              90  30.032054
3        paolini

### Promising but Not Fully Relevant (?)

#### How to Evaluate the Results?

In practice, the only real measure of recommendation quality is user satisfaction—whether the reader enjoyed the recommended book and to what extent. Since we lack explicit feedback data, we face the challenge of defining what makes a recommendation "good" or "bad."

For example, recommending *Christopher Paolini* (author of *Eragon*) might be considered a good match for Tolkien fans due to similar themes and genre. On the other hand, recommending *Agatha Christie* may be less relevant, as her detective novels do not align with Tolkien's fantasy style.

While we can use existing professional recommendation systems for benchmarking, a quick and creative alternative is **prompt-based evaluation** using ChatGPT. Acting as a simulated recommendation system or proxy for reader feedback, ChatGPT can evaluate the relevance of recommended authors.

We used the following prompt for this purpose:

> *"Decide how relevant these authors are to Tolkien's work. Grade each recommendation on a scale from 1 (worst) to 5 (best)."*

This approach offers a fast, human-like evaluation when explicit user feedback is unavailable.

**Note**: This evaluation method can be significantly improved by applying principles of prompt engineering (e.g. giving examples one/few-shot prompting)

| Rank | Author      | Relevance (1–5) | Justification |
|------|-------------|----------------|---------------|
| 1    | tolkien     | —              | Reference group. |
| 2    | hobbs       | 4              | Robin Hobb is a leading epic fantasy author with richly developed worlds and character-driven plots—very Tolkien-esque. |
| 3    | rockwell    | 1              | Possibly Anne Rockwell (children’s lit) or another non-fantasy writer—likely unrelated to Tolkien-style fantasy. |
| 4    | paolini     | 4              | Christopher Paolini (*Eragon*) is directly inspired by Tolkien, though targeted toward younger readers. Clear fantasy relevance. |
| 5    | hoh         | 1              | Likely a non-fantasy author; little to no clear connection with Tolkienian themes. |
| 6    | hoban       | 2              | Russell Hoban has written speculative and surreal fiction (*Riddley Walker*), which is imaginative but not high fantasy. |
| 7    | saberhagen  | 3              | Fred Saberhagen wrote sci-fi and fantasy, including *Swords* series—some overlap with Tolkien fans possible. |
| 8    | farley      | 1              | Walter Farley (*The Black Stallion*)—classic children’s adventure stories; not fantasy-related. |
| 9    | bunting     | 1              | Eve Bunting writes children’s fiction; not related to Tolkien-like fantasy. |
| 10   | moorcock    | 5              | Michael Moorcock (*Elric of Melniboné*) is a foundational fantasy author, often seen as a philosophical foil to Tolkien. Very relevant. |

 (fun fact - Hobbs is not Robin Hobb but somebody irelevant - but Paolini (Eragon) and Moorcock are highly relevant)

### Same Strategy for Book Recommendation

In [10]:


# Merge ratings with book titles
ratings_with_titles = ratings.merge(books[['ISBN', 'Book-Title']], on='ISBN', how='left')

# Book recommendation using Reader Ratio
book_reader_ratio = compute_reader_ratio(
    ratings_df=ratings_with_titles,
    target_users=tolkien_fans,
    entity_col='Book-Title',
    min_target_readers=5,
    min_global_readers=20
)

# Add author names
book_reader_ratio = book_reader_ratio.merge(
    books[['Book-Title', 'Book-Author']].drop_duplicates(), on='Book-Title', how='left'
)

# Filter out Tolkien books
tolkien_titles = tolkien_books['Book-Title'].str.lower().unique()
book_reader_ratio = book_reader_ratio[
    ~book_reader_ratio['Book-Title'].str.lower().isin(tolkien_titles)
]



# Book recommendation using Lift
book_lift = compute_lift(
    ratings_df=ratings_with_titles,
    target_users=tolkien_fans,
    entity_col='Book-Title',
    min_target_readers=5,
    min_global_readers=20
)

# Add author names
book_lift = book_lift.merge(
    books[['Book-Title', 'Book-Author']].drop_duplicates(), on='Book-Title', how='left'
)

# Filter out Tolkien books
book_lift = book_lift[
    ~book_lift['Book-Title'].str.lower().isin(tolkien_titles)
]

# Display top results
print("Top books by Reader Ratio:")
book_reader_ratio[['Book-Title', 'Book-Author', 'Target_Readers', 'Global_Readers', 'Reader_Ratio']].head(10)


Top books by Reader Ratio:


Unnamed: 0,Book-Title,Book-Author,Target_Readers,Global_Readers,Reader_Ratio
25,"heretics of dune (dune chronicles, book 5)",frank herbert,24,33,0.727273
26,darkness and light (dragonlance: preludes),paul b. thompson,14,20,0.7
27,sapphire rose (elenium (paperback)),david eddings,16,23,0.695652
28,midnight louie's pet detectives,carole nelson douglas,14,21,0.666667
29,"the seeress of kell (the malloreon, book 5)",david eddings,19,29,0.655172
30,"the path of daggers (the wheel of time, book 8)",robert jordan,22,34,0.647059
31,the courtship of princess leia (star wars),dave wolverton,15,24,0.625
32,king of the murgos (malloreon (paperback rando...,david eddings,18,29,0.62069
33,second heaven,judith guest,14,23,0.608696
34,"wedge's gamble (star wars: x-wing series, book 2)",michael a. stackpole,12,20,0.6


In [11]:
# Display top results
print("Top books by Lift:")
book_lift[['Book-Title', 'Book-Author', 'Target_Readers', 'Global_Readers', 'Lift']].head(10)


Top books by Lift:


Unnamed: 0,Book-Title,Book-Author,Target_Readers,Global_Readers,Lift
25,"heretics of dune (dune chronicles, book 5)",frank herbert,24,33,50.403448
26,darkness and light (dragonlance: preludes),paul b. thompson,14,20,48.513318
27,sapphire rose (elenium (paperback)),david eddings,16,23,48.211993
28,midnight louie's pet detectives,carole nelson douglas,14,21,46.20316
29,"the seeress of kell (the malloreon, book 5)",david eddings,19,29,45.406554
30,"the path of daggers (the wheel of time, book 8)",robert jordan,22,34,44.844244
31,the courtship of princess leia (star wars),dave wolverton,15,24,43.315463
32,king of the murgos (malloreon (paperback rando...,david eddings,18,29,43.016735
33,second heaven,judith guest,14,23,42.185494
34,"winds of change (the mage winds, book 2)",mercedes lackey,12,20,41.582844


### Results promising, both metrics (Ratio and Lift) same results

For a fixed target group (e.g., Tolkien fans) these two metrics (lift and readers ratio) are actually equivalent, we will use just a lift.

### Evaluation of the results
ChatGPT evaluation:
| Rank | Book Title                                      | Author                   | Relevance (1–5) | Justification                                                                                   |
|------|--------------------------------------------------|--------------------------|-----------------|--------------------------------------------------------------------------------------------------|
| 1    | Heretics of Dune (Dune Chronicles, Book 5)       | Frank Herbert            | 5               | Dune is a foundational sci-fi saga with deep lore and myth-making—appeals to Tolkien fans who appreciate vast, philosophical world-building. |
| 2    | Darkness and Light (Dragonlance: Preludes)       | Paul B. Thompson         | 4               | Dragonlance is a key fantasy universe; strong Tolkien overlap due to its roots in heroic quests and traditional fantasy archetypes. |
| 3    | Sapphire Rose (Elenium)                          | David Eddings            | 4               | Eddings is a mainstay of classic fantasy, with sword-and-sorcery vibes and moral clarity akin to Tolkien. |
| 4    | Midnight Louie’s Pet Detectives                  | Carole Nelson Douglas    | 1               | Cozy cat mysteries—humorous and modern; not fantasy or relevant to Tolkien-style readers.       |
| 5    | The Seeress of Kell (Malloreon, Book 5)          | David Eddings            | 4               | Same as above—Malloreon is a classic high fantasy series.                                        |
| 6    | The Path of Daggers (The Wheel of Time, Book 8)  | Robert Jordan            | 5               | Wheel of Time is perhaps the most direct spiritual successor to Tolkien’s epic narrative scope.  |
| 7    | The Courtship of Princess Leia (Star Wars)       | Dave Wolverton           | 3               | Sci-fi, but space opera with adventure and mythic tropes. May appeal to Tolkien fans who enjoy expansive universes. |
| 8    | King of the Murgos (Malloreon)                   | David Eddings            | 4               | Another entry in a Tolkien-adjacent fantasy series.                                              |
| 9    | Second Heaven                                    | Judith Guest             | 1               | Contemporary family drama—not genre fiction, no clear Tolkien connection.                       |
| 10   | Wedge’s Gamble (Star Wars: X-Wing Series, Book 2)| Michael A. Stackpole     | 3               | Military sci-fi in a beloved universe. Not Tolkien-esque, but fans of expansive lore may enjoy it. |


### Parameter Tuning

By tuning the `min_target_readers` parameter, we observe a spectrum of results:

- **High values** return generally popular books like **Harry Potter** or **Da Vinci Code**.
- **Medium to low** values highlight books that are more relevant to the task, such as other **fantasy novels**.
- **Very low values** may surface **irrelevant books** that were read by just a few readers (accidentally Tolkien fans).

A similar approach can be applied using ratings, although this would result in losing more than half of the available data.


## Use Just "Geek Fans"

*The Lord of the Rings* and *The Hobbit* are widely read classics, often picked up by general readers—not just core Tolkien fans. Many people, including myself, read them simply because they were among the most popular books worldwide, alongside titles like *Harry Potter*.

If we want to tailor recommendations for someone who is **specifically** looking for deeper or more niche fantasy content, it might make sense to **exclude *The Lord of the Rings* and *The Hobbit*** from the training dataset. Instead, we can focus on less mainstream Tolkien works like **The Silmarillion**, which are more likely to be read by dedicated fans or "geek" readers. This could help surface more relevant fantasy recommendations for true genre enthusiasts.

In [12]:
# Define LOTR-related keywords 
lotr_keywords = ['lord', 'ring', 'fellowship', 'towers', 'return', 'king',
                 'seigneur', 'anneaux', 'anillos', 'senor', 'gefahrten',
                 'turme', 'wiederkehr', 'konigs', 'retour', 'roi', 'tours'] # english, french, german, spanish
hobbit_keywords = ['hobbit', 'hobit']

# Mark series type in Tolkien books ---
tolkien_books = books[books['Book-Author'].str.lower().str.contains('tolkien', na=False)].copy()
tolkien_books['series_group'] = 'Other'
tolkien_books.loc[tolkien_books['Book-Title'].str.lower().str.contains('|'.join(lotr_keywords), na=False), 'series_group'] = 'LOTR'
tolkien_books.loc[tolkien_books['Book-Title'].str.lower().str.contains('|'.join(hobbit_keywords), na=False), 'series_group'] = 'Hobbit'

# Merge with ratings ---
tolkien_ratings = ratings.merge(tolkien_books[['ISBN', 'series_group']], on='ISBN', how='inner')

# Identify geek Tolkien fans
# Users who read Tolkien books tagged as 'Other'
geek_tolkien_fans = tolkien_ratings[
    tolkien_ratings['series_group'] == 'Other'
]['User-ID'].unique()

# Compute author lift for geek Tolkien fans
geek_author_lift = compute_lift(
    ratings_df=ratings_with_authors,
    target_users=geek_tolkien_fans,
    entity_col='Author_Surname',
    min_target_readers=30,
    min_global_readers=50
)

# Show top results
geek_author_lift.head(10)

Unnamed: 0,Author_Surname,Target_Readers,Global_Readers,Lift
0,tolkien,304,1329,69.30474
1,silverberg,32,235,41.256887
2,weis,45,359,37.978027
3,clavell,32,258,37.578947
4,guin,49,433,34.28645
5,eddings,34,307,33.554817
6,ripley,30,290,31.342786
7,norton,32,315,30.778947
8,donaldson,30,301,30.197368
9,wilder,33,342,29.234938


In [13]:
# Book recommendation for Geek Tolkien fans using Lift (ISBN-based)
geek_book_lift = compute_lift(
    ratings_df=ratings,  # assuming this has 'ISBN' not 'Book-Title'
    target_users=geek_tolkien_fans,
    entity_col='ISBN',
    min_target_readers=10,
    min_global_readers=30
)

# Add book metadata (title + author)
geek_book_lift = geek_book_lift.merge(
    books[['ISBN', 'Book-Title', 'Book-Author']].drop_duplicates(), on='ISBN', how='left'
)

# Filter out Tolkien books
tolkien_isbns = books[books['Book-Author'].str.lower().str.contains('tolkien', na=False)]['ISBN'].unique()
geek_book_lift = geek_book_lift[~geek_book_lift['ISBN'].isin(tolkien_isbns)]

# Display top results
print("Top books by Lift for Geek Tolkien Fans (unique by ISBN):")
geek_book_lift[['ISBN', 'Book-Title', 'Book-Author', 'Target_Readers', 'Global_Readers', 'Lift']].head(10)

Top books by Lift for Geek Tolkien Fans (unique by ISBN):


Unnamed: 0,ISBN,Book-Title,Book-Author,Target_Readers,Global_Readers,Lift
1,812513754,"lord of chaos (the wheel of time, book 6)",robert jordan,10,36,84.161184
2,64408639,the austere academy (a series of unfortunate e...,lemony snicket,10,41,73.897625
3,64407691,the miserable mill (a series of unfortunate ev...,lemony snicket,10,42,72.138158
4,61020680,mort (discworld novels (paperback)),terry pratchett,10,42,72.138158
5,345313151,bearing an hourglass (incarnations of immortal...,piers anthony,13,55,71.613517
6,812550307,"the fires of heaven (the wheel of time, book 5)",robert jordan,12,51,71.289474
7,345362861,dragonsdawn,anne mccaffrey,14,61,69.536454
8,375826688,"eragon (inheritance, book 1)",christopher paolini,14,61,69.536454
9,451456521,lady of avalon,marion zimmer bradley,10,44,68.859151
10,553562738,doomsday book,connie willis,15,69,65.865275


### No "Optically" Better Results 

While the idea of focusing only on readers of Tolkien's lesser-known works (e.g., *The Silmarillion*) is conceptually sound, it did not lead to significantly better (*but what does "significantly" even mean...*) results. The main limitation is the relatively small number of readers for these books, which reduces the quality and robustness of the recommendation.


## Lectures, Research, Theory...

From lectures of *Internet and Classification Tasks* (which I also teach) — **K-Nearest Neighbors (KNN)** is typically used for this purpose.  

In particular, it is widely applied in **collaborative filtering**, where a **item-user pivot matrix** is constructed based on **readership behavior** or **explicit ratings**.  

We use KNN to compare **books** (not users) — identifying similar titles based on how users interact with them.

For more context, see the [Wikipedia article on Recommender Systems](https://en.wikipedia.org/wiki/Recommender_system).


In [14]:
min_readers = 2
book_counts = ratings['ISBN'].value_counts()
popular_isbns = book_counts[book_counts >= min_readers].index
filtered_ratings = ratings[ratings['ISBN'].isin(popular_isbns)]

# Annotate books as LOTR, HOBBIT, OTHER (OTHER even for non-Tolkien books)
books = books.copy()
books['series_group'] = 'Other'
books.loc[books['Book-Title'].str.contains('|'.join(lotr_keywords), case=False, na=False), 'series_group'] = 'LOTR'
books.loc[books['Book-Title'].str.contains('|'.join(hobbit_keywords), case=False, na=False), 'series_group'] = 'Hobbit'

# Identify Tolkien books
tolkien_isbns = books[books['Book-Author'].str.lower().str.contains('tolkien', na=False)]['ISBN'].unique()
tolkien_books = books[books['ISBN'].isin(tolkien_isbns)]
tolkien_other_isbns = tolkien_books.loc[tolkien_books['series_group'] == 'Other', 'ISBN'].unique()


# Get all ratings from Tolkien fans
tolkien_fan_ratings = ratings[ratings['User-ID'].isin(tolkien_fans)]

# Find users who read Tolkien "Other" books
geek_tolkien_fans = ratings[ratings['ISBN'].isin(tolkien_other_isbns)]['User-ID'].unique()


In [15]:
book_user_matrix = tolkien_fan_ratings.pivot_table(
    index="ISBN",
    columns="User-ID",
    values='Book-Rating',
    fill_value=0
)

In [16]:
# Get the vector for the target book (as 2D array)
target_vector = book_user_matrix.loc[['0345339703']]  # selected one LOTR book (Fellowship) from 1986 

# Get all other vectors
other_vectors = book_user_matrix.drop(index='0345339703')

# Compute cosine similarities
similarities = cosine_similarity(target_vector, other_vectors)[0]

# Create a Series with ISBNs as index and similarities as values
similarity_series = pd.Series(similarities, index=other_vectors.index)

# Sort and take top 10
top_similar_books = similarity_series.sort_values(ascending=False).head(10)

# Reset index to turn ISBNs into a column
top_similar_books_df = top_similar_books.reset_index()
top_similar_books_df.columns = ['ISBN', 'Similarity']

# Merge with book metadata
top_similar_books_named = top_similar_books_df.merge(
    books[['ISBN', 'Book-Title', 'Book-Author']],
    on='ISBN',
    how='left'
)

# Round similarity values
top_similar_books_named['Similarity'] = top_similar_books_named['Similarity'].round(2)

# Display the results
top_similar_books_named = top_similar_books_named[['ISBN','Book-Title', 'Book-Author', 'Similarity']]
print(top_similar_books_named)

         ISBN                                         Book-Title  \
0  0345339738  the return of the king (the lord of the rings,...   
1  0345339711     the two towers (the lord of the rings, part 2)   
2  0345339681  the hobbit : the enchanting prelude to the lor...   
3  068485080X         diana her true story commemorative edition   
4  0345397819  lasher: lives of the mayfair witches (lives of...   
5  0812550706      ender's game (ender wiggins saga (paperback))   
6  096463161X                              conceived without sin   
7  0553288202                               the fall of hyperion   
8  0764502611                             windows 98 for dummies   
9  0688163424                                        to timbuktu   

          Book-Author  Similarity  
0      j.r.r. tolkien        0.39  
1      j.r.r. tolkien        0.38  
2      j.r.r. tolkien        0.22  
3       andrew morton        0.16  
4           anne rice        0.14  
5    orson scott card        0.14  

### Results make sense (and are quite funny as well :))
Readers who enjoyed The Fellowship of the Ring also tended to read and enjoy The Two Towers and The Return of the King.

However, a key limitation is that this analysis used only one ISBN, which may not capture all editions or formats of the same book.
### Idea – Merge All Tolkien Books into One
Treat all of Tolkien’s works — The Hobbit, The Lord of the Rings trilogy, The Silmarillion, and others — as a single unified book, regardless of edition, publisher, or language.

In [17]:


# Sub-matrix for Tolkien books only
tolkien_matrix = book_user_matrix.loc[book_user_matrix.index.isin(tolkien_isbns)]

# Create a composite vector: with average non-zero ratings
tolkien_vector = tolkien_matrix.replace(0, np.nan).mean(axis=0).fillna(0).values.reshape(1, -1)

# Compare this vector to all other books
other_vectors = book_user_matrix
cosine_similarities = cosine_similarity(tolkien_vector, other_vectors)[0]

# Create a Series with ISBNs as index
similarity_series = pd.Series(cosine_similarities, index=book_user_matrix.index)

# Optional: Drop Tolkien books from results
similarity_series = similarity_series.drop(index=tolkien_isbns, errors='ignore')

# Get top N most similar books
top_similar_books = similarity_series.sort_values(ascending=False).head(20).reset_index()
top_similar_books.columns = ['ISBN', 'Similarity']

# Merge with book metadata
top_similar_books_named = top_similar_books.merge(
    books[['ISBN', 'Book-Title', 'Book-Author']],
    on='ISBN',
    how='left'
)

# Final result
top_similar_books_named = top_similar_books_named[['ISBN', 'Book-Title', 'Book-Author', 'Similarity']]
print(top_similar_books_named)

          ISBN                                         Book-Title  \
0   0439064864   harry potter and the chamber of secrets (book 2)   
1   043935806X  harry potter and the order of the phoenix (boo...   
2   0439136350  harry potter and the prisoner of azkaban (book 3)   
3   059035342X  harry potter and the sorcerer's stone (harry p...   
4   0439139597       harry potter and the goblet of fire (book 4)   
5   0385504209                                  the da vinci code   
6   0590353403     harry potter and the sorcerer's stone (book 1)   
7   0439064872   harry potter and the chamber of secrets (book 2)   
8   0345313860   the vampire lestat (vampire chronicles, book ii)   
9   0345337662                         interview with the vampire   
10  0439136369  harry potter and the prisoner of azkaban (book 3)   
11  0451524934                                               1984   
12  0380002930                                     watership down   
13  0441172717                    

## Popularity bias
May be relevant for users looking for popular (and bit fantasy) books (probably not for core fans)

### Same approach for less known Tolkien books as Sillmarillion

We can apply the same strategy to users who have read Tolkien’s lesser-known works to see if their behavior reveals more specific and targeted book recommendations. The analysis continues in notebook 04, where the code will be refactored for robustness.