## Final Base for Recommendation (Lift & KNN)

This notebook contains the final version of the code used for book recommendation based on collaborative filtering.  
Two main approaches are demonstrated:

- **Lift-based recommendation**
- **K-Nearest Neighbors (KNN)**  


### Data Filtering and Preprocessing

In [1]:
import pandas as pd
import numpy as np
import sys

sys.path.append('../src')  
from functions_notebook_04 import split_custom_segment, display_recommendations_knn, compute_lift 
from metrics_notebook_04 import compute_knn_similarity, compute_knn_author_similarity

# Load raw data
books = pd.read_csv('../data/Books.csv', dtype={'ISBN': str}, low_memory=False)
users = pd.read_csv('../data/Users.csv')
ratings = pd.read_csv('../data/Ratings.csv')

# Clean books
books['ISBN'] = books['ISBN'].str.strip()
books['Book-Title'] = books['Book-Title'].astype(str).str.strip().str.lower()
books['Book-Author'] = books['Book-Author'].astype(str).str.strip().str.lower()
books.drop(columns=['Image-URL-S', 'Image-URL-M', 'Image-URL-L', 'Publisher'], inplace=True)

# Clean and simplify Book-Author to only surname
books['Book-Author'] = books['Book-Author'].str.replace(r'[^a-zA-Z,\s]', '', regex=True).str.strip()

# Extract surname and overwrite Book-Author
def extract_surname(author):
    if pd.isna(author) or author.strip() == '':
        return np.nan
    if ',' in author:
        return author.split(',')[0].strip()
    parts = author.split()
    return parts[-1] if parts else np.nan

books['Author-Surename'] = books['Book-Author'].apply(extract_surname)

# Clean users and assign age group
def assign_age_group(age):
    if pd.isna(age) or age < 6 or age > 99:
        return 'unknown'
    elif 6 <= age <= 12:
        return 'child'
    elif 13 <= age <= 17:
        return 'teen'
    elif 18 <= age <= 24:
        return 'young adult'
    elif 25 <= age <= 39:
        return 'adult'
    elif 40 <= age <= 59:
        return 'middle-aged'
    else:
        return 'senior'

users['Age_Group'] = users['Age'].apply(assign_age_group)
users.drop(columns=['Age', 'Location'], inplace=True)


#Fill the missing group age using idea from notebook_03


# Clean ratings
ratings = ratings[
    ratings['User-ID'].isin(users['User-ID']) &
    ratings['ISBN'].isin(books['ISBN'])
]


# Merge all into one clean DataFrame
ratings_clean = ratings.merge(users[['User-ID', 'Age_Group']], on='User-ID', how='left')
ratings_clean = ratings_clean.merge(books, on='ISBN', how='left')

# Year from object to float64
ratings_clean['Year-Of-Publication'] = pd.to_numeric(ratings_clean['Year-Of-Publication'], errors='coerce')

In [2]:
# Count how many ratings each user gave
user_rating_counts = ratings_clean['User-ID'].value_counts()

# Keep only users with at least 2 ratings - others are irelevant for the recommendation system
# Can be relevant for missing age/rating estimate
active_users = user_rating_counts[user_rating_counts >= 2].index

# Filter the dataset
ratings_clean = ratings_clean[ratings_clean['User-ID'].isin(active_users)]

### Lift-Based Book Recommendations

You start by selecting:
- A specific **author** (e.g., `Tolkien`)
- Optionally a **title keyword** (e.g., `"Lord"` to target *Lord of the Rings*)
- Optionally an **age group** (e.g., `"teen"`)

From this input, we identify a **target segment of readers** and compare their behavior to the rest of the population. Using the `compute_lift()` function, we calculate Lift values for books.

To discover **new authors**, we filter out books from the target author and sort by highest Lift. This surfaces books that are disproportionately popular among readers similar to those who read the selected author or title.

Optionally, we can limit the output to show only the **top book from each author**, helping diversify the recommendations across authors while keeping only their most relevant title.

In [3]:
author = ["Tolkien"]
title = ["Lord"]

[target_df, rest_df, target_user_ids] = split_custom_segment(
    ratings_df=ratings_clean,
    author_name=author
    #title_keyword=title
    # age_group=["teen"]
)

lift_df = compute_lift(target_df, ratings_clean, min_target_support=20, mode="read")

# Filter out books from the target author
non_target_lift_df = lift_df[~lift_df['Author-Surename'].str.lower().isin([a.lower() for a in author])]

# Sort by Lift descending and show top results
top_lift_books = non_target_lift_df.sort_values(by="Lift", ascending=False).head(20)

# Display result
display(top_lift_books)

# Keep only the top book per author
#top_lift_books_unique = non_target_lift_df.drop_duplicates(subset="Author-Surename", keep="first").head(20)

# Display result
#display(top_lift_books_unique)

Unnamed: 0_level_0,Book-Title,Author-Surename,Segment_Readers,Global_Readers,Lift
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
441328008,"heretics of dune (dune chronicles, book 5)",herbert,21.0,29,24.644828
812550293,"the path of daggers (the wheel of time, book 8)",jordan,22.0,34,22.021569
441294677,"god emperor of dune (dune chronicles, book 4)",herbert,20.0,31,21.956989
812513754,"lord of chaos (the wheel of time, book 6)",jordan,21.0,35,20.42
446357421,if tomorrow comes,sheldon,20.0,34,20.019608
64408647,the ersatz elevator (a series of unfortunate e...,snicket,20.0,35,19.447619
812513738,"the shadow rising (the wheel of time, book 4)",jordan,25.0,45,18.907407
441172695,"dune messiah (dune chronicles, book 2)",herbert,33.0,60,18.718333
441104029,"children of dune (dune chronicles, book 3)",herbert,28.0,52,18.325641
345313151,bearing an hourglass (incarnations of immortal...,anthony,29.0,54,18.27716


### KNN-Based Book Recommendations

This section provides personalized book recommendations using **K-nearest neighbors (KNN)** similarity. You define a **target segment of users** by filtering readers based on:
- Specific **author(s)** (e.g., `"Tolkien"`)
- Optionally a **book title keyword**
- Optionally an **age group**

We analyze the reading behavior of this segment and compute their similarity with other books using either of two modes:

- **`mode="rating"`**: Uses average rating patterns — recommends books whose readers rate similarly to your segment.
- **`mode="read"`**: Uses binary data (read vs. not read) — recommends books that tend to be read by similar users.

In both cases, we build a book-user matrix, compute cosine similarity between each book and the aggregated behavior of the target segment, and return the top-N most similar books.


In [4]:
author = ["Tolkien"]
title = ["Lord of the Rings"]

[target_df, rest_df, target_user_ids] = split_custom_segment(
    ratings_df=ratings_clean,
    author_name=author,
    title_keyword=title
    # age_group=["teen"]
)
recommendations=compute_knn_similarity(target_df, ratings_clean, min_readers=20, top_n=200, mode="rating")

### This is how a nice answer should look :)

In [5]:
display_recommendations_knn(recommendations, author, title)


📚 Recommendation Summary:
If you just read **Lord of the Rings** by **Tolkien**, you might also enjoy:

🔁 More from Tolkien:
- the hobbit: or there and back again (score: 0.159)
- the hobbit (leatherette collector's edition) (score: 0.097)
- the silmarillion (score: 0.097)
- the book of lost tales 1 (the history of middle-earth - volume 1) (score: 0.086)
- unfinished tales: the lost lore of middle-earth (score: 0.080)

🎯 Books from other authors:
- j k rowling — harry potter and the chamber of secrets (book 2) (score: 0.152)
- stephen king — wizard and glass (the dark tower, book 4) (score: 0.095)
- anne rice — the tale of the body thief (vampire chronicles (paperback)) (score: 0.094)
- e b white — charlotte's web (trophy newbery) (score: 0.094)
- lemony snicket — the wide window (a series of unfortunate events, book 3) (score: 0.086)
- isaac asimov — second foundation (foundation novels (paperback)) (score: 0.083)
- thomas harris — silence of the lambs (score: 0.083)
- frank herbert 

### KNN for Author Recommendations

Same issue with popularity bias (and with messy data).

In [6]:
recommendations=compute_knn_author_similarity(target_df, ratings_clean, min_readers=20, top_n=200, mode="rating")
recommendations.head(5)

Unnamed: 0,Author-Surename,Similarity
4308,tolkien,0.690732
3746,rowling,0.170762
2577,lewis,0.128816
2327,king,0.120797
4107,staff,0.111187


### Strong Popularity Bias
Since similarity is computed directly from overlapping user interactions, very popular books (e.g., by J.K. Rowling, Stephen King, or C.S. Lewis) tend to dominate the recommendations.
These titles are recommended frequently not necessarily due to thematic similarity, but simply because they appear in many users’ histories.

### Idea – Penalize Popularity
 
We chose post-processing for its simplicity and flexibility — no major code changes needed, but there is a huge space for improvement (penalizing the pivot matrix). We use Adjusted_Similarity = Similarity / (log(1 + Num_Readers)^penalty_strength).


In [7]:
author = ["Tolkien"]
penalty_strength = 1.5  # 0 = no penalization, the higher value the higher penalization

[target_df, rest_df, target_user_ids] = split_custom_segment(
    ratings_df=ratings_clean,
    author_name=author,
    #title_keyword=["Silma"]  
    age_group=["teen"]      
)

# Get base recommendations using KNN
recommendations = compute_knn_similarity(
    target_df=target_df,
    full_df=ratings_clean,
    min_readers=20,
    top_n=200,
    mode="rating"
)

# Remove books by the target author
#recommendations = recommendations[
#    ~recommendations['Author-Surename'].str.lower().isin([a.lower() for a in author])
#].copy()

# Compute popularity (unique readers per book)
popularity = ratings_clean.groupby('ISBN')['User-ID'].nunique().rename('Num_Readers')

# Merge popularity into recommendations
recommendations = recommendations.merge(popularity, on='ISBN', how='left')

# Apply log-based penalty to similarity scores
recommendations['Adjusted_Similarity'] = recommendations['Similarity'] / (
    np.log1p(recommendations['Num_Readers']) ** penalty_strength
)

# Sort by adjusted similarity
recommendations = recommendations.sort_values(by='Adjusted_Similarity', ascending=False)

# Display final result using pretty print
display_recommendations_knn(recommendations_df=recommendations, author_name=author)


📚 Recommendation Summary:
If you just read books by **Tolkien**, you might also enjoy:

🔁 More from Tolkien:
- unfinished tales: the lost lore of middle-earth (score: 0.137)
- the hobbit : the enchanting prelude to the lord of the rings (score: 0.133)
- the return of the king (the lord of the rings, part 3) (score: 0.131)
- the silmarillion (score: 0.120)
- the fellowship of the ring (the lord of the rings, part 1) (score: 0.113)

🎯 Books from other authors:
- lemony snicket — the hostile hospital (a series of unfortunate events, book 8) (score: 0.117)
- laurence yep — dragonwings : golden mountain chronicles: 1903 (golden mountain chronicles) (score: 0.102)
- lois lowry — the giver (readers circle) (score: 0.100)
- christopher paolini — eragon (inheritance, book 1) (score: 0.094)
- ursula k le guin — a wizard of earthsea (pelican books) (score: 0.093)
- j k rowling — harry potter and the chamber of secrets (book 2) (score: 0.092)
- margaret weis — dragons of winter night (dragonlance

### Penalized Author Recommendations

In [8]:
author = ["Tolkien"]
penalty_strength = 1  # how much to penalize popular authors

#Select target segment
[target_df, rest_df, target_user_ids] = split_custom_segment(
    ratings_df=ratings_clean,
    author_name=author
    #age_group=["teen"] 
)

# Compute base author similarities
recommendations = compute_knn_author_similarity(
    target_df=target_df,
    full_df=ratings_clean,
    min_readers=100,
    top_n=200,
    mode="rating"
)

# Filter out the target authors
recommendations = recommendations[
    ~recommendations['Author-Surename'].str.lower().isin([a.lower() for a in author])
].copy()

# Compute popularity = number of unique readers per author
author_popularity = (
    ratings_clean.groupby('Author-Surename')['User-ID']
    .nunique()
    .rename('Num_Readers')
)

# Merge popularity into recommendations
recommendations = recommendations.merge(author_popularity, on='Author-Surename', how='left')

# Penalize using log-based adjustment
recommendations['Adjusted_Similarity'] = recommendations['Similarity'] / (
    np.log1p(recommendations['Num_Readers']) ** penalty_strength
)

# Sort by penalized score
recommendations = recommendations.sort_values(by='Adjusted_Similarity', ascending=False)

display(recommendations[['Author-Surename', 'Similarity', 'Num_Readers', 'Adjusted_Similarity']].head(20))

Unnamed: 0,Author-Surename,Similarity,Num_Readers,Adjusted_Similarity
0,rowling,0.181841,1437,0.025009
3,herbert,0.121922,679,0.018694
2,lewis,0.131608,1234,0.018487
8,staff,0.098312,242,0.017897
28,chaucer,0.083806,119,0.017505
6,asimov,0.106738,632,0.016547
12,na,0.097206,362,0.016491
1,king,0.135519,3979,0.016349
20,watterson,0.088021,222,0.016279
4,adams,0.118159,1839,0.015718


### Final Summary

To answer the question “I like Lord of the Rings, what else should I read?” we developed two recommendation approaches: Lift-based and KNN-based. Both produced promising results. While it’s difficult to objectively assess which is better without proper evaluation, the lift model felt slightly more relevant—though this may reflect better luck with parameter tuning rather than inherent superiority.

It’s worth noting that the underlying data was often noisy, inconsistent, and only lightly preprocessed. This limited the overall reliability and accuracy of the models, and further cleaning could significantly improve recommendation quality.