# Sentiment Analysis for Book Recommendations

In this notebook, we'll use Large Language Models (LLMs) to perform sentiment analysis on book descriptions. This will help us determine the emotional tone of books, which can be used as an additional feature in our recommendation system.

## Why Sentiment Analysis?

By extracting emotional content from book descriptions, we can:
- Allow users to filter books based on their desired emotional tone
- Someone looking for an exciting read might choose something suspenseful
- Someone wanting to be cheered up might choose something joyful
- Provide an additional degree of control for users in our recommender system

## Our Approach: Fine-tuned Models

We'll classify text into **7 discrete emotion categories**:
1. **Anger**
2. **Disgust** 
3. **Fear**
4. **Joy**
5. **Sadness**
6. **Surprise**
7. **Neutral** (for text without emotional content)

### Fine-tuning vs Zero-shot Classification

Instead of using zero-shot classification, we're using a **fine-tuned model**. Here's how fine-tuning works:

1. Start with a pre-trained model (like RoBERTa) with its encoder layers intact
2. Remove the original final layers (used for masked word prediction)
3. Replace them with new layers designed for emotion classification
4. Train on a labeled emotion dataset
5. The model preserves its rich language understanding while learning emotion-specific patterns

This gives us an LLM specifically designed for emotion classification tasks.

## Load the Data

First, let's load our book dataset that contains the predicted categories from previous steps.

In [32]:
import pandas as pd

books = pd.read_csv('books_with_categories.csv', encoding='utf-8')

## Setting Up the Emotion Classification Model

We're using a fine-tuned RoBERTa model from Hugging Face: `j-hartmann/emotion-english-distilroberta-base`

**Model Details:**
- Fine-tuned specifically for 6 basic emotions + neutral class
- Evaluation accuracy: **66%** (significantly higher than random chance baseline of 14%)
- Well-established model with good performance metrics

**Configuration:**
- `top_k=None`: Returns all emotion probabilities (not just the top prediction)
- `device=0`: Uses GPU for faster processing (change to CPU if needed)

In [33]:
from transformers import pipeline

classifier = pipeline("text-classification",
                      model="j-hartmann/emotion-english-distilroberta-base",
                      top_k=None,
                      device=0)  # Use device=-1 for CPU

# Test the classifier
classifier("I love this!")

Device set to use cuda:0


[[{'label': 'joy', 'score': 0.9771687984466553},
  {'label': 'surprise', 'score': 0.008528691716492176},
  {'label': 'neutral', 'score': 0.0057645998895168304},
  {'label': 'anger', 'score': 0.004419785924255848},
  {'label': 'sadness', 'score': 0.0020923952106386423},
  {'label': 'disgust', 'score': 0.0016119939973577857},
  {'label': 'fear', 'score': 0.0004138521908316761}]]

## Choosing the Right Granularity: Sentence vs. Whole Description

We need to decide at what level to apply sentiment analysis:

### Option 1: Whole Description
- Analyze the entire book description as one piece
- May lose nuanced emotional information

### Option 2: Sentence-by-Sentence (Our Choice)
- Split description into individual sentences
- Analyze each sentence separately
- Capture more variety and nuanced emotions
- Take maximum probability for each emotion across all sentences

Let's test both approaches to see the difference:

In [34]:
# Look at the first book description
books["description"][0]

'A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives. John Ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers. It’s 1956 in Gilead, Iowa, towards the end of the Reverend Ames’s life, and he is absorbed in recording his family’s story, a legacy for the young son he will never see grow up. Haunted by his grandfather’s presence, John tells of the rift between his grandfather and his father: the elder, an angry visionary who fought for the abolitionist cause, and his son, an ardent pacifist. He is troubled, too, by his prodigal namesake, Jack (John Ames) Boughton, his best friend’s lost son who returns to Gilead searching for forgiveness and redemption. Told in John Ames’s joyous, rambling voice that finds beauty, humour and truth in the smallest of life’s details, Gilead is a song of celebration and acceptance of the best and the worst the world ha

In [35]:
# Approach 1: Classify entire description
print("=== WHOLE DESCRIPTION ANALYSIS ===")
whole_result = classifier(books["description"][0])
print(f"Dominant emotion: {whole_result[0]['label']} ({whole_result[0]['score']:.2%})")
print("\nThis might miss nuanced emotional content in different sentences.")

=== WHOLE DESCRIPTION ANALYSIS ===


TypeError: list indices must be integers or slices, not str

In [None]:
# Approach 2: Classify by sentences
print("=== SENTENCE-BY-SENTENCE ANALYSIS ===")
sentences_result = classifier(books["description"][0].split("."))
print("This captures much more variety:")
for i, sentence_emotions in enumerate(sentences_result[:3]):  # Show first 3 sentences
    top_emotion = max(sentence_emotions, key=lambda x: x['score'])
    print(f"Sentence {i+1}: {top_emotion['label']} ({top_emotion['score']:.2%})")

## Examining Individual Sentences

Let's look at specific sentences to verify our classifier is working correctly:

In [None]:
sentences = books["description"][0].split(".")
predictions = classifier(sentences)

print("=== SENTENCE ANALYSIS ===")
print(f"First sentence: '{sentences[0]}'")
print(f"Prediction: {predictions[0]}")
print()
print(f"Fourth sentence: '{sentences[3]}'")
print(f"Prediction: {predictions[3]}")

## Processing Challenge: Multiple Emotions per Book

The sentence-by-sentence approach introduces complexity:
- Each book now has multiple emotions associated with it
- The classifier output is ordered by score (different order for each sentence)

**Our Solution:**
1. Create separate columns for each of the 7 emotion categories
2. For each emotion, take the **highest probability** from across all sentences in the description
3. This gives us a comprehensive emotion profile for each book

### Data Processing Steps:
1. Sort predictions by label (to ensure consistent ordering)
2. Extract maximum score for each emotion across all sentences
3. Create a structured dataframe with emotion columns

In [None]:
# Show the ordering problem
print("=== ORDERING CHALLENGE ===")
print("Raw predictions have different label orders:")
print(f"Sentence 1 order: {[p['label'] for p in predictions[0]]}")
print(f"Sentence 2 order: {[p['label'] for p in predictions[1]]}")
print()
print("After sorting by label:")
sorted_pred = sorted(predictions[0], key=lambda x: x["label"])
print(f"Consistent order: {[p['label'] for p in sorted_pred]}")

## Building the Emotion Extraction System

Now we'll create the infrastructure to process all our book descriptions:

In [None]:
import numpy as np

# Define our emotion categories (alphabetical order for consistency)
emotion_labels = ["anger", "disgust", "fear", "joy", "neutral", "sadness", "surprise"]

# Initialize storage for results
isbn = []  # To merge back with original dataframe
emotion_scores = {label: [] for label in emotion_labels}  # Dictionary to become dataframe columns

def calculate_max_emotion_scores(predictions):
    """
    Extract maximum emotion scores from sentence-level predictions.
    
    Args:
        predictions: List of predictions, one per sentence
        
    Returns:
        Dictionary with maximum score for each emotion across all sentences
    """
    # Initialize storage for this description
    per_emotion_scores = {label: [] for label in emotion_labels}
    
    # Process each sentence
    for prediction in predictions:
        # Sort to ensure consistent emotion order
        sorted_predictions = sorted(prediction, key=lambda x: x["label"])
        
        # Extract score for each emotion
        for index, label in enumerate(emotion_labels):
            per_emotion_scores[label].append(sorted_predictions[index]["score"])
    
    # Return maximum score for each emotion
    return {label: np.max(scores) for label, scores in per_emotion_scores.items()}

## Testing Our Function

Let's verify our emotion extraction function works correctly:

In [None]:
# Test with the first book
sentences = books["description"][0].split(".")
predictions = classifier(sentences)
max_scores = calculate_max_emotion_scores(predictions)

print("=== MAXIMUM EMOTION SCORES ===")
for emotion, score in max_scores.items():
    print(f"{emotion.capitalize()}: {score:.3f}")

## Processing All Books

Now let's apply our emotion analysis to the entire dataset. This will take some time as we're processing over 5,000 book descriptions:

**Process for each book:**
1. Extract ISBN13 for merging later
2. Split description into sentences
3. Get emotion predictions for all sentences
4. Calculate maximum scores for each emotion
5. Store results in our data structures

In [None]:
from tqdm import tqdm

# Reset our storage (in case we're re-running)
emotion_labels = ["anger", "disgust", "fear", "joy", "neutral", "sadness", "surprise"]
isbn = []
emotion_scores = {label: [] for label in emotion_labels}

# Process all books with progress bar
for i in tqdm(range(len(books)), desc="Analyzing emotions"):
    # Store ISBN for merging
    isbn.append(books["isbn13"][i])
    
    # Process description
    sentences = books["description"][i].split(".")
    predictions = classifier(sentences)
    max_scores = calculate_max_emotion_scores(predictions)
    
    # Store results
    for label in emotion_labels:
        emotion_scores[label].append(max_scores[label])

## Creating and Merging the Emotions DataFrame

Convert our results into a pandas DataFrame and merge it back with our original book data:

In [None]:
# Create emotions dataframe
emotions_df = pd.DataFrame(emotion_scores)
emotions_df["isbn13"] = isbn

print("=== EMOTIONS DATAFRAME ===")
print(f"Shape: {emotions_df.shape}")
print(emotions_df.head())

In [None]:
# Merge with original books dataframe
books = pd.merge(books, emotions_df, on="isbn13")

print("=== MERGED DATAFRAME ===")
print(f"Shape: {books.shape}")
print("\nNew emotion columns:")
print([col for col in books.columns if col in emotion_labels])

## Examining the Results

Let's look at the distribution of emotions across our book dataset:

In [None]:
# Display emotion statistics
print("=== EMOTION DISTRIBUTION STATISTICS ===")
emotion_stats = books[emotion_labels].describe()
print(emotion_stats)

print("\n=== KEY INSIGHTS ===")
print("- We have a good distribution across most emotions")
print("- Sadness shows quite high probabilities in many books")
print("- This gives us valuable variables for book filtering and recommendation")

## Saving the Enhanced Dataset

Save our enriched dataset with emotion features for use in the final recommendation dashboard:

In [None]:
books.to_csv("books_with_emotions.csv", index=False)
print("✅ Enhanced dataset saved as 'books_with_emotions.csv'")
print(f"📊 Final dataset shape: {books.shape}")
print(f"🎭 Emotion features added: {emotion_labels}")

## Summary

We've successfully implemented sentiment analysis for our book recommendation system:

1. **Fine-tuned Model**: Used a specialized emotion classification model (66% accuracy)
2. **Granular Analysis**: Analyzed emotions at sentence level for better precision
3. **Comprehensive Features**: Created 7 emotion columns for each book
4. **Smart Aggregation**: Used maximum probability across sentences for each emotion
5. **Enhanced Dataset**: Added emotion features to support advanced filtering

This sentiment analysis capability showcases how LLMs can extract meaningful features from text data that wouldn't be available in traditional recommender systems!