# 03. Text Embedding

This notebook focuses on creating text embeddings for our dataset. We will:
- Load the data prepared in the previous notebooks (which includes image embeddings).
- Prepare separate text fields for reviews and product metadata.
- Use a `SentenceTransformer` model to generate embeddings for both text fields.
- Fuse the review and metadata embeddings into a single text embedding.
- Save the final DataFrame containing both image and text embeddings.

## 1. Setup and Data Loading

In [1]:
import sys
import os
import pandas as pd
import numpy as np

# Add project root to path
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath('__file__'))))

from src.data_utils import prepare_text_columns
from src.embedding_utils import get_text_model, generate_text_embeddings, fuse_embeddings

# --- Configuration ---
CATEGORY = "CDs_and_Vinyl"
INPUT_FILE = f"../reviews_with_img_emb_{CATEGORY}.parquet"
OUTPUT_FILE = f"../reviews_with_img_text_emb_{CATEGORY}.parquet"

# Load data
print(f"Loading data from {INPUT_FILE}...")
df = pd.read_parquet(INPUT_FILE)
print("Data loaded successfully. Shape:", df.shape)

Loading data from ../reviews_with_img_emb_CDs_and_Vinyl.parquet...
Data loaded successfully. Shape: (5000, 23)
Data loaded successfully. Shape: (5000, 23)


## 2. Prepare Text Columns

We'll create two new columns, `review_text` and `meta_text`, by combining titles and descriptions. This is handled by the `prepare_text_columns` helper function.

In [2]:
df = prepare_text_columns(df)

print("Review text sample:", df["review_text"].iloc[0])
print("\nMeta text sample:", df["meta_text"].iloc[0])

Review text sample: Great Scott Ok...am I missing something here? Scott Walker is the greatest singer ever. It's that simple. This guy has sung it all from schlock broadway tunes to brooding interpretations of J. Brel and there has never been a bad note. The celebration of that voice has probably obscured the fact that the man is also responsible for some of the most intense self-penned anguish in the recorded history of recorded history. Tilt is a benchmark of total excellence for a genre that doesn't even exist. No one is as deep space profound as Scott Walker and no one has ever come close to the stratas-fear of pain and plight as THIS artist on THIS CD. Drive a nine-inch-nail in your head. Chant that Bowie is far from being the man who fell to earth. Whisper to Frank that Wee Small Hours just got smaller. Radiohead? RadioDEAD.  Julian Cope dubst Scott Walker is an incredible God-like Genius. That is an understatement. Years since Tilt and still waiting. The plea to Scott...&quot;mo

## 3. Generate Text Embeddings

Now, we'll load a pre-trained `SentenceTransformer` model and use it to encode both the review text and the metadata text into high-dimensional vectors.

In [3]:
model_text = get_text_model("all-MiniLM-L6-v2")

review_corpus = df["review_text"].tolist()
meta_corpus = df["meta_text"].tolist()

print("Generating review embeddings...")
review_emb = generate_text_embeddings(review_corpus, model_text)

print("Generating metadata embeddings...")
meta_emb = generate_text_embeddings(meta_corpus, model_text)

print("review_emb shape:", review_emb.shape)
print("meta_emb shape:", meta_emb.shape)

Generating review embeddings...


Batches:   0%|          | 0/79 [00:00<?, ?it/s]

Generating metadata embeddings...


Batches:   0%|          | 0/79 [00:00<?, ?it/s]

review_emb shape: (5000, 384)
meta_emb shape: (5000, 384)


## 4. Fuse Embeddings and Save

We'll combine the review and metadata embeddings into a single, more representative text embedding. The `alpha` parameter controls the weight given to the review text versus the metadata. Finally, we add the new embeddings to the DataFrame and save it.

In [4]:
# Fuse review + meta into one text vector
alpha = 0.7  # Weight on review text; (1-alpha) on meta
text_emb = fuse_embeddings(review_emb, meta_emb, alpha=alpha)

# Add embeddings to DataFrame
df["review_text_emb"] = list(review_emb)
df["meta_text_emb"] = list(meta_emb)
df["text_emb"] = list(text_emb.astype("float32"))  # Fused text emb

# Save the final DataFrame
df.to_parquet(OUTPUT_FILE, index=False)
print(f"✅ Saved final DataFrame with text embeddings to: {OUTPUT_FILE}")

✅ Saved final DataFrame with text embeddings to: ../reviews_with_img_text_emb_CDs_and_Vinyl.parquet


## Step 3: Text Embedding
This notebook focuses on generating text embeddings from product reviews and metadata. We will use transformer-based models to encode textual information for further analysis.

- Load and clean text data
- Tokenize and preprocess text
- Generate text embeddings
- Save and organize embeddings for later use

## Text Embedding with Sentence Transformers
Prepare review and meta text, encode with transformer model, fuse embeddings, and save results.

In [6]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import normalize
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath('__file__'))))
from src.data_utils import to_text

# Prepare review and meta text columns
df["review_text"] = (df.get("title_x", "").apply(to_text).fillna("") + " " + df.get("text", "").apply(to_text).fillna("")).str.strip()
df["meta_text"] = (df.get("title_y", "").apply(to_text).fillna("") + " " + df.get("description", "").apply(to_text).fillna("")).str.strip()

model_text = SentenceTransformer("all-MiniLM-L6-v2")
review_corpus = df["review_text"].tolist()
meta_corpus   = df["meta_text"].tolist()

review_emb = model_text.encode(review_corpus, batch_size=64, show_progress_bar=True, convert_to_numpy=True).astype("float32")
meta_emb = model_text.encode(meta_corpus, batch_size=64, show_progress_bar=True, convert_to_numpy=True).astype("float32")

alpha = 0.7  # weight on review text; (1-alpha) on meta
text_emb = normalize(alpha * review_emb + (1 - alpha) * meta_emb, axis=1)
df["review_text_emb"] = list(review_emb)
df["meta_text_emb"]   = list(meta_emb)
df["text_emb"]        = list(text_emb.astype("float32"))

df.to_parquet(f"reviews_with_img_text_emb_{CATEGORY}.parquet", index=False)

Batches:   0%|          | 0/79 [00:00<?, ?it/s]

Batches:   0%|          | 0/79 [00:00<?, ?it/s]