# Enhanced Embeddings with Top 5 Models

We implement separate embeddings with weights for the **`title`**, **`plot`** and **`genres`** fields using the 3 different sentence transformer models, time the embedding processes, and save each to separate files.

In [1]:
import pandas as pd
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from typing import List, Dict, Union
import time




In [3]:
df = pd.read_csv('final_dataset.csv')
df['genres_str'] = df['genres'].apply(lambda x: ' '.join(x))

## Models Selection

3 models were chosen based on the [Sentence-Transformers leaderboard](https://www.sbert.net/docs/pretrained_models.html):

1. **all-mpnet-base-v2**: Highest general-purpose accuracy (12-layer, 768-hidden-dimension)
4. **all-MiniLM-L12-v2**: Balance of speed and performance (12-layer, 384-hidden-dimension)
5. **multi-qa-distilbert-cos-v1**: Fastest model with QA optimization, has been specifically trained for Semantic Search (6-layer, 768-hidden-dimension)  

In [None]:
WEIGHTS = {'title': 0.15, 'plot': 0.6, 'genres': 0.25}

MODELS = {
    'all-mpnet-base-v2',
    'all-MiniLM-L12-v2',
    'multi-qa-distilbert-cos-v1'
}


## Embedding Generation

We'll generate embeddings for each model, time the process, and save the results.

In [None]:
import time
import numpy as np
from sentence_transformers import SentenceTransformer
import os

os.makedirs('embeddings', exist_ok=True)

def generate_embeddings(model_name, texts):
    model = SentenceTransformer(model_name)
    embeddings = model.encode(texts, convert_to_tensor=True)
    return embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)

# Dictionary to store timing results
timing_results = {}

for model_name in MODELS:
    start_time = time.time()
    
    # Generate all embeddings
    title_start = time.time()
    title_emb = generate_embeddings(model_name, df['title'].tolist())
    title_time = time.time() - title_start
    
    plot_start = time.time()
    plot_emb = generate_embeddings(model_name, df['plot'].tolist())
    plot_time = time.time() - plot_start
    
    genres_start = time.time()
    genres_emb = generate_embeddings(model_name, df['genres_str'].tolist())
    genres_time = time.time() - genres_start
    
    # Combine with stable weights
    combined_emb = (
        WEIGHTS['title'] * title_emb +
        WEIGHTS['plot'] * plot_emb + 
        WEIGHTS['genres'] * genres_emb
    )
    
    
    filename = f"embeddings/embeddings_{model_name.replace('-', '_')}.npy"
    np.save(filename, combined_emb)
    
    # Calculate total time
    total_time = time.time() - start_time
    
    # Store timing results
    timing_results[model_name] = {
        'title_embedding_time': title_time,
        'plot_embedding_time': plot_time,
        'genres_embedding_time': genres_time,
        'total_time': total_time
    }

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Timing for all-MiniLM-L12-v2:
  Title embeddings: 224.69 seconds
  Plot embeddings: 1349.76 seconds
  Genres embeddings: 229.35 seconds
  Total time: 1804.65 seconds



modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/523 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Timing for multi-qa-distilbert-cos-v1:
  Title embeddings: 254.51 seconds
  Plot embeddings: 6538.55 seconds
  Genres embeddings: 382.33 seconds
  Total time: 7179.21 seconds



modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

KeyboardInterrupt: 

In [None]:
with open('embedding_generation_times.txt', 'a') as f:
    for model_name, times in timing_results.items():
        f.write(f"Model: {model_name}\n")
        f.write(f"  Title embeddings: {times['title_embedding_time']:.2f} seconds\n")
        f.write(f"  Plot embeddings: {times['plot_embedding_time']:.2f} seconds\n")
        f.write(f"  Genres embeddings: {times['genres_embedding_time']:.2f} seconds\n")
        f.write(f"  Total time: {times['total_time']:.2f} seconds\n\n")

print("Embedding generation times saved to 'embedding_generation_times.txt'")

Embedding generation times saved to 'embedding_generation_times.txt'
