# Embeddings

We implement separate embeddings with weights for the **`title`**, **`plot`** and **`genres`** fields using the *all-MiniLM-L6-v2* model.

In [1]:
import pandas as pd
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from typing import List, Dict, Union




In [8]:
df = pd.read_csv('final_dataset.csv')

In [9]:
df['genres_str'] = df['genres'].apply(lambda x: ' '.join(x))

In [10]:
model = SentenceTransformer('all-MiniLM-L6-v2')

title_emb = model.encode(df['title'].tolist(), convert_to_tensor=True)
plot_emb = model.encode(df['plot'].tolist(), convert_to_tensor=True)
genres_emb = model.encode(df['genres_str'].tolist(), convert_to_tensor=True)

### Normalization and weighing

The weights can be adjusted depending on the importance of the fields (for example, `plot` is more important than `title`):

In [11]:
weights = {
    'title': 0.1,
    'plot': 0.6,
    'genres': 0.3
}

title_emb = title_emb / np.linalg.norm(title_emb, axis=1, keepdims=True)
plot_emb = plot_emb / np.linalg.norm(plot_emb, axis=1, keepdims=True)
genres_emb = genres_emb / np.linalg.norm(genres_emb, axis=1, keepdims=True)

combined_emb = weights['title'] * title_emb + weights['plot'] * plot_emb + weights['genres'] * genres_emb

In [12]:
np.save('movie_embeddings.npy', combined_emb)