# Usage
We ultimately want to fill in manga ratings that are NA or null.


We are choosing SBERT (Sentence BERT) as opposed to TFIDF cosine similarity because we want to predict ratings based on synopsis's that are semantically similar. It uses transformer-based architectures, such as BERT, that understand context and relationships between words, enabling it to capture the meaning of a sentence as a whole. TFIDF won't be a good method to use because it primiarily checks for word occurance and frequency which doesn't capture contexts or semantics.

In [1]:
import pandas as pd
from sentence_transformers import SentenceTransformer


  from tqdm.autonotebook import tqdm, trange


In [2]:
df = pd.read_csv('dataset/modified_manga.csv')
# MODEL = 'all-mpnet-base-v2' # General-purpose, best for applications where accuracy is more important than speed.
# MODEL = 'paraphrase-MiniLM-L6-v2' # Balanced Performance. Paraphrase identification, text similarity tasks.
MODEL = 'paraphrase-mpnet-base-v2' # High-accuracy paraphrase identification and text similarity tasks. Slightly slower than MiniLM

In [3]:
# Assuming the column containing synopses is named 'synopsis'
synopses = df['synopsis'].tolist()

# Load a pre-trained Sentence-BERT model
model = SentenceTransformer(MODEL) 

# Compute embeddings for each synopsis
embeddings = model.encode(synopses, convert_to_tensor=True)

KeyboardInterrupt: 

In [None]:
# Now, 'embeddings' is a tensor containing the sentence embeddings for each synopsis
# You can convert it to a numpy array if needed:
embeddings_np = embeddings.cpu().numpy()

# If you want to add the embeddings back to the dataframe and save it
df['embeddings'] = list(embeddings_np)
df.to_csv('manga_dataset_with_embeddings.csv', index=False)