#### Cosine Similarity (ModernBertEmbed version)

This time I am going to use another model to compare to the BERT version of the "Recommendations engine". The model will be ModernBERTembed-base. A finetuned model of BERT used for embeddings. It was developed by nomic AI. 

Some features of ModernBertEmbed:

* Flash attention: It is meant to be more efficient as it used Flash Attention, an algorithm to process text faster on the GPU  
* Longer Context: up to  8192 tokens vs 512 in BERT
* Prefixes: It requires a prefix of either "search_document:" which signifies that you are looking for answers and "search_query:" which is for questions

Let's install dependencies first

In [1]:
pip install --upgrade "transformers>=4.48.0" sentence-transformers accelerate torch

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install flash-attn --no-build-isolation

Collecting flash-attn
  Using cached flash_attn-2.8.3.tar.gz (8.4 MB)
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting einops (from flash-attn)
  Using cached einops-0.8.1-py3-none-any.whl.metadata (13 kB)
Using cached einops-0.8.1-py3-none-any.whl (64 kB)
Building wheels for collected packages: flash-attn
  Building wheel for flash-attn (pyproject.toml): started
  Building wheel for flash-attn (pyproject.toml): finished with status 'error'
Failed to build flash-attn
Note: you may need to restart the kernel to use updated packages.


  error: subprocess-exited-with-error
  
  Building wheel for flash-attn (pyproject.toml) did not run successfully.
  exit code: 1
  
  [157 lines of output]
  !!
  
          ********************************************************************************
          Please consider removing the following classifiers in favor of a SPDX license expression:
  
          License :: OSI Approved :: BSD License
  
          See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details.
          ********************************************************************************
  
  !!
    self._finalize_license_expression()
  
  
  torch.__version__  = 2.10.0.dev20251210+cu128
  
  
  running bdist_wheel
  W1223 22:15:36.737000 34796 site-packages\torch\utils\cpp_extension.py:659] Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
  Guessing wheel URL:  https://github.com/Dao-AILab

Flash attention failed to install so I will not use it.

In [4]:
pip install hf_xet

Note: you may need to restart the kernel to use updated packages.


In [6]:
import pandas as pd

In [7]:
sg_df_clean = pd.read_csv("sg_df_clean.csv")

In [8]:
sg_df_clean.columns

Index(['ID', 'name', 'release_date', 'detailed_description', 'about_the_game',
       'short_description', 'metacritic_score', 'categories', 'genres',
       'positive', 'negative', 'estimated_owners', 'tags', 'user_reviews',
       'tags_dict', 'top_5_tags', 'rating'],
      dtype='object')

Let's load and run the model

In [11]:
from sentence_transformers import SentenceTransformer

# Load Model
model = SentenceTransformer('nomic-ai/modernbert-embed-base', device='cuda')
model.max_seq_length = 512 #to save on memory

# Format Text
print("Formatting text...")
texts = []  # Create an empty list

# Loop through every row in the column
for text in sg_df_clean['about_the_game']:
    
    #  Add the prefix
    new_text = "search_document: " + str(text)  #We use search_document to tell the model we are looking for answers, not questions. It is related to how this model was trained.
    
    #  Add to the list
    texts.append(new_text)

# Generate Embeddings 
print("Generating embeddings...")
embeddings = model.encode(
    texts,
    batch_size=256,  
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True
)

sg_df_clean["embeddings"] = list(embeddings)
print("Done.")

Formatting text...
Generating embeddings...


Batches:   0%|          | 0/283 [00:00<?, ?it/s]

Done.


In [13]:
sg_df_clean["embeddings"]

0        [-0.011169485, -0.013404448, 0.04623577, -0.04...
1        [-0.012440475, 0.007200008, -0.03891932, 0.017...
2        [-0.0075391605, -0.03697401, 0.0120488545, -0....
3        [-0.043684047, 0.040152546, 0.02400123, -0.021...
4        [-0.021892304, 0.031989895, -0.03811961, -0.02...
                               ...                        
72366    [0.042880114, 0.024679743, 0.037190583, -0.099...
72367    [0.009328879, -0.03561734, 0.04149969, -0.0447...
72368    [-0.0095696505, -0.037554722, 0.005656592, -0....
72369    [-0.0069766436, 0.06417681, 0.013568733, -0.07...
72370    [0.012446061, 0.047930624, 0.015035044, -0.049...
Name: embeddings, Length: 72371, dtype: object

In [18]:
import numpy as np

In [20]:
embedding_matrix = np.array(sg_df_clean["embeddings"].to_list()) #we need to "unpack" the column to create a matrix to use for cosine similarity

In [24]:
from sklearn.metrics.pairwise import cosine_similarity

In [26]:
sim_matrix_embeddings = cosine_similarity(embedding_matrix) #cosine similarity for the new embeddings

In [28]:
tfidf_matrix = pd.read_pickle("TF-IDF_V1.pkl")

In [30]:
sim_matrix_tfidf = cosine_similarity(tfidf_matrix) #cosine similarity for tf-idf

Let's free up some memory

In [None]:
import gc
gc.collect()

In [46]:
del tfidf_matrix
del sg_df_clean
del embedding_matrix

In [54]:
sim_matrix_tfidf = sim_matrix_tfidf.astype(np.float32) #saving memory

In [56]:
gc.collect()

1040

In [58]:
sim_matrix_embeddings = sim_matrix_embeddings.astype(np.float32) #saving memory

In [59]:
gc.collect()

0

##### Cosine Similarity

We will now repeat the same steps as notebook 2 to create the cosine similarity matrix

Let's add the weights to each matrix, I am going to decrease tf-idf weight to 0.3 as it has less context than the "about the game" section

In [50]:
tfidf_w = 0.3
embeddings_w = 0.7

In [62]:
weighted_matrix_1 = tfidf_w * sim_matrix_tfidf

In [64]:
del sim_matrix_tfidf

In [66]:
gc.collect()

0

In [68]:
weighted_matrix_2 =embeddings_w * sim_matrix_embeddings

In [70]:
del sim_matrix_embeddings

In [71]:
gc.collect()

0

In [74]:
final_matrix = weighted_matrix_1 + weighted_matrix_2

In [75]:
gc.collect()

0

In [76]:
del weighted_matrix_1 
del weighted_matrix_2

In [80]:
pd.to_pickle(final_matrix, "Full_cosine_matrix_modernbertembed.pkl") #exporting the matrix to pickle file