# Vectorization of the title of the items from meta datasets
The idea of our design is to vectorize the title (and maybe later descriptions after proper processing of the text in that field) to compare the query vector with these vectors and find the closest items to user query. 

[Query and items vectorizer](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) that I used here is the sentence embedding model which is the fine-tuned version of microsoft Mpnet. After vectorization, I index the vectors with [Faiss](https://github.com/facebookresearch/faiss) which is a library for efficient similarity search and clustering of dense vectors.
Finally I saved the vectors and indeces to `vectorized_texts_v2.pkl` and `faiss_index_v2.bin`, respectively. 

In [34]:
import pandas as pd
import numpy as np
import faiss
from transformers import MPNetModel, MPNetTokenizer
from sentence_transformers import SentenceTransformer
import torch
from tqdm.auto import tqdm

tqdm.pandas()

In [35]:
df = pd.read_csv('data/all_beauty_meta_amazon.csv', usecols=['parent_asin','title','description'])

In [36]:

# Ensure CUDA is available for PyTorch
assert torch.cuda.is_available(), "CUDA is not available. Please check your PyTorch installation and GPU settings."

# Batch encoding function with GPU acceleration
def encode_texts_in_batches_gpu(texts, batch_size=32):
    tokenizer = MPNetTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
    model = MPNetModel.from_pretrained('sentence-transformers/all-mpnet-base-v2').cuda()  # Move model to GPU
    model.eval()  # Evaluation mode

    all_embeddings = []
    
    for i in tqdm(range(0, len(texts), batch_size), desc="Encoding Texts"):
        batch_texts = texts[i:i + batch_size]
        encoded_input = tokenizer(batch_texts, padding=True, truncation=True, max_length=128, return_tensors='pt').to('cuda')
        
        with torch.no_grad():
            model_output = model(**encoded_input)
        embeddings = model_output.last_hidden_state.mean(dim=1).cpu().numpy()  # Move embeddings back to CPU
        all_embeddings.extend(embeddings)
    
    return np.array(all_embeddings)

# Vectorize texts
# Assuming the correct column name is 'text'
texts = df['title'].astype(str).tolist()

# Proceed with encoding and other operations
df['vector'] = list(encode_texts_in_batches_gpu(texts))

# Continue with the rest of your operations


# Saving DataFrame with vectors
df.to_pickle("vectorized_texts_v2.pkl")

# Faiss index creation and saving
d = df['vector'].iloc[0].shape[0]  # Dimension of vectors
index = faiss.IndexFlatL2(d)
index.add(np.vstack(df['vector'].values))
faiss.write_index(index, "faiss_index_v2.bin")

Encoding Texts: 100%|██████████| 3519/3519 [02:37<00:00, 22.40it/s]
