# - Part 02: Loading and Embedding the Data

## 🗒️ This notebook is divided in 3 sections:
1. Loading the Feature Group from the Hopsworks Feature Store
2. Embedding the data using the sentence-transformers library
3. Saving the model to the Hopsworks Model Registry

In [2]:
from dotenv import load_dotenv
import os
import streamlit as st
import hopsworks

## Pulling the Feature Group

In [5]:
# Load hopsworks API key from .env file or secrets.toml file
load_dotenv()

try:
    HOPSWORKS_API_KEY = os.environ.get('HOPSWORKS_API_KEY')
    # HOPSWORKS_API_KEY = st.secrets.HOPSWORKS.HOPSWORKS_API_KEY
except:
    raise Exception('Set environment variable HOPSWORKS_API_KEY')

In [6]:
try:
    project = hopsworks.login(api_key_value=HOPSWORKS_API_KEY)
    fs = project.get_feature_store()
    
    print("Connected to the Hopsworks Feature Store")
except Exception as e:
    print(f"An error occurred: {e}")

2025-01-01 17:51:50,076 INFO: Initializing external client
2025-01-01 17:51:50,079 INFO: Base URL: https://c.app.hopsworks.ai:443
2025-01-01 17:51:50,079 INFO: Base URL: https://c.app.hopsworks.ai:443
2025-01-01 17:51:53,616 INFO: Python Engine initialized.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1208511
Connected to the Hopsworks Feature Store


In [10]:
feature_group = fs.get_feature_group("papers_info", version=1)

In [11]:
# Pull the feature group as a Pandas DataFrame
df = feature_group.read()

2025-01-01 17:57:06,609 ERROR: Not data found for featuregroup paperrecommendation.papers_info_1. Detail: Python exception: FlyingDuckException. gRPC client debug context: UNKNOWN:Error received from peer ipv4:51.79.26.27:5005 {created_time:"2025-01-01T09:57:06.542442+00:00", grpc_status:2, grpc_message:"Not data found for featuregroup paperrecommendation.papers_info_1. Detail: Python exception: FlyingDuckException"}. Client context: IOError: Server never sent a data message. Detail: Internal
Traceback (most recent call last):
  File "c:\Users\aldir\anaconda3\envs\venvpaper\Lib\site-packages\hsfs\core\arrow_flight_client.py", line 364, in afs_error_handler_wrapper
    return func(instance, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\aldir\anaconda3\envs\venvpaper\Lib\site-packages\hsfs\core\arrow_flight_client.py", line 427, in read_query
    return self._get_dataset(
           ^^^^^^^^^^^^^^^^^^
  File "c:\Users\aldir\anaconda3\envs\venvpaper\Lib\site-package

FeatureStoreException: Could not read data using Hopsworks Feature Query Service.

In [None]:
import pandas as pd
# Setting pandas option to display the full content of DataFrame columns without truncation
pd.set_option('display.max_colwidth', None)

df.head()

## Embedding process

In [None]:
from sentence_transformers import SentenceTransformer

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

# Our feature we like to encode
sentences = df['titles']

# Features are encoded by calling model.encode()
embeddings = model.encode(sentences)

In [None]:
# Printing embeddings
c = 0
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding length:", len(embedding)) # list of floats
    print("")
    if c >=10:
        break
    c +=1 

In [None]:
import pickle

# Saving sentences and corresponding embeddings
with open('../models/titles_embeddings.pkl', 'wb') as f:
    pickle.dump(embeddings, f)

with open('../models/titles_sentences.pkl', 'wb') as f:
    pickle.dump(sentences, f)

## Saving the model to the Hopsworks Model Registry

In [None]:
try:
    mr = project.get_model_registry()
    
    print("Connected to the Hopsworks Model Registry")
except Exception as e:
    print(f"An error occurred: {e}")

In [None]:
mr_sentences = mr.python.create_model(
    name="titles_sentences",
    description="Scientific papers titles"
)

In [None]:
mr_sentences.save("../models/titles_sentences.pkl")

In [None]:
mr_embeddings = mr.python.create_model(
    name="titles_embeddings",
    description="Scientific papers embeddings"
)

In [None]:
mr_embeddings.save("../models/titles_embeddings.pkl")