# 1. Create a Database from a csv file using Milvus and Pandas
- Read cleaned data
- Create a local database for testing
- Add necessary data to database

## Mimic [milvus_quickstart](./milvus_quickstart.ipynb) first and then do explorations

In [7]:
import pandas as pd
from pymilvus import model
from pymilvus import MilvusClient
from sentence_transformers import SentenceTransformer

## 1.1. Read Data

In [2]:
df = pd .read_csv("/workspaces/Music_Playlist_Generation/music_playlist_generation/data/data.csv")
df.head()

Unnamed: 0,input,track_name
0,danceability:0.676 track_genre:acoustic valenc...,Comedy
1,danceability:0.42 track_genre:acoustic valence...,Ghost - Acoustic
2,danceability:0.438 track_genre:acoustic valenc...,To Begin Again
3,danceability:0.266 track_genre:acoustic valenc...,Can't Help Falling In Love
4,danceability:0.618 track_genre:acoustic valenc...,Hold On


Here track_name is supposed to be our output data from Vector DB given other factors.

## 1.2 Setting up Milvus DB

In [3]:
client = MilvusClient("/workspaces/Music_Playlist_Generation/music_playlist_generation/databases/mpg_v0x1.db")

### 1.2.1 Creating a Collection

A collection in Milvus is like a table in a traditional database. It's where our data will be stored. Each collection can have multiple fields, akin to columns in a table. A collection a `primary_key` field which is a unique identifier for each entity within a collection. It ensures that each entity can be uniquely identified and accessed. For this `auto_id` is enabled.

In [4]:
if client.has_collection(collection_name="music_collection"):
    print("Deleting old data")
    client.drop_collection(collection_name="music_collection")
client.create_collection(
    collection_name="music_collection",
    dimension=768,  # The vectors we will use in this demo has 768 dimensions
    primary_field_name="music_id",
    auto_id=True
)

Deleting old data


### 1.2.2 Data Preparation

- Convert Dataframe to list of dict (each dict is a new row) or alternative `Collection.construct_from_dataframe`
- Vectorize data using an embedding model. Vectorized data will be used for searching through the database so in this scenario we vectorize the combination of `danceability`, `track_genre` and `valence`.

In [6]:
# Convert data to string
input = df["input"].to_list()
track_name = df["track_name"].to_list()

# This will download a small embedding model "paraphrase-albert-small-v2" (~50MB).
embedding_fn = model.model.dense.SentenceTransformerEmbeddingFunction(
    model_name='all-mpnet-base-v2', # Specify the model name
    device='cpu' # Specify the device to use, e.g., 'cpu' or 'cuda:0'
)

# Vectorize input data
vectors = embedding_fn.encode_documents(input)
# The output vector has 768 dimensions, matching the collection that we just created.
print("Dim:", embedding_fn.dim, vectors[0].shape)  # Dim: 768 (768,)

TypeError: OnnxEmbeddingFunction.__init__() got an unexpected keyword argument 'clean_up_tokenization_spaces'

In [None]:
# This will download a small embedding model "paraphrase-albert-small-v2" (~50MB).
embedding_fn = model.DefaultEmbeddingFunction()

vectors = embedding_fn.encode_documents(docs)
# The output vector has 768 dimensions, matching the collection that we just created.
print("Dim:", embedding_fn.dim, vectors[0].shape)  # Dim: 768 (768,)

# Each entity has id, vector representation, raw text, and a subject label that we use
# to demo metadata filtering later.
data = [
    {"id": i, "vector": vectors[i], "text": docs[i], "subject": "history"}
    for i in range(len(vectors))
]

print("Data has", len(data), "entities, each with fields: ", data[0].keys())
print("Vector dim:", len(data[0]["vector"]))