# Load Processed Data into Vector Database

This notebook loads output from data prep kit into Milvus

**Step-4 in this workflow**

![](../media/rag-overview-2.png)


## Configuration

In [1]:
class MyConfig:
    pass
MY_CONFIG = MyConfig()

MY_CONFIG.PROCESSED_DATA_DIR = 'output/output_final'

MY_CONFIG.DB_URI = './rag_1_dpk.db'  # For embedded instance
#MY_CONFIG.DB_URI = 'http://localhost:19530'  # For Docker instance
MY_CONFIG.COLLECTION_NAME = 'dpk_walmart_docs'

## Step-: Load Parquet Data

Load all  `.parquet` files in the given dir

In [2]:
import pandas as pd
import glob

print ('Loading data from : ', MY_CONFIG.PROCESSED_DATA_DIR)

# Get a list of all Parquet files in the directory
parquet_files = glob.glob(f'{MY_CONFIG.PROCESSED_DATA_DIR}/*.parquet')
print ("Number of parquet files to read : ", len(parquet_files))
print ()

# Create an empty list to store the DataFrames
dfs = []

# Loop through each Parquet file and read it into a DataFrame
for file in parquet_files:
    df = pd.read_parquet(file)
    print (f"Read file: '{file}'.  number of rows = {df.shape[0]}")
    dfs.append(df)

# Concatenate all DataFrames into a single DataFrame
data_df = pd.concat(dfs, ignore_index=True)

print (f"\nTotal number of rows = {data_df.shape[0]}")

Loading data from :  output/output_final
Number of parquet files to read :  2

Read file: 'output/output_final/Walmart-10K-Reports-Optimized_2023.parquet'.  number of rows = 666
Read file: 'output/output_final/Walmart_2024.parquet'.  number of rows = 636

Total number of rows = 1302


In [3]:

## Shape the data

MY_CONFIG.EMBEDDING_LENGTH =  len(data_df.iloc[0]['embeddings'])
print ('embedding length: ', MY_CONFIG.EMBEDDING_LENGTH)

# rename 'embeddings' columns as 'vector' to match default schema
# if 'vector' not in data_df.columns and 'embeddings' in data_df.columns:
#     data_df = data_df.rename( columns= {'embeddings' : 'vector'})
# if 'text' not in data_df.columns and 'contents' in data_df.columns:
#     data_df = data_df.rename( columns= {'contents' : 'text'})

data_df = data_df.rename( columns= {'embeddings' : 'vector', 'contents' : 'text'})

print (data_df.info())
data_df.head(3)

embedding length:  384
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1302 entries, 0 to 1301
Data columns (total 29 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   filename                      1302 non-null   object 
 1   num_pages                     1302 non-null   int64  
 2   num_tables                    1302 non-null   int64  
 3   num_doc_elements              1302 non-null   int64  
 4   document_id                   1302 non-null   object 
 5   ext                           1302 non-null   object 
 6   hash                          1302 non-null   object 
 7   size                          1302 non-null   int64  
 8   date_acquired                 1302 non-null   object 
 9   pdf_convert_time              1302 non-null   float64
 10  source_filename               1302 non-null   object 
 11  text                          1302 non-null   object 
 12  doc_jsonpath                  1302 non-

Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,document_id,ext,hash,size,date_acquired,pdf_convert_time,...,docq_symbol_to_word_ratio,docq_sentence_count,docq_lorem_ipsum_ratio,docq_curly_bracket_ratio,docq_contain_bad_word,docq_bullet_point_ratio,docq_ellipsis_line_ratio,docq_alphabet_word_ratio,docq_contain_common_en_words,vector
0,Walmart-10K-Reports-Optimized_2023.pdf,100,81,1163,d626ab9b-0f53-446c-b55d-150fbbd93066,pdf,ea5544f26fe0831ec9befbf7aaf68b1b256df6c3ae18b8...,1159974,2024-08-29T00:43:21.059856,332.679391,...,0.0,3,0.0,0.0,False,0.0,0.0,1.0,True,"[-0.006206639, 0.010256912, 0.023658218, -0.02..."
1,Walmart-10K-Reports-Optimized_2023.pdf,100,81,1163,d626ab9b-0f53-446c-b55d-150fbbd93066,pdf,ea5544f26fe0831ec9befbf7aaf68b1b256df6c3ae18b8...,1159974,2024-08-29T00:43:21.059856,332.679391,...,0.0,1,0.0,0.0,False,0.0,0.0,0.909091,True,"[-0.0497427, 0.046492133, -0.02381167, 0.02798..."
2,Walmart-10K-Reports-Optimized_2023.pdf,100,81,1163,d626ab9b-0f53-446c-b55d-150fbbd93066,pdf,ea5544f26fe0831ec9befbf7aaf68b1b256df6c3ae18b8...,1159974,2024-08-29T00:43:21.059856,332.679391,...,0.0,1,0.0,0.0,False,0.0,0.0,0.875,False,"[-0.03265641, -0.040947884, 0.017305722, 0.022..."


## Connect to Vector Database

Milvus can be embedded and easy to use.

<span style="color:blue;">Note: If you encounter an error about unable to load database, try this: </span>

- <span style="color:blue;">In **vscode** : **restart the kernel** of previous notebook. This will release the db.lock </span>
- <span style="color:blue;">In **Jupyter**: Do `File --> Close and Shutdown Notebook` of previous notebook. This will release the db.lock</span>
- <span style="color:blue;">Re-run this cell again</span>




In [4]:
from pymilvus import MilvusClient

milvus_client = MilvusClient(MY_CONFIG.DB_URI)

print ("✅ Connected to Milvus instance:", MY_CONFIG.DB_URI)

✅ Connected to Milvus instance: ./rag_1_dpk.db


# Create A Collection



In [5]:
# if we already have a collection, clear it first
if milvus_client.has_collection(collection_name=MY_CONFIG.COLLECTION_NAME):
    milvus_client.drop_collection(collection_name=MY_CONFIG.COLLECTION_NAME)
    print ('✅ Cleared collection :', MY_CONFIG.COLLECTION_NAME)


milvus_client.create_collection(
    collection_name=MY_CONFIG.COLLECTION_NAME,
    dimension=MY_CONFIG.EMBEDDING_LENGTH,
    metric_type="IP",  # Inner product distance
    consistency_level="Strong",  # Strong consistency level
    auto_id=True
)
print ("✅ Created collection :", MY_CONFIG.COLLECTION_NAME)


✅ Cleared collection : dpk_walmart_docs
✅ Created collection : dpk_walmart_docs


In [6]:
res = milvus_client.insert(collection_name=MY_CONFIG.COLLECTION_NAME, data=data_df.to_dict('records'))

print('inserted # rows', res['insert_count'])

milvus_client.get_collection_stats(MY_CONFIG.COLLECTION_NAME)

inserted # rows 1302


{'row_count': 1302}

## Close DB Connection

Close the connection so the lock files are relinquished and other notebooks can access the db

In [7]:
milvus_client.close()

print ("✅ SUCCESS")


✅ SUCCESS


## Test your data by doing a Vector Search

See notebook [vector_search.ipynb](vector_search.ipynb)