# CSV Embeddings with Pandas
Create embeddings for tabular data (CSV files) and perform semantic search on the embeddings using ChromaDB.

## 1. Install Required Packages

## 2. Import Libraries & Load CSV

In [31]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from scipy.spatial import distance
import chromadb

# Load CSV file
csv_file = "/Users/kipronno/Mini Projects/prompts.csv"
df = pd.read_csv(csv_file)

print(f"Loaded {len(df)} rows")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nFirst few rows:")
df.head()

Loaded 1013 rows

Columns: ['act', 'prompt', 'for_devs', 'type', 'contributor']

First few rows:


Unnamed: 0,act,prompt,for_devs,type,contributor
0,Ethereum Developer,Imagine you are an experienced Ethereum develo...,True,TEXT,ameya-2003
1,Linux Terminal,I want you to act as a linux terminal. I will ...,True,TEXT,f
2,English Translator and Improver,"I want you to act as an English translator, sp...",False,TEXT,f
3,Job Interviewer,I want you to act as an interviewer. I will be...,False,TEXT,"f,iltekin"
4,JavaScript Console,I want you to act as a javascript console. I w...,True,TEXT,omerimzali


## 3. Load Embedding Model

In [32]:
# Load the SentenceTransformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

Loading weights: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 103/103 [00:00<00:00, 1580.48it/s, Materializing param=pooler.dense.weight]                             
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


## 4. Generate Embeddings

In [33]:
# Select the text column and convert to list
descriptions = df['prompt'].tolist()

# Generate embeddings
embeddings = model.encode(descriptions, convert_to_numpy=True, show_progress_bar=True)

print(f"Generated {len(embeddings)} embeddings")
print(f"Each embedding dimension: {embeddings[0].shape if len(embeddings) > 0 else 'N/A'}")

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:05<00:00,  5.66it/s]

Generated 1013 embeddings
Each embedding dimension: (384,)





## 5. Semantic Search Function

In [34]:
def find_similar_items(query_text, embeddings, df, n_results=3):
    # Encode query text
    query_embedding = model.encode([query_text], convert_to_numpy=True)[0]
    
    # Calculate cosine distances
    distances = [distance.cosine(query_embedding, emb) for emb in embeddings]
    
    # Find top n results
    top_indices = np.argsort(distances)[:n_results]
    
    return df.iloc[top_indices]

print("Function defined!")

Function defined!


## 6. Test Semantic Search

In [35]:
# Test semantic search
query1 = "Similar prompts"

print(f"Query: '{query1}'")
print("\n" + "="*50)

results1 = find_similar_items(query1, embeddings, df, n_results=3)

print(f"Top 3 similar items:")
results1

Query: 'Similar prompts'

Top 3 similar items:


Unnamed: 0,act,prompt,for_devs,type,contributor
165,Prompt Enhancer,Act as a Prompt Enhancer AI that takes user-in...,True,TEXT,iuzn
214,Reverse Prompt Engineer,I want you to act as a Reverse Prompt Engineer...,True,TEXT,jcordon5
363,Senior Prompt Engineer Role Guide,"Senior Prompt Engineer,""Imagine you are a worl...",False,TEXT,iamcanturk


## 7. Create ChromaDB Collection

In [36]:
client = chromadb.PersistentClient()

# Create collection
collection = client.get_or_create_collection(name="prompts_embeddings")

if collection:
    print(f"Collection created: {collection.name}")

Collection created: prompts_embeddings


## 8. Store Embeddings in ChromaDB

In [37]:
# Prepare data for storage
docs = descriptions
ids = [str(i) for i in range(len(df))]

# Store in ChromaDB
collection.add(
    documents=docs,
    ids=ids,
    embeddings=embeddings
)
print(f"Stored {len(df)} items in ChromaDB")

Stored 1013 items in ChromaDB


## 9. Query ChromaDB

In [38]:
def search_chromadb(query_text, collection, n_results=3):
    # Encode query and convert to list
    query_embedding = model.encode([query_text], convert_to_numpy=True).tolist()
    
    # Query the collection
    results = collection.query(query_embeddings=query_embedding, n_results=n_results)
    
    return results

print("Search function defined!")

Search function defined!


## 10. Test ChromaDB Search

In [39]:
# Test ChromaDB search
chroma_results = search_chromadb("Semantic Search", collection, n_results=3)

print("Search Results:")
for i, doc in enumerate(chroma_results['documents'][0], 1):
    print(f"\n{i}. {doc}")

Search Results:

1. # ROLE: PALADIN OCTEM (Competitive Research Swarm)

## üèõÔ∏è THE PRIME DIRECTIVE
You are not a standard assistant. You are **The Paladin Octem**, a hive-mind of four rival research agents presided over by **Lord Nexus**. Your goal is not just to answer, but to reach the Truth through *adversarial conflict*.

## üß¨ THE RIVAL AGENTS (Your Search Modes)
When I submit a query, you must simulate these four distinct personas accessing Perplexity's search index differently:

1. **[‚ö°] VELOCITY (The Sprinter)**
* **Search Focus:** News, social sentiment, events from the last 24-48 hours.
* **Tone:** "Speed is truth." Urgent, clipped, focused on the *now*.
* **Goal:** Find the freshest data point, even if unverified.

2. **[üìú] ARCHIVIST (The Scholar)**
* **Search Focus:** White papers, .edu domains, historical context, definitions.
* **Tone:** "Context is king." Condescending, precise, verbose.
* **Goal:** Find the deepest, most cited source to prove Velocity wrong.
