Skip to content

GOUTHAM-2002/VectorDB-FromScratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

📊 VectorDB From Scratch

Learning project: Building my own vector database to understand how they work. Implemented from scratch to trace every step.

Python Learning


Why I Built This

To understand how vector databases like Pinecone, Weaviate, and Chroma work by building one myself.

Goal: Learn by doing, not just using.


What I Learned

✅ Vector embeddings and similarity search
✅ TF-IDF vectorization from scratch
✅ Word embeddings (Word2Vec concepts)
✅ PCA for dimensionality reduction
✅ Visualizing high-dimensional data in 2D


What It Does

  • Converts text to vector embeddings
  • Stores vectors in memory
  • Searches similar vectors using cosine similarity
  • Visualizes embeddings with PCA

Quick Start

python wordEmbedding.py
# or
python tfidf.py

Files

  • wordEmbedding.py - Word embedding implementation
  • tfidf.py - TF-IDF vectorization
  • Both include PCA visualization

How It Works

# 1. Convert text to vectors
vectors = create_embeddings(documents)

# 2. Store in "database" (Python dict)
db = VectorDatabase()
db.add(vectors)

# 3. Search similar vectors
results = db.search(query_vector, top_k=5)

# 4. Visualize with PCA
visualize_2d(vectors)

What Vector DBs Are Used For

  • Semantic search (search by meaning, not keywords)
  • Recommendation systems
  • RAG (Retrieval Augmented Generation) for LLMs
  • Image similarity search
  • Duplicate detection

Key Concepts I Traced

Embeddings: Text → Numbers (vectors)
Similarity: Cosine distance between vectors
Indexing: Fast lookup structures
PCA: High dimensions → 2D for visualization


Tech Stack

  • Python
  • NumPy (vector operations)
  • scikit-learn (PCA)
  • Matplotlib (visualization)

Next Steps

  • Add FAISS indexing for speed
  • Implement HNSW algorithm
  • Add persistence (save/load)
  • Build REST API
  • Compare with real vector DBs

Author

Goutham N
GitHub: @GOUTHAM-2002


⭐ Star if you're also learning by building!


Note: This is a learning project. For production use, check out Pinecone, Weaviate, or Chroma.

About

Custom vector databases that implements PCA to depict vector embeddings in a 2 dimensional representation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages