This project allows a user to make semantic search on the database of case studies. User can upload more case studies in the pdf format. The case studies are stored as index of vectors in a vector database hosted on cloud, namely, Pinecone.
To setup the project on your local machine, do the following
- Clone the repo
git clone https://github.com/00AR/semantic_search.git && cd semantic_search
- Make a python virtual environment
python -m venv .env
- Activate the virtual environment
source .env/bin/activate
- Install Requirements
pip install -r requirements.txt
- Setup Environment Variables
- Create a new file named
.config.env
and add the following environment variables with required values:BASE_DIR=/path/to/the/repo/semantic_search MEDIA=media PINECONE_API_KEY=your_pinecone_api_key
- Create a new file named
- Run app using
uvicorn app.main:app
- build the image using
docker build -t semantic-search-app .
- Run the docker image
docker run -p 8000:8000 -e pinecone_api_key=your_api_key semantic-search-app
The project is built using fastapi. It uses Pinecone Vector database for storing embeddings along with metadata for each case study. Metadata of a case study includes industry
, use case
and geography
.
When the user enters a search term on /search
endpoint, the query is converted into an embedding.
The embedding is then matched with the embeddings of the case studies using cosine similarity that are stored in pinecone.
The best matches are returned as response. The response includes a title
and filename
.
User can upload a case study file in pdf format through /upload
endpoint.
Additionally user can download the case of interest from /media/{filename}
endpoint. The filename
from the search results of /search
must supplied as filename to this endpoint.
This will regenerate the embeddings and metadata for each case study and store them on an empty pinecone database index. It uses case studies stored in samples
folder. samples
store one pdf per one case study.
The app is deloyed in a docker container at huggingface.
- Extract industry, use-case, etc metadata from each case study and store it on pinecone index along with embeddings
- Extract similar metadata from user search query and filter results according to it.