This project is a Streamlit application that allows users to upload PDF files, process them into text chunks, and perform similarity searches on the text. Users can adjust chunking parameters and view the most similar chunks retrieved.
- Upload multiple PDF files
- Extract text from PDFs
- Chunk text with adjustable parameters
- Search for similar text chunks
- Display top similar chunks with similarity scores
- User-configurable settings via sidebar
- Python 3.7+
- Streamlit
- langchain_community
- langchain_openai
- PyPDF2
-
Clone the repository:
git clone https://github.com/hamadandrabi/streamlit-chunker.git cd streamlit-chunker
-
Create a virtual environment and activate it:
python -m venv env source env/bin/activate # On Windows, use `env\Scripts\activate`
-
Install the required packages:
pip install -r requirements.txt
-
Run the Streamlit app:
streamlit run main.py
-
Open a web browser and navigate to
http://localhost:8501
. -
In the sidebar, paste your OpenAI API key.
-
Adjust the chunking parameters (chunk size and overlap) as needed.
-
Upload your PDF files and click "Process".
-
Enter a search query to find similar text chunks.
- API Key: Enter your OpenAI API key in the sidebar.
- Chunk Size: Use the slider to select the size of each text chunk.
- Chunk Overlap: Use the slider to select the overlap between chunks.
- Number of Chunks: Specify how many top similar chunks to display.
To run the Streamlit app on ports 80 or 443, you need administrative privileges.
- Open Command Prompt as an administrator.
- Run the Streamlit app with the desired port:
streamlit run main.py --server.port 80
- Open a terminal.
- Run the Streamlit app with
sudo
:sudo streamlit run main.py --server.port 80
This project is licensed under the MIT License. See the LICENSE file for details.
- Streamlit for the awesome web app framework
- Langchain for the vector store and embedding support
- PyPDF2 for PDF text extraction
Link to the App: https://chunker.streamlit.app/