This project implements a question answering system that retrieves information from scraped web pages and indexed documents. It utilizes:
- web scraping to gather content from specific tabs on a website,
- preprocesses the text data,
- creates a PDF report, and
- sets up an interactive querying interface using GenAI for natural language processing.
Ensure you have the following dependencies installed:
requests
beautifulsoup4
transformers
sentence-transformers
faiss-cpu
pandas
nltk
chromadb
reportlab
langchain==0.0.187
unstructured
docx2txt
genai
You can install them using pip:
pip install -r requirements.txt
- Clone the repository:
git clone https://github.com/Diksha-Bisht/Question-Answer.git
cd Question-Answer
-
Install the dependencies as mentioned above.
-
Obtain an API key for GenAI from GenAI website and store it securely.
-
Ensure you have access to a directory containing PDF documents for indexing.
- Directly run the cells in any of the
Jupyter
environments, OR - Run the script
To run the script, execute the following command:
python Q&A.py
To do so you need to conver the file from
.ipynb
format to.py
format.
- The script will prompt you to enter a question after initialisation.
- Ensure the question is relevant to the content scraped and indexed.
- Web Scraping
Uses requests and BeautifulSoup to extract content from specific tabs of a website.
Combines scraped text data into a unified corpus for further processing.
- Text Preprocessing
Normalizes text by converting to lowercase and removing unnecessary characters like newlines.
- PDF Generation
Utilizes
reportlab
to create a PDF report from the preprocessed text data.
Saves the generated PDF in a specified directory
/content/sample_data
in this case).
- Document Indexing and Querying
Sets up a document indexing pipeline using
ChromaDB
andVectorStoreIndex
.
Uses
HuggingFace
for document embeddings andGenAI
for querying.
Creates an interactive loop to input questions and retrieve answers based on indexed documents.
The code needs correction for better performance, any possible corrections are always welcomed.
- Diksha Bisht: bishtdiksha096@gmail.com
- Deepak Garg: gargdeepak114@gmail.com