Question Answering System with Web Scraping and Document Indexing

This project implements a question answering system that retrieves information from scraped web pages and indexed documents. It utilizes:

web scraping to gather content from specific tabs on a website,
preprocesses the text data,
creates a PDF report, and
sets up an interactive querying interface using GenAI for natural language processing.

Dependencies

Ensure you have the following dependencies installed:

requests
beautifulsoup4
transformers
sentence-transformers
faiss-cpu
pandas
nltk
chromadb
reportlab
langchain==0.0.187
unstructured
docx2txt
genai

You can install them using pip:

pip install -r requirements.txt

Setup Instructions

Clone the repository:

git clone https://github.com/Diksha-Bisht/Question-Answer.git
cd Question-Answer

Install the dependencies as mentioned above.
Obtain an API key for GenAI from GenAI website and store it securely.
Ensure you have access to a directory containing PDF documents for indexing.

Usage

Running the Script

Directly run the cells in any of the Jupyter environments, OR
Run the script

To run the script, execute the following command:

python Q&A.py

To do so you need to conver the file from .ipynb format to .py format.

Input Requirements

The script will prompt you to enter a question after initialisation.
Ensure the question is relevant to the content scraped and indexed.

Components

Web Scraping

Uses requests and BeautifulSoup to extract content from specific tabs of a website.

Combines scraped text data into a unified corpus for further processing.

Text Preprocessing

Normalizes text by converting to lowercase and removing unnecessary characters like newlines.

PDF Generation

Utilizes reportlab to create a PDF report from the preprocessed text data.

Saves the generated PDF in a specified directory /content/sample_data in this case).

Document Indexing and Querying

Sets up a document indexing pipeline using ChromaDB and VectorStoreIndex.

Uses HuggingFace for document embeddings and GenAI for querying.

Creates an interactive loop to input questions and retrieve answers based on indexed documents.

Note:

The code needs correction for better performance, any possible corrections are always welcomed.

Thankyou

Collaborators:

Diksha Bisht: bishtdiksha096@gmail.com
Deepak Garg: gargdeepak114@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Explaination.txt		Explaination.txt
Q&A.ipynb		Q&A.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Question Answering System with Web Scraping and Document Indexing

Table of Contents

Dependencies

Setup Instructions

Usage

Running the Script

Input Requirements

Components

Note:

Thankyou

Collaborators:

Please read Explaination.txt for the explaination of code

About

Releases

Packages

Languages

Diksha-Bisht/Question-Answer

Folders and files

Latest commit

History

Repository files navigation

Question Answering System with Web Scraping and Document Indexing

Table of Contents

Dependencies

Setup Instructions

Usage

Running the Script

Input Requirements

Components

Note:

Thankyou

Collaborators:

Please read Explaination.txt for the explaination of code

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages