Skip to content

This Large Language Model project implements a question answering system that retrieves information from scraped web pages and indexed documents

Notifications You must be signed in to change notification settings

Diksha-Bisht/Question-Answer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Question Answering System with Web Scraping and Document Indexing

This project implements a question answering system that retrieves information from scraped web pages and indexed documents. It utilizes:

  • web scraping to gather content from specific tabs on a website,
  • preprocesses the text data,
  • creates a PDF report, and
  • sets up an interactive querying interface using GenAI for natural language processing.

Table of Contents

  1. Dependencies
  2. Setup Instructions
  3. Usage
  4. Components
  5. License

Dependencies

Ensure you have the following dependencies installed:

  • requests
  • beautifulsoup4
  • transformers
  • sentence-transformers
  • faiss-cpu
  • pandas
  • nltk
  • chromadb
  • reportlab
  • langchain==0.0.187
  • unstructured
  • docx2txt
  • genai

You can install them using pip:

pip install -r requirements.txt

Setup Instructions

  1. Clone the repository:
git clone https://github.com/Diksha-Bisht/Question-Answer.git
cd Question-Answer
  1. Install the dependencies as mentioned above.

  2. Obtain an API key for GenAI from GenAI website and store it securely.

  3. Ensure you have access to a directory containing PDF documents for indexing.

Usage

Running the Script

  1. Directly run the cells in any of the Jupyter environments, OR
  2. Run the script

To run the script, execute the following command:

python Q&A.py

To do so you need to conver the file from .ipynb format to .py format.

Input Requirements

  • The script will prompt you to enter a question after initialisation.
  • Ensure the question is relevant to the content scraped and indexed.

Components

  1. Web Scraping

Uses requests and BeautifulSoup to extract content from specific tabs of a website.

Combines scraped text data into a unified corpus for further processing.

  1. Text Preprocessing

Normalizes text by converting to lowercase and removing unnecessary characters like newlines.

  1. PDF Generation

Utilizes reportlab to create a PDF report from the preprocessed text data.

Saves the generated PDF in a specified directory /content/sample_data in this case).

  1. Document Indexing and Querying

Sets up a document indexing pipeline using ChromaDB and VectorStoreIndex.

Uses HuggingFace for document embeddings and GenAI for querying.

Creates an interactive loop to input questions and retrieve answers based on indexed documents.

Note:

The code needs correction for better performance, any possible corrections are always welcomed.

Thankyou

Collaborators:

  1. Diksha Bisht: bishtdiksha096@gmail.com
  2. Deepak Garg: gargdeepak114@gmail.com

Please read Explaination.txt for the explaination of code

About

This Large Language Model project implements a question answering system that retrieves information from scraped web pages and indexed documents

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published