This project demonstrates the following tasks:
- Web scraping to extract comments from a website.
- Saving the extracted comments in a text file and a PDF file.
- Reading text from a text file and splitting it into smaller chunks for further processing.
- Using OpenAI's DocumentSearch to find relevant documents for a given query.
- Answering the query based on the relevant documents using OpenAI's Question-Answering chain.
The Python script uses the requests
and BeautifulSoup
libraries to scrape comments from a website. The comments are extracted, and their author, timestamp, and message are stored in a list of dictionaries.
The extracted comments are saved in a text file (comments.txt
) using the built-in open
function in Python. Additionally, the comments are saved in a PDF file (comments.pdf
) using the fpdf
library.
The text is read from a file (comments.txt
) and split into smaller chunks using the CharacterTextSplitter
from OpenAI's DocumentSearch. The splitting helps avoid token size limits during information retrieval.
The OpenAI's DocumentSearch is used to create a FAISS index for efficient similarity searching. A query is provided, and the script finds the most similar documents (text chunks) from the texts
variable.
The OpenAI's Question-Answering chain is used to generate an answer based on the information in the relevant documents. The query and the relevant documents are used as input, and the answer is returned as output.
To run the code, the following libraries need to be installed:
- requests
- beautifulsoup4
- fpdf
- openai
- faiss-cpu
You can install them using the following command:
pip install requests beautifulsoup4 fpdf openai faiss-cpu