Project: Web Scraping, Text Processing, and Question Answering

This project demonstrates the following tasks:

Web scraping to extract comments from a website.
Saving the extracted comments in a text file and a PDF file.
Reading text from a text file and splitting it into smaller chunks for further processing.
Using OpenAI's DocumentSearch to find relevant documents for a given query.
Answering the query based on the relevant documents using OpenAI's Question-Answering chain.

Task 1: Web Scraping

The Python script uses the requests and BeautifulSoup libraries to scrape comments from a website. The comments are extracted, and their author, timestamp, and message are stored in a list of dictionaries.

Task 2: Saving Comments in Text and PDF Files

The extracted comments are saved in a text file (comments.txt) using the built-in open function in Python. Additionally, the comments are saved in a PDF file (comments.pdf) using the fpdf library.

Task 3: Reading and Splitting Text

The text is read from a file (comments.txt) and split into smaller chunks using the CharacterTextSplitter from OpenAI's DocumentSearch. The splitting helps avoid token size limits during information retrieval.

Task 4: Finding Relevant Documents

The OpenAI's DocumentSearch is used to create a FAISS index for efficient similarity searching. A query is provided, and the script finds the most similar documents (text chunks) from the texts variable.

Task 5: Question Answering

The OpenAI's Question-Answering chain is used to generate an answer based on the information in the relevant documents. The query and the relevant documents are used as input, and the answer is returned as output.

Dependencies

To run the code, the following libraries need to be installed:

requests
beautifulsoup4
fpdf
openai
faiss-cpu

You can install them using the following command:

pip install requests beautifulsoup4 fpdf openai faiss-cpu

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
output		output
.gitignore		.gitignore
README.md		README.md
app.py		app.py
part1.pdf		part1.pdf
part2.pdf		part2.pdf
read.pdf		read.pdf
soup_output.html		soup_output.html
spider.py		spider.py
spyder.ipynb		spyder.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project: Web Scraping, Text Processing, and Question Answering

Task 1: Web Scraping

Task 2: Saving Comments in Text and PDF Files

Task 3: Reading and Splitting Text

Task 4: Finding Relevant Documents

Task 5: Question Answering

Dependencies

About

Releases

Packages

Languages

Alexey3250/Bulgarian-Spider

Folders and files

Latest commit

History

Repository files navigation

Project: Web Scraping, Text Processing, and Question Answering

Task 1: Web Scraping

Task 2: Saving Comments in Text and PDF Files

Task 3: Reading and Splitting Text

Task 4: Finding Relevant Documents

Task 5: Question Answering

Dependencies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages