🤖🌐 AI Scraper

Welcome to AI Scraper, a powerful tool built with Streamlit and the LangChain library, designed to transform unstructured web data into structured, actionable insights. This application makes it easy to scrape data from websites and automatically generate Pydantic models for structured data extraction.

Check this article, If you wanted to know more about this project.

Features

Model Definition: Dynamically create Pydantic models based on user-defined schemas directly from the UI.
Data Extraction: Enter a URL and scrape data according to the defined Pydantic model.
Data Download: Export the scraped data in JSON format for ease of use in further applications.

How It Works

AI Scraper operates in two main stages:

Model Creation:
- Define your data model by specifying attributes such as name, type, and description.
- Validate the model to ensure all fields are correctly filled out.
- Automatically generate a Pydantic model to be used in the scraping process.
Data Scraping:
- Enter the URL of the website from which you want to scrape data.
- Execute the scraping process, which uses the previously defined Pydantic model to parse and structure the HTML content.
- Download the structured data as a JSON file or view it directly within the app.

How to Use

Craft Your Model

Start by defining your data model in the provided table format. Ensure each attribute is carefully described, specifying the type and a brief description.

Mark the Spot

Input the URL of the webpage you wish to scrape. The application supports various content types as long as they can be parsed into HTML.

Summon the Data

Click the 'Generate Pydantic Model and Scrape' button to start the extraction process. The data matching your model will be retrieved and displayed.

Treasure Awaits

Download the structured data in JSON format, or explore it directly within the application.

Setup and Installation

Clone the Repository:

git clone https://yourrepositorylink.git

Create a venv and install all the requirements:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

You will need the following .env file

# OPENAI Key
OPENAI_API_KEY=<OPENAI_API_KEY>

To init the application run

streamlit run app.py

Dependencies

Streamlit LangChain Pydantic Requests dotenv json

Contributing

Feel free to fork the repository, make changes, and submit pull requests. If you encounter any issues or have suggestions for improvement, please submit an issue.

Acknowledgments

LangChain Library: For providing the tools to integrate AI capabilities seamlessly. Streamlit: For making it possible to build interactive web applications quickly and easily.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
article		article
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖🌐 AI Scraper

Features

How It Works

How to Use

Craft Your Model

Mark the Spot

Summon the Data

Treasure Awaits

Setup and Installation

Dependencies

Contributing

Acknowledgments

About

Releases

Packages

Languages

NachoCP/AIScraping

Folders and files

Latest commit

History

Repository files navigation

🤖🌐 AI Scraper

Features

How It Works

How to Use

Craft Your Model

Mark the Spot

Summon the Data

Treasure Awaits

Setup and Installation

Dependencies

Contributing

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages