Welcome to AI Scraper, a powerful tool built with Streamlit and the LangChain library, designed to transform unstructured web data into structured, actionable insights. This application makes it easy to scrape data from websites and automatically generate Pydantic models for structured data extraction.
Check this article, If you wanted to know more about this project.
- Model Definition: Dynamically create Pydantic models based on user-defined schemas directly from the UI.
- Data Extraction: Enter a URL and scrape data according to the defined Pydantic model.
- Data Download: Export the scraped data in JSON format for ease of use in further applications.
AI Scraper operates in two main stages:
-
Model Creation:
- Define your data model by specifying attributes such as name, type, and description.
- Validate the model to ensure all fields are correctly filled out.
- Automatically generate a Pydantic model to be used in the scraping process.
-
Data Scraping:
- Enter the URL of the website from which you want to scrape data.
- Execute the scraping process, which uses the previously defined Pydantic model to parse and structure the HTML content.
- Download the structured data as a JSON file or view it directly within the app.
Start by defining your data model in the provided table format. Ensure each attribute is carefully described, specifying the type and a brief description.
Input the URL of the webpage you wish to scrape. The application supports various content types as long as they can be parsed into HTML.
Click the 'Generate Pydantic Model and Scrape' button to start the extraction process. The data matching your model will be retrieved and displayed.
Download the structured data in JSON format, or explore it directly within the application.
- Clone the Repository:
git clone https://yourrepositorylink.git
- Create a venv and install all the requirements:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
- You will need the following .env file
# OPENAI Key
OPENAI_API_KEY=<OPENAI_API_KEY>
- To init the application run
streamlit run app.py
Streamlit LangChain Pydantic Requests dotenv json
Feel free to fork the repository, make changes, and submit pull requests. If you encounter any issues or have suggestions for improvement, please submit an issue.
LangChain Library: For providing the tools to integrate AI capabilities seamlessly. Streamlit: For making it possible to build interactive web applications quickly and easily.