- Python 3.8 or higher
- Required Libraries:
requests
bs4
(BeautifulSoup)
Clone the repository:
git clone https://github.com/pythonshik/ai-html-parser.git
OR
Install with pip:
pip install ai-html-parse
- Import the
AIparser
class:from AIparse import AIparser
- Initialize the parser with a URL:
element = AIparser("https://www.youtube.com/@PythonShik")
- Parse specific elements:
for i in ["number of videos", "number of subscribers"]: parsed_data = element.parse(i) print(f"{parsed_data['explain']}: {parsed_data['value']}")
- Output example:
{ "value": "96", "explain": "Number of subscribers", "result": "96 subscribers" }
This project is an AI-powered HTML parser designed to extract specific data from web pages using Google Gemini's text generation API. The parser processes the HTML source code of a webpage, identifies specific elements, and returns the desired information in a structured JSON format.
- AI Integration: Utilizes Google Gemini for intelligent text analysis.
- HTML Parsing: Extracts and processes HTML elements using BeautifulSoup.
- Customizable Instructions: Supports user-defined parsing instructions.
- JSON Output: Provides clear and structured results in JSON format.
- User Input: Provide a URL and the target element to parse.
- HTML Fetching: The tool fetches the HTML source code of the webpage.
- AI Analysis: The HTML source and target element are sent to the AI for processing.
- JSON Output: The AI generates a structured response containing the extracted information.
The core class for interacting with Google Gemini's text generation API.
- Features:
- API key management.
- Methods for adding and managing conversation history.
- Text generation using the
generate()
method.
- Key Methods:
history_add(role, content)
: Adds messages to the conversation history.generate()
: Sends data to gemini API and retrieves the generated text.export_history(filename)
: Saves conversation history to a file.import_history(filename)
: Loads conversation history from a file.clear_history(filename)
: Clears the conversation history.
Defines the instruction format for AI tasks.
- Key Class:
Instructions
first_instruction
: Provides a detailed guide for parsing HTML elements and formatting the response.
The main entry point for the application.
- Features:
- Manages the parsing process using
AIparser
. - Configures and interacts with the
Gen
class for AI communication. - Outputs results for specific elements like "number of subscribers" or "number of videos".
- Manages the parsing process using
- Key Methods:
AIparser.__init__
: Initializes the parser with a URL and target element.AIparser.parse(element)
: Parses the given element and retrieves AI-generated results.
This tool is ideal for:
- Marketers and Analysts: For monitoring trends, gathering competitor data, and extracting insights.
- Small and Medium Businesses: To automate tasks like market monitoring or customer review aggregation.
- SEO Specialists: To analyze site content, keywords, and metadata.
- Developers and Freelancers: To speed up the execution of client parsing tasks.
- Journalists and Bloggers: To gather data for articles and posts effortlessly.
- Speed: Processing time can take up to 45 seconds due to the AI generation.
- Dependencies: Requires an active internet connection and a valid API key.
- Scalability: Not optimized for high-frequency requests.
- Monitoring changes on web pages.
- Extracting market research data.
- Analyzing competitors' content.
- Automating reporting tasks.
- Optimize performance with batch processing and caching.
- Add support for local AI models to reduce dependency on external APIs.
- Expand parsing capabilities to include other data formats like JSON and XML.
- Develop a user-friendly interface (e.g., Telegram bot or web app).
Feel free to contribute to the project by submitting issues or pull requests.
This project is licensed under the MIT License.