This Python application is designed to extract news data from pre-extracted news media section URLs stored in file final_url_sections_vX.json (where, X is an integer for the file version), filter and clean the data, optionally summarize the article bodies using the OpenAI API for gpt model, and stores the information in a structured database. The script is organized into several functions, each serving a specific purpose in the data extraction process.
The web scrapring process is designed to be executed in a multi-process environment to improve efficiency.
This script can be a valuable tool for collecting and analyzing news data from various sources.
In order to execute the main process and being able to see the news extracted in a webpage locally hosted and backed with django, refer to the helper file run-project-end-to-end.txt .
The script can be run to extract news data from various media sources. It processes a list of media URLs and their associated sections, extracting news URLs, and storing the data in a structured database.
Please note that the script relies on external APIs, so you'll need to set up the necessary API key as an environment-variable this way:
OPENAI_API_KEY = yourOpenAIAPIKey
This function extracts news data from JSON objects within the HTML content of a news webpage. It looks for various attributes like the article body, article type, creation date, and more. The data is stored in a dictionary and returned.
This function extracts additional data from meta tags in the HTML of the news webpage. It retrieves information like the title, description, creation date, and more. The extracted data is added to the dictionary passed to the function.
This function uses the GPT API to extract additional keys from the text content of the webpage. It looks for relevant information such as the number of tokens, title, tags, creation date, and more.
This function extracts the main body of the news article using the GPT API. It processes the text content of the webpage and returns the article body and the number of tokens.
This function uses the GPT API to summarize the article body, providing a brief summary of the news content. It takes the text, URL, and media name as input and returns the summarized body and the number of tokens. Note that you can disable the API calling by assigning True
the constant BLOCK_API_CALL
from constants.py file, where application variables are stored.
This function processes a list of raw news URLs, filtering out invalid or undesirable URLs. It also interacts with a database to determine which URLs are already stored.
This function filters out invalid URLs based on various criteria, including query symbols, file extensions, and specific URL patterns. It also checks if URLs are already present in the database.
This function processes a list of news URLs, extracts the necessary data from each URL, and stores it in a structured format. It also identifies URLs that couldn't be processed.
This function orchestrates the multi-process execution of the web scraping script. It divides the workload among multiple processes, each handling a specific set of media URLs.
This function represents the main logic for processing news URLs in a multi-threaded environment. It iterates through media URLs, retrieves the HTML content of news sections, and processes the news URLs.
This function checks the publication date of a news article to determine if it's too old. If the article is older than a specified threshold, it is filtered out.
This function arranges the keys in a dictionary in a specified order, making the resulting dictionary more structured and easier to work with.
These functions interact with a database to store and retrieve news data. read_stored_news
retrieves URLs that are already stored, and create_news_table
initializes the database table if it doesn't exist.
This function inserts news data into the database table, making it available for later retrieval and analysis.