# News Scraper and Summarizer using OpenAI

This notebook automates the process of scraping news articles, summarizing them using OpenAI's GPT model, saving them to a CSV file, and displaying them in a user-friendly format. The process involves:
1. Scraping articles from a website.
2. Summarizing the content with OpenAI.
3. Saving the articles and summaries into a CSV file.
4. Displaying the data interactively.

## Libraries and Configuration
First, we import necessary libraries and set up configurations such as logging and the OpenAI API key.

In [1]:
# Import necessary libraries
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import openai
from config import OPENAI_API_KEY # Ensure this file contains your OpenAI API key
import csv
import pandas as pd
import logging
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from requests.exceptions import RequestException

# Set up logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler("script_web.log"), # Log to file
        logging.StreamHandler()                # Log to console
    ]
)

# Use API key
openai.api_key = OPENAI_API_KEY

logging.info("Logging setup complete and OpenAI API key successfully set.")

2024-12-23 15:17:06,037 - INFO - Logging setup complete and OpenAI API key successfully set.


## Initialize Session with Retry Logic

We create an HTTP session with retry logic to ensure the script handles intermittent connection issues gracefully.


In [2]:
# Initialize session with retry logic
session = requests.Session()
retries = HTTPAdapter(max_retries=5)

session.mount('http://', retries)
session.mount('https://', retries)

logging.info("HTTP session with retry logic initialized.")

2024-12-23 15:17:07,233 - INFO - HTTP session with retry logic initialized.


## Functions
Below are the main functions used in this script:
1. `scrape_articles(url)`: Scrapes articles from a given website URL.
2. `summarize_with_openai(text)`: Summarizes the given text using OpenAI.
3. `parse_article_date(date_str)`: Parses and standardizes article dates.
4. `get_article_content(url)`: Fetches the main content of an article from its URL.
5. `save_to_csv(articles, filename)`: Saves articles to a CSV file.


In [4]:
# Function to scrape articles from the website
def scrape_articles(url):
    """
    Scrapes articles from the provided website URL.
    
    Args:
        url (str): URL of the website to scrape articles from.
    
    Returns:
        list: A list of dictionaries, where each dictionary contains article data (title, link, date, type).
    """
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
        }
        response = session.get(url, timeout=30, headers=headers)
        response.raise_for_status()

        soup = BeautifulSoup(response.content, 'html.parser')
        article_data = []

        # Find the Featured section
        featured_section = soup.find('section', class_='featured')
        if featured_section:

            # Find all featured article blocks with specific class
            featured_blocks = featured_section.find_all('div', class_='cell blocks small-12 medium-3 large-3')

            for block in featured_blocks:
                try:
                    # Get the image link and title
                    link_element = block.find('a', class_='img-link')
                    title_element = block.find('h3')
                    date_div = block.find('div', class_='content')  # Date extraction

                    if link_element and title_element:
                        title = link_element['title'].strip()  # Title is in the link's title attribute
                        link = link_element['href'].strip()   # Article URL
                        date_str = date_div.text.strip().split('|')[0].strip() if date_div else 'No date available'
                        date = parse_article_date(date_str) 

                        article_data.append({
                            'Title': title,
                            'Link': link,
                            'Date': date,
                            'Type': 'featured'
                        })

                except Exception as e:
                    logging.warning(f"Error processing featured article: {e}")
                    continue

        # Get regular articles
        regular_articles = soup.find_all('article')

        for article in regular_articles:
            try:
                title = article.find('h3').get_text(strip=True)
                link = article.find('a')['href']
                date_div = article.find('div', class_='content')
                date_str = date_div.text.strip().split('|')[0].strip() if date_div else 'No date available'
                date = parse_article_date(date_str)

                # Check if this article is already in our list
                if not any(a['Link'] == link for a in article_data):
                    article_data.append({
                        'Title': title,
                        'Link': link,
                        'Date': date,
                        'Type': 'regular'
                    })
            except Exception as e:
                logging.warning(f"Error processing article: {e}")
                continue

        return article_data
    except RequestException as e:
        logging.error(f"Request error: {e}")
        return []


# Function to summarize article content using OpenAI
def summarize_with_openai(text):
    """
    Summarizes the given text using OpenAI API.

    Args:
        text (str): Text to summarize.

    Returns:
        str: Summary of the text or None if an error occurs.
    """
    try:
        user_prompt = f"Please provide a concise summary of this article: {text}"
        completion = openai.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are a helpful assistant that summarizes news articles."},
                {"role": "user", "content": user_prompt}
            ],
            temperature=0.1,
            max_tokens=200
        )
        return completion.choices[0].message.content.strip()
    except Exception as e:
        logging.error(f"OpenAI API error: {e}")
        return ""

# Function to parse and standardize dates
def parse_article_date(date_str):
    """
    Converts a date string into a standardized date object.

    Args:
        date_str (str): Date string to parse.

    Returns:
        date: Parsed date object or None if parsing fails.
    """

    formats = ['%d %B %Y', '%Y-%m-%d', '%B %d, %Y'] # Add more formats if needed
    for fmt in formats:
        try:
            return datetime.strptime(date_str, fmt).date()
        except ValueError:
            continue
    logging.warning(f"Unrecognized date format: {date_str}")
    return None

# Function to fetch the content of an article
def get_article_content(url):
    """
    Fetches the main content of an article from its URL.

    Args:
        url (str): URL of the article.

    Returns:
        str: Extracted article content or None if extraction fails.
    """
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
        }
        response = session.get(url, timeout=30, headers=headers)
        response.raise_for_status()

        soup = BeautifulSoup(response.content, 'html.parser')
        content_container = soup.find('div', class_='article-content') or soup.find('article')

        if content_container:
            paragraphs = content_container.find_all('p')
            return ' '.join([p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True)])
        return ""
    except RequestException as e:
        logging.error(f"Error fetching article content: {e}")
        return ""

# Function to save articles to CSV
def save_to_csv(articles, filename):
    """
    Saves a list of articles to a CSV file.

    Args:
        articles (list): List of dictionaries containing article data.
        filename (str): Name of the CSV file to save.
    """
    try:
        keys = ['Title', 'Date', 'Link', 'Type', 'Summary']
        with open(filename, mode='w', newline='', encoding='utf-8') as file:
            writer = csv.DictWriter(file, fieldnames=keys)
            writer.writeheader()
            writer.writerows(articles)
        logging.info(f"Articles saved to {filename}")
    except Exception as e:
        logging.error(f"Error saving to CSV: {e}")


## Main Script
In this section, we scrape today's news articles, summarize them, and save the results in a CSV file. The data is also displayed for easy viewing in the notebook.


In [5]:
# Main execution logic
url = "https://www.artificialintelligence-news.com"

# Scrape articles
articles = scrape_articles(url)

# Use today's date for filtering
today = datetime.now().date()

filtered_articles = []
for article in articles:
    if article['Date'] == today:
        content = get_article_content(article['Link'])
        if content:
            article['Summary'] = summarize_with_openai(content)
            filtered_articles.append(article)

# Save articles to a CSV file
today_str = today.strftime('%Y-%m-%d')
filename = f"data/articles_{today_str}.csv"
save_to_csv(filtered_articles, filename)

# Display articles in Jupyter Notebook with clickable links and formatted display
if filtered_articles:
    df = pd.DataFrame(filtered_articles)
    df_display = df.drop(['Type'], axis=1)

    # Function to make links clickable
    def make_clickable(val):
        return '<a href="{}" target="_blank">{}</a>'.format(val, val)

    # Define the CSS styles
    styles = [
        # Header style
        dict(selector="th", props=[
            ("background-color", "#914048"),  # Green background
            ("color", "white"),               # White text
            ("font-weight", "bold"),
            ("text-align", "center"),
            ("padding", "10px")
        ]),
        # Add grid to cells
        dict(selector="td", props=[
            ("border", "1px solid #ddd"),
            ("padding", "8px")
        ]),
        # Add grid to header cells
        dict(selector="th", props=[
            ("border", "1px solid #ddd")
        ])
    ]
    # Apply the formatting to the Link column
    df_styled = df_display.style\
    .format({'Link': make_clickable})\
    .set_table_styles(styles)
    # If you're working in a Jupyter notebook, display the styled DataFrame
    display(df_styled)

    filename_output = f"results/articles_{today_str}.html"
    # If you're saving to HTML
    df_styled.to_html(filename_output, escape=False)
else:
    print("No articles found for today.")

2024-12-23 15:18:19,094 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-12-23 15:18:29,500 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-12-23 15:18:38,814 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-12-23 15:18:46,318 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-12-23 15:18:46,327 - INFO - Articles saved to articles_2024-12-23.csv


Unnamed: 0,Title,Link,Date,Summary
0,OpenAI funds $1 million study on AI and morality at Duke University,https://www.artificialintelligence-news.com/news/openai-funds-1-million-study-on-ai-and-morality-at-duke-university/,2024-12-23,"OpenAI has awarded a $1 million grant to Duke University's Moral Attitudes and Decisions Lab (MADLAB) for a project called ""Making Moral AI"". The research team, led by ethics professor Walter Sinnott-Armstrong and co-investigator Jana Schaich Borg, aims to develop a ""moral GPS"" that could guide ethical decision-making. The project will explore how AI might predict or influence moral judgments, such as ethical dilemmas in autonomous vehicles or business practices. However, the initiative raises questions about who determines the moral framework guiding these tools and whether AI should be trusted to make decisions with ethical implications. The grant will support the development of algorithms that predict human moral judgments in areas like medicine, law, and business."
1,Manhattan Project 2.0? US eyes AGI breakthrough in escalating China rivalry,https://www.artificialintelligence-news.com/news/manhattan-project-2-0-us-eyes-agi-breakthrough-in-escalating-china-rivalry/,2024-12-23,"The US-China Economic and Security Review Commission (USCC) has recommended a Manhattan Project-style initiative and restrictions on humanoid robots in its latest report to Congress. The report proposes a government-backed program to develop Artificial General Intelligence (AGI), AI systems that could match or exceed human cognitive abilities. The AGI initiative would provide multi-year contracts to leading AI companies, cloud providers, and data center operators, backed by the Defense Department’s highest priority, “DX Rating”. The report also suggests restricting imports of Chinese-made autonomous humanoid robots with advanced capabilities and targets energy infrastructure products with remote monitoring capabilities. The Commission also recommends stronger oversight of technology transfers and investment flows, and the creation of an Outbound Investment Office to prevent US capital and expertise from advancing China’s technological capabilities in sensitive sectors. The report also suggests eliminating China’s Permanent Normal Trade Relations status, which could reshape the technology supply chain and trade flows."
2,"How blockchain, IoT, and AI are shaping the future of digital transformation",https://www.artificialintelligence-news.com/news/how-blockchain-iot-and-ai-are-shaping-the-future-of-digital-transformation/,2024-12-23,"Blockchain, IoT, and AI are converging to redefine industries, according to David Palmer, chief product officer of Pairpoint by Vodafone. Blockchain has evolved from experimental concepts to practical tools, with applications in supply chain management and decentralized finance. IoT devices, expected to number around 30 billion worldwide by 2030, generate vast amounts of data that AI systems can use to provide actionable insights. Blockchain ensures the security and reliability of this data. Digital wallets, expected to grow from 4 billion today to 5.6 billion by 2030, are becoming a cornerstone of this ecosystem. The integration of finance into IoT devices allows for autonomous transactions, while decentralized physical infrastructure networks allow for shared resources. Governments are also exploring the potential of blockchain through Central Bank Digital Currencies and tokenized deposits. The convergence of these technologies could reshape industries and economies by 2030."
3,Ordnance Survey: Navigating the role of AI and ethical considerations in geospatial technology,https://www.artificialintelligence-news.com/news/ordnance-survey-navigating-the-role-of-ai-and-ethical-considerations-in-geospatial-technology/,2024-12-23,"Manish Jethwa, CTO at Ordnance Survey (OS), predicts significant advancements in artificial intelligence (AI) and machine learning (ML) in the coming year, particularly in the geospatial sector. He anticipates the integration of large language models with sophisticated agents to perform complex tasks and reduce barriers to interaction. This will make geospatial datasets more accessible and user-friendly. Jethwa also emphasizes the need for ethical considerations in AI development, including creating transparent, fair, and unbiased systems. He highlights the importance of workforce development and retraining to prepare employees for the impact of AI and digital transformation. Despite the potential of these advancements, challenges such as cultural resistance, change fatigue, and cybersecurity threats persist. Jethwa urges companies to develop comprehensive strategies to address these issues and to maintain a clear vision of future goals."
