# News Scraper and Summarizer using OpenAI

This notebook automates the process of scraping news articles, summarizing them using OpenAI's GPT model, saving them to a CSV file, and displaying them in a user-friendly format. The process involves:
1. Scraping articles from a website.
2. Summarizing the content with OpenAI.
3. Saving the articles and summaries into a CSV file.
4. Displaying the data interactively.

## Libraries and Configuration
First, we import necessary libraries and set up configurations such as logging and the OpenAI API key.

In [16]:
# Import necessary libraries
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
import openai
from config import OPENAI_API_KEY, EMAIL, PASSWORD, RECIPIENT_EMAIL # Ensure this file contains your OpenAI API key
import csv
import pandas as pd
import logging
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from requests.exceptions import RequestException
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from apscheduler.schedulers.blocking import BlockingScheduler

In [2]:
# Set up logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler("script_web.log"), # Log to file
        logging.StreamHandler()                # Log to console
    ]
)

# Use API key
openai.api_key = OPENAI_API_KEY

logging.info("Logging setup complete and OpenAI API key successfully set.")

2025-01-27 15:01:34,262 - INFO - Logging setup complete and OpenAI API key successfully set.


## Initialize Session with Retry Logic

We create an HTTP session with retry logic to ensure the script handles intermittent connection issues gracefully.


In [3]:
# Initialize session with retry logic
session = requests.Session()
retries = HTTPAdapter(max_retries=5)

session.mount('http://', retries)
session.mount('https://', retries)

logging.info("HTTP session with retry logic initialized.")

2025-01-27 15:01:35,650 - INFO - HTTP session with retry logic initialized.


## Functions
Below are the main functions used in this script:
1. `scrape_articles(url)`: Scrapes articles from a given website URL.
2. `summarize_with_openai(text)`: Summarizes the given text using OpenAI.
3. `parse_article_date(date_str)`: Parses and standardizes article dates.
4. `get_article_content(url)`: Fetches the main content of an article from its URL.


In [14]:
# Function to scrape articles from the website
def scrape_articles(url):
    """
    Scrapes articles from the provided website URL.
    
    Args:
        url (str): URL of the website to scrape articles from.
    
    Returns:
        list: A list of dictionaries, where each dictionary contains article data (title, link, date, type).
    """
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
        }
        response = session.get(url, timeout=30, headers=headers)
        response.raise_for_status()

        soup = BeautifulSoup(response.content, 'html.parser')
        article_data = []

        # Find the Featured section
        featured_section = soup.find('section', class_='featured')
        if featured_section:

            # Find all featured article blocks with specific class
            featured_blocks = featured_section.find_all('div', class_='cell blocks small-12 medium-3 large-3')

            for block in featured_blocks:
                try:
                    # Get the image link and title
                    link_element = block.find('a', class_='img-link')
                    title_element = block.find('h3')
                    date_div = block.find('div', class_='content')  # Date extraction

                    if link_element and title_element:
                        title = link_element['title'].strip()  # Title is in the link's title attribute
                        link = link_element['href'].strip()   # Article URL
                        date_str = date_div.text.strip().split('|')[0].strip() if date_div else 'No date available'
                        date = parse_article_date(date_str) 

                        article_data.append({
                            'Title': title,
                            'Link': link,
                            'Date': date,
                            'Type': 'featured'
                        })

                except Exception as e:
                    logging.warning(f"Error processing featured article: {e}")
                    continue

        # Get regular articles
        regular_articles = soup.find_all('article')

        for article in regular_articles:
            try:
                title = article.find('h3').get_text(strip=True)
                link = article.find('a')['href']
                date_div = article.find('div', class_='content')
                date_str = date_div.text.strip().split('|')[0].strip() if date_div else 'No date available'
                date = parse_article_date(date_str)

                # Check if this article is already in our list
                if not any(a['Link'] == link for a in article_data):
                    article_data.append({
                        'Title': title,
                        'Link': link,
                        'Date': date,
                        'Type': 'regular'
                    })
            except Exception as e:
                logging.warning(f"Error processing article: {e}")
                continue

        return article_data
    except RequestException as e:
        logging.error(f"Request error: {e}")
        return []

def scrape_mit_articles(url):
    """
    Scrapes articles from MIT AI News
    """
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        }
        
        response = requests.get(url, timeout=30, headers=headers)
        response.raise_for_status()

        soup = BeautifulSoup(response.content, 'html.parser')
        article_data = []

        # Find all article elements with the correct class
        articles = soup.find_all('article', class_='term-page--news-article--item')

        for article in articles:
            try:
                # Extract title
                title_element = article.find('h3', class_='term-page--news-article--item--title')
                title = title_element.find('a').get_text(strip=True) if title_element else None

                # Extract link
                link_element = article.find('a', class_='term-page--news-article--item--title--link')
                link = link_element['href'] if link_element else None
                if link and not link.startswith('http'):
                    link = f"https://news.mit.edu{link}"

                # Extract date and convert to date object
                date_element = article.find('time')
                date_str = date_element['datetime'] if date_element else None
                if date_str:
                    date_obj = datetime.fromisoformat(date_str.replace('Z', '+00:00')).date()
                else:
                    date_obj = None

                # Extract summary
                summary_element = article.find('p', class_='term-page--news-article--item--dek')
                summary = summary_element.get_text(strip=True) if summary_element else None

                if all([title, link]):  # Add article if at least title and link are present
                    article_data.append({
                        'Title': title,
                        'Link': link,
                        'Date': date_obj,
                        'Summary': summary
                    })

            except Exception as e:
                logging.warning(f"Error processing MIT article: {e}")
                continue

        return article_data
    except Exception as e:
        logging.error(f"Error in scraping MIT: {e}")
        return []

def scrape_stanford_articles(url):
    """
    Scrapes articles from Stanford AI News
    """
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        }
        
        response = requests.get(url, timeout=30, headers=headers)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.content, 'html.parser')
        article_data = []

        news_container = soup.find('div', {'data-component': 'topic-subtopic-listing'})
        if news_container:
            import json
            props = json.loads(news_container['data-hydration-props'])
            articles = props.get('data', [])
            
            for article in articles:
                try:
                    # Convert timestamp to date object immediately
                    if article.get('date'):
                        date_obj = datetime.fromtimestamp(article.get('date')/1000).date()
                    else:
                        date_obj = None

                    article_data.append({
                        'Title': article.get('title'),
                        'Link': article.get('liveUrl'),
                        'Date': date_obj,
                        'Summary': article.get('description', [''])[0] if isinstance(article.get('description'), list) else article.get('description')
                    })

                except Exception as e:
                    logging.warning(f"Error processing Stanford article: {e}")
                    continue

        return article_data
    
    except Exception as e:
        logging.error(f"Error in scraping Stanford: {e}")
        return []

# Function to summarize article content using OpenAI
def summarize_with_openai(text):
    """
    Summarizes the given text using OpenAI API.

    Args:
        text (str): Text to summarize.

    Returns:
        str: Summary of the text or None if an error occurs.
    """
    try:
        user_prompt = f"Please provide a concise summary of this article: {text}"
        completion = openai.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are a helpful assistant that summarizes news articles."},
                {"role": "user", "content": user_prompt}
            ],
            temperature=0.1,
            max_tokens=200
        )
        return completion.choices[0].message.content.strip()
    except Exception as e:
        logging.error(f"OpenAI API error: {e}")
        return ""

# Function to parse and standardize dates
def parse_article_date(date_str):
    """
    Converts a date string into a standardized date object.

    Args:
        date_str (str): Date string to parse.

    Returns:
        date: Parsed date object or None if parsing fails.
    """

    formats = ['%d %B %Y', '%Y-%m-%d', '%B %d, %Y'] # Add more formats if needed
    for fmt in formats:
        try:
            return datetime.strptime(date_str, fmt).date()
        except ValueError:
            continue
    logging.warning(f"Unrecognized date format: {date_str}")
    return None

# Function to fetch the content of an article
def get_article_content(url):
    """
    Fetches the main content of an article from its URL.

    Args:
        url (str): URL of the article.

    Returns:
        str: Extracted article content or None if extraction fails.
    """
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
        }
        response = session.get(url, timeout=30, headers=headers)
        response.raise_for_status()

        soup = BeautifulSoup(response.content, 'html.parser')
        content_container = soup.find('div', class_='article-content') or soup.find('article')

        if content_container:
            paragraphs = content_container.find_all('p')
            return ' '.join([p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True)])
        return ""
    except RequestException as e:
        logging.error(f"Error fetching article content: {e}")
        return ""

def get_mit_article_content(url):
    """
    Fetches the content of a MIT News article
    """
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
        }
        response = requests.get(url, timeout=30, headers=headers)
        response.raise_for_status()

        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Try to find the main article content
        content_container = (
            soup.find('div', class_='news-article--content--body') or  # Try specific content class first
            soup.find('article') or                                    # Then try main article tag
            soup.find('main')                                         # Finally try main content area
        )
        
        if content_container:
            # Get all paragraphs
            paragraphs = content_container.find_all('p')
            
            # Clean and join the text
            content = ' '.join([
                p.get_text(strip=True) 
                for p in paragraphs 
                if p.get_text(strip=True) and 
                   'Previous image' not in p.get_text() and
                   'Next image' not in p.get_text()
            ])
            
            # Additional cleaning
            content = content.replace('Previous imageNext image', '')
            
            if len(content) > 100:  # Basic check to ensure we got meaningful content
                return content
                
        return ""
    except Exception as e:
        logging.err
        
def get_stanford_article_content(url):
    """
    Fetches the content of a Stanford News article
    """
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
        }
        response = requests.get(url, timeout=30, headers=headers)
        response.raise_for_status()

        soup = BeautifulSoup(response.content, 'html.parser')
        # Find the main article content
        content_container = soup.find('div', class_='su-page-content') or soup.find('article')
        
        if content_container:
            paragraphs = content_container.find_all('p')
            return ' '.join([p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True)])
        return ""
    except Exception as e:
        logging.error(f"Error fetching Stanford article content: {e}")
        return ""


## Main Script
In this section, we scrape today's news articles, summarize them, and save the results in a CSV file. The data is also displayed for easy viewing in the notebook.


In [12]:
def scrape_all_sources():
    """
    Scrapes articles from all three sources and combines them
    """
    all_articles = []
    
    # Scrape AI News
    ai_news_url = "https://www.artificialintelligence-news.com"
    ai_articles = scrape_articles(ai_news_url)
    for article in ai_articles:
        article['Source'] = 'AI News'
        all_articles.append(article)
    
    # Scrape MIT News
    mit_url = "https://news.mit.edu/topic/artificial-intelligence2"
    mit_articles = scrape_mit_articles(mit_url)
    for article in mit_articles:
        article['Source'] = 'MIT News'
        all_articles.append(article)
    
    # Scrape Stanford News
    stanford_url = "https://news.stanford.edu/artificial-intelligence"
    stanford_articles = scrape_stanford_articles(stanford_url)
    for article in stanford_articles:
        article['Source'] = 'Stanford News'
        all_articles.append(article)
    
    return all_articles

def safe_str(value):
    """Convert any value to string safely."""
    if value is None:
        return ""
    return str(value)

# Function to save articles to CSV
def save_to_csv(articles, filename):
    """
    Saves a list of articles to a CSV file.

    Args:
        articles (list): List of dictionaries containing article data.
        filename (str): Name of the CSV file to save.
    """
    try:
        keys = ['Title', 'Date', 'Link', 'Type', 'Summary', 'Source']  # Added 'Source'
        with open(filename, mode='w', newline='', encoding='utf-8') as file:
            writer = csv.DictWriter(file, fieldnames=keys, extrasaction='ignore')
            writer.writeheader()
            writer.writerows(articles)
        logging.info(f"Articles saved to {filename}")
    except Exception as e:
        logging.error(f"Error saving to CSV: {e}")
        raise  # Re-raise the exception to handle it in the calling function

def send_combined_email_report(articles, date_str, recipients):
    """
    Sends an email with articles from all sources
    """
    try:
        # Create HTML content with source grouping
        html_content = f"""
        <html>
            <head>
                <style>
                    table {{ 
                        border-collapse: collapse; 
                        width: 100%; 
                        margin-bottom: 30px;
                    }}
                    th, td {{ 
                        padding: 8px; 
                        text-align: left; 
                        border: 1px solid #ddd; 
                    }}
                    th {{ 
                        background-color: #914048; 
                        color: white; 
                    }}
                    .source-header {{
                        background-color: #f5f5f5;
                        padding: 10px;
                        margin: 20px 0 10px 0;
                        font-size: 1.2em;
                        font-weight: bold;
                    }}
                    a {{ 
                        color: #0066cc; 
                        text-decoration: none; 
                    }}
                    a:hover {{ 
                        text-decoration: underline; 
                    }}
                </style>
            </head>
            <body>
                <h2>Weekly AI News Summary {date_str}</h2>
        """

        # Group articles by source
        sources = ['AI News', 'MIT News', 'Stanford News']
        for source in sources:
            source_articles = [a for a in articles if a.get('Source') == source]
            if source_articles:
                html_content += f"""
                    <div class="source-header">{source}</div>
                    <table>
                        <tr>
                            <th>Title</th>
                            <th>Summary</th>
                            <th>Link</th>
                        </tr>
                """
                
                for article in source_articles:
                    html_content += f"""
                        <tr>
                            <td>{safe_str(article.get('Title'))}</td>
                            <td>{safe_str(article.get('Summary'))}</td>
                            <td><a href="{safe_str(article.get('Link'))}">{safe_str(article.get('Link'))}</a></td>
                        </tr>
                    """
                
                html_content += "</table>"

        html_content += """
            </body>
        </html>
        """

        # Create SMTP session and send email
        with smtplib.SMTP_SSL('smtp.gmail.com', 465) as server:
            server.login(EMAIL, PASSWORD)
            
            # Handle single recipient or list of recipients
            if isinstance(recipients, str):
                recipients = [recipients]
            
            for recipient in recipients:
                try:
                    msg = MIMEMultipart('alternative')
                    msg['Subject'] = f'Weekly AI News Summary {date_str}'
                    msg['From'] = EMAIL
                    msg['To'] = recipient
                    
                    html_part = MIMEText(html_content, 'html', 'utf-8')
                    msg.attach(html_part)
                    
                    server.send_message(msg)
                    logging.info(f"Email sent successfully to {recipient}")
                    
                except Exception as e:
                    logging.error(f"Error sending email to {recipient}: {e}")
                    continue
            
    except Exception as e:
        logging.error(f"Error in email sending process: {e}")
        raise

def process_all_news(recipients, target_date=None):
    """
    Process news from all sources for the past week and send combined email
    """
    try:
        # Set target date range
        if target_date:
            try:
                end_date = datetime.strptime(target_date, '%Y-%m-%d').date()
            except ValueError:
                raise ValueError("Date must be in format 'YYYY-MM-DD'")
        else:
            end_date = datetime.now().date()
            
        # Calculate start date (7 days before end date)
        start_date = end_date - timedelta(days=7)
        date_str = f"{start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}"
            
        logging.info(f"Processing news for date range: {date_str}")
        
        # Get all articles
        all_articles = scrape_all_sources()
        logging.info(f"Total articles found: {len(all_articles)}")
        
        # Filter and process articles
        processed_articles = []
        for article in all_articles:
            article_date = article.get('Date')
            
            # Check if article date is within the week range
            if article_date and start_date <= article_date <= end_date:
                # Get content and create summary based on source
                if article['Source'] == 'AI News':
                    content = get_article_content(article['Link'])
                elif article['Source'] == 'MIT News':
                    content = get_mit_article_content(article['Link'])
                elif article['Source'] == 'Stanford News':
                    content = get_stanford_article_content(article['Link'])
                else:
                    content = ""

                if content:
                    article['Summary'] = summarize_with_openai(content)
                processed_articles.append(article)
        
        logging.info(f"Found {len(processed_articles)} articles for date range {date_str}")
        
        if processed_articles:
            # Save to CSV
            save_path = f"data/articles_week_{end_date.strftime('%Y-%m-%d')}.csv"
            try:
                save_to_csv(processed_articles, save_path)
                
                # Send email
                send_combined_email_report(processed_articles, date_str, recipients)
                logging.info("Combined email sent successfully!")
            except Exception as e:
                logging.error(f"Error in saving or sending: {e}")
                raise
        else:
            logging.info(f"No articles found for date range: {date_str}")
            
    except Exception as e:
        logging.error(f"Error in process_all_news: {e}")
        raise

In [13]:
process_all_news(RECIPIENT_EMAIL, target_date="2025-01-16")

2025-01-27 15:06:04,533 - INFO - Processing news for date: 2025-01-16


Found 15 MIT articles


2025-01-27 15:06:07,835 - INFO - Total articles found: 59


Found 10 Stanford articles


2025-01-27 15:06:17,578 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-01-27 15:06:28,339 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-01-27 15:06:35,414 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-01-27 15:06:44,788 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-01-27 15:06:44,793 - INFO - Found 4 articles for date 2025-01-16
2025-01-27 15:06:44,795 - INFO - Articles saved to data/combined_articles_2025-01-16.csv
2025-01-27 15:06:47,119 - INFO - Email sent successfully to karina.canziani@gmail.com
2025-01-27 15:06:47,292 - INFO - Combined email sent successfully!


In [None]:
if __name__ == "__main__":
    # For manual testing
    process_all_news(RECIPIENT_EMAIL)
    
    try:
        logging.info("Starting scheduler...")
        scheduler.start()
    except (KeyboardInterrupt, SystemExit):
        logging.info("Scheduler stopped")