In [None]:
%pip install python-dotenv  

Now we'll import the necessary libraries for web scraping, HTML parsing, and environment variable management:

# NPR News Web Scraper

This notebook demonstrates web scraping of NPR news articles using the Decodo API and BeautifulSoup for HTML parsing.

## Setup and Dependencies

First, we'll install the required Python packages:

In [1]:
import os
import json
from bs4 import BeautifulSoup
from dotenv import load_dotenv
import pandas as pd


## Environment Configuration

Load environment variables from a `.env` file to securely store the Decodo API authentication token:

In [2]:
load_dotenv()
DECODO_AUTH = os.getenv('DECODO_AUTH_FIELD')

Load the environment variables and authenticate with the Decodo API using the token stored in the `.env` file:

## Web Scraping Function

Define a function to crawl URLs using the Decodo API. This function takes a URL and returns the scraped content:

In [3]:
import requests

def crwal_url(URL_TO_CRWAL):  
  url = "https://scraper-api.decodo.com/v2/scrape"
    
  payload = {
        "url": URL_TO_CRWAL
  }
    
  headers = {
      "accept": "application/json",
      "content-type": "application/json",
      "authorization": DECODO_AUTH
  }
    
  response = requests.post(url, json=payload, headers=headers)
  
  return response

## Target URLs Configuration

Define the NPR news category URLs that we want to scrape. Each category has its own endpoint:

In [4]:
urls_to_crawl = {
  "Politics" : "https://www.npr.org/get/1014/render/partial/next", 
  "business" : "https://www.npr.org/get/1006/render/partial/next",
  "Health" : "https://www.npr.org/get/1128/render/partial/next", 
  "Science" : "https://www.npr.org/get/1007/render/partial/next",
  "Climate" : "https://www.npr.org/get/1167/render/partial/next",
}

## Scraping Politics Articles

Test the scraper by crawling the Politics section with pagination parameters (start index and batch size):

In [5]:
category_url = urls_to_crawl["Politics"]
start_index = 1
batch_size = 10
crawled_url = crwal_url(f"{category_url}?start={start_index}&count={batch_size}")


## Processing the Response

Parse the JSON response from the Decodo API to extract the scraped content:

In [None]:
crawled_url_json = json.loads(crawled_url.text)
crawled_url_json['results']

## Extracting HTML Content

Get the HTML content from the first result in the scraped data:

In [None]:
html_string = crawled_url_json['results'][0]['content']
html_string

## HTML Parsing with BeautifulSoup

Parse the HTML content and extract article information. Here we find all article elements and extract the first anchor tag:

In [None]:
soup = BeautifulSoup(html_string,'html.parser')
for article in soup.find_all('article'):
  anchor_tag  = article.find('a')
  article_url = anchor_tag['href']
  break

In [10]:
def get_article_text(article_url):
  try:
    crawled_article = crwal_url(article_url)
    crawled_article_json = json.loads(crawled_article.text)
    if crawled_article_json['results'][0]["status_code"] != 200:
      return None

    html_string = crawled_article_json['results'][0]['content']
    soup = BeautifulSoup(html_string,'html.parser')
    story_div = soup.find('div', id='storytext')
    if story_div is None:
      return None

    article_text = story_div.get_text(strip=True, separator='\n')

    return article_text
  except:
    return None


## Article Text Extraction Function

Define a function to extract the actual text content from individual article URLs. This function handles the full article scraping process:

In [11]:
def get_next_article(category_url, batch_size = 10):
  start_index = 1
  while True:
    crawled_page = crwal_url(f"{category_url}?start={start_index}&count={batch_size}")
    crawled_page_json = json.loads(crawled_page.text)

    if crawled_page_json['results'][0]['status_code'] != 200:
      break

    html_string = crawled_page_json['results'][0]['content']
    soup = BeautifulSoup(html_string,'html.parser')


    for article in soup.find_all('article'):
      anchor_tag = article.find('a')
      if anchor_tag is None:
        continue
      article_url = anchor_tag['href']
      article_text = get_article_text(article_url)
      if article_text is None:
        continue

      yield article_text
    start_index += batch_size



## Article Iterator Function

Create a generator function that iterates through all articles in a category with pagination support:

In [12]:
data = [] 
for news_category, category_url in urls_to_crawl.items():
  print(f"Crawling {news_category}")
  article_crawled_num = 0
  for article_text in get_next_article(category_url):
    data.append({'news_categoty' : news_category, 'article' : article_text})
    article_crawled_num += 1
    print(f"Crawled {article_crawled_num} articles")
    if article_crawled_num >= 5:
      break

Crawling Politics
Crawled 1 articles
Crawled 2 articles
Crawled 3 articles
Crawled 4 articles
Crawled 5 articles
Crawling business
Crawled 1 articles
Crawled 2 articles
Crawled 3 articles
Crawled 4 articles
Crawled 5 articles
Crawling Health
Crawled 1 articles
Crawled 2 articles
Crawled 3 articles
Crawled 4 articles
Crawled 5 articles
Crawling Science
Crawled 1 articles
Crawled 2 articles
Crawled 3 articles
Crawled 4 articles
Crawled 5 articles
Crawling Climate
Crawled 1 articles
Crawled 2 articles
Crawled 3 articles
Crawled 4 articles
Crawled 5 articles


## Main Scraping Loop

Execute the main scraping process across all news categories. 

**TODO: When you hit the API limit, create a new DECODO account and re-run these cells**

In [13]:

df = pd.DataFrame(data)
df.to_csv('news_articles_Dataset.csv',index=False)

## Save Data to CSV

Convert the collected article data into a pandas DataFrame and save it as a CSV file:

In [17]:
head_csv = pd.read_csv('news_articles_Dataset.csv')
head_csv.sample(5)

Unnamed: 0,news_categoty,article
13,Health,Florida's Surgeon General Dr. Joseph Ladapo at...
20,Climate,Energy Secretary Chris Wright spearheaded a re...
0,Politics,"Sen. Elizabeth Warren, D-Mass., speaks during ..."
19,Science,Three scientists learned they carry genes that...
7,business,"""Vanny,"" as this 2005 Chrysler Town & Country ..."


## Data Preview

Load and display a sample of the scraped data to verify the results: