# Scrape RSS Newsfeed

In this notebook we are going to scrape the articles from an RSS newsfeed which contains the latest articles posted by CNN across their various subdomains, extract the article information, and store our results in a csv file.

<img src="./img/CNN_scrape.png" alt="CNN scraping" width="800"/>

Each article result will contain -
 - Date 
 - Publilsher
 - Title
 - Author(s)
 - Article URL
 - Content


## Program Parameters

 - `NEWS_FEEDS` -> This is the list of supported newsfeeds, which may be expanded in the future. Only CNN supported for now.

In [10]:
NEWS_FEEDS = {
    'CNN' : 'http://rss.cnn.com/rss/cnn_latest.rss'
    }

## Imports

We are using `requests` and `BeautifulSoup` to request and parse the html, respectively. We are using `datetime` to parse dates into timestamps. Finally, `pandas` is used to export the results.

In [11]:
from bs4 import BeautifulSoup
import requests
from datetime import datetime
import pandas as pd

## Scraping Functions

In [12]:
def scrape_urls_from_rss(url=str, format=str):
    '''Scrape news articles from an RSS feed. `url` is the url of the feed. `format` is the format of the feed.'''
    supported_news_feeds = {'CNN'}
    if(format not in supported_news_feeds):
        raise Exception('Format not recognised.')
    html = requests.get(url)
    xml = BeautifulSoup(html.text, 'xml')
    article_urls = list()
    articles = xml.find_all('item')
    for article in articles:
        article_urls.append(article.find('guid').text)
    return article_urls

In [13]:
def scrape_cnn_article(html_text):
    '''Method to scrape CNN article from any subdomain'''
    article = dict()
    
    # Title
    title = html_text.find('h1', class_='pg-headline')
    if(title is None):
        title = html_text.find('h1', class_='headline__text')
    if(title is None):
        title = html_text.find('h1', class_='Article__title')
    if(title is None):
        title = html_text.find('h1', class_='PageHead__title')
    if(title is None):
        title = html_text.find('div', 'SpecialArticle__headTitle')

    article["Title"] = title.text.strip()

    # Authors
    author_list = ""
    i = 0
    authors = html_text.find('span', class_= 'metadata__byline__author')
    if authors is None:
        authors = html_text.find('div', class_= 'Article__subtitle')
        if authors is None:
            authors = html_text.find_all('span', class_= 'byline__name')
            if authors == []:
                authors = html_text.find_all('span', class_='Authors__writer')
            if authors == []:
                authors = html_text.find_all('span', class_='SpecialArticle__writer')
            for author in authors:
                if(i == 0):
                    author_list += author.text
                else:
                    author_list += ", " + author.text
                i += 1
        else:
            author_list = authors.text[:authors.text.find('•')-6] 
    else:
        authors = authors.text.split(' ')
        print(f'Metadata found: Authors - {authors}')
        if authors.__contains__('CNN'):
            del(authors[authors.index('CNN'):])
        if authors.__contains__('By'):
            del(authors[:authors.index('By')+1])
        if authors.__contains__('by'):
            del(authors[:authors.index('by')+1])
        print(f'Trimmed Author List - {authors}')
        author_list = " ".join(authors)
    if not author_list == "":
        if author_list[-1] == ',':
            author_list = author_list[:-1]

    article['Authors'] = author_list or 'Anonymous'

    # Content
    content = ""
    lead_text_ps = html_text.find_all('p', class_='zn-body__paragraph speakable')
    if lead_text_ps == []:
        paragraphs = html_text.find_all('p', class_='paragraph inline-placeholder')
        if paragraphs == []:
            paragraphs = html_text.find_all('div', class_='Paragraph__component')
            if paragraphs == []:
                paragraphs = html_text.find_all('div', class_='SpecialArticle__paragraph')
    else:
        for p in lead_text_ps:
            exclude_text = False
            for parent in p.parents:
                if parent == 'q':
                    exclude_text = True
            if not exclude_text:
                content = p.text[p.text.find(')')+1:].strip()

        paragraphs = html_text.find_all('div', class_='zn-body__paragraph')

    for paragraph in paragraphs:
        content += ' ' + paragraph.text.strip()

    article['Content'] = content.replace('\n', "")

    # Date
    updated = html_text.find('p', class_='update-time')
    if updated is None:
        # print('UpdateTime not found') 
        updated = html_text.find('div', class_='timestamp')
        if updated is None:
            # print('Timestamp not found')
            updated = html_text.find('div', class_='PageHead__published')
            if updated is None:
                updated = html_text.find('div', class_='SpecialArticle__details')
                if updated is None:
                    updated = html_text.find('div', class_='Article__subtitle')
            date_array = updated.text.strip().split(" ")[-3:]
            date_array[0] = date_array[0][:-2]
            if len(date_array[0]) == 1:
                date_array[0] = '0' + date_array[0]
            date_string = " ".join(date_array)
            date = datetime.strptime(date_string, '%d %B %Y')
        else:
            date_array = updated.text.strip().split(" ")[1:]
            date_string = " ".join(date_array).strip()
            date_string = " ".join(date_string.split(" ")[4:])
    else:
        # print('UpdateTime found')
        date_string = updated.text[updated.text.find(')')+2:].strip()

    try:
        date
    except NameError:
        date = datetime.strptime(date_string, '%B %d, %Y') 
    
    article['Date'] = date 

    return article

In [14]:
def scrape_article_from_url(url=str, format=str):
    '''Scrape news article from a given url. `url` is the url of the feed. `format` is the format of the publisher's articles.'''
    supported_formats = {'CNN'}
    if(format not in supported_formats):
        raise Exception('Format not recognised.')
    
    html = requests.get(url)
    html_text = BeautifulSoup(html.text, 'lxml')
    
    # Debug HTML
    # print(html_text)

    if(format == 'CNN'):
        if not url.__contains__('/live-news/'):
            article = scrape_cnn_article(html_text)
        else:
            article = None
            print('/live-news/ page skipped.\n')
    elif(format == 'Fox'):
        pass
    elif(format == 'NPR'):
        pass
    else:
        raise Exception('Format not recognised.')

    if article is None:
        return
    else:
        # Publisher
        article["Publisher"] = format
        # URL
        article['URL'] = url
        return article

In [15]:
def scrape_articles_from_feed(url=str, format=str):
    '''Scrape news articles from an RSS feed. `url` is the url of the feed. `format` is the format of the articles.'''
    results = list()

    for url in scrape_urls_from_rss(url, format):
        print(f'Scraping URL => {url}')
        article = scrape_article_from_url(url, format)
        if article is not None:
            print('|\n|==> Scraped: ' + article['Title'] + '\n')
            results.append(article)
    
    return results
    

## Scraping Newsfeed

In [16]:
## Test scrape Newsfeed
source = 'CNN'
results = scrape_articles_from_feed(url=NEWS_FEEDS[source], format=source)

Scraping URL => https://www.cnn.com/2022/10/05/world/spacex-nasa-crew-5-astronaut-launch-scn/index.html
|
|==> Scraped: SpaceX, NASA launch 3 astronauts and 1 cosmonaut to the ISS. Here’s everything you need to know

Scraping URL => https://www.cnn.com/2022/10/05/football/alex-ferguson-jose-mourinho-english-dictionary-spt-intl/index.html
|
|==> Scraped: Soccer lexicon: ‘Squeaky bum time’ and ‘park the bus’ added to Oxford English Dictionary

Scraping URL => https://www.cnn.com/2022/10/05/uk/liz-truss-greenpeace-conservative-conference-gbr-intl/index.html
|
|==> Scraped: Greenpeace campaigners disrupt Liz Truss’s party conference speech

Scraping URL => https://www.cnn.com/2022/10/05/us/hurricane-ian-florida-recovery-wednesday/index.html
|
|==> Scraped: Sanibel Island residents return to see if their homes survived devastating Hurricane Ian

Scraping URL => https://www.cnn.com/2022/10/05/us/california-family-missing-wednesday/index.html
|
|==> Scraped: Search continues for abducted Cali

In [17]:
# Create dataframe
df = pd.DataFrame.from_dict(results)
df = df.reindex(columns=[
        'Date',
        'Publisher',
        'Title',
        'Authors',
        'URL',
        'Content',
    ])
df.head(5)

Unnamed: 0,Date,Publisher,Title,Authors,URL,Content
0,2022-10-05,CNN,"SpaceX, NASA launch 3 astronauts and 1 cosmona...",Jackie Wattles,https://www.cnn.com/2022/10/05/world/spacex-na...,SpaceX and NASA launched a crew of astronauts...
1,2022-10-05,CNN,Soccer lexicon: ‘Squeaky bum time’ and ‘park t...,Alasdair Howorth,https://www.cnn.com/2022/10/05/football/alex-f...,Soccer’s lexicon is a rich reservoir of often...
2,2022-10-05,CNN,Greenpeace campaigners disrupt Liz Truss’s par...,"Peter Wilkinson, Chris Liakos",https://www.cnn.com/2022/10/05/uk/liz-truss-gr...,Protesters from the environmental group Green...
3,2022-10-05,CNN,Sanibel Island residents return to see if thei...,"Nouran Salahieh, Dakin Andone",https://www.cnn.com/2022/10/05/us/hurricane-ia...,Residents of Florida’s Sanibel Island are war...
4,2022-10-05,CNN,Search continues for abducted California famil...,"Aya Elamroussi, Natasha Chen, Jack Hannah",https://www.cnn.com/2022/10/05/us/california-f...,The search for a family of four kidnapped in ...


## Output Results

In [18]:
timestamp_string = str(int(datetime.timestamp(datetime.now())))
df.to_csv(path_or_buf=f'outputs/{source}_{timestamp_string}.csv', index=False)