# <u>Web Scraping</u>

## <u>Part \#4: Getting XML from RSS Feeds</u>

### <u>What is an RSS feed?</u>

RSS stands for "Really Simple Syndication." It's just a page of data conforming to the XML format that is updated frequently and can be processed in an automated way.

### <u>Exploring Some RSS Feeds</u>

Many organizations have RSS feeds. Some links are provided below that will allow you to find some of these feeds. Spend a few minutes exploring some prominent examples using Google Chrome. 

*Note: If the RSS feed you are looking at is a collection of HTML-style tags, you are in the right place. If not, right-click and select "View page source."* 

Okay, time to explore:

* Local News:
  * __Chicago Tribune__: https://www.chicagotribune.com/about/ct-chicago-tribune-rss-feeds-htmlstory.html
  * __The Chicago Sun Times__: https://blog.feedspot.com/chicago_sun_times_rss_feeds/
  * __Block Club Chicago__: https://blockclubchicago.org/feed


* National/International News: 
  * __Reuters__: https://www.reutersagency.com/en/reutersbest/reuters-best-rss-feeds/
  * __USA Today__: http://rssfeeds.usatoday.com/usatoday-NewsTopStories
  * __The New York Times__: https://www.nytimes.com/rss
  * __BBC News__: http://www.bbc.com/news/10628494____


* Technology News: 
  * __Wired.com__: https://www.wired.com/about/rss-feeds/
  * __Ars Technica__: https://arstechnica.com/rss-feeds/
  * __CNET__: https://www.cnet.com/rss/
  
  
* Miscellaneous (Sports, Government, Science):  
  * __ESPN__: http://www.espn.com/espn/news/story?page=rssinfo
  * __City of Chicago__: https://www.chicago.gov/city/en/rss.html
  * __US Congress__: https://www.congress.gov/rss
  * __NASA__: https://www.nasa.gov/content/nasa-rss-feeds


**<u>Task \#1:<u>** After you explore a few of the feeds above, try to find an RSS feed for another website you are interested in. This may be a news website for a certain type of news you like to follow (video games, style/fashion, politics, etc). Then fill in the information below for the feed (or feeds) you found:

Organization(s): Reddit Meme Subreddit

URL(s): https://www.reddit.com/r/meme.rss

Description(s): Memes

#### <u>An RSS Feed from the Wall Street Journal</u>

The beauty of an RSS feed is that its content is updated regularly, but the structure of its tags always stays the same. This allows you to extract up-to-date data in an automated fashion. 

For example, here are the top stories from the "global news" section of _The Wall Street Journal_ from their RSS feed: https://feeds.a.dj.com/rss/RSSWSJD.xml

After looking at this link in Chrome, explore it using Beautiful Soup:

In [2]:
from bs4 import BeautifulSoup       # Import BeautifulSoup
from urllib.request import urlopen  # Import urlopen

xml_page = urlopen("https://feeds.a.dj.com/rss/RSSWSJD.xml")   # Opens whatever page we are requesting
bs_obj = BeautifulSoup(xml_page, 'xml')  # Extract xml data

print(bs_obj.prettify()[:1000])  # Makes it more easily readible or 'pretty'

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dj="http://dowjones.net/rss/" xmlns:wsj="http://dowjones.net/rss/">
 <channel>
  <title>
   WSJ.com: WSJD
  </title>
  <link>
   http://online.wsj.com
  </link>
  <atom:link href="http://online.wsj.com" rel="self" type="application/rss+xml"/>
  <description>
   WSJD - Technology
  </description>
  <language>
   en-us
  </language>
  <pubDate>
   Mon, 23 May 2022 21:23:12 -0400
  </pubDate>
  <lastBuildDate>
   Mon, 23 May 2022 21:23:12 -0400
  </lastBuildDate>
  <copyright>
   Dow Jones &amp; Company, Inc.
  </copyright>
  <generator>
   http://online.wsj.com
  </generator>
  <docs>
   http://cyber.law.harvard.edu/rss/rss.html
  </docs>
  <image>
   <title>
    WSJ.com: WSJD
   </title>
   <link>
    http://online.wsj.com
   </link>
   <url>
    http://online.wsj.com/img/wsj_sm_logo.gif
   </url>
  </image>
  <item>
   <title>

Now you can get a list of all headlines:

In [3]:
headlines = bs_obj.find_all('title') # Extracts and creates a list of all the <title> tags

print(headlines)

[<title>WSJ.com: WSJD</title>, <title>WSJ.com: WSJD</title>, <title>GameStop Launches Digital Wallet for NFTs, Cryptocurrencies</title>, <title>Broadcom in Advanced Talks to Buy VMware</title>, <title>Apple Looks to Boost Production Outside China</title>, <title>Zoom Sales Growth Slows as Pandemic Boom Wanes</title>, <title>Buffalo Shooting Tests Internet Antiterrorism Accord</title>, <title>Didi Says It Will Proceed With Delisting From NYSE</title>, <title>Tinder Owner Match Group, Google Reach Deal on App Store Payment Rules</title>, <title>Hello? Hello? Is This Facebook?
		
			 
		


If you want to strip out the `<title>` tags, use the *.getText()* method:

In [4]:
headlines = [story.getText() for story in headlines] # Creates a list of innerHTML for each <title> tag in headlines list

print(headlines)



In this feed, the first two titles appear to be for the news website rather than for news stories themselves. This is an easy fix:

In [5]:
headlines = headlines[2:]

print(headlines)



Let's do the same thing with links to these stories by creating a list of all links

In [6]:
urls = bs_obj.find_all('link')            # Extracts and creates a list of all the <link> tags
urls = [link.getText() for link in urls]  # Creates a list of innerHTML for each <link> tag in the urls list
print(urls)



This time, it looks like the third link is where we want to start (The second entry is an empty String!)

In [7]:
urls = urls[3:]

print(urls)



**<u>Task \#2:</u>** Write a function, *random_headline(headline_list, link_list)*, that accepts a list of headlines and a list of links as input and returns an output string in the format "HEADLINE, read more at LINK."

*Note: Be sure to test out your function to make sure it works as expected. Show the results of your tests below.  Also, not all headlines may have a link, and your two arrays may not be 'parallel'.  Try re-running the cells for the most up-to-date listings!*

#### <u>HINT:</u>
* Import random to create random #'s
* Create a function named random_headline(headline_list, link_list)
    * Generate a random number between 0 and one less than the length of headline_list
    * Create a variable to hold the value of headline_list at the index of the randomly generated number
    * Create a variable to hold the value of link_list at the index of the randomly generated number
    * Create an output string in the format "HEADLINE, read more at LINK" replacing HEADLINE and LINK with your variables
    * Return the output string

In [None]:
# Your code here


### <u>Processing Other RSS Feeds</u>

You already perused a few RSS feeds. This time, pick one of those feeds (or a new one) and explore it by writing code. As a reminder, here are the feeds listed above: 

* Local News:
  * __Chicago Tribune__: https://www.chicagotribune.com/about/ct-chicago-tribune-rss-feeds-htmlstory.html
  * __The Chicago Sun Times__: https://blog.feedspot.com/chicago_sun_times_rss_feeds/
  * __Block Club Chicago__: https://blockclubchicago.org/feed


* National/International News: 
  * __Reuters__: https://www.reutersagency.com/en/reutersbest/reuters-best-rss-feeds/
  * __USA Today__: http://rssfeeds.usatoday.com/usatoday-NewsTopStories
  * __The New York Times__: https://www.nytimes.com/rss
  * __BBC News__: http://www.bbc.com/news/10628494


* Technology News: 
  * __Wired.com__: https://www.wired.com/about/rss-feeds/
  * __Ars Technica__: https://arstechnica.com/rss-feeds/
  * __CNET__: https://www.cnet.com/rss/
  
  
* Miscellaneous (Sports, Government, Science):  
  * __ESPN__: http://www.espn.com/espn/news/story?page=rssinfo
  * __City of Chicago__: https://www.chicago.gov/city/en/rss.html
  * __US Congress__: https://www.congress.gov/rss
  * __NASA__: https://www.nasa.gov/content/nasa-rss-feeds

**<u>Task \#3:</u>** Pick one or more of the feeds above (or find your own). Experiment using Python code, and show the results of your experimentation below. 

**<u>Note:</u>** This question is fairly open-ended, but at a minimum you must do the following: 

* Create a "random headline"-style function like you did above. You should not expect code such as *headlines = headlines[2:]* or *urls = urls[3:]* to fit perfectly with your data since these were modifications that may be specific to the way *The Wall Street Journal*'s RSS feed is organized.

* You must show that you engaged with the feed(s) you picked using the Beautiful Soup module and at least one Python data structure (probably lists). You will need to analyze the XML for the feed you pick and write your code to fit with the data. 

In [None]:
# Your code here
