# Web Scraping

## Part \#4: Getting XML from RSS Feeds

### What is an RSS feed?

RSS stands for "Really Simple Syndication." It's just a page of data conforming to the XML format that is updated frequently and can be processed in an automated way.

### Exploring Some RSS Feeds

Many organizations have RSS feeds. Some links are provided below that will allow you to find some of these feeds. Spend a few minutes exploring some prominent examples using Google Chrome. 

*Note: If the RSS feed you are looking at is a collection of HTML-style tags, you are in the right place. If not, right-click and select "View page source."* 

Okay, time to explore:

* Local News:
  * __Chicago Tribune__: http://www.chicagotribune.com/cs-rssfeeds-htmlstory.html
  * __The Daily Herald__: http://www.dailyherald.com/rss/
  * __The Chicago Sun Times__: http://www.thesuntimes.com/section/feed


* National/International News: 
  * __Reuters__: https://www.reuters.com/tools/rss
  * __USA Today__: https://www.usatoday.com/rss/
  * __The New York Times__: http://www.nytimes.com/services/xml/rss/index.html
  * __BBC News__: http://www.bbc.com/news/10628494____


* Technology News: 
  * __Wired.com__: https://www.wired.com/about/rss_feeds/
  * __Ars Technica__: https://arstechnica.com/rss-feeds/
  * __CNET__: https://www.cnet.com/rss/
  
  
* Miscellaneous (Sports, Government, Science):  
  * __ESPN__: http://www.espn.com/espn/news/story?page=rssinfo
  * __Illinois Commerce Commission__: https://www.icc.illinois.gov/rss/
  * __US Congress__: https://www.congress.gov/rss
  * __NASA__: https://www.nasa.gov/content/nasa-rss-feeds


**Task \#1:** After you explore a few of the feeds above, try to find an RSS feed for another website you are interested in. This may be a news website for a certain type of news you like to follow (video games, style/fashion, politics, etc). Then fill in the information below for the feed (or feeds) you found:

Organization(s):Buzzfeed

URL(s):https://www.buzzfeed.com/rss 

Description(s): Had 3 sections. One was for the main sections in the wbesite, such as the homepage, quizes, and other mian sections of buzzfeed. The second section acts as more of a topic finder, and lists topics buzzfeed frequently talks about. The final one is for miscellenious items within buzzfeed thart have subjects with in buzzfeed, but arent as prominent


#### An RSS Feed from the Wall Street Journal

The beauty of an RSS feed is that its content is updated regularly, but the structure of its tags always stays the same. This allows you to extract up-to-date data in an automated fashion. 

For example, here are the top stories from the "global news" section of _The Wall Street Journal_ from their RSS feed: https://feeds.a.dj.com/rss/RSSWSJD.xml

After looking at this link in Chrome, explore it using Beautiful Soup:

In [1]:
from bs4 import BeautifulSoup       # Import BeautifulSoup
from urllib.request import urlopen  # Import urlopen

xml_page = urlopen("https://feeds.a.dj.com/rss/RSSWSJD.xml")   # Opens whatever page we are requesting
bs_obj = BeautifulSoup(xml_page, 'xml')  # Extract xml data

print(bs_obj.prettify())  # Makes it more easily readible or 'pretty'

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dj="http://dowjones.net/rss/" xmlns:wsj="http://dowjones.net/rss/">
 <channel>
  <title>
   WSJ.com: WSJD
  </title>
  <link>
   http://online.wsj.com
  </link>
  <atom:link href="http://online.wsj.com" rel="self" type="application/rss+xml"/>
  <description>
   WSJD - Technology
  </description>
  <language>
   en-us
  </language>
  <pubDate>
   Sat, 09 May 2020 14:09:58 -0400
  </pubDate>
  <lastBuildDate>
   Sat, 09 May 2020 14:09:58 -0400
  </lastBuildDate>
  <copyright>
   Dow Jones &amp; Company, Inc.
  </copyright>
  <generator>
   http://online.wsj.com
  </generator>
  <docs>
   http://cyber.law.harvard.edu/rss/rss.html
  </docs>
  <image>
   <title>
    WSJ.com: WSJD
   </title>
   <link>
    http://online.wsj.com
   </link>
   <url>
    http://online.wsj.com/img/wsj_sm_logo.gif
   </url>
  </image>
  <item>
   <title>

Now you can get a list of all headlines:

In [2]:
headlines = bs_obj.find_all('title') # Extracts and creates a list of all the <title> tags

print(headlines)

[<title>WSJ.com: WSJD</title>, <title>WSJ.com: WSJD</title>, <title>Amazon, Berkshire, JPMorgan Health-Care Venture Looking for New CEO</title>, <title>Uber Redraws Road Map to Profit in Wake of Weakened Ridership</title>, <title>LiveXLive to Join Podcast Fray With PodcastOne Acquisition</title>, <title>Google Parent Alphabet Drops Controversial 'Smart City' Project</title>, <title>The Trouble With Coronavirus Contact-Tracing Apps</title>, <title>Elon Musk, Tech's Cash-Poor Billionaire</title>, <title>Nintendo Scores on 'Animal Crossing' Sales Thanks to Coronavirus Lockdowns</title>, <title>Vista Follows Facebook Into India's Jio With $1.5 Billion Digital Bet</title>, <title>Spending Too Much Time Online? Here Are Tips for Unplugging.</title>, <title>Facebook-Backed Libra Project Gets New CEO</title>, <title>China's WeChat Monitors Foreign Users to Refine Censorship at Home</title>, <title>Grandma to the Rescue! Parents Get Virtual Help</title>, <title>Airbnb to Cut 25% of Workforce</t

If you want to strip out the `<title>` tags, use the *.getText()* method:

In [69]:
headlines = [story.getText() for story in headlines] # Creates a list of innerHTML for each <title> tag in headlines list

print(headlines)

AttributeError: 'str' object has no attribute 'getText'

# Hello Mr. Nichols! I hope you are well. While the above code segment says it doesnt work, it did do its job. In the cells below this one, you can see that the heading "Title" has been removed. My computer is acting wonky, and sorry for the trouble this may cause.

In this feed, the first two titles appear to be for the news website rather than for news stories themselves. This is an easy fix:

In [9]:
headlines = headlines[2:]

print(headlines)

['Amazon, Berkshire, JPMorgan Health-Care Venture Looking for New CEO', 'Uber Redraws Road Map to Profit in Wake of Weakened Ridership', 'LiveXLive to Join Podcast Fray With PodcastOne Acquisition', "Google Parent Alphabet Drops Controversial 'Smart City' Project", 'The Trouble With Coronavirus Contact-Tracing Apps', "Elon Musk, Tech's Cash-Poor Billionaire", "Nintendo Scores on 'Animal Crossing' Sales Thanks to Coronavirus Lockdowns", "Vista Follows Facebook Into India's Jio With $1.5 Billion Digital Bet", 'Spending Too Much Time Online? Here Are Tips for Unplugging.', 'Facebook-Backed Libra Project Gets New CEO', "China's WeChat Monitors Foreign Users to Refine Censorship at Home", 'Grandma to the Rescue! Parents Get Virtual Help', 'Airbnb to Cut 25% of Workforce', 'Welcome Back to the Office. Your Every Move Will Be Watched.', 'Hospitals Deploy Technology to Reduce ICU Staff Exposure to Covid-19', 'New York City Apartment Renting Turns to Video Chats and Virtual Tours']


Let's do the same thing with links to these stories by creating a list of all links

In [10]:
urls = bs_obj.find_all('link')            # Extracts and creates a list of all the <link> tags
urls = [link.getText() for link in urls]  # Creates a list of innerHTML for each <link> tag in the urls list
print(urls)

['http://online.wsj.com', '', 'http://online.wsj.com', 'https://www.wsj.com/articles/gawande-in-talks-about-leaving-helm-of-health-care-venture-haven-11588987079?mod=rss_Technology', 'https://www.wsj.com/articles/ubers-first-quarter-loss-balloons-on-coronavirus-impact-11588882349?mod=rss_Technology', 'https://www.wsj.com/articles/livexlive-to-join-podcast-fray-with-podcastone-acquisition-11588939211?mod=rss_Technology', 'https://www.wsj.com/articles/alphabet-subsidiary-sidewalk-labs-abandons-toronto-smart-city-project-11588867545?mod=rss_Technology', 'https://www.wsj.com/articles/curbing-coronavirus-with-a-contact-tracing-app-its-not-so-simple-11588996809?mod=rss_Technology', 'https://www.wsj.com/articles/elon-musk-techs-cash-poor-billionaire-11588967043?mod=rss_Technology', 'https://www.wsj.com/articles/nintendo-scores-on-animal-crossing-sales-thanks-to-coronavirus-lockdowns-11588842065?mod=rss_Technology', 'https://www.wsj.com/articles/vista-follows-facebook-into-indias-jio-with-1-5-

This time, it looks like the third link is where we want to start (The second entry is an empty String!)

In [11]:
urls = urls[3:]

print(urls)

['https://www.wsj.com/articles/gawande-in-talks-about-leaving-helm-of-health-care-venture-haven-11588987079?mod=rss_Technology', 'https://www.wsj.com/articles/ubers-first-quarter-loss-balloons-on-coronavirus-impact-11588882349?mod=rss_Technology', 'https://www.wsj.com/articles/livexlive-to-join-podcast-fray-with-podcastone-acquisition-11588939211?mod=rss_Technology', 'https://www.wsj.com/articles/alphabet-subsidiary-sidewalk-labs-abandons-toronto-smart-city-project-11588867545?mod=rss_Technology', 'https://www.wsj.com/articles/curbing-coronavirus-with-a-contact-tracing-app-its-not-so-simple-11588996809?mod=rss_Technology', 'https://www.wsj.com/articles/elon-musk-techs-cash-poor-billionaire-11588967043?mod=rss_Technology', 'https://www.wsj.com/articles/nintendo-scores-on-animal-crossing-sales-thanks-to-coronavirus-lockdowns-11588842065?mod=rss_Technology', 'https://www.wsj.com/articles/vista-follows-facebook-into-indias-jio-with-1-5-billion-digital-bet-11588926647?mod=rss_Technology', '

**Task \#2:** Write a function, *random_headline(headline_list, link_list)*, that accepts a list of headlines and a list of links as input and returns an output string in the format "HEADLINE, read more at LINK."

**Note:** Be sure to test out your function to make sure it works as expected. Show the results of your tests below.  Also, not all headlines may have a link, and your two arrays may not be 'parallel'.  Try re-running the cells for the most up-to-date listings!

#### HINT:
* Import random to create random #'s
* Create a function named random_headline(headline_list, link_list)
    * Generate a random number between 0 and one less than the length of headline_list
    * Create a variable to hold the value of headline_list at the index of the randomly generated number
    * Create a variable to hold the value of link_list at the index of the randomly generated number
    * Create an output string in the format "HEADLINE, read more at LINK" replacing HEADLINE and LINK with your variables
    * Return the output string

In [12]:
# Your code here
import random as rand

def random_headline(headline_list, link_list):
    head_length = len(headline_list)
    num = rand.randint(0, head_length-1)
    choosen_head = headline_list[num]
    choosen_link = link_list[num]
    output = choosen_head + ", read more at " + choosen_link
    return output

In [21]:
random_headline(headlines, urls)

'The Trouble With Coronavirus Contact-Tracing Apps, read more at https://www.wsj.com/articles/curbing-coronavirus-with-a-contact-tracing-app-its-not-so-simple-11588996809?mod=rss_Technology'

### Processing Other RSS Feeds

You already perused a few RSS feeds. This time, pick one of those feeds (or a new one) and explore it by writing code. As a reminder, here are some recommended feeds: 

* Local News:
  * __Chicago Tribune__: http://www.chicagotribune.com/cs-rssfeeds-htmlstory.html
  * __The Daily Herald__: http://www.dailyherald.com/rss/
  * __The Chicago Sun Times__: http://www.thesuntimes.com/section/feed


* National/International News: 
  * __Reuters__: https://www.reuters.com/tools/rss
  * __USA Today__: https://www.usatoday.com/rss/
  * __The New York Times__: http://www.nytimes.com/services/xml/rss/index.html
  * __BBC News__: http://www.bbc.com/news/10628494____


* Technology News: 
  * __Wired.com__: https://www.wired.com/about/rss_feeds/
  * __Ars Technica__: https://arstechnica.com/rss-feeds/
  * __CNET__: https://www.cnet.com/rss/
  
  
* Miscellaneous (Sports, Government, Science):  
  * __ESPN__: http://www.espn.com/espn/news/story?page=rssinfo
  * __Illinois Commerce Commission__: https://www.icc.illinois.gov/rss/
  * __US Congress__: https://www.congress.gov/rss
  * __NASA__: https://www.nasa.gov/content/nasa-rss-feeds

**Task \#3:** Pick any of the feeds above (or more than one). Experiment using Python code, and show the results of your experimentation below. 

**Note:** This question is fairly open-ended, but at a minimum you must do the following: 

* Create a "random headline"-style function like you did above. You should not expect code such as __headlines = headlines[2:]__ or __urls = urls[3:]__ to fit perfectly with your data since these were modifications that may be specific to the way __The Wall Street Journal__'s RSS feed is organized.

* You must show that you engaged with the feed(s) you picked using the Beautiful Soup module and at least one Python data structure (probably lists). You will need to analyze the XML for the feed you pick and write your code to fit with the data. 

In [36]:
# Your code here
from bs4 import BeautifulSoup       # Import BeautifulSoup
from urllib.request import urlopen  # Import urlopen

DH_page = urlopen("http://www.dailyherald.com/rss/")   # Opens whatever page we are requesting
DH_obj = BeautifulSoup(DH_page, 'xml')  # Extract xml data

print(DH_obj.prettify())  # Makes it more easily readible or 'pretty'

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
 <channel>
  <title>
   DailyHerald.com  &gt; Top News
  </title>
  <copyright>
   Copyright 2020 Daily Herald, Paddock Publications, Inc.
  </copyright>
  <link>
   http://www.dailyherald.com/
  </link>
  <description/>
  <language>
   en-us
  </language>
  <lastBuildDate>
   Sat, 9 May 2020 14:08:57 -0400
  </lastBuildDate>
  <item>
   <title>
    COVID-19 nursing home deaths climb to 1,553 &amp;#xad;-- 48% of state total
   </title>
   <link>
    http://www.dailyherald.com/news/20200508/covid-19-nursing-home-deaths-climb-to-1553-xadx2014-48-of-state-total
   </link>
   <guid>
    http://www.dailyherald.com/article/20200508/news/200509257
   </guid>
   <pubDate>
    Fri, 8 May 2020 20:43:54 -0400
   </pubDate>
   <description>
    Coronavirus outbreaks continue to ravage Illinois nursing homes as new state data show at least 1,553 deaths associated with long-term care facilities.
   </

In [47]:
Description = DH_obj.find_all('description') # Extracts and creates a list of all the <title> tags

print(Description)

[<description/>, <description>Coronavirus outbreaks continue to ravage Illinois nursing homes as new state data show at least 1,553 deaths associated with long-term care facilities.</description>, <description>Testing for COVID-19 surpassed the 20,000-a-day milestone, officials said, but some measures of Illinois' success fighting the deadly virus lag, raising the question of whether schools could reopen this fall -- a hope Gov. J.B. Pritzker expressed Friday.</description>, <description>Little Richard, one of the chief architects of rock 'n' roll whose piercing wail, pounding piano and towering pompadour irrevocably altered popular music while introducing black R&amp;B to white America, died Saturday after battling bone cancer. He was 87.</description>, <description>With social distancing precautions in place, the McHenry Outdoor Theater welcomes movie fans this weekend for a prehistoric double bill under the stars.</description>, <description>Sean Thomas, grandson of Wendy's founder 

In [60]:
Description = [story.getText() for story in Description] # Creates a list of innerHTML for each <title> tag in headlines list

print(Description)

AttributeError: 'str' object has no attribute 'getText'

In [61]:
Description[3:]

["Little Richard, one of the chief architects of rock 'n' roll whose piercing wail, pounding piano and towering pompadour irrevocably altered popular music while introducing black R&B to white America, died Saturday after battling bone cancer. He was 87.",
 'With social distancing precautions in place, the McHenry Outdoor Theater welcomes movie fans this weekend for a prehistoric double bill under the stars.',
 "Sean Thomas, grandson of Wendy's founder Dave Thomas, plans to open his own burger restaurant Monday in Kildeer. Fresh Stack Burger Co. will open for curbside and third-party delivery service until restaurants are allowed to reopen for dining in.",
 'The Elgin Youth Symphony Orchestra will hold a virtual concert Sunday.',
 'A Naperville firefighter was injured at the scene of a fire in the 1700 block of Baybrook Lane early Saturday. The residents of the two-story home were able to evacuate safely, officials said.',
 "Jeanne Hansen of Round Lake is back home for Mother's Day aft

In [62]:
url = DH_obj.find_all('link')            # Extracts and creates a list of all the <link> tags
url = [link.getText() for link in url]  # Creates a list of innerHTML for each <link> tag in the urls list
print(url)

['http://www.dailyherald.com/', 'http://www.dailyherald.com/news/20200508/covid-19-nursing-home-deaths-climb-to-1553-xadx2014-48-of-state-total', 'http://www.dailyherald.com/news/20200508/testing-in-illinois-passes-20000-a-day-but-can-schools-reopen-this-fall', 'http://www.dailyherald.com/entlife/20200509/little-richard-flamboyant-rock-n-roll-pioneer-dead-at-87', 'http://www.dailyherald.com/news/20200508/night-out-at-the-movies-mchenry-drive-in-delivers-double-bill-with-pandemic-precautions', 'http://www.dailyherald.com/news/20200509/fresh-stack-burger-co-set-to-open-monday-in-kildeer', 'http://www.dailyherald.com/news/20200509/eyso-creates-virtual-way-to-conclude-44th-season', 'http://www.dailyherald.com/news/20200509/firefighter-injured-at-scene-of-naperville-fire', 'http://www.dailyherald.com/news/20200509/mothers-day-means-even-more-this-year-for-hansen', 'http://www.dailyherald.com/news/20200508/may-8-covid-19-cases-per-county-search-by-zip-code', 'http://www.dailyherald.com/entli

In [63]:
url = url[3:]
print(url)

['http://www.dailyherald.com/entlife/20200509/little-richard-flamboyant-rock-n-roll-pioneer-dead-at-87', 'http://www.dailyherald.com/news/20200508/night-out-at-the-movies-mchenry-drive-in-delivers-double-bill-with-pandemic-precautions', 'http://www.dailyherald.com/news/20200509/fresh-stack-burger-co-set-to-open-monday-in-kildeer', 'http://www.dailyherald.com/news/20200509/eyso-creates-virtual-way-to-conclude-44th-season', 'http://www.dailyherald.com/news/20200509/firefighter-injured-at-scene-of-naperville-fire', 'http://www.dailyherald.com/news/20200509/mothers-day-means-even-more-this-year-for-hansen', 'http://www.dailyherald.com/news/20200508/may-8-covid-19-cases-per-county-search-by-zip-code', 'http://www.dailyherald.com/entlife/20200509/a-song-to-make-you-smile-i-saved-the-world-today', 'http://www.dailyherald.com/news/20200508/2-swans-found-dead-in-pen-at-itasca-business-park', 'http://www.dailyherald.com/news/20200509/watch-cardinal-cupich-thanks-educators', 'http://www.dailyhera

In [64]:
import random as rand

def random_ad(Description_list, link_list):
    head_length = len(Description_list)
    num = rand.randint(0, head_length-1)
    choosen_head = Description_list[num]
    choosen_link = link_list[num]
    output = str(choosen_head) + ". Please read more at " + str(choosen_link)
    return output

In [65]:
random_ad(Description, url)

'The rookies with the Pittsburgh Steelers are trying to find a way to get their careers off on the right foot amid the COVID-19 pandemic. Please read more at http://www.dailyherald.com/article/20200509/news/305099988/'