# feedparser and beautifulsoup4 - A learning notebook
The purpose of this notebook is to allow me to learn how to use the `feedparser` and `BeautifulSoup4` modules. Currently, my plan is to use this module to collect all of my AI related news feeds so in my `ai_news_writer` application.

In [1]:
ai_news_feeds = [
    {"name": "AI Trends", "url": "https://www.aitrends.com/feed/"},
#     {"name": "Science Daily", "url": "https://www.sciencedaily.com/rss/computers_math/artificial_intelligence.xml"},
    {"name": "MIT News - Artificial Intelligence", "url": "http://news.mit.edu/rss/topic/artificial-intelligence2"},
    {"name": "reddit artificial", "url": "https://www.reddit.com/r/artificial/.rss"},
    {"name": "Chatbots Magazine - Medium", "url": "https://chatbotsmagazine.com/feed"},
    {"name": "Towards Data Science - Medium", "url": "https://towardsdatascience.com/feed"},
    {"name": "Chatbots Life - Medium", "url": "https://chatbotslife.com/feed"},
    {"name": "AWS Machine Learning Blog", "url": "https://aws.amazon.com/blogs/machine-learning/feed/"},
    {"name": "Artificial Intelligence - IBM Developer",
     "url": "https://developer.ibm.com/patterns/category/artificial-intelligence/feed/"},
    {"name": "Lex Fridman - Artificial Intelligence (AI)", "url": "https://lexfridman.com/category/ai/feed/"},
    {"name": "reddit singularity", "url": "https://www.reddit.com/r/singularity/.rss?format=xml"},
    {"name": "Archie.AI - Medium", "url": "https://medium.com/feed/archieai"},
    {"name": "The Official NVIDIA Blog", "url": "http://feeds.feedburner.com/nvidiablog"},
    {"name": "OpenAI Blog", "url": "https://openai.com/blog/rss/"},
    {"name": "VentureBeat", "url": "http://feeds.feedburner.com/venturebeat/SZYF?format=xml"},
]

## feedparser and RSS feeds in general

I've not worked with RSS feeds before, so once again, this is a learning opportunity...

Currently, we're taking a look at the `feedparser` module and we're going to try and figure out how this works. It seems that there are two formats we'll have to contend with and those are **RSS** and **ATOM**. It does appear that both of these formats are mixed in the sources defined above. We'll need to take a look at each source and try to figure out if we can determine its type in the payload somehow, but we're getting a little ahead of ourselves...

Let's import `feedparser` and get started. We'll also need to use the `requests` module to actually download the feeds and save the `xml` files so we can work locally.

In order to get around an issue I ran into with reddit, I'm going to add an import to `time` so I can put a sleep statement in place in-between all of our calls. This should get around the issue with being flagged as a *bot* as mentioned below (Yes, I'm reconning this document).

In [2]:
import feedparser
import requests
import time

Now that we've got our import, let's:
* loop through the defined collection
* call `requests.get` to grab each of the feeds
* create a new file based on the source of the feed
* save the contents of the feed to a file
* sleep for a few seconds before moving to the next item

In [3]:
for feed_source in ai_news_feeds:
    feed = requests.get(feed_source['url'])
    
    with open(f"data/{feed_source['name']}.xml", "wb+") as file:
        file.write(feed.content)
        file.close()
    
    # Going to sleep for 3 seconds between calls just to be safe
    time.sleep(3.0)

## Explore the feeds in each of the files

Now that we've captured our feeds, we need to try and figure out their contents. As I mentioned before, it seems like we're going to have to deal with **RSS** and **ATOM** feed formats, so I guess the first thing we should do is try to figure that out based on the contents of the file.

I *believe* we have both file formats in the first two list items (*AI Trends* and *Science Daily*). Based on my previous tests, *AI Trends* seems to work perfectly with `feedparser` and it has `content`, but the one from *Science Daily* does not have `content` items and throws an exception when iterating through its collection of items.

In [4]:
ai_trends = feedparser.parse('data/AI Trends.xml')
archie_ai = feedparser.parse('data/Archie.AI - Medium.xml')

> **Notes and Important Information**: It appears I was wrong about *Science Daily* and I'm not getting anything back from them (so I changed this "experiment" to use Archie.AI instead). It could be because I am a *bot* and they've cut me off. I will try again at a later date. I also got an error from *r/artificial* because they flagged me as a bot and I had too many requests. I'll need to work through that as well. Fun stuff we'll have to keep in mind.

In [5]:
ai_trends

{'feed': {'title': 'AI Trends',
  'title_detail': {'type': 'text/plain',
   'language': None,
   'base': '',
   'value': 'AI Trends'},
  'links': [{'href': 'https://www.aitrends.com/feed/',
    'rel': 'self',
    'type': 'application/rss+xml'},
   {'rel': 'alternate',
    'type': 'text/html',
    'href': 'https://www.aitrends.com'}],
  'link': 'https://www.aitrends.com',
  'subtitle': 'The Business and Technology of Enterprise AI',
  'subtitle_detail': {'type': 'text/html',
   'language': None,
   'base': '',
   'value': 'The Business and Technology of Enterprise AI'},
  'updated': 'Fri, 18 Oct 2019 11:03:27 +0000',
  'updated_parsed': time.struct_time(tm_year=2019, tm_mon=10, tm_mday=18, tm_hour=11, tm_min=3, tm_sec=27, tm_wday=4, tm_yday=291, tm_isdst=0),
  'language': 'en',
  'sy_updateperiod': 'hourly',
  'sy_updatefrequency': '1',
  'generator_detail': {'name': 'https://wordpress.org/?v=4.9.8'},
  'generator': 'https://wordpress.org/?v=4.9.8'},
 'entries': [{'title': 'Data Privacy

In [6]:
print(ai_trends['feed']['title'])
print(ai_trends['feed']['subtitle'])
print(ai_trends['feed']['updated'])

for entry in ai_trends.entries:
    for term in entry.tags:
        print(f"\tTag: {term.term}")
        
    print(f"{entry.title}\r\n{entry.published}\r\n{entry.author}")
    print(entry.summary)
    
    for content in entry.content:
        print(content.value)

AI Trends
The Business and Technology of Enterprise AI
Fri, 18 Oct 2019 11:03:27 +0000
	Tag: Data Privacy and Security
	Tag: Ethics and Social Issues
	Tag: Machine Learning
	Tag: ai in business
	Tag: data security and privacy
	Tag: ethics and social issues
	Tag: machine learning
Data Privacy Clashing with Demand for Data to Power AI Applications
Thu, 17 Oct 2019 21:30:40 +0000
Benjamin Ross
<img width="100" height="70" src="https://www.aitrends.com/wp-content/uploads/2019/10/10-18GDPR-CompliantForm-2-100x70.jpg" class="webfeedsFeaturedVisual wp-post-image" alt="" style="float: left; margin-right: 5px;" link_thumbnail="" srcset="https://www.aitrends.com/wp-content/uploads/2019/10/10-18GDPR-CompliantForm-2-100x70.jpg 100w, https://www.aitrends.com/wp-content/uploads/2019/10/10-18GDPR-CompliantForm-2-218x150.jpg 218w" sizes="(max-width: 100px) 100vw, 100px" />By AI Trends Staff Your data has value, but unlocking it for your own benefit is challenging. Understanding how valuable data are c

Okay, there are a couple of things to look at here in the above example. We're doing a lot, and it seems like there will be more to do.

1. The `feed` tag is the parent of the feed parse. It contains the information for the feed, including the title, subtitle and when it was last updated
2. Each feed has a list of entries, the entries are the actual content we'll be looking at
> for each piece of content, we'll want to look at the tags and see if it's related to AI, ML, or what ever to make sure it's something we're actually interested in. This should be configurable
3. After we've determined that we're interested in this particular piece of content, we'll want to grab:
    * title of the article
    * author of the article
    * the original publish date, and possible modified date
    * summary of the content
    * content

In [7]:
archie_ai

{'feed': {'title': 'Archie.AI - Medium',
  'title_detail': {'type': 'text/plain',
   'language': None,
   'base': '',
   'value': 'Archie.AI - Medium'},
  'subtitle': 'ML &amp; Tech Articles from the team behind Archie.AI - Medium',
  'subtitle_detail': {'type': 'text/html',
   'language': None,
   'base': '',
   'value': 'ML &amp; Tech Articles from the team behind Archie.AI - Medium'},
  'links': [{'rel': 'alternate',
    'type': 'text/html',
    'href': 'https://medium.com/archieai?source=rss----4e8922a89498---4'},
   {'href': 'https://medium.com/feed/archieai',
    'rel': 'self',
    'type': 'application/rss+xml'},
   {'href': 'http://medium.superfeedr.com',
    'rel': 'hub',
    'type': 'text/html'}],
  'link': 'https://medium.com/archieai?source=rss----4e8922a89498---4',
  'image': {'href': 'https://cdn-images-1.medium.com/proxy/1*TGH72Nnw24QL3iV9IOm4VA.png',
   'title': 'Archie.AI - Medium',
   'title_detail': {'type': 'text/plain',
    'language': None,
    'base': '',
    'val

So we got everything processed for the `ai_trends` collection (which was in *RSS* format), now we'll need to do the same thing for the `archie_ai` collection.

In [8]:
print(archie_ai['feed']['title'])
print(archie_ai['feed']['subtitle'])
print(archie_ai['feed']['updated'])

Archie.AI - Medium
ML &amp; Tech Articles from the team behind Archie.AI - Medium
Sat, 19 Oct 2019 22:54:51 GMT


So far, so good...

> I am not nearly as cofident in `archie_ai` as I was in the `ai_trends`, so we're taking babysteps

In [9]:
for entry in archie_ai.entries:
    for term in entry.tags:
        print(f"\tTag: {term['term']}")
        
    print(f"{entry.title}\r\n{entry.published}\r\n{entry.author}")
    print(entry.summary)
    
    for content in entry.content:
        print(content.value)

	Tag: ethereum
	Tag: bitcoin
	Tag: investing
	Tag: cryptocurrency
How Much Money Do You Need to Move the Bitcoin Market?
Thu, 12 Jul 2018 03:46:18 GMT
Ishtiaq Rahman
<h4>Whale Science 🐋</h4><p>I did a quick estimation to figure out what kind of investment you’d need to be able to move the Bitcoin market in the short term, in a particular direction.</p><p>I looked at Gdax trading data which handled about 2.98% of all BTC trading volume on February 26th, 2018.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ZNfAigHdJMdvAk1J6T-1kQ.png" /><figcaption>Source: Coinmarkercap.com</figcaption></figure><p>On February 26, between 11: 54 PM PST and 11:55 PM PST, the total volume traded on Gdax was 65 Bitcoins which was worth around $686,440 USD at the time. This resulted in a .557% increase in the price of Bitcoin.</p><p>I picked this particular minute for my example because it had visibly higher trading volume compared to the minutes before it.(See image below).</p><p>We ca

`archie_ai` worked much better than I had anticipated, but there are several articles that we're not going to be interested in. I'll work on excluding those later, but we'll need to keep an eye on the terms and figure out which ones we're really interested in. We could also have our **writing engine** determine if the content/subject matter is something we're interested in and let it sort everything out for us...

For grins and giggles, let's take a look at the *VentureBeat* feed and see what kind of articles it includes
> There are several articles that show up in my **Google** news feed that are usually interesting and they keep up with the things we're interested in here, so that's why I included it.

In [10]:
venture_beat = feedparser.parse("data/VentureBeat.xml")

In [11]:
venture_beat

{'feed': {'title': 'VentureBeat',
  'title_detail': {'type': 'text/plain',
   'language': None,
   'base': '',
   'value': 'VentureBeat'},
  'links': [{'rel': 'alternate',
    'type': 'text/html',
    'href': 'https://venturebeat.com'},
   {'rel': 'self',
    'type': 'application/rss+xml',
    'href': 'http://feeds.feedburner.com/venturebeat/SZYF'},
   {'rel': 'hub',
    'href': 'http://pubsubhubbub.appspot.com/',
    'type': 'text/html'}],
  'link': 'https://venturebeat.com',
  'subtitle': 'Tech news that matters',
  'subtitle_detail': {'type': 'text/html',
   'language': None,
   'base': '',
   'value': 'Tech news that matters'},
  'updated': 'Sat, 19 Oct 2019 22:31:12 +0000',
  'updated_parsed': time.struct_time(tm_year=2019, tm_mon=10, tm_mday=19, tm_hour=22, tm_min=31, tm_sec=12, tm_wday=5, tm_yday=292, tm_isdst=0),
  'language': 'en-US',
  'sy_updateperiod': 'hourly',
  'sy_updatefrequency': '1',
  'generator_detail': {'name': 'https://wordpress.org/?v=5.2.4'},
  'generator': 'ht

In [12]:
print(venture_beat['feed']['title'])
print(venture_beat['feed']['subtitle'])
print(venture_beat['feed']['updated'])

VentureBeat
Tech news that matters
Sat, 19 Oct 2019 22:31:12 +0000


In [13]:
for entry in venture_beat.entries:
    for term in entry.tags:
        print(f"\tTag: {term['term']}")
        
    print(f"{entry.title}\r\n{entry.published}\r\n{entry.author}")
    print(entry.summary)
    
    for content in entry.content:
        print(content.value)

	Tag: Business
	Tag: Games
	Tag: category-/Games/Computer & Video Games
	Tag: DeanBeat News
	Tag: Eidos
	Tag: Emily Greer
	Tag: Keith Boesky
Former Eidos president and frequent game startup adviser Keith Boesky passes away
Sat, 19 Oct 2019 22:33:41 +0000
Dean Takahashi
Keith Boesky, former president of Tomb Raider publisher Eidos and a frequent adviser to game companies at Boesky &#038; Co., has died from cancer.
<img width="578" height="373" src="https://venturebeat.com/wp-content/uploads/2019/10/keith-boesky.jpg?fit=578%2C373&amp;strip=all" class="attachment-single-feed size-single-feed wp-post-image" alt="Keith Boesky was the former president of Eidos." srcset="https://venturebeat.com/wp-content/uploads/2019/10/keith-boesky.jpg?w=877&amp;strip=all 877w, https://venturebeat.com/wp-content/uploads/2019/10/keith-boesky.jpg?w=300&amp;strip=all 300w, https://venturebeat.com/wp-content/uploads/2019/10/keith-boesky.jpg?w=768&amp;strip=all 768w, https://venturebeat.com/wp-content/uploads/20

As mentioned above, *VentureBeat* was for grings and giggles, and it looks like the feed itself is only links to stories. We'll probably end up having to handle this one differently. If this is the case for one feed, it certainly could be the case for other feeds as well, so we'll have to plan for that and have the ability to crawl each of the links to grab the content we want.

## BeautifulSoup4 - Grabbing the Basic Conent/Text from the Feed

Okay, now that we've got our feed data, we're ready to pull out the content so that we can begin to run it through our NLP modules. Let's take a look at the content using `BeautifulSoup4`.

In [14]:
from bs4 import BeautifulSoup

for entry2 in ai_trends.entries:
    for content in entry2.content:
        soup = BeautifulSoup(content.value, 'html.parser')
        print(soup.get_text())

By AI Trends Staff
Your data has value, but unlocking it for your own benefit is challenging. Understanding how valuable data are collected and approved for use can help you to get there.
Two primary means for differentiating audiences by their data collection methods are site-authenticated data collection and people-based data collection, suggested a recent piece in BulletinHealthcare written by Justin Fadgen, chief corporate development officer for the firm.
Site-authenticated data are sourced from individual authentication events, such as when a user completes an online form, and generally agrees to a privacy policy that includes a data use agreement. User data are then be combined with other data sources that add meaning, becoming the basis of advertising targeting for instance. In marketing for healthcare, this is the National Provider Identifier (NPI), a 10-digit numeric identifier for covered healthcare providers under HIPAA.
People-based data collection does not come from a reg

In [15]:
for entry2 in archie_ai.entries:
    for content in entry2.content:
        soup = BeautifulSoup(content.value, 'html.parser')
        print(soup.get_text())

Whale Science 🐋I did a quick estimation to figure out what kind of investment you’d need to be able to move the Bitcoin market in the short term, in a particular direction.I looked at Gdax trading data which handled about 2.98% of all BTC trading volume on February 26th, 2018.Source: Coinmarkercap.comOn February 26, between 11: 54 PM PST and 11:55 PM PST, the total volume traded on Gdax was 65 Bitcoins which was worth around $686,440 USD at the time. This resulted in a .557% increase in the price of Bitcoin.I picked this particular minute for my example because it had visibly higher trading volume compared to the minutes before it.(See image below).We can now estimate that the total volume of Bitcoins traded in ALL markets during that 1-minute was about 2181 Bitcoins (~23 million USD worth) from the fact 65 Bitcoins traded on Gdax represented 2.98% of all Bitcoin trade volume in the world.The actual total volume should be lower since it is unlikely that all markets are completely effic

In [16]:
for entry2 in venture_beat.entries:
    for content in entry2.content:
        soup = BeautifulSoup(content.value, 'html.parser')
        print(soup.get_text())

Keith Boesky, former president of Tomb Raider publisher Eidos and a frequent adviser to game companies at Boesky & Co., has died from cancer. Read More
Former Bungie CEO Harold Ryan showed everybody how to go big or go home earlier this month. He announced that his ProbablyMonsters is announcing it has raised $18.8 million for the company’s two triple-A game studios. Ryan is going to need a lot more money than that to pull off not one but two ambitious triple-A projects in an age when there are a…Read More
I spoke with Drew Henry, senior vice president at Arm, about the growing IoT ecosystem's compute demands and the impact that could have on climate change.Read More
Tilt Five announced two new tabletop AR game partners with Tabletopia and Monocle Society.Read More
Paradox Interactive and Harebrained Schemes announced the Heavy Metal expansion for mech combat game Battletech will go live on November 21.Read More
Age of Wonders: Planetfall's first expansion is Revelations, and it launch

All of the text in the above `feedparser` collections seems to look really good with no processing after the fact. It looks like we're at the point that we can save this text off and start working on our NLP portion of the application.

> Actually, now that I say that, I need to make this notebook available in my repository and convert this code into actual *usuable* code within my **ai_news_writer** application.