# Working with RSS Feeds Lab

Complete the following set of exercises to solidify your knowledge of parsing RSS feeds and extracting information from them.

* RSS stands for Rich Site Summary and uses standard web feed formats to publish frequently updated information: blog entries, news headlines, audio, video.

* __What is feedparser?__

feedparser is a library that helps users to keep track over feeds such as RSS, and Atom in Python
[source](https://github.com/kurtmckee/feedparser)

- Documentation: https://pythonhosted.org/feedparser/

In [3]:
! pip install feedparser

Collecting feedparser
  Using cached https://files.pythonhosted.org/packages/91/d8/7d37fec71ff7c9dbcdd80d2b48bcdd86d6af502156fc93846fb0102cb2c4/feedparser-5.2.1.tar.bz2
Building wheels for collected packages: feedparser
  Building wheel for feedparser (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/liviaclarete/Library/Caches/pip/wheels/8c/69/b7/f52763c41c5471df57703a0ef718a32a5e81ee35dcf6d4f97f
Successfully built feedparser
Installing collected packages: feedparser
Successfully installed feedparser-5.2.1


In [4]:
# import feedparser
import feedparser

### 1. Use feedparser to parse the following RSS feed URL.

In [5]:
url = 'http://feeds.feedburner.com/oreilly/radar/atom'

In [6]:
# importing parse() from feedparser passing the url as an argument
feed_oreilly = feedparser.parse(url)

In [7]:
# feed type
type(feed_oreilly)

feedparser.FeedParserDict

In [8]:
# note the object is a dict
feed_oreilly

{'feed': {'title': "All - O'Reilly Media",
  'title_detail': {'type': 'text/plain',
   'language': None,
   'base': 'http://feeds.feedburner.com/oreilly/radar/atom',
   'value': "All - O'Reilly Media"},
  'id': 'https://www.oreilly.com',
  'guidislink': True,
  'link': 'https://www.oreilly.com',
  'updated': '2019-09-04T14:38:31Z',
  'updated_parsed': time.struct_time(tm_year=2019, tm_mon=9, tm_mday=4, tm_hour=14, tm_min=38, tm_sec=31, tm_wday=2, tm_yday=247, tm_isdst=0),
  'subtitle': 'All of our Ideas and Learning material from all of our topics.',
  'subtitle_detail': {'type': 'text/plain',
   'language': None,
   'base': 'http://feeds.feedburner.com/oreilly/radar/atom',
   'value': 'All of our Ideas and Learning material from all of our topics.'},
  'links': [{'href': 'https://www.oreilly.com',
    'rel': 'alternate',
    'type': 'text/html'},
   {'rel': 'self',
    'type': 'application/atom+xml',
    'href': 'http://feeds.feedburner.com/oreilly/radar/atom'},
   {'rel': 'hub',
    

### 2. Obtain a list of components (keys) that are available for this feed.

In [9]:
# we can access the keys elements by calling keys() over the the 
keys = list(feed_oreilly.keys())

In [10]:
len(keys)

12

In [11]:
# 12 keys are listed
keys

['feed',
 'entries',
 'bozo',
 'headers',
 'etag',
 'updated',
 'updated_parsed',
 'href',
 'status',
 'encoding',
 'version',
 'namespaces']

### 3. Obtain a list of components (keys) that are available for the *feed* component of this RSS feed.

In [12]:
# access the keys inside feed element
feed_key = list(feed_oreilly.feed.keys())

In [13]:
# print the of elements inside feed
feed_key

['title',
 'title_detail',
 'id',
 'guidislink',
 'link',
 'updated',
 'updated_parsed',
 'subtitle',
 'subtitle_detail',
 'links',
 'authors',
 'author_detail',
 'author',
 'feedburner_info',
 'geo_lat',
 'geo_long',
 'feedburner_emailserviceid',
 'feedburner_feedburnerhostname']

In [14]:
# we can call each one of this elements 
print(feed_oreilly.feed.title)
print(feed_oreilly.feed.author)
print(feed_oreilly.feed.geo_lat)

All - O'Reilly Media
O'Reilly Media
38.393314


### 4. Extract and print the feed title, subtitle, author, and link.

* Extract nested information in nested atributes

In [15]:
title = feed_oreilly.feed.title
subtitle = feed_oreilly.feed.subtitle
author = feed_oreilly.feed.author
link = feed_oreilly.feed.link

In [16]:
# print the atributes
print(f'Title:{title}, \nSubtitle:{subtitle}, \nAuthor:{author}, \nLink:{link}')

Title:All - O'Reilly Media, 
Subtitle:All of our Ideas and Learning material from all of our topics., 
Author:O'Reilly Media, 
Link:https://www.oreilly.com


### 5. Count the number of entries that are contained in this RSS feed.

In [17]:
# Print the number of entries
len(feed_oreilly.entries)

60

### 6. Obtain a list of components (keys) available for an entry.

*Hint: Remember to index first before requesting the keys*

In [18]:
# select the first element from feed_oreilly.entries and call the keys() function
entries_key = feed_oreilly.entries[0].keys()

In [19]:
# convert the dict_keys into a list
entries_key = list(entries_key)

In [20]:
entries_key

['title',
 'title_detail',
 'updated',
 'updated_parsed',
 'id',
 'guidislink',
 'link',
 'content',
 'summary',
 'links',
 'feedburner_origlink']

### 7. Extract a list of entry titles.

* We need a for/loop to extract all nested titles within entries

In [21]:
# iterate through each element and return the 'title' from each entry
titles = [entry['title'] 
          for entry in feed_oreilly.entries]

In [22]:
titles

['New live online training courses',
 'Four short links: 4 September 2019',
 'How new tools in data and AI are being used in health care and medicine',
 'Four short links: 3 September 2019',
 'Four short links: 2 September 2019',
 'Four short links: 30 August 2019',
 'Becoming a machine learning practitioner',
 'Four short links: 29 August 2019',
 'One simple chart: Who is interested in Apache Pulsar?',
 'Four short links: 28 August 2019',
 'Four short links: 27 August 2019',
 'Four short links: 26 August 2019',
 'How organizations are sharpening their skills to better understand and use AI',
 'Four short links: 23 August 2019',
 'Four short links: 22 August 2019',
 'Four short links: 21 August 2019',
 'Four short links: 20 August 2019',
 'Four short links: 19 August 2019',
 'Antitrust regulators are using the wrong tools to break up Big Tech',
 'Labeling, transforming, and structuring training data sets for machine learning',
 'Four short links: 15 August 2019',
 'Four short links: 14

### 8. Calculate the percentage of "Four short links" entry titles.

In [23]:
# search for an element inside a list
four_short_links = [
    # return title
    title
    # iterate throght each element in title
    for title in titles
    # if the title contains 'For short links'
    if "Four short links" in title]

In [24]:
four_short_links

['Four short links: 4 September 2019',
 'Four short links: 3 September 2019',
 'Four short links: 2 September 2019',
 'Four short links: 30 August 2019',
 'Four short links: 29 August 2019',
 'Four short links: 28 August 2019',
 'Four short links: 27 August 2019',
 'Four short links: 26 August 2019',
 'Four short links: 23 August 2019',
 'Four short links: 22 August 2019',
 'Four short links: 21 August 2019',
 'Four short links: 20 August 2019',
 'Four short links: 19 August 2019',
 'Four short links: 15 August 2019',
 'Four short links: 14 August 2019',
 'Four short links: 13 August 2019',
 'Four short links: 12 August 2019',
 'Four short links: 9 August 2019',
 'Four short links: 8 August 2019',
 'Four short links: 7 August 2019',
 'Four short links: 6 August 2019',
 'Four short links: 5 August 2019',
 'Four short links: 2 August 2019',
 'Four short links: 1 August 2019',
 'Four short links: 31 July 2019',
 'Four short links: 30 July 2019',
 'Four short links: 29 July 2019',
 'Four s

In [25]:
# calculate the percentage ofr elements with 'Four short links'
perct = 100 * len(four_short_links) / len(titles)


In [26]:
print(f'"Four short links" represents {round(perct,2)}% of the list')

"Four short links" represents 56.67% of the list


### 9. Create a Pandas data frame from the feed's entries.

In [27]:
import pandas as pd

In [28]:
# The dict format is a perfect match to be read with DataFrame() from pandas
df_entries = pd.DataFrame(feed_oreilly.entries)

In [29]:
df_entries.head(1)

Unnamed: 0,author,author_detail,authors,content,feedburner_origlink,guidislink,id,link,links,summary,title,title_detail,updated,updated_parsed
0,,,,"[{'type': 'text/html', 'language': None, 'base...",https://www.oreilly.com/ideas/new-live-online-...,True,"tag:www.oreilly.com,2019-09-04:/ideas/new-live...",http://feedproxy.google.com/~r/oreilly/radar/a...,[{'href': 'http://feedproxy.google.com/~r/orei...,<p><img src='https://d3ucjech6zwjp8.cloudfront...,New live online training courses,"{'type': 'text/plain', 'language': None, 'base...",2019-09-04T12:10:00Z,"(2019, 9, 4, 12, 10, 0, 2, 247, 0)"


* After creating the data frame, let's convert the 'updated' column into a datatime format

In [275]:
# write the string format accordingly to the 'updated' column
date_format = '%Y-%m-%d %H:%M:%S'

In [276]:
# create a new column 'datetime'
df_entries['datetime'] = pd.to_datetime(
    # call the pd.to_datetime passing the column and the format as an argument
    df_entries['updated'], format=date_format)

In [277]:
# check the number of posts by month using the dt.month argument
df_entries.datetime.dt.month.value_counts()

7    41
6    19
Name: datetime, dtype: int64

In [278]:
# check the number of posts by day of the week
df_entries.datetime.dt.weekday_name.value_counts()

Monday       16
Wednesday    15
Thursday     13
Friday        9
Tuesday       7
Name: datetime, dtype: int64

* As we saw below, most of page's publications were posted on Mondays followed by Wednesdays

In [279]:
# Now let's check the time's publication
df_entries.datetime.dt.time.value_counts().head()

20:00:00    11
15:50:00     8
11:00:00     7
08:00:00     3
10:45:00     3
Name: datetime, dtype: int64

### 10. Count the number of entries per author and sort them in descending order.

In [280]:
# grouping the entries by author and calling the size()
n_entries = df_entries.groupby(['author']).size()

In [281]:
# sort the entries at a descending order
n_entries = n_entries.sort_values(ascending=False)

In [282]:
n_entries.head()

author
Nat Torkington                           26
Ben Lorica                                5
Ben Lorica, Harish Doddi, David Talby     2
Jenn Webb                                 2
VM Brasseur                               1
dtype: int64

In [283]:
# we can write the code above in only one line
n_entries = df_entries.groupby(['author']).size().sort_values(ascending=False)

In [284]:
n_entries.head()

author
Nat Torkington                           26
Ben Lorica                                5
Ben Lorica, Harish Doddi, David Talby     2
Jenn Webb                                 2
VM Brasseur                               1
dtype: int64

### 11. Add a new column to the data frame that contains the length (number of characters) of each entry title. Return a data frame that contains the title, author, and title length of each entry in descending order (longest title length at the top).

In [285]:
# Method 1: calling len() using the str method over the column 'title'
df_entries['len_title'] = df_entries['title'].str.len()

In [286]:
# Method 2: using apply() and lambda
df_entries['len_title'] = df_entries.title.apply(
    lambda x: len(x))

In [287]:
# Select the 'author', 'title', and 'len_title' columns
columns_df = ['author', 'title', 'len_title']

In [288]:
# subset the dataset with the columns_df list
new_df_entries = df_entries[columns_df]

In [290]:
new_df_entries.head()

Unnamed: 0,author,title,len_title
0,Nat Torkington,Four short links: 19 July 2019,30
1,Adam Jacob,The war for the soul of open source,35
2,VM Brasseur,Ask not what Brands™ can do for you,35
3,Roger Magoulas,O’Reilly Radar: Open source technology trends—...,68
4,,O'Reilly Open Source and Frank Willison Awards,46


In [292]:
# sort df by 'len_title'
new_df_entries = new_df_entries.sort_values('len_title',
                                           ascending=False)

In [294]:
new_df_entries.head()

Unnamed: 0,author,title,len_title
40,Ben Lorica,RISELab’s AutoPandas hints at automation tech ...,97
16,"Ben Lorica, Harish Doddi, David Talby",Managing machine learning in the enterprise: L...,81
23,Jenn Webb,Highlights from the O'Reilly Artificial Intell...,79
11,Mac Slocum,Highlights from the O'Reilly Open Source Softw...,77
50,Ben Lorica,Enabling end-to-end machine learning pipelines...,73


### 12. Create a list of entry titles whose summary includes the phrase "machine learning."

In [295]:
# create a boolean mask filtering elements with 'machine learning'
mask = df_entries['summary'].str.contains('machine learning')

In [298]:
# subset the dataset
titles_ml = df_entries[mask]

In [301]:
# the new dataset contains 12 rows
titles_ml.shape

(12, 16)

In [302]:
# convert the titles which rows contains 'machine learning' to a list
titles_ml = list(titles_ml['title'])

In [303]:
titles_ml

['Acquiring and sharing high-quality data',
 "Highlights from the O'Reilly Open Source Software Conference in Portland 2019",
 'Managing machine learning in the enterprise: Lessons from banking and health care',
 "Highlights from the O'Reilly Artificial Intelligence Conference in Beijing 2019",
 'The future of machine learning is tiny',
 'Tools for machine learning development',
 'New live online training courses',
 'RISELab’s AutoPandas hints at automation tech that will change the nature of software development',
 'AI and machine learning will require retraining your entire organization',
 'Enabling end-to-end machine learning pipelines in real-world applications',
 'What are model governance and model operations?',
 'The quest for high-quality data']