# Working with RSS Feeds Lab

Complete the following set of exercises to solidify your knowledge of parsing RSS feeds and extracting information from them.

In [1]:
import feedparser

### 1. Use feedparser to parse the following RSS feed URL.

In [2]:
url = 'http://feeds.feedburner.com/oreilly/radar/atom'

In [3]:
reddit=feedparser.parse(url)

### 2. Obtain a list of components (keys) that are available for this feed.

In [4]:
reddit.keys()

dict_keys(['feed', 'entries', 'bozo', 'headers', 'etag', 'updated', 'updated_parsed', 'href', 'status', 'encoding', 'version', 'namespaces'])

### 3. Obtain a list of components (keys) that are available for the *feed* component of this RSS feed.

In [5]:
reddit.feed.keys()

dict_keys(['title', 'title_detail', 'links', 'link', 'subtitle', 'subtitle_detail', 'updated', 'updated_parsed', 'language', 'sy_updateperiod', 'sy_updatefrequency', 'generator_detail', 'generator', 'feedburner_info', 'geo_lat', 'geo_long', 'feedburner_emailserviceid', 'feedburner_feedburnerhostname'])

### 4. Extract and print the feed title, subtitle, author, and link.

In [12]:
print(f'autor: {reddit.feed.title}, title: {reddit.feed.subtitle}, author: unkown, link: {reddit.feed.link}')

autor: Radar, title: Now, next, and beyond: Tracking need-to-know trends at the intersection of business and technology, author: unkown, link: https://www.oreilly.com/radar


### 5. Count the number of entries that are contained in this RSS feed.

In [19]:
print(f'the number of entries is {len(reddit.entries)}')

the number of entries is 60


### 6. Obtain a list of components (keys) available for an entry.

*Hint: Remember to index first before requesting the keys*

In [34]:
list(reddit.entries[0].keys())

['title',
 'title_detail',
 'links',
 'link',
 'comments',
 'published',
 'published_parsed',
 'authors',
 'author',
 'author_detail',
 'tags',
 'id',
 'guidislink',
 'summary',
 'summary_detail',
 'content',
 'wfw_commentrss',
 'slash_comments',
 'feedburner_origlink']

### 7. Extract a list of entry titles.

In [40]:
titles=[]
for i in range(len(reddit.entries)):
    titles.append(reddit.entries[i].title)
titles

['Four Short Links: 19 August 2020',
 'Why Best-of-Breed is a Better Choice than All-in-One Platforms for Data Science',
 'Four short links: 14 August 2020',
 'The Least Liked Programming Languages',
 'Four short links: 11 Aug 2020',
 'Four short links: 7 Aug 2020',
 'Four short links: 5 August 2020',
 'Radar trends to watch: August 2020',
 'Four short links: 31 July 2020',
 'Four short links: 30 July 2020',
 'Four short links: 29 July 2020',
 'Bringing an AI Product to Market',
 'Power, Harms, and Data',
 'Four short links: 27 July 2020',
 'Four short links: 24 July 2020',
 'Four short links: 26 July 2020',
 'Four short links: 22 July 2020',
 'AI, Protests, and Justice',
 'Four short links: 21 July 2020',
 'Four short links: 20 July 2020',
 'Four short links: 17 July 2020',
 'Four short links: 16 July 2020',
 'Microservices Adoption in 2020',
 'Four short links: 15 July 2020',
 'Society-Centered Design',
 'Four short links: 14 July 2020',
 'Four short links: 13 July 2020',
 'Four shor

### 8. Calculate the percentage of "Four short links" entry titles.

In [41]:
s='Four short links'
cont=0
for g in titles:
    if s in g:
        cont+=1
print(f'The percentage is: {(cont/len(titles))*100}%')

The percentage is: 73.33333333333333%


### 9. Create a Pandas data frame from the feed's entries.

In [46]:
import pandas as pd
import re

In [47]:
entries=pd.DataFrame(titles, columns=['Titles'])

### 10. Count the number of entries per author and sort them in descending order.

In [74]:
auth=[]
for f in range(len(reddit.entries)):
    auth.append(reddit.entries[f].authors[0]['name'])
entries['Authors']=auth
entries['Authors'].value_counts()

Nat Torkington                                      45
Mike Loukides                                        9
Justin Norman, Peter Skomoroch and Mike Loukides     1
Sarah Gold                                           1
Matthew Rocklin and Hugo Bowne-Anderson              1
Adam Jacob, Nat Torkington and Mike Loukides         1
Hugo Bowne-Anderson                                  1
Mike Loukides and Steve Swoyer                       1
Name: Authors, dtype: int64

### 11. Add a new column to the data frame that contains the length (number of characters) of each entry title. Return a data frame that contains the title, author, and title length of each entry in descending order (longest title length at the top).

In [80]:
length=[]
for i in range(len(entries['Titles'])):
    length.append(len(entries['Titles'][i]))
entries['Characters']=length
entries.sort_values('Characters',ascending=False)

Unnamed: 0,Titles,Authors,Characters
1,Why Best-of-Breed is a Better Choice than All-...,Matthew Rocklin and Hugo Bowne-Anderson,79
28,Automated Coding and the Future of Programming,Mike Loukides,46
53,Machine Learning and the Production Gap,Mike Loukides,39
3,The Least Liked Programming Languages,Mike Loukides,37
47,Decision-Making in a Time of Crisis,Hugo Bowne-Anderson,35
7,Radar trends to watch: August 2020,Mike Loukides,34
0,Four Short Links: 19 August 2020,Nat Torkington,32
11,Bringing an AI Product to Market,"Justin Norman, Peter Skomoroch and Mike Loukides",32
2,Four short links: 14 August 2020,Nat Torkington,32
35,Radar trends to watch: July 2020,Mike Loukides,32


### 12. Create a list of entry titles whose summary includes the phrase "machine learning."

In [85]:
summary=[]
mahi='machine learning'
for c in range(len(reddit.entries)):
    if mahi in reddit.entries[c].summary:
        summary.append(titles[c])
summary

['Four short links: 8 July 2020', 'Machine Learning and the Production Gap']