# Working with RSS Feeds Lab

Complete the following set of exercises to solidify your knowledge of parsing RSS feeds and extracting information from them.

In [1]:
import feedparser

### 1. Use feedparser to parse the following RSS feed URL.

In [2]:
nytimes = feedparser.parse('https://feeds.simplecast.com/54nAGcIl')

In [3]:
print(nytimes['feed'])

{'links': [{'href': 'https://feeds.simplecast.com/54nAGcIl', 'rel': 'self', 'title': 'MP3 Audio', 'type': 'application/atom+xml'}, {'href': 'https://simplecast.superfeedr.com/', 'rel': 'hub', 'type': 'text/html'}, {'rel': 'alternate', 'type': 'text/html', 'href': 'https://www.nytimes.com/the-daily'}], 'generator_detail': {'name': 'https://simplecast.com'}, 'generator': 'https://simplecast.com', 'title': 'The Daily', 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'https://feeds.simplecast.com/54nAGcIl', 'value': 'The Daily'}, 'subtitle': 'This is what the news should sound like. The biggest stories of our time, told by the best journalists in the world. Hosted by Michael Barbaro and Sabrina Tavernise. Twenty minutes a day, five days a week, ready by 6 a.m.\n\nListen to this podcast in New York Times Audio, our new iOS app for news subscribers. Download now at nytimes.com/audioapp', 'subtitle_detail': {'type': 'text/html', 'language': None, 'base': 'https://feeds.simple

### 2. Obtain a list of components (keys) that are available for this feed.

In [4]:
 nytimes.keys()

dict_keys(['bozo', 'entries', 'feed', 'headers', 'etag', 'updated', 'updated_parsed', 'href', 'status', 'encoding', 'version', 'namespaces'])

### 3. Obtain a list of components (keys) that are available for the *feed* component of this RSS feed.

In [5]:
nytimeslist = list(nytimes.keys())
nytimeslist

['bozo',
 'entries',
 'feed',
 'headers',
 'etag',
 'updated',
 'updated_parsed',
 'href',
 'status',
 'encoding',
 'version',
 'namespaces']

### 4. Extract and print the feed title, subtitle, author, and link.

In [10]:
nytimes.feed.keys()

print (nytimes.feed.title)
print ('')
print (nytimes.feed.subtitle)
print ('')
print (nytimes.feed.author)
print ('')
print (nytimes.feed.link)

The Daily

This is what the news should sound like. The biggest stories of our time, told by the best journalists in the world. Hosted by Michael Barbaro and Sabrina Tavernise. Twenty minutes a day, five days a week, ready by 6 a.m.

Listen to this podcast in New York Times Audio, our new iOS app for news subscribers. Download now at nytimes.com/audioapp

The New York Times

https://www.nytimes.com/the-daily


### 5. Count the number of entries that are contained in this RSS feed.

In [14]:
len(nytimes.entries)

1902

### 6. Obtain a list of components (keys) available for an entry.

*Hint: Remember to index first before requesting the keys*

In [17]:
nytimes.entries[0].keys()

dict_keys(['id', 'guidislink', 'title', 'title_detail', 'summary', 'summary_detail', 'published', 'published_parsed', 'authors', 'author', 'author_detail', 'links', 'link', 'content', 'itunes_title', 'itunes_duration', 'subtitle', 'subtitle_detail', 'itunes_explicit', 'itunes_episodetype'])

### 7. Extract a list of entry titles.

In [18]:
list(nytimes.entries[0].keys())

['id',
 'guidislink',
 'title',
 'title_detail',
 'summary',
 'summary_detail',
 'published',
 'published_parsed',
 'authors',
 'author',
 'author_detail',
 'links',
 'link',
 'content',
 'itunes_title',
 'itunes_duration',
 'subtitle',
 'subtitle_detail',
 'itunes_explicit',
 'itunes_episodetype']

### 8. Calculate the percentage of "Four short links" entry titles.

### 9. Create a Pandas data frame from the feed's entries.

In [19]:
import pandas as pd

In [20]:
data = []

for entry in nytimes.entries:
    
    entry_data = {
        'id': entry.get('id', ''),
        'guidislink': entry.get('guidislink', ''),
        'title': entry.get('title', ''),
        'summary': entry.get('summary', ''),
        'published': entry.get('published', ''),
        'authors': entry.get('authors', ''),
        'author': entry.get('author', ''),
        'links': entry.get('links', ''),
        'link': entry.get('link', ''),
        'content': entry.get('content', ''),
        'itunes_title': entry.get('itunes_title', ''),
        'itunes_duration': entry.get('itunes_duration', ''),
        'subtitle': entry.get('subtitle', ''),
        'itunes_explicit': entry.get('itunes_explicit', ''),
        'itunes_episodetype': entry.get('itunes_episodetype', '')
    }
    data.append(entry_data)

df = pd.DataFrame(data)

print(df)

                                                     id  guidislink  \
0                  3f6e5473-d26a-4c9c-bb97-6e17e5b00260       False   
1                  6a349a51-f9de-4fd0-820d-e8a9f137b199       False   
2                  b4282e9e-bbb4-43cb-80ca-49a76f0d9c2f       False   
3                  bacbf190-63d7-446c-a92e-4b15823e482f       False   
4                  3e818886-9ecf-4ca9-992c-ac45d931f420       False   
...                                                 ...         ...   
1897  gid://art19-episode-locator/V0/hEU3jczQz949Kcm...       False   
1898  gid://art19-episode-locator/V0/In2dGOFxV52Kl-p...       False   
1899  gid://art19-episode-locator/V0/1ft_DGtLLsokwij...       False   
1900  gid://art19-episode-locator/V0/8fJKNp6i648MNqY...       False   
1901  gid://art19-episode-locator/V0/dBmW5PqKxnbikwB...       False   

                                                  title  \
0                      The Secret History of Gun Rights   
1     Italy’s Giorgia Meloni 

### 10. Count the number of entries per author and sort them in descending order.

In [22]:
authors = []

for entry in nytimes.entries:
    authors.append(entry.get('author', ''))

author_series = pd.Series(authors)

entry_counts = author_series.value_counts().sort_values(ascending=False)

print(entry_counts)

The New York Times    1902
dtype: int64


### 11. Add a new column to the data frame that contains the length (number of characters) of each entry title. Return a data frame that contains the title, author, and title length of each entry in descending order (longest title length at the top).

In [23]:
data = []

for entry in nytimes.entries:
    entry_data = {
        'title': entry.get('title', ''),
        'author': entry.get('author', ''),
    }
    data.append(entry_data)

df = pd.DataFrame(data)

df['title_length'] = df['title'].apply(len)

sorted_df = df.sort_values(by='title_length', ascending=False)

result_df = sorted_df[['title', 'author', 'title_length']]

print(result_df)

                                                  title              author  \
49    Special Episode: A Crash Course in Dembow, a M...  The New York Times   
731   Bonus: The N-Word is Both Unspeakable and Ubiq...  The New York Times   
778   The Sunday Read: ‘The Amateur Cloud Society Th...  The New York Times   
455   The Sunday Read: ‘Animals That Infect Humans A...  The New York Times   
343   The Sunday Read: ‘How Houston Moved 25,000 Peo...  The New York Times   
...                                                 ...                 ...   
1015                                      A Glut in Oil  The New York Times   
1126                                      Year in Sound  The New York Times   
807                                       Hacked, Again  The New York Times   
379                                         One Million  The New York Times   
803                                             Delilah  The New York Times   

      title_length  
49             114  
731      

### 12. Create a list of entry titles whose summary includes the phrase "machine learning."

In [25]:

machinelearning = []

for entry in nytimes.entries:
    summary = entry.get('summary', '')
    if "machine learning" in summary.lower():
        machinelearning.append(entry.get('title', ''))

print(machinelearning)

[]
