# Working with RSS Feeds Lab

Complete the following set of exercises to solidify your knowledge of parsing RSS feeds and extracting information from them.

In [3]:
%pip install feedparser

Collecting feedparser
  Downloading feedparser-6.0.10-py3-none-any.whl (81 kB)
     ---------------------------------------- 0.0/81.1 kB ? eta -:--:--
     ---------------------------------------- 81.1/81.1 kB 4.7 MB/s eta 0:00:00
Collecting sgmllib3k
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Installing collected packages: sgmllib3k, feedparser
  Running setup.py install for sgmllib3k: started
  Running setup.py install for sgmllib3k: finished with status 'done'
Successfully installed feedparser-6.0.10 sgmllib3k-1.0.0
Note: you may need to restart the kernel to use updated packages.


  DEPRECATION: sgmllib3k is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559

[notice] A new release of pip is available: 23.0.1 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [223]:
import feedparser
from datetime import datetime

### 1. Use feedparser to parse the following RSS feed URL.

In [51]:
url = 'https://www.nasa.gov/rss/dyn/onthestation_rss.rss'
nasa = feedparser.parse(f'{url}')

### 2. Obtain a list of components (keys) that are available for this feed.

In [19]:
nasa.keys()

dict_keys(['bozo', 'entries', 'feed', 'headers', 'href', 'status', 'encoding', 'version', 'namespaces'])

### 3. Obtain a list of components (keys) that are available for the *feed* component of this RSS feed.

In [27]:
avaliable = []
for key, value in nasa.items():
    if value != False:
        avaliable.append(key)
avaliable

['entries',
 'feed',
 'headers',
 'href',
 'status',
 'encoding',
 'version',
 'namespaces']

### 4. Extract and print the feed title, subtitle, author, and link.

No he encontrado una key que sea subtitle y author, por lo que las he sustituido por language y summary.


In [103]:
extraction = {}
for i, entry in enumerate(nasa['entries']):
    title = entry['title']
    language = entry['title_detail']['language']
    link = entry['link']
    summ = entry['summary']
    extraction[i] = {'title': title, 'language': language, 'link': link, 'summary': summ}
for key, value in extraction.items():
    print(f"Entry {key}:")
    print(f"Title: {value['title']}")
    print(f"Language: {value['language']}")
    print(f"Link: {value['link']}")
    print(f"Summary: {value['summary']}")
    print()

Entry 0:
Title: Space Station Science Highlights: Week of April 24, 2023
Language: en
Link: http://www.nasa.gov/mission_pages/station/research/news/space-station-science-highlights-24apr23
Summary: Crew members aboard the International Space Station conducted scientific investigations during the week of April 24.

Entry 1:
Title: Space Station Science Highlights: Week of April 17, 2023
Language: en
Link: http://www.nasa.gov/mission_pages/station/research/news/space-station-science-highlights-17apr23
Summary: Crew members aboard the International Space Station conducted scientific investigations during the week of April 17 that included demonstrating a liquid-based carbon dioxide removal system, measuring eye changes during spaceflight, and offering students in Europe an opportunity to use computers on the space station for specific challenges.

Entry 2:
Title: Space Station Studies Help Monitor Climate Change
Language: en
Link: http://www.nasa.gov/mission_pages/station/research/news/Sp

### 5. Count the number of entries that are contained in this RSS feed.

In [73]:
len(nasa['entries'])

10

### 6. Obtain a list of components (keys) available for an entry.

*Hint: Remember to index first before requesting the keys*

In [111]:
keys = []
for i in nasa['entries']:
    keys.append(i.keys())
keys[0]

dict_keys(['title', 'title_detail', 'links', 'link', 'summary', 'summary_detail', 'id', 'guidislink', 'published', 'published_parsed', 'source'])

### 7. Extract a list of entry titles.

In [115]:
titles = []
for i in nasa['entries']:
    titles.append(i['title'])
titles

['Space Station Science Highlights: Week of April 24, 2023',
 'Space Station Science Highlights: Week of April 17, 2023',
 'Space Station Studies Help Monitor Climate Change',
 'NASA Teams Persevere Through Plant Challenges in Space',
 'Space Station Science Highlights: Week of April 10, 2023',
 "NASA, SpaceX's 27th Resupply Mission Returns Science Samples for Study",
 'Space Station Science Highlights: Week of April 3, 2023',
 'Space Station Science Highlights: Week of March 27, 2023',
 'Celebrating Women’s History Month: Female Space Station Crew Members',
 'Space Station Science Highlights: Week of March 20, 2023']

### 8. Calculate the percentage of "Four short links" entry titles.

In [118]:
nasa['entries'][0]

# Revisar, porque no entiendo la pregunta...

{'title': 'Space Station Science Highlights: Week of April 24, 2023',
 'title_detail': {'type': 'text/plain',
  'language': 'en',
  'base': 'http://www.nasa.gov/',
  'value': 'Space Station Science Highlights: Week of April 24, 2023'},
 'links': [{'rel': 'alternate',
   'type': 'text/html',
   'href': 'http://www.nasa.gov/mission_pages/station/research/news/space-station-science-highlights-24apr23'},
  {'length': '2224270',
   'type': 'image/jpeg',
   'href': 'http://www.nasa.gov/sites/default/files/styles/1x1_cardfeed/public/thumbnails/image/canadarm.jpg?itok=vVHvKY0-',
   'rel': 'enclosure'}],
 'link': 'http://www.nasa.gov/mission_pages/station/research/news/space-station-science-highlights-24apr23',
 'summary': 'Crew members aboard the International Space Station conducted scientific investigations during the week of April 24.',
 'summary_detail': {'type': 'text/html',
  'language': 'en',
  'base': 'http://www.nasa.gov/',
  'value': 'Crew members aboard the International Space Stati

### 9. Create a Pandas data frame from the feed's entries.

In [119]:
import pandas as pd

In [172]:
nasa.keys()

dict_keys(['bozo', 'entries', 'feed', 'headers', 'href', 'status', 'encoding', 'version', 'namespaces'])

In [175]:
nasa['entries'][0]

{'title': 'Space Station Science Highlights: Week of April 24, 2023',
 'title_detail': {'type': 'text/plain',
  'language': 'en',
  'base': 'http://www.nasa.gov/',
  'value': 'Space Station Science Highlights: Week of April 24, 2023'},
 'links': [{'rel': 'alternate',
   'type': 'text/html',
   'href': 'http://www.nasa.gov/mission_pages/station/research/news/space-station-science-highlights-24apr23'},
  {'length': '2224270',
   'type': 'image/jpeg',
   'href': 'http://www.nasa.gov/sites/default/files/styles/1x1_cardfeed/public/thumbnails/image/canadarm.jpg?itok=vVHvKY0-',
   'rel': 'enclosure'}],
 'link': 'http://www.nasa.gov/mission_pages/station/research/news/space-station-science-highlights-24apr23',
 'summary': 'Crew members aboard the International Space Station conducted scientific investigations during the week of April 24.',
 'summary_detail': {'type': 'text/html',
  'language': 'en',
  'base': 'http://www.nasa.gov/',
  'value': 'Crew members aboard the International Space Stati

In [186]:
nasa['feed'].keys()

dict_keys(['language', 'title', 'title_detail', 'subtitle', 'subtitle_detail', 'links', 'link'])

Como no sabía si se pide de entries o de feed lo hago de los dos

In [191]:
def extraction_feed():
    ext = {}
    result = []
    include = ['language', 'title', 'subtitle', 'link']
    for key, value in nasa['feed'].items():
        if key in include:
            ext[key] = value
            result.append(ext)
        
    return result

In [194]:
 def extraction_entry():
    ext = {}
    result = []
    for i, entry in enumerate(nasa['entries']):
        ext['title'] = entry['title']
        ext['link'] = entry['link']
        ext['summ'] = entry['summary']
        ext['published'] = entry['published']
        
        result.append(ext)
    return result

In [192]:
nasa_db_feed = pd.DataFrame(extraction_feed())
nasa_db_feed

Unnamed: 0,language,title,subtitle,link
0,en-us,On the Station - Latest News,On the Station - Latest News,http://www.nasa.gov/
1,en-us,On the Station - Latest News,On the Station - Latest News,http://www.nasa.gov/
2,en-us,On the Station - Latest News,On the Station - Latest News,http://www.nasa.gov/
3,en-us,On the Station - Latest News,On the Station - Latest News,http://www.nasa.gov/


In [195]:
nasa_db_entry = pd.DataFrame(extraction_entry())
nasa_db_entry

Unnamed: 0,title,link,summ,published
0,Space Station Science Highlights: Week of Marc...,http://www.nasa.gov/mission_pages/station/rese...,Crew members aboard the International Space St...,"Fri, 24 Mar 2023 14:00 EDT"
1,Space Station Science Highlights: Week of Marc...,http://www.nasa.gov/mission_pages/station/rese...,Crew members aboard the International Space St...,"Fri, 24 Mar 2023 14:00 EDT"
2,Space Station Science Highlights: Week of Marc...,http://www.nasa.gov/mission_pages/station/rese...,Crew members aboard the International Space St...,"Fri, 24 Mar 2023 14:00 EDT"
3,Space Station Science Highlights: Week of Marc...,http://www.nasa.gov/mission_pages/station/rese...,Crew members aboard the International Space St...,"Fri, 24 Mar 2023 14:00 EDT"
4,Space Station Science Highlights: Week of Marc...,http://www.nasa.gov/mission_pages/station/rese...,Crew members aboard the International Space St...,"Fri, 24 Mar 2023 14:00 EDT"
5,Space Station Science Highlights: Week of Marc...,http://www.nasa.gov/mission_pages/station/rese...,Crew members aboard the International Space St...,"Fri, 24 Mar 2023 14:00 EDT"
6,Space Station Science Highlights: Week of Marc...,http://www.nasa.gov/mission_pages/station/rese...,Crew members aboard the International Space St...,"Fri, 24 Mar 2023 14:00 EDT"
7,Space Station Science Highlights: Week of Marc...,http://www.nasa.gov/mission_pages/station/rese...,Crew members aboard the International Space St...,"Fri, 24 Mar 2023 14:00 EDT"
8,Space Station Science Highlights: Week of Marc...,http://www.nasa.gov/mission_pages/station/rese...,Crew members aboard the International Space St...,"Fri, 24 Mar 2023 14:00 EDT"
9,Space Station Science Highlights: Week of Marc...,http://www.nasa.gov/mission_pages/station/rese...,Crew members aboard the International Space St...,"Fri, 24 Mar 2023 14:00 EDT"


### 10. Count the number of entries per author and sort them in descending order.

No hay aunthor en las entradas, así que lo ordeno por orden de entrada, de más lejano a más cercano

In [216]:
nasa['entries'][0]['published_parsed']

time.struct_time(tm_year=2023, tm_mon=4, tm_mday=28, tm_hour=18, tm_min=0, tm_sec=0, tm_wday=4, tm_yday=118, tm_isdst=0)

In [225]:
def fechas():
    dat = []
    for entry in nasa['entries']:
        dates = {}
        for i, date in enumerate(entry['published_parsed']):
            if i == 0:
                dates['year'] = date
            if i == 1:
                dates['month'] = date
            if i == 2:
                dates['day'] = date
        dt = datetime(**dates)
        dat.append(dt)
    return dat

In [229]:
fechas()

[datetime.datetime(2023, 4, 28, 0, 0),
 datetime.datetime(2023, 4, 21, 0, 0),
 datetime.datetime(2023, 4, 21, 0, 0),
 datetime.datetime(2023, 4, 20, 0, 0),
 datetime.datetime(2023, 4, 14, 0, 0),
 datetime.datetime(2023, 4, 13, 0, 0),
 datetime.datetime(2023, 4, 7, 0, 0),
 datetime.datetime(2023, 3, 31, 0, 0),
 datetime.datetime(2023, 3, 28, 0, 0),
 datetime.datetime(2023, 3, 24, 0, 0)]

In [228]:
orden = sorted(fechas())
orden

[datetime.datetime(2023, 3, 24, 0, 0),
 datetime.datetime(2023, 3, 28, 0, 0),
 datetime.datetime(2023, 3, 31, 0, 0),
 datetime.datetime(2023, 4, 7, 0, 0),
 datetime.datetime(2023, 4, 13, 0, 0),
 datetime.datetime(2023, 4, 14, 0, 0),
 datetime.datetime(2023, 4, 20, 0, 0),
 datetime.datetime(2023, 4, 21, 0, 0),
 datetime.datetime(2023, 4, 21, 0, 0),
 datetime.datetime(2023, 4, 28, 0, 0)]

### 11. Add a new column to the data frame that contains the length (number of characters) of each entry title. Return a data frame that contains the title, author, and title length of each entry in descending order (longest title length at the top).

### 12. Create a list of entry titles whose summary includes the phrase "machine learning."