##Objective
Write a program to scrape news from the https://www.space.com/news web page. 

##Method:
Write a function that can be run that retrieves the space.com news page, extracts the news stories, and prints out the **headline, author, synopsis, and date and time for each story.**

##The Scraping Process
In order to properly parse web sites for data, we will need to perform a few steps:
​
1. Identify the website where the data is located [site]( https://www.space.com/news )
2. Verify that scraping the data does not violate the site's policies

``` 
User-Agent: *
Disallow: /search.php
Disallow: /social.php
Disallow: /newsletter-signup
Disallow: /_proxy*
Disallow: /search

Sitemap: https://www.space.com/sitemap.xml 

```

3. Manually examine the web site with Chrome Developer Tools learn the HTML structure of the site learn how the data is requested by the site, that is, how is the URL of the website is structured to get the data
4. Use the requests library to fetch the HTML code of the website
5. Use BeautifulSoup to parse the HTML and extract the content
6. Follow links to get more data (if necessary)

​
Store the data in a file for future analysis (if needed)


In [0]:
import requests
url = 'https://www.space.com/news'
response = requests.get(url)

#test for a valid response
if(response.ok):
  #get all data from the response
  data = response.text
  print(data)

<!DOCTYPE html>
<html lang="en" dir="ltr" data-locale="US">
<head>
<script>
if(window&&!("Promise"in window)){var xhr=new XMLHttpRequest;xhr.open("GET",'/media/shared/js/promise.js',false);xhr.onerror=function(){console.log(xhr.statusText)};xhr.send("");eval(xhr.responseText);}
</script> <script>window.usingBordeauxAds = true</script>
<!-- Consent Manager Tag : Stubbed (minified) (updated 2018-04-27T17:00) -->
<script type="text/javascript">
(function(){function d(){if(!window.frames.__cmpLocator)if(document.body&&document.body.firstChild){var a=document.body,c=document.createElement("iframe");c.style.display="none";c.height=c.width=0;c.name="__cmpLocator";a.insertBefore(c,a.firstChild)}else setTimeout(d,5)}function g(){var a=arguments;__cmp.a=__cmp.a||[];if(a.length)if("ping"===a[0])a[2]({gdprAppliesGlobally:!1,cmpLoaded:!1},!0);else __cmp.a.push([].slice.apply(a));else return __cmp.a}function b(a){var c="string"===typeof a.data,b=a.data;
if(c)try{b=JSON.parse(a.data)}catch(h){}if(b._

In [0]:
#let's import BeautifulSoup to make sense of the code

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'html.parser')

print(soup.prettify())

<!DOCTYPE html>
<html data-locale="US" dir="ltr" lang="en">
 <head>
  <script>
   if(window&&!("Promise"in window)){var xhr=new XMLHttpRequest;xhr.open("GET",'/media/shared/js/promise.js',false);xhr.onerror=function(){console.log(xhr.statusText)};xhr.send("");eval(xhr.responseText);}
  </script>
  <script>
   window.usingBordeauxAds = true
  </script>
  <!-- Consent Manager Tag : Stubbed (minified) (updated 2018-04-27T17:00) -->
  <script type="text/javascript">
   (function(){function d(){if(!window.frames.__cmpLocator)if(document.body&&document.body.firstChild){var a=document.body,c=document.createElement("iframe");c.style.display="none";c.height=c.width=0;c.name="__cmpLocator";a.insertBefore(c,a.firstChild)}else setTimeout(d,5)}function g(){var a=arguments;__cmp.a=__cmp.a||[];if(a.length)if("ping"===a[0])a[2]({gdprAppliesGlobally:!1,cmpLoaded:!1},!0);else __cmp.a.push([].slice.apply(a));else return __cmp.a}function b(a){var c="string"===typeof a.data,b=a.data;
if(c)try{b=JSON.parse(

###Here is a portion of the code containing news article info, including title, synopsis, author, date and time.

```
 <div class="content">
           <header>
            <h3 class="article-name">
             Blue Origin Kicks Off Kids' Space Club With Offer to Launch Postcards
            </h3>
            <p class="byline">
             <span class="by-author">
              By
              <span style="white-space:nowrap">
               Robert Z. Pearlman
              </span>
             </span>
             <time class="published-date relative-date" data-published-date="2019-05-11T12:38:01Z" datetime="2019-05-11T12:38:01Z">
             </time>
            </p>
           </header>
           <p class="synopsis">
            Jeff Bezos is adding another title to his credit: space postmaster.
           </p>
          </div>
         </article>
        </a>
            
```

In [0]:
  #first find all elements with class *content*
  summary =soup.find_all(class_='content')
  
  #let's review results
  print(summary)

[<div class="content">
<header>
<h3 class="article-name">This NASA Spacecraft Will Crash into an Asteroid to Test Earth's Defenses Against Space Rocks</h3>
<p class="byline">
<span class="by-author">
By
<span style="white-space:nowrap">
Meghan Bartels </span>
</span>
<time class="published-date relative-date" data-published-date="2019-05-14T17:12:25Z" datetime="2019-05-14T17:12:25Z"></time>
</p>
</header>
<p class="synopsis">
NASA's first planetary defense mission is preparing to launch in June 2021, making sure all the pieces are in place for the spacecraft to successfully slam into the small "moon" of a binary asteroid.
</p>
</div>, <div class="content">
<header>
<h3 class="article-name">Mars' Moon Phobos Looks Sweet Enough to Eat in New Images</h3>
<p class="byline">
<span class="by-author">
By
<span style="white-space:nowrap">
Elizabeth Howell </span>
</span>
<time class="published-date relative-date" data-published-date="2019-05-14T17:10:55Z" datetime="2019-05-14T17:10:55Z"></time

In [0]:
#let's get all article titles

art_names = soup.find_all(class_='article-name')
print(art_names)

[<h3 class="article-name">This NASA Spacecraft Will Crash into an Asteroid to Test Earth's Defenses Against Space Rocks</h3>, <h3 class="article-name">Mars' Moon Phobos Looks Sweet Enough to Eat in New Images</h3>, <h3 class="article-name">Supermoon Looms Over Portuguese Palace in Dreamy Night-Sky Photo</h3>, <h3 class="article-name">Watch Live Now! See the 2019 Humans to Mars Summit</h3>, <h3 class="article-name">Humans to Mars Summit 2019 Launches in D.C. This Week: Watch It Live!</h3>, <h3 class="article-name">On This Day in Space! May 14, 1973: NASA Launches Skylab Space Station</h3>, <h3 class="article-name">How Earth Life Could Come Back from a Sterilizing Asteroid Impact</h3>, <h3 class="article-name">The Universe Probably 'Remembers' Every Single Gravitational Wave</h3>, <h3 class="article-name">'Apollo's Legacy': Space Historian Talks Lunar Science, Politics — and a Return to the Moon</h3>, <h3 class="article-name">NASA Names New Moon Landing Program Artemis After Apollo's Sis

In [0]:
# get clean list of article's names

title = [title.get_text() for title in art_names]
print(title)

["This NASA Spacecraft Will Crash into an Asteroid to Test Earth's Defenses Against Space Rocks", "Mars' Moon Phobos Looks Sweet Enough to Eat in New Images", 'Supermoon Looms Over Portuguese Palace in Dreamy Night-Sky Photo', 'Watch Live Now! See the 2019 Humans to Mars Summit', 'Humans to Mars Summit 2019 Launches in D.C. This Week: Watch It Live!', 'On This Day in Space! May 14, 1973: NASA Launches Skylab Space Station', 'How Earth Life Could Come Back from a Sterilizing Asteroid Impact', "The Universe Probably 'Remembers' Every Single Gravitational Wave", "'Apollo's Legacy': Space Historian Talks Lunar Science, Politics — and a Return to the Moon", "NASA Names New Moon Landing Program Artemis After Apollo's Sister", "Trump Proposes Extra $1.6 Billion for NASA's 2024 Return to Moon", "'Exhalation' Collection Will Expand Your Mind: A Q&A with Short Story Author Ted Chiang", 'Moonquakes Rattle the Moon as It Shrinks Like a Raisin', 'SpaceX to Launch 60 Satellites for Starlink Megacons

In [0]:
#let's get all article authors

art_auth = soup.find_all(class_='by-author')

print(art_auth)

[<span class="by-author">
By
<span style="white-space:nowrap">
Meghan Bartels </span>
</span>, <span class="by-author">
By
<span style="white-space:nowrap">
Elizabeth Howell </span>
</span>, <span class="by-author">
By
<span style="white-space:nowrap">
Miguel Claro </span>
</span>, <span class="by-author">
By
<span style="white-space:nowrap">
SPACE.com Staff </span>
</span>, <span class="by-author">
By
<span style="white-space:nowrap">
Hanneke Weitering </span>
</span>, <span class="by-author">
By
<span style="white-space:nowrap">
Hanneke Weitering </span>
</span>, <span class="by-author">
By
<span style="white-space:nowrap">
Mike Wall </span>
</span>, <span class="by-author">
By
<span style="white-space:nowrap">
Rafi Letzter </span>
</span>, <span class="by-author">
By
<span style="white-space:nowrap">
Sarah Lewin </span>
</span>, <span class="by-author">
By
<span style="white-space:nowrap">
Robert Z. Pearlman </span>
</span>, <span class="by-author">
By
<span style="white-space:nowra

In [0]:
# get a list of authors

auth = [auth.get_text().strip() for auth in art_auth]
print(auth)

['By\n\nMeghan Bartels', 'By\n\nElizabeth Howell', 'By\n\nMiguel Claro', 'By\n\nSPACE.com Staff', 'By\n\nHanneke Weitering', 'By\n\nHanneke Weitering', 'By\n\nMike Wall', 'By\n\nRafi Letzter', 'By\n\nSarah Lewin', 'By\n\nRobert Z. Pearlman', 'By\n\nMike Wall', 'By\n\nSarah Lewin', 'By\n\nCharles Q. Choi', 'By\n\nMike Wall', 'By\n\nMeghan Bartels', 'By\n\nScott Snowden', 'By\n\nScott Snowden', 'By\n\nJasmin Malik Chua', 'By\n\nHanneke Weitering', 'By\n\nElizabeth Howell']


In [0]:
# get all dates

days = soup.find_all(class_='published-date relative-date')
print(days)

[<time class="published-date relative-date" data-published-date="2019-05-14T17:12:25Z" datetime="2019-05-14T17:12:25Z"></time>, <time class="published-date relative-date" data-published-date="2019-05-14T17:10:55Z" datetime="2019-05-14T17:10:55Z"></time>, <time class="published-date relative-date" data-published-date="2019-05-14T17:09:46Z" datetime="2019-05-14T17:09:46Z"></time>, <time class="published-date relative-date" data-published-date="2019-05-14T11:28:27Z" datetime="2019-05-14T11:28:27Z"></time>, <time class="published-date relative-date" data-published-date="2019-05-14T11:23:25Z" datetime="2019-05-14T11:23:25Z"></time>, <time class="published-date relative-date" data-published-date="2019-05-14T11:03:27Z" datetime="2019-05-14T11:03:27Z"></time>, <time class="published-date relative-date" data-published-date="2019-05-14T10:52:30Z" datetime="2019-05-14T10:52:30Z"></time>, <time class="published-date relative-date" data-published-date="2019-05-14T10:51:42Z" datetime="2019-05-14T10:

In [0]:
# get a list of all dates
for timestamp in soup.find_all('time'):
  if timestamp.has_attr('datetime'):
    print(timestamp['datetime'])

2019-05-14T17:12:25Z
2019-05-14T17:10:55Z
2019-05-14T17:09:46Z
2019-05-14T11:28:27Z
2019-05-14T11:23:25Z
2019-05-14T11:03:27Z
2019-05-14T10:52:30Z
2019-05-14T10:51:42Z
2019-05-14T10:51:13Z
2019-05-14T10:48:54Z
2019-05-14T00:22:41Z
2019-05-13T20:36:29Z
2019-05-13T19:58:07Z
2019-05-13T19:46:58Z
2019-05-13T19:28:42Z
2019-05-13T19:27:06Z
2019-05-13T16:14:50Z
2019-05-13T16:11:14Z
2019-05-13T14:40:49Z
2019-05-13T11:22:33Z


In [0]:
# iterate over all space news

summary = soup.select('.content')
articles = []
for article in summary:
  
  name = article.select_one('.article-name').get_text() # extract the article name
  
  author = article.select_one('.by-author').get_text().strip() # extract the author
  
  synopsis = article.select_one('.synopsis').get_text().strip() # get the synopsis
  
  date = article.time.attrs['datetime'] # obtain date and time
  
  article = {'title': name, 'author': author [4:], 'synopsis' : synopsis, 'date' : date [:10], 'time' : date [11:-1]} # build dictionary
  
  articles.append(article) # add dictionary to the list
  
print(articles)  

[{'title': "This NASA Spacecraft Will Crash into an Asteroid to Test Earth's Defenses Against Space Rocks", 'author': 'Meghan Bartels', 'synopsis': 'NASA\'s first planetary defense mission is preparing to launch in June 2021, making sure all the pieces are in place for the spacecraft to successfully slam into the small "moon" of a binary asteroid.', 'date': '2019-05-14', 'time': '17:12:25'}, {'title': "Mars' Moon Phobos Looks Sweet Enough to Eat in New Images", 'author': 'Elizabeth Howell', 'synopsis': "New images of one of Mars' moons show a sweet surprise: It looks like candy. The moon, Phobos, shines like a rainbow-colored jawbreaker in new images from NASA's Mars Odyssey spacecraft.", 'date': '2019-05-14', 'time': '17:10:55'}, {'title': 'Supermoon Looms Over Portuguese Palace in Dreamy Night-Sky Photo', 'author': 'Miguel Claro', 'synopsis': 'In this dreamy moonlit scene, the Super Snow Moon rises over the Pena Palace (Palácio da Pena), a 19th-century, Romanticist hilltop castle ove

In [0]:
# wrap everything in a function, with necessary libraries already installed
# import requests
# from bs4 import BeautifulSoup

url = 'https://www.space.com/news'

def space_scraper(url):
  response = requests.get(url)
  if(response.ok):
    data = response.text
    soup = BeautifulSoup(data, 'html.parser')
    summary = soup.select('.content')
    articles = []
    for article in summary:
      name = article.select_one('.article-name').get_text() # extract the article name
      author = article.select_one('.by-author').get_text().strip() # extract the author
      synopsis = article.select_one('.synopsis').get_text().strip() # get the synopsis
      timestamp = article.time.attrs['datetime']
      article = {'title': name, 'author': author [4:], 'synopsis' : synopsis, 'date' : timestamp [:10], 'time' : timestamp [11:-1]}
      articles.append(article)
    return print(articles)
  else:
    return "Sorry, there is a problem with your html request"
  
space_scraper(url)

[{'title': "This NASA Spacecraft Will Crash into an Asteroid to Test Earth's Defenses Against Space Rocks", 'author': 'Meghan Bartels', 'synopsis': 'NASA\'s first planetary defense mission is preparing to launch in June 2021, making sure all the pieces are in place for the spacecraft to successfully slam into the small "moon" of a binary asteroid.', 'date': '2019-05-14', 'time': '17:12:25'}, {'title': "Mars' Moon Phobos Looks Sweet Enough to Eat in New Images", 'author': 'Elizabeth Howell', 'synopsis': "New images of one of Mars' moons show a sweet surprise: It looks like candy. The moon, Phobos, shines like a rainbow-colored jawbreaker in new images from NASA's Mars Odyssey spacecraft.", 'date': '2019-05-14', 'time': '17:10:55'}, {'title': 'Supermoon Looms Over Portuguese Palace in Dreamy Night-Sky Photo', 'author': 'Miguel Claro', 'synopsis': 'In this dreamy moonlit scene, the Super Snow Moon rises over the Pena Palace (Palácio da Pena), a 19th-century, Romanticist hilltop castle ove