<a href="https://colab.research.google.com/github/Irayav/Web-Scraping-and-API-Rrequsts/blob/master/Space_Scraping_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Objective
Write a program to scrape news from the https://www.space.com/news web page. 

##Method:
Write a function that can be run that retrieves the space.com news page, extracts the news stories, and prints out the **headline, author, synopsis, and date and time for each story.**

##The Scraping Process
In order to properly parse web sites for data, we will need to perform a few steps:
​
1. Identify the website where the data is located [site]( https://www.space.com/news )
2. Verify that scraping the data does not violate the site's policies

``` 
User-Agent: *
Disallow: /search.php
Disallow: /social.php
Disallow: /newsletter-signup
Disallow: /_proxy*
Disallow: /search

Sitemap: https://www.space.com/sitemap.xml 

```

3. Manually examine the web site with Chrome Developer Tools learn the HTML structure of the site learn how the data is requested by the site, that is, how is the URL of the website is structured to get the data
4. Use the requests library to fetch the HTML code of the website
5. Use BeautifulSoup to parse the HTML and extract the content
6. Follow links to get more data (if necessary)

​
Store the data in a file for future analysis (if needed)


In [0]:
import requests
url = 'https://www.space.com/news'
response = requests.get(url)

#test for a valid response
if(response.ok):
  #get all data from the response
  data = response.text
#if I wanted to see see what it looks like, we can use print function
#for now, to decluter the code I will skip this step
#print(data)

In [0]:
#let's import BeautifulSoup to make sense of the code

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'html.parser')

#should we need to review the results, print function can be used
#print(soup.prettify())

###Here is a portion of the code containing news article info, including title, synopsis, author, date and time.

```
 <div class="content">
           <header>
            <h3 class="article-name">
             Blue Origin Kicks Off Kids' Space Club With Offer to Launch Postcards
            </h3>
            <p class="byline">
             <span class="by-author">
              By
              <span style="white-space:nowrap">
               Robert Z. Pearlman
              </span>
             </span>
             <time class="published-date relative-date" data-published-date="2019-05-11T12:38:01Z" datetime="2019-05-11T12:38:01Z">
             </time>
            </p>
           </header>
           <p class="synopsis">
            Jeff Bezos is adding another title to his credit: space postmaster.
           </p>
          </div>
         </article>
        </a>
            
```

In [3]:
  #first find all elements with class *content*
  summary =soup.find_all(class_='content')
  
  #let's review results
  print(summary)

[<div class="content">
<header>
<h3 class="article-name">Solar Eclipse Glasses: Where to Buy the Best, High-Quality Eyewear</h3>
<p class="byline">
<span class="by-author">
By
<span style="white-space:nowrap">
Hanneke Weitering </span>
</span>
<time class="published-date relative-date" data-published-date="2019-06-25T20:44:15Z" datetime="2019-06-25T20:44:15Z"></time>
</p>
</header>
<p class="synopsis">
Whether you're looking for a new pair of eclipse glasses or you've already purchased some form of eye protection, here's what you need to know to avoid burning your eyes during the solar eclipse.
</p>
</div>, <div class="content">
<header>
<h3 class="article-name">One Week Until the Great South American Total Solar Eclipse!</h3>
<p class="byline">
<span class="by-author">
By
<span style="white-space:nowrap">
Joe Rao </span>
</span>
<time class="published-date relative-date" data-published-date="2019-06-25T20:21:30Z" datetime="2019-06-25T20:21:30Z"></time>
</p>
</header>
<p class="synopsi

In [4]:
#let's get all article titles

art_names = soup.find_all(class_='article-name')
print(art_names)

[<h3 class="article-name">Solar Eclipse Glasses: Where to Buy the Best, High-Quality Eyewear</h3>, <h3 class="article-name">One Week Until the Great South American Total Solar Eclipse!</h3>, <h3 class="article-name">Dairy Queen Whips Up 'Zero Gravity' Blizzard for Moon Landing 50th</h3>, <h3 class="article-name">How Apollo 8 Morphed Into a Moon Mission: Exclusive Clip </h3>, <h3 class="article-name">Watch 3 'BIRDS' Take Flight from the International Space Station</h3>, <h3 class="article-name">SpaceX Falcon Heavy Launch Spotted from Space (Photos)</h3>, <h3 class="article-name">Asteroid That's 3 Times As Long As a Football Field Will Whiz by Earth Thursday</h3>, <h3 class="article-name">Executive Order Could Reduce Number of NASA Advisory Committees</h3>, <h3 class="article-name">Pictures from Space! Our Image of the Day</h3>, <h3 class="article-name">On This Day in Space! June 25, 1997: Russian Cargo Craft Collides With Mir Space Station</h3>, <h3 class="article-name">July New Moon 20

In [5]:
# get clean list of article's names

title = [title.get_text() for title in art_names]
print(title)

['Solar Eclipse Glasses: Where to Buy the Best, High-Quality Eyewear', 'One Week Until the Great South American Total Solar Eclipse!', "Dairy Queen Whips Up 'Zero Gravity' Blizzard for Moon Landing 50th", 'How Apollo 8 Morphed Into a Moon Mission: Exclusive Clip ', "Watch 3 'BIRDS' Take Flight from the International Space Station", 'SpaceX Falcon Heavy Launch Spotted from Space (Photos)', "Asteroid That's 3 Times As Long As a Football Field Will Whiz by Earth Thursday", 'Executive Order Could Reduce Number of NASA Advisory Committees', 'Pictures from Space! Our Image of the Day', 'On This Day in Space! June 25, 1997: Russian Cargo Craft Collides With Mir Space Station', 'July New Moon 2019: Catch a Total Solar Eclipse, Bright Planets and Constellations ', 'Space Webcasts: Space Station Crew Landing and a SpaceX Falcon Heavy Launch', 'Boeing to Move Space Headquarters to Florida', 'China Launches Latest Beidou Satellite for Global Navigation System', 'SpaceX Falcon Heavy Rocket Lofts 24

In [6]:
#let's get all article authors

art_auth = soup.find_all(class_='by-author')

print(art_auth)

[<span class="by-author">
By
<span style="white-space:nowrap">
Hanneke Weitering </span>
</span>, <span class="by-author">
By
<span style="white-space:nowrap">
Joe Rao </span>
</span>, <span class="by-author">
By
<span style="white-space:nowrap">
Robert Z. Pearlman </span>
</span>, <span class="by-author">
By
<span style="white-space:nowrap">
Elizabeth Howell </span>
</span>, <span class="by-author">
By
<span style="white-space:nowrap">
Passant Rabie </span>
</span>, <span class="by-author">
By
<span style="white-space:nowrap">
Mike Wall </span>
</span>, <span class="by-author">
By
<span style="white-space:nowrap">
Laura Geggel </span>
</span>, <span class="by-author">
By
<span style="white-space:nowrap">
Jeff Foust </span>
</span>, <span class="by-author">
By
<span style="white-space:nowrap">
Hanneke Weitering </span>
</span>, <span class="by-author">
By
<span style="white-space:nowrap">
Hanneke Weitering </span>
</span>, <span class="by-author">
By
<span style="white-space:nowrap">
J

In [7]:
# get a list of authors

auth = [auth.get_text().strip() for auth in art_auth]
print(auth)

['By\n\nHanneke Weitering', 'By\n\nJoe Rao', 'By\n\nRobert Z. Pearlman', 'By\n\nElizabeth Howell', 'By\n\nPassant Rabie', 'By\n\nMike Wall', 'By\n\nLaura Geggel', 'By\n\nJeff Foust', 'By\n\nHanneke Weitering', 'By\n\nHanneke Weitering', 'By\n\nJesse Emspak', 'By\n\nSPACE.com Staff', 'By\n\nJeff Foust', 'By\n\nAndrew Jones', 'By\n\nMeghan Bartels', 'By\n\nRobert Z. Pearlman', 'By\n\nMeghan Bartels', 'By\n\nHanneke Weitering', 'By\n\nStefano Coledan', 'By\n\nCharles Q. Choi']


In [8]:
# get all dates

days = soup.find_all(class_='published-date relative-date')
print(days)

[<time class="published-date relative-date" data-published-date="2019-06-25T20:44:15Z" datetime="2019-06-25T20:44:15Z"></time>, <time class="published-date relative-date" data-published-date="2019-06-25T20:21:30Z" datetime="2019-06-25T20:21:30Z"></time>, <time class="published-date relative-date" data-published-date="2019-06-25T19:30:08Z" datetime="2019-06-25T19:30:08Z"></time>, <time class="published-date relative-date" data-published-date="2019-06-25T18:25:34Z" datetime="2019-06-25T18:25:34Z"></time>, <time class="published-date relative-date" data-published-date="2019-06-25T18:20:58Z" datetime="2019-06-25T18:20:58Z"></time>, <time class="published-date relative-date" data-published-date="2019-06-25T17:43:07Z" datetime="2019-06-25T17:43:07Z"></time>, <time class="published-date relative-date" data-published-date="2019-06-25T16:32:22Z" datetime="2019-06-25T16:32:22Z"></time>, <time class="published-date relative-date" data-published-date="2019-06-25T16:29:25Z" datetime="2019-06-25T16:

In [9]:
# get a list of all dates
for timestamp in soup.find_all('time'):
  if timestamp.has_attr('datetime'):
    print(timestamp['datetime'])

2019-06-25T20:44:15Z
2019-06-25T20:21:30Z
2019-06-25T19:30:08Z
2019-06-25T18:25:34Z
2019-06-25T18:20:58Z
2019-06-25T17:43:07Z
2019-06-25T16:32:22Z
2019-06-25T16:29:25Z
2019-06-25T12:56:47Z
2019-06-25T12:36:14Z
2019-06-25T12:25:45Z
2019-06-25T12:01:01Z
2019-06-25T11:00:00Z
2019-06-25T11:00:00Z
2019-06-25T07:17:59Z
2019-06-25T03:02:36Z
2019-06-25T01:49:42Z
2019-06-25T01:40:40Z
2019-06-24T21:45:11Z
2019-06-24T21:30:53Z


In [10]:
# iterate over all space news

summary = soup.select('.content')
articles = []
for article in summary:
  
  name = article.select_one('.article-name').get_text() # extract the article name
  
  author = article.select_one('.by-author').get_text().strip() # extract the author
  
  synopsis = article.select_one('.synopsis').get_text().strip() # get the synopsis
  
  date = article.time.attrs['datetime'] # obtain date and time
  
  article = {'title': name, 'author': author [4:], 'synopsis' : synopsis, 'date' : date [:10], 'time' : date [11:-1]} # build dictionary
  
  articles.append(article) # add dictionary to the list
  
print(articles)  

[{'title': 'Solar Eclipse Glasses: Where to Buy the Best, High-Quality Eyewear', 'author': 'Hanneke Weitering', 'synopsis': "Whether you're looking for a new pair of eclipse glasses or you've already purchased some form of eye protection, here's what you need to know to avoid burning your eyes during the solar eclipse.", 'date': '2019-06-25', 'time': '20:44:15'}, {'title': 'One Week Until the Great South American Total Solar Eclipse!', 'author': 'Joe Rao', 'synopsis': 'On Tuesday, July 2, a lot of ocean and a few tiny bits of land will lie under a moon-blackened sun.', 'date': '2019-06-25', 'time': '20:21:30'}, {'title': "Dairy Queen Whips Up 'Zero Gravity' Blizzard for Moon Landing 50th", 'author': 'Robert Z. Pearlman', 'synopsis': 'Eating it upside down is optional.', 'date': '2019-06-25', 'time': '19:30:08'}, {'title': 'How Apollo 8 Morphed Into a Moon Mission: Exclusive Clip ', 'author': 'Elizabeth Howell', 'synopsis': 'Apollo 8 was originally not supposed to go to the moon.', 'dat

In [11]:
# wrap everything in a function, with necessary libraries already installed
# import requests
# from bs4 import BeautifulSoup

url = 'https://www.space.com/news'

def space_scraper(url):
  response = requests.get(url)
  if(response.ok):
    data = response.text
    soup = BeautifulSoup(data, 'html.parser')
    summary = soup.select('.content')
    articles = []
    for article in summary:
      name = article.select_one('.article-name').get_text() # extract the article name
      author = article.select_one('.by-author').get_text().strip() # extract the author
      synopsis = article.select_one('.synopsis').get_text().strip() # get the synopsis
      timestamp = article.time.attrs['datetime']
      article = {'title': name, 'author': author [4:], 'synopsis' : synopsis, 'date' : timestamp [:10], 'time' : timestamp [11:-1]}
      articles.append(article)
    return print(articles)
  else:
    return "Sorry, there is a problem with your html request"
  
space_scraper(url)

[{'title': 'Solar Eclipse Glasses: Where to Buy the Best, High-Quality Eyewear', 'author': 'Hanneke Weitering', 'synopsis': "Whether you're looking for a new pair of eclipse glasses or you've already purchased some form of eye protection, here's what you need to know to avoid burning your eyes during the solar eclipse.", 'date': '2019-06-25', 'time': '20:44:15'}, {'title': 'One Week Until the Great South American Total Solar Eclipse!', 'author': 'Joe Rao', 'synopsis': 'On Tuesday, July 2, a lot of ocean and a few tiny bits of land will lie under a moon-blackened sun.', 'date': '2019-06-25', 'time': '20:21:30'}, {'title': "Dairy Queen Whips Up 'Zero Gravity' Blizzard for Moon Landing 50th", 'author': 'Robert Z. Pearlman', 'synopsis': 'Eating it upside down is optional.', 'date': '2019-06-25', 'time': '19:30:08'}, {'title': 'How Apollo 8 Morphed Into a Moon Mission: Exclusive Clip ', 'author': 'Elizabeth Howell', 'synopsis': 'Apollo 8 was originally not supposed to go to the moon.', 'dat