<h1>Step 1 - Scraping

Complete your initial scraping using Jupyter Notebook, BeautifulSoup, Pandas, and Requests/Splinter.

<h2>NASA Mars News

* Scrape the NASA Mars News Site and collect the latest News Title and Paragraph Text.
* Assign the text to variables that you can reference later.

In [1]:
# Import dependencies
from bs4 import BeautifulSoup as bs
import requests

# Scrape HTML from NASA website
url = 'https://mars.nasa.gov/news/?page=0&per_page=40&order=publish_date+desc%2Ccreated_at+desc&search=&category=19%2C165%2C184%2C204&blank_scope=Latest'
response = requests.get(url)
parsed = bs(response.text, 'html.parser')
print(parsed.prettify())

<!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <!-- Always force latest IE rendering engine or request Chrome Frame -->
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <script type="text/javascript">
   window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","errorBeacon":"bam.nr-data.net","licenseKey":"5e33925808","applicationID":"59562082","transactionName":"JVcPR0MLWApSRU1eAQVVEhxSC1oSUlkWbBMHXwRAHhdcCUA=","queueTime":0,"applicationTime":226,"agent":""}
  </script>
  <script type="text/javascript">
   (window.NREUM||(NREUM={})).loader_config={xpid:"VQcPUlZTDxAFXVRUBQEPVA=="};window.NREUM||(NREUM={}),__nr_require=function(t,n,e){function r(e){if(!n[e]){var o=n[e]={exports:{}};t[e][0].call(o.exports,function(n){var o=t[e][1][n];return r(o||n)},o,o.exports)}return n[e].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0

In [2]:
# Find and save titles and description into lists
news_title_list = []
news_p_list = []

for div in parsed.find_all('div', class_ = 'slide'):
    news_title = div.find('div', class_ = 'content_title').text.strip()
    news_p = div.find('div', class_ = 'rollover_description_inner').text.strip()
    news_title_list.append(news_title)
    news_p_list.append(news_p)

In [3]:
# Check lists of news titles and descriptions
for i in range(0, len(news_title_list)):
    print(news_title_list[i], '\n', news_p_list[i], '\n')

NASA's InSight Places First Instrument on Mars 
 In deploying its first instrument onto the surface of Mars, the lander completes a major mission milestone. 

NASA Announces Landing Site for Mars 2020 Rover 
 After a five-year search, NASA has chosen Jezero Crater as the landing site for its upcoming Mars 2020 rover mission. 

Opportunity Hunkers Down During Dust Storm 
 It's the beginning of the end for the planet-encircling dust storm on Mars. But it could still be weeks, or even months, before skies are clear enough for NASA's Opportunity rover to recharge its batteries and phone home. 

NASA Finds Ancient Organic Material, Mysterious Methane on Mars 
 NASA’s Curiosity rover has found evidence on Mars with implications for NASA’s search for life. 

NASA Invests in Visionary Technology 
 NASA is investing in technology concepts, including several from JPL, that may one day be used for future space exploration missions. 

NASA is Ready to Study the Heart of Mars 
 NASA is about to go 

<h2>JPL Mars Space Images - Featured Image

* Use splinter to navigate the site and find the image url for the current Featured Mars Image.
* Assign the url string to a variable called featured_image_url.

In [4]:
# Scrape HTML from JPL Mars Space Images
jplmars_url = 'https://www.jpl.nasa.gov/spaceimages/?search=&category=Mars'
response = requests.get(jplmars_url)
parsed_jplmars = bs(response.text, 'html.parser')
#parsed2.prettify()

In [5]:
# Find and save featured image url
# (Splinter's Selenium's Geckodriver was denied on MacOS due to my security settings so I won't be using Splinter)
for a in parsed_jplmars.find_all('a', class_ = 'button fancybox'):
    featured_image_url = 'https://www.jpl.nasa.gov' + a.get('data-fancybox-href')
    print(featured_image_url)

https://www.jpl.nasa.gov/spaceimages/images/mediumsize/PIA17254_ip.jpg


<h2>Mars Weather

* Visit the Mars Weather twitter page and scrape the latest Mars weather tweet from the page. 
* Save the tweet text for the weather report as a variable called mars_weather.

In [6]:
# Scrape HTML from Mars Weather's Twitter Page
twitter_url = 'https://twitter.com/marswxreport?lang=en'
response = requests.get(twitter_url)
parsed_twitter = bs(response.text, 'html.parser')

# Scrape the latest Mars weather tweet from the page
for p in parsed_twitter.find('p', class_ ="TweetTextSize TweetTextSize--normal js-tweet-text tweet-text"):
    mars_weather = p
    break
    
print(mars_weather)

Sol 2299 (2019-01-24), high -5C/23F, low -74C/-101F, pressure at 8.18 hPa, daylight 06:46-18:55


<h2>Mars Facts

* Visit the Mars Facts webpage and use Pandas to scrape the table containing facts about the planet including Diameter, Mass, etc. 
* Use Pandas to convert the data to a HTML table string.

In [7]:
# Import Pandas
import pandas as pd

# Scrape table from Mars Facts using Pandas
spacefacts_url = 'https://space-facts.com/mars/'
tables = pd.read_html(spacefacts_url)
df = tables[0]
df

Unnamed: 0,0,1
0,Equatorial Diameter:,"6,792 km"
1,Polar Diameter:,"6,752 km"
2,Mass:,6.42 x 10^23 kg (10.7% Earth)
3,Moons:,2 (Phobos & Deimos)
4,Orbit Distance:,"227,943,824 km (1.52 AU)"
5,Orbit Period:,687 days (1.9 years)
6,Surface Temperature:,-153 to 20 °C
7,First Record:,2nd millennium BC
8,Recorded By:,Egyptian astronomers


In [8]:
# Use Pandas to convert the data to a HTML table string
html_table_str = df.to_html()
print(html_table_str)

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
      <th>1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Equatorial Diameter:</td>
      <td>6,792 km</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Polar Diameter:</td>
      <td>6,752 km</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Mass:</td>
      <td>6.42 x 10^23 kg (10.7% Earth)</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Moons:</td>
      <td>2 (Phobos &amp; Deimos)</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Orbit Distance:</td>
      <td>227,943,824 km (1.52 AU)</td>
    </tr>
    <tr>
      <th>5</th>
      <td>Orbit Period:</td>
      <td>687 days (1.9 years)</td>
    </tr>
    <tr>
      <th>6</th>
      <td>Surface Temperature:</td>
      <td>-153 to 20 °C</td>
    </tr>
    <tr>
      <th>7</th>
      <td>First Record:</td>
      <td>2nd millennium BC</td>
    </tr>
    <tr>
      <th>8</th>
      <td>Recorde

<h2>Mars Hemispheres

* Visit planetary.org to obtain high resolution images for each of Mar's hemispheres.
* Save both the image url string for the full resolution hemisphere image, and the Hemisphere title containing the hemisphere name. Use a Python dictionary to store the data using the keys img_url and title.
* Append the dictionary with the image url string and the hemisphere title to a list. This list will contain one dictionary for each hemisphere.

In [9]:
# Scrape HTML from planetary.org
hemispheres_url = 'http://www.planetary.org/blogs/guest-blogs/bill-dunford/20140203-the-faces-of-mars.html'
response = requests.get(hemispheres_url)
parsed_hemisphere = bs(response.text, 'html.parser')

hemisphere_image_urls = []

for img in parsed_hemisphere.find_all('img', class_ = 'img840'):
    
    hemisphere_title = img.get('alt')
    hemisphere_url = img.get('src')
    
    new_dict = {
        'title': hemisphere_title,
        'img_url': hemisphere_url
    }

    hemisphere_image_urls.append(new_dict)

hemisphere_image_urls

[{'title': 'Mars: Valles Marineris Hemisphere',
  'img_url': 'https://planetary.s3.amazonaws.com/assets/images/4-mars/2014/20140202_valles_marineris_enhanced_f840.jpg'},
 {'title': 'Mars: Syrtis Major Hemisphere',
  'img_url': 'https://planetary.s3.amazonaws.com/assets/images/4-mars/2014/20140202_syrtis_major_enhanced_f840.jpg'},
 {'title': 'Mars: Cerberus Hemisphere ',
  'img_url': 'https://planetary.s3.amazonaws.com/assets/images/4-mars/2014/20140202_cerberus_enhanced_f840.jpg'},
 {'title': 'Mars: Schiaparelli Hemisphere ',
  'img_url': 'https://planetary.s3.amazonaws.com/assets/images/4-mars/2014/20140202_schiaparelli_enhanced_f840.jpg'}]

<h1>Step 2 - MongoDB and Flask Application

Use MongoDB with Flask templating to create a new HTML page that displays all of the information that was scraped from the URLs above.

In [10]:
import pymongo
conn = "mongodb://127.0.0.1:27017"
client = pymongo.MongoClient(conn)

In [11]:
db = client["mars_db"]

In [12]:
dict_of_scraped = {
    "news_title_list": news_title_list,
    "news_p_list": news_p_list,
    "featured_image_url": featured_image_url,
    "mars_weather": mars_weather,
    "html_table_str": html_table_str,
    "hemisphere_image_urls": hemisphere_image_urls
}
db.mars_db.insert_one(dict_of_scraped)

<pymongo.results.InsertOneResult at 0x11ddd2588>

In [13]:
from pprint import pprint
mars_data = db.mars_db.find()
for data in mars_data:
    pprint(data)

{'_id': ObjectId('5c3f7856b520d9e3e3a37417'),
 'featured_image_url': 'https://www.jpl.nasa.gov/spaceimages/images/mediumsize/PIA19036_ip.jpg',
 'hemisphere_image_urls': [{'img_url': 'https://planetary.s3.amazonaws.com/assets/images/4-mars/2014/20140202_valles_marineris_enhanced_f840.jpg',
                            'title': 'Mars: Valles Marineris Hemisphere'},
                           {'img_url': 'https://planetary.s3.amazonaws.com/assets/images/4-mars/2014/20140202_syrtis_major_enhanced_f840.jpg',
                            'title': 'Mars: Syrtis Major Hemisphere'},
                           {'img_url': 'https://planetary.s3.amazonaws.com/assets/images/4-mars/2014/20140202_cerberus_enhanced_f840.jpg',
                            'title': 'Mars: Cerberus Hemisphere '},
                           {'img_url': 'https://planetary.s3.amazonaws.com/assets/images/4-mars/2014/20140202_schiaparelli_enhanced_f840.jpg',
                            'title': 'Mars: Schiaparelli Hemisphere '}]