## HW_12_Part 1 WebScraping
### NASA Mars News

* Scrape the [NASA Mars News Site](https://mars.nasa.gov/news/) and collect the latest News Title and Paragraph Text. Assign the text to variables that you can reference later.

In [48]:
from splinter import Browser
from splinter.exceptions import ElementDoesNotExist
from bs4 import BeautifulSoup as bs
import time
from pandas import pandas as pd

In [49]:
# check for path of your chromdriver
!which chromedriver

/usr/local/bin/chromedriver


In [50]:
# Start Chrome Driver to navigate through websites
def init_browser():
    executable_path = {"executable_path": "/usr/local/bin/chromedriver"}
    return Browser("chrome", **executable_path, headless=False)

In [51]:
# Main dictionary to store scraped data
mars_data = {}

### Scrape for News Titles and Paragraph Text

In [52]:
browser = init_browser()

news_url = "https://mars.nasa.gov/news/"
browser.visit(news_url)

time.sleep(1) # Set time sleep to 1 sec. Increase time if needs longer to scrape

# Scrape page into Soup
html = browser.html
soup = bs(html, "html.parser")

# Collect the latest News Title and Paragraph Text
news_title = soup.find('div', class_ = 'content_title').text
news_p = soup.find('div', class_ = 'article_teaser_body').text

# Store data in a dictionary
mars_data = {
    "news_title": news_title,
    "news_p": news_p
}

# Close the browser after scraping
browser.quit()

In [53]:
# Print saved scrape data
mars_data

{'news_title': 'Watch NASA Build Its Next Mars Rover',
 'news_p': "A newly installed webcam offers the public a live, bird's-eye view of NASA's Mars 2020 rover as it takes shape at NASA's Jet Propulsion Laboratory. "}

### JPL Mars Space Images - Featured Image

* Visit the url for JPL Featured Space Image [here](https://www.jpl.nasa.gov/spaceimages/?search=&category=Mars).

In [42]:
# Inspect JPL website
# Click on "FULL IMAGE" button
# Click on "more info" button

# Image saved here: 
# <figure class="lede">
#               <a href="/spaceimages/images/largesize/PIA16815_hires.jpg"><img alt="This image shows the first holes into rock drilled by NASA's Mars rover Curiosity, with drill tailings around the holes plus piles of powdered rock collected from the deeper hole and later discarded." title="This image shows the first holes into rock drilled by NASA's Mars rover Curiosity, with drill tailings around the holes plus piles of powdered rock collected from the deeper hole and later discarded." class="main_image" src="/spaceimages/images/largesize/PIA16815_hires.jpg"></a>
#             </figure>

In [70]:
# img_url from instruction

jpl_home = "https://www.jpl.nasa.gov"
jpl_url = "https://www.jpl.nasa.gov/spaceimages/?search=&category=Mars"

# visit browser must remain outside of browser quit or will create endless loop
browser = init_browser()
browser.visit(img_url)
time.sleep(1)

try:
    for url in jpl_url:
        
        # click on first button
        browser.click_link_by_partial_text('FULL IMAGE')
        time.sleep(2)
        
        # click on next page button
        browser.click_link_by_partial_text('more info')
        time.sleep(1)
        
        # Scrape page into Soup
        html = browser.html
        soup = bs(html, "html.parser")
        
        #  Find image and save into variable
        figure_img = soup.find('figure', class_="lede").find('a')['href']

        # Create url of image
        feature_img_url = jpl_home + figure_img
        
        print(f"Feature image url: {feature_img_url}")

except ElementDoesNotExist:
    print("Scraping Complete")
    
# Close the browser after scraping
browser.quit()

# updating dictionary must be outside of loop
mars_data["feature_img_url"] = feature_img_url

Feature image url: https://www.jpl.nasa.gov/spaceimages/images/largesize/PIA19673_hires.jpg
Scraping Complete


In [72]:
mars_data

{'news_title': 'Watch NASA Build Its Next Mars Rover',
 'news_p': "A newly installed webcam offers the public a live, bird's-eye view of NASA's Mars 2020 rover as it takes shape at NASA's Jet Propulsion Laboratory. ",
 'feature_img_url': 'https://www.jpl.nasa.gov/spaceimages/images/largesize/PIA19673_hires.jpg'}

### Mars Weather

* Visit the Mars Weather twitter account [here](https://twitter.com/marswxreport?lang=en) and scrape the latest Mars weather tweet from the page. Save the tweet text for the weather report as a variable called `mars_weather`.

In [73]:
# <p class="TweetTextSize TweetTextSize--normal js-tweet-text tweet-text" data-aria-label-part="0" lang="en">
# InSight sol 188 (2019-06-07) low -102.5ºC (-152.6ºF) high -21.9ºC (-7.4ºF)
# winds from the SSE at 4.8 m/s (10.8 mph) gusting to 15.6 m/s (35.0 mph)
# pressure at 7.60 hPa
# <a href="https://t.co/ocUTA1rgaU" class="twitter-timeline-link u-hidden" data-pre-embedded="true" dir="ltr">
# pic.twitter.com/ocUTA1rgaU</a>
# </p>

In [76]:
mars_twitter = "https://twitter.com/marswxreport?lang=en"

# visit browser must remain outside of browser quit or will create endless loop
browser = init_browser()
browser.visit(mars_twitter)
time.sleep(1)

try:
    for url in mars_twitter:
        # Scrape page into Soup
        html = browser.html
        soup = bs(html, "html.parser")

        # Collect the latest tweet
        mars_weather = soup.find('p', class_ = 'tweet-text').text
    
except ElementDoesNotExist:
    print("Scraping Complete")
    
# Close the browser after scraping
browser.quit()

# Store data in a dictionary
mars_data["mars_weather"] = mars_weather
mars_data

{'news_title': 'Watch NASA Build Its Next Mars Rover',
 'news_p': "A newly installed webcam offers the public a live, bird's-eye view of NASA's Mars 2020 rover as it takes shape at NASA's Jet Propulsion Laboratory. ",
 'feature_img_url': 'https://www.jpl.nasa.gov/spaceimages/images/largesize/PIA19673_hires.jpg',
 'mars_weather': 'InSight sol 188 (2019-06-07) low -102.5ºC (-152.6ºF) high -21.9ºC (-7.4ºF)\nwinds from the SSE at 4.8 m/s (10.8 mph) gusting to 15.6 m/s (35.0 mph)\npressure at 7.60 hPapic.twitter.com/ocUTA1rgaU'}

### Mars Facts

* Visit the Mars Facts webpage [here](https://space-facts.com/mars/) and use Pandas to scrape the table containing facts about the planet including Diameter, Mass, etc.

In [32]:
# <table id="tablepress-mars" class="tablepress tablepress-id-mars">
# <tbody>
# <tr class="row-1 odd">
# <td class="column-1"><strong>Equatorial Diameter:</strong></td><td class="column-2">6,792 km<br>
# </td>
# </tr>
# ...
# </tbody>
# </table>

In [77]:
mars_facts_url = "https://space-facts.com/mars/"
tables = pd.read_html(mars_facts_url, header=None)
tables

[                      0                              1
 0  Equatorial Diameter:                       6,792 km
 1       Polar Diameter:                       6,752 km
 2                 Mass:  6.42 x 10^23 kg (10.7% Earth)
 3                Moons:            2 (Phobos & Deimos)
 4       Orbit Distance:       227,943,824 km (1.52 AU)
 5         Orbit Period:           687 days (1.9 years)
 6  Surface Temperature:                  -153 to 20 °C
 7         First Record:              2nd millennium BC
 8          Recorded By:           Egyptian astronomers]

In [78]:
type(tables)

list

In [79]:
df = tables[0]
df

Unnamed: 0,0,1
0,Equatorial Diameter:,"6,792 km"
1,Polar Diameter:,"6,752 km"
2,Mass:,6.42 x 10^23 kg (10.7% Earth)
3,Moons:,2 (Phobos & Deimos)
4,Orbit Distance:,"227,943,824 km (1.52 AU)"
5,Orbit Period:,687 days (1.9 years)
6,Surface Temperature:,-153 to 20 °C
7,First Record:,2nd millennium BC
8,Recorded By:,Egyptian astronomers


In [80]:
df.columns = ["Key", "Measurement"]
df

Unnamed: 0,Key,Measurement
0,Equatorial Diameter:,"6,792 km"
1,Polar Diameter:,"6,752 km"
2,Mass:,6.42 x 10^23 kg (10.7% Earth)
3,Moons:,2 (Phobos & Deimos)
4,Orbit Distance:,"227,943,824 km (1.52 AU)"
5,Orbit Period:,687 days (1.9 years)
6,Surface Temperature:,-153 to 20 °C
7,First Record:,2nd millennium BC
8,Recorded By:,Egyptian astronomers


In [81]:
# set_index shifts the index header and other column headers on different rows
# df.set_index("Key", inplace=True)
# df

In [82]:
df.to_html('tables/mars_table.html', index = False)

In [88]:
# find current path
file_path = !pwd

table_path = file_path[0] + "tables/mars_table.html"
mars_data["table_path"] = table_path
# del mars_data["key"] # Remove item from dictionary
mars_data

{'news_title': 'Watch NASA Build Its Next Mars Rover',
 'news_p': "A newly installed webcam offers the public a live, bird's-eye view of NASA's Mars 2020 rover as it takes shape at NASA's Jet Propulsion Laboratory. ",
 'feature_img_url': 'https://www.jpl.nasa.gov/spaceimages/images/largesize/PIA19673_hires.jpg',
 'mars_weather': 'InSight sol 188 (2019-06-07) low -102.5ºC (-152.6ºF) high -21.9ºC (-7.4ºF)\nwinds from the SSE at 4.8 m/s (10.8 mph) gusting to 15.6 m/s (35.0 mph)\npressure at 7.60 hPapic.twitter.com/ocUTA1rgaU',
 'table_path': '/Users/bic/Desktop/GW_DATA_2019/Module-12/HW_12tables/mars_table.html'}

In [89]:
!open mars_table.html

### Mars Hemispheres

* Visit the USGS Astrogeology site [here](https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars) to obtain high resolution images for each of Mar's hemispheres.

In [98]:
astro_home ="https://astrogeology.usgs.gov"
hemi_url = "https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars"

browser = init_browser()
browser.visit(hemi_url)
time.sleep(1)

In [99]:
# Gather list of names to scrape for
hemi_names = []
for n in range(4):
    name = browser.find_by_css('h3')[n].text
    hemi_names.append(name)
print(hemi_names)

['Cerberus Hemisphere Enhanced', 'Schiaparelli Hemisphere Enhanced', 'Syrtis Major Hemisphere Enhanced', 'Valles Marineris Hemisphere Enhanced']


In [100]:
# Empty list to save dictinaries
mars_img = []

try:
    for name in hemi_names:
        browser.click_link_by_partial_text(name)
        time.sleep(1)

        html = browser.html
        soup = bs(html, "html.parser")

        img_url = soup.find('div', class_ ='downloads').find('li').find('a')['href']
        print(f"Scraping {name}")

        if any(x.get("img_title") == name for x in mars_img):
            print("No new items added.")
        else:
            # Append dictionaries to a list
            mars_img.append({
                "img_title": name,
                "img_url": img_url})
            
        browser.back()
        
except ElementDoesNotExist:
    print("Scraping Complete")
    
browser.quit()

Scraping Cerberus Hemisphere Enhanced
Scraping Schiaparelli Hemisphere Enhanced
Scraping Syrtis Major Hemisphere Enhanced
Scraping Valles Marineris Hemisphere Enhanced


In [101]:
mars_img

[{'img_title': 'Cerberus Hemisphere Enhanced',
  'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/cerberus_enhanced.tif/full.jpg'},
 {'img_title': 'Schiaparelli Hemisphere Enhanced',
  'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/schiaparelli_enhanced.tif/full.jpg'},
 {'img_title': 'Syrtis Major Hemisphere Enhanced',
  'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/syrtis_major_enhanced.tif/full.jpg'},
 {'img_title': 'Valles Marineris Hemisphere Enhanced',
  'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/valles_marineris_enhanced.tif/full.jpg'}]

In [102]:
mars_data["Mars images"] = mars_img

In [103]:
mars_data

{'news_title': 'Watch NASA Build Its Next Mars Rover',
 'news_p': "A newly installed webcam offers the public a live, bird's-eye view of NASA's Mars 2020 rover as it takes shape at NASA's Jet Propulsion Laboratory. ",
 'feature_img_url': 'https://www.jpl.nasa.gov/spaceimages/images/largesize/PIA19673_hires.jpg',
 'mars_weather': 'InSight sol 188 (2019-06-07) low -102.5ºC (-152.6ºF) high -21.9ºC (-7.4ºF)\nwinds from the SSE at 4.8 m/s (10.8 mph) gusting to 15.6 m/s (35.0 mph)\npressure at 7.60 hPapic.twitter.com/ocUTA1rgaU',
 'table_path': '/Users/bic/Desktop/GW_DATA_2019/Module-12/HW_12tables/mars_table.html',
 'Mars images': [{'img_title': 'Cerberus Hemisphere Enhanced',
   'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/cerberus_enhanced.tif/full.jpg'},
  {'img_title': 'Schiaparelli Hemisphere Enhanced',
   'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/schiaparelli_enhanced.tif/full.jpg'},
  {'img_title': 'Syrtis Major Hemisphere En