# Scrape Mars Data(10.3.3)

https://redplanetscience.com/

In [28]:
# Import Splinter and 

from splinter import Browser
from bs4 import BeautifulSoup as soup
from webdriver_manager.chrome import ChromeDriverManager

# Import Pandas(10.3.5)

import pandas as pd

In [2]:
executable_path={"executable_path":ChromeDriverManager().install()}
browser=Browser("chrome",**executable_path, headless=False)

In [3]:
# Visit the mars nasa news site
# Assign the url and instruct the browser to visit it.

url="https://redplanetscience.com/"
browser.visit(url)  # this code will take control of the browser using automated software and take you to the website.

# Optional delay for loading the page

browser.is_element_present_by_css('div.list_text', wait_time=1)

    # We are accomplishing two things:
    
        # we're searching for elements with a specific combination of tag (div) and attribute (list_text)
            # example, ul.item_list would be found in HTML as <ul class="item_list">.
            
        # optional delay is useful because sometimes dynamic pages take a little while to load, especially if they are image-heavy.

True

In [13]:
# Parse HTML using BeautifulSoup
    # Turn browser into HTML object and Parse object with soup

html = browser.html
news_soup = soup(html, 'html.parser')
slide_elem = news_soup.select_one('div.list_text')

    # slide_elem as the variable to look for the <div /> tag and its descendent (the other tags within the <div /> element)? This is our parent element
    
    # The . is used for selecting classes, such as list_text, so the code 'div.list_text' pinpoints the <div /> tag with the class of list_text. 
    
    # CSS works from right to left, such as returning the last item on the list instead of the first
        # when using select_one, the first matching element returned will be a <li /> element with a class of slide and all nested elements within it.

In [14]:
slide_elem

<div class="list_text">
<div class="list_date">August 17, 2022</div>
<div class="content_title">Air Deliveries Bring NASA's Perseverance Mars Rover Closer to Launch</div>
<div class="article_teaser_body">A NASA Wallops Flight Facility cargo plane transported more than two tons of equipment — including the rover's sample collection tubes — to Florida for this summer's liftoff.</div>
</div>

In [15]:
# Find Title for the first article

slide_elem.find('div', class_='content_title')

# The title is in that mix of HTML in our output: 
    # Use get_text() method to find just the title

<div class="content_title">Air Deliveries Bring NASA's Perseverance Mars Rover Closer to Launch</div>

In [16]:
# Find Title for the first article and pull out just the text
# Use the parent element to find the first `a` tag and save it as `news_title`

news_title=slide_elem.find('div', class_='content_title').get_text()
news_title

"Air Deliveries Bring NASA's Perseverance Mars Rover Closer to Launch"

In [17]:
# Use the parent element to find the paragraph text
news_p = slide_elem.find('div', class_='article_teaser_body').get_text()
news_p

"A NASA Wallops Flight Facility cargo plane transported more than two tons of equipment — including the rover's sample collection tubes — to Florida for this summer's liftoff."

# Jet Propulsion Laboratory's(10.3.4)

https://spaceimages-mars.com

### Featured Images

In [18]:
# Visit URL

url = 'https://spaceimages-mars.com'
browser.visit(url)

In [24]:
# Find and click the full image button
full_image_elem = browser.find_by_tag('button')[1]
full_image_elem.click()

# Notice the indexing chained at the end of the first line of code? 
    # With this, we've stipulated that we want our browser to click the second button.
    
# The automated browser should automatically "click" the button and change the view to a slideshow of images

In [None]:
# Click the More Info button to get to the next page.
    # Let's look at the DevTools again to see what elements we can use for our scraping.

In [25]:
# Parse the resulting html with soup
html = browser.html
img_soup = soup(html, 'html.parser')

In [26]:
# Find the relative image url
img_url_rel = img_soup.find('img', class_='fancybox-image').get('src')
img_url_rel

# We've done a lot with that single line.

#Let's break it down:

    # An img tag is nested within this HTML, so we've included it.
    # .get('src') pulls the link to the image.

'image/featured/mars2.jpg'

- What we've done here is tell BeautifulSoup to look inside the <img /> tag for an image with a class of fancybox-image. Basically we're saying, "This is where the image we want lives—use the link that's inside these tags."

- We were able to pull the link to the image by pointing BeautifulSoup to where the image will be, instead of grabbing the URL directly. This way, when JPL updates its image page, our code will still pull the most recent image.

- But if we copy and paste this link into a browser, it won't work. This is because it's only a partial link, as the base URL isn't included. If we look at our address bar in the webpage, we can see the entire URL up there already; we just need to add the first portion to our app.

In [27]:
# Use the base URL to create an absolute URL
img_url = f'https://spaceimages-mars.com/{img_url_rel}'
img_url


'https://spaceimages-mars.com/image/featured/mars2.jpg'

- We're using an f-string for this print statement because it's a cleaner way to create print statements; they're also evaluated at run-time. This means that it, and the variable it holds, doesn't exist until the code is executed and the values are not constant. This works well for our scraping app because the data we're scraping is live and will be updated frequently.

# Mars Facts(10.3.5)

https://galaxyfacts-mars.com/

- The next bit of information Robin wants to have included in her app is a collection of Mars facts. With news articles and high-quality images, a collection of facts is a solid addition to her web app.

- Robin already has a great photo and an article, so all she wants from this page is the table. 

- plan is to display it as a table on her own web app, so **keeping the current HTML table format** is important.

In [29]:
# Import Pandas(10.3.5)-Added to top

# Create a new DataFrame from the HTML table

df = pd.read_html('https://galaxyfacts-mars.com')[0]
df.columns=['description', 'Mars', 'Earth']
df.set_index('description', inplace=True)
df

Unnamed: 0_level_0,Mars,Earth
description,Unnamed: 1_level_1,Unnamed: 2_level_1
Mars - Earth Comparison,Mars,Earth
Diameter:,"6,779 km","12,742 km"
Mass:,6.39 × 10^23 kg,5.97 × 10^24 kg
Moons:,2,1
Distance from Sun:,"227,943,824 km","149,598,262 km"
Length of Year:,687 Earth days,365.24 days
Temperature:,-87 to -5 °C,-88 to 58°C


Now let's break it down:

- df = pd.read_htmldf = pd.read_html('https://galaxyfacts-mars.com')[0] With this line, we're creating a new DataFrame from the HTML table. 
    - The Pandas function read_html() **specifically searches for and returns a list of tables** found in the HTML. 
    - By specifying an index of 0, we're telling Pandas to pull only the first table it encounters, or the first item in the list. Then, it turns the table into a DataFrame.

- df.columns=['description', 'Mars', 'Earth'] Here, we assign columns to the new DataFrame for additional clarity.

- df.set_index('description', inplace=True) By using the .set_index() function, we're turning the Description column into the DataFrame's index. inplace=True means that the updated index will remain in place, without having to reassign the DataFrame to a new variable.

- How do we add the DataFrame to a web application?
    - **Pandas** also has a way to **easily convert our DataFrame back into HTML-ready code** using the **.to_html()**

- web app is going to be an actual webpage. Our data is live—if the table is updated, then we want that change to appear in Robin's app also.

In [30]:
# Use Pandas to convert our DataFrame back into HTML-ready code

df.to_html()

# Below is a <table /> element with a lot of nested elements. 
# This means success. After adding this exact block of code to Robin's web app, the data it's storing will be presented in an easy-to-read tabular format.

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>Mars</th>\n      <th>Earth</th>\n    </tr>\n    <tr>\n      <th>description</th>\n      <th></th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>Mars - Earth Comparison</th>\n      <td>Mars</td>\n      <td>Earth</td>\n    </tr>\n    <tr>\n      <th>Diameter:</th>\n      <td>6,779 km</td>\n      <td>12,742 km</td>\n    </tr>\n    <tr>\n      <th>Mass:</th>\n      <td>6.39 × 10^23 kg</td>\n      <td>5.97 × 10^24 kg</td>\n    </tr>\n    <tr>\n      <th>Moons:</th>\n      <td>2</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>Distance from Sun:</th>\n      <td>227,943,824 km</td>\n      <td>149,598,262 km</td>\n    </tr>\n    <tr>\n      <th>Length of Year:</th>\n      <td>687 Earth days</td>\n      <td>365.24 days</td>\n    </tr>\n    <tr>\n      <th>Temperature:</th>\n      <td>-87 to -5 °C</td>\n      <td>-88 to 58°C</td>\n    </tr>\n  </tbody>

#### End the automated browsing session

- We really only want the automated browser to remain active while we're scraping data. It's like turning off a light switch when you're ready to leave the room or home.

In [31]:
# End the automated browsing session

browser.quit()

#### IMPORTANT

- Live sites are a great resource for fresh data, but the layout of the site may be updated or otherwise changed. When this happens, there's a good chance your scraping code will break and need to be reviewed and updated to be used again.

- For example, an image may suddenly become embedded within an inaccessible block of code because the developers switched to a new JavaScript library. **It's not uncommon to revise code to find workarounds or even look for a different, scraping-friendly site all together.**

# Export to Python

- Jupyter is great, each chunk can be tested and ran independently but we can't automate the scraping using the Jupyter Notebook. To fully automate it, it will need to be converted into a .py file.

- The next step in making this an automated process is to download the current code into a Python file. It won't transition over perfectly, we'll need to clean it up a bit, but it's an easier task than copying each cell and pasting it over in the correct order.


#### Important Feature of the Jupyter ecosystem is **being able to download the notebook into different formats.**

There are several formats available, but we'll focus on one by downloading to a Python file.

1. While your notebook is open, navigate to the top of the page to the Files tab.