# 10.3.3 Scrape Mars Data: The News

In [23]:
# Import Splinter and BeautifulSoup
from splinter import Browser
from bs4 import BeautifulSoup as soup
from webdriver_manager.chrome import ChromeDriverManager

#import pandas as pd
import pandas as pd

In [12]:
#Set up Splinter
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)




[WDM] - Current google-chrome version is 103.0.5060
INFO:WDM:Current google-chrome version is 103.0.5060
[WDM] - Get LATEST chromedriver version for 103.0.5060 google-chrome
INFO:WDM:Get LATEST chromedriver version for 103.0.5060 google-chrome
[WDM] - Driver [C:\Users\ssteffen\.wdm\drivers\chromedriver\win32\103.0.5060.134\chromedriver.exe] found in cache
INFO:WDM:Driver [C:\Users\ssteffen\.wdm\drivers\chromedriver\win32\103.0.5060.134\chromedriver.exe] found in cache


In [4]:
# Visit the mars nasa news site
url = 'https://redplanetscience.com'
browser.visit(url)
# Optional delay for loading the page
browser.is_element_present_by_css('div.list_text', wait_time=1)

True

In [5]:
#set up the HTML parser
html = browser.html
news_soup = soup(html, 'html.parser')
slide_elem = news_soup.select_one('div.list_text')

In [6]:
#locate the html webpage article title and summary text
slide_elem.find('div', class_='content_title')

<div class="content_title">NASA's Perseverance Rover Will Look at Mars Through These 'Eyes'</div>

In [7]:
# add the webpage article title and summary to a variable
# Use the parent element to find the first `a` tag and save it as `news_title`
news_title = slide_elem.find('div', class_='content_title').get_text()
news_title

"NASA's Perseverance Rover Will Look at Mars Through These 'Eyes'"

In [8]:
# Use the parent element to find the paragraph text
news_p = slide_elem.find('div', class_='article_teaser_body').get_text()
news_p

'A pair of zoomable cameras will help scientists and rover drivers with high-resolution color images.'

# 10.3.4 Scrape Mars Data: Featured Image

### Featured Images

In [18]:
# Visit URL
url = 'https://spaceimages-mars.com'
browser.visit(url)

Run this code to make sure it's working correctly. A new automated browser should open to the featured images webpage.

Next, we want to click the "Full Image" button. This button will direct our browser to an image slideshow. Let's take a look at the button's HTML tags and attributes with the DevTools.

This is a fairly straightforward HTML tag: the ```<button>``` element has two classes (btn and btn-outline-light) and a string reading "FULL IMAGE". First, let's use the dev tools to search for all the button elements.
    
Since there are only three buttons, and we want to click the full-size image button, we can go ahead and use the HTML tag in our code.

In [19]:
# Find and click the full image button
full_image_elem = browser.find_by_tag('button')[1]
full_image_elem.click()

Notice the indexing chained at the end of the first line of code? With this, we've stipulated that we want our browser to click the second button.

Go ahead and run this code. The automated browser should automatically "click" the button and change the view to a slideshow of images, so we're on the right track. We need to click the More Info button to get to the next page. Let's look at the DevTools again to see what elements we can use for our scraping.

With the new page loaded onto our automated browser, it needs to be parsed so we can continue and scrape the full-size image URL. In the next empty cell, type the following:

In [20]:
# Parse the resulting html with soup
html = browser.html
img_soup = soup(html, 'html.parser')

Now we need to find the relative image URL. In our browser (make sure you're on the same page as the automated one), activate your DevTools again. This time, let's find the image link for that image. This is a little more tricky. Remember, Robin wants to pull the most recently posted image for her web app. If she uses the image URL below, she'll only ever pull that specific image when using her app.

It's important to note that the value of the src will be different every time the page is updated, so we can't simply record the current value—we would only pull that image each time the code is executed, instead of the most recent one.

We'll use the image tag and class (```<img />``` and ```fancybox-img```) to build the URL to the full-size image. Let's go back to Jupyter Notebook to do that.

In [21]:
# Find the relative image url
img_url_rel = img_soup.find('img', class_='fancybox-image').get('src')
img_url_rel

'image/featured/mars1.jpg'

We've done a lot with that single line.

Let's break it down:

An ```img``` tag is nested within this HTML, so we've included it.
```.get('src')``` pulls the link to the image.
What we've done here is tell BeautifulSoup to look inside the ```<img />``` tag for an image with a class of ```fancybox-image```. Basically we're saying, "This is where the image we want lives—use the link that's inside these tags."

Run the notebook cell to see the output of the link.

This looks great! We were able to pull the link to the image by pointing BeautifulSoup to where the image will be, instead of grabbing the URL directly. This way, when JPL updates its image page, our code will still pull the most recent image.

But if we copy and paste this link into a browser, it won't work. This is because it's only a partial link, as the base URL isn't included. If we look at our address bar in the webpage, we can see the entire URL up there already; we just need to add the first portion to our app.

In [22]:
# Use the base URL to create an absolute URL
img_url = f'https://spaceimages-mars.com/{img_url_rel}'
img_url

'https://spaceimages-mars.com/image/featured/mars1.jpg'

We're using an f-string for this print statement because it's a cleaner way to create print statements; they're also evaluated at run-time. This means that it, and the variable it holds, doesn't exist until the code is executed and the values are not constant. This works well for our scraping app because the data we're scraping is live and will be updated frequently.

# 10.3.5. Scrape Mars Data: Mars Facts

Robin has chosen to collect her data from Mars Facts (Links to an external site.), so let's visit the webpage to look at what we'll be working with. Robin already has a great photo and an article, so all she wants from this page is the table. Her plan is to display it as a table on her own web app, so keeping the current HTML table format is important.

Let's look at the webpage again, this time using our DevTools. All of the data we want is in a ```<table />``` tag. HTML code used to create a table looks fairly complex, but it's really just breaking down and naming each component.

Let's look at the webpage again, this time using our DevTools. All of the data we want is in a ```<table />``` tag. HTML code used to create a table looks fairly complex, but it's really just breaking down and naming each component.

Tables in HTML are basically made up of many smaller containers. The main container is the ```<table />``` tag. Inside the table is ```<tbody />```, which is the body of the table—the headers, columns, and rows.

```<tr />``` is the tag for each table row. Within that tag, the table data is stored in ```<td />``` tags. This is where the columns are established.

Instead of scraping each row, or the data in each ```<td />```, we're going to scrape the entire table with Pandas' .read_html() function.

At the top of your Jupyter Notebook, add import pandas as pd to the dependencies and rerun the cell. This way, we'll be able to use this new function without generating an error.

Back at the bottom of your notebook, in the next blank cell, let's set up our code.

In [24]:
df = pd.read_html('https://galaxyfacts-mars.com')[0]
df.columns=['description', 'Mars', 'Earth']
df.set_index('description', inplace=True)
df

Unnamed: 0_level_0,Mars,Earth
description,Unnamed: 1_level_1,Unnamed: 2_level_1
Mars - Earth Comparison,Mars,Earth
Diameter:,"6,779 km","12,742 km"
Mass:,6.39 × 10^23 kg,5.97 × 10^24 kg
Moons:,2,1
Distance from Sun:,"227,943,824 km","149,598,262 km"
Length of Year:,687 Earth days,365.24 days
Temperature:,-87 to -5 °C,-88 to 58°C


Now let's break it down:

```df = pd.read_htmldf = pd.read_html('https://galaxyfacts-mars.com')[0]``` With this line, we're creating a new DataFrame from the HTML table. The Pandas function read_html() specifically searches for and returns a list of tables found in the HTML. By specifying an index of 0, we're telling Pandas to pull only the first table it encounters, or the first item in the list. Then, it turns the table into a DataFrame.
```df.columns=['description', 'Mars', 'Earth']``` Here, we assign columns to the new DataFrame for additional clarity.
```df.set_index('description', inplace=True)``` By using the ```.set_index()``` function, we're turning the Description column into the DataFrame's index. ```inplace=True``` means that the updated index will remain in place, without having to reassign the DataFrame to a new variable.
Now, when we call the DataFrame, we're presented with a tidy, Pandas-friendly representation of the HTML table we were just viewing on the website.

This is exactly what Robin is looking to add to her web application. How do we add the DataFrame to a web application? Robin's web app is going to be an actual webpage. Our data is live—if the table is updated, then we want that change to appear in Robin's app also.

Thankfully, Pandas also has a way to easily convert our DataFrame back into HTML-ready code using the .to_html() function. Add this line to the next cell in your notebook and then run the code.



In [25]:
# convert the dataframe you just made back into html
df.to_html()

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>Mars</th>\n      <th>Earth</th>\n    </tr>\n    <tr>\n      <th>description</th>\n      <th></th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>Mars - Earth Comparison</th>\n      <td>Mars</td>\n      <td>Earth</td>\n    </tr>\n    <tr>\n      <th>Diameter:</th>\n      <td>6,779 km</td>\n      <td>12,742 km</td>\n    </tr>\n    <tr>\n      <th>Mass:</th>\n      <td>6.39 × 10^23 kg</td>\n      <td>5.97 × 10^24 kg</td>\n    </tr>\n    <tr>\n      <th>Moons:</th>\n      <td>2</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>Distance from Sun:</th>\n      <td>227,943,824 km</td>\n      <td>149,598,262 km</td>\n    </tr>\n    <tr>\n      <th>Length of Year:</th>\n      <td>687 Earth days</td>\n      <td>365.24 days</td>\n    </tr>\n    <tr>\n      <th>Temperature:</th>\n      <td>-87 to -5 °C</td>\n      <td>-88 to 58°C</td>\n    </tr>\n  </tbody>

The result is a slightly confusing-looking set of HTML code—it's a ```<table />``` element with a lot of nested elements. This means success. After adding this exact block of code to Robin's web app, the data it's storing will be presented in an easy-to-read tabular format.

Now that we've gathered everything on Robin's list, we can end the automated browsing session. This is an important line to add to our web app also. Without it, the automated browser won't know to shut down—it will continue to listen for instructions and use the computer's resources (it may put a strain on memory or a laptop's battery if left on). We really only want the automated browser to remain active while we're scraping data. It's like turning off a light switch when you're ready to leave the room or home.

In the last empty cell of Jupyter Notebook, add ```browser.quit()``` and execute that cell to end the session.

In [26]:
# add code to quit the browser
browser.quit()

Live sites are a great resource for fresh data, but the layout of the site may be updated or otherwise changed. When this happens, there's a good chance your scraping code will break and need to be reviewed and updated to be used again.

For example, an image may suddenly become embedded within an inaccessible block of code because the developers switched to a new JavaScript library. It's not uncommon to revise code to find workarounds or even look for a different, scraping-friendly site all together.

# 10.3.6 Export to Python

The next step in making this an automated process is to download the current code into a Python file. It won't transition over perfectly, we'll need to clean it up a bit, but it's an easier task than copying each cell and pasting it over in the correct order.

The Jupyter ecosystem is an extremely versatile tool. We already know many of its great functions, such as the different libraries that work well with it and also how easy it is to troubleshoot code. Another feature is being able to download the notebook into different formats.

There are several formats available, but we'll focus on one by downloading to a Python file.

While your notebook is open, navigate to the top of the page to the Files tab.

From here, scroll down to the "Download as" section of the drop-down menu.

Select "Python (.py)" from the next menu to download the code.

If you get a warning about downloading this type of file, click "Keep" to continue the download. 

Navigate to your Downloads folder and open the new file. A brief look at the first lines of code shows us that the code wasn't the only thing to be ported over. The number of times each cell has been run is also there, for example.

Clean up the code by removing unnecessary blank spaces and comments.

When you're done tidying up the code, make sure you save it in your working folder with your notebook code as scraping.py. You can also test the script by running it through your terminal.

The final scraping.py file should look like this: