In [1]:
# Import Splinter and BeautifulSoup
from splinter import Browser
from bs4 import BeautifulSoup as soup
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd

In [2]:
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)



Current google-chrome version is 89.0.4389
Get LATEST driver version for 89.0.4389
Get LATEST driver version for 89.0.4389
Trying to download new driver from https://chromedriver.storage.googleapis.com/89.0.4389.23/chromedriver_mac64.zip
Driver has been saved in cache [/Users/jacquelineesbri/.wdm/drivers/chromedriver/mac64/89.0.4389.23]


In [3]:
# assign the url and instruct the browser to visit it.
# Visit the mars nasa news site
url = 'https://redplanetscience.com'
browser.visit(url)
# Optional delay for loading the page
browser.is_element_present_by_css('div.list_text', wait_time=1)

True

#### With the following line, browser.is_element_present_by_css('div.list_text', wait_time=1), we are accomplishing two things.

One is that we're searching for elements with a specific combination of tag (div) and attribute (list_text). As an example, ul.item_list would be found in HTML as <ul class="item_list">.
Secondly, we're also telling our browser to wait one second before searching for components. The optional delay is useful because sometimes dynamic pages take a little while to load, especially if they are image-heavy.

In [4]:
# set up the HTML parser: 10.3.3
html = browser.html
news_soup = soup(html, 'html.parser')
slide_elem = news_soup.select_one('div.list_text')

Notice how we've assigned slide_elem as the variable to look for the <div /> tag and its descendent (the other tags within the <div /> element)? This is our parent element. This means that this element holds all of the other elements within it, and we'll reference it when we want to filter search results even further. The . is used for selecting classes, such as list_text, so the code 'div.list_text' pinpoints the <div /> tag with the class of list_text. CSS works from right to left, such as returning the last item on the list instead of the first. Because of this, when using select_one, the first matching element returned will be a <li /> element with a class of slide and all nested elements within it.

In [5]:
# 10.3.3

# search for the HTML components you'll use to identify 
# the title and paragraph you want. We will use the 
# HTML attribute class = "content_title" will to scrape 
# the article’s title ( We’re looking for a <div /> with a class of “content_title.”)

slide_elem.find('div', class_='content_title')

<div class="content_title">NASA's Mars Helicopter Attached to Mars 2020 Rover </div>

In [6]:
# The title is in that mix of HTML in our output
# But we need to get just the text, and the extra HTML 
# stuff isn't necessary.
# Use the parent element to find the first `a` tag and save it as `news_title`

news_title = slide_elem.find('div', class_='content_title').get_text()
news_title

"NASA's Mars Helicopter Attached to Mars 2020 Rover "

The title is in that mix of HTML in our output— But we need to get just the text, and the extra HTML stuff isn't necessary.We've added something new to our .find() method here: .get_text(). When this new method is chained onto .find(), only the text of the element is returned. The code above, for example, would return only the title of the news article and not any of the HTML tags or elements. We have created a new variable for the title, added the get_text() method, and we’re searching within the parent element for the title.
We have created a new variable for the title, added the get_text() method, and we’re searching within the parent element for the title. We’re also stripping the additional HTML attributes and tags with the use of .get_text().

In [7]:
# Next we need to add the summary text. This time, we’re 
# searching for the summary instead of the title, so we’ll 
# need to use the unique class associated with the summary.
# We’ll need to change the class to “article_teaser_body.” from 
# the above code
# output is summary

In [8]:
# Use the parent element to find the paragraph text
news_p = slide_elem.find('div', class_='article_teaser_body').get_text()
news_p

'The helicopter will be first aircraft to perform flight tests on another planet.'

### Featured Images

In [9]:
# Visit URL
url = 'https://spaceimages-mars.com'
browser.visit(url)

In [10]:
# Find and click the full image button
full_image_elem = browser.find_by_tag('button')[1]
full_image_elem.click()

In [11]:
# Parse the resulting html with soup
html = browser.html
img_soup = soup(html, 'html.parser')

In [12]:
# Find the relative image url
img_url_rel = img_soup.find('img', class_='fancybox-image').get('src')
img_url_rel

'image/featured/mars2.jpg'

But if we copy and paste this link into a browser, it won't work. This is because it's only a partial link, as the base URL isn't included. Let's add the base URL to our code.

In [13]:
# Use the base URL to create an absolute URL
img_url = f'https://spaceimages-mars.com/{img_url_rel}'
img_url

'https://spaceimages-mars.com/image/featured/mars2.jpg'

In [14]:
df = pd.read_html('https://galaxyfacts-mars.com')[0]
df.columns=['description', 'Mars', 'Earth']
df.set_index('description', inplace=True)
df

Unnamed: 0_level_0,Mars,Earth
description,Unnamed: 1_level_1,Unnamed: 2_level_1
Mars - Earth Comparison,Mars,Earth
Diameter:,"6,779 km","12,742 km"
Mass:,6.39 × 10^23 kg,5.97 × 10^24 kg
Moons:,2,1
Distance from Sun:,"227,943,824 km","149,598,262 km"
Length of Year:,687 Earth days,365.24 days
Temperature:,-87 to -5 °C,-88 to 58°C


Now let's break it down:

df = pd.read_htmldf = pd.read_html('https://galaxyfacts-mars.com')[0] With this line, we're creating a new DataFrame from the HTML table. The Pandas function read_html() specifically searches for and returns a list of tables found in the HTML. By specifying an index of 0, we're telling Pandas to pull only the first table it encounters, or the first item in the list. Then, it turns the table into a DataFrame.

df.columns=['description', 'Mars', 'Earth'] Here, we assign columns to the new DataFrame for additional clarity.

df.set_index('description', inplace=True) By using the .set_index() function, we're turning the Description column into the DataFrame's index. inplace=True means that the updated index will remain in place, without having to reassign the DataFrame to a new variable

Now, when we call the DataFrame, we're presented with a tidy, Pandas-friendly representation of the HTML table we were just viewing on the website.

In [67]:
# Pandas also has a way to easily convert our DataFrame back 
# into HTML-ready code using the .to_html() function.
df.to_html()

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>Mars</th>\n      <th>Earth</th>\n    </tr>\n    <tr>\n      <th>description</th>\n      <th></th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>Mars - Earth Comparison</th>\n      <td>Mars</td>\n      <td>Earth</td>\n    </tr>\n    <tr>\n      <th>Diameter:</th>\n      <td>6,779 km</td>\n      <td>12,742 km</td>\n    </tr>\n    <tr>\n      <th>Mass:</th>\n      <td>6.39 × 10^23 kg</td>\n      <td>5.97 × 10^24 kg</td>\n    </tr>\n    <tr>\n      <th>Moons:</th>\n      <td>2</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>Distance from Sun:</th>\n      <td>227,943,824 km</td>\n      <td>149,598,262 km</td>\n    </tr>\n    <tr>\n      <th>Length of Year:</th>\n      <td>687 Earth days</td>\n      <td>365.24 days</td>\n    </tr>\n    <tr>\n      <th>Temperature:</th>\n      <td>-87 to -5 °C</td>\n      <td>-88 to 58°C</td>\n    </tr>\n  </tbody>

In [69]:
# To end sessions
#browser.quit()