# Mission to Mars Multiple Web Scrapes

Ultimately, with each item we scrape, we'll also save and then serve it on our own website. 

https://courses.bootcampspot.com/courses/676/pages/10-dot-3-3-scrape-mars-data-the-news?module_item_id=190909

In [21]:
# Import Dependencies
import pandas as pd
# Import Splinter and BeautifulSoup
from splinter import Browser
from bs4 import BeautifulSoup as soup
from webdriver_manager.chrome import ChromeDriverManager

In [2]:
#Set Up the Splinter
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)



Current google-chrome version is 96.0.4664
Get LATEST chromedriver version for 96.0.4664 google-chrome
Driver [C:\Users\cfole\.wdm\drivers\chromedriver\win32\96.0.4664.45\chromedriver.exe] found in cache


## Scrape Mars News (NASA)

In [3]:
# Visit the mars nasa news site
url = 'https://redplanetscience.com'
browser.visit(url)
# Optional delay for loading the page
browser.is_element_present_by_css('div.list_text', wait_time=1)

True

In [4]:
#Convert the browser html to a soup object and then quit the browser
html = browser.html
news_soup = soup(html, 'html.parser')
slide_elem = news_soup.select_one('div.list_text')

In [5]:
slide_elem.find('div', class_='content_title')

<div class="content_title">NASA's Push to Save the Mars InSight Lander's Heat Probe</div>

In [6]:
# Use the parent element to find the first `a` tag and save it as `news_title`
news_title = slide_elem.find('div', class_='content_title').get_text()#he code returns only the title of the news article and not any of the HTML tags or elements.
news_title

"NASA's Push to Save the Mars InSight Lander's Heat Probe"

In [9]:
#.find() provides the latest or first article assigned to the tag and attribute we specified;
#.find_all() provides all of the articles that match the specificed tag and attribute.
# Use the parent element to find the paragraph text
news_p = slide_elem.find('div', class_='article_teaser_body').get_text()
news_p

"The scoop on the end of the spacecraft's robotic arm will be used to 'pin' the mole against the wall of its hole."

## Scrape Mars Images (JPL) - Featured Images

https://courses.bootcampspot.com/courses/676/pages/10-dot-3-4-scrape-mars-data-featured-image?module_item_id=190916

In [15]:
# Visit URL
url = 'https://spaceimages-mars.com'
browser.visit(url)

In [16]:
# Find and click the full image button. The no index or [0] = 1st button, [1] = 2nd, [2] = 3rd
full_image_elem = browser.find_by_tag('button')[1]
full_image_elem.click()

In [17]:
# Parse the resulting html with soup
html = browser.html
img_soup = soup(html, 'html.parser')

JPL rotates the images so that everytime we open the website, a diff image of Mars comes up. So, we were able to pull the link to the image by pointing BeautifulSoup to where the image will be, instead of grabbing the URL directly. This way, when JPL updates its image page, our code will still pull the most recent image.

In [18]:
# Find the relative image url. 
img_url_rel = img_soup.find('img', class_='fancybox-image').get('src')
img_url_rel

'image/featured/mars2.jpg'

But if we copy and paste this link into a browser, it won't work. This is because it's only a partial link, as the base URL isn't included. If we look at our address bar in the webpage, we can see the entire URL up there already; we just need to add the first portion to our app.

*https://spaceimages-mars.com/
    
Let's add the base URL to our code.

In [19]:
# Use the base URL to create an absolute URL
img_url = f'https://spaceimages-mars.com/{img_url_rel}'
img_url

'https://spaceimages-mars.com/image/featured/mars2.jpg'

Note: We're using an f-string for this print statement because it's a cleaner way to create print statements; they're also evaluated at run-time. This means that it, and the variable it holds, doesn't exist until the code is executed and the values are not constant. This works well for our scraping app because the data we're scraping is live and will be updated frequently.

## Mars Facts (https://galaxyfacts-mars.com/)

https://courses.bootcampspot.com/courses/676/pages/10-dot-3-5-scrape-mars-data-mars-facts?module_item_id=190922

In [22]:
#We will be pulling data from the table. Instead of scraping each row, or the data in each <td />,
#we're going to scrape the entire table with Pandas' .read_html() function.

#create a dataframe from the html table
#read_html = specifically searches for and returns a list of tables found in the HTML. 
#By specifying an index of 0, we're telling Pandas to pull only the first table it encounters, 
#or the first item in the list
df = pd.read_html('https://galaxyfacts-mars.com')[0]

#assign columns to the new DataFrame for additional clarity
df.columns=['description', 'Mars', 'Earth']

#set_index turning the Description column into the DataFrame's index.
#inplace=True means that the updated index will remain in place, without having
#to reassign the DataFrame to a new variable.
df.set_index('description', inplace=True)
df

Unnamed: 0_level_0,Mars,Earth
description,Unnamed: 1_level_1,Unnamed: 2_level_1
Mars - Earth Comparison,Mars,Earth
Diameter:,"6,779 km","12,742 km"
Mass:,6.39 × 10^23 kg,5.97 × 10^24 kg
Moons:,2,1
Distance from Sun:,"227,943,824 km","149,598,262 km"
Length of Year:,687 Earth days,365.24 days
Temperature:,-87 to -5 °C,-88 to 58°C


How do we add the DataFrame to a web application? Robin's web app is going to be an actual webpage. Our data is live—if the table is updated, then we want that change to appear in Robin's app also.

Thankfully, Pandas also has a way to easily convert our DataFrame back into HTML-ready code using the .to_html() function. Add this line to the next cell in your notebook and then run the code.

In [23]:
#Convert the Dataframe back into HTML-ready code
df.to_html()

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>Mars</th>\n      <th>Earth</th>\n    </tr>\n    <tr>\n      <th>description</th>\n      <th></th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>Mars - Earth Comparison</th>\n      <td>Mars</td>\n      <td>Earth</td>\n    </tr>\n    <tr>\n      <th>Diameter:</th>\n      <td>6,779 km</td>\n      <td>12,742 km</td>\n    </tr>\n    <tr>\n      <th>Mass:</th>\n      <td>6.39 × 10^23 kg</td>\n      <td>5.97 × 10^24 kg</td>\n    </tr>\n    <tr>\n      <th>Moons:</th>\n      <td>2</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>Distance from Sun:</th>\n      <td>227,943,824 km</td>\n      <td>149,598,262 km</td>\n    </tr>\n    <tr>\n      <th>Length of Year:</th>\n      <td>687 Earth days</td>\n      <td>365.24 days</td>\n    </tr>\n    <tr>\n      <th>Temperature:</th>\n      <td>-87 to -5 °C</td>\n      <td>-88 to 58°C</td>\n    </tr>\n  </tbody>

Now that we've gathered everything on Robin's list, we can end the automated browsing session. This is an important line to add to our web app also. Without it, the automated browser won't know to shut down—it will continue to listen for instructions and use the computer's resources (it may put a strain on memory or a laptop's battery if left on). We really only want the automated browser to remain active while we're scraping data. It's like turning off the lights. ;)

In [24]:
#Shut down the automated browsing session
browser.quit()

Important: Live sites are a great resource for fresh data, but the layout of the site may be updated or otherwise changed. When this happens, there's a good chance your scraping code will break and need to be reviewed and updated to be used again.

For example, an image may suddenly become embedded within an inaccessible block of code because the developers switched to a new JavaScript library. It's not uncommon to revise code to find workarounds or even look for a different, scraping-friendly site all together.

## Export to Python

https://courses.bootcampspot.com/courses/676/pages/10-dot-3-6-export-to-python?module_item_id=190925

## Mongo DB

https://courses.bootcampspot.com/courses/676/pages/10-dot-4-1-store-the-data?module_item_id=190935

Highlights: 
    * 1st Terminal >Mongod and just let that run, it is the server.
    * 2nd Terminal >Mongo, it is the command shell to speak to the server.

DB Commands:
    * show all db - 'show dbs', (Make sure you are in the correct DB b4 using commands below)
    * show a collection in the db - 'show collections', (these are documents)
    * show detailed fields in the collection - 'db.collectionName.find()', (fields = records)
    * update
        * create db - 'use "name"',
        * insert field - 'db."collectionName".insert({key:value})',
        * drop field - 'db.collectionName.remove({"specific key:"specific value"})',
        * empty entire collection - 'db.collectionName.remove({})',
        * drop entire collection - 'db.collectionName.drop()'
        * drop database - 'db.dropDatabase()'
    
Quit: 
*You can quit the Mongo shell by using keyboard commands: Command + C for Mac or CTRL + C for Windows. You need to shut down everytime.

## Flask App (app.py)

https://courses.bootcampspot.com/courses/676/pages/10-dot-5-1-use-flask-to-create-a-web-app?module_item_id=190945

Flask is a web microframework that helps developers build a web application. The Pythonic tools and libraries it comes with provide the means to create anything from a small webpage or blog or something large enough for commercial use.

## Refactor the Code (scraping.py)

https://courses.bootcampspot.com/courses/676/pages/10-dot-5-2-update-the-code?module_item_id=190951

## Integrate MongoDB Into the Web App

https://courses.bootcampspot.com/courses/676/pages/10-dot-5-3-integrate-mongodb-into-the-web-app?module_item_id=190956

Connect to Database Code Updated at the top following importing dependencies

Hints:

While we can see the word "browser" here twice they do not need to have the same name.

    * "browser" is the name of the variable passed into the function, and
    * "Browser" is the name of a parameter. 
Headless:

    * headless=True means that we do not see the scraping in action,
    * but when we are developing/updating code we should set to False. 