In [1]:
# Import Splinter and Beautiful Soup.
from splinter import Browser
from bs4 import BeautifulSoup as soup
from webdriver_manager.chrome import ChromeDriverManager

In [2]:
# Set up Splinter.
executable_path = {"executable_path" : ChromeDriverManager().install()}
browser = Browser("chrome", **executable_path, headless = False, incognito = True)

In [3]:
# Visit the Mars News site for Scraping.
url = "https://redplanetscience.com"
browser.visit(url)

# Optional delay for loading the page.
browser.is_element_present_by_css("div.list_text", wait_time = 1)

True

In [4]:
# Parse the HTML.
html = browser.html
news_soup = soup(html, "html.parser")

In [5]:
# Scrape the articles on the landing page.
articles = news_soup.find_all("div", class_ = "list_text")

In [6]:
# Scrape the article titles.
for article in articles:
    title = article.find("div", class_ = "content_title").text
    print(title)

Mars InSight Lander to Push on Top of the 'Mole'
Alabama High School Student Names NASA's Mars Helicopter
NASA Invites Public to Share Excitement of Mars 2020 Perseverance Rover Launch
How NASA's Perseverance Mars Team Adjusted to Work in the Time of Coronavirus 
NASA Establishes Board to Initially Review Mars Sample Return Plans
Mars Helicopter Attached to NASA's Perseverance Rover
NASA's MAVEN Explores Mars to Understand Radio Interference at Earth
Two Rovers to Roll on Mars Again: Curiosity and Mars 2020
Hear Audio From NASA's Perseverance As It Travels Through Deep Space
6 Things to Know About NASA's Ingenuity Mars Helicopter
NASA's MAVEN Maps Winds in the Martian Upper Atmosphere that Mirror the Terrain Below and Gives Clues to Martian Climate
NASA's Perseverance Rover Bringing 3D-Printed Metal Parts to Mars
The Man Who Wanted to Fly on Mars
AI Is Helping Scientists Discover Fresh Craters on Mars
8 Martian Postcards to Celebrate Curiosity's Landing Anniversary


In [7]:
# Scrape the article previews.
for article in articles:
    preview = article.find("div", class_ = "article_teaser_body").text
    print(preview)

Engineers have a plan for pushing down on the heat probe, which has been stuck at the Martian surface for a year.
Vaneeza Rupani's essay was chosen as the name for the small spacecraft, which will mark NASA's first attempt at powered flight on another planet.
There are lots of ways to participate in the historic event, which is targeted for July 30.
Like much of the rest of the world, the Mars rover team is pushing forward with its mission-critical work while putting the health and safety of their colleagues and community first.
The board will assist with analysis of current plans and goals for one of the most difficult missions humanity has ever undertaken.
The team also fueled the rover's sky crane to get ready for this summer's history-making launch.
NASA’s MAVEN spacecraft has discovered “layers” and “rifts” in the electrically charged part of the upper atmosphere of Mars.
They look like twins. But under the hood, the rover currently exploring the Red Planet and the one launching t

In [8]:
# Put the article titles and preview texts together into a list of dictionaries using list comprehension.
article_list = [{"title" : article.find("div", class_ = "content_title").text,
                 "preview" : article.find("div", class_ = "article_teaser_body").text} for article in articles]

In [9]:
# Print the results.
for article in article_list:
    print(article)

{'title': "Mars InSight Lander to Push on Top of the 'Mole'", 'preview': 'Engineers have a plan for pushing down on the heat probe, which has been stuck at the Martian surface for a year.'}
{'title': "Alabama High School Student Names NASA's Mars Helicopter", 'preview': "Vaneeza Rupani's essay was chosen as the name for the small spacecraft, which will mark NASA's first attempt at powered flight on another planet."}
{'title': 'NASA Invites Public to Share Excitement of Mars 2020 Perseverance Rover Launch', 'preview': 'There are lots of ways to participate in the historic event, which is targeted for July 30.'}
{'title': "How NASA's Perseverance Mars Team Adjusted to Work in the Time of Coronavirus ", 'preview': 'Like much of the rest of the world, the Mars rover team is pushing forward with its mission-critical work while putting the health and safety of their colleagues and community first.'}
{'title': 'NASA Establishes Board to Initially Review Mars Sample Return Plans', 'preview': '

In [10]:
# Close the Splinter session.
browser.quit()

In [11]:
# Export the list of dictionaries into a JSON file.
import json

jsonString = json.dumps(article_list)
jsonFile = open("article_list.json", "w")
jsonFile.write(jsonString)
jsonFile.close()

The list of dictionaries can be imported into a Mongo database collection using two different methods. The first method uses the Mongo CLI to import the recently created JSON file into the Mongo database collection.

1. Start Mongo by running `mongod` for Windows, or `brew services start mongodb/brew/mongodb-community` for Mac. (This will need to be done regardless of the method used.)
2. In the terminal, use `cd` to navigate to the resources folder that contains the file named `article_list.json`.
3. Import this file to a Mongo database using this command:

`mongoimport --type json -d mars_news -c article_list --drop --jsonArray article_list.json`

This command tells Mongo that it needs to:

    * import a json file (`--type json`)
    * to a database called \"mars_news\" (`-d mars_news`)
    * in a collection called \"article_list\" (`-c article_list`)
    * treat the input source as a json array (`--array`)
    * removing the existing \"article_list\" collection (`--drop`), if it exists, before adding the new documents from the json file.

The other method inserts the list of dictionaries into the Mongo database collection using the script in the next cells below.

In [12]:
# Create an instance of MongoClient, using the port number 27017.
from pymongo import MongoClient

mongo = MongoClient(port = 27017)

In [13]:
# Set up a database named "mars_news."
db = mongo["mars_news"]

# Set up a collection named "article_list."
collect = db["article_list"]

# Insert the list of dictionaries.
collect.insert_many(article_list)

<pymongo.results.InsertManyResult at 0x2d1119f76a0>

Regardless of the method used, there should now be a Mongo database named `mars_news` with a collection named `article_list`.

In [14]:
# Verify existence of the database.
print(mongo.list_database_names())

['admin', 'config', 'local', 'mars_news']


In [15]:
# Verify existence of the collection.
db = mongo["mars_news"]

print(db.list_collection_names())

['article_list']


In [16]:
# Verify that all documents are accounted for.
collect = db["article_list"]

for result in collect.find():
    print(result)

{'_id': ObjectId('6376e571960fc7567c74103d'), 'title': "Mars InSight Lander to Push on Top of the 'Mole'", 'preview': 'Engineers have a plan for pushing down on the heat probe, which has been stuck at the Martian surface for a year.'}
{'_id': ObjectId('6376e571960fc7567c74103e'), 'title': "Alabama High School Student Names NASA's Mars Helicopter", 'preview': "Vaneeza Rupani's essay was chosen as the name for the small spacecraft, which will mark NASA's first attempt at powered flight on another planet."}
{'_id': ObjectId('6376e571960fc7567c74103f'), 'title': 'NASA Invites Public to Share Excitement of Mars 2020 Perseverance Rover Launch', 'preview': 'There are lots of ways to participate in the historic event, which is targeted for July 30.'}
{'_id': ObjectId('6376e571960fc7567c741040'), 'title': "How NASA's Perseverance Mars Team Adjusted to Work in the Time of Coronavirus ", 'preview': 'Like much of the rest of the world, the Mars rover team is pushing forward with its mission-critic

OPTIONAL: When finished, clean up everything.

In [17]:
# Delete the collection.
db.drop_collection("article_list")
db.list_collection_names()

[]

In [18]:
# Delete the database.
mongo.drop_database(db)
mongo.list_database_names()

['admin', 'config', 'local']