# Getting Started Using Web Scraping Tools

# 10.1.1 Install Your Tools

Download a few libraries and tools that we need when we are ready to start scraping data:
* Splinter to automate a web browser,
* BeautifulSoup to parse and extract the data, and
* MongoDB to hold the data that has been gathered.


![Screen%20Shot%202022-02-27%20at%209.46.39%20AM.png](attachment:Screen%20Shot%202022-02-27%20at%209.46.39%20AM.png)

## Splinter
Splinter is the tool that will automate our web browser as we begin scraping.
* This means that it will open the browser,
* Visit a webpage, and then 
* Interact with it (such as logging in or searching for an item).

To do all of this, we'll need to install Splinter and ChromeDriver.

In [3]:
!pip install splinter

Collecting splinter
  Downloading splinter-0.17.0-py3-none-any.whl (38 kB)
Installing collected packages: splinter
Successfully installed splinter-0.17.0


## Web-Driver Manager
The web driver manager package will allow us to easily use a driver that to scrape websites without having to go through the complicated process of installing the stand alone ChromeDriver.

In [4]:
!pip install webdriver_manager

Collecting webdriver_manager
  Downloading webdriver_manager-3.5.3-py2.py3-none-any.whl (18 kB)
Collecting configparser
  Downloading configparser-5.2.0-py3-none-any.whl (19 kB)
Collecting crayons
  Downloading crayons-0.4.0-py2.py3-none-any.whl (4.6 kB)
Installing collected packages: crayons, configparser, webdriver-manager
Successfully installed configparser-5.2.0 crayons-0.4.0 webdriver-manager-3.5.3


## BeautifulSoup


In [5]:
!pip install bs4

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1271 sha256=1fed92d38da2ca6b728470717b74f360a33baf6685f31c2e93399363c5578f82
  Stored in directory: /Users/lucypepe/Library/Caches/pip/wheels/73/2b/cb/099980278a0c9a3e57ff1a89875ec07bfa0b6fcbebb9a8cad3
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1


## MongoDB

MongoDB (also known as Mongo) is a document database that thrives on chaos. Well, maybe it's not that extreme, but it is far more flexible when it comes to storing data than a structured database such as SQL. It's able to handle smaller, more personal projects as well as larger-scale projects that a company might require. For this module, Mongo is a better choice than SQL because the data we'll scrape from the web isn't going to be uniform. For example, how would we break down an image into rows and columns? We can't. But Mongo will store and access it as a document instead.

* brew services start mongodb-community@4.4

## Flask-PyMongo
To bridge Flask and Mongo, you'll also want to install the (Flask-PyMongo)[https://flask-pymongo.readthedocs.io/en/latest/] library. This library can be installed using pip and the following command from your terminal: pip install Flask-PyMongo.

In [1]:
!pip install Flask-PyMongo

Collecting Flask-PyMongo
  Downloading Flask_PyMongo-2.3.0-py2.py3-none-any.whl (12 kB)
Collecting PyMongo>=3.3
  Downloading pymongo-4.0.1-cp39-cp39-macosx_10_9_universal2.whl (351 kB)
[K     |████████████████████████████████| 351 kB 4.7 MB/s eta 0:00:01
Installing collected packages: PyMongo, Flask-PyMongo
Successfully installed Flask-PyMongo-2.3.0 PyMongo-4.0.1


## Additional Libraries: html5lib and lxml
There are two final Python libraries required to run scraping code successfully: html5lib and lxml. Both packages are used to parse HTML in Python, which will be important as you traverse through different web pages to find and collect information.

To install these libraries, first make sure your coding environment is active. Then, type the following commands in your terminal to install them:

In [2]:
!pip install html5lib



In [3]:
!pip install lxml



In [1]:
!pip install selenium

Collecting selenium
  Downloading selenium-4.1.2-py3-none-any.whl (963 kB)
[K     |████████████████████████████████| 963 kB 3.6 MB/s eta 0:00:01
[?25hCollecting trio-websocket~=0.9
  Downloading trio_websocket-0.9.2-py3-none-any.whl (16 kB)
Collecting trio~=0.17
  Downloading trio-0.20.0-py3-none-any.whl (359 kB)
[K     |████████████████████████████████| 359 kB 23.0 MB/s eta 0:00:01
Collecting outcome
  Downloading outcome-1.1.0-py2.py3-none-any.whl (9.7 kB)
Collecting wsproto>=0.14
  Downloading wsproto-1.1.0-py3-none-any.whl (24 kB)
Collecting h11<1,>=0.9.0
  Downloading h11-0.13.0-py3-none-any.whl (58 kB)
[K     |████████████████████████████████| 58 kB 16.4 MB/s eta 0:00:01
[?25hInstalling collected packages: outcome, h11, wsproto, trio, trio-websocket, selenium
Successfully installed h11-0.13.0 outcome-1.1.0 selenium-4.1.2 trio-0.20.0 trio-websocket-0.9.2 wsproto-1.1.0


# 10.2.1 Use HTML Elements
Every webpage is built using hypertext markup language, more commonly known as HTML

Our first step will be to explore that design so that we can write a script that knows what it's looking at when it interacts with a webpage.

Open VS Code and create a file named index.html. This file can be saved to your desktop because it's just for practice.

In this blank HTML file put an exclamation point on the first line and press Enter. This should autofill the editor to contain everything we need for a basic HTML page.

![Screen%20Shot%202022-02-27%20at%2012.16.58%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%2012.16.58%20PM.png)


![Screen%20Shot%202022-02-27%20at%2012.21.44%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%2012.21.44%20PM.png)

![Screen%20Shot%202022-02-27%20at%2012.22.41%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%2012.22.41%20PM.png)

![Screen%20Shot%202022-02-27%20at%2012.23.34%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%2012.23.34%20PM.png)

![Screen%20Shot%202022-02-27%20at%2012.24.21%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%2012.24.21%20PM.png)

![Screen%20Shot%202022-02-27%20at%2012.25.03%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%2012.25.03%20PM.png)

![Screen%20Shot%202022-02-27%20at%2012.25.43%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%2012.25.43%20PM.png)

![Screen%20Shot%202022-02-27%20at%2012.26.34%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%2012.26.34%20PM.png)

![Screen%20Shot%202022-02-27%20at%2012.27.09%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%2012.27.09%20PM.png)

![Screen%20Shot%202022-02-27%20at%2012.27.45%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%2012.27.45%20PM.png)

![Screen%20Shot%202022-02-27%20at%2012.28.25%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%2012.28.25%20PM.png)

![Screen%20Shot%202022-02-27%20at%2012.28.54%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%2012.28.54%20PM.png)

![Screen%20Shot%202022-02-27%20at%2012.29.24%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%2012.29.24%20PM.png)

# 10.2.2 Using Chrome Developer Tools

Chrome Developer Tools (also known as DevTools) allows developers to look at the structure of any webpage. Not only that, but there's a search function as well. This should help make more sense of the tags and components that hold the data we are inetesting in.

Let's visit one of the websites Robin plans to use and take a peek at its structure, then practice finding different components.


![Screen%20Shot%202022-02-27%20at%201.59.59%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%201.59.59%20PM.png)




![Screen%20Shot%202022-02-27%20at%202.00.28%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%202.00.28%20PM.png)

![Screen%20Shot%202022-02-27%20at%202.01.31%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%202.01.31%20PM.png)

![Screen%20Shot%202022-02-27%20at%202.08.56%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%202.08.56%20PM.png)

![Screen%20Shot%202022-02-27%20at%202.09.28%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%202.09.28%20PM.png)

![Screen%20Shot%202022-02-27%20at%202.10.06%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%202.10.06%20PM.png)

![Screen%20Shot%202022-02-27%20at%202.10.54%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%202.10.54%20PM.png)

![Screen%20Shot%202022-02-27%20at%202.12.04%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%202.12.04%20PM.png)

![Screen%20Shot%202022-02-27%20at%202.12.36%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%202.12.36%20PM.png)

![Screen%20Shot%202022-02-27%20at%202.13.43%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%202.13.43%20PM.png)

![Screen%20Shot%202022-02-27%20at%202.14.32%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%202.14.32%20PM.png)

# 10.3.1 Use Splinter
The next part is to use Splinter to automate a browser—this is pretty fun because we'll actually be able to watch a browser work without us clicking anywhere or typing in fields, such as using a search bar or next button. We'll actually scrape data using BeautifulSoup. This is where our practice with HTML tags comes in. To scrape the data we want, we'll have to tell BeautifulSoup which HTML tag is being used and if it has an attribute such as a specific class or id.

One of the fun things about web scraping is the automation—watching your script at work.

1. Once you execute your completed scraping script, a new Chrome web browser will pop up with a banner across the top that says "Chrome is being controlled by automated test software."

![Screen%20Shot%202022-02-27%20at%202.18.21%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%202.18.21%20PM.png)

2. This message lets you know that your Python script is directing the browser. The browser will visit websites and interact with them on its own.

![Screen%20Shot%202022-02-27%20at%202.19.06%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%202.19.06%20PM.png)

3. Depending on how you've programmed your script, your browser will click buttons, use a search bar, or even log in to a website.

![Screen%20Shot%202022-02-27%20at%202.20.17%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%202.20.17%20PM.png)

Navigate to your Mission-to-Mars folder using the terminal. Then go ahead and activate Jupyter Notebook. Create a new .ipynb file to get started—this is where we'll begin our web scraping work. Let's name it "Practice." It can be deleted when we're done, or used as a reference later on. It's not necessary, but you can add it to your GitHub repo and to your .gitignore file so that it's hidden from public view.



![Screen%20Shot%202022-02-27%20at%202.22.41%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%202.22.41%20PM.png)

![Screen%20Shot%202022-02-27%20at%202.24.51%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%202.24.51%20PM.png)

![Screen%20Shot%202022-02-27%20at%202.25.17%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%202.25.17%20PM.png)

![Screen%20Shot%202022-02-27%20at%202.25.44%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%202.25.44%20PM.png)

# 10.3.2 Practice with Splinter and BeautifulSoup

![Screen%20Shot%202022-02-27%20at%209.25.39%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%209.25.39%20PM.png)

![Screen%20Shot%202022-02-27%20at%209.26.06%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%209.26.06%20PM.png)

![Screen%20Shot%202022-02-27%20at%209.26.38%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%209.26.38%20PM.png)

![Screen%20Shot%202022-02-27%20at%209.27.11%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%209.27.11%20PM.png)

![Screen%20Shot%202022-02-27%20at%209.27.45%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%209.27.45%20PM.png)

![Screen%20Shot%202022-02-27%20at%209.28.18%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%209.28.18%20PM.png)

![Screen%20Shot%202022-02-27%20at%209.28.48%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%209.28.48%20PM.png)

![Screen%20Shot%202022-02-27%20at%209.29.29%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%209.29.29%20PM.png)

![Screen%20Shot%202022-02-27%20at%209.30.01%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%209.30.01%20PM.png)

![Screen%20Shot%202022-02-27%20at%209.30.39%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%209.30.39%20PM.png)

![Screen%20Shot%202022-02-27%20at%209.31.19%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%209.31.19%20PM.png)

![Screen%20Shot%202022-02-27%20at%209.31.55%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%209.31.55%20PM.png)

![Screen%20Shot%202022-02-27%20at%209.32.24%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%209.32.24%20PM.png)

![Screen%20Shot%202022-02-27%20at%209.33.47%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%209.33.47%20PM.png)

![Screen%20Shot%202022-02-27%20at%209.34.17%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%209.34.17%20PM.png)

![Screen%20Shot%202022-02-27%20at%209.35.10%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%209.35.10%20PM.png)

![Screen%20Shot%202022-02-27%20at%209.36.57%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%209.36.57%20PM.png)

![Screen%20Shot%202022-02-27%20at%209.37.43%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%209.37.43%20PM.png)

![Screen%20Shot%202022-02-27%20at%209.38.15%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%209.38.15%20PM.png)

![Screen%20Shot%202022-02-27%20at%209.38.44%20PM.png](attachment:Screen%20Shot%202022-02-27%20at%209.38.44%20PM.png)

# 10.3.3 Scrape Mars Data: The News