# Web Scraping
Web scraping is a technique to automatically access and extract large amounts of information from a website, which can save time and effort.

## Basic components of a WebSite

### HTML
HTML stands for Hypertext Markup Language and every website on the internet uses it to display information. The raw HTML page is the information that Python will be looking at to grab information. Let's take a look at a simple webpage's HTML:

    <!DOCTYPE html>  
    <html>  
        <head>
            <title>Title on Browser Tab</title>
        </head>
        <body>
            <h1> Website Header </h1>
            <p> Some Paragraph </p>
        <body>
    </html>

Every < > indicates a specific block type on the webpage:

    1.<DOCTYPE html> HTML documents will always start with this type declaration, letting the browser know its an    
    HTML file.
    2. The component blocks of the HTML document are placed between <html> and </html>.
    3. Meta data and script connections (like a link to a CSS (Cascading style sheet) file or a JS (Java file) 
    are often placed in the <head> block.
    4. The <title> tag defines the title of the webpage (its what shows up in the tab of a website).
    5. Whatever is between <body> and </body> tags are the blocks that will be visible to the site visitor.
    6. Headings are defined by the <h1> through <h6> tags, where the number represents the size of the heading.
    7. Paragraphs are defined by the <p> tag, this is essentially just normal text on the website.

    There are many more tags than just these, such as <a> for hyperlinks, <table> for tables, <tr> for table rows, 
    and <td> for table columns, and more!

### CSS

Cascading Style Sheets, this is what gives "style" to a website, including colors and fonts, and even some animations! CSS uses tags such as **id** or **class** to connect an HTML element to a CSS feature, such as a particular color. **id** is a unique id for an HTML tag and must be unique within the HTML document, basically a single use connection. **class** defines a general style that can then be linked to multiple HTML tags. Basically if you only want a single html tag to be red, you would use an id tag, if you wanted several HTML tags/blocks to be red, you would create a class in your CSS doc and then link it to the rest of these blocks.

#### Web Scraping tools in Python

We need these libraries installed to proceed with web scraping

    conda install requests
    conda install lxml
    conda install bs4

## Gathering the data on Nuclear reactors in USA


In [155]:
res = requests.get("http://www.nrc.gov/reactors/operating/list-power-reactor-units.html")

In [156]:
res.status_code

200

We got this output hence, we were successful in accessing the webpage

In [157]:
title = 'http://www.nrc.gov/reactors/operating/list-power-reactor-units.html'

Now we use BeautifulSoup to analyze the extracted page. Technically we could use our own custom script to loook for items in the string of **res.text** but the BeautifulSoup library already has lots of built-in tools and methods to grab information from a string of this nature (a HTML file). Using BeautifulSoup we can create a "soup" object that contains all the "ingredients" of the webpage.

We use the **.select( )** method to grab elements. We are looking for the 'title' tag, so we will pass in 'title'

In [162]:
soup.select('title')

[<title>NRC: List of Power Reactor Units</title>]

In [168]:
import pandas as pd
import requests

from bs4 import BeautifulSoup

In [169]:
url = "http://www.nrc.gov/reactors/operating/list-power-reactor-units.html"
main_page = requests.get(url)

In [171]:
soup = BeautifulSoup(main_page.content, 'html.parser')

In [172]:
# snip out the table and pass it to a new variable
reactors_table = soup.find('table')

In [173]:
# print reactor_table to verify we have the right thing
print(reactors_table)

<table border="1" cellpadding="5" cellspacing="0" summary="List of Power Reactor Units" width="100%">
<tr valign="top">
<th scope="col">Plant Name<br/>
Docket Number</th>
<th scope="col">License Number</th>
<th scope="col">Reactor<br/>
Type</th>
<th scope="col">Location</th>
<th scope="col">Owner/Operator</th>
<th scope="col">NRC Region</th>
</tr>
<tr valign="top">
<td scope="row"><a href="/info-finder/reactors/ano1.html">Arkansas Nuclear 1</a><br/>05000313</td>
<td align="center">DPR-51</td>
<td>PWR</td>
<td>6 miles WNW of Russellville,  AR</td>
<td>Entergy Nuclear Operations, Inc. </td>
<td align="middle">4</td>
</tr>
<tr valign="top">
<td scope="row"><a href="/info-finder/reactors/ano2.html">Arkansas Nuclear 2</a><br/>05000368</td>
<td align="center">NPF-6</td>
<td>PWR</td>
<td>6 miles WNW of Russellville,  AR</td>
<td>Entergy Nuclear Operations, Inc. </td>
<td align="middle">4</td>
</tr>
<tr valign="top">
<td scope="row"><a href="/info-finder/reactors/bv1.html">Beaver Valley 1</a

In [174]:
# use .find_all to create a list of rows in the table
reactor_rows = reactors_table.find_all('tr')

In [175]:
# isolate the fourth row and print it
ex_row = reactor_rows[3]
print(ex_row)

<tr valign="top">
<td scope="row"><a href="/info-finder/reactors/bv1.html">Beaver Valley 1</a><br/>05000334</td>
<td align="center">DPR-66</td>
<td>PWR</td>
<td>17 miles W of McCandless,  PA</td>
<td>FirstEnergy Nuclear Operating Co. </td>
<td align="middle">1</td>
</tr>


In [176]:
# use .find_all again to generate a list of the row's cells and return it
cells = ex_row.find_all('td')
cells

[<td scope="row"><a href="/info-finder/reactors/bv1.html">Beaver Valley 1</a><br/>05000334</td>,
 <td align="center">DPR-66</td>,
 <td>PWR</td>,
 <td>17 miles W of McCandless,  PA</td>,
 <td>FirstEnergy Nuclear Operating Co. </td>,
 <td align="middle">1</td>]

In [177]:
# examine the "contents" of the first item in cells
cells[0].contents

[<a href="/info-finder/reactors/bv1.html">Beaver Valley 1</a>,
 <br/>,
 '05000334']

In [178]:
# isolate and print the name, the link and the docket number
print(cells[0].contents[0].text)
print(cells[0].contents[0].get('href'))
print(cells[0].contents[2])

Beaver Valley 1
/info-finder/reactors/bv1.html
05000334


In [179]:
scraped_data = []
for row in reactors_table.find_all('tr')[1:]:
    
    # .find_all 'td' tags in the row and put them into a variable
    cells = row.find_all('td')
    
    # extract the cell contents
    reactor_name = cells[0].contents[0].text
    link = 'http://www.nrc.gov' + cells[0].contents[0].get('href')
    docket = cells[0].contents[2]
    license = cells[1].text
    reactor_type = cells[2].text
    location = cells[3].text
    owner = cells[4].text
    region = cells[5].text
    
    # append the collected data to the empty list
    scraped_data.append([reactor_name, link, docket, license, reactor_type, location, owner, region])

In [180]:
scraped_data

[['Arkansas Nuclear 1',
  'http://www.nrc.gov/info-finder/reactors/ano1.html',
  '05000313',
  'DPR-51',
  'PWR',
  '6 miles WNW of Russellville,\xa0\xa0AR',
  'Entergy Nuclear Operations, Inc. ',
  '4'],
 ['Arkansas Nuclear 2',
  'http://www.nrc.gov/info-finder/reactors/ano2.html',
  '05000368',
  'NPF-6',
  'PWR',
  '6 miles WNW of Russellville,\xa0\xa0AR',
  'Entergy Nuclear Operations, Inc. ',
  '4'],
 ['Beaver Valley 1',
  'http://www.nrc.gov/info-finder/reactors/bv1.html',
  '05000334',
  'DPR-66',
  'PWR',
  '17 miles W of McCandless,\xa0\xa0PA',
  'FirstEnergy Nuclear Operating Co. ',
  '1'],
 ['Beaver Valley 2',
  'http://www.nrc.gov/info-finder/reactors/bv2.html',
  '05000412',
  'NPF-73',
  'PWR',
  '17 miles W of McCandless,\xa0\xa0PA',
  'FirstEnergy Nuclear Operating Co. ',
  '1'],
 ['Braidwood 1',
  'http://www.nrc.gov/info-finder/reactors/brai1.html',
  '05000456',
  'NPF-72',
  'PWR',
  '20 miles SSW of Joliet,\xa0\xa0IL',
  'Exelon Generation Co., LLC ',
  '3'],
 ['Br

In [181]:
import pandas as pd
df = pd.DataFrame(scraped_data)
df.head(12)

Unnamed: 0,0,1,2,3,4,5,6,7
0,Arkansas Nuclear 1,http://www.nrc.gov/info-finder/reactors/ano1.html,5000313,DPR-51,PWR,"6 miles WNW of Russellville, AR","Entergy Nuclear Operations, Inc.",4
1,Arkansas Nuclear 2,http://www.nrc.gov/info-finder/reactors/ano2.html,5000368,NPF-6,PWR,"6 miles WNW of Russellville, AR","Entergy Nuclear Operations, Inc.",4
2,Beaver Valley 1,http://www.nrc.gov/info-finder/reactors/bv1.html,5000334,DPR-66,PWR,"17 miles W of McCandless, PA",FirstEnergy Nuclear Operating Co.,1
3,Beaver Valley 2,http://www.nrc.gov/info-finder/reactors/bv2.html,5000412,NPF-73,PWR,"17 miles W of McCandless, PA",FirstEnergy Nuclear Operating Co.,1
4,Braidwood 1,http://www.nrc.gov/info-finder/reactors/brai1....,5000456,NPF-72,PWR,"20 miles SSW of Joliet, IL","Exelon Generation Co., LLC",3
5,Braidwood 2,http://www.nrc.gov/info-finder/reactors/brai2....,5000457,NPF-77,PWR,"20 miles SSW of Joliet, IL","Exelon Generation Co., LLC",3
6,Browns Ferry 1,http://www.nrc.gov/info-finder/reactors/bf1.html,5000259,DPR-33,BWR,"32 miles W of Huntsville, AL",Tennessee Valley Authority,2
7,Browns Ferry 2,http://www.nrc.gov/info-finder/reactors/bf2.html,5000260,DPR-52,BWR,"32 miles W of Huntsville, AL",Tennessee Valley Authority,2
8,Browns Ferry 3,http://www.nrc.gov/info-finder/reactors/bf3.html,5000296,DPR-68,BWR,"32 miles W of Huntsville, AL",Tennessee Valley Authority,2
9,Brunswick 1,http://www.nrc.gov/info-finder/reactors/bru1.html,5000325,DPR-71,BWR,"30 miles S of Wilmington, NC","Duke Energy Progress, LLC",2
