# Webscraping 101: What to do when APIs fail you.

Webscraping, or extracting data from the HTML (Hypertext Markup Language) code of websites, is a rather hacky substitute for gathering data via flat files (read: csv or Excel) or API calls. Whereas APIs have an existing structure that you can leverage to grab data, with webscraping you actually mine raw HTML code. 

Generally speaking, webscraping is much less performant than getting data from an API. However, technically you have unlimited freedom with what you can scrape. Most APIs have limits to the data you can access for free. Webscraping does not have a physical limitation to what you can access, **WITH A FEW CAVEATS**. Please note, **FOR SOME WEBSITES, IT CAN BE ILLEGAL TO WEBSCRAPE DATA. ALWAYS BE SURE IT IS LEGAL FOR YOU TO SCRAPE BEFORE ACTUALLY DOING SO**. However, for almost all of us, the websites we want to access will have no problem with us scraping data. 

For this tutorial, we will look at a simple web scraping program. There are many free APIs you can use to get stock data, but often they limit the granularity of their data to one row per day (open, high, low, close, avg_volume). Moreover, it can be beneficial to scrape your own data if you are looking for something specific.

### Dependencies

Here are some special dependencies needed to webscrape with Python. They are by no means the only modules you can use. You will learn more about them later on in this tutorial, but feel free to explore the links below as well. Standard packages like numpy and pandas will also be used. Most importantly, make sure you have **Google Chrome** on your computer.

* [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [urllib.request](https://docs.python.org/3.0/library/urllib.request.html)
* [ssl](https://docs.python.org/2/library/ssl.html)


## Scraping Part 1: Pulling Scottrade's Top Performers Table

### General Information About HTML

This example is much simpler than **Webscraping 201**, and is a great introduction into the basic process of webscraping using **`BeautifulSoup`** and **`urllib.requests`**. To start, be sure that you have **Google Chrome** on your computer. Open up a Google Chrome tab, and navigate to the Scottrade top performers URL seen **[here](https://research.scottrade.com/qnr/Public/InvestorTools/Performers?type=s&results_view=performing)**. This is a list of the companies with the highest growing stock prices. Next, right click anywhere on the screen and click **Inspect** or **Inspect Element**, depending on if you are on a Mac or Windows computer. On the screen, you should now see something like the picture below:

![alt text](http://image.ibb.co/gLWRra/scrape1.png)

What are you seeing next to the actual page is the back-end HTML code for the webpage. If you put your cursor over the code seen in the upper right, the applicable part of the webpage will light up as well. HTML code structure is broken up into **tags**. For more on HTML, look on their website [here](http://html.com/). Basically, tables have table tags, and we need to find the name of the given table so we can get the data. 

First, go to the bottom taskbar and click on the option that says **#CenterContentContainer**. This is the main body of the webpage, which houses the table we need. If you look down a little further in the html, you will see a tag that says **`<table class="sortable Performers mb15" width="100%">`**. This is the tag that houses all of the stock data we need. As seen in the picture below, the **`<tbody>`** tag is a subgroup that holds all of the rows of data, identified by the **`<tr>`** tags. This information will come in handy when understanding our code.

![alt text](http://image.ibb.co/n5fajv/scrape2.png)

### Extracting HTML From Webpage

First let's get our boilerplate code out of the way. A few notes on the code below. **ssl** "provides access to Transport Layer Security (often known as “Secure Sockets Layer”) encryption and peer authentication facilities for network sockets." Basically, it allows us to contact a webpage without a verified handshake.

Second, the link below is the webpage for Scottrade's top performers.

In [1]:
###################################################
#
# Scrape Scottrade's Top Performers
#
# Created by: Sam Showalter
# Creation Date: 2017-04-01
#
###################################################

import pandas as pd
import requests
from bs4 import BeautifulSoup
import urllib.request as req 
import ssl

#Link to scottrades list of top performing public companies
link = "https://research.scottrade.com/qnr/Public/InvestorTools/Performers?type=s&results_view=performing"

#Allowing for unverified handshake with website
context = ssl._create_unverified_context()

Thirdly, we need to open up the URL and extract all of its HTML code. This code is saved as the **`soup`** variable seen below. Finally, we need to grab just the html data for the top performers table. This is where we use the **`class`** variable to refine our search.

Lastly, we will create lists to store each column of data.

In [5]:
#Open up webpage
page = req.urlopen(link, context = context)

#Extract ALL html code from the webpage as a soup
soup = BeautifulSoup(page, "html.parser")

#Find the table within the soup for top performers
performers_table = soup.find('table', class_='sortable Performers mb15')

#Generate lists for all fields (see below for field descriptions)
symbol =[]
sector =[]
industry =[]
prior_close =[]
five_day_change =[]
four_week_change =[]
fifty2_week_change =[]
market_cap =[]

### Extracting Text from HTML Code

Finally, we have the html code we need. As of now, the data looks like this:

In [6]:
print(performers_table)

<table cellpadding="0" cellspacing="0" class="sortable Performers mb15" width="100%"><thead><tr><th width="25"> </th><th class="left" valign="bottom" width="160"><span>Symbol</span><br/>Company</th><th class="left" valign="bottom" width="85"><span>Sector</span></th><th class="left" valign="bottom" width="85"><span>Industry</span></th><th class="right" valign="bottom" width=""><a href="?c=&amp;availability=&amp;assetType=&amp;fundFamily=&amp;classification=&amp;type=s&amp;perfSort=DSPriceCurrent&amp;perfOrder=D&amp;results_view=performing">Prior<br/>Close</a><span class="icons"></span></th><th class="right" valign="bottom" width=""><a href="?c=&amp;availability=&amp;assetType=&amp;fundFamily=&amp;classification=&amp;type=s&amp;perfSort=DSPrice5DayPctChg&amp;perfOrder=D&amp;results_view=performing">5 Day<br/>Change</a><span class="icons"></span></th><th class="right" valign="bottom" width=""><a href="?c=&amp;availability=&amp;assetType=&amp;fundFamily=&amp;classification=&amp;type=s&amp;

Yuck. Fortunately, we can pull out just the text from this data using the code below. By going through every row (**`<tr>`**), and gathering the contents (**`<td>`**) of each, we can fill the empty lists we created above.

In [7]:
#Finds all row in the table (tr)
for row in performers_table.findAll("tr"):
    #Finds all cells in the table
    cells = row.findAll('td')
    if len(cells)==9: #Only extract table body not heading
        #Extract all data from the table into the lists
        symbol.append(cells[1].find(text=True))      # Symbol
        sector.append(cells[2].find(text=True))      # Sector
        industry.append(cells[3].find(text=True))      # Industry
        prior_close.append(cells[4].find(text=True))      # Prior Close
        five_day_change.append(cells[5].find(text=True))      # 5 Day Change
        four_week_change.append(cells[6].find(text=True))      # 4 Week Change
        fifty2_week_change.append(cells[7].find(text=True))      # 52 Week Change
        market_cap.append(cells[8].find(text=True))      # Market Cap

Thats it! We have all the data we need. Now we just need to format our lists. As of now a column list looks like this.

In [8]:
print(symbol)

['WTW', 'NVDA', 'STM', 'SQ', 'ANET', 'TTWO', 'BCOR', 'CGNX', 'BRKS', 'MU', 'AMD', 'ITGR', 'NRG', 'ALGN', 'MTOR', 'W', 'NFLX', 'BABA', 'BBY', 'CSX']


To make everything more visually appealing, lets combine all of these lists into a pandas dataframe.

In [12]:
#Create pandas dataframe of all lists
df=pd.DataFrame(symbol,columns=['SYMBOL'])
df['SECTOR']=sector
df['INDUSTRY']=industry
df['PRIOR_CLOSE']=prior_close
df['5_DAY_CHANGE']=five_day_change
df['4_WEEK_CHANGE']=four_week_change
df['52_WEEK_CHANGE']=fifty2_week_change
df['MARKET_CAP']=market_cap

That's all there is to it! Our data is now organized as seen below. For a more advanced version of webscraping, please see the **`Webscrape_201`** tutorial.

In [13]:
#Show Scottrade's top performers
print(df)

   SYMBOL                  SECTOR                                   INDUSTRY  \
0     WTW  Consumer Non-Cyclicals   Personal & Household Products & Services   
1    NVDA              Technology  Semiconductors & Semiconductor Equipment    
2     STM              Technology  Semiconductors & Semiconductor Equipment    
3      SQ              Technology                     Software & IT Services   
4    ANET              Technology                Communications & Networking   
5    TTWO      Consumer Cyclicals                           Leisure Products   
6    BCOR              Technology                     Software & IT Services   
7    CGNX             Industrials          Machinery, Equipment & Components   
8    BRKS              Technology  Semiconductors & Semiconductor Equipment    
9      MU              Technology  Semiconductors & Semiconductor Equipment    
10    AMD              Technology  Semiconductors & Semiconductor Equipment    
11   ITGR              Healthcare       