# <font color=blue> Web Scraping in Python </font>

 #### <font color=red>Resources :</font>
 
 [Udemy Course](https://www.udemy.com/learning-python-for-data-analysis-and-visualization).

 [Web scraping article from **Dataquest**](https://www.dataquest.io/blog/web-scraping-tutorial-python/)
 
 [Article on BeautifulSoup](http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html)

We'll go over how to **scrape information** from a webpage using Python. 

**Work Flow :** 

>* We'll go to a website 
>* Decide what information we want
>* See (Inspect the html in browser) where and how it is stored
>* **Scrape the information** and 
>* Set it as a pandas DataFrame!

## Some things we should consider before scraping a website:

1) We should check a site's terms and conditions before you scrape them. 

2) Space out our requests so we don't overload the site's server, doing this could get us blocked.

3) Scrapers break after time - web pages change their layout all the time, we'll more than likely have to rewrite our code. 

4) Web pages are usually inconsistent, more than likely we'll have to clean up the data after scraping it.

5) Every web page and situation is different, we'll have to spend time configuring your scraper.

#### Sources for Learning HTML

[W3School](http://www.w3schools.com/html/)

[Codecademy](http://www.codecademy.com/tracks/web)

## There are 2 modules we'll need :

> 1) <font color=blue>**requests**</font>,  which you can download by typing: *pip install requests* or *conda install requests* (for the Anaconda distrbution of Python) in your command prompt.

> 2) <font color=blue>**BeautifulSoup**</font>, which you can download by typing: `pip install beautifulsoup4` or `conda install beautifulsoup4` (for the Anaconda distrbution of Python) in your command prompt.

In [1]:
import requests as re
import pandas as pd
from pandas import Series,DataFrame
from bs4 import BeautifulSoup

[Documentation of Request module](http://docs.python-requests.org/en/master/)

[Documentation of BeautifulSoup module](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-objects)

We'll look at some **`legislative reports`** from the University of California Web Page. Feel free to experiment with other webpages, but remember to be cautious and respectful in what you scrape, and how often you do it. Always check the legality of a web scraping job.

In [2]:
url = 'http://www.ucop.edu/operating-budget/budgets-and-reports/legislative-reports/2013-14-legislative-session.html'

### <font color=red>Step 01:</font> Set up `requests` to grab `content` form the `url`

> The first thing we'll need to do to scrape a web page is to download the page. We can download pages using the Python `requests` library. The requests library will make a `GET` request to a web server, which will download the `HTML contents` of a given web page for us. There are several different types of requests we can make using requests, of which GET is just one.

In [3]:
# Request content from the web page
webpage = re.get(url)
webpage

<Response [200]>

> After running our request, we get a `Response object`. This object has a `status_code` property, which indicates if the page was downloaded successfully.

> A status_code of `200` means that the page downloaded successfully. We won't fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error

In [4]:
webpage.status_code

200

In [5]:
# We can print out the HTML content of the page using the "content" property.
webpage.content



### <font color=red>Step 02:</font> Set the webpage content as a <font color=blue>BeautifulSoup</font> object.

> We have downloaded an HTML document. Now we can use the `BeautifulSoup library` to parse this document.

In [6]:
# Set as Beautiful Soup Object
c = webpage.content
soup = BeautifulSoup(c,'html.parser')

>We can print out the HTML content of the page, formatted nicely, using the `prettify` method on the BeautifulSoup object.

In [None]:
print(soup.prettify())

> As all the tags are nested, we can move through the structure `one level` at a time. We can first select all the elements at the top level of the page using the `children` property of soup. Note that children returns a `list generator`, so we need to call the `list function` on it

In [7]:
list(soup.children)

['\n',
 'html',
 '\n',
 '[if lt IE 9]><html class="lte-ie8 no-js"  lang="en"><![endif]',
 '\n',
 '[if gt IE 8]><!',
 <html class="no-js" lang="en"><!--<![endif]-->
 <html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
 <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
 <meta charset="utf-8"/>
 <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
 <meta content="" name="description"/>
 <meta content="" name="author"/>
 <title>Legislative reports | UCOP</title>
 <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
 <!--[if lt IE 9]>
       <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
     <![endif]-->
 <!-- Le styles -->
 <!-- main.css - see /_common/files/css/main.less non-minified sources -->
 <link href="/_common/files/css/main.css?v=1.2" media="screen" rel="stylesheet"/>
 <link href="/_common/files/css/print.css" media="print" rel="stylesheet"/>
 <!-- Le fav and touch icons -->
 <link href="/_common/files/i

In [9]:
[type(item) for item in list(soup.children)]

[bs4.element.NavigableString,
 bs4.element.Doctype,
 bs4.element.NavigableString,
 bs4.element.Comment,
 bs4.element.NavigableString,
 bs4.element.Comment,
 bs4.element.Tag]

> `Doctype` object contains information about the type of the document.

> `NavigableString` represents text found in the HTML document. 

> `Tag` object contains other nested tags. The most important object type, and the one we'll deal with most often, is the Tag object. The Tag object allows us to navigate through an HTML document, and extract other tags and text. 

In [10]:
html = list(soup.children)[6]
html

<html class="no-js" lang="en"><!--<![endif]-->
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="" name="description"/>
<meta content="" name="author"/>
<title>Legislative reports | UCOP</title>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
      <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]-->
<!-- Le styles -->
<!-- main.css - see /_common/files/css/main.less non-minified sources -->
<link href="/_common/files/css/main.css?v=1.2" media="screen" rel="stylesheet"/>
<link href="/_common/files/css/print.css" media="print" rel="stylesheet"/>
<!-- Le fav and touch icons -->
<link href="/_common/files/img/ico/favicon.ico" rel="shortcut icon"/>
<!-- <link href="/files/img/ico/apple-touch-icon.png" rel="apple-touch-icon"/>
<link href="/f

### <font color=red>Step 03:</font>  Use `BeautifulSoup` to search for the **`table`** we want to grab!

>  If we want to extract a single tag, we can instead use the `find_all` method, which will find all the instances of a tag on a page.

> Note that find_all returns a list, so we'll have to loop through, or use list indexing, it to extract text.

> If we instead only want to find the `first instance of a tag`, you can use the `find` method, which will return a single BeautifulSoup object.

In [None]:
# Go to the section of interest - We find the name of 'div' element by inspecting the webpage
# find() method return only the first child of this Tag matching the given criteria
div = soup.find("div",{'class':'list-land','id':'content'})

# Find the tables in the HTML
tables = div.find_all('table')


Now we need to use Beautiful Soup to find the table entries. 

A `td` tag  defines a standard cell in an HTML table. The `tr` tag defines a row in an HTML table.

We'll parse through our tables object and try to find each cell using the `findALL('td')` method.

There are tons of options to use with findALL in beautiful soup. You can read about them [here](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all).

In [None]:
# Set up empty data list

data = []

# Here we have only one table 
# Set rows as first indexed object in tables with a row

rows = tables[0].findAll('tr')

# Now grab every HTML cell in every row

for tr in rows:
    cols = tr.findAll('td')
    # Check to see if text is in the row
    for td in cols:
        text = td.find(text=True) 
        print(text),
        data.append(text)
    

Let's see what the data list looks like

In [None]:
data

> Now we'll use a for loop to go through the list and grab only the cells with a `pdf` file in them, we'll also need to keep track of the index to set up the date of the report.

In [None]:
# Set up empty lists
reports = []
date    = []

# Set index counter
index = 0

# Go find the pdf cells
for item in data:
    if 'pdf' in item:
        # Add the date and reports
        date.append(data[index-1])
        
        # Get rid of \xa0
        reports.append(item.replace(u'\xa0', u' '))
                    
    index += 1

> You'll notice a line to take care of **'\xa0'** This is due to a `unicode error` that occurs if you don't do this. Web pages can be messy and inconsistent, and it is very likely you'll have to do some research to take care of problems like these.

Here's the link I used to solve this particular issue: [StackOverflow Page](http://stackoverflow.com/questions/10993612/python-removing-xa0-from-string)

Now all that is left is to organize our data into a pandas DataFrame!

In [None]:
# Set up Dates and Reports as Series
date    = Series(date)
reports = Series(reports)

In [None]:
# Concatenate into a DataFrame
legislative_df = pd.concat([date,reports],axis=1)

In [None]:
# Set up the columns
legislative_df.columns = ['Date','Reports']

In [None]:
# Show the finished DataFrame
legislative_df

### Displaying the full name of the reports 

In [None]:
pd.set_option('display.max_colwidth', -1)

In [None]:
legislative_df

There are other less intense options for web scraping:

Check out these two companies:

https://import.io/

https://www.kimonolabs.com/


## Good Job!