# Python Web Scraping Tutorial using BeautifulSoup
   Part of it adapted from Vik Paruchuri
   
   When performing data science tasks, it’s common to want to use data found on the internet. You’ll usually be able to access this data in csv format, or via an Application Programming Interface(API). However, there are times when the data you want can only be accessed as part of a web page. In cases like this, you’ll want to use a technique called web scraping to get the data from the web page into a format you can work with in your analysis.
   
##   The components of a web page
When we visit a web page, our web browser makes a request to a web server. This request is called a GET request, since we’re getting files from the server. The server then sends back files that tell our browser how to render the page for us. The files fall into a few main types:

1. HTML – contain the main content of the page.
2. CSS – add styling to make the page look nicer.
3. JS – Javascript files add interactivity to web pages.
4. Images – image formats, such as JPG and PNG allow web pages to show pictures.

After our browser receives all the files, it renders the page and displays it to us. There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping. When we perform web scraping, we’re interested in the main content of the web page, so we look at the HTML.

## HTML
HyperText Markup Language(HTML) is a language that web pages are created in. HTML isn’t a programming language, like Python – instead, it’s a markup language that tells a browser how to layout content. HTML allows you to do similar things to what you do in a word processor like Microsoft Word – make text bold, create paragraphs, and so on. Because HTML isn’t a programming language, it isn’t nearly as complex as Python.

Let’s take a quick tour through HTML so we know enough to scrape effectively. HTML consists of elements called tags. The most basic tag is the <html> tag. This tag tells the web browser that everything inside of it is HTML. We can make a simple HTML document just using this tag:

> `<html>`

> `</html>`

We haven’t added any content to our page yet, so if we viewed our HTML document in a web browser, we wouldn’t see anything:

Right inside an html tag, we put two other tags, the head tag, and the body tag. The main content of the web page goes into the body tag. The head tag contains data about the title of the page, and other information that generally isn’t useful in web scraping:

> `<html>
    <head>
    </head>
    <body>
    </body>
> </html>`
>

We still haven’t added any content to our page (that goes inside the body tag), so we again won’t see anything:

You may have noticed above that we put the head and body tags inside the html tag. In HTML, tags are nested, and can go inside other tags.

We’ll now add our first content to the page, in the form of the p tag. The p tag defines a paragraph, and any text inside the tag is shown as a separate paragraph:

>`<html>
    <head>
    </head>
    <body>
        <p>
            Here's a paragraph of text!
        </p>
        <p>
            Here's a second paragraph of text!
        </p>
    </body>
></html>`

Here’s how this will look:

>Here's a paragraph of text!

>Here's a second paragraph of text!

Tags have commonly used names that depend on their position in relation to other tags:

+ child – a child is a tag inside another tag. So the two p tags above are both children of the body tag.
+ parent – a parent is the tag another tag is inside. Above, the html tag is the parent of the body tag.
+ sibiling – a sibiling is a tag that is nested inside the same parent as another tag. For example, head and body are siblings, since they’re both inside html. Both p tags are siblings, since they’re both inside body.

We can also add properties to HTML tags that change their behavior:

>`<html>
    <head>
    </head>
    <body>
        <p>
            Here's a paragraph of text!
            <a href="https://www.dataquest.io">Learn Data Science Online</a>
        </p>
        <p>
            Here's a second paragraph of text!
            <a href="https://www.python.org">Python</a>
        </p>
    </body>
></html>`

Here’s how this will look:

>Here's a paragraph of text! [Learn Data Science Online](https://www.dataquest.io)

>Here's a second paragraph of text! [Python](https://www.python.org)

In the above example, we added two a tags.  a tags are links, and tell the browser to render a link to another web page. The href property of the tag determines where the link goes.

a and p are extremely common html tags. Here are a few others:

+ div – indicates a division, or area, of the page.
+ b – bolds any text inside.
+ i – italicizes any text inside.
+ table – creates a table.
+ form – creates an input form.
+ For a full list of tags, Google it, :-).

Before we move into actual web scraping, let’s learn about the class and id properties. These special properties give HTML elements names, and make them easier to interact with when we’re scraping. One element can have multiple classes, and a class can be shared between elements. Each element can only have one id, and an id can only be used once on a page. Classes and ids are optional, and not all elements will have them.

We can add classes and ids to our example:

>`<html>
    <head>
    </head>
    <body>
        <p class="bold-paragraph">
            Here's a paragraph of text!
            <a href="https://www.dataquest.io" id="learn-link">Learn Data Science Online</a>
        </p>
        <p class="bold-paragraph extra-large">
            Here's a second paragraph of text!
            <a href="https://www.python.org" class="extra-large">Python</a>
        </p>
    </body>
></html>`

Here’s how this will look:

>Here's a paragraph of text! [Learn Data Science Online](https://www.dataquest.io)

>Here's a second paragraph of text! [Python](https://www.python.org)

As you can see, adding classes and ids doesn’t change how the tags are rendered at all.

## The requests library

The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library. The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us. There are several different types of requests we can make using requests, of which GET is just one. If you want to learn more, check out [request documentation](http://docs.python-requests.org/en/master/).

**Twitter, Spotify, Microsoft, Amazon, Lyft, BuzzFeed, Reddit, The NSA, Her Majesty's Government, Google, Twilio, Runscope, Mozilla, Heroku, PayPal, NPR, Obama for America, Transifex, Native Instruments, The Washington Post, SoundCloud, Kippt, Sony, and Federal U.S. Institutions that prefer to be unnamed claim to use Requests internally.**

Let’s try downloading a simple sample website, http://dataquestio.github.io/web-scraping-pages/simple.html. We’ll need to first download it using the requests.get method.

In [None]:
import requests

In [None]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html") #the url of the page you want to download.

After running our request, we get a **Response object**. This object has a status_code property, which indicates if the page was downloaded successfully:

In [None]:
page.status_code

A status_code of 200 means that the page downloaded successfully. We won’t fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.

We can print out the HTML content of the page using the content property:

In [None]:
page.content #get content of the page

In [None]:
page.headers #get headers of the page

In [None]:
page.encoding #get encoding of the page

### Passing Parameters In URLs

You often want to send some sort of data in the URL’s query string. If you were constructing the URL by hand, this data would be given as key/value pairs in the URL after a question mark, e.g. http://httpbin.org/get?key=val. Requests allows you to provide these arguments as a dictionary of strings, using the params keyword argument. As an example, if you wanted to pass key1=value1 and key2=value2 to http://httpbin.org/get, you would use the following code:

In [None]:
payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.get('http://httpbin.org/get', params=payload)

You can see that the URL has been correctly encoded by printing the URL:

In [None]:
r.url

In [None]:
r = requests.get('http://apple.com')

In [None]:
r.content[:100]

### Custom Headers

If you’d like to add HTTP headers to a request, simply pass in a dict to the headers parameter.

For example, we didn’t specify our user-agent in the previous example:

In [None]:
url = 'https://api.github.com/some/endpoint'
headers = {'user-agent': 'my-app/0.0.1'}

r = requests.get(url, headers=headers)

Sometimes, websites do not allow applications (for example, by python) to access, you have to send the request as if it is sent by a web browser.

The header is the tool to fake the browser.

In [None]:
#Make your python request as if it were from firefox browser
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko)'}
r = requests.get(url, headers=headers)

In [None]:
r.headers # display the full information of the header of the page received

### Response Types
There several response types:
+ Text web page
+ Bytes contents, e.g., images, files
+ Json contents
+ Raw contents

In [None]:
#Text response
r = requests.get('https://api.github.com/events')
r.text[:200]#display the first 200 characters of the content
#with r.text, Requests will automatically decode content from the server.

In [None]:
#bytes reponse contents
r = requests.get('http://higheredutah.org/wp-content/uploads/2016/03/USU-search-cmte.jpg')
r.content[:100]

In [None]:
from PIL import Image
from io import BytesIO

i = Image.open(BytesIO(r.content))
i.show()

#Save bytes contents
with open('usu.jpg', 'wb') as fd:
    fd.write(r.content)

In [None]:
#Json response content
r = requests.get('https://api.github.com/events')
r.json()

#### Raw response
In the rare case that you'd like to get the raw socket response from the server, you can access r.raw. If you want to do this, make sure you set stream=True in your initial request. Once you do, you can do this:

In [None]:
r = requests.get('https://api.github.com/events', stream=True)
#r.raw
r.raw.read(10)


In general, however, you should use a pattern like this to save what is being streamed to a file:

In [None]:
filename='test'
with open(filename, 'wb') as fd: # 'wb' is write as bytes
    for chunk in r.iter_content(chunk_size=128):
        fd.write(chunk)

Using Response.iter_content will handle a lot of what you would otherwise have to handle when using Response.raw directly. When streaming a download, the above is the preferred and recommended way to retrieve the content. Note that chunk_size can be freely adjusted to a number that may better fit your use cases.

#### This is especially useful if you are downloading large files of which you may use just a small part. It will save your download time and storage space.

In [None]:
url='https://www.sec.gov/Archives/edgar/data/7323/0000065984-14-000065.txt' #this file is more than 400 megabytes.
r = requests.get(url, stream=True)
filename='test.txt'
with open(filename, 'w') as fd:
    n=0 #see how many chunk we download in the end
    cont='' #considering the '<TYPE>GRAPHIC' may be spread in two chunks, we chop the last 12 characters in previous chunk and add it to the next chunk
    for chunk in r.iter_content(chunk_size=1024*1024):
        test=cont+chunk.decode('utf-8') #adding the previous chunk's last 12 character to next chunk and decode bytes into string
        inde=test.find('<TYPE>GRAPHIC')#search for the string in a string, if returns -1, meaning not found, otherwise the index of first stance
        #inde=test.find('</html>') 
        if inde!=-1: #if found, write to file and break, but only write the string ending at '<\html>'
            print('found it')
            fd.write(chunk.decode('utf-8')[:inde-12])#offsetting the added 12 characters of last chunk
            break
        fd.write(chunk.decode('utf-8')) #if not found,write current chunk to file
        cont=str(chunk.decode('utf-8')[-12:])#retain the last 12 characters for next chunk
        n+=1
print(n)        

## Parsing a page with BeautifulSoup

As you can see above, we now have downloaded an HTML document.

We can use the BeautifulSoup library to parse this document, and extract the text from the p tag. We first have to import the library, and create an instance of the BeautifulSoup class to parse our document:

For more information, visit [BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [None]:
nytimes=requests.get('https://www.nytimes.com/section/technology')

In [None]:
nytimes.text[:1000]

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(nytimes.content, 'html.parser') #we need to specify a parser to soup

In [None]:
soup.get_text()[15000:17000]

### Basic tagging information

In [None]:
soup.a #get the first a tag

In [None]:
soup.a.name #get the tag name

In [None]:
soup.a.string #get the string between the tag

In [None]:
soup.a['class'] #get the 'class' attribute of tag a

In [None]:
soup.a['href'] #get the 'href' attribute of tag a, which the hyperlink

In [None]:
soup.title #get the first title tag

In [None]:
soup.title.parent.name # get the parent name of title tag

In [None]:
soup.img #get the first 'img' tag

### Finding all instances of a tag at once

If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page.

`soup.find_all('a')` create an iterator for all 'a' tags

`soup.find_all('div')` create an iterator for all 'div' tags

In [None]:
#get all hyperlinks of a page

links=[a['href'] for a in soup.find_all('a')]#list comprehension, function as if in a for loop:
                                             #for x in soup.find_all('a')
                                             #    links.append(x['href'])
'''
same as
links=[]
for a in soup.find_all('a'):
    links.append(a['href'])

'''        

In [None]:
len(links)

In [None]:
links[:20]

In [None]:
l=set(links) #convert list to set, removing duplicates
print(len(l))

In [None]:
links=list(l)
links[:20]

#### Find all headlines

In [None]:
headlines=[x.get_text() for x in soup.find_all('h2', class_='headline')]#Note: the use of class_, 

In [None]:
len(headlines)

In [None]:
h=set(headlines) #removing duplicates

In [None]:
len(h)

In [None]:
h

### What if we only need the latest news?

In [None]:
latest_panel=soup.find('section', id='latest-panel')

In [None]:
latest_panel

In [None]:
hl=latest_panel.li #find the first 'li' tag within latest_panel

In [None]:
hl

In [None]:
hl.a['href'] #get link from hl

In [None]:
hl.h2.get_text().replace('\n','').strip() #get the h2 tag from hl and get the text between tags, removing '\n' and trailing spaces

In [None]:
hl.time #get the 'time' tag from hl

In [None]:
hl.time['datetime'] #get the 'datetime' attribute of time tage

### So, get all li tags from latest_panel, with id start with 'story-id'

In [None]:
hls=latest_panel.find_all('li', id=lambda x: x and x.startswith('story-id'))

In [None]:
len(hls)

In [None]:
#get the headline titles, hyperlink, and date time of the headlines
headlines=[(x.h2.get_text().replace('\n','').strip(),x.a['href'], x.time['datetime']) for x in hls]

In [None]:
len(headlines)

In [None]:
headlines