# 🌐 Web Scraping with Python 1

## Why is this useful?
There are many reasons why you would want to scrape data. Some examples are:
- Scrape pages of newspapers to get information around important historical events (e.g. elections, major reforms, armed conflicts)
- Compare prices of different products by scraping the pages
- Find the cheapest flight tickets for your dream holidays!

## 📑 Illustrative example
Imagine that one day, you find yourself thinking:
> Gee, I wonder who the five most popular mathematicians are?

One way to do this is to use [Wikipedia's xTools](https://www.mediawiki.org/wiki/XTools) to measure the popularity of a mathematician by equating popularity and page views. For example, look at the page on [Henri Poincaré](https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/Henri_Poincaré). There you can see that Poincaré’s pageviews for the last 60 days are, as of December 2017, around 32,000. 

Next, you Google for “famous mathematicians” and find [this resource](http://www.fabpedigree.com/james/mathmen.htm) which lists 100 names. Now you have both a page listing mathematician’s names and you have a website that provides information about how “popular” that mathematician is. Now what?

This is where Python and web scraping come in. Web scraping is about downloading structured data from the web, selecting some of that data, and passing along what you selected to another process.

## A little more of theory

We first need to understand some basics on how the web works and how we can access the data in the web.

### HTTP
HTTP or Hypertext Transfer Protocol is an application protocol used for communication between distributed and multi-layered systems on the web. The foundation of the web as we know today (the world wide web) uses HTTP as the main data communication protocol.

HTTP functions as a request-response protocol, or in an active fashion, meaning that one end issues a "request" and the other end receives the request and responds with a "response". This is a generic and more of a high-level explanation, but just have in mind that the response can be pretty much anything that is parseable (a JSON document, XML, HTML, an integer number, a URL...you name it).

The classic example of how this work is when you browse the web. Usually speaking, there is a server (or multiple) hosting the website you are accessing that will be in charge of receiving the requests and responding with the website pages (serving them to you). In this case your browser will be known as the "client" and the server, well, it's known as the "server" 

![](./assets/http.png)

### URLs
![](./assets/urls.png)


### The components of a webpage

When we visit a web page,our web browser makes a request to a web server. This is called a `GET` request, as we are gettign the files from the server. The server then sends back the files that tell out browser how to render the page for us. The files fall into a few types:
- HTML: contain the main content of the page
- CSS : the styling (makes the page look snazzy)
- JS: Javascript files that add interactivity to the pages
- Images

Once these files are received the web browser renders them and display them to us. 


### Verbs / Methods
These define the action that should be performed on the the host:

- HTTP GET: Requests a representation of the specified resource (Document, HTML Page, Picture, JSON, XML...). Using the GET method should only retrieve data and should have no other effect.


- HTTP POST: Requests that the server accept the data enclosed in it's request body as a new resource to be persisted on it's end. The data POSTed might be for instance a new user that registered in your website, a new message on an instant messaging app, a comment on a thread etc. Usually speaking, it's something that the server will store on it's end for later consumption, processing or usage.


- HTTP HEAD: The HEAD request is identical to the GET request, but instead of receiving the full payload of the response, it receives only meta-information about the server (also known as the response headers). This is useful for understanding what is running on the server (or how it reacts to different requests), without having to tranport the entire content of a standard response.


- HTTP PUT: Similar to the POST request, but this one suplies an URI (identifier) that should be used by the server to persist the object transported by the PUT request. The catch here is that if an object with the same URI already exists on the server side, it should be overwritten by the one received (this operation is also known as UPSERT or Merge operation. If the record does not exist, it will be inserted, otherwise it will be updated).


- HTTP DELETE: Requests the deletion of the specified resource


- HTTP OPTIONS: Requests the HTTP methods and actions supported by the server for one specific URL


- HTTP TRACE: Bounces the issued request to the server and back again. This is useful for understanding whether any intermediate servers made any changes to the request you issued, before it reached the target.


### HTTP status codes

The Status Codes are the way the server can tell the client what happened with the request it issued. Have you ever tried to access a site and saw the classic "404 - Not Found" screen ? Well, it turns out that "404" is the Status Code that represents the Not Found status.

Each status is represented by it's own integer number and falls into one out of five different categories of status:

- `1XX` - Informational (E.g: 100 - Continue)

- `2XX` - Success (E.g: 200 - OK ; 201 - Created ; 204 - No Content)

- `3XX` - Redirection (E.g: 301 - Moved Permanently)

- `4XX` - Client Error (E.g: 400 - Bad Request ; 401 - Unauthorized ; 404 - Not Found)

- `5XX` - Server Error (E.g: 500 - Internal Server Error ; 501 - Not Implemented)

For a full list of status codes you can try [this link](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes?oldformat=true) or if you are a cat lover you can try this [visual representation of status codes as cats](https://http.cat)

### HTML

HyperText Markup Language (HTML) is a language that web pages are created in. HTML isn't a programming language, like Python — instead, it's a markup language that tells a browser how to layout content. HTML allows you to do similar things to what you do in a word processor like Microsoft Word — make text bold, create paragraphs, and so on. Because HTML isn't a programming language, it isn't nearly as complex as Python.

Let's take a quick tour through HTML so we know enough to scrape effectively. HTML consists of elements called tags. The most basic tag is the <html> tag. This tag tells the web browser that everything inside of it is HTML. We can make a simple HTML document just using this tag:
    
```html
<html>
</html>
```

Save this as `simple_webpage.html` if you were to open this file you would not see anything. Let's add more content:

```html
<html>
    <head>
    </head>
    <body>
        <p>
            Here's a paragraph of text!
        </p>
        <p>
            Here's a second paragraph of text!
        </p>
    </body>
</html>
```

Tags have commonly used named that depend on their position in relation to the tags:

- `child` — a child is a tag inside another tag. So the two p tags above are both children of the body tag.
- `parent` — a parent is the tag another tag is inside. Above, the html tag is the parent of the body tag.
- `sibiling` — a sibiling is a tag that is nested inside the same parent as another tag. For example, head and body are siblings, since they're both inside `html`. Both `p` tags are siblings, since they're both inside body.


There are many tags that add some functionalities and behaviours to the webpages. For a full list of them [visit this link](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)

Let's add another tag:


```html
<html>
    <head>
    </head>
    <body>
        <p class="bold-paragraph">
            Here's a paragraph of text!
        </p>
        <p class="bold-paragraph extra-large">
            Here's a second paragraph of text!
            <a href="https://www.python.org">The Python website!!</a>
        </p>
    </body>
</html>
```

In the above example the tag `<a>` adds a link to the site and tells the browser to render a link to another webpage. We also added a `class` to our paragraph which gives our elements special properties, classes are optional.

## Getting started with the scraping

We are going to use the Python library [requests](http://docs.python-requests.org/en/master/) to collect data from the web. For this we are going to start with a basic web page:

In [1]:
import requests

url = 'https://goo.gl/FwemWV'

page = requests.get(url)

The object `page` is now a response object. This will contain information about our request such as the status, encoding, the content, and much more.

Let's check the status code using the `status_code` attribute and get the `text` from the response.

In [5]:
page.status_code

200

In [8]:
page.text

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\n    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n\n<html lang="en-US" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US">\n<head>\n  <meta http-equiv="content-type" content="text/html; charset=us-ascii" />\n\n  <title>Turtle Soup</title>\n</head>\n\n<body>\n  <h1>Turtle Soup</h1>\n\n  <p class="verse" id="first">Beautiful Soup, so rich and green,<br />\n  Waiting in a hot tureen!<br />\n  Who for such dainties would not stoop?<br />\n  Soup of the evening, beautiful Soup!<br />\n  Soup of the evening, beautiful Soup!<br /></p>\n\n  <p class="chorus" id="second">Beau--ootiful Soo--oop!<br />\n  Beau--ootiful Soo--oop!<br />\n  Soo--oop of the e--e--evening,<br />\n  Beautiful, beautiful Soup!<br /></p>\n\n  <p class="verse" id="third">Beautiful Soup! Who cares for fish,<br />\n  Game or any other dish?<br />\n  Who would not give all else for two<br />\n  Pennyworth only of Beautiful Soup?<br />\n  Pennyworth only of

Here yoou can see all the conent, including the HTML tags. However, there is not much spacing and this makes the content very difficult to read 🤨. There is 

<div class=warn>
Do not modify below: this adds the style to the notebook
</div>

In [9]:
from IPython.core.display import HTML


def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()