# General Python Info

`Practical Web Scraping for Data Science: Seppe Vanden Broucke, Bart Baesens`

In [3]:
# Strings
"{} : {}".format("A", "B")

'A : B'

In [4]:
"{0}, {0}, {1}".format("A", "B")

'A, A, B'

# The Web Speaks HTTP

**Steps navigating to a website**
- You enter “www.google.com” into your web browser, which needs to figure out the IP address for this site. IP stands for “Internet Protocol” and forms a core protocol of the Internet, as it enables networks to route and redirect communication packets between connected computers, which are all given an IP address
- And so, your browser sets off to figure out the correct IP address behind “www.google.com”
- If the operating system is also unaware of this domain, the browser will send a DNS request to your router, which is the machine that connects you to the Internet and also — typically — keeps its own DNS cache.
- All of this was done just to figure out the IP address of www.google.com. Your browser can now establish a connection to 172.217.17.68, Google’s web server
- Google’s web server now sends back an HTTP reply, containing the contents of the page we want to visit. In most cases, this textual content is formatted using HTML.

## HTTP in Python: The Requests Library

> Basically, we’re throwing out our web browser and we’re going to
surf the web using a Python program instead. This means that our Python program will
need to be able to speak and understand HTTP.

In [5]:
import requests
url = 'http://www.webscrapingfordatascience.com/basichttp/'
r = requests.get(url)
print(r.text)

Hello from the web!



In [6]:
# Take a look under the hood
import requests
url = 'http://www.webscrapingfordatascience.com/basichttp/'
r = requests.get(url)

# Which HTTP status code did we get back from the server?
print(r.status_code)
# What is textual status code?
print(r.reason)
# What were the HTTP response headers?
print(r.headers)
# The request information is saved as a Python object in r.request:
print(r.request)
# What were the HTTP request headers?
print(r.request.headers)
# The HTTP response content:
print(r.text)

200
OK
{'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Date': 'Sat, 14 Dec 2019 11:57:40 GMT', 'Server': 'Apache/2.4.18 (Ubuntu)', 'Content-Length': '20', 'Content-Type': 'text/html; charset=UTF-8'}
<PreparedRequest [GET]>
{'User-Agent': 'python-requests/2.13.0', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'keep-alive'}
Hello from the web!



- Here, a status code and message of “200 OK” indicates that everything went well.
- The headers attribute of the request.Response object returns a dictionary of the headers the server included in its HTTP reply. This server reports its data, server version, and also provides the “Content-Type” header.
- To get information regarding the HTTP request that was fired off, you can access the request attribute of a request.Response object. This attribute itself is a request.Request object, containing information about the HTTP request that was prepared.
- Since an HTTP request message also includes headers, we can access the headers attribute for this object as well to get a dictionary representing the headers that were included by requests. Note that requests politely reports its “User-Agent” by default.

In [7]:
# Take a look under the hood
import requests
url = 'http://www.webscrapingfordatascience.com/paramhttp/'
r = requests.get(url)

# Which HTTP status code did we get back from the server?
print(r.status_code)
# What is textual status code?
print(r.reason)
# What were the HTTP response headers?
print(r.headers)
# The request information is saved as a Python object in r.request:
print(r.request)
# What were the HTTP request headers?
print(r.request.headers)
# The HTTP response content:
print(r.text)

200
OK
{'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Date': 'Sat, 14 Dec 2019 12:05:45 GMT', 'Server': 'Apache/2.4.18 (Ubuntu)', 'Content-Length': '34', 'Content-Type': 'text/html; charset=UTF-8'}
<PreparedRequest [GET]>
{'User-Agent': 'python-requests/2.13.0', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'keep-alive'}
Please provide a "query" parameter


Try opening this page in your web browser to verify that you get the same result. Now try navigating to the page http://www.webscrapingfordatascience.com/paramhttp/? query=test. 

What do you see?

The optional “?…” part in URLs is called the “query string,” and it is meant to contain data that does not fit within a URL’s normal hierarchical path structure. You’ve probably
encountered this sort of URL many times when surfing the web, for example:
- http://www.example.com/product_page.html?product_id=304
- https://www.google.com/search?dcr=0&source=hp&q=test&oq=test
- http://example.com/path/to/page/?type=animal&location=asia

Query strings in URLs should adhere to the following conventions:
- A query string comes at the end of a URL, starting with a single question mark, “?”.
- Parameters are provided as key-value pairs and separated by an ampersand, “&”.
- The key and value are separated using an equals sign, “=”

# Stirring the HTML and CSS Soup

`Chapter 3`
- HTML tags often come in pairs and are enclosed in angled brackets, with `<tagname>` being the opening tag and `</tagname>` indicating the closing tag. Some tags come in an unpaired form, and do not require a closing tag. Some commonly used tags are the following:

- Most modern web browsers nowadays include a toolkit of powerful tools you can use to get an idea of what’s going on regarding HTML, and HTTP too.

**You might wonder why the “View source” option is useful to look at a page’s raw HTML source when we have a much user-friendlier alternative offered by the Elements tab.**

- A warning is in order here: the “View source” option shows the HTML code *as it was returned by the web server*, and it will contain the **same contents as r.text** when using requests. 

- The view in the Elements tab, on the other hand, provides a “cleaned up” version after the HTML was parsed by your web browser. Overlapping tags are fixed and extra white space is removed, for instance. There might hence be small differences between these two views. 
- In addition, the Elements tab provides a live and dynamic view. Websites can include scripts that are executed by your web browser and which can alter the contents of the page at will. The Elements tab will hence always reflect the current state of the page.



- Next, note that any HTML element in the Elements tab can be right-clicked. “Copy, Copy selector” and “Copy XPath” are particularly useful, which we’re going to use quite often later on.

## Cascading Style Sheets: CSS

While
perusing the HTML elements in your browser, you’ve probably noticed that some HTML
attributes are present in lots of tags:

- id
- class

In CSS, style information
is written down as a list of colon-separated key-value based statements, with each
statement itself being separated by a semicolon, as follows:

- color: 'red':
- font-size: 14pt:
- ...

These style declarations can be included in a document in three different ways: - **see page 57 for more details**

## The Beautiful Soup Library

Beautiful Soup tries to organize complexity: it helps to parse,
structure and organize the oftentimes very messy web by fixing bad HTML and
presenting us with an easy-to-work-with Python structure.

In [1]:
from bs4 import BeautifulSoup

Beautiful Soup’s main task is to take HTML content and transform it into a tree-based
representation. Once you’ve created a BeautifulSoup object, there are two
methods you’ll be using to fetch data from the page:

• `find(name, attrs, recursive, string, **keywords);`

• `find_all(name, attrs, recursive, string, limit, **keywords).`

**Both find and find_all return Tag objects. Using these, there are a number of
interesting things you can do:**

- Access the name attribute to retrieve the tag name.
- Access the contents attribute to get a Python list containing the tag’s children (its direct descendant tags) as a list.
- ...

In [5]:
import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'

r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')

In [13]:
# Find the first h1 tag - seems to be a sort of heading
first_h1 = html_soup.find('h1')

print(first_h1.name) # h1

print(first_h1.contents) # ['List of ', [...], ' episodes']

print(str(first_h1))

print(first_h1.text) # List of Game of Thrones episodes
print(first_h1.get_text()) # Does the same

print(first_h1.attrs)

print(first_h1.attrs['id']) # firstHeading
print(first_h1['id']) # Does the same
print(first_h1.get('id')) # Does the same

print('------------ CITATIONS ------------')

# Find the first five cite elements with a citation class
cites = html_soup.find_all('cite', class_='citation', limit=5)

for citation in cites:
    print(citation.get_text())
    
# Inside of this cite element, find the first a tag

link = citation.find('a')

# ... and show its URL
print(link.get('href'))
print()

h1
['List of ', <i>Game of Thrones</i>, ' episodes']
<h1 class="firstHeading" id="firstHeading" lang="en">List of <i>Game of Thrones</i> episodes</h1>
List of Game of Thrones episodes
List of Game of Thrones episodes
{'lang': 'en', 'class': ['firstHeading'], 'id': 'firstHeading'}
firstHeading
firstHeading
firstHeading
------------ CITATIONS ------------
Fowler, Matt (April 8, 2011). "Game of Thrones: "Winter is Coming" Review". IGN. Archived from the original on August 17, 2012. Retrieved September 22, 2016.
Fleming, Michael (January 16, 2007). "HBO turns Fire into fantasy series". Variety. Archived from the original on May 16, 2012. Retrieved September 3, 2016.
"Game of Thrones". Emmys.com. Retrieved September 17, 2016.
Roberts, Josh (April 1, 2012). "Where HBO's hit 'Game of Thrones' was filmed". USA Today. Archived from the original on April 1, 2012. Retrieved March 8, 2013.
Schwartz, Terri (January 28, 2013). "'Game of Thrones' casts a bear and shoots in Los Angeles for major Seaso