## Advanced Python Features

While the title of this week's lesson is "Advanced" Python features, these are still basic features that all Python developers should be familiar with. True "advanced" features will reveal themselves in your implementations.

Before we go over any new features (that we might have seen already), let's take some time to also review the features of Python we did not get a chance to explore.

## Collection Data Types

### NamedTuple
A namedtuple is a tuple (immutable list) that has labels for each value that is stored inside of it. This is a quick and easy way to implement immutable "method-less objects."

We implement a namedtuple through the following pattern: `x = namedtuple("Name", ["attr1", "attr2", "attr3"])`

```python
from collections import namedtuple

Cooridinate = namedtuple("Cooridinate", ["x", "y", "z"])

p1 = Cooridinate(10, 20, 3)
p2 = Cooridinate(5, 3, 19)

# this calculates to 15
print(p1.x + p2.x)

# this also calculates to 15
print(p1[0] + p2[0])

# this calculates to 49
print(p1.y + p2.z)
```

There are a variety of other useful data-structs in "Collections", and I encourage you all to explore this documentation: https://docs.python.org/3/library/collections.html

In [1]:
# What is the result of this code?
from collections import namedtuple

Cooridinate = namedtuple("Cooridinate", ["x", "y", "z"])

p1 = Cooridinate("53", 20, "19")
p2 = Cooridinate("21", 19, "28")

print(p1.x + p2.x)

5321


## Generators

Let's say you are reading in a large file or calculating a series of numbers that entails too much memory usage. This could easily lead to a runtime memory error if we utilize more memory than is allocated on our OS for running programs. 

```python
def infinite_sequence():
    num = 0
    # this will eventually crash
    while True:
        print(num)
        num += 1
```

```python
def infinite_sequence():
    num = 0
    # this will run "forever"
    while True:
        yield num
        num += 1
```

In general, this is a great memory saving technique that we should implement when we need to keep track of some internal state (variable) while also pulling up data in a memory friendly way.

We likewise create generators by utilizing paranthesis instead of square-brackets in our list-comprehension!

```python
nums_squared = (num**2 for num in range(5))
```

I recommend you take a look at the following RealPython article to get a good idea of how generators are used:

https://realpython.com/introduction-to-python-generators/

In [None]:
import sys

# RealPython Example

# list comprehension
nums_squared_lc = [i ** 2 for i in range(10000)]
print(sys.getsizeof(nums_squared_lc))


# generator
nums_squared_gc = (i ** 2 for i in range(10000))
print(sys.getsizeof(nums_squared_gc))

## Decorators

Decorators, in essence, are just functions that wrap around other functions to implement additional functionality. We can always implement our own decorators, but we often use the [functools](https://docs.python.org/3/library/functools.html) module which loads some useful decorators that we can use to save on memory. 

I recommend you take a look at the following RealPython article(s) to get a good idea of how decorators are used.

https://realpython.com/primer-on-python-decorators/  
https://docs.python.org/3/library/functools.html  
https://refactoring.guru/design-patterns/decorator  

In [None]:
# Real Python example
def my_decorator(func):
    def wrapper():
        print("hello world!")
        func()
        print("goodbye world!")
    return wrapper

@my_decorator
def do_maths():
    print("1 + 1 = 2")

do_maths()

In [None]:
from functools import cache

# cache's are super useful when it comes to recursion, let's compare 
@cache
def cache_factorial(n):
    return n * cache_factorial(n-1) if n else 1


def factorial(n):
    return n * factorial(n-1) if n else 1

In [None]:
import time

start = time.time()
res = cache_factorial(1200)
end = time.time()

print("This took ", end - start, " seconds")

start = time.time()
res = factorial(1200)
end = time.time()

print("This took ", end - start, " seconds")

In [None]:
start = time.time()
res = cache_factorial(1000)
end = time.time()

print("This took ", end - start, " seconds")

start = time.time()
res = factorial(1000)
end = time.time()

print("This took ", end - start, " seconds")

# notice the difference in time efficiency!

## Packing & Unpacking

A feature of Python you all might have seen already in the domain of data-structures is packing & unpacking. 

We can pack a list using the `*` operator, and similarly unpack a list by placing the asterisk behind the list variable name as we pass it into a function.

**list packing**
```python
a, *b, c = [1, 2, 3, 4, 5, 6]
```

**list unpacking**
```python
def adder(a, b, c, d):
    return a + b + c + d

x = [1, 2, 3, 4]
adder(*x)
```

The same applies for dictionaries, except this time we utilize two asterisks `**`.

**dictionary unpacking**
```python
def adder(a, b, c, d):
    return a + b + c + d

x = {"a": 1, "b": 2, "c": 3, "d": 4}
adder(**x)
```

https://www.geeksforgeeks.org/packing-and-unpacking-arguments-in-python/


Attempt to solve the 3 questions in the below code-block. Questions are labeled as `Q#` and set as comments. Write your answer below or next to the question. Attempt to do this without running!

In [None]:
x = [1, 2, 3, 4]
y = [5, 6, 7, 8]

z = {"a": 1, "b": 2, "c": 3}

def var_reveal(a, b, c):
    print("The value of a is", a)
    print("The value of b is", b)
    print("The value of c is", c)

# Q1: what will be the result of this print statement?
print([*x, *y])

# Q2: what will be the result of this print statement?
print(var_reveal(**z))

x, *y, z = [1, 2, 3, 4]
# Q3: what will be the result of this print statement?
print(y)

## Web-Scraping

Web-scraping is the process of extracting information from a publically available website.

While web-scrapers aren't illegal (if used in the right context), they are generally frowned upon by most websites. If we ping a website too often we will actually get kicked off of a website or get a 5xx error message.

We will be scraping the following website today: https://www.scrapethissite.com/pages/simple/

## HTML

Before we enter the exciting and new world of scraping data from a website, let's familiarize ourselves with `HTML` (Hyper-Text Markup-Langauge). 

Now even though there is "langauge" in the name, we do not consider this to be a "formal" programming language. This is because, at its heart, HTML was engineered to represent information in some structured format rather introduce logical structures or data manipulation. 

This fact make websites a prime candidate for data-gathering. There is some sort of information that we want to extract, it just unfortunately isn't immediately available to us and furthermore most likely unstructured.

## HTML Basics

HTML is composed of import wrappers called `tags`. We encapsulate specific types of information within specific tags. We usually wrap around text or other information using the following pattern.

```HTML
<tag> information such as text or a link </tag>
```

Here is an example of infromation represented using HTML:

```HTML
<!DOCTYPE html>
<html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h1>This is a Heading</h1>
        <p>This is a paragraph.</p>
    </body>
</html>
```

Let's take note of this structure:

* we specify the start of an `HTML` page via the tag `<html>`, we complete the html body using `</html>`
* we have two tags that represent sections that wrap around other tags. Namely `<head>` and `<body>`
    * Usually `<head>` includes meta(invisible) information
    * Usually `<body>` contains the actual content of the website.
* `<h1>` indicates the largest possible header title
* `<p>` indicates a paragraph

Let's go over a few more important tags that we often find:

* `<div class="...">` A divider of a specific class  
* `<a href="...">` A hyperlink. This is a self-closing tag that does not need a closing tag.
* `<table>` A table of rows & columns

```HTML
<table>
  <tr>
    <th>Company</th>
    <th>Contact</th>
    <th>Country</th>
  </tr>
  <tr>
    <td>Alfreds Futterkiste</td>
    <td>Maria Anders</td>
    <td>Germany</td>
  </tr>  
```

One thing I also want to specify is that we rarely write JUST html when developing a website. 

https://www.w3schools.com/html/

## Selenium

A common tool we use in conjunction with bs4 is selenium.

Selenium is an automation tool that we use to browse websites and webapps. Super useful for testing a website, but we can also use it in conjunction with web-scraping.

Before we progress, install `selenium` via pip: https://pypi.org/project/selenium/

```
pip install selenium
pip3 install selenium
```

## BS4 Basics

Let's learn through experimentation

https://pypi.org/project/bs4/

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [7]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

# make sure window does not open
options = Options()
options.headless = True

# create a FireFox browser
driver = webdriver.Firefox(options=options)

# get the html located at our link
url = "https://www.scrapethissite.com/pages/simple/"

# access the url you're trying to look for
driver.get(url)
# get the page source
html = driver.page_source

html

'<html lang="en"><head>\n    <meta charset="utf-8">\n    <title>Countries of the World: A Simple Example | Scrape This Site | A public sandbox for learning web scraping</title>\n    <link rel="icon" type="image/png" href="/static/images/scraper-icon.png">\n\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <meta name="description" content="A single page that lists information about all the countries in the world. Good for those just get started with web scraping.">\n\n    <link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" rel="stylesheet" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" crossorigin="anonymous">\n    <link href="https://fonts.googleapis.com/css?family=Lato:400,700" rel="stylesheet" type="text/css">\n    <link rel="stylesheet" type="text/css" href="/static/css/styles.css">\n\n    \n<meta name="robots

In [9]:
from bs4 import BeautifulSoup

# create bs4 object with html that you accessed
soup = BeautifulSoup(html)

In [11]:
print(soup.prettify())

<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Countries of the World: A Simple Example | Scrape This Site | A public sandbox for learning web scraping
  </title>
  <link href="/static/images/scraper-icon.png" rel="icon" type="image/png"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="A single page that lists information about all the countries in the world. Good for those just get started with web scraping." name="description"/>
  <link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" rel="stylesheet"/>
  <link href="https://fonts.googleapis.com/css?family=Lato:400,700" rel="stylesheet" type="text/css"/>
  <link href="/static/css/styles.css" rel="stylesheet" type="text/css"/>
  <meta content="noindex" name="robots"/

In [12]:
# try out some bs4 basics
soup.title

<title>Countries of the World: A Simple Example | Scrape This Site | A public sandbox for learning web scraping</title>

In [15]:
# get body of html, we feed in the "body" html tag
# https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find#find
soup.find('body')

<body>
<nav id="site-nav">
<div class="container">
<div class="col-md-12">
<ul class="nav nav-tabs">
<li id="nav-homepage">
<a class="nav-link hidden-sm hidden-xs" href="/">
<img id="nav-logo" src="/static/images/scraper-icon.png"/>
                                Scrape This Site
                            </a>
</li>
<li class="active" id="nav-sandbox">
<a class="nav-link" href="/pages/">
<i class="glyphicon glyphicon-console hidden-sm hidden-xs"></i>
                                Sandbox
                            </a>
</li>
<li id="nav-lessons">
<a class="nav-link" href="/lessons/">
<i class="glyphicon glyphicon-education hidden-sm hidden-xs"></i>
                                Lessons
                            </a>
</li>
<li id="nav-faq">
<a class="nav-link" href="/faq/">
<i class="glyphicon glyphicon-flag hidden-sm hidden-xs"></i>
                                FAQ
                            </a>
</li>
<li class="pull-right" id="nav-login">
<a class="nav-link" href="/logi

In [16]:
# function chaining results in finding nested structures
soup.find('body').find("img")

<img id="nav-logo" src="/static/images/scraper-icon.png"/>

In [13]:
# get all links, we feed in the tags that we are looking for
# https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find_all#find-all
soup.find_all('a')

[<a class="nav-link hidden-sm hidden-xs" href="/">
 <img id="nav-logo" src="/static/images/scraper-icon.png"/>
                                 Scrape This Site
                             </a>,
 <a class="nav-link" href="/pages/">
 <i class="glyphicon glyphicon-console hidden-sm hidden-xs"></i>
                                 Sandbox
                             </a>,
 <a class="nav-link" href="/lessons/">
 <i class="glyphicon glyphicon-education hidden-sm hidden-xs"></i>
                                 Lessons
                             </a>,
 <a class="nav-link" href="/faq/">
 <i class="glyphicon glyphicon-flag hidden-sm hidden-xs"></i>
                                 FAQ
                             </a>,
 <a class="nav-link" href="/login/">
                                 Login
                             </a>,
 <a href="/lessons/">4 video lessons</a>,
 <a class="data-attribution" href="http://peric.github.io/GetCountries/" target="_blank">http://peric.github.io/GetCountri

In [18]:
# we're interested in getting each country.
# notice that all countries are in a div tag labeled "col-md-4 country"
# https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find_all#attrs
for country in soup.find_all('div', class_="col-md-4 country", limit=5):
    print(country)

<div class="col-md-4 country">
<h3 class="country-name">
<i class="flag-icon flag-icon-ad"></i>
                            Andorra
                        </h3>
<div class="country-info">
<strong>Capital:</strong> <span class="country-capital">Andorra la Vella</span><br/>
<strong>Population:</strong> <span class="country-population">84000</span><br/>
<strong>Area (km<sup>2</sup>):</strong> <span class="country-area">468.0</span><br/>
</div>
</div>
<div class="col-md-4 country">
<h3 class="country-name">
<i class="flag-icon flag-icon-ae"></i>
                            United Arab Emirates
                        </h3>
<div class="country-info">
<strong>Capital:</strong> <span class="country-capital">Abu Dhabi</span><br/>
<strong>Population:</strong> <span class="country-population">4975593</span><br/>
<strong>Area (km<sup>2</sup>):</strong> <span class="country-area">82880.0</span><br/>
</div>
</div>
<div class="col-md-4 country">
<h3 class="country-name">
<i class="flag-icon flag-ic

In [21]:
# let's further extract pertinent info from these countries
for country in soup.find_all('div', class_="col-md-4 country", limit=5):
    name = country.find('', class_="")
    capital = country.find('', class_="")
    print(name.text, capital.text)



                            Andorra
                         Andorra la Vella


                            United Arab Emirates
                         Abu Dhabi


                            Afghanistan
                         Kabul


                            Antigua and Barbuda
                         St. John's


                            Anguilla
                         The Valley


In [26]:
# consider the following string `strip()` function
for country in soup.find_all('div', class_="col-md-4 country", limit=5):
    name = country.find('h3', class_="country-name")
    capital = country.find('span', class_="country-capital")
    print("Name:", name.text.strip(), "\nCapital:", capital.text.strip(), "\n")

Name: Andorra 
Capital: Andorra la Vella 

Name: United Arab Emirates 
Capital: Abu Dhabi 

Name: Afghanistan 
Capital: Kabul 

Name: Antigua and Barbuda 
Capital: St. John's 

Name: Anguilla 
Capital: The Valley 



## Storing Data

Usually, we utilize some unstructured database to store this data. However, in this case we will save data into a dictionary that we then translate into a dataframe.

In [29]:
import pandas as pd

countries = {
    "name" : [],
    "capital": []
}

for country in soup.find_all('div', class_="col-md-4 country", limit=5):
    name = country.find('h3', class_="country-name")
    capital = country.find('span', class_="country-capital")

    countries["name"].append(name.text.strip())
    countries["capital"].append(capital.text.strip())

df = pd.DataFrame(countries)
df.head()

Unnamed: 0,name,capital
0,Andorra,Andorra la Vella
1,United Arab Emirates,Abu Dhabi
2,Afghanistan,Kabul
3,Antigua and Barbuda,St. John's
4,Anguilla,The Valley
