<a href="https://github.com/theonaunheim">
    <img style="border-radius: 100%; float: right;" src="static/strawberry_thief_square.png" width=10% alt="Theo Naunheim's Github">
</a>
<br style="clear: both">
<hr>
<br>

<h1 align='center'>Web</h1>

<br>

<div style="display: table; width: 100%">
    <div style="display: table-row; width: 100%;">
        <div style="display: table-cell; width: 50%; vertical-align: middle;">
            <img src="static/internet.jpg" width="300">
        </div>
        <div style="display: table-cell; width: 10%">
        </div>
        <div style="display: table-cell; width: 40%; vertical-align: top;">
            <blockquote>
                <p style="font-style: italic;">"lo"</p>
                <br>
                <p>-The first message sent on the Internet</p>
                <br>
                <p style="font-style: italic;">"login"</p>
                <br>
                <p>-The second message sent on the Internet (one hour after the first message sent on the Internet crashed the Internet)</p>
            </blockquote>
        </div>
    </div>
</div>

<br>

<div align='left'>
    Image courtesy of <a href='https://commons.wikimedia.org/wiki/Category:Internet#/media/File:Internet_map_1024.jpg'>The Opte Project</a> under the <a href='https://creativecommons.org/licenses/by/2.5/'>CC 2.5 BY</a>
</div>

<hr>

## Generally

Python cut its teeth in the internet age, and consequently its standard library and third party packages have web capabilities. This is going to cover some of the more common operations you'll see that involve client-side web programming. In other words: fetching from websites, parsing, etc.

---

# Modules covered

### Standard Library
* [email](https://docs.python.org/3.4/library/email.html#module-email)
* [json](https://docs.python.org/3/library/json.html)
* [pathlib](https://docs.python.org/3/library/pathlib.html)
* [smtplib](https://docs.python.org/3/library/smtplib.html)
* [urllib.request](https://docs.python.org/3/library/urllib.request.html#module-urllib.request)
* [urllib.parse](https://docs.python.org/3/library/urllib.parse.html)
* [webbrowser](https://docs.python.org/3/library/webbrowser.html)

### Third Party Libraries
* [beautifulsoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [comtypes.client](https://pythonhosted.org/comtypes/)
* [requests](http://docs.python-requests.org/en/master/)
* [pandas](https://pandas.pydata.org/)


# Modules not covered

### Standard Library
* [ftplib](https://docs.python.org/3/library/ftplib.html)
* [xml](https://docs.python.org/3/library/xml.html)

### Third Party Libraries
* [selenium](http://selenium-python.readthedocs.io/)

---

In [None]:
# Stdlib imports
import email.mime.text
import json
import pathlib
import smtplib
import urllib.request
import webbrowser

# Third party imports
import bs4
import comtypes.client 
import pandas as pd
import requests

---

# Web Requests

### Fetching web pages using the the standard library. 

Note: this is for sites outside the bank. Sites inside the bank will require special authentication mechanisms.

Note: at times, the proxies can be finnicky. "Rerun login scripts" often helps.

In [None]:
# Define a URL
HTTPS_URL = 'https://en.wikipedia.org/wiki/Monty_Python_and_the_Holy_Grail'

# Open the request
r = urllib.request.urlopen(HTTPS_URL)

# Check the status of my request
status_code = r.getcode()
print('The status of my request is {}!\n\n'.format(status_code))

# Convert the raw bytes to text
raw_data = r.read()
text_data = raw_data.decode()

# Do stuff with HTML output
print('Here is the start of our data from {} !\n\n'.format(HTTPS_URL))
print(text_data[:100])

# Close request
r.close()

### Though it's better to do it like this with a context manager:

In [None]:
# Doing the above again as a context manager:
with urllib.request.urlopen(HTTPS_URL) as f:
    text_data = f.read().decode()
    
print('\n\n\nHere is the data again!\n\n\n{}'.format(text_data))

### We can also download files as necessary and put them in RAM or write them to the filesystem:

In [None]:
# Define a URL
FOOT_URL = 'https://upload.wikimedia.org/wikipedia/commons/a/ab/Monty_python_foot.png'
FOOT_OUT = './static/monty_python_foot.png'

# Download data
with urllib.request.urlopen(FOOT_URL) as f:
    binary_data = f.read()

# Write data to a file.
with open(FOOT_OUT, 'wb') as f:
    f.write(binary_data)
    
# Or to a temporary buffer-like interface like io.BytesIO or tempfile.TemporaryFile
# Output using keyword args (not necessary)
args = {
    'length': len(binary_data), 
    'url': FOOT_URL, 
    'dest': FOOT_OUT
}
print('We downloaded {length} bytes of data from {url} , and wrote it to {dest}!'.format(**args))

### If needed you can break it into chunks:

In [None]:
# Download data using an infinite loop (generally a bad idea)
with open(FOOT_OUT, 'wb') as out_file:
    # Nested context manager
    with urllib.request.urlopen(FOOT_URL) as web_file:
        # Infinite loop
        while True:
            # Read data
            binary_data = web_file.read(8192)
            # If no data left, will be None and loop will break.
            if not binary_data:
                break
            # Write data
            out_file.write(binary_data)
            print('.', end='')
        print('\nDing! Loops are done.')
    # Web connection closes here
# Binary file closes here

### Usually the "requests" module, which describes itself as "HTTP for Humans" is easier, but it can sometimes run afoul of proxy servers:

In [None]:
# Certificate location (this will not work for you)
TARGET_URL = 'https://www.google.com'

# You do not have this file on your desktop.
CERT_PATH = 'FULL_PATH_TO_CERT.cer' 

# Open the request
r = requests.get(TARGET_URL, verify=CERT_PATH)

# Stop if error
r.raise_for_status()

# As text
text = r.text
print(text[:100])

### Once you have HTML, you can get to scraping using beautifulsoup ("bs4"):

In [None]:
# Load your soup
soup = bs4.BeautifulSoup(text_data, 'lxml')

# Iterate through all the tables to get our table
for table in soup.find_all('table'):
    # Check if it's the table we want
    if 'wikitable' in table.attrs['class']:
        target_table = table

# Alternatvely we can get it directly
target_table = soup.select('table.wikitable')[0]
print(str(target_table))

In [None]:
webbrowser.open_new('https://en.wikipedia.org/wiki/Monty_Python_and_the_Holy_Grail#Cast')

### And put it in a dataframe for organized manipulation:

In [None]:
# Now that we have our table, convert it to something we can use.
df = pd.read_html(str(target_table), header=0)[0]

# Don't even try to understand this, but here the formatting is weird.
table_data = [
    [
        cell.find('span').text
        if cell.find('span')
        else cell.text
        for cell
        in row.find_all(['th', 'td'])
    ]
    for row
    in target_table.find_all('tr')
]

# Coerce to dataframe
df = pd.DataFrame.from_records(table_data[1:], columns=table_data[0])

# Do arbitrary stuff to our data.
df['LAST_NAME'] = df['Actor\n'].str.split(',').str.get(0)
df['FIRST_NAME'] = df['Actor\n'].str.split(',').str.get(1)
df.head()

### Or if we want to crawl, we can get the links and go from there.

In [None]:
# We can manually put in the base
base = 'https://en.wikipedia.org'

# Or we can get it more generally
url_obj = urllib.parse.urlsplit(HTTPS_URL)
base = url_obj.scheme + '://' + url_obj.hostname

# Get all links.
links = []
for link in soup.find_all('a'):
    try:
        new_url = base + link.attrs['href']
        if link.attrs['href'].startswith('http'):
            pass
        links.append(new_url)
    except KeyError:
        pass

links[:5]

### Where this really comes into its own is with REST APIs to fetch structured data from the internets:

In [None]:
CFPB_ENDPOINT = 'https://data.consumerfinance.gov/resource/jhzv-w97w.json?'

QUERY_ARGS = {
    'state'   : 'MO',
    'product' : 'Credit card'
}

# Let's construct our REST query string
query_string = urllib.parse.urlencode(QUERY_ARGS)
full_url = CFPB_ENDPOINT + query_string

print('We are querying: {} !\n\n'.format(full_url))

# And then fetch our data
with urllib.request.urlopen(full_url) as f:
    json_data = f.read().decode()

# We can load this into native Python datatypes
data = json.loads(json_data)
print(data[0])
print('\n\n')

# Or we can immediately go to pandas.
df = pd.read_json(json_data)

# And refine as needed.
df.dropna(subset=['complaint_what_happened']).head(3)

### And again:

In [None]:
# Base URL
FEMA_URL = 'http://www.fema.gov/api/open/v1/DisasterDeclarationsSummaries?$filter=declarationDate%20gt%20\'{}\''

# Let's construct our REST query string
pd.datetime.now().date().isoformat()
now = pd.datetime.now()
# We can use all the handy libraries of Python!
one_hundred_bdays_ago = now - pd.tseries.offsets.BusinessDay(100)
date = one_hundred_bdays_ago.date()
iso_date = date.isoformat()
full_url = FEMA_URL.format(iso_date)

print('Querying {} !'.format(full_url))

# And then fetch our data
with urllib.request.urlopen(full_url) as f:
    json_data = f.read().decode()
    data = json.loads(json_data)

df = pd.DataFrame(data['DisasterDeclarationsSummaries'])
df.head(5)

### We can also email stuff.

Note: the method below requires MS Outlook.

In [None]:
# We can also email stuff
STRING_OF_RECIPIENTS = '''
<theonaunheim@gmail.com>;
'''

# Easy way to extract emails from formatted string
recipient_series = pd.Series([STRING_OF_RECIPIENTS])
extracted_emails = recipient_series.str.extractall("<(.*?)>")
email_addresses = extracted_emails[0].values

# Send table
table_html = table_content=df.head(5)[['declarationDate', 'declaredCountyArea']].to_html()

EMAIL_HTML = '''

<head>

    <style type="text/css">
    
        body, table, td {{font-family: Segoe UI, sans-serif !important; color: #34282C;}}
        table {{border-width: 20px; width: 100%;}}
        th, td {{text-align: left; padding: 8px; border: 1px solid white; border-collapse: collapse; }}
        th {{background-color: #000080; color: white;}}

    </style>

</head>

<body>

    <p>
        <h1>Automated email sent from Python automation presentation.</h1><br>
        <span>
            <img src="cid:{img_path}" width="100" alt="foot">
            <img src="cid:{img_path}" width="100" alt="foot">
        </span>
        <br>
        I met a traveller from an antique land<br>
        Who said: Two vast and trunkless legs of stone<br>
        Stand in the desert ... near them, on the sand,<br>
        Half sunk, a shattered visage lies, whose frown,<br>
        And wrinkled lip, and sneer of cold command,<br>
        Tell that its sculptor well those passions read<br>
        Which yet survive, stamped on these lifeless things,<br>
        The hand that mocked them and the heart that fed;<br>
        <br>
        And on the pedestal these words appear:<br>
        'My name is Ozymandias, king of kings;<br>
        Look on my works, ye Mighty, and despair!'<br>
        Nothing beside remains. Round the decay<br>
        Of that colossal wreck, boundless and bare<br>
        The lone and level sands stretch far away.<br>
        <br>
    </p>

    {table_content}

</body>

'''.format(
    table_content=table_html,
    img_path=str(pathlib.Path(FOOT_OUT).name)
)

# Create an application object
outlook = comtypes.client.CreateObject("Outlook.Application")

# Create your email
mail = outlook.CreateItem(0)
mail.To = ';'.join(email_addresses)
mail.Subject = 'Ozymandias'

# Add body and attachemnts
mail.HTMLBody = EMAIL_HTML
veggie_csv = pathlib.Path(FOOT_OUT).parent.parent.absolute() / 'data' / 'iris_dataset.csv'
mail.Attachments.Add(str(veggie_csv.absolute()))
mail.Attachments.Add(str(pathlib.Path(FOOT_OUT).absolute()))
# mail.CC = 'recipient1@usbank.com; recipient2@usbank.com'
# mail.BlindCopyTo = "alice_bob@usbank.com"

# Send your mail
mail.Send()
# outlook.Quit() # you will probably want to keep this open.

### It's generally preferable to send it directly via server, but that takes some prep:

In [None]:
# Construct your email message
# msg = email.mime.text.MIMEText(EMAIL_HTML, 'html') 
# msg['Subject'] = 'Ozymandias' 
# msg['From'] = 'Percy Bysshe Shelley' 
# msg['To'] = ','.join(LIST_OF_RECIPIENTS) 

# Send msg (this is the test server below, your server may differ). 
# s = smtplib.SMTP(host='server', port=25) 
# s.sendmail(msg['From'], msg['To'], msg.as_string()) 
# s.quit() 

# Additional Learing Resources

* ### [Webscraping on Analytics Vidhya](https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/)
* ### [Webscraping on Hitchhiker's Guide to Python](http://docs.python-guide.org/en/latest/scenarios/scrape/)
* ### [Email Examples in Python Documentation](https://docs.python.org/3.4/library/email-examples.html)

---

# Next Up: [Database](3_database.ipynb)

<br>

<img style="margin-left: 0;" src="static/database.png" width="20%">

<br>

<div align='left'>
    Image courtesy of <a href='https://commons.wikimedia.org/wiki/File:Applications-database.svg'>Dracos</a> under the <a href='https://creativecommons.org/licenses/by-sa/3.0/deed.en'>CC BY-SA 3.0</a>
</div>

---