# Online Data

Webscraping is the activity of downloading, manipulating, and using information obtained online. Webscraping can get very complicated, and we won't do much in this course. This set of lecture notes can help you get started on the basics. We'll look into this a bit more when we get to regular expressions in a few lectures. 

## Downloading Files

There are several modules for downloading files from the internet. We'll use `urllib`: 

In [1]:
import urllib

Conveniently, my one of my colleagues has a copy of the Palmer Penguins dataset on his website

In [2]:
url = "https://philchodrow.github.io/PIC16A/content/IO_and_modules/IO/palmer_penguins.csv"

Now, let's read in the data

In [3]:
filedata= urllib.request.urlopen(url)
to_write=filedata.read()

#we need to use "wb" instead of "w" here
with open("interweb_penguins.csv","wb") as f:
    f.write(to_write)

Having run this code, you can check in your file explorer that a file called `interweb_penguins.csv` now lives in the same directory as this notebook. We used the somewhat unusual flag `"wb"` to `open()` in order to indicate that we need to write a *binary* file, rather than the usual text file. This is because `to_write, the return value of `filedata.read()`, is by default binary data. We might ask you in assignments to use this pattern, but you we won't evaluate you on it in any timed or closed-book contexts. 

The module `wget` is another popular tool for downloading files from the internet. 

## Data from Websites

Often, we want to access the contents of a webpage. In this case, the request.urlopen submodule of urllib can help us easily access the contents of a desired URL.

In [11]:
from urllib.request import urlopen

Here is a website the url of my old website.

In [12]:
url="https://www.math.purdue.edu/~mperlmut/"

The following code reads in the html code

In [13]:
page=urlopen(url)
html_bits=page.read()

Let's look at html_bits

In [14]:
html_bits[10000:10500]

b'r><br>Course ID: mcgee93777\n<br><br>Access Code: WSCMMV-QUIPU-BRAND-ASWAN-MIMIR-PIPES\n<br><br>eText ID: educator254232eb\n\n<br><br>________________________________________________________________________\n\n <!---<br><br>ATTENTION: BONUS PRE-EXAM Office Hours will be held Monday 11/5/2012 4-5pm in PHYSICS201 and Tuesday 11/13/2012 also in PHYSICS from 2:30-3:30pm.\n\n <!--- <br><br> To vote on BONUS PRE-EXAM Office Hours, use the link below. Location will be decided Thursday 11/1/2012  \n\n<br><br><a h'

Not too nice! Instad, lets add a decode line

In [15]:
html = html_bits.decode("utf-8")

print(html[10000:10500])

r><br>Course ID: mcgee93777
<br><br>Access Code: WSCMMV-QUIPU-BRAND-ASWAN-MIMIR-PIPES
<br><br>eText ID: educator254232eb

<br><br>________________________________________________________________________

 <!---<br><br>ATTENTION: BONUS PRE-EXAM Office Hours will be held Monday 11/5/2012 4-5pm in PHYSICS201 and Tuesday 11/13/2012 also in PHYSICS from 2:30-3:30pm.

 <!--- <br><br> To vote on BONUS PRE-EXAM Office Hours, use the link below. Location will be decided Thursday 11/1/2012  

<br><br><a h


__Better!__ Now, lets say we want to extract all of the links on this website

We will do this using regular expressions. We will learn more about regular expressions later in the course, but since we haven't covered them yet, you have my permission to stop typing for the rest of the video. 

In [9]:
import re

urls=re.findall(r'href=[\'"]?([^\'">]+)',html)

urls


['/share/styles/main/v2_purdue.css',
 '/share/styles/main/v2_ie6_is_broken.css',
 '/share/styles/2009/bio.css',
 '/share/styles/sos/dir.css',
 '/',
 'http://www.purdue.edu',
 '/about/',
 '/people/faculty/',
 '/people/emeritus/',
 '/people/visiting/',
 '/people/lecturer/',
 '/people/gradstudents/',
 '/people/tas/',
 '/people/staff/',
 '/people/studentorgs/',
 '/about/diversity/',
 '/jobs/',
 '/about/purview/',
 '/academic/',
 '/academic/undergrad/',
 '/academic/grad/',
 '/academic/applied/',
 '/academic/actuary',
 '/academic/courses/',
 '/academic/courses/',
 '/academic/courses/schedule/',
 '/academic/courses/textbook/',
 '/academic/courses/grad/',
 '/academic/tutor/',
 '/academic/courses/oldexams/',
 '/academic/courses/past/',
 '/resources/',
 '/resources/internal/',
 '/resources/computing/',
 '/business/',
 '/resources/gta/',
 '/resources/banner/faculty/',
 '/resources/banner/grad/',
 '/resources/ada/',
 '/research/',
 '/research/faculty/',
 'http://ccam.math.purdue.edu/',
 'http://gm

If we want to only look at urls containing http, we can use a list constructor

In [10]:
[line for line in urls if "http" in line ]

['http://www.purdue.edu',
 'http://ccam.math.purdue.edu/',
 'http://gmig.math.purdue.edu/',
 'http://hopf.math.purdue.edu/',
 'https://sites.google.com/site/purduemathbio/',
 'http://www.math.purdue.edu/~mperlmut/',
 'http://www.math.purdue.edu/',
 'http://www.math.purdue.edu/~banuelos',
 'http://www.math.purdue.edu/academic/courses/coursepage?subject=MA&course=16010',
 'http://www.tufts.edu/',
 'http://www.purduebarbell.com',
 'http://portal.mypearson.com/mypearson-login.jsp',
 'http://doodle.com/ftnv2rp4k56bampw',
 'http://www.math.purdue.edu/~krotz']