# S-109A

## Lab 1: Data Collection - Web Scraping - Data Parsing

**Harvard University**<br>
**Summer 2018**<br>
**Instructors:** Pavlos Protopapas and Kevin Rader
**Lab Instructors:** David Sondak and Will Claybaugh

---

In [1]:
from IPython.display import HTML
style = "<style>div.exercise { background-color: #ffcccc;border-color: #E9967A; border-left: 5px solid #800080; padding: 0.5em;}</style>"
HTML(style)

# Table of Contents
<ol start="0">
  <li> Learning Goals </li>
  <li> Working with `Python` </li>
    <ol>
        <li> functions, modules, and libraries </li>
    </ol>
  <li> `pandas` </li>
  <li> Regular Expressions  </li>
  <li> Beautiful Soup  </li>
</ol>

## Learning Goals

Overall description and goal for the lab.

By the end of this lab, you should be able to:
* Know how to use `python` modules and libraries
* Use practical syntax for writing functions in `python`
* Work proficiently with `pandas`
* Know what regular expressions are
* Know how to work with regular expressions in `python`
* Use `Beautiful Soup` to parse `HTML` webpages
* Read and write files

**This lab corresponds to lectures 1 and 2 and maps on to homework 1 (and beyond).**

## <font color='red'> DO PART 1 BEFORE COMING TO LAB</font>

## Part 1: Working with `Python`:  Functions,  Modules, and Libraries
In the first part of this lab, you will brush up on your `python` skills.  You are assumed to have a basic, working knowledge of `python`.  We will go a little beyond this and begin to introduce and use some more interesting `python` features.

### Comments on `Python` Style
 - The Python style guide (how many spaces between = signs, how are function names capitalized?) is avilable here: [**PEP8**](https://www.python.org/dev/peps/pep-0008/). This is the official style guide for this class.
 - In particular, you are expected to organize your code into functions, and document them via function annotations.
 - For the exact specification refer to [PEP307](https://www.python.org/dev/peps/pep-3107/).  
 
Here is the basic format for writing annotated functions:
```python
def my_func(name: str, age: int, weight, height: float = 150.0) -> list:
    """short description of the function's effect
    description of input 1
    decription of input 2
    ...
    decription of output
    """
    code
    
    return [body_mass_index, disease_chance]
```
Things to notice:
1. The argument list to the function contains information about the types.  For example, we can see explicitly that `name` is a `string`.
2. The argument list is followed by an arrow pointing to the return type.  In this case, we can see that the return type should be a `list`.

### Python Functions
 - If you have not yet written functions as part of your coding education, you should strongly consider taking another coding class before this one. We assume familiarity with the mechanics of writing functions and the reasons to do so.
 - Python does not check the types of the variables at runtime. The documentation above is strictly for human readers; it's prefectly possible to pass a floating point value to the `name` input
 - Python supports default arguments. If a user calls `my_func('david', 30, 170)`, omitting the height argument, the code will run with height taking the default value of `150.0`.
 - Python functions can be called with named arguments. These can occur in any order. For instance `my_func(age=10, name='will', weight=120)` works as expected, even though the arguments are technically out of order
 - Python supports functions with variable numbers of arguments and variable numbers of key-word arguments. See [Defining Functions in `Python`](https://docs.python.org/3/tutorial/controlflow.html#more-on-defining-functions) for more.

##### Question:  Think of a few reasons why to use function annotations.
##### Question:  Think of a case where it makes sense to have a default value for a particular function.  If you can't think of one yet, consider the default used by the 'join' function discussed later in this lab.

### Python Data Structures
We will not review all `python` data structures here; you are expected to be familiar with the main ones:
* lists
* tuples
* dictionaries
* sets

If the topics in the list above are not familiar to you, then it is absolutely essential that you brush up on them.  Here's the documentation to get you started:  [`python` data structures](https://docs.python.org/3/tutorial/datastructures.html).

However, we will very briefly review dictionaries.  They are more complex than `lists` and `tuples` and used extensivly.

#### Dictionaries
Dictionaries are defined by a `key`-`value` pair.  The `key` is used to index the dictionary.  Let's do a short example to review.

In [2]:
dog_dict = {} # initialize a dictionary

# Populate the dictionary
dog_dict["jack"] = "border collie"
dog_dict["sophi"] = "beagle"
dog_dict["betty"] = "irish wolfhound"

print(dog_dict["jack"]) # Access one element of the dictionary
print("\n") # Just make a space for convenience

print(dog_dict) # Print the entire dictionary

border collie


{'jack': 'border collie', 'sophi': 'beagle', 'betty': 'irish wolfhound'}


You can do much more with dictionaries, but that example encapsulates the basics.

#### Question: Which of the above structures (list, tuples, etc) is best for holding...
- the names of everyone in CS 109?
- a list of all the words in the English language (presuming we'll want to search and decide if a particular user-provided word is one of the correctly-spelled words)?
- student ID associated with each student's name?
- the name associated with each (sequential) student ID?
- the nickname and github username associated with each student ID?

(Answers at the bottom of the notebook)

### Python Exception Handling
Sometimes you make a mistake when writing code.  Rather than crash, the program should fail gracefully, preferably with an informative error message.  These types of considerations are called *exception handling*.  A good way to deal with this in `python` is to use the `try-except` block.  Once again, we won't go deep here.  We'll just show you the basic structure and enthusiastically encourage you to use it when necessary.

Extensive documentation can be found at [Errors and Exceptions](https://docs.python.org/3/tutorial/errors.html).

#### Example
Suppose you write a function that looks like
```python
def bad_func(x: float, y: float) -> float:
    return x / y
```
and you call it with 
```python
bad_func(1.0, 0.0)
```
Right away, `python` returns
```
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-10-4a83987fd0cc> in <module>()
      2     return x / y
      3 
----> 4 bad_func(1, 0)

<ipython-input-10-4a83987fd0cc> in bad_func(x, y)
      1 def bad_func(x: float, y: float) -> float:
----> 2     return x / y
      3 
      4 bad_func(1, 0)

ZeroDivisionError: division by zero
```
This is the stack trace and it tells you where the errors occured (You will see this often in this class!). The ultimate problem is often at the very bottom (division by zero, in this case), and from top down you can see where in the code you were when tragedy struck.

The error was first detected on line 4 at `bad_func(1,0)` and then at 2 line of `bad_func()`.  Then we're told that the error was a division by zero.

So informative!  Can't ask for much more than that.  In fact, `python` has a whole host of exceptions that it is aware of.  You can find them at [Built-in Exception](https://docs.python.org/3/library/exceptions.html) in the documentation.

Suppose you don't want your program to die when it reaches an exception and want it to automatically fix the problem (or ignore it) and continue on its merry way.  You can use a `try-except` block to handle this.

In [3]:
def bad_func(x: float, y: float) -> float:
    try:
        result = x/y
    except ZeroDivisionError:
        print("WARNING:")
        print("You set y = 0 but y must be non-zero.")
        print("We are setting y = 1.  This may drastically change your results.")
        y = 1.0
        result = x/y
    return result

x, y = 1.0, 0.0
important_quantity = bad_func(x, y)

print("\n Your important_quantity has a value of {0:3.6f}".format(important_quantity))

You set y = 0 but y must be non-zero.
We are setting y = 1.  This may drastically change your results.

 Your important_quantity has a value of 1.000000


Notice that our program will continue to run past the point of no return.  We were good developers and warned the user about what was happening.

For example, maybe $y$ is the standard deviation of a quantity.  The code must normalize all variables to the standard deviation.  If the user makes a mistake and accidentally calculates the standard deviation to be zero, then nothing else will work.  So, we just warn them what happened and don't carry out the normalization (by setting the standard deviation to $1$) and carry on with the analysis.  Hopefully it doesn't matter, but if it does, we at least warned the user about it.

There are other use cases for exception handling, and you should at least familiarize yourself with them.

### Python Classes, Modules, and Libraries

#### Classes/Objects
In true object oriented programming (OOP), the developer writes code around things called objects.  An object (or a class) groups together data and functions that operate on that data.  You might know this terminology from *C++* and other languages.

For example, maybe I have one function that calculates the area of a circle and another function that calculates the perimiter.  I could group these two functions together, along with data about the circle into a class called `circle`.
A user can create a particular circle by *instantiating* a `circle` oject, as we do below.

```python
from shapes import circle
my_circle = circle(r=5)
circle_area = my_circle.area()
circle_perimiter = my_circle.perimiter()
```
When a function is part of an object it is called a *method* instead of a function. Notice that methods are accessed using the *dot* notation.
**What really matters for this class is that we will often create an object and use the methods associated with it without having to worry about the object's internal workings**


#### Modules
Modules in python contain a bunch of code that logically fits together. Most often this is a bunch of classes and functions that address a particular need. For example, there could be a `shapes.py` file that contains the `circle` class used above, as well as a `triangle` class and maybe even a `will_they_fit` function to tell if one shape can fit inside another. That .py file would then be called a module, and we could import from it.

If we only want particular portions of a module, we use the from ___ import ___ syntax above. If we want the whole module, we can do this:
```python
import shapes
my_circle = shapes.circle(r=5)
my_triangle = shapes.triangle(3,4,5)
circle_area = my_circle.area()
fit_flag = shapes.will_they_fit(my_circle, my_triangle)
```

#### Libraries
Libraries may contain a bunch of modules that go together.  A library usually has a specific directory structure.  We won't discuss these further here, becuase as a user you only need to know about the import syntax above.

### Take a breath and recap
In this class, you are not expected to write classes, modules or libraries.  However, you will interact with them *all the time*.  You will be expected to replicate the syntax above and use specific modules/libraries on every assignment.

##### Libraries that you will become very comfortable with:
- `numpy`:  [NumPy](http://www.numpy.org/)
- `scipy`:  [SciPy](https://www.scipy.org/)
- `sklearn`:  [scikit-learn](http://scikit-learn.org/stable/index.html)
- `pandas`:  [pandas](https://pandas.pydata.org/)
- `matplotlib`: [matplotlib](https://matplotlib.org/)

We won't go into the inner-workings of how each of these libraries works (e.g. `numpy` is built on top of `C`).  However, throughout the course we will make extensive use of these libraries and we will point out the necessary features as we go along.


## <font color='red'> You're done. Save the rest for lab on Friday</font>

## Part 2:  I/O and Preprocessing
Much of data science and computational science involves reading data from files and writing to files.   This process is generally known as `I/O`.

There are many ways of accomplishing different `I/O` tasks.  `Python` has its own built-in functionality for reading and working with files.  You can read also read data with `numpy` and `pandas` among others.  We won't cover `numpy` for basic input parsing today, but you'll probably be introduced to it eventually.  We will spend a considerable amount of time on `pandas`.

### Part 2.1:  `Python`'s built-in I/O 
We'll work with the small file called `brief_comments.txt` in the `data` directory.

You can read in a file using the `open` function.  There is a "right" way and a "wrong" way of doing this.

An alternative way of reading data from a file is to use the `with` statement.

In [4]:
# This approach should not be used!
f = open("data/brief_comments.txt", "r") # Open the file for reading
dogs = f.read() # Read the file
f.close() # Remember to close the file!

In [5]:
# This approach is the correct way, and should always be used.
with open("data/brief_comments.txt", "r") as f:
    dogs = f.read()

#### Observations
The `with` statement does a few things for us automatically.  First, it closes the file for us so we don't need to remember to do this.  It is important to close a file when you're done with it!  There are a few reasons for this:
1. Having too many files open at once consumes resources
2. Not closing a file is sloppy coding
3. You might not see changes to the file until you close it

`with` even closes the file for us if an exception was thrown.  It's nice that the `with` statement handles that for us.

### Part 2.2:  Preprocessing

Now we can do some operations on the text.  We will explore a few methods here:
1. `len`
2. `split`
3. `lower`

There are many other methods as part of the `string` class.  These can be found in the documentation:  [String Methods](https://docs.python.org/3/library/stdtypes.html#string-methods).

In [6]:
print(dogs) # What are the contents of the object we just read in?

Dogs have been with humans for millenia.  Although they do not speak human languages (e.g. English or Chinese), they have been watching and observing
us for all that time.  Their emotional intelligence is prodigious.  We are only just beginning to scrape the surface of the mind of dogs.  The nascent
field of dog cognition is beginning to shed light on ways to form meaningful communication with dogs.



In [7]:
type(dogs) # What kind of data are we dealing with?

str

In [8]:
l = len(dogs) # How many characters are in this string?
print(l)

403


In [9]:
dogs[10] # Let's access the 11th item

'b'

That's something of a letdown.  The `string` object is just one giant string so accessing the first item gives us the first character.  It would be more useful to access individual words.  We can use the `split` method to accomplish this: [`str.split`](https://docs.python.org/3/library/stdtypes.html#str.split).

In [10]:
words = dogs.split()
print(words)

['Dogs', 'have', 'been', 'with', 'humans', 'for', 'millenia.', 'Although', 'they', 'do', 'not', 'speak', 'human', 'languages', '(e.g.', 'English', 'or', 'Chinese),', 'they', 'have', 'been', 'watching', 'and', 'observing', 'us', 'for', 'all', 'that', 'time.', 'Their', 'emotional', 'intelligence', 'is', 'prodigious.', 'We', 'are', 'only', 'just', 'beginning', 'to', 'scrape', 'the', 'surface', 'of', 'the', 'mind', 'of', 'dogs.', 'The', 'nascent', 'field', 'of', 'dog', 'cognition', 'is', 'beginning', 'to', 'shed', 'light', 'on', 'ways', 'to', 'form', 'meaningful', 'communication', 'with', 'dogs.']


In [11]:
type(words)

list

So `split` returned a `python` `list` by splitting the string into elements separated by white space.  Now let's see what the 11th item is.

In [12]:
words[10]

'not'

Very nice!  Let's explore some of the other cool string operations that we can do.

In [13]:
N = len(words) # Number of words
print("There are {0} words in our brief comments.".format(N))

There are 67 words in our brief comments.


#### Brief Interlude
We used the `format` method on a string.  The *pythonic* way of doing this in `python3` can be found at the following resources:
* [The `format` statement](https://docs.python.org/3/library/stdtypes.html#str.format) --- Syntax for using the `format` statement.
* [Format String Syntax](https://docs.python.org/3/library/string.html#formatstrings) --- Different ways for formating strings (e.g. printing integers, floats, etc).
* [Formatting Literal Strings](https://docs.python.org/3/reference/lexical_analysis.html#f-strings) --- An alternative way of formatting strings.  Not used in this lab.

#### End Brief Interlude

Let's get back to our nice little example.

Suppose we want to count the occurance of a particular word.  We could use the `count` method (see [`list.count()`](https://docs.python.org/3/tutorial/datastructures.html)).

How many times is the word `dogs` mentioned?

In [14]:
words.count("dogs")

0

That's not correct.  We can see clearly that the word `dogs` is mentioned $3$ times (it's a small enough text that we can count this manually).  The problem is that sometimes `dogs` is capitalized, sometimes it comes with a period, and sometimes it's all lowercase.  We need to do further processing.

In [15]:
more_words = [word.split('.')[0] for word in words] # List comprehension
more_words.count("dogs")

2

We found $2$ dogs!  Still not correct, but better than before.

What just happened here?!

In [16]:
# We can write the list comprehension as a for loop as follows:
more_words1 = []
for word in words:
    inter = word.split('.')
    inter1 = inter[0]
    more_words1.append(inter1)
more_words1.count("dogs")

2

Let's put this all into English.


1. First of all, this time we used `split` to split on periods rather than white space.  Splitting on white space is the default.  Beyond that, you have to tell it what to do the split on.
2. Second, we introduced a **list comprehension**.  The list comprehension structure is extremely useful.  You should take a look at the documentation:  [List Comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions).  Basically, a list comprehension can be used to create a new list by applying operations on an old list.
3. Third, we accessed the first element of the new list in the list comprehension.
  * You see, each time the `split` method is called, it creates a new list.
  * But we know that we don't want a list of lists.
    - Try it out.  Just print the result from `[word.split('.') for word in words]` and see what it looks like.
  * We also know that the nested lists contain just one element (a single word), so we just access that word and get a string back.  If this sentence sounds strange to you, then you probably didn't try things out on your own like we suggested you do.

<div class="exercise"><b>Exercise</b></div>
You should play around with list comprehensions on your own.  Create any list you want.  Then operate on it using a list comprehension.
* The list doesn't have to be strings (make it `int`s if you want to).
* For example, you could make a list of consecutive integers (call it `int_list`).
* Then use a list comprehension to square each integer and put the result in a new list (call it `int_list_sqrd`).

In [17]:
# your code here
my_ints = [i for i in range(-5, 6)]
my_ints2 = [i*i for i in my_ints]
print(my_ints2)

[25, 16, 9, 4, 1, 0, 1, 4, 9, 16, 25]


Still, we have more work to do.  Only $2$ occurrances of the word `dogs` were noted.  The other occurance happens with a capital letter.  Not to fear!  We can convert strings to lower case using the `lower` method.

In [18]:
my_str = 'HELLO Bonnie'
my_str.lower()

'hello bonnie'

<div class="exercise"><b>Exercise</b></div>
Use a list comprehension to create a list of all lower case words starting from the `more_words` list that we just created.  Then print out the number of occurrances of the word `dogs`.

In [19]:
# Your code here
lower_words = [word.lower() for word in more_words]
lower_words.count("dogs")

3

<div class="exercise"><b> Exercise </b> </div>
* `hamlet.txt` is in the `data` directory.  Open and read it into a variable called `hamlettext`.
* What is the type of `hamlettext`?  What is its length?  Print the first $500$ items of `hamlettext`.
* Create a list called `hamletwords` where the items are the words of the play.
  * Confirm that the list you created is really a list
  * Confirm that each element of the list is a string
  * Print the first 10 items in the list.  
  * Print "There are $N$ total words in Hamlet.",  where $N$ is the total number of words in Hamlet.
* Using a *list comprehension*, create `hamletwords_lc` which converts the items in `hamletwords` to lower-case. 
* Count the number of occurences of the word "thou".
* Use `set` to determine the set of unique words in `hamletwords_lc`.  Here's documentation on the `set` datatype:  [Sets](https://docs.python.org/3/tutorial/datastructures.html#sets).
  * Print "There are $M$ unique words in Hamlet.", where $M$ is the number of unique words.  As a sanity check, verify that $M < N$.
  * Your output should be 
  ```
  "There are 7456 unique words in Hamlet."
  ```

In [20]:
# Your code here

# Open file
with open("data/hamlet.txt", "r") as f:
    hamlettext = f.read()

print(type(hamlettext), "\n") # data type
print(len(hamlettext), "\n") # length
print(hamlettext[:500], "\n") # first 500 items

hamletwords = hamlettext.split() # get list of words in hamlet
print(type(hamletwords), "\n") # confirm that it's a list
print(hamletwords[0:10], "\n") # first 10 items
print("There are %d total words in Hamlet.\n" %len(hamletwords)) # Total words in hamlet

hamletwords_lc = [word.lower() for word in hamletwords] # convert to lowercase
print(hamletwords_lc.count("thou"), "\n") # occurences of thou

uniquewords_lc = set(hamletwords_lc) # unique words
print(len(uniquewords_lc), len(hamletwords_lc), "\n")
print("There are {0} unique words in Hamlet.".format(len(uniquewords_lc)))

<class 'str'> 

173946 

﻿XXXX
HAMLET, PRINCE OF DENMARK

by William Shakespeare




PERSONS REPRESENTED.

Claudius, King of Denmark.
Hamlet, Son to the former, and Nephew to the present King.
Polonius, Lord Chamberlain.
Horatio, Friend to Hamlet.
Laertes, Son to Polonius.
Voltimand, Courtier.
Cornelius, Courtier.
Rosencrantz, Courtier.
Guildenstern, Courtier.
Osric, Courtier.
A Gentleman, Courtier.
A Priest.
Marcellus, Officer.
Bernardo, Officer.
Francisco, a Soldier
Reynaldo, Servant to Polonius.
Players.
Two Clowns,  

<class 'list'> 

['\ufeffXXXX', 'HAMLET,', 'PRINCE', 'OF', 'DENMARK', 'by', 'William', 'Shakespeare', 'PERSONS', 'REPRESENTED.'] 

There are 31659 total words in Hamlet.

95 

7456 31659 

There are 7456 unique words in Hamlet.


[Poll everywhere link]

### Part 2.3:  Writing Files
So far, we've discussed how to read data from files.  We've used `python`'s built-in functionality.  Of course, you generally want to *write* data to files as well.  What's the point of generating data if you're not saving it somewhere?!

We'll begin this section by generating some data to write.

In [21]:
my_ints = [i for i in range(-5, 6)]
my_ints2 = [i*i for i in my_ints]
print("Our list is {0}.".format(my_ints2))

Our list is [25, 16, 9, 4, 1, 0, 1, 4, 9, 16, 25].


Now let's prepare to write this data out.

In [22]:
with open("data/datafile.txt", "w") as dataf:
    # header
    dataf.write("Here is a list of squared ints.\n\n")
    # Columns
    dataf.write("n")
    dataf.write(", ")
    dataf.write("n^2" + "\n")
    # Data
    for i, i2 in zip(my_ints, my_ints2):
        dataf.write("{}, {}\n".format(str(i), str(i2)))

Once again, there are a few things worth mentioning here.

1. This is pretty ugly. We even had to convert the floats and ints to strings before we could write things out.
2. We've introduced the `zip` method.
  * Here's the documentation: [`zip` documentation](https://docs.python.org/3/library/functions.html#zip)
  * Here's the essence:  Combine the two lists into a tuple.  Now you can iterate on the tuple.
  * More precisely, `zip` aggregates the *iterables* `nums` and `some_data` into an *iterator* of tuples.  
    - In our case, the iterables here are just lists (they can be iterated on).
    - The *iterator of tuples* is formed by pairing the first elements of each list into a tuple.
    - That tuple is iterated upon by the `for` statement during which the elements of the tuple are extracted.
  * This sounds complicated, but it makes things very clean and nice to work with.  You should practice using `zip` whenever possible.
3. The related cousin of `zip` is `enuemrate`:  [`enumerate` documentation](https://docs.python.org/3/library/functions.html#enumerate).  We will use `enumerate` all the time, but not yet.

### `json`

We really don't want to write out complex data forms using the `write` method.

Fortunately, there are a bunch of ways to write out data:
* [`pickle`](https://docs.python.org/3/library/pickle.html#module-pickle) --- `python` specific; used for saving and loading `python` objects
* [`xml`](https://docs.python.org/3/library/xml.html) --- Commonly used standard for storing data.
* [`json`](https://docs.python.org/3/library/json.html#module-json) --- Extremely commonly used standard for sharing data.

We will focus on `json` because it is a commonly used standard for data exchange.  Here is a nice little tutorial on `json`: [Working With JSON Data in Python](https://realpython.com/python-json/).

Here are a few comments on `json`:
* Human readable
* Used both within and external to the `python` ecosystem
* Cannot represent all `python` types, but that's usually okay.  If you are only working in `python`, then you should probably just use `pickle`.

In [23]:
import json # import the json library

Suppose we have a `python` dictionary containing information about individual dogs in a particular dog shelter.

In [24]:
dog_shelter = {} # Initialize dictionary

# Set up dictionary elements
dog_shelter['dog1'] = {'name': 'Cloe', 'age': 3, 'breed': 'Border Collie', 'playgroup': 'Yes'}
dog_shelter['dog2'] = {'name': 'Karl', 'age': 7, 'breed': 'Beagle', 'playgroup': 'Yes'}

dog_shelter

{'dog1': {'age': 3,
  'breed': 'Border Collie',
  'name': 'Cloe',
  'playgroup': 'Yes'},
 'dog2': {'age': 7, 'breed': 'Beagle', 'name': 'Karl', 'playgroup': 'Yes'}}

We can access the elements of the dictionary like so:

In [25]:
dog_shelter['dog1']

{'age': 3, 'breed': 'Border Collie', 'name': 'Cloe', 'playgroup': 'Yes'}

In [26]:
dog_shelter['dog2']['name']

'Karl'

#### Writing to `json` file

Now we should save the dictionary to a file.  We decide to save it in `json` format because then many people will be able to read it and work with it.

In [27]:
with open('dog_shelter_info.txt', 'w') as output:  
    json.dump(dog_shelter, output)

We can make sure the file is there:

In [28]:
!ls

Lab1-Solutions.ipynb Untitled1.ipynb      [34mimages[m[m
Lab1.ipynb           [34mdata[m[m                 [34mlab_files[m[m
README.md            dog_shelter_info.txt
Untitled.ipynb       iacs.png


And we can even take a quick look:

In [29]:
!cat dog_shelter_info.txt

{"dog1": {"name": "Cloe", "age": 3, "breed": "Border Collie", "playgroup": "Yes"}, "dog2": {"name": "Karl", "age": 7, "breed": "Beagle", "playgroup": "Yes"}}

By the way, don't worry if you don't understand those last two commands.  Those are command line commands meaning *list files in current directory* (`ls`) and *view contents of file* (`cat`).

#### Reading from `json` file

Reading from a `json` file is also very easy.

In [30]:
with open('dog_shelter_info.txt', 'r') as f:
    dog_data = json.load(f)

In [31]:
print(dog_data)

{'dog1': {'name': 'Cloe', 'age': 3, 'breed': 'Border Collie', 'playgroup': 'Yes'}, 'dog2': {'name': 'Karl', 'age': 7, 'breed': 'Beagle', 'playgroup': 'Yes'}}


Let's explore the data structure.  We know that it's a `python` dictionary.

In [32]:
for dogid, info in dog_data.items():
    print(dogid)
    print("{0} is a {1} year old {2}.".format(info['name'], info['age'], info['breed']))
    if info['playgroup'].lower() == 'yes':
        print("{0} can attend playgroup.".format(info['name']))
    else:
        print("{0} is not permitted at playgroup.".format(info['name']))
    print("======================================\n")

dog1
Cloe is a 3 year old Border Collie.
Cloe can attend playgroup.

dog2
Karl is a 7 year old Beagle.
Karl can attend playgroup.



##  Part 2: Regular Expressions

### Background and Motivation
*Regular Expressions* (a.k.a. `regex` or `regexp`) are a tool for working with and manipulating text data.  We've already done some text manipulation in this lab.  We've shied away from particularly thorny examples until now.  Using `python`'s string methods is useful, but that approach has it's limitations.

Regular expressions provide a set of rules for working with text data.  At first, these expressions look completely foreign (e.g. `([0-9]+(\.[0-9]+){3})`), but once you know some of the basics they're not so bad.

As it turns out, the fundamentals of regular expressions are based upon abstract algebra.  Mathematicians have studied regular expressions simply to lay down and understand their theoretical underpinnings.  We won't go anywhere near that level of detail.  For us, regular expressions will simply be used to process some gnarly text data.

There are a few key `regex` patterns and concepts that you must know and be comfortable with.  That fact is, there are many ways to create a `regex` to search for a particular pattern.  Some approaches are more succinct than others.  As with most things, you will get better the more you practice.  You should try to make your `regex` patterns as crisp as possible while still mainting readabilty.

### Some resources
In order to become proficient with `regex`s, you are **strongly encouraged** to take the *RegexOne* tutorial at [https://regexone.com/](https://regexone.com/).  That tutorial is an interactive and accessible introduction to regular expressions and it can be done in an hour.  It contains problems at the end to test your knowledge.  The *RegexOne* website also contains a very nice demo for `Python3`.  This lab will borrow from the *RegexOne* `python` demo to walk you through some concepts.

You may also want to consider the book [Mastering Regular Expressions](http://shop.oreilly.com/product/9780596528126.do) for more details as well as some historical comments.

---

### Learning by Example
Suppose you have a string containing a date:

In [33]:
birthday = "June 11"

You would like to search this string for the month.  For such a simple string, this can easily be done with the `python` string methods.

In [34]:
birth_month = birthday.strip()[:-3]
print(birth_month)

June


We're after much more intense strings, which we'll process with regular expressions.  Let's warm up with a `regex` on this simple string.

In [35]:
regex = r"\w+" # A first regular expression

What in the world does this mean?!  Well, there are a few syntactical details here:
1. The `r` means that the string is a *raw string*.  This just tells `python` not to interpret backslashes and other metacharacters in the string.  For example, in order to render TeX, you must use a raw string.
2. The `\w` indicates any alphanumeric character.
3. The `+` indicates one or more occurances.

In English words, we say that `regex` is a regular expression that tries to match one or more occurances of alphanumeric characters.

We still haven't specified what string we want to find the matches in.  All we've done so far is specify a `regex`.

Let's remedy that.  We will now use the `python` `re` module to start matching some regular expressions in strings.  Here are two more resources for you:
* [`re` module documentation](https://docs.python.org/3/library/re.html) --- The official `python` documentation on the `re` module
* [Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html#regex-howto) --- A gentler introduction to using the `re` module.

Honestly, your best bet is still to start with the resources found on the *RegexOne* site.

In [36]:
import re # Regular expression module
months = re.search(regex, birthday) # Search string for regex
print(months)

<_sre.SRE_Match object; span=(0, 4), match='June'>


We just searched the `birthday` string for the regular expression contained in `regex`.  If the pattern doesn't match, then we get `None` in return, otherwise we get an object that contains some information.  In our case, the pattern matches something in `birthday`.  What information did we get?
* We are told the `start` and `end` of the matching pattern (that is the `span=(0, 4)`)
* We are told what matched (that is the `'June'`)

You can access the starting and ending indices with the `start()` and `end()` methods as follows:

In [37]:
print("The matched pattern starts at index {0} and ends at index {1}.".format(months.start(), months.end()))

The matched pattern starts at index 0 and ends at index 4.


Note that we could have used a very simple pattern to search for the word `June`:

In [38]:
regex = r"June"
re.search(regex, birthday)

<_sre.SRE_Match object; span=(0, 4), match='June'>

Same answer!

In [39]:
re.search(r"Oct", birthday) # nothing prints out

**Note:** When a regex fails to match, like above, it can look a little weird. Instead of getting an empty regex object that prints out, we get `None`, which doesn't dispaly anything. Printing it still works though.

In [40]:
months = re.search(r"Oct", birthday)
print(months) # printing the match object shows us the result, even if no match was found.

None


As already mentioned, regular expressions work directly with text.  You need the fancier stuff when you have more complicated strings.  We'll get to that in a moment.  First, do the following exercise.

<div class=exercise><b>Exercise</b></div>
Consider the string 
```python
statement = "June is a lovely month."
```
* Use a regular expression to the find the pattern `June`.
* Create a new string, `fragment` from `statement`, which starts just after the word `June`.

Your output should be ` is a lovely month.`

In [41]:
statement = "June is a lovely month."
regex = r"June"
fragment = statement[re.search(regex, statement).end():]
print(fragment)

 is a lovely month.


[Poll everywhere link]

Okay, we're ready to move on to more interesting things.  We'll do this in a sequence demos.

First, let's try to get the day out of the birthday string.  We'll use some more intesting expressions to illustrate some of the important patterns.

#### We can use `\d` to get just digits.

In [42]:
regex = r"\d+"
re.search(regex, birthday)

<_sre.SRE_Match object; span=(5, 7), match='11'>

#### We can use `[a-z]` for characters `a` to `z` and `[0-9]` for digits `0` to `9`.

In [43]:
regex = r"[A-Za-z]+"
re.search(regex, birthday)

<_sre.SRE_Match object; span=(0, 4), match='June'>

Note that we had to specify both capital letters and lowercase letters.  We also needed the `+` pattern to make sure that one or more occurances of the characters were found.  If not, we would have only gotten one occurance as illustrated in the next example.

In [44]:
regex = r"[0-9]"
re.search(regex, birthday)

<_sre.SRE_Match object; span=(5, 6), match='1'>

Only got the first occurance of `1`!

#### `findall()`

Let's start getting down to business.  We want the actual month and the actual day.  Not the whole thing.  That's not too hard given what we already have at our disposal.

In [45]:
regex_month = r"[A-Za-z]+"
month = re.findall(regex_month, birthday)
print(month)

regex_day = r"\d+"
day = re.findall(regex_day, birthday)
print(day)

['June']
['11']


The `findall()` method returns a list of all the pattern matches.  Very cool.  Now we're ready to move on to another very important concept: *groups*.

#### Groups
Let's say we have a busy string of birthdays:

In [46]:
birthdays = "June 11th, December 13th, September 21st, May 12th"

We want to get all the months and all the days.  This looks like a job for the `findall()` method.

In [47]:
regex = r"[A-Za-z]+"
bdays = re.findall(regex, birthdays)
print(bdays)

['June', 'th', 'December', 'th', 'September', 'st', 'May', 'th']


That's not right.  Almost, but not quite.  We can fix things in a bunch of ways.  Let's take this opportunity to introduce groups.

In [48]:
regex = r"([A-Za-z]+) (\d+\w+)"
bdays = re.findall(regex, birthdays)
print(bdays)

[('June', '11th'), ('December', '13th'), ('September', '21st'), ('May', '12th')]


Let's try to unpack all of that:
* The parentheses indicate a group.  So, our first set of parentheses indicate that we want a pattern of characters with one or more occurances.
* Right after that first group, we have a space.
* Then we have another group.  This time, the group indicates a pattern with one or more occurances of numbers followed by one or more occurances of any alphanumeric characters.

We could have accomplished the same thing in a number of ways.  Here are a couple more possibilities:
```python
regex = r"([A-Za-z]+)\s(\d+\w+)"
regex = r"([A-Za-z]+)\s(\w+)"
regex = r"([A-Za-z]+) (\d+[a-z]+)"
```
You get the idea.

It's also possible to just get the months and days separately.

In [49]:
regex = r"[A-Za-z]+ \d+"
bdays = re.findall(regex, birthdays)
for bday in bdays:
    print(bday)

June 11
December 13
September 21
May 12


There are many other ways to play with these `regex` patterns.  You will get many chances to do so in your homework.  For now, let's do an exercise.

<div class=exercise><b>Exercise</b></div>
* Open and read the file `shelterdogs.xml` into a string named `dogs`.  It should look like:

```
<?xml version="1.0" encoding="UTF-8"?>

<dogshelter>
    <dog id="dog1">
        <name> Cloe </name>
        <age> 3 </age>
        <breed> Border Collie </breed>
        <playgroup> Yes </playgroup>
    </dog>
    <dog id="dog2"> 
        <name> Karl </name> 
        <age> 7 </age>
        <breed> Beagle </breed>
        <playgroup> Yes </playgroup>
    </dog>
</dogshelter>
```
* Write a regular expression to match the dog names.  That is, you want to match the name inside the name tag: `<name> dog_name </name>`.
  * **Hint:** Use a group.
* Print out each name.

Your output should be 
```python
Chloe
Karl
```

In [97]:
# your code here

# Open file
with open("data/shelterdogs.xml", "r") as f:
    dogs = f.read()

regex = r"<name> (.*) </name>" # regex to get the dog names
names = re.findall(regex, dogs) # find the names according to the regex
names

['Cloe', 'Karl']

[Poll everywhere link]

<div class=exercise><b>Exercise</b></div>
Although you successfully completed the previous exercise, you think it would have been nicer to strip out the first two lines of the `dogs` string.  Do that now.

**Hints:**
* The first line has some special metacharacters in it (e.g. ?, ", \n).  You can escape these by using a backslash. For example, \? treats ? like a real question mark.  Otherwise it's the *optional* character in regular expressions.
* Consider using [\n]+ to deal with the end of line character.

Your output should be:
```
<dogshelter>
    <dog id="dog1">
        <name> Cloe </name>
        <age> 3 </age>
        <breed> Border Collie </breed>
        <playgroup> Yes </playgroup>
    </dog>
    <dog id="dog2"> 
        <name> Karl </name> 
        <age> 7 </age>
        <breed> Beagle </breed>
        <playgroup> Yes </playgroup>
    </dog>
</dogshelter>
```

In [98]:
# your code here

regex = r"<\?xml version=\"1.0\" encoding=\"UTF-8\"\?>[\n]+"
start = re.search(regex, dogs).end()
dogs = dogs[start:]
print(dogs)

<dogshelter>
    <dog id="dog1">
        <name> Cloe </name>
        <age> 3 </age>
        <breed> Border Collie </breed>
        <playgroup> Yes </playgroup>
    </dog>
    <dog id="dog2"> 
        <name> Karl </name> 
        <age> 7 </age>
        <breed> Beagle </breed>
        <playgroup> Yes </playgroup>
    </dog>
</dogshelter>



This ends the `I/O` introduction.  We've discussed the following:
* How to read and write data using straight `python`.
* How to process text data using `python`'s built-in string methods.
* How to read and write `JSON` files.
* How to use regular expressions to process text data.

All of this was fine for the examples that we've done so far.  However, we are interested in working with very complicated text strings (possibly from log files and websites) and messy data.  Regular expressions will still be useful, but there are other tools available to make our lives easier.

First, we will introduce the `python` library `pandas` for working with complicated data types.  After that, we'll introduce the *BeautifulSoup* `python` library for reading and parsing data from websites.

## Part 3:  Introduction to Pandas

We'd like a data structure that can that can easily store variables of different types, that stores column names, and that we can reference by column name as well as by indexed position.  And it would be nice if this data structure came with built-in functions that we can use to manipulate it. 

`Pandas` is a package/library that does all of this!  The library is built on top of `numpy` (which we haven't introduced yet).  

There are two basic `pandas` objects, *series* and *dataframes*, which can be thought of as enhanced versions of 1D and 2D `numpy` arrays, respectively.  

For reference, here is a useful `pandas` [cheatsheet](https://drive.google.com/folderview?id=0ByIrJAE4KMTtaGhRcXkxNHhmY2M&usp=sharing) and the `pandas` [documentation](https://pandas.pydata.org/pandas-docs/stable/).

In [52]:
import pandas as pd

### Importing data

Now let's read in some automatible data as a pandas *dataframe* structure.  

In [53]:
# Read in the csv files
dfcars=pd.read_csv("data/mtcars.csv")

# Display the header and the first five rows of data
dfcars.head()

Unnamed: 0.1,Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


Wow!  That was easy and the output looks very nice.  What we have now is a spreadsheet with indexed rows and named columns, called a *dataframe* in pandas.  `dfcars` is an instance of the `pd.DataFrame` class, created by calling the `pd.read_csv` function, which then calls the DataFrame constructor inside of it. If the last sentence is confusing, don't worry, it will become clearer later.  The take-away is that `dfcars` is a dataframe object, and it has methods (functions) belonging to it. For example, `df.head()` is a method that shows the first 5 rows of the dataframe.

A pandas dataframe is a set of columns pasted together into a spreadsheet, as shown in the schematic below.  The columns in `pandas` are called *series* objects.

![](images/pandastruct.png)

Initial data exploration is as simple as a one-liner.

In [54]:
dfcars.describe()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
count,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0
mean,20.090625,6.1875,230.721875,146.6875,3.596563,3.21725,17.84875,0.4375,0.40625,3.6875,2.8125
std,6.026948,1.785922,123.938694,68.562868,0.534679,0.978457,1.786943,0.504016,0.498991,0.737804,1.6152
min,10.4,4.0,71.1,52.0,2.76,1.513,14.5,0.0,0.0,3.0,1.0
25%,15.425,4.0,120.825,96.5,3.08,2.58125,16.8925,0.0,0.0,3.0,2.0
50%,19.2,6.0,196.3,123.0,3.695,3.325,17.71,0.0,0.0,4.0,2.0
75%,22.8,8.0,326.0,180.0,3.92,3.61,18.9,1.0,1.0,4.0,4.0
max,33.9,8.0,472.0,335.0,4.93,5.424,22.9,1.0,1.0,5.0,8.0


That's about as simple as you could ever ask for.

Returning to the `dfcars` dataframe, we notice that the first column has a bad name: "Unnamed: 0". Let's **clean** it up. 

In [55]:
dfcars=dfcars.rename(columns={"Unnamed: 0":"car name"})
dfcars.head()

Unnamed: 0,car name,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


### Dataframes and Series

Now that we have our automobile data loaded as a dataframe, we'd like to be able to manipulate it and its series, say by calculating statistics and plotting distributions of features.  Fortunately, like arrays and other containers, dataframes and series are listy, so we can apply the list operations we already know to these new containers.  Below we explore our dataframe and its properties.

#### set length

 The attribute `shape` tells us the dimension of the dataframe, the number of rows and columns in the dataframe, `(rows, columns)`.  Somewhat strangely, but fairly usefully, (which is why the developers of Pandas probably did it ) the `len` function outputs the number of rows in the dataframe, not the number of columns as we'd expect based on how dataframes are built up from pandas series (columns).  

In [56]:
print(dfcars.shape)     # 12 columns, each of length 32
print(len(dfcars))      # the number of rows in the dataframe, also the length of a series
print(len(dfcars.mpg))  # the length of a series

(32, 12)
32
32


#### iteration via loops

 One consequence of the column-wise construction of dataframes is that you cannot easily iterate over the rows of the dataframe.  Instead, we iterate over the columns, for example, by printing out the column names via a for loop.

In [57]:
for ele in dfcars: # iterating iterates over column names though, like a dictionary
    print(ele)

car name
mpg
cyl
disp
hp
drat
wt
qsec
vs
am
gear
carb


Or we can call the attribute `columns`.  Notice the `Index` in the output below. We'll return to this shortly. 

In [58]:
dfcars.columns

Index(['car name', 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs',
       'am', 'gear', 'carb'],
      dtype='object')

We can iterate series in the same way that we iterate lists. Here we print out the number of cylinders for each of the 32 vehicles.  However, you shouldn't do this in general.  Try to use the built-in `pandas` methods.

In [59]:
for ele in dfcars.cyl:
    print(ele)

6
6
4
6
8
6
8
4
4
6
6
8
8
8
8
8
8
4
4
4
4
8
8
8
8
4
4
4
8
6
8
4


How do you iterate over rows?  Dataframes are put together column-by-column and you should be able to write code which never requires iteration over rows. But if you still find a need to iterate over rows, you can do it using `itertuples`.  See the documentation.  

**In general direct iteration through pandas series/dataframes is a bad idea.**

Instead, you should manipulate dataframes and series with `pandas` methods which are written to be very fast (i.e. they access series and dataframes at the `C` level).

#### slicing

Let's see how indexing works in dataframes.  Like lists in `python`, dataframes and series are zero-indexed.

In [60]:
dfcars.head()

Unnamed: 0,car name,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [61]:
# index for the dataframe
print(list(dfcars.index))

# index for the cyl series
dfcars.cyl.index

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]


RangeIndex(start=0, stop=32, step=1)

There are two ways to index dataframes:
1. the `loc` property indexes by label name
2. the `iloc` indexes by position in the index.

We'll illustrate this with a slightly modified version of `dfcars`, created by relabeling the row indices of `dfcars` to start at $5$ instead of $0$.

In [62]:
# create values from 5 to 36
new_index = [i+5 for i in range(32)]

# new dataframe with indexed rows from 5 to 36
dfcars_reindex = dfcars.reindex(new_index)
dfcars_reindex.head()

Unnamed: 0,car name,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
5,Valiant,18.1,6.0,225.0,105.0,2.76,3.46,20.22,1.0,0.0,3.0,1.0
6,Duster 360,14.3,8.0,360.0,245.0,3.21,3.57,15.84,0.0,0.0,3.0,4.0
7,Merc 240D,24.4,4.0,146.7,62.0,3.69,3.19,20.0,1.0,0.0,4.0,2.0
8,Merc 230,22.8,4.0,140.8,95.0,3.92,3.15,22.9,1.0,0.0,4.0,2.0
9,Merc 280,19.2,6.0,167.6,123.0,3.92,3.44,18.3,1.0,0.0,4.0,4.0


We now return the first three rows of `dfcars_reindex` in two different ways, first with `iloc` and then with `loc`. 

With `iloc` we use the command,

In [63]:
dfcars_reindex.iloc[0:3]

Unnamed: 0,car name,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
5,Valiant,18.1,6.0,225.0,105.0,2.76,3.46,20.22,1.0,0.0,3.0,1.0
6,Duster 360,14.3,8.0,360.0,245.0,3.21,3.57,15.84,0.0,0.0,3.0,4.0
7,Merc 240D,24.4,4.0,146.7,62.0,3.69,3.19,20.0,1.0,0.0,4.0,2.0


since `iloc` uses the position in the index. Notice that the argument `0:3` with `iloc` returns the first three rows of the dataframe, which have label names 5, 6, and 7. 

To access the same rows with `loc`, we write,

In [64]:
dfcars_reindex.loc[0:7] # or dfcars_reindex.loc[5:7]

Unnamed: 0,car name,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
5,Valiant,18.1,6.0,225.0,105.0,2.76,3.46,20.22,1.0,0.0,3.0,1.0
6,Duster 360,14.3,8.0,360.0,245.0,3.21,3.57,15.84,0.0,0.0,3.0,4.0
7,Merc 240D,24.4,4.0,146.7,62.0,3.69,3.19,20.0,1.0,0.0,4.0,2.0


since `loc` indexes via the label name.  

Here's another example where we return three rows of `dfcars_reindex` that correspond to column attributes `mpg`, `cyl`, and `disp`.  First do it with `iloc`:

In [65]:
dfcars_reindex.iloc[2:5, 1:4]

Unnamed: 0,mpg,cyl,disp
7,24.4,4.0,146.7
8,22.8,4.0,140.8
9,19.2,6.0,167.6


Notice that rows we're accessing, 2, 3, and 4, have label names 7, 8, and 9, and the columns we're accessing, 1, 2, and 3, have label names `mpg`, `cyl`, and `disp`.  So for both rows and columns, we're accessing elements of the dataframe using the integer position indices.  Now let's do it with `loc`:

In [66]:
dfcars_reindex.loc[7:9, ['mpg', 'cyl', 'disp']]

Unnamed: 0,mpg,cyl,disp
7,24.4,4.0,146.7
8,22.8,4.0,140.8
9,19.2,6.0,167.6


We don't have to remember that `disp` is the third column of the dataframe --- we can simply access it with `loc` using the label name `disp`. 

Generally we prefer `iloc` for indexing rows and `loc` for indexing columns. 

<div class="exercise"><b>Exercise</b></div>
In this exercise you'll examine the documentation to generate a toy dataframe from scratch.  Go to the documentation and click on "10 minutes to pandas" in the table of contents.  Then do the following:

>1.  Create a series called `column_1` with entries 0, 1, 2, 3.

>2.  Create a second series called `column_2` with entries 4, 5, 6, 7.

>3.  Glue these series into a dataframe called `table`, where the first and second labelled column of the dataframe are `column_1` and `column_2`, respectively.  In the dataframe, `column_1` should be indexed as `col_1` and `column_2` should be indexed as `col_2`.

>4. Oops!  You've changed your mind about the index labels for the columns.  Use `rename` to rename `col_1` as `Col_1` and `col_2` as `Col_2`.  

> *Stretch*: Can you figure out how to rename the row indexes?  Try to rename `0` as `zero`, `1` as `one`, and so on.

In [99]:
# your code here

column_1 = pd.Series(range(4)) # Q1
column_2 = pd.Series(range(4,8)) # Q2
table = pd.DataFrame({'col_1': column_1, 'col_2': column_2}) # Q3
table = table.rename(columns={"col_1": "Col_1", "col_2":"Col_2"}) # Q4

table = table.rename({0: "zero", 1: "one", 2: "two", 3: "three"})

table

Unnamed: 0,Col_1,Col_2
zero,0,4
one,1,5
two,2,6
three,3,7


### Reading `json` into `pandas` dataframe

Before moving on, there is one more convenient thing to discuss.

Hopefully you remember reading and writing data to `json` files from earlier in the lab.

Now that you're equipped with `pandas`, we can discuss reading `json` data into `pandas` dataframes!

Recall that we saved a `json` file earlier called `dog_shelter_info.txt`.  Let's load it up, convert it to a `pandas` dataframe, and take a look.

In [68]:
# Load dog shelter data
with open('dog_shelter_info.txt', 'r') as f:
    dog_data = json.load(f)

dog_data_json_str = json.dumps(dog_data) # Convert data to json string
df = pd.read_json(dog_data_json_str) # Convert to pandas dataframe
df.head() # Look at data

Unnamed: 0,dog1,dog2
age,3,7
breed,Border Collie,Beagle
name,Cloe,Karl
playgroup,Yes,Yes


#### Recap

At this point, you have:
* Refreshed your `python` knowledge
* Learned about basic `python` I/O
* Learned basic text processing with `python` string methods
* Learned advanced text processing with regular expressions
* Learned a little bit about the `json` format
* Become proficient with `pandas`

The last part of this lab will focus on working with even uglier data formats.  Specifically, we will look at parsing data from a web page.  Fortunately, there is a wonderful library out there that makes life much easier in this regard.  It is called *BeautifulSoup*.

##  Part 4: Beautiful Soup 
Data Engineering, the process of gathering and preparing data for analysis, is a very big part of Data Science.

Datasets might not be formatted in the way you need (e.g. you have categorical features but your algorithm requires numerical features); or you might need to cross-reference some dataset to another that has a different format; or you might be dealing with a dataset that contains missing or invalid data.

These are just a few examples of why data retrieval and cleaning are so important.

---

### `requests`:  Retrieving Data from the Web

In HW1, you will be asked to retrieve some data from the Internet. `Python` has many built-in libraries that were developed over the years to do exactly that (e.g. `urllib`, `urllib2`, `urllib3`).

However, these libraries are very low-level and somewhat hard to use. They become especially cumbersome when you need to issue POST requests or authenticate against a web service.

Luckly, as with most tasks in `Python`, someone has developed a library that simplifies these tasks. In reality, the requests made both on this lab and on HW1 are fairly simple, and could easily be done using one of the built-in libraries. However, it is better to get acquainted to `requests` as soon as possible, since you will probably need it in the future.

In [69]:
# You tell Python that you want to use a library with the import statement.
import requests

Now that the requests library was imported into our namespace, we can use the functions offered by it.

In this case we'll use the appropriately named `get` function to issue a *GET* request. This is equivalent to typing a URL into your browser and hitting enter.

In [70]:
# Get the HU Wikipedia page
req = requests.get("https://en.wikipedia.org/wiki/Harvard_University")

Python is an Object Oriented language, and everything on it is an object. Even built-in functions such as `len` are just syntactic sugar for acting on object properties.

We will not dwell too long on OO concepts, but some of Python's idiosyncrasies will be easier to understand if we spend a few minutes on this subject.

When you evaluate an object itself, such as the `req` object we created above, Python will automatially call the `__str__()` or `__repr__()` method of that object. The default values for these methods are usually very simple and boring. The `req` object however has a custom implementation that shows the object type (i.e. `Response`) and the HTTP status number (200 means the request was successful).

In [71]:
req

<Response [200]>

Just to confirm, we will call the `type` function on the object to make sure it agrees with the value above.

In [72]:
type(req)

requests.models.Response

Another very nifty Python function is `dir`. You can use it to list all the properties of an object.

By the way, properties starting with a single and double underscores are usually not meant to be called directly.

In [73]:
dir(req)

['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 '_next',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'next',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

Right now `req` holds a reference to a *Request* object; but we are interested in the text associated with the web page, not the object itself.

So the next step is to assign the value of the `text` property of this `Request` object to a variable.

In [74]:
page = req.text
page[20000:30000]

'alize admissions after the war. The undergraduate college became coeducational after its 1977 merger with <a href="/wiki/Radcliffe_College" title="Radcliffe College">Radcliffe College</a>.</p>\n<p>The university is organized into eleven separate academic units—ten faculties and the <a href="/wiki/Radcliffe_Institute_for_Advanced_Study" title="Radcliffe Institute for Advanced Study">Radcliffe Institute for Advanced Study</a>—with campuses throughout the <a href="/wiki/Greater_Boston" title="Greater Boston">Boston metropolitan area</a>:<sup id="cite_ref-14" class="reference"><a href="#cite_note-14">[14]</a></sup> its 209-acre (85&#160;ha) main campus is centered on <a href="/wiki/Harvard_Yard" title="Harvard Yard">Harvard Yard</a> in Cambridge, approximately 3 miles (5&#160;km) northwest of Boston; the <a href="/wiki/Harvard_Business_School" title="Harvard Business School">business school</a> and athletics facilities, including <a href="/wiki/Harvard_Stadium" title="Harvard Stadium">Har

Great! Now we have the text of the Harvard University Wikipedia page. But this mess of HTML tags would be a pain to parse manually. Which is why we will use another very cool Python library called `BeautifulSoup`.

### `BeautifulSoup`

Parsing data would be a breeze if we could always use well formatted data sources, such as CSV, JSON, or XML; but some formats such as HTML are at the same time a very popular and a pain to parse.

One of the problems with HTML is that over the years browsers have evolved to be very forgiving of "malformed" syntax. Your browser is smart enough to detect some common problems, such as open tags, and correct them on the fly.

Unfortunately, we do not have the time or patience to implement all the different corner cases, so we'll let BeautifulSoup do that for us.

You'll notice that the `import` statement bellow is different from what we used for `requests`. The _from library import thing_ pattern is useful when you don't want to reference a function byt its full name (like we did with `requests.get`), but you also don't want to import every single thing on that library into your namespace.

In [75]:
from bs4 import BeautifulSoup

`BeautifulSoup` can deal with `HTML` or `XML` data, so the next line parses the contents of the `page` variable using its `HTML` parser, and assigns the result of that to the `soup` variable.

In [76]:
soup = BeautifulSoup(page, 'html.parser')

In [77]:
type(soup)

bs4.BeautifulSoup

Doesn't look much different from the `page` object representation. Let's make sure the two are different types.

In [78]:
type(page)

str

Looks like they are indeed different.

`BeautifulSoup` objects have a cool little method that allows you to see the `HTML` content in a nice, indented way.

In [79]:
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Harvard University - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Harvard_University","wgTitle":"Harvard University","wgCurRevisionId":848045131,"wgRevisionId":848045131,"wgArticleId":18426501,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles containing potentially dated statements from March 2018","All articles containing potentially dated statements","CS1: Julian–Gregorian uncertainty","CS1 maint: Extra text: editors list","Webarchive template wayback links","Wikipedia indefinitely move-protected pages","Use md

Looks like it's our page!

We can now reference elements of the `HTML` document in different ways. One very convenient way is by using the dot notation, which allows us to access the elements as if they were properties of the object.

In [80]:
soup.title

<title>Harvard University - Wikipedia</title>

This is nice for `HTML` elements that only appear once per page, such the the `title` tag. But what about elements that can appear multiple times?

In [81]:
# Be careful with elements that show up multiple times.
soup.p

<p><b>Harvard University</b> is a private <a href="/wiki/Ivy_League" title="Ivy League">Ivy League</a> <a href="/wiki/Research_university" title="Research university">research university</a> in <a href="/wiki/Cambridge,_Massachusetts" title="Cambridge, Massachusetts">Cambridge</a>, Massachusetts. Established in 1636 and named for its first benefactor clergyman <a href="/wiki/John_Harvard_(clergyman)" title="John Harvard (clergyman)">John Harvard</a>, Harvard is the <a class="mw-redirect" href="/wiki/Colonial_Colleges" title="Colonial Colleges">United States' oldest institution of higher learning</a>,<sup class="reference" id="cite_ref-9"><a href="#cite_note-9">[9]</a></sup> and its history, influence, and wealth have made it one of the world's most prestigious universities.<sup class="reference" id="cite_ref-10"><a href="#cite_note-10">[10]</a></sup> The <a class="mw-redirect" href="/wiki/Harvard_Corporation" title="Harvard Corporation">Harvard Corporation</a> is its first chartered <a

Uh Oh. Turns out the attribute syntax in `Beautiful` soup is what is called *syntactic sugar*. That's why it is safer to use the explicit commands behind that syntactic sugar I mentioned. These are:
* `BeautifulSoup.find` for getting single elements, and 
* `BeautifulSoup.find_all` for retrieving multiple elements.

In [82]:
len(soup.find_all("p"))

75

If you look at the Wikipedia page on your browser, you'll notice that it has a couple of tables in it. We will be working with the "Demographics" table, but first we need to find it.

One of the `HTML` attributes that will be very useful to us is the `class` attribute.

Getting the class of a single element is easy!

In [83]:
soup.table["class"]

['infobox', 'vcard']

Next we will use a *list comprehension* to see all the tables that have a `class` attribute. 

In [84]:
# the classes of all tables that have a class attribute set on them
[t["class"] for t in soup.find_all("table") if t.get("class")]

[['infobox', 'vcard'],
 ['toccolours'],
 ['plainlinks', 'metadata', 'ambox', 'mbox-small-left', 'ambox-content'],
 ['multicol'],
 ['infobox'],
 ['wikitable', 'sortable', 'collapsible', 'collapsed'],
 ['wikitable', 'sortable', 'collapsible', 'collapsed'],
 ['wikitable'],
 ['nowraplinks', 'collapsible', 'collapsed', 'navbox-inner'],
 ['nowraplinks', 'navbox-subgroup'],
 ['nowraplinks', 'navbox-subgroup'],
 ['nowraplinks', 'collapsible', 'collapsed', 'navbox-inner'],
 ['nowraplinks', 'collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'hlist', 'collapsible', 'autocollapse', 'navbox-inner'],
 [

As already mentioned, we will be using the Demographics table for this lab. The next cell contains the `HTML` elements of said table. We will render it in different parts of the notebook to make it easier to follow along the parsing steps.

In [85]:
table_demographics = soup.find_all("table", "wikitable")[2]

In [86]:
from IPython.core.display import HTML
HTML(str(table_demographics))

Unnamed: 0,Undergrad,Graduate,U.S. census
Asian/Pacific Islander,17%,11%,5%
Black/non-Hispanic,6%,4%,12%
Hispanics of any race,9%,5%,16%
White/non-Hispanic,46%,43%,64%
Mixed race/other,10%,8%,9%
International students,11%,27%,


First we'll use a list comprehension to extract the rows (*tr*) elements.

In [87]:
rows = [row for row in table_demographics.find_all("tr")]
print(rows)

[<tr>
<th></th>
<th>Undergrad</th>
<th>Graduate</th>
<th>U.S. census</th>
</tr>, <tr>
<th>Asian/Pacific Islander</th>
<td>17%</td>
<td>11%</td>
<td>5%</td>
</tr>, <tr>
<th>Black/non-Hispanic</th>
<td>6%</td>
<td>4%</td>
<td>12%</td>
</tr>, <tr>
<th>Hispanics of any race</th>
<td>9%</td>
<td>5%</td>
<td>16%</td>
</tr>, <tr>
<th>White/non-Hispanic</th>
<td>46%</td>
<td>43%</td>
<td>64%</td>
</tr>, <tr>
<th>Mixed race/other</th>
<td>10%</td>
<td>8%</td>
<td>9%</td>
</tr>, <tr>
<th>International students</th>
<td>11%</td>
<td>27%</td>
<td>N/A</td>
</tr>]


In [88]:
header_row = rows[0]
HTML(str(header_row))

We will then use a `lambda` expression to replace new line characters with spaces. `Lambda` expressions are to functions what list comprehensions are to lists: namely a more concise way to achieve the same thing.

In reality, both lambda expressions and list comprehensions are a little different from their function and loop counterparts. But for the purposes of this class we can ignore those differences.

In [89]:
# Lambda expressions return the value of the expression inside it.
# In this case, it will return a string with new line characters replaced by spaces.
rem_nl = lambda s: s.replace("\n", " ")

#### Splitting the data
Next we extract the text value of the columns. If you look at the table above, you'll see that we have three columns and six rows.

Here we're doing the following:
* Taking the first element (`Python` indices start at zero)
* Iterating over the *th* elements inside it
* Taking the text value of those elements

We should end up with a list of column names.

But there is one little caveat: the first column of the table is actually an empty string (look at the cell right above the row names). We could add it to our list and then remove it afterwards; but instead we will use the `if` statement inside the list comprehension to filter that out.

In the following cell, `get_text` will return an empty string for the first cell of the table, which means that the test will fail and the value will not be added to the list.

In [90]:
# the if col.get_text() takes care of no-text in the upper left
columns = [rem_nl(col.get_text()) for col in header_row.find_all("th") if col.get_text()]
columns

['Undergrad', 'Graduate', 'U.S. census']

Now let's do the same for the rows. Notice that since we have already parsed the header row, we will continue from the second row.

In [91]:
indexes = [row.find("th").get_text() for row in rows[1:]]
indexes

['Asian/Pacific Islander',
 'Black/non-Hispanic',
 'Hispanics of any race',
 'White/non-Hispanic',
 'Mixed race/other',
 'International students']

Now we want to transform the string on the cells to integers.  To do this, we follow a very common `python` pattern:
1. Check if the last character of the string is a percent sign
2. If it is, then convert the characters before the percent sign to integers
3. If one of the prior checks fails, return a value of `None`

These steps can be conveniently packaged into a function using `if-else` statements.

In [92]:
def to_num(s):
    if s[-1] == "%":
        return int(s[:-1])
    else:
        return None

Notice the `Python` slices are open on the upper bound. So the `[:-1]` construct will return all elements of the string, except for the last.

Another nice way to write our `to_num` function would be
```python
def to_num(s):
    return int(s[:-1]) if s[-1] == "%" else None
```
Notice that we only had to write `return` one time and everything conveniently fits on one line.  I'll leave it up to you to decide if it's readable or not.

Now we use the `to_num` function in a list comprehension to parse the table values.

Notice that we have two `for ... in ...` in this list comprehension. That is perfectly valid and somewhat common.

Although there is no real limit to how many iterations you can perform at once, having more than two can be visually unpleasant, at which point either regular nested loops or saving intermediate comprehensions might be a better solution.

In [93]:
values = [to_num(value.get_text()) for row in rows[1:] for value in row.find_all("td")]
values

[17, 11, 5, 6, 4, 12, 9, 5, 16, 46, 43, 64, 10, 8, 9, 11, 27, None]

The problem with the list above is that the values lost their grouping.

The `zip` function is used to combine two sequences element wise. So 
```python
zip([1,2,3], [4,5,6])
```
would return
```python
[(1, 4), (2, 5), (3, 6)]
```

Next we create three arrays corresponding to the three columns by putting every three values in each list.

In [94]:
stacked_values_lists = [values[i::3] for i in range(len(columns))]
stacked_values_lists

[[17, 6, 9, 46, 10, 11], [11, 4, 5, 43, 8, 27], [5, 12, 16, 64, 9, None]]

We then use `zip`. 

In [95]:
stacked_values = zip(*stacked_values_lists)
list(stacked_values)

[(17, 11, 5), (6, 4, 12), (9, 5, 16), (46, 43, 64), (10, 8, 9), (11, 27, None)]

Notice the use of the `*` in front: that converts the list of lists to a set of arguments to `zip`. See the ASIDE below.

In [96]:
# Here's the original HTML table for visual understanding
HTML(str(table_demographics))

Unnamed: 0,Undergrad,Graduate,U.S. census
Asian/Pacific Islander,17%,11%,5%
Black/non-Hispanic,6%,4%,12%
Hispanics of any race,9%,5%,16%
White/non-Hispanic,46%,43%,64%
Mixed race/other,10%,8%,9%
International students,11%,27%,


---

---

#### Answers to data structures from the beginning of lab
- the names of everyone in CS 109?
    - a list (or a tuple, if the roster won't change)
- a list of all the words in the English language (presuming we'll want to search and decide if a particular user-provided word is one of the correctly-spelled words)?
    - a set! (A list would be very slow to search, and a dictionary mapping each word to "yes" or "no" would be wasteful... we simply want to know whether the user's word is or isn't among the valid words
- the github username associated with each student's name?
    - a dictionary (we're mapping strings to other objects)
- the name associated with each (sequential) student ID?
    - a list or tuple. Becuase the IDs are sequential a list (which maps indexes 0,1,2,... to its contents) is a good choice. A dictionary could be slower but also suffices.
- the nickname and github username associated with each student ID?
    - None of the above, directly. However, we could set up a list or dictionary where we provide a student ID and get back a two-element list, where element 0 is the nickname and element 1 is the username.
    
In order to do well in the course, you should score _at least_ 3/5, and preferably a 4/5.