# Week 3 - More on Tidy Data and Related Python Topics

Week 3 reading: **Pandas for Everyone** chapter 6 (pages 124 - 141)

Outline:
1. Review of Tidy Data criteria
2. Python string tutorial (supporting ch. 6.3.1)
3. Python iterator tutorial (supporting ch. 6.3.3)
4. Python zip tutorial (supporting ch. 6.3.3)
5. Python tuple tutorial (supporting ch. 6.3.3)
6. Using the requests module to gather multi-file data (alternative to ch. 6.6 method)
    1. Wildcard and glob tutorial (supporting ch. 6.6)

## 1. Review of Tidy Data criteria

Last week we briefly discussed Hadley Wickham's 2014 article in the *Journal of Statistical Software* and what he termed "tidy data" as a framework for organizing data for analysis. The paper defines *tidy data* as fitting the following criteria:

* Each row is an observation
* Each column is a variable
* Each type of observational unit forms a table

Chapter 6 of **Pandas for Everyone** describes a variety of ways data can be grouped based on how it is collected or displayed that do **not** conform to tidy criteria, such as "wide data format" shown in section 6.2.1 and how to "melt" the data set into a more analysis-friendly format.

Another item of note shown in 6.2.1 is to name the melted columns to help with identifying data and to make later analysis easier.




<hr>
As it is relatively easy to load the data and follow along with the book, the rest of this week's material will provide tutorials that further explain or enhance techniques shown in the book, or in some cases show alternate techniques for reasons described in the section.

## 2. Python string tutorial (supporting ch. 6.3.1)

**Section 6.3.1** describes splitting columns based on some delimiter, such as an underscore (\_) or a comma (,). Let's take a look at how Python sees and manipulates strings to get a better understanding of what is going on. 

We saw in **week 1** how a Python list is a container for objects, each object individually addressable with an index. You may recall we had a list of names and could display any one of them:

In [7]:
names = ["Bob","Cheryl","Dave","Jenny", "Waldo"]
names[1]

'Cheryl'

Python looks at strings as if they are lists of characters. For example:

In [8]:
py_string = 'I love Python'
py_string[4]

'v'

Strings can be *concatenated*, or added to, with a simple "+" operation:

In [9]:
py_string + ' programming!'

'I love Python programming!'

We can do some nifty functions like counting characters, slicing, reversing:

In [10]:
print( py_string.count('o') )  # Count the number of 'm' characters

print( py_string[2:6] ) # Show characters 2 through 5

print( py_string[::-1] ) # Display string backwards (with a slicing trick)

2
love
nohtyP evol I


We can even split the string on some character of choice. This will give us back a **list**.

In [11]:
# Split string at each space character. 
#Notice the space is NOT included in the list.

py_string.split(' ')

['I', 'love', 'Python']

Lists of strings can also be re-joined with your choice of separating character. This is one method for making a CSV file.

In [12]:
py_list = py_string.split(' ')

','.join(py_list)

'I,love,Python'

The join syntax deserves a little explanation. In Python, everything is an object, and objects have functions (methods) that can operate on the object. Even an explicit string like

`','` is an object and the `join()` method says to use its string (the ',' in this case) to separate the strings in the list. We could have put the list back together with spaces like so:

In [13]:
' '.join(py_list) # Notice there is a <space> character between the single quotes.

'I love Python'

In [14]:
# Double quotes work, too

" ".join(py_list)

'I love Python'

This demonstration only scratches the surface of what can be done with strings. We will discuss more properties of strings in the sections on iterators.

## 3. Python iterator tutorial (supporting ch. 6.3.3)

Another great feature of lists is that they function as containers that we can step through (called *iteration*). 

In Python, the `for` loop was made for iterating containers like lists:

In [15]:
for name in names:
    print(name)

Bob
Cheryl
Dave
Jenny
Waldo


The general syntax to *iterate* a for loop is:

```
for <item variable> in <container variable>:
    use item variable
```
Really, the Python `for` loop functions like a *for_each* loop in other languages. In fact, thinking about for loops is easier if you concieve of it like this:

```
for-each <item> in <container> do:
    operation using item
```

Let's test that out. We can make each name lower case. **Conceptually**:

```
for-each name in names do:
    print(name.lower())
```

Actual Python:

In [16]:
# name is the item and names is the container
for name in names:       
    print(name.lower())

bob
cheryl
dave
jenny
waldo


Formally:

* __iteration__ : The action of using a loop to apply some code to every item in a container.
* __iterable__ : The container holding the items of interest.
* __iterator__ : The variable containing each item of interest from the container.

Not coincidentally, Python views strings as iterable containers:

In [17]:
# ch is just a variable that I think makes a good
# abbreviation for "character."

for ch in py_string:  
    print(ch)

I
 
l
o
v
e
 
P
y
t
h
o
n


That fact probably isn't useful at this particular time, but it is handy to know and could come up later, during some text processing task.

A trick that is particularly useful is to use the `enumerate()` function to keep a count of loop iterations. We have to have an extra variable along with our item variable:

In [36]:
for indx, ch in enumerate(py_string):
    print(indx, ch)

0 I
1  
2 l
3 o
4 v
5 e
6  
7 P
8 y
9 t
10 h
11 o
12 n


We will see that trick again below when we only want to read the first 5 lines of a file.

## 4. Python zip tutorial (supporting ch. 6.3.3)

Section 6.3.3 on page 132 mentions the `zip()` function and gives a minimal demonstration of its functionality. `zip()` is a built-in function in Python. If you were to consult the documentation, it would say something like:

>Zip takes zero or more iterables as arguments and returns an iterator to tuples matched from the indexes of the inputs.

Since it is somewhat silly (though possible) to pass an empty list or a single list to zip, we will pass over those trivial cases. 

In [18]:
list1 = ['red','green','blue']
list2 = ['apple', 'grape', 'berry']

z = zip(list1,list2)

print(type(z))
print(z)

<class 'zip'>
<zip object at 0x10f17aa88>


So, you can see that `z` is a zip object, but what good does that do us? Well, first of all, we can iterate on `z`:

In [19]:
for i in z:
    print(i)

('red', 'apple')
('green', 'grape')
('blue', 'berry')


But, after we have iterated on our zip object, it is empty. Let's try to turn `z` into a list and print it:

In [20]:
list3 = list(z)
print(list3)

[]


In [21]:
z = zip(list1, list2)
list(z)

[('red', 'apple'), ('green', 'grape'), ('blue', 'berry')]

Let's go ahead and assign that list to a variable so we can work with it:

In [22]:
zlist = list(z)
zlist

[]

Whoops! That "drained" our iterator too! 

Let's try again:

In [23]:
z = zip(list1, list2)
zlist = list(z)
zlist

[('red', 'apple'), ('green', 'grape'), ('blue', 'berry')]

Now, each list item is a **tuple** (from a generalization of single, double, triple, quadruple, quintuple, sextuple, ..., etc). We will look at tuples in the next section.

## 5. Python tuple tutorial (supporting ch. 6.3.3)

I have bad news and I have good news. Which do you want to hear first?

Let's start with the bad news: **tuples** are *one more* Python data structure (container) you need to learn.

Now the good news: **Tuples** are identical to lists, except for one detail -- **tuples are immutable**, which is a computer science-y way of saying **_they can't be changed after they are created_**.

Let's look at a couple of examples:

In [24]:
tple_1 = ('red', 'green', 'blue', 'cyan', 'magenta')
tple_1

('red', 'green', 'blue', 'cyan', 'magenta')

In [25]:
tple_1[0]

'red'

In [26]:
len(tple_1)

5

In [27]:
tple_1[::-1]

('magenta', 'cyan', 'blue', 'green', 'red')

In [28]:
tple_1[0] = 'pink'

TypeError: 'tuple' object does not support item assignment

Whoops! There is the immutability I mentioned.

What about our `zlist` object above, that is a list composed of tuples? How do we get to the individual elements of each tuple?

The key lies in understanding that each tuple has indexed elements, and in turn, each tuple is an indexed element of the list:

In [29]:
zlist[0]

('red', 'apple')

OK, so a subscript returns a tuple. And we know that each item of a tuple has a subscript, so, let's just keep piling them on! 

Suppose I wanted the second item in the first tuple:

In [30]:
zlist[0][1]

'apple'

In [31]:
zlist[0] = ('yellow', 'banana')
zlist

[('yellow', 'banana'), ('green', 'grape'), ('blue', 'berry')]

## 6. Using the requests module to gather multi-file data (alternative to ch. 6.6 method)

Section 6 discusses the fairly common occurrence of data being spread across multiple files. The book uses the older **urllib** library to retrieve a list of URLs. It is much easier and more standard to use the **requests** library instead. we will look at requests-based retrieval as an alternative.

First, though, I think we should look at the first few lines of the URL data file the book uses, to verify location and content:

In [34]:
!head ../pandas_for_everyone/data/raw_data_urls.txt

https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-01.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-02.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-03.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-04.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-05.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-06.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-07.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-08.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-09.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-10.csv


In [35]:
import os
import requests

I'm also reversing the logic a bit. It isn't really common to use the `break` command to stop a loop. Better to say there are a group of steps to be performed if the count is < 5 (remember, 0 to 4 is five things).

In [40]:
url_list = []  # new, empty list to hold URLs we get from the data file.
data_dir = 'data/'

with open('../pandas_for_everyone/data/raw_data_urls.txt', 'r') as infile:
    for indx, url in enumerate(infile):
        if indx < 5:
            url = url.strip()
            filename = url.split('/')[-1]
            print(f'Reading {filename}...')
            response = requests.get(url)
            new_file = data_dir + filename
            print(f'Writing {new_file}...\n\n')
            with open(new_file, 'w') as outfile:
                outfile.write(response.text)
            

Reading fhv_tripdata_2015-01.csv...
Writing data/fhv_tripdata_2015-01.csv...


Reading fhv_tripdata_2015-02.csv...
Writing data/fhv_tripdata_2015-02.csv...


Reading fhv_tripdata_2015-03.csv...
Writing data/fhv_tripdata_2015-03.csv...


Reading fhv_tripdata_2015-04.csv...
Writing data/fhv_tripdata_2015-04.csv...


Reading fhv_tripdata_2015-05.csv...
Writing data/fhv_tripdata_2015-05.csv...




Those are pretty big files and a normal text editor may have some trouble opening them to verify. Let's fall back on the Linux head command:

In [43]:
!head data/fhv_tripdata_2015-01.csv

Dispatching_base_num,Pickup_date,locationID
B00013,2015-01-01 00:30:00,
B00013,2015-01-01 01:22:00,
B00013,2015-01-01 01:23:00,
B00013,2015-01-01 01:44:00,
B00013,2015-01-01 02:00:00,
B00013,2015-01-01 02:00:00,
B00013,2015-01-01 02:00:00,
B00013,2015-01-01 02:50:00,
B00013,2015-01-01 04:45:00,


In [44]:
!head data/fhv_tripdata_2015-02.csv

Dispatching_base_num,Pickup_date,locationID
B00013,2015-02-01 00:00:00,
B00013,2015-02-01 00:01:00,
B00013,2015-02-01 00:21:00,
B00013,2015-02-01 01:00:00,
B00013,2015-02-01 02:10:00,
B00013,2015-02-01 03:34:00,
B00013,2015-02-01 03:37:00,
B00013,2015-02-01 03:39:00,
B00013,2015-02-01 03:42:00,


### 6.1. Wildcard and glob tutorial (supporting ch. 6.6)

Page 138 uses the oddly-named `glob` library to take advantage of the sequential naming of the data files. The `glob` library uses the asterisk ('__\*__', also known as a star or a splat) to return a list of all files whose names match a pattern.

In this case, the book shows a filename of `fhv_*`. The splat means 'match anything' so giving `fhv_*` to `glob` returns a list of all files whose names start with the pattern fhv_.

## 7. Alternative merge method (supporting ch. 6.6.1)