# WORKSHOP 1 / Python Files Primer 

* [A quick introduction to Python](01intro_wkshp1nb.ipynb)
* [Python string basics](01strings_wkshp1nb.ipynb)
* **A quick tour of working with files**
* [Introduction to Github](01github_wkshp1nb.ipynb)

# Text Files

Files in Python are exceedingly easy to work with.  There are a few things to remember when we get started:

* the "end of line" is a newline (return),
* the "end of file" is handled for you, so you don't have to worry about checking in the general case,
* try initially to think about a text file as a list of lines, where each line is the next entry in the list.

## Loading a file
A file can be loaded by the absolute or relative path to your executing program.  We'll keep things simple for now and run everything relative.

Say we have a file of numbers, where there are *two* numbers per line, each separated by a tab (whitespace).  Let's first pretend we have a list of such numbers:


In [5]:
lst_of_number_strings = ["123.8    -38.79", "123.95    -39.75", "123.99    -36.79"]
for s in lst_of_number_strings:
    print s

123.8    -38.79
123.95    -39.75
123.99    -36.79


Looks good.  Now let's actually work with a file and try the same thing. To open the file you need to use the following :



In [4]:
f = open("data/sample_data.txt")
for s in f:
    print s


120.04	-42.8

120.16	-39.48

120.25	-37.67

120.26	-36.92

120.26	-36.7

120.3	-39.67

120.34	-37.69

120.4	-40.55

120.4	-36.37

120.42	-36.91

120.49	-40.52

120.49	-42.44

120.49	-41.68

120.64	-37.06

120.67	-36.95

120.67	-40.07

120.71	-40.28

120.73	-40.13

120.73	-42.41

120.84	-39.65

120.91	-37.25

120.92	-37.71

120.95	-36.19

120.97	-39.76

121.02	-41.44

121.09	-39.66

121.1	-37.25

121.11	-42.48

121.11	-36.38

121.11	-38.62

121.14	-39.54

121.25	-38.9

121.26	-38.69

121.3	-37.9

121.31	-39.81

121.43	-42.36

121.48	-38.86

121.55	-41.97

121.56	-37.26

121.56	-36.51

121.57	-37.7

121.58	-41.52

121.64	-37.88

121.66	-40.38

121.79	-40.4

121.88	-42.7

121.92	-39.98

121.93	-39.6

121.94	-37.65

121.95	-40.59

121.95	-40.79

122.18	-37.71

122.2	-36.55

122.21	-42.1

122.32	-38.41

122.33	-42.12

122.37	-40.15

122.49	-38.23

122.52	-41.96

122.54	-40.83

122.59	-39.1

122.7	-41.69

122.78	-40.09

122.79	-37.44

122.8	-37.03

122.81	-38.82

122.83	-39.31

122.84	-37.08

Cool, huh?

Now what if want to do something useful, like **count** the number of lines in the file?  The naive approach is to just count the lines.

In [8]:
f = open("data/sample_data.txt")

line_count = 0
for s in f:
    line_count+=1

print line_counbt

100


Nice, way to go!  But we can do something a little better - that is we can load all the lines into a list with `readlines()`.  Try this:

In [10]:
f = open("data/sample_data.txt")
file_as_list = f.readlines()

print file_as_list # ==> ['120.04\t-42.8\n', '120.16\t-39.48\n', '120.25\t-37.67\n' ...]
print len(file_as_list) # ==> 100

['120.04\t-42.8\n', '120.16\t-39.48\n', '120.25\t-37.67\n', '120.26\t-36.92\n', '120.26\t-36.7\n', '120.3\t-39.67\n', '120.34\t-37.69\n', '120.4\t-40.55\n', '120.4\t-36.37\n', '120.42\t-36.91\n', '120.49\t-40.52\n', '120.49\t-42.44\n', '120.49\t-41.68\n', '120.64\t-37.06\n', '120.67\t-36.95\n', '120.67\t-40.07\n', '120.71\t-40.28\n', '120.73\t-40.13\n', '120.73\t-42.41\n', '120.84\t-39.65\n', '120.91\t-37.25\n', '120.92\t-37.71\n', '120.95\t-36.19\n', '120.97\t-39.76\n', '121.02\t-41.44\n', '121.09\t-39.66\n', '121.1\t-37.25\n', '121.11\t-42.48\n', '121.11\t-36.38\n', '121.11\t-38.62\n', '121.14\t-39.54\n', '121.25\t-38.9\n', '121.26\t-38.69\n', '121.3\t-37.9\n', '121.31\t-39.81\n', '121.43\t-42.36\n', '121.48\t-38.86\n', '121.55\t-41.97\n', '121.56\t-37.26\n', '121.56\t-36.51\n', '121.57\t-37.7\n', '121.58\t-41.52\n', '121.64\t-37.88\n', '121.66\t-40.38\n', '121.79\t-40.4\n', '121.88\t-42.7\n', '121.92\t-39.98\n', '121.93\t-39.6\n', '121.94\t-37.65\n', '121.95\t-40.59\n', '121.95\t-40

Notice that the strings in the list include '\t' between the numbers and '\n' at the end of the line (tab and newline, respectively).  Let's work a little example to see what we can do to convert this string into something more useful.

Let's say we want to split this string along the tab character.  Doing so could not be simpler, let see:

In [11]:
example = file_as_list[0] 
print example.split('\t') 

['120.04', '-42.8\n']


This is nice -- but what if we wanted to turn this into a tuple.  Remember we can turn a list into a tuple (and visa versa):

In [14]:
my_tuple = tuple(example.split('\t'))
print my_tuple

('120.04', '-42.8\n')


In this case, we don't gain anything by converting to a tuple, but this can noneless be interesting down the road.  Notice something interesting here -- that is there is a '\n' at the end of the second string.  This is the newline character that was in the original file.  Unfortunately, this character is still with us and we really don't want it to be.  Never fret, there is an easy way out.  All strings have a method called `strip()` that will remove_whitespace characters_ (spaces, newlines, tabs) from the _beginning_ **and** _end_ of the string like so:

In [16]:
# remove the whitespace (including newlines)
example = file_as_list[0].strip()
print example # ==> '120.04\t-42.8'

# now we can split the string to make a list
print example.split('\t') # ==> ['120.04', '-42.8']


120.04	-42.8
['120.04', '-42.8']


### Lists again ...
As we can see, lists are really important in Python.  So much so that the language has provided a shortcut to making lists easier when you need to compose one fast.  Let say we want to take the **list of strings** and make that a **list of lists**.  In other words, let's say we want our file to look like the list: [['120.04', '-42.8'], ['120.16', '-39.48'], ...].  Getting the list os strings into this form could be very valuable for searching, sorting, etc. down the road.  Let's look at the straightforward way to do this with just loops:


In [19]:
list_of_lists = []
for s in file_as_list:
    s_split = s.strip().split('\t') # ==> ['120.04', '-42.8']
    list_of_lists.append(s_split)
    
print list_of_lists

[['120.04', '-42.8'], ['120.16', '-39.48'], ['120.25', '-37.67'], ['120.26', '-36.92'], ['120.26', '-36.7'], ['120.3', '-39.67'], ['120.34', '-37.69'], ['120.4', '-40.55'], ['120.4', '-36.37'], ['120.42', '-36.91'], ['120.49', '-40.52'], ['120.49', '-42.44'], ['120.49', '-41.68'], ['120.64', '-37.06'], ['120.67', '-36.95'], ['120.67', '-40.07'], ['120.71', '-40.28'], ['120.73', '-40.13'], ['120.73', '-42.41'], ['120.84', '-39.65'], ['120.91', '-37.25'], ['120.92', '-37.71'], ['120.95', '-36.19'], ['120.97', '-39.76'], ['121.02', '-41.44'], ['121.09', '-39.66'], ['121.1', '-37.25'], ['121.11', '-42.48'], ['121.11', '-36.38'], ['121.11', '-38.62'], ['121.14', '-39.54'], ['121.25', '-38.9'], ['121.26', '-38.69'], ['121.3', '-37.9'], ['121.31', '-39.81'], ['121.43', '-42.36'], ['121.48', '-38.86'], ['121.55', '-41.97'], ['121.56', '-37.26'], ['121.56', '-36.51'], ['121.57', '-37.7'], ['121.58', '-41.52'], ['121.64', '-37.88'], ['121.66', '-40.38'], ['121.79', '-40.4'], ['121.88', '-42.7'],

Which gives us what we want!

There is a shorthand way to do the same thing _in one line_ using what Pythonistas call **list comprehensions**.  Don't let the fancy name fool you -- this is a useful technique that will make your life easier and code more readable and compact.  To convert the loop above into a **list comprehension** we first must think from the outside and work our way in.  Let's say we want to make an empty list `[]` and inside that list we want to have the items of the list look like `['first_string', 'second_string']`.

Consider what the loop fragment `for s in file_list` does.  It gives us the string `'120.04\t-42.8\n'` for each item in file list.  Let's assume for a minute that we're allowed to do __one thing__ to each of those items in the loop.  What would it be?  In our case, we want to strip whitespace and split the string at the tab to get the resulting list (`['120.04', '-42.8']`).  Thus, that one thing looks like this `s.strip().split('\t')`.  

Now what we want is just that list of all those two-element list of strings.  Thus, if we put all that together:

In [22]:
list_of_lists = [ s.strip().split('\t')  for s in file_as_list ]
print list_of_lists

[['120.04', '-42.8'], ['120.16', '-39.48'], ['120.25', '-37.67'], ['120.26', '-36.92'], ['120.26', '-36.7'], ['120.3', '-39.67'], ['120.34', '-37.69'], ['120.4', '-40.55'], ['120.4', '-36.37'], ['120.42', '-36.91'], ['120.49', '-40.52'], ['120.49', '-42.44'], ['120.49', '-41.68'], ['120.64', '-37.06'], ['120.67', '-36.95'], ['120.67', '-40.07'], ['120.71', '-40.28'], ['120.73', '-40.13'], ['120.73', '-42.41'], ['120.84', '-39.65'], ['120.91', '-37.25'], ['120.92', '-37.71'], ['120.95', '-36.19'], ['120.97', '-39.76'], ['121.02', '-41.44'], ['121.09', '-39.66'], ['121.1', '-37.25'], ['121.11', '-42.48'], ['121.11', '-36.38'], ['121.11', '-38.62'], ['121.14', '-39.54'], ['121.25', '-38.9'], ['121.26', '-38.69'], ['121.3', '-37.9'], ['121.31', '-39.81'], ['121.43', '-42.36'], ['121.48', '-38.86'], ['121.55', '-41.97'], ['121.56', '-37.26'], ['121.56', '-36.51'], ['121.57', '-37.7'], ['121.58', '-41.52'], ['121.64', '-37.88'], ['121.66', '-40.38'], ['121.79', '-40.4'], ['121.88', '-42.7'],

**Awesome!**

## Back to files
Now let's get back to files and introduce a few typical text file types.  A lot data you might work with will come as a CSV, TSV or something similar


|File type  |   |   Looks like |
|----|----|-----|
|CSV  |Comma Separated Values | `a, b, c`  |
|TSV  |Tab Separated Values   | `a <tab> b <tab> c` |

Python provides a library for specifically working with CSV files, which will greatly reduce the amount of work you will be doing to work with the files.  Later when we move to Pandas, we'll advance our knowledge even further and dispense with the pleasantries of caring about the details of the CSV altogether.

For now, though, let's get a quick understanding of working with a CSV file.

### The `csv` library

When you are clear of the basics of files, you can forget what you know and move on to the `csv` library -- which takes even more work off your hands.  Essentially this library raises the bar a bit more and allows you to perform some more abstractions on the file.  In particular the library (despite its name) allows you to process CSV, TSV and any the xSV file that has a well known delimiter ('|', '~', etc.).

The code will look a lot like your previous code, but with some extra niceties:

In [24]:

import csv
with open('data/sample_data.txt') as f:
    csv_reader = csv.reader(f, delimiter='\t') # necessary because the file is TAB delimited

    for l in csv_reader:
        print l


['120.04', '-42.8']
['120.16', '-39.48']
['120.25', '-37.67']
['120.26', '-36.92']
['120.26', '-36.7']
['120.3', '-39.67']
['120.34', '-37.69']
['120.4', '-40.55']
['120.4', '-36.37']
['120.42', '-36.91']
['120.49', '-40.52']
['120.49', '-42.44']
['120.49', '-41.68']
['120.64', '-37.06']
['120.67', '-36.95']
['120.67', '-40.07']
['120.71', '-40.28']
['120.73', '-40.13']
['120.73', '-42.41']
['120.84', '-39.65']
['120.91', '-37.25']
['120.92', '-37.71']
['120.95', '-36.19']
['120.97', '-39.76']
['121.02', '-41.44']
['121.09', '-39.66']
['121.1', '-37.25']
['121.11', '-42.48']
['121.11', '-36.38']
['121.11', '-38.62']
['121.14', '-39.54']
['121.25', '-38.9']
['121.26', '-38.69']
['121.3', '-37.9']
['121.31', '-39.81']
['121.43', '-42.36']
['121.48', '-38.86']
['121.55', '-41.97']
['121.56', '-37.26']
['121.56', '-36.51']
['121.57', '-37.7']
['121.58', '-41.52']
['121.64', '-37.88']
['121.66', '-40.38']
['121.79', '-40.4']
['121.88', '-42.7']
['121.92', '-39.98']
['121.93', '-39.6']
['121

In [None]:
This may seem unremarkable but the advantage now is that we can assign the columns a name and then reference the data by name instead of index.  So using the DictReader` and optional parameter `fieldnames` we can give the fields whatever name we want and then access them by that name,


In [30]:
import csv
with open('data/sample_data.txt') as f:
    csv_reader = csv.DictReader(f, delimiter='\t', fieldnames=['lat', 'lon'])

    for l in csv_reader:
        print l['lat'] # instead of l[0]


120.04
120.16
120.25
120.26
120.26
120.3
120.34
120.4
120.4
120.42
120.49
120.49
120.49
120.64
120.67
120.67
120.71
120.73
120.73
120.84
120.91
120.92
120.95
120.97
121.02
121.09
121.1
121.11
121.11
121.11
121.14
121.25
121.26
121.3
121.31
121.43
121.48
121.55
121.56
121.56
121.57
121.58
121.64
121.66
121.79
121.88
121.92
121.93
121.94
121.95
121.95
122.18
122.2
122.21
122.32
122.33
122.37
122.49
122.52
122.54
122.59
122.7
122.78
122.79
122.8
122.81
122.83
122.84
122.84
122.88
122.93
122.95
123.02
123.06
123.14
123.17
123.18
123.2
123.21
123.23
123.29
123.3
123.3
123.34
123.35
123.39
123.4
123.54
123.58
123.61
123.63
123.65
123.68
123.69
123.69
123.8
123.84
123.84
123.95
123.99


You might be wondering, **what if** the file already has a header row ... then we just omit the `fieldnames` and use the names in the header row:


In [2]:
import csv
with open('data/sample_data.txt') as f:
    csv_reader = csv.DictReader(f, delimiter='\t', header=True)

    for l in csv_reader:
        print l['lat'] # instead of l[0]


TypeError: 'header' is an invalid keyword argument for this function

Now go and have some fun with files!