<img src='img/logo.png' />

<img src='img/title.png'>

<img src='img/py3k.png'>

# Table of Contents
* [Set-up](#Set-up)
* [Outline](#Outline)
* [Reading Files](#Reading-Files)
	* [Reading Text](#Reading-Text)
	* [Reading CSV](#Reading-CSV)
	* [File Formats to read](#File-Formats-to-read)
	* [String Parsing for Reads](#String-Parsing-for-Reads)
* [Writing Files](#Writing-Files)
	* [Exercise](#Exercise)
	* [String Formatting](#String-Formatting)
* [Reading and Writing with Pandas](#Reading-and-Writing-with-Pandas)
* [Section Summary](#Section-Summary)

# Set-up

In [None]:
import os
import csv
import random

# Outline

Reading Files
* `open(file,mode='r')`
* import csv
* context managers, with
* String operations, parsing

Writing Files
* `open(file,mode='w')`
* `print("Hello, world!", file=outfile)`
* String formatting for writes

Reading, Writing Tabular Data
* Pandas
* Colunar data, Tabular data
* CSV, JSON, YAML, etc

Data File Formats
* CSV
* JSON
* YAML
* XML

# Reading Files

Notebooks:
* 16 - Python III - Read Files

Read and Write with `open()`:
> `file_object = open(file_name, mode)`

The modes can be:
* 'r'  read only
* 'r+' reading and writing
* 'w'  write only (existing file will be over-written)
* 'a'  write as append

## Reading Text

In [None]:
# Example: Don't need to understand all this, just notice the power of python

# We'll use this utiltiy function throughout to check
# the top few lines of the files we read and write

def head(file_name, n=5):
    """OS/Platform-independent utility function 
        used to inspect the top few lines of our files
    """
    with open(file_name) as myfile:
        for _ in range(n):
            try:
                print(next(myfile), end="")
            except StopIteration: # end of file
                return None

In [None]:
# Now let's look at a text file
file_name = os.path.join("data","lorem-ipsum")
head(file_name)

In [None]:
# Example: simplest input
file_obj  = open(file_name, mode='r')
file_data = file_obj.read().replace('\n', " ")
file_obj.close()

print( file_data )

In [None]:
# Exercise: do the same read, but remove all 
# end of lines "\n" and replace them with spaces

print( type( file_data ) )

file_data.replace("\n"," ")

In [None]:
# Example: next simplest input
file_obj   = open(file_name, mode='r')
file_lines = []

# Iterate and read
for line in file_obj:
    file_lines.append(line)

# close
file_obj.close()

# echo
file_lines

In [None]:
# Better way
file_lines = []

with open(file_name) as file_obj:
    for line in file_obj:
        file_lines.append(line)

file_lines

## Reading CSV

In [None]:
# Let's inspect another text data file
file_name  = os.path.join("data","exoplanets.csv")

head(file_name)

In [None]:
# Let's read the data file.

# Method 1: Not the best way...
import csv
file_obj = open(file_name, mode='rt')  # "t" = "text mode"
csv_obj  = csv.reader(file_obj)

file_lines = []
for row in csv_obj:
    file_lines.append(row)
file_obj.close() # this never executes if reader fails!

file_lines[1]

In [None]:
# Method 2: Smarter way: use a "context manager"
import csv
with open(file_name, 'rt') as file_obj:
    csv_obj = csv.reader(file_obj)
    for row in csv_obj:
        print(row)

In [None]:
# Exercise: 
# Read all rows from "data/AAPL.csv" and store them in a list

# Set-up:
import csv
file_name = os.path.join("data","AAPL.csv")
data_list = []

In [None]:
# Solution:
# notice when you print your output here, that
# a list of rows is not an ideal data structure... see next exercise.

with open(file_name, 'rt') as file_obj:
    csv_obj = csv.reader(file_obj)
    for row in csv_obj:
        data_list.append(row)

In [None]:
data_list[0]

In [None]:
data_list[1]

In [None]:
data_list

In [None]:
# Exercise: Read "data/goog.csv"

# (1) use dict(zip()) as shown above to create a dictionary for each row
# such that the column header names are the keys for the row values
# and append each dict to a data list

# (2) print the 1st and 2nd rows from your final list

In [None]:
# Exercise Reminder: using zip() to create dicts
columns = ['Date','Open','High','Low','Close','Volume']
data    = [20160203,2.0,3.0,1.0,2.0,6]
dictionary = dict(zip(columns, data))
print(dictionary)

In [None]:
# Exercise set-up:
import csv
data_file = os.path.join("data","goog.csv")
data_list = []
dictionary = []
columns = ['Date','Open','High','Low','Close','Volume']

In [None]:
# Solution:
with open(data_file, 'rt') as file_obj:
    csv_obj = csv.reader(file_obj)
    for row in csv_obj:
        dictionary = dict(zip(columns, row))
        data_list.append(dictionary)

In [None]:
record = data_list[1]
print(record)

x = float( record['Close'] )
type(x)

## File Formats to read

Notebooks:
* 61 - Data Formats - CSV, XLS, SQL.ipynb
* 62 - Data Formats - JSON, YAML, XML.ipynb

## String Parsing for Reads

Now that you can read data from files, you need to be able to parse strings!

Notebooks:
* 07 - Python II - Data Types (string methods)

In [None]:
"Jason Vestuto".count("s") # how many times does "s" substring appear?

In [None]:
"Jason Vestuto".split()    # default delimiter is space

In [None]:
" Jason ".strip()          # remove lead/trail white-space

In [None]:
"Jason ".rstrip()          # remove only trailing white-space

In [None]:
"Jason".replace("J","M")   # substitution

In [None]:
"Jason".endswith("n")

In [None]:
" ".join("Jason")             # separate all elements with " "

In [None]:
" ".join(["Jason","Vestuto"]) # separate all elements with " "

In [None]:
"Vest" in "Jason Vestuto"     # returns bool

# Writing Files

## Exercise

In [None]:
# Example: simplest output
file_name  = os.path.join("tmp","hello.txt")

out_string = "Hello World \n"
outfile = open(file_name, mode='wt')
outfile.write(out_string)
outfile.close()

In [None]:
head(file_name)

In [None]:
# Exercise: Write a 4x6 data table to a file using print()

# hint: out_string = "%0.2f" % random()
# hint: print() takes an input arg called "file"
# hint: print() takes an input arg called "end"

# set-up
import random
nrows, ncols = 4, 6
file_name = os.path.join('tmp', 'random_numbers.csv')

# Use the same seed so we all get the same random numbers
random.seed(1234)
random.random()

In [None]:
# Solution1: Write with write()
import random
random.seed(1234)

nrows, ncols = 4, 6
file_name = os.path.join('tmp', 'random_numbers.csv')

with open(file_name, mode='w') as out_file:
    for _ in range(nrows):
        for _ in range(ncols):
            value = "%0.2f" % random.random()
            out_file.write( value )
            if _ != ncols-1:
                out_file.write( ", " )
        out_file.write("\n")

In [None]:
# Check the results
head(file_name)

In [None]:
# Solution2: Write with print()
import random
random.seed(1234)
nrows, ncols = 4, 6
out_string = []
file_name = os.path.join('tmp','random_numbers.csv')

with open(file_name,'w') as out_file:
    for _ in range(nrows):
        for _ in range(ncols-1):
            out_string = "%0.2f" % random.random()
            print(out_string, end=',', file=out_file)
            if _ == ncols:
               out_string = "%0.2f" % random.random()
        print(out_string, end='\n',file=out_file)

In [None]:
# Check the results
head( file_name )

## String Formatting

Now that you are writing files, you need to know how to format strings!
* Notes based loosely on https://pyformat.info/
* See also https://github.com/ulope/pyformat.info

In [None]:
# Basic Formatting
# Old:
'%s' % ('Hello',) # string % tuple

In [None]:
'%s %s' % ('one', 'two') # more than one value

In [None]:
# New:
'{} {}'.format('one', 'two')

In [None]:
# New: out of order
'{1} {0}'.format('one', 'two')

In [None]:
# New: out of order
print( "The capital of {1:s} is {2:s}, a {0:s} city".format(
            "Music", "Texas", "Austin", "USA")
      )

In [None]:
# New: Using keyword arguments to specify values to format
print("The capital of {state} is {city}".format(
      city="Austin", state="Texas", country="USA"))

In [None]:
# Padding, align right
# Old
'%10s' % ('test',) # notice tuple comma

In [None]:
# New
'{:>10}'.format('test')

In [None]:
# New
'{0:>10}'.format('test') # only one entry, so 0 index, or no index

In [None]:
# Padding, Align left
# Old
'%-10s' % ('test',)

In [None]:
# New
'{0:<10}'.format('test')

In [None]:
# Truncation: format coded
# Old
'%.5s' % ('blizzard',)

In [None]:
# New, with tuple unpacking
tup_of_strs = ('blizzard', 'minneapolis','austin')
'{2:.4}'.format(*tup_of_strs)

In [None]:
# Numerical types: integers
# Old:
'%d' % (42,)
# New:
'{0:d}'.format(42)

In [None]:
# Numerical types: Floats
# Old:
import math
'%f' % (math.pi,)
# New:
'{0:f}'.format(math.pi)

In [None]:
('{:f} '*3).format(1,2,3)

In [None]:
my_nums = [1,2,3]
my_tup = tuple(my_nums)
print( ('{:<10e} '*len(my_nums)).format(*my_tup) )
print( ('{:<10f} '*len(my_nums)).format(*my_tup) )

In [None]:
'{0:f} {1:s}'.format(math.pi, "Jason")

In [None]:
# Padding Integers:
# Old:
'%4d' % (42,)
# New:
'{:4d}'.format(42)

In [None]:
# Padding Floats:
# Old:
'%06.2f' % (math.pi,)
# New:
'{:06.2f}'.format(math.pi)

In [None]:
# Key-value expansion
data = {'first': 'Jason', 'last': 'Vestuto'}
'{first:>.1s}{last:>.7s}'.format(**data).lower()

In [None]:
# New: for details on datatime, see notebook "09 - Python II - More Types"

from datetime import datetime
my_date = datetime(2016, 2, 3, 10, 30, 0)
'{:%Y-%m-%d %H:%M}'.format(my_date)

In [None]:
# Linux users, notice that this syntax is the same as the GNU date command
if os.name is "posix":
    ! date +"%Y-%m-%d %H:%M"

# Reading and Writing with Pandas

Reading, writing, manipulating data with Pandas

Columnar data, Tabular data

CSV, JSON, YAML, etc

Notebooks:
* 44 - Pandas - Introduction.ipynb
* 46 - Pandas - Data IO.ipynb
* 51 - Pandas - Exercise 1 Messy Data.ipynb
* 53 - Pandas - Exercise 3 Excel Files.ipynb

# Section Summary

Reading and Writing with core Python

Reading Files
* Reading Text
* Reading CSV
* File Formats to read
* String Parsing for Reads

Writing Files
* Writing to various formats
* String Formatting

Reading and Writing with Pandas

<img src='img/copyright.png'>