# Table of Contents
* [Working with Spreadsheets](#Working-with-Spreadsheets)
* [Learning Objectives:](#Learning-Objectives:)
	* [What are Spreadsheets?](#What-are-Spreadsheets?)
		* [What are spreadsheets good for?](#What-are-spreadsheets-good-for?)
	* [Structure of Excel files](#Structure-of-Excel-files)
		* [New XML-based hotness: .xlsx](#New-XML-based-hotness:-.xlsx)
		* [Old binary-format-based, but not busted: .xls](#Old-binary-format-based,-but-not-busted:-.xls)
	* [Structure of ODT (and ODS) files](#Structure-of-ODT-%28and-ODS%29-files)
		* [XML-based .odt and .ods](#XML-based-.odt-and-.ods)
		* [Picking one to use:](#Picking-one-to-use:)
	* [Basic Steps for Programmatically Working with Excel](#Basic-Steps-for-Programmatically-Working-with-Excel)
	* [Notes and Gotchas](#Notes-and-Gotchas)
	* [Exercises](#Exercises)
	* [Optional Exercises](#Optional-Exercises)
		* [What is a cell?](#What-is-a-cell?)


# Working with Spreadsheets

# Learning Objectives:

* Understand the structure of Excel .xlsx files
* Read data from Excel files
* Write data to Excel files

## What are Spreadsheets?

* Spreadsheets are files that can only be modified via lots of mouse-clicking. (Or is that true?)
* Databases
* Todo Lists
* Complex Programs
* A catchall for data for people who don't/can't know any better. (This is not true, but it often feels true.)

### What are spreadsheets good for?

* Rapid prototyping
* Easy to share understanding between technical and non-technical people
* Concrete structure makes it easy for non-programmers (and it makes it dangerous)


Microsoft Excel is the dominant spreadsheet program, so we'll focus on that, but give some examples with the ODT (Open DocumenT) championed by the Free Software community (specifically OASIS).

## Structure of Excel files

### New XML-based hotness: .xlsx 

* xlsx defines the structure of Excel spreadsheets that fit into the [OOXML framework](http://www.officeopenxml.com/anatomyofOOXML-xlsx.php). 
* One .xlsx file contains only one workbook (but worksheets in that workbook may refer to other workbooks in other files).
* A .xlsx file is actually a zip file (aka package) containing a number of parts. Some are required, some are not.
  * [Content_Types].xml is required
  * relationships between different things are required (between worksheets, styles, external resources, etc.)
* A workbook may contain one or more worksheets
* Each worksheet is kept in a different XML file

### Old binary-format-based, but not busted: .xls

* xls is a binary-format specification that defines the structure of Excel spreadsheets.
* An xls file is "... an OLE compound file. A compound file contains storages, streams, and substreams. Each stream or substream contains a series of binary records. Each binary record contains zero or more structured fields that contain the workbook data. (This brief excerpt taken from [MSDN](https://msdn.microsoft.com/en-us/library/office/cc313154%28v=office.12%29.aspx)
* The basic building block of xls files is the binary record. Each record is a variable-length sequence of bytes, and is composed of three things: record type, record size, and data.

In other words, xls is a complex format. (I hate this format now. But in truth, it is actually pretty amazing. Backwards compatible to the beginning of time, made to be fast on old computers (like the kind from 10+ years ago), and designed to solve the problems of the day while still being able to handle the future)

## Structure of ODT (and ODS) files

### XML-based .odt and .ods

* odt defines the structure of ODS spreadsheets that fit into the [ISO/IEC 26300-1:2015 specification](http://www.iso.org/iso/home/store/catalogue_ics/catalogue_detail_ics.htm?csnumber=66363) 
* odt files are composed of many XML elements (spreadsheets, charts, images, text, drawings, etc.)
* ods files are simply odt files that use the "ods" extension to tell what program should open the file.
  * In other words, there is absolutely NOTHING special about .ods
* Each spreadsheet element can contain table elements, calculation elements, and lots of other XML elements

### Picking one to use:

1. Should you use a spreadsheet in the first place?
  1. How is the data intended to be used?
  2. How much data is there?
  3. Who knows what logic and calculations have to be encoded? A business analyst or accountant?
  4. 
2. If so...
  3. Is backwards compatibility to Excel 2005 required? (Excel 2007 was one of the first versions to actually support OOXML, according to my "Google archaeology")

## Basic Steps for Programmatically Working with Excel

1. Look at a small sample of your data
2. Test on a small sample of the data
3. Robustify the code
4. Test on a larger sample of the data
5. Iterate

## Notes and Gotchas

* Python indexing is 0-based
* Excel indexing is 1-based
* This makes for a WEIRD mismash of indexing techniques
  * worksheet.cell(row=1, column=1) == worksheet.rows[0][0]
* openpyxl requires a LOT of memory, even for smallish spreadsheets

In [None]:
#conda install openpyxl xlrd xlwt
#This won't work: "conda install xlutils". It is apparently incompatible with python 3.4 (as of 2015-06-25)
import openpyxl
import xlrd
import xlwt

from openpyxl import load_workbook, Workbook

from pprint import pprint

aapl_xlsx = "data/AAPL01.xlsx"

In [None]:
wb = load_workbook(aapl_xlsx)
#A workbook should have one or more worksheets.
#Let's see
pprint(wb.worksheets)
AAPL_ws = wb['AAPL']
pprint(AAPL_ws)

#for row in AAPL_ws.rows[1:10]:
#    for cell in row[:7]:
#        print (cell.value)

#What is the difference bewteen that loop and this one?
for row in AAPL_ws['A2':'F11']:
    for cell in row:
        print (cell.value)

#The top loop is loading ALL columns (from A-ZZZZZZ whatever)
#This is fine if you can wait a while and have lots o' RAM

In [None]:
#Iterate over the opening prices and find and print the maximum

#Why use "maximum" instead of easier to write "max"?
maximum = float("-inf")
for cell in AAPL_ws.columns[1][1:]:
    if maximum < float(cell.value):
        maximum = float(cell.value)
print("The highest opening price is {}".format(maximum))
        

## Exercises

1. Find and print the maximum volume
2. Sum and print the volume over all time
3. Find and print any differences between the closing price and adjusted closing prices

## Optional Exercises

1. Find and print the maximum volume per year
2. Find and print the maximum and minimum opening price per year
3. Sum and print the volume over each year

In [None]:
my_first_workbook = "data/my_first_spreadsheet.xlsx"
new_wb = Workbook()

#Each workbook has at least one worksheet
ws = new_wb.active
ws.title = "Test1"


ws.cell('A1').value = "Header1"
ws.cell('B1').value = "Header2"
ws.cell('C1').value = "Header3"
ws.cell('D1').value = "Header4"

for col in range(1,5):
    for row in range(2,10):
        c = ws.cell(column=col, row=row)
        c.value = col*100 + row
        
new_wb.save(my_first_workbook)

### What is a cell?

A cell is a distinct collection of attributes and properties at a particular location (identified by a row and column) inside a worksheet. If that definition is too generic, try this:

"The cell is the primary place in which data is stored and operated on. A cell can have a number of characteristics,
such as numeric, text, date, or time formatting; alignment; font; color; and a border. Each cell is identified by a
cell reference, a combination of its column and row headings." ([ECMA OOXML Part 1](http://www.ecma-international.org/publications/standards/Ecma-376.htm))

In [None]:
#Boss says "You did great getting that Apple stock data, but I need one worksheet per year."
#What do?
#We could go in and manually separate each year into a different worksheet (from 2014 to 1980). Yuck!
#We could do it automatically. Yay!


#Basic scheme for the new workbook:
# for each year encountered, make a new worksheet
# populate that worksheet with the data for that year.
aapl_wb = load_workbook(aapl_xlsx)
aapl_ws = aapl_wb.active

headers = list(aapl_ws['A1': 'G1'])[0]
first_data_cell = 'a2'
last_data_cell = 'g%s' % (aapl_ws.max_row)
#last_data_cell = 'g1000'
year = aapl_ws.cell(row=2, column=1).value[:4]

aapl_separated_file = "data/AAPL_separated.xlsx"
aapl_separated_wb = Workbook()

ws = aapl_separated_wb.active
ws.title = year
ws.append([cell.value for cell in headers])

new_worksheets = {year: ws}

for row in aapl_ws[first_data_cell:last_data_cell]:
    #Each of these things is an individual cell
    date, p_open, p_high, p_low, p_close, p_vol, p_adj_close = row
    year = date.value[:4]
    if year not in new_worksheets:
        ws = aapl_separated_wb.create_sheet(title=year)
        new_worksheets[year] = ws
        ws.append([cell.value for cell in headers])
        
    else:
        ws = new_worksheets[year]
        
    ws.append([cell.value for cell in row])
    
aapl_separated_wb.save(aapl_separated_file)