# Data Lists and List Transforms
by Chris North, Virginia Tech

Also: Computational Notebooks, Jupyter & Python, Processing Data Files

Reminder about learning goals:  Concepts + Coding

## Computational Notebooks
* What?
    * Notebook encapsulates Code + Results + Narrative
    * Consists of a series of 2 types of cells:  code+result cell, or markdown cell
    * Example:  https://nbviewer.org/gist/nealcaren/5105037
* Why?
    * Provenance = Sequence of code cells and intermediate results leading up to a solution
    * Incremental, Interactive, Collaborative, Reproducable science, ...
* Use cases
    * Data science process
    * Reporting and presentation
* Tools:
    * Jupyter, colab.research.google.com, deepnote.com, observablehq.com, ...

### Jupyter:
* Anaconda Navigator
* Documentation:  https://jupyter.org
* WebCAT plug-in:  https://github.com/CSSPLICE/webcatjupyterplugin

### Python:
* Anaconda download: https://www.anaconda.com/download/ 
* Documentation: https://www.python.org


## Getting started with Jupyter and Python

### Code cell
* Create a code cell (+)
    * Code completion (tab)
    * Code inspector (shift+tab)
    * Run a cell (shift+return)
* Command mode vs Edit mode
    * Delete a cell (d,d)
* Keyboard Shortcuts:  Jupyter | Help | Keyboard Shortcuts
    * Tutorial: Jupyter | Help | User Interface Tour


In [1]:
print('hello world')

hello world


**bold**

*Virginia Tech is great*

### Markdown cell
* Create a markdown cell (+, toolbar menu | Markdown)
    * Markdown syntax: https://www.markdownguide.org/cheat-sheet/ ,  https://help.github.com/articles/basic-writing-and-formatting-syntax/ 
    * bullet *italic* **bold** `code`
    * [VT](http://vt.edu) or http://dude.com/url 
    * [link within a notebook](#Computational-Notebooks)
* Run a markdown cell

![VT](https://www.assets.cms.vt.edu/images/logo-maroon-whiteBG.svg)


### Computing stuff

How tall are you in inches?

In [2]:
print(5*12+7)
5*12+7

67


67

How tall are you in cm?  
*Write a function*

In [3]:
def inch_to_cm(inch):
    return inch * 2.54

In [4]:
inch_to_cm(67)

170.18

## Data Structures and Transforms:  Lists
Using appropriate data science abstractions

### List data structure
* Sequence of elements:  [1, 2, 3]
* Length:  len(list)
* Mutable:  list[2]=5
* List Comprehensions: https://docs.python.org/3.6/tutorial/datastructures.html#list-comprehensions


Example:  heights of students in the class, in inches

In [5]:
heights = [67, 68, 74, 63, 69, 65, 82]
heights

[67, 68, 74, 63, 69, 65, 82]

## Data Transforms using Lists

Functional transforms for answers to simple data questions.
(Without writing loop code!)

* Slice
* Reduce
* Map 
* Filter
* Sort
* Join

### Slice

Select data by index.

`list[start:end:step]`

Example: What are some specific heights in the list? 

In [6]:
heights

[67, 68, 74, 63, 69, 65, 82]

In [7]:
heights[0]

67

In [8]:
heights[-1]

82

In [9]:
heights[0:6:2]

[67, 74, 69]

In [10]:
heights[::-1]

[82, 65, 69, 63, 74, 68, 67]

In [11]:
print(heights)
heights[::-1]
print(heights)

[67, 68, 74, 63, 69, 65, 82]
[67, 68, 74, 63, 69, 65, 82]


In [12]:
heights

[67, 68, 74, 63, 69, 65, 82]

### Reduce

List &rarr; value

Most common reduce transforms are pre-implemented: `mean(list), sum(list), len(list), ...`

Example:  How many?  Whats the mean?  Shortest or tallest?

In [13]:
max(heights)

82

In [14]:
import numpy 
numpy.median(heights)

68.0

### Map

List &rarr; List

* Long form:  `map(func, list)` where `func(element)` is 1&rarr;1
    * Inline funcs:  `(lambda vars: expression(vars))`
* Short form:  `[expr(var) for var in list]`  (Python **List Comprehension**)
    * List Comprehensions: https://docs.python.org/3.6/tutorial/datastructures.html#list-comprehensions

Example: What are the heights in metric?

In [15]:
heights

[67, 68, 74, 63, 69, 65, 82]

In [16]:
## Map, using defined function
list(map(inch_to_cm, heights))

[170.18, 172.72, 187.96, 160.02, 175.26, 165.1, 208.28]

In [17]:
## Map, using Lambda function
list(map(lambda i: i * 2.54, heights))

[170.18, 172.72, 187.96, 160.02, 175.26, 165.1, 208.28]

In [18]:
## Map, using Python List Comprehension
[i * 2.54 for i in heights]

[170.18, 172.72, 187.96, 160.02, 175.26, 165.1, 208.28]

### Filter

Select by data criteria:  List &rarr; shorter List

* Long form:  `filter(func, list)`  where func(element) -> boolean
* Short form: `[var for var in list if cond(var)]`  (Python **List Comprehension**)

Example:  Which are tall?


In [19]:
heights

[67, 68, 74, 63, 69, 65, 82]

In [20]:
def is_tall(inch):
    return inch >= 6 * 12

In [21]:
[is_tall(i) for i in heights]

[False, False, True, False, False, False, True]

In [22]:
## Filter using defined function
list(filter(is_tall, heights))

[74, 82]

In [23]:
## Filter using lambda function

In [24]:
## Filter, using Python List Comprehension
[i for i in heights if i >= 6*12]

[74, 82]

In [25]:
## Combine Map and Filter in one list comprehension
[i * 2.54 for i in heights if i >= 6*12]

[187.96, 208.28]

### Sort

`sorted(list)` or  `list.sort()`

Example: What is the order of short to tall?

In [26]:
sorted(heights, reverse=True)

[82, 74, 69, 68, 67, 65, 63]

## Composition of Transforms

Chaining multiple transforms together 

Example: How many tall people?

*hint: use data as logic, no counter variable needed*

In [27]:
len([i for i in heights if i >= 6*12])

2

Example: Standard Deviation?

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/32e3c0f27c2595926963cc5d8df113e6a12cf917)

In [28]:
## HW2?

# Exercises

### Data normalization using Zscore:

$zscore(x) = \frac{(x - mean)}{stdev}$

* normalize data such that: mean=0, stdev=1
* https://en.wikipedia.org/wiki/Standard_score

Problem 1:  Normalize the Heights using Zscore:  


In [29]:
### some useful reduce operators
import numpy
numpy.mean(heights), numpy.std(heights)

(69.71428571428571, 5.94532915619566)

Problem 2: Find heights shorter than -1 stdev

In [30]:
(75 - numpy.mean(heights)) / numpy.std(heights)

0.8890532629645946

In [31]:
[(i - numpy.mean(heights)) / numpy.std(heights) for i in heights]

[-0.45654086476560113,
 -0.28834159879932664,
 0.7208539969983202,
 -1.129337928630699,
 -0.12014233283305217,
 -0.7929393966981501,
 2.066448124728516]

## Functional vs Loops

Functional transforms (e.g. `map()`, `filter()`, List Comprehensions `[for in if]`) 

    vs.

Imperative loops (e.g. `For in:`)?

* Functional programming: https://en.wikipedia.org/wiki/Functional_programming 
* Functional programming in python:  https://docs.python.org/3.6/howto/functional.html
    * List Comprehensions
    * Lambda functions 
    

In [32]:
### functional map
list(map(inch_to_cm, heights))

# or list comprehension
[inch_to_cm(i) for i in heights]

[170.18, 172.72, 187.96, 160.02, 175.26, 165.1, 208.28]

In [33]:
### imperative loop
cms = []
for i in heights:
    cms.append(i * 2.54)
cms

[170.18, 172.72, 187.96, 160.02, 175.26, 165.1, 208.28]


Advantages of Functional transforms (e.g. `map()`, `filter()`, List Comprehensions `[for in if]`) over writing loops (e.g. `For in:`)?
* Abstraction: loops are low level
* Composable, re-usable:  z = f(g(x))
* Provenance: maintain copies of lists
* Usability: focus on math, not data structure maintenance
* Optimization: native code
* Efficiency: parallel computing
* More efficiency: Lazy evaluation

In [34]:
### Which one is faster?
import time

input = range(1,10000000)  # 10,000,000

# using a for loop with func
start = time.time()
output = []
for i in input:
    output.append(inch_to_cm(i))
end = time.time()
print( "%.3fs = for loop with func" % (end-start) )

# using a for loop
start = time.time()
output = []
for i in input:
    output.append(i * 2.54)
end = time.time()
print( "%.3fs = for loop" % (end-start) )


# using map with func
start = time.time()
output = list(map(inch_to_cm, input))
end = time.time()
print( "%.3fs = map with func" % (end-start) )

# using map
start = time.time()
output = list(map(lambda i: i * 2.54, input))
end = time.time()
print( "%.3fs = map" % (end-start) )


# using list comprehension with func
start = time.time()
output = [inch_to_cm(i) for i in input]
end = time.time()
print( "%.3fs = list comprehension with func" % (end-start) )

# using list comprehension
start = time.time()
output = [i * 2.54 for i in input]
end = time.time()
print( "%.3fs = list comprehension" % (end-start) )

1.521s = for loop with func
1.129s = for loop
0.843s = map with func
0.842s = map
1.054s = list comprehension with func
0.629s = list comprehension


In [38]:
### dict access by key
d['a']

1

## Tuples, Dictionaries, Sets, Strings

Other useful list-like data structures:
* Tuple:  `(1, 2, 3)`
* Dictionary:  `{'a':1, 'b':2, 'c':3}`
* Set: `{1, 2, 3}`

In [35]:
heights

[67, 68, 74, 63, 69, 65, 82]

In [36]:
### tuple
t = (1,2,2,3)
t

(1, 2, 2, 3)

In [44]:
(a,b,c,d) = t
b

2

In [45]:
t[2:]

(2, 3)

In [37]:
### dict key/value pairs
d = {'a':1, 'b':2, 'c':3}
d

{'a': 1, 'b': 2, 'c': 3}

In [46]:
### set
s= {4, 1, 2, 2, 3}
s

{1, 2, 3, 4}

In [40]:
### string
x = 'some text'
x[3]

'e'

In [48]:
x.capitalize()

'Some text'

# Join, Zip

columns -> rows
* table = zip(column, column)
* rows = zip(*columns)
* columns = zip(*rows)

In [49]:
heights

[67, 68, 74, 63, 69, 65, 82]

In [51]:
### names
names = ['yalong','chris','mary','jack','e','f','g']
names

['yalong', 'chris', 'mary', 'jack', 'e', 'f', 'g']

In [55]:
dt = list(zip(names, heights))
dt

[('yalong', 67),
 ('chris', 68),
 ('mary', 74),
 ('jack', 63),
 ('e', 69),
 ('f', 65),
 ('g', 82)]

In [56]:
dt[0][0]

'yalong'

## Data Table = List of Tuples

* `[(1,2,3), (4,5,6)]`
* row-major format = list of rows
* column-major format = list of columns
* https://en.wikipedia.org/wiki/Column-oriented_DBMS


In [57]:
[print(t) for t in dt]

('yalong', 67)
('chris', 68)
('mary', 74)
('jack', 63)
('e', 69)
('f', 65)
('g', 82)


[None, None, None, None, None, None, None]

In [60]:
[t[0] for t in dt if t[1] >= 72]

['mary', 'g']

In [63]:
[name for (name, height) in dt if height >= 72]

['mary', 'g']

In [66]:
sorted(dt, key=lambda t:t[1])

[('jack', 63),
 ('f', 65),
 ('yalong', 67),
 ('chris', 68),
 ('e', 69),
 ('mary', 74),
 ('g', 82)]

In [67]:
sorted(dt, key=lambda (name, height):height)

SyntaxError: invalid syntax (1217731844.py, line 1)

# Parsing data from text files

Table = List of Tuples

* `open()`, `read()`, `splitlines()`, `split()`
* tuple assignment:  `(x,y) = (1,2)`


In [74]:
text = open('data/pop-table.csv').read()
text

"Name,State,Population\nAleutians East,AK,2305\nAleutians West,AK,5259\nAnchorage,AK,251335\nBethel,AK,15525\nBristol Bay,AK,1023\nDillingham,AK,4360\nFairbanks North Star,AK,83374\nHaines,AK,2181\nJuneau,AK,29378\nKenai Peninsula,AK,46151\nKetchikan Gateway,AK,14422\nKodiak Island,AK,14987\nLake and Peninsula,AK,1787\nMatanuska-Susitna,AK,50686\nNome,AK,8754\nNorth Slope,AK,7024\nNorthwest Arctic,AK,6501\nPrince of Wales-Outer Ketchikan,AK,7109\nSitka,AK,8710\nSkagway-Yakutat-Angoon,AK,4655\nSoutheast Fairbanks,AK,6018\nValdez-Cordova,AK,10475\nWade Hampton,AK,6574\nWrangell-Petersburg,AK,7089\nYukon-Koyukuk,AK,6057\nAutauga,AL,39381\nBaldwin,AL,120198\nBarbour,AL,26469\nBibb,AL,17942\nBlount,AL,42721\nBullock,AL,11149\nButler,AL,21798\nCalhoun,AL,117263\nChambers,AL,37262\nCherokee,AL,21038\nChilton,AL,34912\nChoctaw,AL,16079\nClarke,AL,27993\nClay,AL,13551\nCleburne,AL,13272\nCoffee,AL,42359\nColbert,AL,52586\nConecuh,AL,14022\nCoosa,AL,11680\nCovington,AL,37459\nCrenshaw,AL,13624\n

In [76]:
lines = text.splitlines()
lines

['Name,State,Population',
 'Aleutians East,AK,2305',
 'Aleutians West,AK,5259',
 'Anchorage,AK,251335',
 'Bethel,AK,15525',
 'Bristol Bay,AK,1023',
 'Dillingham,AK,4360',
 'Fairbanks North Star,AK,83374',
 'Haines,AK,2181',
 'Juneau,AK,29378',
 'Kenai Peninsula,AK,46151',
 'Ketchikan Gateway,AK,14422',
 'Kodiak Island,AK,14987',
 'Lake and Peninsula,AK,1787',
 'Matanuska-Susitna,AK,50686',
 'Nome,AK,8754',
 'North Slope,AK,7024',
 'Northwest Arctic,AK,6501',
 'Prince of Wales-Outer Ketchikan,AK,7109',
 'Sitka,AK,8710',
 'Skagway-Yakutat-Angoon,AK,4655',
 'Southeast Fairbanks,AK,6018',
 'Valdez-Cordova,AK,10475',
 'Wade Hampton,AK,6574',
 'Wrangell-Petersburg,AK,7089',
 'Yukon-Koyukuk,AK,6057',
 'Autauga,AL,39381',
 'Baldwin,AL,120198',
 'Barbour,AL,26469',
 'Bibb,AL,17942',
 'Blount,AL,42721',
 'Bullock,AL,11149',
 'Butler,AL,21798',
 'Calhoun,AL,117263',
 'Chambers,AL,37262',
 'Cherokee,AL,21038',
 'Chilton,AL,34912',
 'Choctaw,AL,16079',
 'Clarke,AL,27993',
 'Clay,AL,13551',
 'Clebur

In [83]:
dt2 = [tuple(l.split(',')) for l in lines[1:]]
dt2

[('Aleutians East', 'AK', '2305'),
 ('Aleutians West', 'AK', '5259'),
 ('Anchorage', 'AK', '251335'),
 ('Bethel', 'AK', '15525'),
 ('Bristol Bay', 'AK', '1023'),
 ('Dillingham', 'AK', '4360'),
 ('Fairbanks North Star', 'AK', '83374'),
 ('Haines', 'AK', '2181'),
 ('Juneau', 'AK', '29378'),
 ('Kenai Peninsula', 'AK', '46151'),
 ('Ketchikan Gateway', 'AK', '14422'),
 ('Kodiak Island', 'AK', '14987'),
 ('Lake and Peninsula', 'AK', '1787'),
 ('Matanuska-Susitna', 'AK', '50686'),
 ('Nome', 'AK', '8754'),
 ('North Slope', 'AK', '7024'),
 ('Northwest Arctic', 'AK', '6501'),
 ('Prince of Wales-Outer Ketchikan', 'AK', '7109'),
 ('Sitka', 'AK', '8710'),
 ('Skagway-Yakutat-Angoon', 'AK', '4655'),
 ('Southeast Fairbanks', 'AK', '6018'),
 ('Valdez-Cordova', 'AK', '10475'),
 ('Wade Hampton', 'AK', '6574'),
 ('Wrangell-Petersburg', 'AK', '7089'),
 ('Yukon-Koyukuk', 'AK', '6057'),
 ('Autauga', 'AL', '39381'),
 ('Baldwin', 'AL', '120198'),
 ('Barbour', 'AL', '26469'),
 ('Bibb', 'AL', '17942'),
 ('Blount

In [85]:
[(r[0], r[1], int(r[2])) for r in dt2]

[('Aleutians East', 'AK', 2305),
 ('Aleutians West', 'AK', 5259),
 ('Anchorage', 'AK', 251335),
 ('Bethel', 'AK', 15525),
 ('Bristol Bay', 'AK', 1023),
 ('Dillingham', 'AK', 4360),
 ('Fairbanks North Star', 'AK', 83374),
 ('Haines', 'AK', 2181),
 ('Juneau', 'AK', 29378),
 ('Kenai Peninsula', 'AK', 46151),
 ('Ketchikan Gateway', 'AK', 14422),
 ('Kodiak Island', 'AK', 14987),
 ('Lake and Peninsula', 'AK', 1787),
 ('Matanuska-Susitna', 'AK', 50686),
 ('Nome', 'AK', 8754),
 ('North Slope', 'AK', 7024),
 ('Northwest Arctic', 'AK', 6501),
 ('Prince of Wales-Outer Ketchikan', 'AK', 7109),
 ('Sitka', 'AK', 8710),
 ('Skagway-Yakutat-Angoon', 'AK', 4655),
 ('Southeast Fairbanks', 'AK', 6018),
 ('Valdez-Cordova', 'AK', 10475),
 ('Wade Hampton', 'AK', 6574),
 ('Wrangell-Petersburg', 'AK', 7089),
 ('Yukon-Koyukuk', 'AK', 6057),
 ('Autauga', 'AL', 39381),
 ('Baldwin', 'AL', 120198),
 ('Barbour', 'AL', 26469),
 ('Bibb', 'AL', 17942),
 ('Blount', 'AL', 42721),
 ('Bullock', 'AL', 11149),
 ('Butler', 'A

In [88]:
dt3 = [(c,s,int(p)) for (c,s,p) in dt2]
dt3

[('Aleutians East', 'AK', 2305),
 ('Aleutians West', 'AK', 5259),
 ('Anchorage', 'AK', 251335),
 ('Bethel', 'AK', 15525),
 ('Bristol Bay', 'AK', 1023),
 ('Dillingham', 'AK', 4360),
 ('Fairbanks North Star', 'AK', 83374),
 ('Haines', 'AK', 2181),
 ('Juneau', 'AK', 29378),
 ('Kenai Peninsula', 'AK', 46151),
 ('Ketchikan Gateway', 'AK', 14422),
 ('Kodiak Island', 'AK', 14987),
 ('Lake and Peninsula', 'AK', 1787),
 ('Matanuska-Susitna', 'AK', 50686),
 ('Nome', 'AK', 8754),
 ('North Slope', 'AK', 7024),
 ('Northwest Arctic', 'AK', 6501),
 ('Prince of Wales-Outer Ketchikan', 'AK', 7109),
 ('Sitka', 'AK', 8710),
 ('Skagway-Yakutat-Angoon', 'AK', 4655),
 ('Southeast Fairbanks', 'AK', 6018),
 ('Valdez-Cordova', 'AK', 10475),
 ('Wade Hampton', 'AK', 6574),
 ('Wrangell-Petersburg', 'AK', 7089),
 ('Yukon-Koyukuk', 'AK', 6057),
 ('Autauga', 'AL', 39381),
 ('Baldwin', 'AL', 120198),
 ('Barbour', 'AL', 26469),
 ('Bibb', 'AL', 17942),
 ('Blount', 'AL', 42721),
 ('Bullock', 'AL', 11149),
 ('Butler', 'A

### Exercise
average VA county size?

In [91]:
# What transforms to use? And in what order?

# filter, map, reduce

numpy.mean([p for (c,s,p) in dt3 if s == 'VA'])


48664.39705882353