# STA 141B Lecture 4

The class website is <https://github.com/2019-winter-ucdavis-sta141b/notes>

### Announcements

* Remember to fill out the GitHub Username and Project Group Form (link on Piazza)!
* Discussion is now in Wellman 216
* New TA: Shan

### Topics

* Modules and Packages
* Iteration
    - Loops
    - Comprehensions and Generators
* NumPy

### References

* Python for Data Analysis, Ch. 4
* [Python Data Science Handbook][PDSH], Ch. 2

[PDSH]: https://jakevdp.github.io/PythonDataScienceHandbook/
[ProGit]: https://git-scm.com/book/

## Shell Commands from Jupyter

On MacOS and Linux, you can run shell commands from a Jupyter notebook by putting a `!` in front of the command, in a code cell. On Windows, without additional configuration, most UNIX shell commands won't work if you try to run them from Jupyter.

Jupyter runs shell commands in a temporary subshell, so commands like `cd` will not work the way you'd expect.

Most of the time, it's easier to just use the shell in a terminal.

But as an example, to check the working directory:

In [1]:
!pwd

/home/nick/university/teach/sta141b/public/lecture/01.17


In [2]:
!cd ..

In [3]:
!pwd

/home/nick/university/teach/sta141b/public/lecture/01.17


## Modules and Packages

A _module_ is a text file that contains Python code, usually a `.py` file.

Python's `import` command lets us load code from a module to use in our script or notebook. Note: `import` is like a combination of R's `source()` and `library()` functions.

Python provides many built-in modules for common tasks (see [the list][py-modules]). Packages provide even more modules. 

[py-modules]: https://docs.python.org/3/library/index.html

In [4]:
import math

In [5]:
math.pi

3.141592653589793

In [6]:
import numpy

import foo # imports foo.py in the working directory OR imports the package foo

In [7]:
import numpy as np

In [None]:
np.

## Iteration

The three most important methods to repeat code for identical or similar tasks are:

1. Loops (`while` and `for`)
2. Comprehensions, Generators, and `map()`
3. Vectorization (NumPy arrays and functions)

These methods have tradeoffs. In general:

* Loops are the most flexible -- particularly `while` loops.
* Generators tend to use the least memory.
* Vectorization tends to be fastest.

There are other methods for iteration, like recursion (more info [here][tp1] and [here][tp2]), but they are not common in statistical computing with Python.

[tp1]: http://greenteapress.com/thinkpython2/html/thinkpython2006.html#sec62
[tp2]: http://greenteapress.com/thinkpython2/html/thinkpython2007.html#sec74

### Loop Tips and Tricks

An _iteratable_ object is a object that can be iterated over, element-by-element. Examples: tuples, lists, strings

Python's for-loops can automatically get elements from iterable objects.

In [11]:
# DO THIS
for x in 'hello':
    print(x)

h
e
l
l
o


In [10]:
# NOT THIS
x = 'hello'
for i in [0, 1, 2, 3, 4]:
    print(x[i])
    

h
e
l
l
o


The `range()` function returns a sequence of integers.

In [13]:
for i in range(1, 5):
    print(i)

1
2
3
4


You can use `list()` to convert objects like ranges to lists.

Generally, you'll only need to do this for visual inspection. You DO NOT need to convert ranges into lists to use them in loops.

In [15]:
list(range(5))

[0, 1, 2, 3, 4]

You can make the keys and values in a dictionary iterable with the `.items()` method.

In [17]:
x = {'hello': 1, "goodbye": 2}

for elt in x:
    print(elt, x[elt])

hello 1
goodbye 2


In [18]:
for key, val in x.items():
    print(key, val)

hello 1
goodbye 2


In [20]:
list(x.items())

[('hello', 1), ('goodbye', 2)]

_Zipping_ two sequences together means combining them into a list of tuples where:

* The first element of each tuple is an element from the first sequence.
* The second element of each tuple is an element from the second sequence.

Usually it only makes sense to zip sequences that are the same length.

The `zip()` function zips two or more sequences. Use it to iterate over multiple sequences at the same time.

In [21]:
x = [1, 2, 3]
y = [4, 5, 6]

for x_elt, y_elt in zip(x, y):
    print(x_elt, y_elt)

1 4
2 5
3 6


In [24]:
list(zip(x, y, [7, 8, 9]))

[(1, 4, 7), (2, 5, 8), (3, 6, 9)]

The `enumerate()` function zips together index numbers and a sequence. In other words, the function enumerates a sequence.

In [29]:
# If you absolutely must use index numbers, at least use enumerate() to get them
x = 'hello'

for i, x_elt in enumerate(x):
    print("Position", i, "is", x_elt)

Position 0 is h
Position 1 is e
Position 2 is l
Position 3 is l
Position 4 is o


In [28]:
list(enumerate(x))

[(0, 'h'), (1, 'e'), (2, 'l'), (3, 'l'), (4, 'o')]

### Comprehensions and Generators

A _comprehension_ is a Python expression that transforms a sequence, element-by-element. The notation is similar to mathematical set notation:

In [30]:
# {x | x in Z}
[x**2 for x in range(5)]

[0, 1, 4, 9, 16]

In [32]:
import math
[math.sqrt(x) for x in range(5)] # think of this as Python's lapply()

[0.0, 1.0, 1.4142135623730951, 1.7320508075688772, 2.0]

You can include a condition in a comprehension:

In [36]:
# Get all squares of even numbers from 0...10
x = [x**2 for x in range(11) if x % 2 == 0]
x

[0, 4, 16, 36, 64, 100]

In [37]:
[math.sin(y) for y in x]

[0.0,
 -0.7568024953079282,
 -0.2879033166650653,
 -0.9917788534431158,
 0.9200260381967906,
 -0.5063656411097588]

You can also iterate over subelements.

__This is tricky!__ The outermost iterables always come _first_ in the comprehension, which can be counterintuitive.

In [38]:
x = [[1, 2, 3], [4, 5, 6]]

In [39]:
[elt for sublist in x for elt in sublist]

[1, 2, 3, 4, 5, 6]

In [None]:
for sublist in x:
    for elt in sublist:
        

A comprehension surrounded by `[ ]` is called a _list comprehension_ and produces a list.

A comprehension surrounded by `{ }` (and including `:`) is called a _dictionary comprehension_ and produces a dictionary.

In [42]:
x = ["hello", "goodbye"]

lens = {name: len(name) for name in x}
lens

{'hello': 5, 'goodbye': 7}

In [45]:
x = {1, 2, 2, 4}
x

{1, 2, 4}

In [46]:
{x**2 for x in [-1, 0, 1]}

{0, 1}

#### Generator Expressions

There's no such thing as a tuple comprehension. Instead, a comprehension surrounded by `( )` is called a _generator expression_.

In [50]:
y = (x**2 for x in range(11) if x % 2 == 0)
y

<generator object <genexpr> at 0x7f6a5c4796d8>

In [51]:
sum(y) # This also forces evaluation

220

In [49]:
list(y) # This is the line where the computation above actually happens

[0, 4, 16, 36, 64, 100]

In [None]:
# Python's itertools module has functions for manipulating generators and iterable objects

In [52]:
# Strange, very computationally expensive way to find numbers with even squares
squares = (x**2 for x in range(11))

# x doesn't exist out here

even_squares = (x for x in squares if x % 2 == 0)
originals = (math.sqrt(x) for x in even_squares)
list(originals)

[0.0, 2.0, 4.0, 6.0, 8.0, 10.0]

A _generator_ is a special kind of iterable which computes its elements on demand. Examples: ranges, generator expressions

Generators are especially useful for working with data that are too large to fit in memory. While making a huge list (say $10^9$ elements) might use enough memory to crash Python, making a generator with the same number of elements uses almost no memory.

You can become a generator ninja and see several examples that use real data [here][beazley].

[beazley]: https://speakerdeck.com/dabeaz/generator-tricks-for-systems-programmers-version-3-dot-0

In [55]:
x = range(1_000_000_000_000)

In [58]:
x_iter = iter(x)

In [61]:
next(x_iter)

2

In [62]:
for x in range(1_000_000_000):
    print(x)
    if (x > 10):
        break

0
1
2
3
4
5
6
7
8
9
10
11


In [65]:
def foo(x):
    return x

foo(3)

3

### NumPy

NumPy is a Python package that provides tools for numerical computing (the name stands for "Numerical Python"). Since we're using Anaconda, NumPy is already installed.

NumPy is documented [here](https://docs.scipy.org/doc/numpy/).

In [67]:
import numpy as np

NumPy's core feature is the n-dimensional array, or _ndarray_. NumPy arrays are the basis for almost all of Python's scientific computing packages. They are the Python equivalent of R's built-in vectors.

NumPy arrays use reference semantics!

#### Creating NumPy Arrays

You can create NumPy arrays from lists:

In [68]:
np.array([1, 2, 3])

array([1, 2, 3])

In [69]:
x = np.array([1, 2, 3])
y = np.array((4.1, 5.2, 6.3))

You can create multidimensional arrays, like matrices, from nested lists.

In [70]:
m = np.array([[1, 2, 3],
              [4, 5, 6]])
m

array([[1, 2, 3],
       [4, 5, 6]])

NumPy also provides several helper functions to create arrays. See the documentation or references for a full list.

As an example, `np.arange()` is the NumPy equivalent of `range()`.

In [71]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [72]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

#### Inspecting Arrays

The array attributes `.shape` and `.size` contain information about the structure of the array.

In [75]:
x.shape # like R's dim()

(3,)

In [76]:
m.shape

(2, 3)

In [77]:
x.size # like R's length()

3

In [78]:
m.size

6

In [79]:
type(m)

numpy.ndarray

The array attribute `.dtype` contains the data type of the array's elements.

In [80]:
x.dtype

dtype('int64')

In [82]:
x

array([1, 2, 3])

In [81]:
y.dtype

dtype('float64')

In [83]:
y

array([4.1, 5.2, 6.3])

See [here](https://docs.scipy.org/doc/numpy-1.15.0/user/basics.types.html) or [here](https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html#NumPy-Standard-Data-Types) for a complete list of NumPy data types.

#### Vectorization

Arithmetic is vectorized for NumPy arrays, which means arithmetic operators are applied element-by-element.

In [84]:
x + y

array([5.1, 7.2, 9.3])

Many of NumPy's functions are also vectorized. In NumPy jargon, vectorized functions are also called _universal functions_ or _ufuncs_.

In [85]:
np.sin(x)

array([0.84147098, 0.90929743, 0.14112001])

#### Indexing

You can subset NumPy arrays with indexes or Boolean arrays. Again, this is similar to R.

__Be careful!__ Python uses `and` and `or` to combine conditions, but NumPy uses `&` and `|`.

In [87]:
x[0]

1

In [88]:
x[0:2]

array([1, 2])

In [89]:
m[1, 1]

5

In [90]:
m

array([[1, 2, 3],
       [4, 5, 6]])

In [93]:
x[x % 2 == 0]

array([2])

In multidimensional arrays, separate indexes for each dimension with commas. The "bare" slice `:` selects everything in one dimension.

__Be careful!__ When subsetting, remember to use `:` where you would use a blank in R.

In [95]:
m

array([[1, 2, 3],
       [4, 5, 6]])

In [98]:
m[:, 0] # this is like R's m[, 1]

array([1, 4])

#### What else can NumPy do?

NumPy also provides functions for:

* Linear algebra (multiplication, transposition, decomposition, ...)
* Random number generation
* Elementary statistics
* Signal processing
* And more...

There isn't time to cover these in detail in lecture, but you can learn more from the documentation and references.