# STA 141B Lecture 4

The class website is <https://github.com/2019-winter-ucdavis-sta141b/notes>

### Announcements

* Remember to fill out the GitHub Username and Project Group Form (link on Piazza)!

### Topics

* Modules and Packages
* Iteration
    - Loops
    - Comprehensions and Generators
* NumPy

### References

* Python for Data Analysis, Ch. 4
* [Python Data Science Handbook][PDSH], Ch. 2

[PDSH]: https://jakevdp.github.io/PythonDataScienceHandbook/
[ProGit]: https://git-scm.com/book/

## Shell Commands from Jupyter

On MacOS and Linux, you can run shell commands from a Jupyter notebook by putting a `!` in front of the command, in a code cell. On Windows, without additional configuration, most UNIX shell commands won't work if you try to run them from Jupyter.

Jupyter runs shell commands in a temporary subshell, so commands like `cd` will not work the way you'd expect.

Most of the time, it's easier to just use the shell in a terminal.

But as an example, to check the working directory:

## Modules

A _module_ is a text file that contains Python code, usually a `.py` file.

Python's `import` command lets us load code from a module to use in our script or notebook. Note: `import` is like a combination of R's `source()` and `library()` functions.

Python provides many built-in modules for common tasks (see [the list][py-modules]). Packages provide even more modules. 

[py-modules]: https://docs.python.org/3/library/index.html

## Iteration

The three most important methods to repeat code for identical or similar tasks are:

1. Loops (`while` and `for`)
2. Comprehensions, Generators, and `map()`
3. Vectorization (NumPy arrays and functions)

These methods have tradeoffs. In general:

* Loops are the most flexible -- particularly `while` loops.
* Generators tend to use the least memory.
* Vectorization tends to be fastest.

There are other methods for iteration, like recursion (more info [here][tp1] and [here][tp2]), but they are not common in statistical computing with Python.

[tp1]: http://greenteapress.com/thinkpython2/html/thinkpython2006.html#sec62
[tp2]: http://greenteapress.com/thinkpython2/html/thinkpython2007.html#sec74

### Loop Tips and Tricks

An _iteratable_ object is a object that can be iterated over, element-by-element. Examples: tuples, lists, strings

Python's for-loops can automatically get elements from iterable objects.

The `range()` function returns a sequence of integers.

You can use `list()` to convert objects like ranges to lists.

Generally, you'll only need to do this for visual inspection. You DO NOT need to convert ranges into lists to use them in loops.

You can make the keys and values in a dictionary iterable with the `.items()` method.

_Zipping_ two sequences together means combining them into a list of tuples where:

* The first element of each tuple is an element from the first sequence.
* The second element of each tuple is an element from the second sequence.

Usually it only makes sense to zip sequences that are the same length.

The `zip()` function zips two or more sequences. Use it to iterate over multiple sequences at the same time.

The `enumerate()` function zips together index numbers and a sequence. In other words, the function enumerates a sequence.

### Comprehensions and Generators

A _comprehension_ is a Python expression that transforms a sequence, element-by-element. The notation is similar to mathematical set notation:

You can include a condition in a comprehension:

You can also iterate over subelements.

__This is tricky!__ The outermost iterables always come _first_ in the comprehension, which can be counterintuitive.

A comprehension surrounded by `[ ]` is called a _list comprehension_ and produces a list.

A comprehension surrounded by `{ }` (and including `:`) is called a _dictionary comprehension_ and produces a dictionary.

#### Generator Expressions

There's no such thing as a tuple comprehension. Instead, a comprehension surrounded by `( )` is called a _generator expression_.

A _generator_ is a special kind of iterable which computes its elements on demand. Examples: ranges, generator expressions

Generators are especially useful for working with data that are too large to fit in memory. While making a huge list (say $10^9$ elements) might use enough memory to crash Python, making a generator with the same number of elements uses almost no memory.

You can become a generator ninja and see several examples that use real data [here][beazley].

[beazley]: https://speakerdeck.com/dabeaz/generator-tricks-for-systems-programmers-version-3-dot-0

### NumPy

NumPy is a Python package that provides tools for numerical computing (the name stands for "Numerical Python"). Since we're using Anaconda, NumPy is already installed.

NumPy is documented [here](https://docs.scipy.org/doc/numpy/).

In [None]:
import numpy as np

NumPy's core feature is the n-dimensional array, or _ndarray_. NumPy arrays are the basis for almost all of Python's scientific computing packages. They are the Python equivalent of R's built-in vectors.

NumPy arrays use reference semantics!

#### Creating NumPy Arrays

You can create NumPy arrays from lists:

You can create multidimensional arrays, like matrices, from nested lists.

NumPy also provides several helper functions to create arrays. See the documentation or references for a full list.

As an example, `np.arange()` is the NumPy equivalent of `range()`.

#### Inspecting Arrays

The array attributes `.shape` and `.size` contain information about the structure of the array.

The array attribute `.dtype` contains the data type of the array's elements.

See [here](https://docs.scipy.org/doc/numpy-1.15.0/user/basics.types.html) or [here](https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html#NumPy-Standard-Data-Types) for a complete list of NumPy data types.

#### Vectorization

Arithmetic is vectorized for NumPy arrays, which means arithmetic operators are applied element-by-element.

Many of NumPy's functions are also vectorized. In NumPy jargon, vectorized functions are also called _universal functions_ or _ufuncs_.

#### Indexing

You can subset NumPy arrays with indexes or Boolean arrays. Again, this is similar to R.

__Be careful!__ Python uses `and` and `or` to combine conditions, but NumPy uses `&` and `|`.

In multidimensional arrays, separate indexes for each dimension with commas. The "bare" slice `:` selects everything in one dimension.

__Be careful!__ When subsetting, remember to use `:` where you would use a blank in R.

#### What else can NumPy do?

NumPy also provides functions for:

* Linear algebra (multiplication, transposition, decomposition, ...)
* Random number generation
* Elementary statistics
* Signal processing
* And more...

There isn't time to cover these in detail in lecture, but you can learn more from the documentation and references.


#### An Example

Consider a circle with radius 1 circumscribed by a square with side length 2.

The area of the circle is $\pi$, so for a uniform distribution on the square, the probability a point will fall in the circle is $\pi / 4$.

We can estimate the probability to estimate $\pi$.