# Python Tutorial Notebook


## Why Python?

- Accessible language
- Great scientific libraries and support, especially for machine/deep learning
- Jupyter notebooks are great environments for exploring, visualizing, and sharing data analyses


## Python Tips and Gotchas



### I've got (lots of) blank spaces

- Python uses whitespace as indentation
- Stay consistent with your spacing (2 or 4 spaces) to cleanly delineate code blocks

*Tip: In Colab, you can change your preferred indentation under Tools -> Preferences... -> Editor*


### Comprehensions



#### List comprehensions

- list creation with loops can be made "in-line" with **list comprehensions**
- often results in more concise code

#### Basic Syntax

```python

result_list = [output_exp for var in input_list if (condition on var is true)]
```

In [None]:
nums = range(100000)
squares = []

for num in nums:
  squares.append(num**2)

In [None]:
nums = range(100000)
squares = [num**2 for num in nums]

In [None]:
nums = range(100000)
odd_squares = []

for num in nums:
  if num % 2 == 1:
    odd_squares.append(num**2)

In [None]:
# list comprehensions are (often) more concise
nums = range(100000)
odd_squares = [num ** 2 for num in nums if num % 2 == 1]

#### Dict comprehensions

In [None]:
nums = range(100000)
odd_dict = {}

for num in nums:
  if num % 2 == 1:
    odd_dict[num] = num **2

In [None]:
nums = range(100000)
odd_dict = {num : num ** 2 for num in nums if num % 2 == 1}


### If it seems like (notebook) Magic, that's because it is

- Jupyter notebooks provide built-in commands called **magics** to address common problems in the data analysis workflow

- *line* magics are denoted by a single `%`
- *cell* magics that apply to the entire cell are denoted by a double `%%`


#### Timing

- likely the most useful magic functions are `time` and `timeit`

- `time` times the amount of time used for a single run of the line of code or cell
- `timeit` runs multiple trials of the line of code or cell to provide a more accurate measurement of the execution time

In [None]:
%time odd_squares = [num ** 2 for num in range(100000) if num % 2 == 1]

CPU times: user 21.1 ms, sys: 1 ms, total: 22.1 ms
Wall time: 23.6 ms


In [None]:
%timeit odd_squares = [num ** 2 for num in range(100000) if num % 2 == 1]

100 loops, best of 3: 18.6 ms per loop


In [None]:
# we can also time entire cells
%%time

nums = range(100000)
odd_squares = []

for num in nums:
  if num % 2 == 1:
    odd_squares.append(num**2)

CPU times: user 28.4 ms, sys: 1.99 ms, total: 30.4 ms
Wall time: 36 ms


In [None]:
%%timeit

nums = range(100000)
odd_squares = []

for num in nums:
  if num % 2 == 1:
    odd_squares.append(num**2)

10 loops, best of 3: 21.5 ms per loop


**Interpreting timeit output**

- x loops: the number of loops (factors of 10) of the code needed to exceed 0.2 seconds of wall time
- best of 3: run the loop trials three times, take the best results
- x ms per loop: the amount of time the code took to execute under the best run

#### Other useful Magics

- `%who_ls`: lists all active variables in the namespace
- `%reset`: resets the namespace and all named variables
- `%debug`: activate the python debugger
- `%lsmagic`: list all the available magic commands

### Out of Scope?

- A quirk of Python variables is that they are scoped to the innermost function, and control blocks like `if` don't count

In [None]:
# Will this code throw an exception? What is the output?
if True:
  x = "I've been initialized!"
print(x)

In [None]:
# Will this code throw an exception? What is the output?
if False:
  x = "I've been initialized!"
print(x)

## Case Study: YamSlam!
![](https://alliance.seas.upenn.edu/~cis520/dynamic/2019/wiki/images/yamslam.png)
- Given 3 chances to roll,
how likely is it that you
will roll 5 of a kind?
- Strategy: Pick the most
common #, and re-roll
dice that don’t match

### Imports

- we need to import the NumPy module below in order to use it
- we often use the `as` syntax to alias the module name to an abbreviation

In [None]:
# subsequent references to the numpy module can use 'np'
import numpy as np

### Rolling 5 Dice



```python
y = np.zeros(5)       
roll_idx = np.array(range(5))
y[roll_idx] = np.floor(np.random.uniform(0,6, roll_idx.shape))
```

---


- `np.zeros(5)` creates an array of zeros of shape (5,)
  - **Note:** this is a 1D array with 5 elements, which is different than an 2D array with only one column, shape (5,1)
- `np.array(range(5))` creates an array \[0,1,2,3,4\] of shape (5,)
- `np.random.uniform(0,6, roll_idx.shape)` samples from the uniform distribution in the range \[0, 6) in the given shape
- `np.floor` rounds down to the nearest integer
  - **Note:** we're using zero-indexed dice, so our possible rolls are \[0,1,2,3,4,5\]
  
*Tip: If you are unsure of a function's arguments or return signature, you can run `help(function_name)` to print the docstring*

In [None]:
help(np.random.uniform)

#### What is the value and shape of y?

In [None]:
y = np.zeros(5)       
roll_idx = np.array(range(5))
y[roll_idx] = np.floor(np.random.uniform(0,6, roll_idx.shape))
print(y)
print(y.shape)

In [None]:
y = np.zeros((5,1))       
roll_idx = np.array(range(5))
# note the need for a second index, since y is now a 2D array!
y[roll_idx, 0] = np.floor(np.random.uniform(0,6, roll_idx.shape))
print(y)
print(y.shape)

In [None]:
y = np.zeros((5,5))       
roll_idx = np.array(range(5))
# : is shorthand for selecting all indices along an axis. More on indexing later
# What happens if we run y[:, roll_idx] instead?
y[roll_idx, :] = np.floor(np.random.uniform(0,6, roll_idx.shape))
print(y)
print(y.shape)

### Re-rolling

#### Random number generation
```python
np.random.seed(0)
```

---

- `np.random.seed(0)` sets the seed of the Numpy's random number generator to 0, ensuring that subsequent calls to `np.random` functions are reproducible
  - note that any `int`, not just 0, will work as a reproducible seed

In [None]:
# Here we're using the random seed to make sure our first roll is reproducible
np.random.seed(0)
y = np.zeros(5)       
roll_idx = np.array(range(5))
y[roll_idx] = np.floor(np.random.uniform(0,6, roll_idx.shape))
print(y)

[3. 4. 3. 3. 2.]


- 3 is the most common, so we want to re-roll indices 1 and 4

#### How do we get the indices to re-roll?

In [None]:
# check which entries of y aren't 3 
y != 3

array([False,  True, False, False,  True])

In [None]:
# return the indices that are "True" -- interpreted as non-zero by Numpy
# note the return shape! np.nonzero returns a tuple of arrays, one for each axis
np.nonzero(y != 3)

(array([1, 4]),)

In [None]:
# assign the new roll_idx
roll_idx = np.nonzero(y != 3)[0]
print(roll_idx)

[1 4]


#### The general re-roll case


In [None]:
# count the number of dice we've rolled for each number
counts = [sum(y == i) for i in range(6)]
print(counts)

[0, 0, 1, 3, 1]


In [None]:
# Find the idx of the most common roll
max_idx = np.argmax(counts)
print(max_idx)

3


In [None]:
# update the indices to re-roll accordingly
roll_idx = np.nonzero(y != max_idx)[0]    
print(roll_idx)

[1 4]


### Putting it all together

In [None]:
def yamslam():
  """Plays one round of yamslam, re-rolling 5 dice up to 3 times.
  
  Also prints exuberantly if we do get a yamslam.
  
  Returns:
    int: 1 if we got a yamslam, 0 if not
  """
  y = np.zeros(5)       
  roll_idx = np.array(range(5))
  for reroll in range(3): # 3 rerolls
    y[roll_idx] = np.floor(np.random.uniform(0,6,roll_idx.shape))
    counts = [sum(y == i) for i in range(6)]
    max_idx = np.argmax(counts)
    
    if np.max(counts) == 5:
      print('YAMSLAM!')
      return 1
    
    roll_idx = np.nonzero(y != max_idx)[0]  
    
  # we've run all 3 re-rolls but still didn't get a yamslam
  return 0      

#### Running multiple trials

In [None]:
%%time

yamslam_trials = []
for i in range(100):
  yamslam_trials.append(yamslam())
  
print("Probability of yamslam: {}".format(np.mean(yamslam_trials)))

YAMSLAM!
YAMSLAM!
YAMSLAM!
YAMSLAM!
Probability of yamslam: 0.04
CPU times: user 54.2 ms, sys: 6.4 ms, total: 60.6 ms
Wall time: 55.3 ms


#### Improving the code

```python
counts = [sum(y == i) for i in range(6)]
max_idx = np.argmax(counts)
```

---

- `np.bincounts` produces the same result for `count` as our list comprehension

- More importantly, what is the above code effectively calculating?

In [None]:
# We're calculating the mode!
from scipy.stats import mode

np.random.seed(0)
y = np.zeros(5)       
roll_idx = np.array(range(5))
y[roll_idx] = np.floor(np.random.uniform(0,6, roll_idx.shape))
roll_idx = np.nonzero(y != mode(y)[0])[0]  

*Tip: before implementing a mathematical operation, check the documentation to see if it's already part of the library -- chances are, it is.*

## Working with Arrays

### Vectorization

- before writing a loop, consider if the operation can be *vectorized*
- vectorization is the application of an operation over an entire array, instead of element by element
- Results in more concise code, and many vectorized implementations of functions are optimized

In [None]:
%%timeit

# Don't do this!!
vec = np.array(range(10000))
sum_v = 0
for i in range(10000):
  sum_v += vec[i]

100 loops, best of 3: 2.96 ms per loop


In [None]:
%%timeit

# Vectorize instead
vec = np.array(range(10000))
sum_v = np.sum(vec)

1000 loops, best of 3: 1.41 ms per loop


#### Watch the axes

- What is the behavior of  `np.sum()` if our array has more than one dimension?
- Use the `axis` argument: specifies which axis to sum along
- Many other vectorized functions take the axis argument, so keep in mind which dimension you want the operation applied to

In [None]:
A = np.array([[1,2,3], [4,5,6], [7,8,9]])
print(A)
print(A.shape)

In [None]:
# By default, axis=None and np.sum will sum all elements of the array
print(np.sum(A))

In [None]:
# axis=0 sums along rows, producing column totals
print(np.sum(A, axis=0))

In [None]:
# axis=0 sums along columns, producing row totals
print(np.sum(A, axis=1))

- vectorization doesn't always work -- post to Piazza or come to office hours if you have questions!

### Indexing

#### Slicing

- the standard rules of slicing and indexing from Python apply to Numpy arrays
- `i:j:k` syntax corresponds to starting at index `i`, ending at index `j` with step size `k`
  - omitting `k` implies a step size of 1
- `i:` selects all indices beginning at index `i`
- `:j` selects all indices up to but not including index `j`
- `:` by itself selects all indices along an axis

In [None]:
# start with list of 0 to 9
vec = np.array(range(10))
print(vec)

In [None]:
# selects indices 2,3,4 (not including 5)
print(vec[2:5])

In [None]:
# selects indices 2,4,6
print(vec[2:7:2])

In [None]:
# selects indices starting with 5 to the end
print(vec[5:])

In [None]:
# selects indices up to, but not including, index 5
print(vec[:5])

In [None]:
# select everything, since vec is a 1D array
print(vec[:])

In [None]:
# same idea with 2D arrays, but with two axes
A = np.array([[1,2,3], [4,5,6], [7,8,9]])
print(A)

In [None]:
# select the element in the 0-index row and 1-index column
print(A[0,1])

In [None]:
# select the 0-index column
print(A[:, 0])

In [None]:
# select the first two rows
print(A[:2, :])

In [None]:
# select the last two columns of the last two rows
print(A[1:, 1:])

#### Logical Indexing

- we can select according to boolean conditions across axes as well

In [None]:
%%timeit

np.random.seed(0)
num_animals = 100000
animal_weights = np.random.uniform(0, 50, num_animals)

# Don't do this!!
is_dog = np.zeros(num_animals)
is_cat = np.zeros(num_animals)

for i in range(num_animals):
  if animal_weights[i] > 30:
    is_dog[i] = 1
  else:
    is_cat[i] = 1

In [None]:
%%timeit

np.random.seed(0)
num_animals = 100000
animal_weights = np.random.uniform(0, 50, num_animals)

# Use logical indexing instead, see the speed difference
is_dog = animal_weights > 30
is_cat = animal_weights <= 30

In [None]:
np.random.seed(0)
num_animals = 100000
animal_weights = np.random.uniform(0, 50, num_animals)

# Use logical indexing for conditional selections
dog_weights = animal_weights[animal_weights > 30]
cat_weights = animal_weights[animal_weights <= 30]

## Where to go next

- We didn't cover: plotting, data I/O, debugging, etc.
- Check out the [Python resources](https://alliance.seas.upenn.edu/~cis520/dynamic/2019/wiki/index.php?n=Resources.Resources) on the course wiki
- Post on Piazza or come to office hours if you have any questions!