# Agenda

1. NumPy
    - NumPy arrays
    - Setting + retrieving
    - Broadcasting
    - Boolean / mask arrays for retrieving
    - dtypes and `nan`
2. Pandas
    - series vs. data frames
    - creating a series, from scratch or via NumPy

# What is NumPy? Why do we care?

Python is *not*:

- slim (in terms of memory)
- fast (in terms of execution)



In [1]:
import sys
x = 0

sys.getsizeof(x) # how many bytes does this integer use in memory?

28

In [4]:
x = 100_000_000
sys.getsizeof(x)

28

# What does NumPy do?

The problem is that everything in Python is an object, and thus it's really big.

In C, integers are tiny -- at the most, they're 64 bits (8 bytes) in size.  

NumPy allows us to use C data structures with a Python API.  We (mostly) feel like we're working in Python, but we're gaining the speed and memory usage of C.

The big deal in NumPy is actually one data structure, the NumPy array, aka `ndarray` -- it's an n-dimensional array.  If you're a mathematician or a physicist, then you'll want all of those dimensions.  We'll be using just 1-dimensional arrays and 2D arrays.

A NumPy array actually has two pieces:

- The Python part, which we work with
- The C part, where it allocates memory and works with it at that level



In [5]:
# let's load NumPy!

import numpy as np     

In [6]:
# create a NumPy array
# we *don't* directly use np.ndarray, even though it exists!

# rather, we'll use np.array, and pass it a regular Python list, which it'll turn
# into a NumPy array with the appropriate back-end values

a = np.array([10, 20, 30, 40, 50, 60, 70])
type(a)

numpy.ndarray

In [7]:
a

array([10, 20, 30, 40, 50, 60, 70])

In [8]:
# things that are similar between lists and arrays
# basic retrieval

a[3]   #get the element at index 3

40

In [9]:
# arrays are mutable
a[3] = 41
a

array([10, 20, 30, 41, 50, 60, 70])

In [10]:
# they are iterable, as well -- so we can put them in a for loop

# but DON'T DO THAT!

In [11]:
# a few other ways to create NumPy arrays

# (1) get a range, using np.arange (similar to Python's "range" builtin)
a = np.arange(10, 200, 3)   # start at 10, end before 200, step size 3

In [12]:
a

array([ 10,  13,  16,  19,  22,  25,  28,  31,  34,  37,  40,  43,  46,
        49,  52,  55,  58,  61,  64,  67,  70,  73,  76,  79,  82,  85,
        88,  91,  94,  97, 100, 103, 106, 109, 112, 115, 118, 121, 124,
       127, 130, 133, 136, 139, 142, 145, 148, 151, 154, 157, 160, 163,
       166, 169, 172, 175, 178, 181, 184, 187, 190, 193, 196, 199])

In [13]:
# (2) get a bunch of 0s
np.zeros(10)   # it's spelled "zeros" not "zeroes"

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [14]:
# (3) get a bunch of 1s
np.ones(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [17]:
# (4) get a bunch of random integers

np.random.seed(0)    # start the random-number generator at a known value
np.random.randint(0, 100, 20)   # 20 random ints from 0-100 (not including 100)

array([44, 47, 64, 67, 67,  9, 83, 21, 36, 87, 70, 88, 88, 12, 58, 65, 39,
       87, 46, 88])

In [22]:

np.random.seed(0)    # start the random-number generator at a known value
a = np.random.randint(0, 100, 20)
np.unique(a)

array([ 9, 12, 21, 36, 39, 44, 46, 47, 58, 64, 65, 67, 70, 83, 87, 88])

In [18]:
# (5 ) get a bunch of random floats from 0-1

np.random.rand(10)    # each result is a float in that range

array([0.3927848 , 0.83607876, 0.33739616, 0.64817187, 0.36824154,
       0.95715516, 0.14035078, 0.87008726, 0.47360805, 0.80091075])

In [24]:
# methods we can run on a NumPy array

a.sum()

1166

In [27]:
a.size   # this is not a method -- this is a data attribute!

20

In [28]:
a.mean()  # mean

58.3

In [29]:
a.std()   # standard deviation

25.088044961694404

In [30]:
a.min()


9

In [31]:
a.max()

88

In [36]:
a.argmin()  # at what index in a is the min located?

5

In [37]:
a.argmax()   # at what index in a is the max located?

11

In [39]:
mylist = [10, 20, 30]  # regular Python list

# what happens if I add it to itself?
# we get a new list with all of the elements twice
mylist + mylist

[10, 20, 30, 10, 20, 30]

In [40]:
# what happens if I add 5 to mylist?

mylist + 5

TypeError: can only concatenate list (not "int") to list

In [41]:
# What happens if we do the same to a NumPy array?

a = np.array([10, 20, 30])

a + a  

array([20, 40, 60])

In [43]:
# if you add two arrays to each other:
# (a) They must be of the same size
# (b) we'll get a new array back , also of the same size, with the addition
#  performed on a per-index basis.

b = np.array([40, 50, 60])
a + b

array([50, 70, 90])

In [44]:
c = np.array([100, 200, 300, 400])

a + c

ValueError: operands could not be broadcast together with shapes (3,) (4,) 

In [45]:
# what happens if we add 5 to a?

a + 5  # this is "broadcasting!"

array([15, 25, 35])

# Broadcasting

If you try to perform a mathematical operation involving 1 array and 1 scalar value, then the operation is performed on each element of the array and that scalar value, resulting in a new array.

In [46]:
a = np.array([10, 20, 30, 40, 50])

a + 3

array([13, 23, 33, 43, 53])

In [47]:
a - 3

array([ 7, 17, 27, 37, 47])

In [48]:
a * 3

array([ 30,  60,  90, 120, 150])

In [49]:
a / 3    # notice -- / always results in a float

array([ 3.33333333,  6.66666667, 10.        , 13.33333333, 16.66666667])

In [50]:
a // 3   # truediv -- truncates any fractional part

array([ 3,  6, 10, 13, 16])

In [51]:
a ** 3

array([  1000,   8000,  27000,  64000, 125000])

In [52]:
a % 3

array([1, 2, 0, 1, 2])

In [54]:
# remember our floating-point random numbers?

np.random.rand(10) * 100   # now I have 10 floats between 0-100

array([52.04774796, 67.88795301, 72.06326547, 58.20197921, 53.73732294,
       75.86156243, 10.59076072, 47.36004193, 18.63323433, 73.69181771])

In [55]:
# remember np.ones and np.zeros?  I can use addition/multiplication to get
# an array of any number I want.

np.ones(10) * 5

array([5., 5., 5., 5., 5., 5., 5., 5., 5., 5.])

In [56]:
# I can retrieve (as we've seen) from an array
a

array([10, 20, 30, 40, 50])

In [57]:
a[2]

30

In [58]:
a[4]

50

In [59]:
# can I get *both* of these values back?
# yes, with "fancy indexing"
# instead of passing a single index value to [], I pass a list of values
# yes, this means double square brackets!

a[   [2, 4 ]   ]    # exaggerated whitespace for pedagogical effect

array([30, 50])

In [60]:
a[  [2,3,2,3,2,3] ]

array([30, 40, 30, 40, 30, 40])

In [61]:
a[  [2,3,2,3,2,3] ].mean()

35.0

# Exercises with NumPy

1. Get the 10-day forecast for your city. Create two NumPy arrays -- `highs` with the expected high temps and `lows` with the expected low temps.
2. Find the mean high temp in the coming days.
3. Find the std for high temps, as well.
4. If you entered the temperature in Fahrenheit, convert the temperature to Celsius. (If you used Celsius, convert to Fahrenheight.)
5. Calculate the mean difference between highs and lows in the coming days.

In [62]:
highs = np.array([51, 48, 56, 61, 68, 67, 63, 66, 63, 65])
lows = np.array([32, 34, 40, 43, 49, 47, 46, 44, 45, 45])

In [63]:
highs

array([51, 48, 56, 61, 68, 67, 63, 66, 63, 65])

In [64]:
lows

array([32, 34, 40, 43, 49, 47, 46, 44, 45, 45])

In [65]:
# mean high temp
highs.mean()

60.8

In [66]:
highs.sum() / highs.size

60.8

In [67]:
highs.std()

6.539113089708726

In [68]:
# C = (F-32) * (5/9)

(highs-32) * (5/9)

array([10.55555556,  8.88888889, 13.33333333, 16.11111111, 20.        ,
       19.44444444, 17.22222222, 18.88888889, 17.22222222, 18.33333333])

In [69]:
highs-32

array([19, 16, 24, 29, 36, 35, 31, 34, 31, 33])

In [70]:
highs - lows

array([19, 14, 16, 18, 19, 20, 17, 22, 18, 20])

In [71]:
# if I have an array, then I can calculate the mean on it

(highs - lows).mean()

18.3

In [72]:
# this calculates the mean on each array, and then subtracts a float from another float
highs.mean() - lows.mean()

18.299999999999997

In [73]:
# this creates a new array based on invoking - on highs and lows, then 
# calculates the mean on that new array
(highs-lows).mean()   

18.3

In [74]:
%timeit highs.mean() - lows.mean()

10.5 µs ± 111 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [75]:
%timeit (highs-lows).mean()   

6.52 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [76]:
np.array(np.arange(100))

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])

In [77]:
np.array((100, 200, 300))

array([100, 200, 300])

In [78]:
np.array({1:'hello', 2:'goodbye', 3:'whatever'})

array({1: 'hello', 2: 'goodbye', 3: 'whatever'}, dtype=object)

In [79]:
# what about if I try to check equality -- that's also a mathematical operation!

a = np.array([10, 20, 30, 40, 50])

a == 30   # what will be the result?   I get back a boolean array

array([False, False,  True, False, False])

# Mask arrays / mask index

If you invoke `[]` on an array and give it a single integer, you get the value at that index.

If you invoke `[]` on an array and give it a list of integers, you get the values at each of those indexes ("fancy indexing").

If you invoke `[]` on an array and give it a list of *booleans*, you get the values that correspond to `True` values. The `False` values are ignored.

In [80]:
#   10    20    30    40      50

a[[True, False, True, True, False]]

array([10, 30, 40])

In [81]:
# we can combine these latest two techniques to retrieve only 
# the elements of an array that fit a condition.

# (1) we get a boolean array from calculating a==30
# (2) we apply that boolean array to a, getting those elements for which it's true

a[ a==30 ]

array([30])

In [83]:
a[ a < 30 ]    #get the elements less than 30 in a

array([10, 20])

# Using and reading mask indexes

1. You first create a boolean array based on a condition and broadcasting.
2. You apply that boolean array as a mask index to the array.

This means, of course, that you can apply a boolean array created from one array, to another one.  They all have to be the same size, but if they are, this can work.

In [86]:
a = np.array([10, 11, 12, 13, 14, 15, 16])

# I want to see the even numbers in a

a[a%2 == 0]

array([10, 12, 14, 16])

In [87]:
# what happens if I decide I want odd nubmers, and in such a case,
# I don't need to perform a comparison! I can simply ask a%2. If it's 1,
# then that'll be seen as True. If it's 0, it'll be seen as False.

a[a%2]   # a%2 gives me an array of integers (not booleans)

array([10, 11, 10, 11, 10, 11, 10])

# Exercises:

1. Create an array of 20 random integers from 0 to 100.
2. Find the largest even number.
3. Find the mean of the odd numbers.
4. Find items that are less than the mean.
5. Find items less than the mean-std.

In [89]:
np.random.seed(0)
a = np.random.randint(0, 100, 20)
a

array([44, 47, 64, 67, 67,  9, 83, 21, 36, 87, 70, 88, 88, 12, 58, 65, 39,
       87, 46, 88])

In [92]:
# find the even numbers

a[a%2 == 0]

array([44, 64, 36, 70, 88, 88, 12, 58, 46, 88])

In [93]:
# get the *largest* even number with .max()

a[a%2 == 0].max()

88

In [97]:
# find the odd numbers

a[a%2 == 1].mean()

57.2

In [100]:
# find items less than the mean

a[a < a.mean()]

array([44, 47,  9, 21, 36, 12, 58, 39, 46])

In [103]:
# find items less than 1 std below the mean

# show me elements of a
# where the elements are less than mean() - std()

a[a < a.mean() - a.std()]

array([ 9, 21, 12])

# Next up

1. More complex conditions (and + or + not)
2. dtypes + `nan`

Return at :50

# More complex conditions

What if I want to find all of the even numbers that are < the mean?

If I have some Python experience, I would try something like this...

In [105]:
a[a % 2 == 0 and a < a.mean()]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

# Boolean context

Remember that in Python, when we have an `if` or a `while`, it looks to its right, and if the value is boolean, it decides what to do.

But if the value is *not* boolean, then `if` and `while` turn that value into a boolean, and then act accordingly. You can do this manually by invoking `bool`.

The problem is that `and` (and `or` and `not`) all work in this way, too. So if you use `and` somewhere, it looks to its right, turns that value into a boolean, looks to its left, turns that value into a boolean, and then decides what to do.

In the above example, `and` tried to turn two values into booleans -- 0 and `a`.  (That's because `and` has higher priority than `==` and `<`.)

In [106]:
# we really did this:

a[a % 2 == (0 and a) < a.mean()]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Beyond all of the other problems, NumPy arrays do *not* allow you to turn them into booleans.  Empty arrays are `False`, and arrays with one element follow that element. Otherwise, you'll get an error.

NumPy solves this problem by *not* using `and`, `or`, and `not`.  Instead, it hijacks the bitwise operators `&`, `|`, and `~`.  These are the operators we should be using in NumPy if we want this sort of thing.

It's not enough to swap out `and` and swap in `&`. Rather, you also need to put parentheses around each of the expressions on the side of `&`, to ensure that priority works correctly.

In [109]:
# this works, and here I split it across lines

a[(a % 2 == 0) & 
  (a < a.mean())]

array([44, 36, 12, 58, 46])

# Basic rule for complex conditions

1. Put each expression in parentheses
2. Use `&`, `|`, and `~`
3. Put each expression on a separate line for greater readiblity.

# Exercises

1. Create an array of 20 random integers from 1-1,000.
2. What's the smallest even number greater than the mean?
3. Show all numbers that are either < mean-std  or > mean+std.
4. Show odd numbers < mean, and even numbers > mean.

In [110]:
np.random.seed(0)
a = np.random.randint(0, 1000, 20)
a

array([684, 559, 629, 192, 835, 763, 707, 359,   9, 723, 277, 754, 804,
       599,  70, 472, 600, 396, 314, 705])

In [111]:
# smallest even number greater than the mean

(a%2==0)

array([ True, False, False,  True, False, False, False, False, False,
       False, False,  True,  True, False,  True,  True,  True,  True,
        True, False])

In [113]:
(a>a.mean())

array([ True,  True,  True, False,  True,  True,  True, False, False,
        True, False,  True,  True,  True, False, False,  True, False,
       False,  True])

In [116]:
a[(a%2==0) & 
  (a>a.mean())].min()

600

In [119]:
# Show all numbers that are either < mean-std or > mean+std.

a[(a < a.mean()-a.std()) | (a > a.mean()+a.std())]

array([192, 835,   9, 277, 804,  70])

In [122]:
# Show odd numbers < mean, and even numbers > mean.

a[((a%2 == 1) & (a<a.mean()))    | 
  (a%2 == 0) & (a>a.mean())]

array([684, 359,   9, 277, 754, 804, 600])

# Underlying data types -- `dtype`

In C, there isn't any such thing as "an integer." An integer has a certain number of bits. The compiler needs to know how many bits your int will want, so that it can allocate the right amount of memory to that value.

Because NumPy uses C behind the scenes, we need to think (most) like C programmers when it comes to our data sizes. We should indicate how many bits we want to use for our values.

This is known as the `dtype`, and every NumPy array has that attribute. All of the values in the array *must* be of the same dtype.

In [123]:
a = np.array([10, 20, 30, 40, 50])
a.dtype

dtype('int64')

In [124]:
# we can express this as:

np.dtype('int64')

dtype('int64')

In [126]:
np.int64   # another way to do it

numpy.int64

In [127]:
# that's not the only option! This means that each int is using 8 bytes (= 64 bits)
# What if I want to save some memory? I can set the dtype when I create the array.

a = np.array([10, 20, 30, 40, 50], dtype=np.int32)
a

array([10, 20, 30, 40, 50], dtype=int32)

In [128]:
# what's the advantage? We're using roughy half of the memory we needed with int64.
# maybe we can go smaller still?

a = np.array([10, 20, 30, 40, 50], dtype=np.int16)
a

array([10, 20, 30, 40, 50], dtype=int16)

In [129]:
# I can even go down to np.int8, 1 byte / 8 bits for each number

a = np.array([10, 20, 30, 40, 50], dtype=np.int8)
a

array([10, 20, 30, 40, 50], dtype=int8)

In [130]:
# if you can shave off memory usage by changing to a smaller dtype, you should!
# but....

a * 100   # broadcast *100 to every element of a

array([ -24,  -48,  -72,  -96, -120], dtype=int8)

# Choosing a `dtype`

You are always navigating between two constraints:

- Use the smallest possible `dtype` that you can, to save memory. If we have 1b rows of data, and each is an integer, then switching from `int64` to `int8` can save us 7/8 of the memory needs!
- Don't use too small of a `dtype`, because you might end up ruining/truncating data that spills over the number of bits available.

The practical advice is: Use the smallest `dtype` you can, taking into account some padding or extra space that you might need in the future.

# What dypes exist?

- Integers
    - `int64` (default)
    - `int32`
    - `int16`
    - `int8`
- Unsigned integers
    - `uint64`
    - `uint32`
    - `uint16`
    - `uint8`
- Floats
    - `float128`
    - `float64` (default)
    - `float32`
    - `float16`

In [131]:
a

array([10, 20, 30, 40, 50], dtype=int8)

In [132]:
a = np.array([10, 20, 30])
a

array([10, 20, 30])

In [133]:
a = np.array([10, 20.5, 30])
a

array([10. , 20.5, 30. ])

In [134]:
a.dtype

dtype('float64')

In [135]:
a = np.array([10, 20, 30, 40, 50])
a

array([10, 20, 30, 40, 50])

In [138]:
# what if I want to change the dtype?
# don't do this!
a.dtype = np.int32

In [137]:
a

array([10,  0, 20,  0, 30,  0, 40,  0, 50,  0], dtype=int32)

In [139]:
a.dtype = np.int8
a

array([10,  0,  0,  0,  0,  0,  0,  0, 20,  0,  0,  0,  0,  0,  0,  0, 30,
        0,  0,  0,  0,  0,  0,  0, 40,  0,  0,  0,  0,  0,  0,  0, 50,  0,
        0,  0,  0,  0,  0,  0], dtype=int8)

In [140]:
# can I turn it into a bunch of floats?
a.dtype = np.float64

In [141]:
a

array([4.94e-323, 9.88e-323, 1.48e-322, 1.98e-322, 2.47e-322])

In [142]:
# get back to sanity
a.dtype = np.int64

In [143]:
a

array([10, 20, 30, 40, 50])

In [None]:
# if I have an array and I want to turn it into another dtype,
# I have to use the .astype method, which takes a dtype as an argument,
# and assign it (back to) a variable

a.astype(np.float64)