# Agenda

1. NumPy
    - NumPy arrays
    - Setting + retrieving
    - Broadcasting
    - Boolean / mask arrays for retrieving
    - dtypes and `nan`
2. Pandas
    - series vs. data frames
    - creating a series, from scratch or via NumPy

# What is NumPy? Why do we care?

Python is *not*:

- slim (in terms of memory)
- fast (in terms of execution)



In [1]:
import sys
x = 0

sys.getsizeof(x) # how many bytes does this integer use in memory?

28

In [4]:
x = 100_000_000
sys.getsizeof(x)

28

# What does NumPy do?

The problem is that everything in Python is an object, and thus it's really big.

In C, integers are tiny -- at the most, they're 64 bits (8 bytes) in size.  

NumPy allows us to use C data structures with a Python API.  We (mostly) feel like we're working in Python, but we're gaining the speed and memory usage of C.

The big deal in NumPy is actually one data structure, the NumPy array, aka `ndarray` -- it's an n-dimensional array.  If you're a mathematician or a physicist, then you'll want all of those dimensions.  We'll be using just 1-dimensional arrays and 2D arrays.

A NumPy array actually has two pieces:

- The Python part, which we work with
- The C part, where it allocates memory and works with it at that level



In [5]:
# let's load NumPy!

import numpy as np     

In [6]:
# create a NumPy array
# we *don't* directly use np.ndarray, even though it exists!

# rather, we'll use np.array, and pass it a regular Python list, which it'll turn
# into a NumPy array with the appropriate back-end values

a = np.array([10, 20, 30, 40, 50, 60, 70])
type(a)

numpy.ndarray

In [7]:
a

array([10, 20, 30, 40, 50, 60, 70])

In [8]:
# things that are similar between lists and arrays
# basic retrieval

a[3]   #get the element at index 3

40

In [9]:
# arrays are mutable
a[3] = 41
a

array([10, 20, 30, 41, 50, 60, 70])

In [10]:
# they are iterable, as well -- so we can put them in a for loop

# but DON'T DO THAT!

In [11]:
# a few other ways to create NumPy arrays

# (1) get a range, using np.arange (similar to Python's "range" builtin)
a = np.arange(10, 200, 3)   # start at 10, end before 200, step size 3

In [12]:
a

array([ 10,  13,  16,  19,  22,  25,  28,  31,  34,  37,  40,  43,  46,
        49,  52,  55,  58,  61,  64,  67,  70,  73,  76,  79,  82,  85,
        88,  91,  94,  97, 100, 103, 106, 109, 112, 115, 118, 121, 124,
       127, 130, 133, 136, 139, 142, 145, 148, 151, 154, 157, 160, 163,
       166, 169, 172, 175, 178, 181, 184, 187, 190, 193, 196, 199])

In [13]:
# (2) get a bunch of 0s
np.zeros(10)   # it's spelled "zeros" not "zeroes"

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [14]:
# (3) get a bunch of 1s
np.ones(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [17]:
# (4) get a bunch of random integers

np.random.seed(0)    # start the random-number generator at a known value
np.random.randint(0, 100, 20)   # 20 random ints from 0-100 (not including 100)

array([44, 47, 64, 67, 67,  9, 83, 21, 36, 87, 70, 88, 88, 12, 58, 65, 39,
       87, 46, 88])

In [22]:

np.random.seed(0)    # start the random-number generator at a known value
a = np.random.randint(0, 100, 20)
np.unique(a)

array([ 9, 12, 21, 36, 39, 44, 46, 47, 58, 64, 65, 67, 70, 83, 87, 88])

In [18]:
# (5 ) get a bunch of random floats from 0-1

np.random.rand(10)    # each result is a float in that range

array([0.3927848 , 0.83607876, 0.33739616, 0.64817187, 0.36824154,
       0.95715516, 0.14035078, 0.87008726, 0.47360805, 0.80091075])

In [24]:
# methods we can run on a NumPy array

a.sum()

1166

In [27]:
a.size   # this is not a method -- this is a data attribute!

20

In [28]:
a.mean()  # mean

58.3

In [29]:
a.std()   # standard deviation

25.088044961694404

In [30]:
a.min()


9

In [31]:
a.max()

88

In [36]:
a.argmin()  # at what index in a is the min located?

5

In [37]:
a.argmax()   # at what index in a is the max located?

11

In [39]:
mylist = [10, 20, 30]  # regular Python list

# what happens if I add it to itself?
# we get a new list with all of the elements twice
mylist + mylist

[10, 20, 30, 10, 20, 30]

In [40]:
# what happens if I add 5 to mylist?

mylist + 5

TypeError: can only concatenate list (not "int") to list

In [41]:
# What happens if we do the same to a NumPy array?

a = np.array([10, 20, 30])

a + a  

array([20, 40, 60])

In [43]:
# if you add two arrays to each other:
# (a) They must be of the same size
# (b) we'll get a new array back , also of the same size, with the addition
#  performed on a per-index basis.

b = np.array([40, 50, 60])
a + b

array([50, 70, 90])

In [44]:
c = np.array([100, 200, 300, 400])

a + c

ValueError: operands could not be broadcast together with shapes (3,) (4,) 

In [45]:
# what happens if we add 5 to a?

a + 5  # this is "broadcasting!"

array([15, 25, 35])

# Broadcasting

If you try to perform a mathematical operation involving 1 array and 1 scalar value, then the operation is performed on each element of the array and that scalar value, resulting in a new array.

In [46]:
a = np.array([10, 20, 30, 40, 50])

a + 3

array([13, 23, 33, 43, 53])

In [47]:
a - 3

array([ 7, 17, 27, 37, 47])

In [48]:
a * 3

array([ 30,  60,  90, 120, 150])

In [49]:
a / 3    # notice -- / always results in a float

array([ 3.33333333,  6.66666667, 10.        , 13.33333333, 16.66666667])

In [50]:
a // 3   # truediv -- truncates any fractional part

array([ 3,  6, 10, 13, 16])

In [51]:
a ** 3

array([  1000,   8000,  27000,  64000, 125000])

In [52]:
a % 3

array([1, 2, 0, 1, 2])

In [54]:
# remember our floating-point random numbers?

np.random.rand(10) * 100   # now I have 10 floats between 0-100

array([52.04774796, 67.88795301, 72.06326547, 58.20197921, 53.73732294,
       75.86156243, 10.59076072, 47.36004193, 18.63323433, 73.69181771])

In [55]:
# remember np.ones and np.zeros?  I can use addition/multiplication to get
# an array of any number I want.

np.ones(10) * 5

array([5., 5., 5., 5., 5., 5., 5., 5., 5., 5.])

In [56]:
# I can retrieve (as we've seen) from an array
a

array([10, 20, 30, 40, 50])

In [57]:
a[2]

30

In [58]:
a[4]

50

In [59]:
# can I get *both* of these values back?
# yes, with "fancy indexing"
# instead of passing a single index value to [], I pass a list of values
# yes, this means double square brackets!

a[   [2, 4 ]   ]    # exaggerated whitespace for pedagogical effect

array([30, 50])

In [60]:
a[  [2,3,2,3,2,3] ]

array([30, 40, 30, 40, 30, 40])

In [61]:
a[  [2,3,2,3,2,3] ].mean()

35.0

# Exercises with NumPy

1. Get the 10-day forecast for your city. Create two NumPy arrays -- `highs` with the expected high temps and `lows` with the expected low temps.
2. Find the mean high temp in the coming days.
3. Find the std for high temps, as well.
4. If you entered the temperature in Fahrenheit, convert the temperature to Celsius. (If you used Celsius, convert to Fahrenheight.)
5. Calculate the mean difference between highs and lows in the coming days.

In [62]:
highs = np.array([51, 48, 56, 61, 68, 67, 63, 66, 63, 65])
lows = np.array([32, 34, 40, 43, 49, 47, 46, 44, 45, 45])

In [63]:
highs

array([51, 48, 56, 61, 68, 67, 63, 66, 63, 65])

In [64]:
lows

array([32, 34, 40, 43, 49, 47, 46, 44, 45, 45])

In [65]:
# mean high temp
highs.mean()

60.8

In [66]:
highs.sum() / highs.size

60.8

In [67]:
highs.std()

6.539113089708726

In [68]:
# C = (F-32) * (5/9)

(highs-32) * (5/9)

array([10.55555556,  8.88888889, 13.33333333, 16.11111111, 20.        ,
       19.44444444, 17.22222222, 18.88888889, 17.22222222, 18.33333333])

In [69]:
highs-32

array([19, 16, 24, 29, 36, 35, 31, 34, 31, 33])

In [70]:
highs - lows

array([19, 14, 16, 18, 19, 20, 17, 22, 18, 20])

In [None]:
# if 