# Session 05
07/20

## Notes
- Solutions for labs are in their respective session folders (Sessions 2 and 4)
- Solution for Assignment 1 was under Session 04, but moved to a new directory called **Assignment Solutions** to make it easy to see!
- Assignment 02 due TODAY 11pm ET, accepted until Jul 21 11am ET with late penalty. Not accepted after as the solution will be released when the late acceptance deadline is passed. 
- Assignments are to be done *INDIVIDUALLY* - you are doing yourself and others a disservice by asking/answering assingment questions in a way that does more than hinting! Questions/answers of this kind do not get participation points. 
- Midterm: during class time, on Qtools, 90 minutes, single submission, open notes but not open collaboration. 
- I see about 20\% of the class did not attempt Lab2 on Qtools. I re-opened it so please use it to practice qtools!
- DataCamp: 12 people (together in both sections) are still MIA. Completion rate of any course is ~10\% which is low.


## Outline
- Questions about Assignment 2
- Numpy Quiz
- Numpy Lecture
- Numpy Lab

If time permits:
- Object Oriented Programming and Classes

# About

![numpy](https://upload.wikimedia.org/wikipedia/commons/thumb/3/31/NumPy_logo_2020.svg/1200px-NumPy_logo_2020.svg.png)


In this notebook we will run through an introduction to [numpy](https://numpy.org/), or numerical python, which is the foundation for the core data analytics and data science ecosystem.

See the absolute beginners intro at the numpy project library:
https://numpy.org/doc/stable/user/absolute_beginners.html




In [72]:
# import the library to get started
import numpy as np

Numpy is the workhorse underneath a number of our tools that we will be using.  At the core, numpy **arrays** look a lot like `lists` that we worked through last week, except that they have additional properties:

1.  numpy arrays can be multi-dimensional 

2.  numpy arrays allow us to do calculations on each element  at the same time (we can't do this with lists).  This is **element wise** calculations.

3.  unlike lists, numpy arrays hold a single **type**

>  ***From above, you see that I imported numpy as np.  This is the standard convention.***

# Lists to numpy arrays

In [73]:
# create a normal list
my_list = [1,2,4,5,6,7]

## print out the type and the values
print(type(my_list))
print(my_list)

<class 'list'>
[1, 2, 4, 5, 6, 7]


In [74]:
# create a new list and perform some math on it
new_list = [1,2,3]
#new_list = [i **3 for i in new_list]
new_list

[1, 2, 3]

In [75]:
# multiple the list by 3
new_list * 3

[1, 2, 3, 1, 2, 3, 1, 2, 3]

In [76]:
"ba765"*3

'ba765ba765ba765'

Above shows that just like with `+`, we can repeat the list with `*`. 

While, when we do work with numerical data, we often want to do element-wise math:

In [77]:
# numpy to the rescue
np_arr = np.array(new_list)
np_arr

array([1, 2, 3])

In [78]:
type(np_arr)

numpy.ndarray

In [79]:
# we converted our list to an numpy array. now lets try it
np_arr * 3

array([3, 6, 9])

While lists and tuples are powerful in base python, and we will use them later in the course, when it really comes to analyzing data, the foundations of the tools that we will use are in **numpy**.

Simply, the higher level tools use numpy under the hood, but it's important for us to have a baseline understanding of numpy before we dive into **pandas** a popular library to work with data.

# Numpy Arrays - Single Type Only

Because numpy arrays are built to do calculations, helping us with analyzing data, each array must be of a single type. 

In [80]:
# create a normal list
normal_list = [1, 'a', True]
#will cast to easiest to map all into one variable
normal_list

[1, 'a', True]

This does what we expected, it printed a list, and if we wanted, we could even have a list within a list, as we learned last week.

In [81]:
# convert the list to numpy, as we saw above with np.array(<list>)
np_convert = np.array(normal_list)
np_convert

array(['1', 'a', 'True'], dtype='<U21')

In [82]:
type(np_convert)

numpy.ndarray

In [83]:
np_convert.shape

(3,)

In [84]:
np_convert.ndim

1




Look at what we have above.  Numpy needs to have all elements in an array be of the same type.  But it didn't error out, nor it did exclude some elements. What did numpy do?

# Core numpy features

## nd arrays

Multidimensional arrays and lists of lists are conceptually similar:

In [85]:
# See below a list of lists:
lol = [ [1,2], [3,4], [5,6] ]
lol

[[1, 2], [3, 4], [5, 6]]

In [86]:
lol[0][1]

2

In [87]:
# bring this into numpy array
lol_np = np.array(lol)
lol_np

array([[1, 2],
       [3, 4],
       [5, 6]])

In [88]:
lol_2 = np.array([ [1,2,3], [4,5,6] ])
print(lol_2.shape)
lol_2

(2, 3)


array([[1, 2, 3],
       [4, 5, 6]])

In [89]:
# what shape is our array
lol_np.shape

(3, 2)

> It's not exactly the same, but for now, we can think of this as a 3 row / 2 column Excel worksheet.  The `shape` attribute is something we are going to use quite a bit!



In [90]:
# how many dimensions within the array
lol_np.ndim

2

In [91]:
# how many elements total, or the size
lol_np.size

6

In [95]:
# remember, we can always get help
#lol_np?

In [93]:
len(lol_np)

3

## Quick exercise:

Generate an array of 10 integers sampled from a random distribution of [0,10).

confirm the length and shape of your array, and
calculate the mean, 
calculate the standard deviation.

In [94]:
np.random.seed(765)
x = np.random.randint(0, 10, size=10)
type(x)
x
# methods to use are .mean() and .std()

array([4, 1, 6, 3, 0, 6, 3, 2, 1, 0])

In [101]:
print(len(x))
print('The Mean is', x.mean())
print('The Standard Deviation is', x.std())

10
The Mean is 2.6
The Standard Deviation is 2.1071307505705477


In [124]:
np_multi = np.array([  [  [5, 10], [15,20] ], [ [25,30], [35,40] ]   ])
np_multi

array([[[ 5, 10],
        [15, 20]],

       [[25, 30],
        [35, 40]]])

In [118]:
np_multi.shape

(2, 2, 2)



## Slicing

In [125]:
lol_np

array([[1, 2],
       [3, 4],
       [5, 6]])

In [132]:
lol_np.shape

(3, 2)

In [133]:
# lets slice the array - include everything
lol_np[:,:]

array([[1, 2],
       [3, 4],
       [5, 6]])

In [140]:
# get just the first array
lol_np[0, :]

array([1, 2])

In [139]:
#find 3,4
lol_np[1:2]

array([[3, 4]])

Multidimensional arrays can be sliced using a comma separated set of tuples

In [152]:
# return just 5 
lol_np[2, 0]
lol_np[1:, 0:3]

array([[3, 4],
       [5, 6]])

In [161]:
# return just 4
lol_np[1,1]
lol_np[:2, 0]
lol_np[1,1]

4

lets look at a larger nd array

In [162]:
# generate a larger array
np.random.seed(765) # for reproducing example
x2 = np.random.randint(10, 20, size=(3, 4))
x2

array([[14, 11, 16, 13],
       [10, 16, 13, 12],
       [11, 10, 17, 14]])

In [167]:
len(x2)
x2.ndim
x2.size

12

In [163]:
# shape (rows/columns)
x2.shape

(3, 4)

In [168]:
# slicing to get all "rows" and first "column"
x2[:, 0]

array([14, 10, 11])

In [171]:
# 2 rows, 3 cols
x2[:2, :3]

array([[14, 11, 16],
       [10, 16, 13]])

In [172]:
# just get one row
x2[0, :]

array([14, 11, 16, 13])

In [174]:
x2
#normal x2

array([[14, 11, 16, 13],
       [10, 16, 13, 12],
       [11, 10, 17, 14]])

In [176]:
# want to reverse?
x2[::-1, :]
#makes first row bottom and bottom row top 


array([[11, 10, 17, 14],
       [10, 16, 13, 12],
       [14, 11, 16, 13]])

In [178]:
x2[:, ::-1]
#flipped oolumns 

array([[13, 16, 11, 14],
       [12, 13, 16, 10],
       [14, 17, 10, 11]])

In [177]:
x2[::-1, ::-1]
#flips rows and columns order 
#doesnt change oringla array

array([[14, 17, 10, 11],
       [12, 13, 16, 10],
       [13, 16, 11, 14]])

In [182]:
my_list = [1,2,3,4]
my_list[::-1]
#more than -1 = skips things in the list 

[4, 3, 2, 1]

In [183]:
x2

array([[14, 11, 16, 13],
       [10, 16, 13, 12],
       [11, 10, 17, 14]])

In [184]:
# we can also do math!
print(x2.mean())   #mean
print(x2.std())    #standard deviation
print(x2.max())    #max


13.083333333333334
2.289771944005681
17


In [185]:
x2

array([[14, 11, 16, 13],
       [10, 16, 13, 12],
       [11, 10, 17, 14]])

## Quick exercise 

Update the value of x2 such that the element that was previously 17 becomes 71. You can locate the element manually.

---


In [188]:
#reassign value simple = array(x,x) = value
x2[2,2] = 71
x2

array([[14, 11, 16, 13],
       [10, 16, 13, 12],
       [11, 10, 71, 14]])

# Useful methods

In [191]:
# arange: like range, generate an array of numbers with bounds and spacing
x = np.arange(2, 10, 2)
x

array([2, 4, 6, 8])

In [192]:
# do some math or finding location
print("the mean of the array is {}".format( x.mean() ))   # x.mean()
print("the largest value in the array is {}".format( x.max() ))   # x.max()
print("the index for the largest value in the array is {}".format( x.argmax() ))   # x.argmax()
#returns index of the maximum value 


the mean of the array is 5.0
the largest value in the array is 8
the index for the largest value in the array is 3


In [194]:
x

array([2, 4, 6, 8])

Conditionals and Masks - very important:

In [195]:
# we can generate an array of booleans for subsetting - this is very useful!
x > 5

array([False, False,  True,  True])

In [196]:
# we can also do complex logic: logical and
np.logical_and(x > 0, x <5)

array([ True,  True, False, False])

In [197]:
# logical or
np.logical_or(x > 5, x <7)

array([ True,  True,  True,  True])

---

## Discussion on logical_and/or

The concept of booleans across our arrays is a really important concept to understand.  This is the foundation for how we filter rows/columns in our data analysis in python (pandas, more specifically).  When slicing (i.e. filtering) our data, we keep the `True` elements.




In [200]:
# Let's see this in action
np.random.seed(765)
x = np.random.randint(0, 10, size=100)
x
print(x.shape)
x[:5]

(100,)


array([4, 1, 6, 3, 0])

In [201]:
x.ndim

1

In [209]:
# save a variable to store our test results
tests = (x >= 4)
tests[:5]
test_2 = np.where(x>=4, True, False)
test_3 = np.where(x>=4, x*10, 0)

In [211]:
test_3

array([40,  0, 60,  0,  0, 60,  0,  0,  0,  0, 70, 40, 40,  0,  0, 80,  0,
       70, 80, 70, 70, 80, 60, 80,  0, 80,  0, 60, 90,  0, 70, 40,  0, 90,
       40, 80, 40, 50,  0, 50,  0,  0, 90,  0, 50, 80,  0, 70, 40,  0,  0,
       70,  0, 50,  0, 40,  0, 40, 70,  0, 50, 60, 40, 80,  0, 40, 50,  0,
       40,  0, 80,  0,  0,  0,  0, 80, 80, 50, 60,  0, 70, 80,  0, 50,  0,
       60, 90,  0,  0, 60,  0,  0,  0,  0,  0, 40, 40,  0,  0,  0])

In [205]:
# confirm that they are the same shape
x.shape == tests.shape

True

In [206]:
# finally, keep just those that meet our criteria
# WE ONLY KEEP THE TRUE
keep = x[tests]
keep.shape

(55,)

In [207]:
keep

array([4, 6, 6, 7, 4, 4, 8, 7, 8, 7, 7, 8, 6, 8, 8, 6, 9, 7, 4, 9, 4, 8,
       4, 5, 5, 9, 5, 8, 7, 4, 7, 5, 4, 4, 7, 5, 6, 4, 8, 4, 5, 4, 8, 8,
       8, 5, 6, 7, 8, 5, 6, 9, 6, 4, 4])

> To see a much larger list of what is possible for logical check in numpy, here are the docs:  https://numpy.org/doc/stable/reference/routines.logic.html

-----
-----

## Reshaping
Another method that comes very handy is reshaping arrays - the name is clear so there is not much to say about this than showing how it works:

In [212]:
my_array = np.arange(10)
my_array

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [214]:
my_array.reshape((2,5))

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

In [220]:
my_array_2 = np.arange(12)
my_array_2

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [225]:
my_array_2.reshape((2,6))

array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11]])

## Mathematical Operations on Axis

Helpful resource:  https://www.sharpsightlabs.com/blog/numpy-axes-explained/

![axis](https://i.imgur.com/4aL45nA.png)

<br>

### What is really happening?

<br>

Axis 0 = we collapse/summarize *across* the rows

![axis0](https://vrzkj25a871bpq7t1ugcgmn9-wpengine.netdna-ssl.com/wp-content/uploads/2018/12/numpy-axes-np-sum-axis-0.png)

Axis 1 = we collapse/summarize *across* the columns


![axis1](https://vrzkj25a871bpq7t1ugcgmn9-wpengine.netdna-ssl.com/wp-content/uploads/2018/11/numpy-axes-np-sum-axis-1.png)


In [226]:
np.random.seed(765)
a = np.random.randint(0, 100, size=(3, 7))

In [227]:
a

array([[29, 90,  4, 46, 27, 81, 70],
       [92, 79, 60, 19,  0, 38, 18],
       [65, 16,  7, 42, 36, 64, 17]])

In [228]:
type(a)

numpy.ndarray

In [229]:
a.shape

(3, 7)

In [230]:
a.ndim

2

In [231]:
a.mean()

42.857142857142854

In [232]:
a.sum()

900

In [233]:
a.max()

92

In [234]:
a.min()

0

Now we will use the `sum` operation specifying the array axis (refer to images above)

We are summing across axis 0 (the first axis):

In [254]:
a

array([[29, 90,  4, 46, 27, 81, 70],
       [92, 79, 60, 19,  0, 38, 18],
       [65, 16,  7, 42, 36, 64, 17]])

In [242]:
 a.sum(axis=0)
 # sum of all columns 
a.sum(axis = 1)
# sum of all rows (collapse to one number)
a.sum(axis = (0,1))
#sum of all numnbers 

900

In [243]:
len(a.sum(axis=0))

7

What if we performed the sum across the other axis, axis 1? How many elements would it return?

In [244]:
a.sum(axis=1)

array([347, 306, 247])

## Quick exercise:

Find the mean of the above array along various axes.

In [264]:
a.mean(axis=1)


array([49.57142857, 43.71428571, 35.28571429])

In [267]:
a.mean(axis=0)

array([62.        , 61.66666667, 23.66666667, 35.66666667, 21.        ,
       61.        , 35.        ])

## Array operations
Now let's do some array operations 

In [273]:
# array ops
np.random.seed(765)
x = np.random.randint(0, 10, size=(10))
y = np.random.randint(20, 100, size=(10))
print(x)
print(y)

[4 1 6 3 0 6 3 2 1 0]
[27 62 56 84 37 28 49 75 75 26]


In [279]:
x + y

array([31, 63, 62, 87, 37, 34, 52, 77, 76, 26])

In [280]:
x *y

array([108,  62, 336, 252,   0, 168, 147, 150,  75,   0])

In [281]:
np.random.seed(765)
x = np.random.randint(0, 10, size=(9))
y = np.random.randint(20, 100, size=(10))

In [285]:
x - y

ValueError: operands could not be broadcast together with shapes (9,) (10,) 

In [289]:
print(x)


[4 1 6 3 0 6 3 2 1]


In [290]:
x

array([4, 1, 6, 3, 0, 6, 3, 2, 1])

In [291]:
x.argmax()

2

This is another very important function for arrays, and since we already saw how to build logical arrays using conditionals earlier, it will sound familiar:

In [None]:
?np.where

In [292]:
x

array([4, 1, 6, 3, 0, 6, 3, 2, 1])

In [293]:
np.where(x >  3, True, False)

array([ True, False,  True, False, False,  True, False, False, False])


>np.where is great for feature engineering because we can evaluate our values and recode the columns however we want.

Below is an example of filtering:

In [294]:
x

array([4, 1, 6, 3, 0, 6, 3, 2, 1])

In [299]:
np.where(x > 3, x, 3)

array([4, 3, 6, 3, 3, 6, 3, 3, 3])

# Summary

In this notebook, we learned 

- numpy generates arrays that are very powerful for analysis, and can be thought of enhanced lists

- numpy arrays must be of one type, else they will be converted to a singular type

- numpy arrays can be multidimensional

- numpy arrays can be sliced just like lists, with dimensional slicing occuring `[:, :]`

- we can perform mathematical operations using numpy