# About

![numpy](https://upload.wikimedia.org/wikipedia/commons/thumb/3/31/NumPy_logo_2020.svg/1200px-NumPy_logo_2020.svg.png)


In this notebook we will run through an introduction to [numpy](https://numpy.org/), or numerical python, which is the foundation for the core data analytics and data science ecosystem.





In [2]:
# import the library to get started
import numpy as np

Numpy is the workhorse underneath a number of our tools that we will be using.  At the core, numpy **arrays** look a lot like `lists` that we worked through last week, except that they have additional properties:

1.  numpy arrays can be multi-dimensional 

2.  numpy arrays allow us to do calculations on each element  at the same time (we can't do this with lists).  This is **element wise** calculations.

3.  unlike lists, numpy arrays hold a single **type**

>  ***From above, you see that I imported numpy as np.  This is the standard convention.***

# Lists to numpy arrays

In [3]:
# create a normal list
my_list = [1,2,4,5,6,7]

## print out the type and the values
print(type(my_list))
print(my_list)

<class 'list'>
[1, 2, 4, 5, 6, 7]


In [4]:
# create a new list and perform some math on it
new_list = [1,2,3]
new_list = [ i *2 for i in new_list]
print(new_list)

[2, 4, 6]


In [5]:
# multiple the list by 3
new_list *3

[2, 4, 6, 2, 4, 6, 2, 4, 6]

In [6]:
"ba765"*3

'ba765ba765ba765'

Above shows that just like with `+`, we can repeat the list with `*`

In [11]:
# numpy to the rescue
np_arr = np.array(new_list)
np_arr

array([2, 4, 6])

In [10]:
type(np_arr)

numpy.ndarray

In [12]:
# we converted our list to an numpy array. now lets try it
np_arr * 3

array([ 6, 12, 18])

While lists and tuples are powerful in base python, and we will use them later in the course, when it really comes to analyzing data, the foundations of the tools that we will use are in **numpy**.

Simply, the higher level tools use numpy under the hood, but it's important for us to have a baseline understanding of numpy before we dive into **pandas**.

# Numpy Arrays - Single Type Only

Because numpy arrays are built to do calculations, helping us with analyzing data, each array must be of a single type. 

In [13]:
# create a normal list
normal_list = [1, 'a', True]
normal_list

[1, 'a', True]

This does what we expected, it printed a list, and if we wanted, we could even have a list within a list, as we learned last week.

In [14]:
# convert the list to numpy, as we saw above with np.array(<list>)
np_convert = np.array(normal_list)
np_convert

array(['1', 'a', 'True'], dtype='<U21')

In [15]:
type(np_convert)

numpy.ndarray




Look at what we have above.  Numpy needs to have all elements in an array be of the same type.  But it didn't error out, nor it did exclude some elements. What did numpy do?

# Core numpy features

## nd arrays

In [16]:
# remember a list of lists?
lol = [ [1,2], [3,4], [5,6] ]
lol

[[1, 2], [3, 4], [5, 6]]

In [17]:
lol[0][1]

2

In [20]:
test_shape = np.array([1,2])
test_shape.shape

(2,)

In [21]:
# bring this into numpy array
lol_np = np.array(lol)
lol_np

array([[1, 2],
       [3, 4],
       [5, 6]])

In [22]:
# what shape is our array
lol_np.shape

(3, 2)

> It's not exactly the same, but for now, we can think of this as a 3 row / 2 column Excel worksheet.  The `shape` attribute is something we are going to use quite a bit!



In [23]:
# how many dimensions within the array
lol_np.ndim

2

In [24]:
# how many elements total, or the size
lol_np.size

6

In [25]:
# remember, we can always get help
#lol_np?



## Slicing

In [26]:
lol_np

array([[1, 2],
       [3, 4],
       [5, 6]])

In [27]:
lol_np.shape

(3, 2)

In [28]:
# lets slice the array - include everything
lol_np[0]

array([1, 2])

In [29]:
# get just the first array
lol_np[2,1]

6

Multidimenstional arrays can be sliced using a comma separated set of tuples

In [30]:
# return just 5 
lol_np[2, 0]

5

lets look at a larger nd array

In [31]:
# generate a larger array
np.random.seed(765) # for reproducing example
x2 = np.random.randint(10, 20, size=(3, 4))
x2

array([[14, 11, 16, 13],
       [10, 16, 13, 12],
       [11, 10, 17, 14]])

In [32]:
# rows/columns
x2.shape

(3, 4)

In [34]:
# all "rows" and first "column"
x2[0:, 0]

array([14, 10, 11])

In [35]:
# 2 rows, 3 cols
x2[:2, :3]

array([[14, 11, 16],
       [10, 16, 13]])

In [36]:
# just get one row
x2[0]

array([14, 11, 16, 13])

In [37]:
# want to reverse?
x2[::-1, ::-1]


array([[14, 17, 10, 11],
       [12, 13, 16, 10],
       [13, 16, 11, 14]])

In [38]:
x2

array([[14, 11, 16, 13],
       [10, 16, 13, 12],
       [11, 10, 17, 14]])

In [39]:
# we can also do math!
print(x2.mean())   #mean
print(x2.std())    #standard deviation
print(x2.max())    #max


13.083333333333334
2.289771944005681
17


---


# Useful methods

In [46]:
# arange: like range, generate an array of numbers with bounds and spacing
x = np.arange(2, 10, 2)
x

array([2, 4, 6, 8])

In [47]:
# do some math and finding locations
print("the mean of the array is {}".format(str( x.mean() )))   # x.mean()
print("the largest value in the array is {}".format(str( x.max() )))   # x.max()
print("the index for the largest value in the array is {}".format(str( x.argmax() )))   # x.argmax()


the mean of the array is 5.0
the largest value in the array is 8
the index for the largest value in the array is 3


In [48]:
x

array([2, 4, 6, 8])

In [49]:
# we can generate an array of booleans for subsetting 
x > 5

array([False, False,  True,  True])

In [50]:
# we can also do complex logic - logical and
np.logical_and(x > 0, x <5)

array([ True,  True, False, False])

In [51]:
# logical or
np.logical_or(x > 5, x <7)

array([ True,  True,  True,  True])

---

#### Discussion on logical_and/or

The concept of booleans across our arrays is a really important concept to understand.  This is the foundation for how we filter rows/columns in our data analysis in python (pandas, more specifically).  When slicing (i.e. filtering) our data, we keep the `True` elements.




In [53]:
# Let's see this in action
x = np.random.randint(0, 10, size=100)
print(x.shape)
x[:5]

(100,)


array([2, 0, 3, 0, 8])

In [55]:
x.ndim

1

In [56]:
# save a variable to store our test results
tests = x >= 7
tests[:5]

array([False, False, False, False,  True])

In [57]:
# confirm that they are the same shape
x.shape == tests.shape

True

In [59]:
# finally, keep just those that meet our criteria
# WE ONLY KEEP THE TRUE
keep = x[tests]
print(keep)
keep.shape

[8 8 7 8 8 8 7 8 8 9 9 7 7 8 8 7 8 8 7 9 7 9 8 9 7 9 7]


(27,)

In [60]:
keep

array([8, 8, 7, 8, 8, 8, 7, 8, 8, 9, 9, 7, 7, 8, 8, 7, 8, 8, 7, 9, 7, 9,
       8, 9, 7, 9, 7])

> To see a much larger list of what is possible for logical check in numpy, here are the docs:  https://numpy.org/doc/stable/reference/routines.logic.html

-----
-----

## Mathematical Operations on Axis

Helpful resource:  https://www.sharpsightlabs.com/blog/numpy-axes-explained/

![axis](https://i.imgur.com/4aL45nA.png)

<br>

### What is really happening?

<br>

Axis 0 = we collapse/summarize *across* the rows

![axis0](https://vrzkj25a871bpq7t1ugcgmn9-wpengine.netdna-ssl.com/wp-content/uploads/2018/12/numpy-axes-np-sum-axis-0.png)

Axis 1 = we collapse/summarize *across* the columns


![axis1](https://vrzkj25a871bpq7t1ugcgmn9-wpengine.netdna-ssl.com/wp-content/uploads/2018/11/numpy-axes-np-sum-axis-1.png)


In [62]:
a = np.random.randint(0, 100, size=(3, 7))

In [63]:
a

array([[56, 33, 77, 40, 88, 99, 32],
       [ 9,  3, 68, 10, 80, 18, 44],
       [77, 38, 93, 43,  9,  9, 40]])

In [64]:
type(a)

numpy.ndarray

In [65]:
a.shape

(3, 7)

In [66]:
a.ndim

2

In [67]:
a.mean()

46.0

In [68]:
a.sum()

966

In [69]:
a.max()

99

In [70]:
a.min()

3

In [73]:
a.sum(axis=0)

array([142,  74, 238,  93, 177, 126, 116])

In [74]:
len(a.sum(axis=0))

7

In [75]:
a.sum(axis=1)

array([425, 232, 309])

In [76]:
# array ops
x = np.random.randint(0, 10, size=(10))
y = np.random.randint(20, 100, size=(10))
print(x)
print(y)

[0 7 5 5 2 3 8 9 8 7]
[26 30 62 32 71 84 25 48 25 70]


In [77]:
x - y

array([-26, -23, -57, -27, -69, -81, -17, -39, -17, -63])

In [78]:
x *y

array([  0, 210, 310, 160, 142, 252, 200, 432, 200, 490])

In [81]:
x = np.random.randint(0, 10, size=(9))
y = np.random.randint(20, 100, size=(10))
print(x)
print(y)

[9 1 0 2 2 4 2 9 6]
[37 53 63 44 87 48 23 69 73 95]


In [82]:
x - y

ValueError: operands could not be broadcast together with shapes (9,) (10,) 

In [83]:
print(x)

[9 1 0 2 2 4 2 9 6]


In [84]:
x

array([9, 1, 0, 2, 2, 4, 2, 9, 6])

In [86]:
x.argmin()

2

In [87]:
# np where
x

array([9, 1, 0, 2, 2, 4, 2, 9, 6])

In [93]:
np.where(x >  4, 1, 0)

array([1, 0, 0, 0, 0, 0, 0, 1, 1])


> NP where is great for feature engineering because we can evaluate our values and recode the columns however we want

In [91]:
np.where(x > 4)

(array([0, 7, 8]),)

# Summary

In this notebook, we learned 

- numpy generates arrays that are very powerful for analysis, and can be thought of enhanced lists

- numpy arrays must be of one type, else they will be converted to a singular type

- numpy arrays can be multidimensional

- numpy arrays can be sliced just like lists, with dimensional slicing occuring `[:, :]`

- we can perform mathematical operations using numpy