<a href="https://colab.research.google.com/github/KuroShiroe/ml-class/blob/main/Christopher_Puglisi_of_numpy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#` Task 1: Getting started with NumPy

![](https://raw.githubusercontent.com/zacharski/datamining-guide/master/labs/pics/numpy.png)

## Task 1: Fundamentals

Let's spend a few minutes just learning some of the fundamentals of NumPy. (pronounced as num-pie **not num-pee**) 

### what is NumPy
NumPy is a Python library that support large, multi-dimensional arrays and matrices. 

Let's look at an example. Suppose we start with a little table:

| a  | b | c  |  d | e |
| :---: | :---: | :---: | :---: | :---: |
| 0 | 1 | 2 | 3 | 4 |
|10| 11| 12 | 13 | 14|
|20| 21 | 22 | 23 | 24 |
|30 | 31 | 32 | 33 | 34 |
|40 |41 | 42 | 43 | 44 |

and we simply want to add 10 to each cell:

| a  | b | c  |  d | e |
| :---: | :---: | :---: | :---: | :---: |
| 10 | 11 | 12 | 13 | 14 |
|20| 21| 22 | 23 | 24|
|30| 31 | 32 | 33 | 34 |
|40 | 41 | 42 | 43 | 44 |
|50 |51 | 52 | 53 | 54 |



First, let's construct it in generic Python

In [3]:
a5 = [[x + y * 5 for x in range(5)] for y in range(5)]
a5



[[0, 1, 2, 3, 4],
 [5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14],
 [15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

And suppose we have the magic function `addToA5(i)` that will add *i* to each cell in the array:

```
addToA5(10)
A5
```
returns

```
[[10, 11, 12, 13, 14],
 [15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24],
 [25, 26, 27, 28, 29],
 [30, 31, 32, 33, 34]]
 ```


To make things interesting, instead of a a 5 x5 array, let's make it 1,000 x 1,000 -- so 1 million cells!

In [5]:
a = [[x + y * 1000 for x in range(1000)] for y in range(1000)]


![](https://raw.githubusercontent.com/zacharski/datamining-guide/master/labs/pics/torchdivide.png)


# <font color='#EE4C2C'>You Try ...</font> 
Ok, time to get coding

## <font color='#EE4C2C'>1. addToArray(i)</font> 
Can you write  a function `addToArr(10)` that will add 10 to each cell in our  1000x1000 matrix.

In [46]:
def addToArray(i):
  i=[[x + y * 1000 for x in range(1000)] for y in range(1000)]
  i*10
  # TO DO


Let's take a look at how much time it takes to run that function:

In [48]:
%time addToArray(10)

CPU times: user 143 ms, sys: 12.9 ms, total: 156 ms
Wall time: 156 ms


My results were:

    CPU times: user 145 ms, sys: 0 ns, total: 145 ms
    Wall time: 143 ms

So about 1/7 of a second. 

![](https://raw.githubusercontent.com/zacharski/datamining-guide/master/labs/pics/numpySmall2.png)
### Doing in using Numpy
Now we will try the same using NumPy.


We can construct the array using
    
    arr = np.arange(1000000).reshape((1000,1000))

Not sure what that line does? NumPy has great online documentation. [Documentation for np.arange](https://numpy.org/doc/stable/reference/generated/numpy.arange.html) says it "Return evenly spaced values within a given interval." Let's try it out:

In [49]:
import numpy as np
np.arange(16)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

So `np.arange(10)` creates a matrix of 16 sequential integers. [The documentation for reshape](https://numpy.org/doc/1.18/reference/generated/numpy.reshape.html) says, as the name suggests, "Gives a new shape to an array without changing its data."  Suppose we want to reshape our 1 dimensional matrix of 16 integers to a 4x4 one. we can do:

In [50]:
np.arange(16).reshape((4,4))

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

As you can see it is pretty easy to find documentation on Numpy.

Back to our example of creating a 1000x1000 matrix, we now can time how long it takes to add 10 to each cell.

    %time arr = arr + 10
    
Let's put this all together:

In [51]:
import numpy as np
arr = np.arange(1000000).reshape((1000,1000))
%time arr = arr + 10

CPU times: user 2.86 ms, sys: 2.1 ms, total: 4.96 ms
Wall time: 7.64 ms


My results were

    CPU times: user 1.26 ms, sys: 408 µs, total: 1.67 ms
    Wall time: 1.68 ms

So, depending on your computer, somewhere around 25 to 100 times faster. **That is phenomenally faster!**. 


### built in functions
In addition to being faster, NumPy has a wide range of built in functions. So, for example, instead of you writing code to calculate the mean or sum or standard deviation of a multidimensional array you can just use numpy:

In [None]:
arr.mean()

In [53]:
arr.sum()

500009500000

In [54]:
 arr.std()

288675.1345946685

So not only is it faster, but NumPy minimizes the code you have to write. A win, win.

Let's continue with some basics.

## NumPy examined 
So NumPy is a library containing a super-fast n-dimensional array object and a load of functions that can operate on those arrays. To use NumPy, we must first load the library into our code and we do that with the statement:


In [55]:
 import numpy as np

Perhaps most of you are saying "fine, fine, I know this already", but let me catch others up to speed. This is just one of several ways we can load a library into Python. We could just say:

In [56]:
 import numpy

and everytime we need to use one of the functions built into NumPy we would need to preface that function with `numpy` . So for example, we could create an array with


In [57]:
arr = numpy.array([1, 2, 3, 4, 5])

If we got tired of writing `numpy` in front of every function, instead of typing

In [58]:
import numpy

we could write:

In [59]:
from numpy import *

(where that * means 'everything' and the whole expression means import everything from the NumPy library).  Now we can use any NumPy function without putting `numpy` in front of it:

In [60]:
arr = array([1, 2, 3, 4, 5])

This may at first seem like a good idea, but it is considered bad form by Python developers. 

The solution is to use what we initially introduced:

In [61]:
 import numpy as np

this makes `np` an alias for `numpy`. so now we would put *np* in front of numpy functions.

In [62]:
 arr = np.array([1, 2, 3, 4, 5])

Of course we could use anything as an alias for `numpy`:

In [63]:
import numpy as myCoolSneakers
arr = myCoolSneakers.array([1, 2, 3, 4, 5])


But it is convention among data scientists, machine learning experts, and the cool kids to use `np`.  One big benefit of this convention is that it makes the code you write more understandable to others and vice versa (I don't need to be scouring your code to find out what `myCoolSneakers.array` does)

## creating arrays

An Array in NumPy is called an `ndarray` for n-dimensional array.  As we will see, they share some similarities with Python lists. We have already seen how to create one:

In [64]:
arr = np.array([1, 2, 3, 4, 5])

and to display what `arr` equals

In [65]:
arr

array([1, 2, 3, 4, 5])

This is a one dimensional array. The position of an element in the array is called the index. The first element of the array is at index 0, the next at index 1 and so on. We can get the item at a particular index by using the syntax:

In [66]:
 arr[0]

1

In [67]:
arr[3]

4

We can create a 2 dimensional array that looks like

      10  20  30
      40  50  60
 
by:


In [68]:
 arr = np.array([[10, 20, 30], [40, 50, 60]])

and we can show the contents of that array just be using the name of the array, `arr`


In [69]:
arr

array([[10, 20, 30],
       [40, 50, 60]])

We don't need to name arrays `arr`, we can name them anything we want. 

In [70]:
ratings = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

In [71]:
ratings

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

So far, we've been creating numpy arrays by using Python lists. We can make that more explicit by first creating the Python list and then using it to create the ndarray:

In [72]:
pythonArray = [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]
sweet = np.array(pythonArray)
sweet

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10]])

We can also create an array of all zeros or all ones directly:

In [73]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [74]:
np.ones((5, 2))

array([[1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.]])

### indexing
Indexing elements in ndarrays works pretty much the same as it does in Python. We have already seen one example, here is another example with a one dimensional array:


In [87]:
temperatures = np.array([48, 44, 37, 35, 32, 29, 33, 36, 42])
temperatures[0]

48

In [88]:
temperatures[3]

35

and a two dimensional one:

In [89]:
sample = np.array([[10, 20, 30], [40, 50, 60]])
sample[0][1]

20

For numpy ndarrays we can also use a comma to separate the indices of multi-dimensional arrays:

In [90]:
sample[1,2]

60

And, like Python you can also get a slice of an array. First, here is the basic Python example:

In [91]:
a = [10, 20, 30, 40, 50, 60]
b = a[1:4]
b

[20, 30, 40]

and the similar NumPy example:

In [92]:
aarr = np.array(a)
barr = aarr[1:4]
barr

array([20, 30, 40])

### Something  wacky to remember
But there is a difference between Python arrays and numpy ndarrays. If I alter the array `b` in Python, the orginal `a` array is not altered:

In [93]:
b[1] = b[1] + 5

In [94]:
b

[20, 35, 40]

In [95]:
a

[10, 20, 30, 40, 50, 60]

but if we do the same in NumPy:

In [96]:
barr[1] = barr[1] + 5

In [97]:
barr

array([20, 35, 40])

In [98]:
aarr

array([10, 20, 35, 40, 50, 60])

we see that the original array is altered since we modified the slice. This may seem wacky to you, or maybe it doesn't. In any case, it is something you will get used to. For now, just be aware of this. It took me awhile to stop making mistakes because of this.

## Functions on arrays

NumPy has a wide range of array functons. Here is just a sample.

### Unary functions

#### absolute value

In [99]:
arr = np.array([-2, 12, -25, 0])
arr2 = np.abs(arr)
arr2

array([ 2, 12, 25,  0])

In [100]:
arr = np.array([[-2, 12], [-25, 0]])
arr2 = np.abs(arr)
arr2               

array([[ 2, 12],
       [25,  0]])

#### square

In [103]:
arr = np.array([-1, 2, -3, 4])
arr2 = np.square(arr)
arr2

array([ 1,  4,  9, 16])

#### squareroot

In [102]:
arr = np.array([[4, 9], [16, 25]])
arr2 = np.sqrt(arr)
arr2

array([[2., 3.],
       [4., 5.]])

## Binary functions

#### add /subtract / multiply / divide


In [104]:
arr1 = np.array([[10, 20], [30, 40]])
arr2 = np.array([[1, 2], [3, 4]])
np.add(arr1, arr2)

array([[11, 22],
       [33, 44]])

In [105]:
np.subtract(arr1, arr2)

array([[ 9, 18],
       [27, 36]])

In [106]:
np.multiply(arr1, arr2)

array([[ 10,  40],
       [ 90, 160]])

In [107]:
np.divide(arr1, arr2)

array([[10., 10.],
       [10., 10.]])

#### maximum / minimum


In [109]:
arr1 = np.array([[10, 2], [3, 40]])
arr2 = np.array([[1, 20], [30, 4]])
np.maximum(arr1, arr2)

array([[10, 20],
       [30, 40]])

#### these are just examples. There are more unary and binary functions

## NumPy Uber
Let us say we have Uber drivers at various intersections around Austin. We will represent that as a set of x,y coordinates.

 | Driver |xPos | yPos |
 | :---: | :---: | :---: |
 | Ann | 4 | 5 |
 | Clara | 6 | 6 |
 | Dora | 3 | 1 |
 | Erica | 9 | 5 |
 
 
 Now I would like to find the closest driver to a customer who is at 6, 3.


 ![](https://raw.githubusercontent.com/zacharski/datamining-guide/master/labs/pics/distance.png)
 And to further define *closest* I am going to use what is called **Manhattan Distance**. Roughly put, Manhattan distance is distance if you followed streets. Ann, for example, is two blocks West of our customer and two blocks north. So the Manhattan distance from Ann to our customer is `2+2` or `4`. 
 
 First, to make things easy (and because the data in a NumPy array must be of the same type), I will represent the x and y positions in one NumPy array and the driver names in another:

In [110]:
locations = np.array([[4, 5], [6, 6], [3, 1], [9,5]])
locations

array([[4, 5],
       [6, 6],
       [3, 1],
       [9, 5]])

In [111]:
drivers = np.array(["Ann", "Clara", "Dora", "Erica"])

Our customer is at

In [112]:
cust = np.array([6, 3])

now we are going to figure out the distance between each of our drivers and the customer

In [113]:
xydiff = locations - cust
xydiff

array([[-2,  2],
       [ 0,  3],
       [-3, -2],
       [ 3,  2]])

NOTE: displaying the results with `xydiff` isn't a necessary step. I just like seeing intermediate results.

Ok. now I am goint to sum the absolute values:

In [114]:
distances =np.abs(xydiff).sum(axis = 1)
distances

array([4, 3, 5, 5])

So the output is the array `[4, 3, 5, 5]` which shows that Ann is 4 away from our customer; Clara is 3 away and so on.

Now I am going to sort these using `argsort`:

In [115]:
sorted = np.argsort(distances)
sorted

array([1, 0, 2, 3])

`argsort` returns an array of sorted indices. So the element at position 1 is the smallest followed by the element at position 0 and so on.

Next, I am going to get the first element of that array (in this case 1) and find the name of the driver at that position in the `drivers` array

In [116]:
drivers[sorted[0]]

'Clara'

![](https://raw.githubusercontent.com/zacharski/datamining-guide/master/labs/pics/torchdivide.png)


# <font color='#EE4C2C'>You Try ...</font> 
Ok, time to get coding

## <font color='#EE4C2C'>2. a function</font> 

Can you put all the above in a function. that takes 3 arguments, the location array, the array containing the names of the drivers, and the array containing the location of the customer. It should return the name of the closest driver.


In [184]:
def findDriver(distanceArr, driversArr, customerArr):
   result = drivers[sorted[0]]
   ### put your code here
   return result
print(findDriver(locations, drivers, cust)) # this should return Clara

Clara


### CONGRATULATIONS

Even though this is just an intro to NumPy, I am going to throw some math at you. So far we have been looking at a two dimensional example, x and y (or North-South and East-West) and our distance formula for the distance, Dist between Ann, A and Customer C is

$$ DIST_{AC} = |A_x - C_x | + |A_y - C_y | $$

Now I am going to warp this a bit. In this example, each driver is represented by an array (as is the customer) So, Ann is represented by `[1,2]` and the customer by `[3,4]`. So Ann's 0th element is 1 and the customer's 0th element is 3. And, sorry, computer science people start counting at 0 but math people (and all other normal people) start at 1 so we  can rewrite the above formula as:

$$ DIST_{AC} = |A_0 - C_0 | + |A_1 - C_1 | $$

That's the distance formula for Ann and the Customer. We can make the formula by saying the distance between any two people, let's call them *x* and *y* is


$$ DIST_{xy} = |x_0 - y_0 | + |x_1 - y_1 | $$

That is the formula for  2 dimensional Manhattan Distance. We can imagine a three dimensional case.  

$$ DIST_{xy} = |x_0 - y_0 | + |x_1 - y_1 | + |x_2 - y_2 | $$

and we can generalize the formula to the n-dimensional case.
 
$$ DIST_{xy}=\sum_{i=0}^n |x_i - y_i| $$

Just in time for a five dimensional example:

![](https://raw.githubusercontent.com/zacharski/datamining-guide/master/labs/pics/PyDivideTwo.png)



## <font color='#EE4C2C'>3. The Amazing 5D Music example</font> 

![](https://raw.githubusercontent.com/zacharski/datamining-guide/master/labs/pics/artists.png)

Guests went into a listening booth and rated the following tunes:

* [Janelle Monae Tightrope](https://www.youtube.com/watch?v=pwnefUaKCbc)
* [Major Lazer - Cold Water](https://www.youtube.com/watch?v=nBtDsQ4fhXY)
* [Tim McGraw - Humble & Kind](https://www.youtube.com/watch?v=awzNHuGqoMc)
* [Maren Morris - My Church](https://www.youtube.com/watch?v=ouWQ25O-Mcg)
* [Hailee Steinfeld - Starving](https://www.youtube.com/watch?v=xwjwCFZpdns)


Here are the results:

| Guest  | Janelle Monae  | Major Lazer  | Tim McGraw  |  Maren Morris | Hailee Steinfeld| 
|---|---|---|---|---|---|
|  Ann | 4  |  5 | 2  |  1 | 3 |
| Ben  |  3 |  1 |  5 | 4  | 2|
| Jordyn  | 5  |  5 | 2  | 2  | 3|
|  Sam | 4 | 1 | 4 | 4 | 1|
| Hyunseo | 1 | 1 | 5 | 4 | 1 |
| Ahmed | 4 | 5 | 3 |  3 | 1 |

So Ann, for example, really liked Major Lazer and Janelle Monae but didn't care much for Maren Morris.

Let's set up a few numpy arrays.


In [156]:
customers = np.array([[4, 5, 2, 1, 3],
                      [3, 1, 5, 4, 2],
                      [5, 5, 2, 2, 3],
                      [4, 1, 4, 4, 1], 
                      [1, 1, 5, 4, 1],
                      [4, 5, 3, 3, 1]])

customerNames = np.array(["Ann", "Ben", 'Jordyn', "Sam", "Hyunseo", "Ahmed"])



Now let's set up a few new customers:

In [157]:
mikaela = np.array([3, 2, 4, 5, 4])
brandon = np.array([4, 5, 1, 2, 3])

Now we would like to determine  which of our current customers is closest to Mikaela and which to Brandon.


### Work to be done ...
Can you write a function findClosest that takes 3 arguments: customers, customerNames, and an array representing one customer's ratings and returns the name of the closest customer?

Let's break this down a bit.

1. Which line in the NumPy Uber section above will create a new array which is the result of subtracting the Mikaela array from each row of the customers array resulting in

```
array([[ 1,  3, -2, -4, -1],
       [ 0, -1,  1, -1, -2],
       [ 2,  3, -2, -3, -1],
       [ 1, -1,  0, -1, -3],
       [-2, -1,  1, -1, -3],
       [ 1,  3, -1, -2, -3]])
       ```


In [170]:
numpy.subtract(mikaela,customers)
# TODO


array([[-1, -3,  2,  4,  1],
       [ 0,  1, -1,  1,  2],
       [-2, -3,  2,  3,  1],
       [-1,  1,  0,  1,  3],
       [ 2,  1, -1,  1,  3],
       [-1, -3,  1,  2,  3]])

2. Which line above will take the array you created and generate a single integer distance for each row representing how far away that row is from Mikaela?  The results will look like:

```
    array([11,  5, 11,  6,  8, 10])
```

In [218]:
xydiff = mikaela - customers
xydiff
distances =np.abs(xydiff).sum(axis = 1)
distances
# TO DO 


array([11,  5, 11,  6,  8, 10])

Finally, we want a sorted array of indices, the zeroth element of that array will be the closest row to Mikaela, the next element will be the next closest and so on. The result should be

```
array([1, 3, 4, 5, 0, 2])
```


In [219]:
sorted = np.argsort(distances)
sorted
# TO DO


array([1, 3, 4, 5, 0, 2])

Finally we need the name of the person that is the closest. 

In [220]:
customerNames[sorted[0]]
# TO DO

'Ben'

Okay, time to put it all together. Can you combine all the code you wrote above to finish the following function? So x is the new person and we want to find the closest customer to x.

In [221]:
def findClosest(customers, customerNames, x):
   result = customerNames[sorted[0]]
   # TO DO
   return result


print(findClosest(customers, customerNames, mikaela)) # Should print Ben
print(findClosest(customers, customerNames, brandon)) # Should print Ann

Ben
Ben


![](https://raw.githubusercontent.com/zacharski/datamining-guide/master/labs/pics/PyDivideTwo.png)
## <font color='#EE4C2C'>4. Numpy drones</font> 

![](https://raw.githubusercontent.com/zacharski/datamining-guide/master/labs/pics/drone2.png)

We are going to start with the same array we did way up above:

 
 | Drone |xPos | yPos |
 | :---: | :---: | :---: |
 | wing_1a | 4 | 5 |
 | wing_2a | 6 | 6 |
 | wing_3a | 3 | 1 |
 | wing_4a | 9 | 5 |
 
 But this time, instead of Uber drivers, think of these as positions of [Alphabet's Wing delivery drones](https://wing.com/). 
 Now we would like to find the closest drone to a customer who is at 7, 1.
 
With the previous example we used Manhattan Distance.  With drones, we can compute the distance as the crow flies -- or Euclidean Distance.  We probably learned how to do this way back in 7th grade when we learned the Pythagorean Theorem which states:

$$c^2 = a^2 + b^2$$

Where *c* is the hypotenuse and *a* and *b* are the two other sides. So, if we want to find *c*:

$$c = \sqrt{a^2 + b^2}$$


If we want to find the distance between the drone and a customer, *x* and *y* in the formula becomes

$$Dist_{xy} = \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2}$$

and for `wing_1a` who is at `[4,5]` and our customer who is at `[7,1]` then the formula becomes:

$$Dist_{xy} = \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2} = \sqrt{(4-7)^2 + (5-1)^2} =\sqrt{-3^2 + 4^2}  = \sqrt{9 + 16} = \sqrt{25} = 5$$

Sweet!  And to generalize this distance formula:

$$Dist_{xy} = \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2}$$

to n-dimensions:

$$Dist_{xy} = \sum_{i=1}^n{\sqrt{(x_i-y_i)^2}}$$





Can you write a function euclidean that takes 3 arguments: droneLocation, droneNames, and an array representing one customer's position and returns the name of the closest drone?

First, a helpful hint:


In [222]:
arr = np.array([-1, 2, -3, 4])
arr2 = np.square(arr)
arr2

array([ 1,  4,  9, 16])

In [224]:
locations = np.array([[4, 5], [6, 6], [3, 1], [9,5]])
drones = np.array(["wing_1a", "wing_2a", "wing_3a", "wing_4a"])
cust = np.array([6, 3])

def euclidean(droneLocation, droneNames, x):
   result = droneNames[sorted[0]]
   ### your code here   
   return result
euclidean(locations, drones, cust) 

'wing_2a'

In [None]:
#TBD