# Numpy Continued

## Searching for Conditions
**np.where()** allows us to search arrays for ceratin conditions

Being able to set values based on conditionals can be a great way to generate labeled data for training a model and/or used to define features within your training data.

In [2]:
import numpy as np
arr1 = np.arange(10)
print(np.where(arr1 > 4)) # np.where lets us apply conditionals onto arrays

# np.where can also take two parameters defining values for True and False elements
# np.where(condition, x, y) - where true put x, where false put y
print("\nSetting values with np.where():")
print(np.where(arr1 > 4, 1, -1))


(array([5, 6, 7, 8, 9]),)

Setting values with np.where():
[-1 -1 -1 -1 -1  1  1  1  1  1]


## Finding Values
On top of indexing, we can also use boolean arrays to subset an array.

The boolean array should be the same size as the array your using it against.

In [3]:
import numpy as np

arr1 = np.arange(10)
print(arr1[arr1 > 4])

print("\nSetting values using boolean array:")
arr1 = np.arange(100)
arr1[arr1 % 2 == 0] = 0 #S ince 'arr1 % 2 == 0' returns a True/False list, it can be used to index
print(arr1)


[5 6 7 8 9]

Setting values using boolean array:
[ 0  1  0  3  0  5  0  7  0  9  0 11  0 13  0 15  0 17  0 19  0 21  0 23
  0 25  0 27  0 29  0 31  0 33  0 35  0 37  0 39  0 41  0 43  0 45  0 47
  0 49  0 51  0 53  0 55  0 57  0 59  0 61  0 63  0 65  0 67  0 69  0 71
  0 73  0 75  0 77  0 79  0 81  0 83  0 85  0 87  0 89  0 91  0 93  0 95
  0 97  0 99]


### In class work

In [8]:
#Problem 1
#Given the random dataset below, where the sqrt(abs(x + 10)) < 5 put a zero otherwise put a 1
import numpy as np

np.random.seed(1) #Setting the seed ensures the same random numbers are always generated
arr1 = np.random.randint(-100, 100, size = 100)


print(np.where((abs(arr1)**0.5 < 5),0,1))

"""Result should match:
[1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1,
1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,
0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1,
1, 1, 1, 1, 0, 0, 0, 1, 1, 1]
"""


[1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0
 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1
 1 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1]


'Result should match:\n[1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1,\n1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,\n0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,\n0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1,\n1, 1, 1, 1, 0, 0, 0, 1, 1, 1]\n'

## Vectorized Operations
Vecotrization is a process where we apply a scalar value (single value) across a vector/matrix. This means we can quickly add/subtract/multiply/divide a single value across an entire vector/matrix without having to loop through the elements.

In [3]:
# Vectorization and Broadcasting

import numpy as np
print("**********\nVectorized Operations:")
# Numpy arrays provide an ability to perform batch operations
arr = np.array([[1, 2, 3],
                [4, 5, 6]])

arr1 = arr + 5 # an operation on the array affects all elements
print(arr1)

print("\n**********\nThis works on a number of operations:")
arr2d = np.array([[1,2,3],
                [4,5,6],
                [7,8,9]])
arr2d *= 3
print(arr2d)

**********
Vectorized Operations:
[[ 6  7  8]
 [ 9 10 11]]

**********
This works on a number of operations:
[[ 3  6  9]
 [12 15 18]
 [21 24 27]]


## Vectorization of Functions
We can also vectorize ceratin functions for certain speedups. There are a number of functional operations that are vectorized for numpy arrays. These function are called unary ufuncs (universal functions), I've listed the more common ones below:
 - abs - absolute
 - sqrt - square root
 - square
 - exp - e^x for each element
 - logs
 - sign - negative/positive/zero
 - ceil/floor
 - isnan
 - isfinite
 - isinf

In [4]:
# Vectorized Functions - Unary ufuncs

"""There are a number of functional operations that are vectorized for numpy arrays.
These function are called unary ufuncs (universal functions), I've listed the more common ones below
    abs - absolute
    sqrt - square root
    square
    exp - e^x for each element
    logs
    sign - negative/positive/zero
    ceil/floor
    isnan
    isfinite
    isinf
"""
import numpy as np

arr1 = np.random.normal(size = 10)
print("Array 1:")
print(arr1)
print("\nabs(Array 1):")
print(abs(arr1))

arr1 = np.arange(10)
print("\n0-9 squared")
print(np.square(arr1))


# Try out some of the other unary functions


Array 1:
[-0.1203552  -1.94411358 -1.24269682 -0.96788593  1.23273328 -0.29232845
  1.16124405 -0.57457224  0.69944595 -0.38438023]

abs(Array 1):
[0.1203552  1.94411358 1.24269682 0.96788593 1.23273328 0.29232845
 1.16124405 0.57457224 0.69944595 0.38438023]

0-9 squared
[ 0  1  4  9 16 25 36 49 64 81]


## Element-to-Element Operations
We can also do elemenwise operations if the arrays are the same shape

In [5]:
print("\nCan also do this on array-to-array operations")
arr2 = np.array([[1,1,1],
                 [2,2,2],
                 [3,3,3]])

print(arr2d - arr2) # Matrix subtraction
#print(arr1 - arr2) - won't work, as they are not the same size


Can also do this on array-to-array operations
[[ 2  5  8]
 [10 13 16]
 [18 21 24]]


In [6]:
arr1 = np.arange(9).reshape(3,3)
arr2 = np.arange(10).reshape(2, 5)

print(f"Array1:\n{arr1}")
print(f"Array2:\n{arr2}")

# Won't work as we have different shapes
arr3 = arr1-arr2
print(arr3)

Array1:
[[0 1 2]
 [3 4 5]
 [6 7 8]]
Array2:
[[0 1 2 3 4]
 [5 6 7 8 9]]


ValueError: operands could not be broadcast together with shapes (3,3) (2,5) 

## Broadcasting
Sometimes, when we don't have arrays with the same shape, we can still do element-wise/scalar operations. This is due to broadcasting, where we can broadcast an array with at least one less dimension on another array.

In [8]:
# Broadcasting across slices

print("\n**********\nWe can broadcast compatibaly shaped arrays:")
arr2d = np.array([[1,2,3],
                [4,5,6],
                [7,8,9]])
tmp_array = np.array([3,2,1])
arr2d = arr2d * tmp_array # 3x3 * 3(3x1)
print(arr2d)

# This works because they share a commonly sized dimension
print(f'We have a {arr2d.shape} and {tmp_array.shape}')

print("\n**********\nLooking at more than two dimensions:")
arr2d = np.arange(27).reshape(3,3,3)
tmp_array = np.arange(9).reshape(3,3)
tmp_array2 = np.arange(3).reshape(1,3)

print(arr2d)
print("Sharing two dimension:")
print(arr2d - tmp_array)
print("\nSharing one dimension:")
print(arr2d * tmp_array2)


**********
We can broadcast compatibaly shaped arrays:
[[ 3  4  3]
 [12 10  6]
 [21 16  9]]
We have a (3, 3) and (3,)

**********
Looking at more than two dimensions:
[[[ 0  1  2]
  [ 3  4  5]
  [ 6  7  8]]

 [[ 9 10 11]
  [12 13 14]
  [15 16 17]]

 [[18 19 20]
  [21 22 23]
  [24 25 26]]]
Sharing two dimension:
[[[ 0  0  0]
  [ 0  0  0]
  [ 0  0  0]]

 [[ 9  9  9]
  [ 9  9  9]
  [ 9  9  9]]

 [[18 18 18]
  [18 18 18]
  [18 18 18]]]

Sharing one dimension:
[[[ 0  1  4]
  [ 0  4 10]
  [ 0  7 16]]

 [[ 0 10 22]
  [ 0 13 28]
  [ 0 16 34]]

 [[ 0 19 40]
  [ 0 22 46]
  [ 0 25 52]]]


## Nan or Not a Number
It isn't uncommon in analytics that we run into a situation where we don't have a valid number to work with, so Numpy provides a specific type to designate that a value doesn't exist compared to a zero or null value.

**Can anyone give an example of why this might occur?**

In [None]:
# nan or Not a Number:

print("**********\nNan:")
# Numpy introduces a type called nan (not a number), usually for missing or NA datapoints
print(np.nan == None)
print(np.nan == np.nan) # What?!
print(np.isnan(np.nan)) # Need to use np.isnan() to check for nan types, luckily it's vectorized

arr1 = np.arange(9)
# arr1 = arr1.astype('float32') # We can force a type conversion on data using astype()
print(arr1.dtype)
arr1[5] = np.nan # This will fail if the dtype is not float (np.nan only exists with floats)
print(np.isnan(arr1))


## Infinity
Sometimes we run into situations were we have infinitie values (think functions with asymptotes)

In [None]:
# Finite and Infinite

import numpy as np

print("**********\nDividing by zero, and what it creates:")
arr1 = np.arange(10)
inf = arr1/0 # Uh-oh, this is going to create some warnings
print(inf) # Notice 0/0 is nan, doesn't exist

print("**********\nChecking for infinities:")
inf[9] = 10 # Lets set the last index to an integer
print(np.isinf(inf)) # Notice both np.nan and 10 are not infinite

# np.inf is useful for setting maxes or mins when working with data
"""Example comparing distances between datapoints, we don't know the minimum distance we would see.
So if we are keeping track of the minimum it is easy to set it to np.inf and update for each element
less than the current"""



## Statistical Operations

In [None]:
# Mathematical and Statistical Methods

import numpy as np
import numpy.random as rand

print("**********\nUseful Math/Stats Functions:")
arr1 = rand.normal(loc = 10, scale = 3, size = 1000)
arr1_mean = np.mean(arr1) #These functions are vectorized, i.e. designed and optimized for array operations
arr1_sd = np.std(arr1)
arr1_var = np.var(arr1)
arr1_25th = np.percentile(arr1, 25)
print(f"Our mean is {arr1_mean}, which should be close to 10 (our loc)")
print(f"Our sd is {arr1_sd}, which should be close to 3 (our scale)")
print(f"Our var is {arr1_var}, which should be close to the square of our sd {np.square(3)}")
print(f"Our 25th percentile is - {arr1_25th}")



In [None]:
arr1 = rand.normal(loc = 10, scale = 3, size = (100, 10))
arr1_mean = np.mean(arr1) #These functions are vectorized, i.e. designed and optimized for array operations
arr1_col_mean = np.mean(arr1, axis=0)
arr1_row_mean = np.mean(arr1, axis=1)

print(arr1_mean)
print(f"\nColumn-wise mean:\n{arr1_col_mean}")
print(f"\nRow-wise mean:\n{arr1_row_mean}")

In [None]:
#Problem 1
"""Generate a random array of data. Compute the mean using python's standard libraries (no numpy)
and compare the result with np.mean()"""

