[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/SmilodonCub/DS4VS/blob/master/Week5/DS4VS_week5_Numpy.ipynb)

<br> 

# Week 5: `Numpy`



<br>

## a Brief Recap:

* Hello, how are you?
* How are you doing with accessing your data in Python?
* Today: `numpy` and `pandas`
* Next week: Exploratory Data Analysis - data cleaning, inspection and missingness

## Where are the matrices hiding?

Matlab is short for Matrix Laboratory. Therefore, it is no suprize that in Matlab, matrices are a very prominent structure.  

Those of you coming from a Matlab background have been undoubtedly wondering:  
"Where are my Matrices!"


<img src="https://miro.medium.com/max/698/1*2kqJE0-z_Cjz2gbL5HLBjg.jpeg" width="60%" style="margin-left:auto; margin-right:auto">


## Python's not a Datacentric Language

Python is intended to be a general purpose language.  
To those in science and engineering, it is not obvious how to utilize Python for data analysis  


Python was developed as a general use programming language and so data analysis was not a primary focus in the standard library. There is no native structure that approaches the performance of the Matlab Matrix.

<img src="https://img.ifunny.co/images/750f66c2493cd9f571ca255a6ce6d508a45a419071b12d1e0d1006d710476ca0_1.jpg" width="40%" style="margin-left:auto; margin-right:auto">

## If Python wasn't meant for data, why have Data Scientists embraced the language?


<img src="https://i.redd.it/m37hm1opxma41.jpg" width="35%" style="margin-left:auto; margin-right:auto">

## `Numpy`: numerical Python 

3rd party libraries have come to the rescue.  

In Python, MATLAB-like matrix behavior is found in the 'array' objects implemented by the `NumPy` package.  
[The Numpy Manual](https://numpy.org/doc/stable/) 




## `Numpy`: numerical Python 

**`NumPy` Features**:  

* Fast vectorized array operations including:
    - mathematical
    - logical
    - sorting/selecting
    - discrete Fourier transforms
    - basic linear algebra
    - basic statistical operations
    - random simulation
* Effective emethods for aggregating/summarizing data and yielding descriptive statistics
* Conditional logic (boolean indexing) can be used instead of loops
* Group-wise data manipulations (functions & transformations)

## The NumPy Array

The `NumPy` library provides a Multidimensional Array Object, or `ndarray`  
The `ndarray` is a fast and flexible container for large homogeneous datasets in Python  

**`ndarray` features**:  

* can take on many dimensions
* can only hold homogeneous (same type) data


In [None]:
import numpy as np 

arr = np.array( [[1,2,3], [4,5,6], [7,8,9]] )
print( arr )
print( type( arr ) )

### Compare and Contrast Python lists and `NumPy` ndarrays

|          **NumPy Arrays**          |             **Lists**            |
|:----------------------------------:|:--------------------------------:|
| All elements must be the same type | Any combination of types allowed |
|    + does element-wise addition    |       + concatenates a list      |
|          Multidimensional          |           1 dimensional          |
|  supports range & boolean indexing |          range indexing          |


When to use a Python lists or a NumPy ndarray? 

* do i need to do math?
* do i need multiple dimensions
* am I storing complex data structures (different types)

## Creating NumPy arrays

* using the array function.




In [None]:
# .array() can take any sequence object as an argument
rad = [1,2,3]
radius = np.array( rad )
print( radius )
print( type( radius ) )

In [None]:
# can also pass a list of list
data = [[ 1,3,5,7,9 ], [0,2,4,6,8]]
data = np.array( data )
print( data )
print( type( data ) )

## Creating NumPy arrays

* other functions: 
    - arange
    - ones
    - zeros
    - empty
    - eye
    
(these are described further in the documentation...)

In [None]:
np.arange( 0, 200, 20 )

In [None]:
# let's try a few others...


## `Numpy`  indexing syntax

Selecting and subsetting ndarrays is very similar to slicing Python lists.  
Numpy also allows for Boolean indexing.  

In [None]:
arr = np.arange( 36 ).reshape( 6,6 )
#print( arr )

In [None]:
print( arr[0] )
print( arr[0][5] )
print( arr[5,5] )

## Explore NumPy indexing further...

In [None]:
from skimage import data, io
image = data.coffee()
io.imshow(image)
io.show()

In [None]:
type( image )

## Indexing Multidimensional arrays

In [None]:
print( image.shape )
# use indexing syntax to subset the image
subimage = image[ 100:200, 200:400, :]
io.imshow( subimage )
io.show( )

**Take a moment to change the image subset**  

## Indexing with a Boolean Array


In [None]:
bool_array = image < 20
bool_array[ -3:, -3:, :]

In [None]:
image[ bool_array ] = 250
io.imshow( image )
io.show( )

**Take a moment to change the Boolean index**

### Boolean Indexing: another look

In [None]:
#generate some toy data 
celltypes = np.array( [ 'Parvo', 'Magno', 'Konio' ] )
celltypes = celltypes.repeat( [109,148,15] )
np.random.shuffle( celltypes )
celltypes[:10]

In [None]:
#use boolean indexing to select the 'Magno' elements
bool_array = celltypes == 'Magno'
celltypes[ bool_array ]

## Maths Manipulations of `Numpy` Arrays

applying calculations across an entire array

|  **Operation** | **Operator** |
|:--------------:|:------------:|
|    Addition    |       +      |
|   Subtraction  |       -      |
| Multiplication |       *      |
|    Division    |       /      |
| Exponentiation |      **      |

In [None]:
print( radius )

area = np.pi * (radius ** 2 )
print( area )
print( type( area ) )
print( '' )
#np.info( area )

In [None]:
# elementwise operations between numpy arrays
mfr_stim_in_rf = np.array( [ 100, 114, 96, 120 ] )
print( type(mfr_stim_in_rf), mfr_stim_in_rf )
mfr_stim_out_rf = np.array( [ 22, 32, 86, 24 ] )
print( type(mfr_stim_out_rf), mfr_stim_out_rf )
difference = mfr_stim_in_rf - mfr_stim_out_rf  
print( type(difference), difference )

## Using functions to summarize `Numpy` data

In [None]:
np.mean( difference )

In [None]:
print( np.min( difference ) )
print( np.argmin( difference ) )

#max
#argmax
#sum

## Comparison Operators and np.arrays


In [None]:
threshold = 60
meets_criteria = difference>threshold
meets_criteria

## Broadcasting

**Broadcasting** - operations between arrays of different shapes   
**Broadcasting Rule** - Two arrays are compatible for broadcasting if for each training dimension the axis lengths match or if either of the lengths is 1. Broadcasting is then performed over the missing or length 1 dimensions

1. Broadcasting a scaler: simplest case.

In [None]:
arr = np.arange( 5 )
print( arr )
# broadcast the scaler 4 to all other elements as a multiplication operation
arr *4

2. row operations

<img src="broadcasting_ax0.png" width="35%" style="margin-left:auto; margin-right:auto">

In [None]:
arr = np.random.randn( 30 ).reshape( 5,6 )
print( arr )

In [None]:
#use broadcasting to demean (center) the data
row_means = arr.mean(0)
print( row_means )
print( row_means.shape )
demeaned = arr - row_means
demeaned.mean(0)

3. column operations

<img src="broadcasting_ax1.png" width="35%" style="margin-left:auto; margin-right:auto">

In [None]:
print( arr.shape )

In [None]:
#use broadcasting to demean (center) the data
col_means = arr.mean(1)
print( col_means )
print( col_means.shape )
print( col_means.reshape((_____,1)).shape )

demeaned = arr - col_means.reshape((_____,1))
demeaned.mean(1)

## reshaping arrays

* arr.reshape((m,n),order = C(row)|F(column, Fortran))
* arr.ravel()
* arr.flatten()
* np.concatenate()
* np.vstack()
* np.hstack()
* np.split()
* np.hsplut()/vsplit()

## linear algebra

* np.dot
* np.det
* np.inv
* np.eig
* np.svd
* np.solve

## randomness

* np.seed
* np.permutation
* draw from a distribution:
    - np.rand (uniform)
    - np.randint
    - np.randn (normal)
    - np.binomial
    _.....several more

## Speaking of randomness, let's take a random walk

**Random Walk/Diffusion Models of Decision making**  

The central idea is that a decision is based on the accumulation of evidence relevant to the decision.  
For perceptual decision making, information takes the form of incoming sensory signals  
Evidence is accumulated and decays over time  
If enough evidence is collected to cross a threshold: Make a decision!  
If a decision is made: Initiate a response!  

<img src="https://www.jneurosci.org/content/jneuro/32/7/2335/F1.large.jpg" width="60%" style="margin-left:auto; margin-right:auto">  

[figure cred](https://www.jneurosci.org/content/32/7/2335)

## Let's simulate overly simplified random walks with NumPy

In [None]:
random_walk = [ 0 ]
iterations = 100

for x in range(iterations) :
    # Set step: last element in random_walk
    step = random_walk[ -1 ]
    # Roll the dice
    dice = np.random.randint(1,7)
    # Determine next step
    if dice <= 2:
        step = max( 0, step - 1 )
    elif dice <= 5:
        step = step + 1
    else:
        step = step + np.random.randint(1,7)
    # append next_step to random_walk
    random_walk.append(step)

# Print random_walk
print( random_walk )
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
plt.plot( random_walk )
plt.show()

In [None]:
# initialize and populate all_walks
all_walks = []
numwalks = 30
for i in range(numwalks) :
    random_walk = [0]
    for x in range(100) :
        step = random_walk[-1]
        dice = np.random.randint(1,7)
        if dice <= 2:
            step = max(0, step - 1)
        elif dice <= 5:
            step = step + 1
        else:
            step = step + np.random.randint(1,7)
        # Implement clumsiness
        if np.random.rand() <= 0.01:
            step = step - step/4
        random_walk.append(step)
    all_walks.append(random_walk)

# Convert all_walks to Numpy array: np_aw
np_aw = np.array( all_walks )

# Transpose np_aw: np_aw_t
np_aw_t = np.transpose( np_aw )

# Plot np_aw_t and show
plt.plot( np_aw_t )
plt.show()

## Find out decision threshold crossings

In [None]:
threshold = 80

# did a random walk cross threshold?
crossed_thresh_mask = ( np_aw >= threshold ).any(1)
print( crossed_thresh_mask )

In [None]:
# how many random walks crossed threshol?
crossed_thresh_mask.sum()

In [None]:
# average number of steps to cross threshold
numsteps_crossthresh = ( np_aw[ crossed_thresh_mask ] >= threshold ).argmax( 1 )
numsteps_crossthresh.mean()

## If you need something to work very fast

`Numba` - works with numpy-like data to translate python code into compiled machine code

In [None]:
def mean_distance( x,y ):
    nx = len( x )
    result = 0.0
    count = 0
    for i in range( nx ):
        result += x[i] - y[i]
        count += 1
    return result / count 

x = np.random.randn(10000000)
y = np.random.randn(10000000)

### taking `Numba` for a spin....

In [None]:
%timeit mean_distance( x,y )

In [None]:
import numba as nb
numba_mean_distance = nb.jit( mean_distance )

In [None]:
%timeit numba_mean_distance( x,y )

## We found where the matrices were hiding. Next we'll look for `pandas`.

<img src="https://content.techgig.com/photo/80071467/pros-and-cons-of-python-programming-language-that-every-learner-must-know.jpg?132269" width="100%" style="margin-left:auto; margin-right:auto">