# Introduction to Numpy
- NumPy stands for <u>Num</u>erical <u>Py</u>thon. It is the most important library or package in Python numerical computation environment
- Numpy has a powerful multi-dimensional array object (ndarray) that stores data efficiently for quick, easy access and manipulation
- Numpy performs complex array calculations by treating arrays as individual units to avoid using loops and greatly increase speed
- Numpy implements sophisticated math calculation functions efficiently
- Has a C API to connect to libraries written in C, C++, etc.
- ndarray is usually the default or underlying data object in other computational packages

In [32]:
# This cell is for wide display and better use of screen space
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
import numpy as np
# When printing out the floating numbers, only keep 3 decimal places 
# This works for all numpy arrays
# np.set_printoptions(precision=3)
# This is for all print outs objects
%precision 3

'%.3f'

## Motivating Example
Add elements of the following 2 lists to calculate total expenses for each store and save the result in a list named "total_exp".  

In [2]:
admin_exp = [1224, 1240, 1906, 1303, 1363, 1145, 1237, 1296, 1194, 1231,1253, 1119]
payroll_exp = [42370.47, 38889.7, 41396.37, 40941.33, 41056.7, 42307.05, 44232.53, 45152.42, 
               46897.79, 44157.78, 44307.4, 43398.66]

### Now You Try 1

In [1]:
# Try to add the two lists directly using "admin_exp + payroll_exp". Can you get total expenses of each store?


In [2]:
# Write a loop to add the corresponding elements in the two lists to get total_exp
# You need to create an empty list named 'total_exp' first. Print out the result
# Did you get the correct result this time?
total_exp = []


## Numpy Array Calculation is Much Faster Than Python List
%time is a magic command in Jupyter Notebook which times the execution of a Python statement. Magic commands are additional functions outside of normal Python syntax. They all start with "%". Wall time measures how much time elapsed between the start and completion of the execution which is not necessarily the real time used to execute the command. However, when both wall times are compared, the comparsion is fair. 

In [3]:
# Create a numpy array with 1 million integers. Create a list with 1 million integers. The two data structures have the same contents.
my_arr = np.arange(1000000)
my_list = list(range(1000000))

In [4]:
# %time is a magic command that times the execution of one line of code
# Execute this cell and the next. Your time will be different from mine since we are using 
# different computers. The code multiplies each number in the array with 2.
%time my_arr2 = my_arr * 2

Wall time: 2 ms


In [5]:
# %%time is a magic command that times the execution of all code in a cell
# The code multiplies each number in the list with 2.
%%time 
my_list2 = []
for i in my_list:
    my_list2.append(i*2) 

Wall time: 104 ms


In [3]:
# Calculate how much faster it is to use Numpy array instead of Python list based on the Wall Time outputs


## Create Numpy Arrays 
1. Numpy arrays can be created by using the array() function. It takes lists, tuples, ndarrays, etc. as arguments. dtpye is an optional argument. The dtype of the array will be inferred if not passed in.
1. Special arrays can be created with functions such as zeros(), ones() etc.
1. Arrays can also be created using arange() and linspace() functions
    - arange() works similar to range() function in Pythob
    - linspace() takes 3 arguments: start value, stop value and number of elements to generate between start and stop values.
1. full() function creates an array with same-value elements. The required arguments are the shape of the new array and the value for the elements.

### Use np.array()

In [4]:
# Create a one-dimensional array from a list [6, 7.5, 8, 8, 1]. Store the result in variable arr1
# Print out the array and the type of the array
# Note the change of element data types in the array
list1 = [6, 7.5, 8, 8, 1]



In [5]:
# Print out list1 and type of list 1. Can you see the difference of the printed results of the array and the list?


#### Write down the difference betweent the display of a Numpy array and a list:
1. 
1.

In [6]:
# Execute the code in this cell. Note the change of the data type of elements in the array 
# dtype "<U32" represent a unicode string of 32 characters. 
# Without using print(), is it easier to differentiate list from array?
np.array([6, 7.5, 8, 8, 1, 'one'])

In [7]:
# Create a two-dimensional array based on the two dimensional list "data". Set dtype to be float. 
# Save the result to variable arr2. Print arr2. Note how the numbers are represented
data = [[1, 2, 3, 4], [5, 6, 7, 8]]


In [8]:
# Create a numpy array based on arr2 just created. Set the dtype to be int. No need to save the array,
# Pay attention to the type of the elements in the new array.


In [9]:
# Create a three-dimensional array using nested list of tuples: [[(1, 2), (3, 4)], [(5, 6), (7, 8)]]
# You can copy and paste the list. Save the result to arr3 and display it.



### Use np.zeros(), ones(), arange(), linspace() and full() to create arrays

In [10]:
# The function zeros() takes only 1 argument: an integer or a tuple
# The default dtype is float. It can be set as integer. Create an one-dimensional
# array with 5 zeros


In [11]:
# Create a 3X4 array with ones using np.ones(). Passing in a tuple to specify the dimension


In [12]:
# np.arange() works like range() in Python. It takes up to 3 arguments to create a 1-dimensional array
# start value, stop value (not included) and step (increment). 
# Start value is optional, default is 0
# Step value is optional, default is 1 

# Create an array of even numbers starting with 20 and ending with 38


In [13]:
# np.linspace() creates a 1-dimensional array. 3 arguments: start, stop and number of elements
# Elements are evenly spaced between start and stop (both included). The number of elements
# is specified in the 3rd argument

# Create an array starting with 0, ending with 2 containing 9 elements. 


In [14]:
# np.full(dimension, value) allows you to create an array with the specified value in all elements
# The first argument is the dimension, usually a tuple. The second is the value.
# Create a 2X3 array in variable full with integer 5. No need to save the array to a variable


### Now You Try 2

In [15]:
# Create a 2-dimensional numpy array with a nested list below
# without dtype argument. Name it array1. Display it
nested_list = [[34, 45, 12], [9, 10, 5], [33, 80, 65]]


In [16]:
# Create a 2-dimensional numpy array array2 using the same nested
# list with dtype = 'str'. Display it


In [17]:
# Create a 2-dimension array 'array3' of shape (2, 3) full of ones
# Display it


In [18]:
# Create an array 'array4' using np.arange() that starts with 100
# end with 201 and take 20 as the step. Display it


In [19]:
# Create an array 'array5' using np.linspace() that starts with 0
# ends with 1 and generate 10 values. Display it


## Array Attributes and Methods
### Array attributes
1. dtype: returns the data type of the elements
1. shape: returns a tuple with the number of elements in each dimension
1. ndim: returns the number of dimensions in an array
1. size: returns the total number of elements in an array

In [20]:
# Execute the code to create the array and store it in 'arr'
arr = np.array([[2.3, 4.5, 2.7, 9.1], [12, 0.9, 7.8, 2.1], 
                [8.5, 71, 3.6, 8.8]])
arr

In [21]:
# Execute the code to create the array and store it in 'arr0'
arr0 = np.zeros(5, dtype=int)
arr0

In [22]:
# Get the data type of arr


In [23]:
# Get the data type of arr0


In [24]:
# Get the shape of the arr


In [26]:
# Get the shape of the arr0


In [27]:
# Get the number of dimensions of arr


In [28]:
# Get the number of dimensions of arr0


In [29]:
# Get the number of elements in arr


### Array methods
1. reshape(): change the dimension of an array. Takes one argument: the shape of the new array as a tuple. The shape of the new array must be compatabile with the old one
1. sort(): sort the elements of the array by the specified axis. By default, the sort is along the innermost axis, which is axis=1 for a 2-dimensional array.

In [32]:
d = np.arange(20, 40, 2)
d

array([20, 22, 24, 26, 28, 30, 32, 34, 36, 38])

In [33]:
# Change the array d to be a 2 X 5 array using reshape()
d = d.reshape((2, 5))
d

array([[20, 22, 24, 26, 28],
       [30, 32, 34, 36, 38]])

In [34]:
# create an array of 20 values between 0 and 1 using linspace(). Reshape it to be 5X4
np.linspace(0, 1, 20).reshape((5, 4))

array([[0.   , 0.053, 0.105, 0.158],
       [0.211, 0.263, 0.316, 0.368],
       [0.421, 0.474, 0.526, 0.579],
       [0.632, 0.684, 0.737, 0.789],
       [0.842, 0.895, 0.947, 1.   ]])

### ndarray Axes: many calculations need to specify the axis calculation happens
- Dimensions of ndarrays are called axes, starting with 0
- One-dimensional array has one axis: axis 0
- Two-dimensional array has two axes: axis 0 and 1
- Threee-dimensional array has three axes: axis 0, 1, and 2

In [19]:
arr = np.array([[2.3, 4.5, 2.7, 9.1], [12, 0.9, 7.8, 2.1], 
                [8.5, 71, 3.6, 8.8]])
arr

array([[ 2.3,  4.5,  2.7,  9.1],
       [12. ,  0.9,  7.8,  2.1],
       [ 8.5, 71. ,  3.6,  8.8]])

In [30]:
# Call sort() on arr without using any argument. Display the array. Along which axis is the sort done?


In [31]:
# Call sort again with argument axis=0. Display the array, What axis does the sort happen?


### Now You Try 3

In [33]:
# Create a ndarray using np.arange(). Start value should be 50 
# and stop value should be 550. Use 5 as the step value. Save it in array1 and display it


In [32]:
# Print out the shape and dimension of the array


In [34]:
# reshape array1 into a 2-dimensional 10x10 array using reshape(). Display it


In [35]:
# Create a ndarray using np.linspace() between 0 and 1 with 9 values
# reshape it into a 3X3 array. name it array2. Display it


In [36]:
# Display the data type of array2


## Random Number Generation
1. random is a module in np. All the functions must be called using prefix np.random.
1. Generate a float or an array of floats between 0 and 1: \[0, 1) using rand()
    - With no arguments, a random float number in the range is generated. 
    - With integers indicating number of elements in each dimension, an array with random floats is generated
1. Generate random integers with randint(). All numbers in the range have equal chance to be returned. It can be considered a uniform probability distribution.
    - In the range \[0, n) by passing in one integer n
    - In the range \[low, high) by passing in two integers
    - The shape of the array can be passed in as a tuple to the size argument. If it is omitted, only one number will be generated.
1. Generate random floats from a normal distribution using normal(mean, stdev, size)
    - First argument is mean
    - Second argument is standard deviation
    - Third argument size specifies the shape of the array to be generated
    - If no argument is passed, a float will be randomly generated from a normal distribution with mean=0 and std=1.
1. Generate random floats from a binomial distribution using binomial()
    - First argument is number of trials 
    - Second argument is probability of success
    - Third argument is number of random samples to draw or the shape of the array that will be created. 
1. Generate random floats from a uniform distribution with uniform()
    - low: the lower boundary with default value 0, inclusive
    - high: the higher boundary with default value 1, not inclusive
    - size can be an integer or a tuple indicating the shape of the array generated
1. seed(): specify the seed of the random number generator. If an integer is passed in, the random number generated can be replicated.

In [37]:
# Generate a Random Float number in the Range: [0, 1). Execute the cell a few times to see the randomness
# Using rand() until you see another function must be used


In [38]:
# Generate a 1-dimensional array of 4 random floats in the Range: [0, 1)


In [39]:
# Generate a 2 X 3 array with random floats between 0 and 1. 


In [40]:
# Generate a 2X3 array with random numbers between 5 and 15. 


In [41]:
# Generate a random integer between [0, n) where n=10. Execute the cell a few times to see the randomness
# Use randint()

In [42]:
# Generate a random integer between 10 and 100


In [43]:
# use random number generator to simulate tossing a coin 100 times. Save the result in a 10X10 array


In [45]:
# generate a random float from a normal distribution with mean 0 and standard deviation 1
# Use normal()


In [44]:
# generate a random float from a normal distribution with mean 150 and standard deviation 30


In [46]:
# generate a 2x2 array of random float from a normal distribution with mean 150 and standard deviation 30


In [47]:
# Use set the random seed to be 123 to ensure the output reproducibility in this cell.
# Get 5 numbers from a normal distribution with mean of 10 and standard deviation 1. 
# Execute the cell a few times to see whether the output changes


### Now You Try 4

In [48]:
# Create a ndarray with 3 rows and 4 columns filled with random numbers 
# from the standard normal distribution. Store it to 'normal_array' and display it
# Note that a standard normal distribution has mean 0 and stdev 1
# Run the code a couple times. Does the values in the array change?


In [49]:
# Set the random seed to be 123. Create another ndarray (5X6) with random 
# integers between 300 and 500 named int_array. Display it
# Run the code a couple times. Does the values in the array change?


In [50]:
# Sort int_array by row and display it


In [51]:
# Sort int_array by column and display it


## Array Indexing and Slicing
Use array indexing and slicing to get or set individual elements or a subset of elements from arrays. <b>Array slices are views, not copies. Any changes to the slice are done in the source array</b>.
### One-dimensional array indexing and slicing
Indexing and slicing for one-dimensional arrays work very similar to those of Python lists.

In [58]:
# Create an one-dimensional array 'arr1' and have it displayed
arr1 = np.array([6, 7.5, 8, 8, 1])
arr1

array([6. , 7.5, 8. , 8. , 1. ])

In [52]:
# Get the element at index 0


In [53]:
# Get an array with elements at index 1 and 2 using slicing [start:stop:step] just like list slicing
# Note that the ending index value is not included in the returned value


In [54]:
# Get an array with items from index 2 on using slicing


In [55]:
# Get an array with items from the first to the one before the end using slicing


In [56]:
# Get an array with every other item using slicing


In [57]:
# Get an array in reverse order of arr1 using slicing


In [58]:
# Get an array with all the items using slicing


In [59]:
# Set the values of the 3rd and 4th to 100


### Two-dimensional array indexing and slicing
Similar to that of nested lists. Two indexing or slicing expressions are needed to specify rows and columns respectively seperated by comma

In [67]:
# Work on 2-dimensional arrays. Create a 5X4 array with random values from 
# a normal distribution with mean 100 and std 10. Save it to 'arr2d' and display it
arr2d = np.random.normal(100, 10, size=(5, 4))
arr2d

array([[122.074, 105.227, 104.656, 107.249],
       [114.958, 107.466,  88.99 ,  85.897],
       [ 92.523,  90.151,  92.514, 102.404],
       [ 81.444,  82.205,  72.498,  97.658],
       [ 93.04 ,  82.259, 123.616, 100.35 ]])

In [60]:
# get the third row of arr2d using indexing


In [62]:
# get the first 3 rows of arr2d using slicing


In [63]:
# get the 3rd and 4th rows using slicing


In [64]:
# get the element in the third row and 3rd column of arr2d using indexing


In [65]:
# get the second column of arr2d. Note that rows returned must be specified before the column is specified


In [66]:
# get the middle 6 elements using slicing. Find out what rows and columns are needed first


In [67]:
# set the middle 2 elements of the last row to be 5 and display it. First decide the indexes for rows and columns


### Now You Try 5
Complete the questions on the PPT first

In [68]:
# Get the first row in arr2d


In [69]:
# Get the first row as a two-dimensional array with shape (1, 4)


In [70]:
# Get the first column of arr2d


### Boolean Indexing
Use a Boolean array as a mask to filter out the elements to be returned. <b> A copy, instead of a view of the array is returned. </b>

    - Boolean array should be passed in the index operator [] when indexing the array
    - The Boolean array must be of the same length of the axis it is indexing
    - The Boolean array indexing can also be mixed with slicers or indexers
    - == means equal to
    - != means not equal to
    - ~ means not
    - & means and
    - | means or
    - When linking conditions with &, ~, or |, conditions on both sides must be included in ().

In [14]:
# 3 arrays are created from Python lists
names = np.array(['John', 'Christine', 'Joy', 'Grace', 'Tom'])
colors = np.array(['red', 'blue', 'green', 'red', 'white'])
genders = np.array(['male', 'female', 'female', 'female', 'male'])

In [71]:
# get a boolean array for colors not white using !=


In [72]:
# get a boolean array for colors red or green using == and |


In [73]:
# get a boolean array for colors blue and black using == and &


In [74]:
# get a boolean array for colors not red using == and ~


In [75]:
# Get the names whose favorite color is red by passing the boolean array to []


In [76]:
# Get the names of females who don't like green. In [], you should have 2 conditions


In [77]:
# Generate a 5x4 2-dimensional array for further indexing example


In [78]:
# get rows corresponding to color red. This is possible because the array has 5 rows and there are 5 items in colors 


In [79]:
# Get the data that matches color blue and returns only the first 3 columns. Combine boolean indexing and slicing


In [80]:
# Get a boolean array by checking elements in color_data > 2. Save it to bool_arr and display



In [81]:
# Set the elements of color_data to be 0 when the value is < 10 using boolean indexing


### Now You Try 6

In [82]:
# Use the names, colors and genders arrays created above, get
# names that are male and favors either red or green


In [83]:
# Create an array with random numbers and display it. It's possible that the random generated array has no values
# between -0.5 and 0. Run the code a few times to get values in the range


In [84]:
# Set all values less than 0 and greater than or equalto -0.5 to be 0 using Boolean index. Print the array 


## Math and Stats in Numpy
1. Arithematic calculations such as +, -, *, and / etc. between arrays with the same shape are done element-wise
    - element-wise: the operation is conducted on elements of the arrays in the same position. The result is placed at the position in the new array.
1. Calculation between array and a scalar (a number) is element-wise
1. No loops are needed for this type of calculation. This is called vectorization
1. Calculation using math functions is conducted element-wise without the need to loop, in a vectorized way. Treat arrays like numbers using these functions: abs, exp, log, log10, ceil, floor, add, multiply, minimum, modf, etc. More about ufuncs can be found here: https://numpy.org/doc/stable/reference/ufuncs.html#math-operations
1. Stats functions: sum(), mean(), std(), min(), max(), cumprod(), etc.
    - Can be done for all elements in an array when no argument is passed in
    - Can be done along an axis. If the axis = 0, the summary stats by columns is returned. To get row statistics, pass in axis = 1.

In [137]:
arr = np.random.normal(200, 3, size=(3, 4))
arr

array([[201.662, 198.408, 204.132, 199.57 ],
       [200.061, 199.418, 200.402, 202.113],
       [201.997, 197.305, 204.571, 196.715]])

In [93]:
arr2 = np.linspace(0, 11, 12).reshape(3, 4)
arr2

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [87]:
# Multiple the corresponding elements of arr and arr2 and have a new 3X4 array 


In [86]:
# Add 5 to each element of arr2


In [85]:
# Calculate log10 of the elements in arr using np.log10()


In [88]:
# Get the quotient and remainder of the arr using np.modf(), which returns 2 values


In [89]:
# Use np.add() to add corresponding elements of arr and arr2 


In [90]:
# Get the mean of arr. Note the difference using an axis argument or not and the values axis argument takes


In [91]:
# Get the means of arr by columns


In [92]:
# Get the means of arr by rows


In [93]:
# Get min of arr by rows


In [94]:
# Get the cumulative sum or arr along columns using cumsum()


### Now You Try 7

In [95]:
# Create a ndarray with 20 random integers from 1 to 10. Set the shape 
# to be 4X5. Name it my_array and display it


In [96]:
# Multiply all elements of my_array by 2


In [97]:
# Get the mean and standard deviation of all the values in my_array and print them out


In [98]:
# Get the row means and column means of my_array and print them out respectively


In [99]:
# Get the square root of each element, save the result to sqrt_array and display it


In [100]:
# Add my_array and sqrt_array


## Numpy where
np.where(condition, x, y) can be used to modify an array based on a condition. The condition is checked on each element, the element will be set as x if the condition is true and y otherwise. It takes 3 arguments: condition, x and y. 

In [143]:
# Create a random array to be used 
arr = np.random.normal(0, 1, (3, 3))
arr

array([[-0.611, -0.392,  0.14 ],
       [ 0.093,  1.46 ,  1.395],
       [-0.359, -0.549, -2.557]])

In [101]:
# Set the element to be 2 if it is greater than 0. Otherwise, set it to -2


In [102]:
# Set the element to be 2 if it is greater than 0. Otherwise, don't change it's value


### Now You Try 8

In [103]:
# Create a 3 X 6 array 'temp1' with values from a normal distribution where
# the mean is 10 and standard dev is 5. 


In [104]:
# Create another 3 X 6 array temp2 with values of random integers from 0 to 10. 


In [105]:
# Use np.where to set values of temp1 to be corresponding values of temp2 if those 
# in temp1 are greater than the mean of temp1. Otherwise, keep the value unchanged
