# Intro to NumPy
NumPy is the most fundamental library to scientific computing in Python. It forms the basis for most of the important data science libraries like pandas and scikit-learn.

The main data structure that NumPy provides is the n-dimensional array object or **`ndarray`**. ndarray objects may be any number of dimensions. Typically in data science we are dealing with two dimensional tabular data of rows and columns so here we will begin by creating an array of random values from a normal distribution and do some basic analysis on it.

In [2]:
import numpy as np

## Create first array - roll some dice
To get things started we will create a two-dimensional array with random dice rolls. They will be generated using the help of the **`randint`** function. The first line of code sets the random number generator seed so that we all have the same exact numbers.

The first two parameters of the **`randint`** function provide the lower and upper bound for the random number. The upper bound is not included. The third parameter is the shape of the array as a tuple - **`(rows, columns)`**.

In [3]:
np.random.seed(123)
a = np.random.randint(1, 7, (10, 4))
a

array([[6, 3, 5, 3],
       [2, 4, 3, 4],
       [2, 2, 1, 2],
       [2, 1, 1, 2],
       [4, 6, 5, 1],
       [1, 5, 2, 4],
       [3, 5, 3, 5],
       [1, 6, 1, 2],
       [4, 5, 5, 5],
       [2, 6, 4, 3]])

### Accessing elements
In regular Python, the indexing operator, the brackets **`[]`**, selects particular items from objects. This is most commonly done with strings, lists, and dictionaries. ndarrays use the same operator for selecting subsets of data. 

To select a single element, place the integer location (the index) of the row and column inside the brackets separated by a comma.

`array[row_selection, column_selection]`

Below we select the number at row location 6, and column location 2. NumPy arrays are indexed beginning at 0.

In [4]:
a[6, 2]

3

In [8]:
a[0, 0]

6

### Use slice notation to select multiple row or columns

### `start:stop:step`

Slice notation only works inside the brackets selection operator. The step value defaults to 1 if not provided.

In [6]:
# select rows 2 through 5 of just column 2
a[2:5, 2]

array([1, 1, 5])

In [9]:
# select rows 2 through 5 of columns 1 through 3
a[2:5, 1:3]

array([[2, 1],
       [1, 1],
       [6, 5]])

In [10]:
# Select every other row beginning at 3 and ending at 8 along with all columns from 2 to the end
a[3:8:2, 2:]

array([[1, 2],
       [2, 4],
       [1, 2]])

## Slice notation without start or stop values
If you do not provide the start value, the slice begins from the first element. Likewise, not providing the stop value will end the slice at the last value.

In [11]:
# select from row 6 to the end from the first to the third column
a[6:, :3]

array([[3, 5, 3],
       [1, 6, 1],
       [4, 5, 5],
       [2, 6, 4]])

### Problem 1
<span style="color:green">Create an array with 5 rows and 6 columns with random numbers. Assign it to variable `arr` and output it to the screen.</span>

In [13]:
# your code here
np.random.seed (1234)
arr = np.random.randint(1, 100, (5, 6))
arr

array([[48, 84, 39, 54, 77, 25],
       [16, 50, 24, 27, 31, 44],
       [31, 27, 59, 93, 70, 81],
       [74, 48, 51, 77, 38, 35],
       [39, 68, 12,  1, 76, 81]])

### Problem 2
<span style="color:green">Select single elements of `arr` multiple times.</span>

In [14]:
# your code here
print(arr[0,0])
print(arr[1,2])
print(arr[4,2])

48
24
12


### Problem 3
<span style="color:green">Select many subsets of elements of `arr` using slice notation.</span>

In [15]:
# your code here
print(arr[1:3,2:5])

[[24 27 31]
 [59 93 70]]


## Arithmetic operations on the entire array
Applying an arithmetic operation to an entire array is easy and has the same syntax as operating on two numbers.

In [16]:
# multiply each element by 5
a * 5

array([[30, 15, 25, 15],
       [10, 20, 15, 20],
       [10, 10,  5, 10],
       [10,  5,  5, 10],
       [20, 30, 25,  5],
       [ 5, 25, 10, 20],
       [15, 25, 15, 25],
       [ 5, 30,  5, 10],
       [20, 25, 25, 25],
       [10, 30, 20, 15]])

In [17]:
# subtract 3 from each element
a - 3

array([[ 3,  0,  2,  0],
       [-1,  1,  0,  1],
       [-1, -1, -2, -1],
       [-1, -2, -2, -1],
       [ 1,  3,  2, -2],
       [-2,  2, -1,  1],
       [ 0,  2,  0,  2],
       [-2,  3, -2, -1],
       [ 1,  2,  2,  2],
       [-1,  3,  1,  0]])

## Vectorized Operations
NumPy is blazingly fast by Python standards. It executes code in pre-compiled C code. **Vectorized** is a term used to describe an operation that happens to many elements without the explicit writing of a for loop.

## Array attributes and methods
Much of the power and functionality within NumPy arrays are accessible via its methods with the dot notation. There are also a few attributes (not executed with parentheses) that are worthwhile.

In [18]:
# get dimensions
a.shape

(10, 4)

In [19]:
# get number of dimensions
a.ndim

2

In [20]:
# total number of elements
a.size

40

In [21]:
# Transpose array
a.T

array([[6, 2, 2, 2, 4, 1, 3, 1, 4, 2],
       [3, 4, 2, 1, 6, 5, 5, 6, 5, 6],
       [5, 3, 1, 1, 5, 2, 3, 1, 5, 4],
       [3, 4, 2, 2, 1, 4, 5, 2, 5, 3]])

### Descriptive statistics
A number of common descriptive statistical methods are available. These operate over each element of the array.

In [22]:
a.max()

6

In [23]:
a.min()

1

In [24]:
a.sum()

131

In [25]:
a.mean()

3.275

In [26]:
a.std()

1.6429774800647756

### Reshaping methods

In [27]:
# make a single dimension
a.flatten()

array([6, 3, 5, 3, 2, 4, 3, 4, 2, 2, 1, 2, 2, 1, 1, 2, 4, 6, 5, 1, 1, 5,
       2, 4, 3, 5, 3, 5, 1, 6, 1, 2, 4, 5, 5, 5, 2, 6, 4, 3])

In [None]:
# reshape - pass a tuple of new shape
# the dimensions of the new shape must work
a.reshape((8, 5))

## Use the `axis` parameter to apply a method in a single direction
We get descriptive statistics for each row or column

In [28]:
# take max of each column
a.max(axis=0)

array([6, 6, 5, 5])

In [29]:
# take max of each row
a.max(axis=1)

array([6, 4, 2, 2, 6, 5, 5, 6, 5, 6])

In [30]:
a.sum(axis=0)

array([27, 43, 30, 31])

In [31]:
# by default axis is set to None
a.sum(axis=None)

131

In [32]:
# not necessary to pass the parameter None to the method
a.sum()

131

![](images/numpy_axis.png)

### Problem 4
<span style="color:green">Practice using the basic vectorized arithmetic operations.</span>

In [33]:
# your code here
arr * 3

array([[144, 252, 117, 162, 231,  75],
       [ 48, 150,  72,  81,  93, 132],
       [ 93,  81, 177, 279, 210, 243],
       [222, 144, 153, 231, 114, 105],
       [117, 204,  36,   3, 228, 243]])

### Problem 5
<span style="color:green">Practice calling many of the methods. Use the tab completion help to find them. Change the direction of operation with the **`axis`** parameter.</span>

In [34]:
# your code here
arr.T
arr.max()
arr.min(axis=0)

array([16, 27, 12,  1, 31, 25])

# NumPy functions on arrays
Not all functionality is available as array methods. NumPy provides more functionality with its functions. These are accessed with **`np.`** followed by the function name. You will usually place the array inside of the function as the first parameter.

In [39]:
b = a - a.mean()
b

array([[ 2.725, -0.275,  1.725, -0.275],
       [-1.275,  0.725, -0.275,  0.725],
       [-1.275, -1.275, -2.275, -1.275],
       [-1.275, -2.275, -2.275, -1.275],
       [ 0.725,  2.725,  1.725, -2.275],
       [-2.275,  1.725, -1.275,  0.725],
       [-0.275,  1.725, -0.275,  1.725],
       [-2.275,  2.725, -2.275, -1.275],
       [ 0.725,  1.725,  1.725,  1.725],
       [-1.275,  2.725,  0.725, -0.275]])

In [40]:
# absolute value. There is no abs method
np.abs(b)

array([[2.725, 0.275, 1.725, 0.275],
       [1.275, 0.725, 0.275, 0.725],
       [1.275, 1.275, 2.275, 1.275],
       [1.275, 2.275, 2.275, 1.275],
       [0.725, 2.725, 1.725, 2.275],
       [2.275, 1.725, 1.275, 0.725],
       [0.275, 1.725, 0.275, 1.725],
       [2.275, 2.725, 2.275, 1.275],
       [0.725, 1.725, 1.725, 1.725],
       [1.275, 2.725, 0.725, 0.275]])

In [41]:
# take the square root of the absolute value and then round
np.sqrt(a)

array([[2.44948974, 1.73205081, 2.23606798, 1.73205081],
       [1.41421356, 2.        , 1.73205081, 2.        ],
       [1.41421356, 1.41421356, 1.        , 1.41421356],
       [1.41421356, 1.        , 1.        , 1.41421356],
       [2.        , 2.44948974, 2.23606798, 1.        ],
       [1.        , 2.23606798, 1.41421356, 2.        ],
       [1.73205081, 2.23606798, 1.73205081, 2.23606798],
       [1.        , 2.44948974, 1.        , 1.41421356],
       [2.        , 2.23606798, 2.23606798, 2.23606798],
       [1.41421356, 2.44948974, 2.        , 1.73205081]])

In [42]:
# chain the round method after the square root
np.sqrt(a).round(1)

array([[2.4, 1.7, 2.2, 1.7],
       [1.4, 2. , 1.7, 2. ],
       [1.4, 1.4, 1. , 1.4],
       [1.4, 1. , 1. , 1.4],
       [2. , 2.4, 2.2, 1. ],
       [1. , 2.2, 1.4, 2. ],
       [1.7, 2.2, 1.7, 2.2],
       [1. , 2.4, 1. , 1.4],
       [2. , 2.2, 2.2, 2.2],
       [1.4, 2.4, 2. , 1.7]])

In [43]:
# some functions do the same things as methods
np.sum(a)

131

In [44]:
# sort defaults to sorting by row
np.sort(a)

array([[3, 3, 5, 6],
       [2, 3, 4, 4],
       [1, 2, 2, 2],
       [1, 1, 2, 2],
       [1, 4, 5, 6],
       [1, 2, 4, 5],
       [3, 3, 5, 5],
       [1, 1, 2, 6],
       [4, 5, 5, 5],
       [2, 3, 4, 6]])

### Problem 6
<span style="color:green">Practice calling many NumPy functions. Find them by using tab completion with **`np.`**.  Use the functions that have an array as their first parameter.</span>

In [45]:
# your code here
np.max(arr)
np.sort(arr)
np. clip (a,3,5)



array([[5, 3, 5, 3],
       [3, 4, 3, 4],
       [3, 3, 3, 3],
       [3, 3, 3, 3],
       [4, 5, 5, 3],
       [3, 5, 3, 4],
       [3, 5, 3, 5],
       [3, 5, 3, 3],
       [4, 5, 5, 5],
       [3, 5, 4, 3]])

## Comparison operators
The 6 comparison operators <, >, <=, >=, ==, != work on all elements of the array. They return an array of booleans of the same shape.

In [36]:
# find all the 6's
a == 6

array([[ True, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False,  True, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False,  True, False, False],
       [False, False, False, False],
       [False,  True, False, False]])

In [37]:
# find out how many 6's are rolled
np.sum(a == 6)

4

In [38]:
# find percentage of values greater than 3
np.mean(a > 3)

0.45

# Use `&` and `|` for `and` and `or`
You cannot use the Python keywords **`and`** and **`or`** for combining logical operations on entire arrays. Instead, you must use **`&`** and **`|`**.

In [None]:
# which rolls are between 2 and 4
(a >= 2) & (a <= 4)

In [None]:
# this should be about 95%
between_2_4 = (a >= 2) & (a <= 4)
between_2_4.mean()

### Problem 7
<span style="color:green">Which column has the highest average roll?</span>

In [46]:
# your code here
maxcolumn = np.average(arr, axis=0).max ()
np.average(arr, axis=0)
maxCol = (np.average (arr, axis=0)==maxcolumn)
maxCol

array([False, False, False, False,  True, False])

### Problem 8
<span style="color:green">Find the average roll for all rolls. Then find the average roll for each row. Which rows have an average that is higher than the average for all rolls?

In [51]:
# your code here
print(np.average(arr))
print(np.average(arr, axis=1))

avgALL = np.average(arr)
avgEACH = np.average(arr, axis=1)
whichGTAvg = (avgEACH > avgALL)
whichGTAvg



49.333333333333336
[54.5        32.         60.16666667 53.83333333 46.16666667]


array([ True, False,  True,  True, False])

### Resources
+ [NumPy's own tutorial](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)
+ [Datacamp NumPy tutorial](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)