# Introduction to NumPy
NumPy is a very powerful tool for numerical calculation in large data set. NumPy is a python library which has many built in methods for performing fast vectorized array operations. NumPy can be used to perform complex computations on entire array without the need for Python for loops and it is significantly more efficient in memory use than their pure python counterparts.
In this jupyter notebook I will introduce the basic ways of creating numpy arrays and then some common operations and how we can use them for efficient performance. 

In [1]:
import numpy as np

## How fast is it?

Before jumping into details, let's first convince ourselves that numpy is actually a very effective library for numerical computation in large data by a little test. We will be performing the same task in two way, one by using numpy array and the other by the pure python counter part. We will be comparing the computation time that the both operation will take to complete.

In [2]:
array = np.arange(10000000)
list_1 = [x for x in range(10000000)]

We have created an array and a list, both containing 10000000 sequential integers. We will do a simple task, squaring all the integers from 1 to 10000000.

In [3]:
%time array = array ** 2

Wall time: 35 ms


Now see how much time it takes if we used python for loop.

In [4]:
%time list_2 = [x ** 2 for x in list_1]

Wall time: 5.72 s


You can see it for yourself. The python "for loop" more than 100 times (might differ in different hardwares) slower than the NP's array operation.

## Creating NumPy arrays
NumPy arrays can be one dimensional or multidimensional.
Let's first see how we can create a numpy array from a python list. We can easily do that with numpy's built in 'array()' function.

In [5]:
# Creating one dimensional NP array
list_1 = [1,2,3]
array_1 = np.array(list_1)

# Creating multi dimensional NP array
list_2 = [[1,2,3],[4,5,6],[7,8,9],[10,11,12]]
array_2 = np.array(list_2)

In [6]:
array_1

array([1, 2, 3])

In [7]:
array_2

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

This array() function accepts any sequence-like object (including other arrays). In case of creating multi demensional array from list which contains other lists, the length of those lists needs to be equal to each other. Otherwise only one dimensional array will be created with data type as "object". Try that yourself for better understanding.

If we want to create an array by autogenarating data in a sequence, we can do that by NP's "arange()" function.

In [8]:
# arange([start,optional] stop[, step, optional], dtype=None)
array_3 = np.arange(5) # One dimensional
print("array 3: \n{0}".format(array_3)) # Printing the array

array_4 = np.arange(2,5) # One dimensional with specified start
print("\narray 4: \n{0}".format(array_4)) # Printing the array

array_5 = np.arange(2,10,2) # One dimensional with specified start and steps
print("\narray 5: \n{0}".format(array_5)) # Printing the array

array 3: 
[0 1 2 3 4]

array 4: 
[2 3 4]

array 5: 
[2 4 6 8]


Sometimes we might need to create arrays with only ones or zeroes. There are built in functions for it.

In [9]:
array_6 = np.zeros((2,3))
array_6

array([[0., 0., 0.],
       [0., 0., 0.]])

In [10]:
array_7 = np.ones((2,3))
array_7

array([[1., 1., 1.],
       [1., 1., 1.]])

The 1s and 0s are taken as float. If we added 'int' as an argument in the function, like in the following example, the 1s and 0s would be integers.

In [11]:
array_6 = np.zeros((2,3),'int')
array_6

array([[0, 0, 0],
       [0, 0, 0]])

In [12]:
array_7 = np.ones((2,3),'int')
array_7

array([[1, 1, 1],
       [1, 1, 1]])

There is also built-in function for creating an identity matrix. For those who are not familiar with it, identity matrix is a special type of matrix that has 1s diagonally and 0s in other positions as its elements

In [13]:
identity = np.identity(3)
identity

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

For creating an NP array with random data, we can use "random.randn()" (for getting float numbers) or "random.randint()" (for getting integer numbers) functions.

In [14]:
array_8 = np.random.randn(2,3)
array_8

array([[-0.7679555 , -0.70026664, -1.50985308],
       [ 1.86321636,  1.04423447,  1.3670779 ]])

In [15]:
array_9 = np.random.randint(low = 1, high = 10, size = (3,3))
array_9

array([[7, 9, 2],
       [9, 4, 1],
       [2, 1, 9]])

That's all for now. One thing to remember, the datatype of an array is not only limited to float or integer. They can be complex number, boolean object, strings and other python objects.

In [16]:
array_1>2

array([False, False,  True])

We can easily convert the data types if they are convertable.

In [17]:
# Creating an array with strings as data type
string_array = np.array(['1','2','2.5','3.5'])
string_array

array(['1', '2', '2.5', '3.5'], dtype='<U3')

In [18]:
# Now converting its data type to float numbers 
numeric_array = string_array.astype('float')
numeric_array

array([1. , 2. , 2.5, 3.5])

In [19]:
numeric_array.dtype

dtype('float64')

## Attributes

Now that we know how to create NP arrays, lets take a look at how we can find the different attributes of an array. "Shape" is the size of the dimensions i.e. number of rows, number of colums. Dimension is the number of axis the array has. And "Size" indicates the number of elements the array contains. To get the shape of an array, we can use the "shape" attribute. For finding the dimension, we can use "ndim". For finding the datatypes, we use "dtype" attribute. And for finding the total number of elements in the array i.e. its size, we use "size" attribute.

In [20]:
array_2.shape

(4, 3)

In [21]:
array_2.ndim

2

In [22]:
array_2.dtype

dtype('int32')

In [23]:
array_2.size

12

You can reshape an array. For that, you can use the built-in resize funtion.

In [24]:
array_2.resize(2,6)
array_2.shape

(2, 6)

## Indexing and Slicing

### 1D array

Indexing and slicing of one dimensional arrays are simple.

In [25]:
# Creating a new array first for demonstration
base_array = np.arange(10)
base_array

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [26]:
# Calling the 6th element. Remember, indexing in python starts from 0.
base_array[5]

5

In [27]:
#slicing
sliced_array = base_array[2:6] # The upper bound is excluded
sliced_array

array([2, 3, 4, 5])

One thing to note down here is that this slicing is just 'view', not a new copy. So if we change anything in this sliced array, the main array will be mutated too. Lets see.

In [28]:
sliced_array[0] = 20
sliced_array[1] = 30

In [29]:
sliced_array

array([20, 30,  4,  5])

In [30]:
base_array

array([ 0,  1, 20, 30,  4,  5,  6,  7,  8,  9])

In [31]:
sliced_array[:] = 100
sliced_array

array([100, 100, 100, 100])

In [32]:
base_array

array([  0,   1, 100, 100, 100, 100,   6,   7,   8,   9])

To create a copy, we have to use "copy()" function. Then we can change the sliced array without mutating the main one.

In [33]:
base_array = np.arange(5)
copied_array = base_array.copy()

copied_array[1] = 100
print(base_array)
print(copied_array)

[0 1 2 3 4]
[  0 100   2   3   4]


### Multi-dimensional array

Indexing of multidimensional array has options. We can get an individual element by a recursive call or we can just pass a comma separated list of indices.

In [34]:
# Creating a 2D array first
base_array = np.array([[1,2,3],[4,5,6],[7,8,9]])
base_array

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [35]:
# Getting the first row
base_array[0]

array([1, 2, 3])

In [36]:
# Getting the element from the first column of the first row
base_array[0][0]

1

In [37]:
# or we can pass the indices separated by comma
base_array[0,0]

1

In [38]:
base_array[0,2]


3

You might have guessed already how to slice a 2d np array.

In [39]:
base_array

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [40]:
base_array[:2] # selecting every row upto the 2nd row


array([[1, 2, 3],
       [4, 5, 6]])

In [41]:
base_array[1:] #selecting every row after the first one

array([[4, 5, 6],
       [7, 8, 9]])

In [42]:
base_array[:,:2]
#selecting every column upto the 2nd one, had to put an empty ':' to indicate 'take all rows'

array([[1, 2],
       [4, 5],
       [7, 8]])

In [43]:
base_array[:2,:2] # take all rows upto 2nd one and all columns upto the 2nd one

array([[1, 2],
       [4, 5]])

In [44]:
base_array[:2,1:] # take all rows upto 2nd one and all columns from the 2nd one to the last

array([[2, 3],
       [5, 6]])

### Boolean Indexing

There is also another very interesting type of indexing called boolean indexing. We can pass an array of boolean values as the index, it will give us only the rows for which the boolean value is 'True'. Let's take a look at the following example. 

In [45]:
# Creating a boolean array first
boolean_array = np.array([True, False, True])

In [46]:
# Using boolean indexing
base_array[boolean_array]

array([[1, 2, 3],
       [7, 8, 9]])

This boolean indexing can be used in very creative ways. Let's see that in a simple example.

Suppose we have four different things to mix (sugar, salt, spice, water or whatever). We tried 10 different combinations. Some proved to be a good mix, some proved to be bad mix. We stored the combinations in an array. Each column represents the ingredients and each row is the different mixes. And we stored the results (if mix is good or bad) in another array.

In [47]:
combinations = np.random.randint(low = 1, high = 10, size = (10,4)) # Randomly creating an array for demonstration
combinations

array([[6, 6, 8, 3],
       [9, 8, 2, 7],
       [3, 5, 4, 2],
       [6, 7, 6, 2],
       [9, 2, 6, 4],
       [7, 7, 7, 8],
       [7, 9, 3, 7],
       [4, 7, 5, 6],
       [8, 6, 6, 1],
       [6, 8, 5, 3]])

In [48]:
results = np.array(['g','g','vg','g','b','b','b','vg','b','b']) 
# Randomly created the result, assume g = good, vg = very good, b = bad

So we want to find mixes that are very good or just good (designated by 'g' in the result) combinations or anything but bad combinations. Here is how we can use boolean indexing to do this.

In [49]:
# First see how we can create a boolean array very easily
very_good = results == 'vg'

In [50]:
combinations[very_good]

array([[3, 5, 4, 2],
       [4, 7, 5, 6]])

In [51]:
combinations[results=='g']

array([[6, 6, 8, 3],
       [9, 8, 2, 7],
       [6, 7, 6, 2]])

In [52]:
condition = results=='b'
combinations[~condition] # using ~ before a condination will yield every other results except the given condition

array([[6, 6, 8, 3],
       [9, 8, 2, 7],
       [3, 5, 4, 2],
       [6, 7, 6, 2],
       [4, 7, 5, 6]])

We can reset the values, if we want, using the boolean indexing.

In [53]:
combinations[~condition] = 0
combinations

array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [9, 2, 6, 4],
       [7, 7, 7, 8],
       [7, 9, 3, 7],
       [0, 0, 0, 0],
       [8, 6, 6, 1],
       [6, 8, 5, 3]])

In boolean indexing, we have to make sure that the length of the booling index matches the length of the axis its indexing.

## Operations

### Basic arithmetic operations

Here I will be covering some basic operations like addition, subtraction, multiplication, division and some basic matrics operations. There are many other advanced operations available in Numpy but all cannot be covered in this "basics" nootbook. I will mention some of those functions and encourage you to try them out by yourself.

In [54]:
print(base_array)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [55]:
# add
base_array + base_array

array([[ 2,  4,  6],
       [ 8, 10, 12],
       [14, 16, 18]])

In [56]:
# Substract
base_array - base_array

array([[0, 0, 0],
       [0, 0, 0],
       [0, 0, 0]])

In [57]:
# Multiplication with a scaler value
base_array * 10

array([[10, 20, 30],
       [40, 50, 60],
       [70, 80, 90]])

In [58]:
# Division
base_array / 10

array([[0.1, 0.2, 0.3],
       [0.4, 0.5, 0.6],
       [0.7, 0.8, 0.9]])

In [59]:
# Adding a scaler value to every element
base_array + 10

array([[11, 12, 13],
       [14, 15, 16],
       [17, 18, 19]])

In the last addition operation, we did so without the need of a for loop. this is one of the reasons why we want to use numpy. Doing such operations without using for loops are significantly more memory efficient in large dataset which we talked about at the beginning.

### Multiplication of Arrays

Moving on to array multiplication with another array. Multiplication using the * symbol will multiply one element with coresponding element from the other array

In [60]:
base_array * base_array

array([[ 1,  4,  9],
       [16, 25, 36],
       [49, 64, 81]])

But if we want to perform matrix multiplication, we can use numpy's dot() function. (For those who are not familiar with matrix multiplication, it is a little bit complex. I do not want to make this notebook any longer with the explanation of how matrix multiplication is done mathematically. Here is a link to help you with that: https://www.mathsisfun.com/algebra/matrix-multiplying.html)

In [61]:
base_array.dot(base_array)
# np.dot(array, array), alternate syntax

array([[ 30,  36,  42],
       [ 66,  81,  96],
       [102, 126, 150]])

### Metric Transpose

We can transpose an array (exchanging its rows and columns) by using .T after the array

In [62]:
base_array.T

array([[1, 4, 7],
       [2, 5, 8],
       [3, 6, 9]])

For more complex multidimensional array, we can use the "transpose()" function and pass a tuple of the axes in the order we want as arguments. For example array.transpose((1,2,0)) will make the second axis first, the third axis second and the first axis last. (Remember python indexing, first axis is indexed 0, second is 1, third is 2.)

In [63]:
array = np.arange(16).reshape((2, 2, 4))
array

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7]],

       [[ 8,  9, 10, 11],
        [12, 13, 14, 15]]])

In [64]:
array.transpose((1, 2, 0))

array([[[ 0,  8],
        [ 1,  9],
        [ 2, 10],
        [ 3, 11]],

       [[ 4, 12],
        [ 5, 13],
        [ 6, 14],
        [ 7, 15]]])

## Applying Functions

There are some universal functions which helps us perform fast element wise operation. For example, np.sqrt() performs square root on each element of the array, np.exp() function returns an array with the exponential value of each element.

In [65]:
np.sqrt(base_array)

array([[1.        , 1.41421356, 1.73205081],
       [2.        , 2.23606798, 2.44948974],
       [2.64575131, 2.82842712, 3.        ]])

In [66]:
np.exp(base_array)

array([[2.71828183e+00, 7.38905610e+00, 2.00855369e+01],
       [5.45981500e+01, 1.48413159e+02, 4.03428793e+02],
       [1.09663316e+03, 2.98095799e+03, 8.10308393e+03]])

Some other universal functions are: 
<ul>
    <li>abs, fabs (Compute the absolute value element-wise for integer floating-point, or complex values)</li> 
    <li>sqrt (Compute the square root of each element)</li>
    <li>square (Compute the square of each element)</li>
    <li>exp (Compute the exponent of each element)</li>
    <li>log, log10, log2, log1p (Natural logarithm (base e), log base 10, log base 2, and log(1 + x), respectively)</li>
    <li>sign (Compute the sign of each element: 1 (positive), 0 (zero), or –1 (negative))</li>
    <li>ceil (Compute the ceiling of each element (i.e., the smallest integer greater than or equal to that number))</li>
    <li>floor (Compute the floor of each element (i.e., the largest integer less than or equal to each element))</li>
    <li>rint (Round elements to the nearest integer, preserving the dtype)</li>
    <li>modf (Return fractional and integral parts of array as a separate array)</li>
    <li>isnan (Return boolean array indicating whether each value is NaN (Not a Number))</li>
    <li>isfinite, isinf (Return boolean array indicating whether each element is finite (non-inf, non-NaN) or infinite, respectively)</li>
    <li>cos, cosh, sin, sinh, tan, tanh (Regular and hyperbolic trigonometric functions)</li>
    <li>arccos, arccosh, arcsin, arcsinh, arctan, arctanh (Inverse trigonometric functions) etc.</li>
</ul>

There are some mathematical functions to evaluate statistics of an array. For example, np.mean() returns the mean value of the given array.

In [67]:
base_array.mean()
# np.mean(array)

5.0

In [68]:
base_array.max()

9

In [69]:
base_array.sum()

45

In [70]:
base_array.cumsum() # cumalative sum

array([ 1,  3,  6, 10, 15, 21, 28, 36, 45], dtype=int32)

Some other mathematical functions are:
<ul>
    <li>std, var (Standard deviation and variance, respectively)</li>
    <li>min, max (Minimum and maximum)</li>
    <li>argmin, argmax (Indices of minimum and maximum elements, respectively)</li>
    <li>cumsum (Cumulative sum of elements starting from 0)</li>
    <li>cumprod (Cumulative product of elements starting from 1)</li>
</ul>

# Conclusion

I hope this notebook will give you a basic idea about numpy and prepares you to get started. I would recommend you to start using numpy with the basics that I have provided in this notebook. At the initial stage, you might always first think of applying a for loop in you arrays whenever you will need to do something, at least that's what I did. But remember, if you do so, you are missing the advantage of using a vectorized operation. But no worries. You will build up the habbit of using vectorized operations if you keep practicing. When you get stuck with a problem or when you need to perform an operation that is not listed in this notebook, just simply do a google search or search it in stackoverflow. There is a good chance that numpy has a vectorized way of executing your operation without the need for a for loop and you know, that will be a better alternative performance-wise.