# Numpy Introduction

Numpy is a library that you will often use with pandas for achieving data science goals. It serves many similar purposes but has different features and abilities that often complement pandas. For example, many linear algebra operations can be achieved with great ease using numpy. These kinds of operations are not ones you want to do with native python data types because the speed is just not the same. Numpy has optimization under the hood that makes it run in fractions of what it takes for regular python.

## Numpy Arrays

The basic type of data for numpy is an array, which is similar to lists in python but with a rich set of numpy features you can use for them. To create a numpy array, you just call np.array() and pass in the list of data you have.

In [1]:
import numpy as np

#Create a basic numpy array
ar1 = np.array([1, 2, 3])

print(ar1)

[1 2 3]


Indexing a one dimensional array works the same in numpy as with basic python lists, for example, to get the first element, or the first two elements...

In [2]:
#Basic indexing
print(ar1[0])
print(ar1[:2])

1
[1 2]


## Extending to 2 Dimensions

One of the biggest differences you will see with numpy is its support for higher dimension arrays. If we give it a nested list, we will get back a 2 dimensional array. So first create the following array.

In [3]:
#Create a two dimensional array
ar1 = np.array([[1, 2, 3],
               [4, 5, 6],
               [7, 8, 9],
               [10, 11, 12]])
print(ar1)

[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]


The shape attribute will return the dimensions of the array. In the case of the 2d array, it will give the number of rows x number of columns.

In [4]:
#Print the shape
print(ar1.shape)

(4, 3)


If you use the indexing from before, you will get the first row, and the first two rows. You can try it out to see.

In [5]:
#Basic indexing
print(ar1[0])
print()
print(ar1[:2])

[1 2 3]

[[1 2 3]
 [4 5 6]]


One thing that you can't do with basic lists is to index along the columns instead. If you give ":" for the rows (meaning return all rows), then you can use the indexing for the columns to get back the first column as well as the first two columns. Notice how the shapes look for each! The first one will give back a one dimensional array.

In [6]:
#Basic indexing for columns
print(ar1[:,0])
print()
print(ar1[:,:2])

[ 1  4  7 10]

[[ 1  2]
 [ 4  5]
 [ 7  8]
 [10 11]]


As well, different operations can be used for numpy arrays together. Things like adding two numpy arrays together will add them element by element. So for example, the following code creates a second array, then adds the first and second array.

In [7]:
#Create array 2
ar2 = ar1.copy() * 2

#Add array 1 and array 2

print("Array 1:")
print(ar1)
print()

print("Array 2:")
print(ar2)
print()

ar3 = ar1 + ar2
print("Array 3 (Array 1 + Array 2):")
print(ar3)

Array 1:
[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]

Array 2:
[[ 2  4  6]
 [ 8 10 12]
 [14 16 18]
 [20 22 24]]

Array 3 (Array 1 + Array 2):
[[ 3  6  9]
 [12 15 18]
 [21 24 27]
 [30 33 36]]


## Moving to 3+ Dimensions

Numpy supports the ability to create arrays that are higher dimensional. Let's start with a basic example, what if you had two products and two stores that you sell the products in leading to a a 2x2 array to hold sales information. The two columns will denote product 1 and product 2, the two rows will denote store 1 and store 2.

In [8]:
#Create the sales array

sales = np.array([[100, 200],
                 [50, 100]])
print(sales)

[[100 200]
 [ 50 100]]


We can index into this array to find different things...

In [9]:
print("Sales for product 1: ")
print(sales[:,0])
print()
print("Sales for product 2: ")
print(sales[:,1])
print()

print("Sales at store 1: ")
print(sales[0])
print()

print("Sales at store 2: ")
print(sales[1])
print()

Sales for product 1: 
[100  50]

Sales for product 2: 
[200 100]

Sales at store 1: 
[100 200]

Sales at store 2: 
[ 50 100]



At this point, you may be thinking that this would be much easier with a pandas dataframe that will have the labels. Here is the twist though, what if we have a third dimension, the day? We saw with pandas one way of representing it, but this may not always be the most effecient way to store data. Numpy allows us to have a third dimension (or more). All we need to do is add a third level of nesting and numpy will take care of the rest. I will specify what day relates to which array to make it as clear as possible how this works. First, the three lists of sales data need to be built.

In [10]:
#Create sales data

sales_day1 = [[100, 200],
                 [50, 100]]

sales_day2 = [[140, 300],
                 [55, 40]]

sales_day3 = [[21, 33],
                 [43, 53]]

We can combine these three lists into one larger list to hold them.

In [11]:
#Create the larger list object
sales = [sales_day1, sales_day2, sales_day3]
print(sales)

[[[100, 200], [50, 100]], [[140, 300], [55, 40]], [[21, 33], [43, 53]]]


Finally this can be converted to a numpy array by passing it.

In [12]:
#Convert to a numpy array
sales = np.array(sales)
print(sales)

[[[100 200]
  [ 50 100]]

 [[140 300]
  [ 55  40]]

 [[ 21  33]
  [ 43  53]]]


With our indexing, we can pick specific 2x2 matrices by passing the first index, so to get the second array...

In [13]:
#Get the second array
print(sales[1])

[[140 300]
 [ 55  40]]


Here is where we really get the benefits of numpy, however. What if we wanted to quickly see all sales for product 1 at all stores over all time frames? We can now to indexing along the different dimensions. Our first index is going to be ":" because we want all the time, the second will be ":" because we want all the stores, and then our final index will be "0" because we want to get back only product 1.

In [14]:
print(sales[:,:,0])

[[100  50]
 [140  55]
 [ 21  43]]


In a similar way, we could get all the sales between the first and second day by switching the index to end at 2.

In [15]:
print(sales[:2,:,0])

[[100  50]
 [140  55]]


Let's say now that we also wanted to limit this to the first store only, we could switch that second index to be 0 as well. Notice how this is going to change the shape of the array.

In [16]:
print(sales[:2,0,0])

[100 140]


If you needed to preserve the two dimensional shaping of the array, one method would be to do the range to 1 instead like so below.

In [17]:
print(sales[:2,:1,0])

[[100]
 [140]]


There is however some built in functionality that can handle any sort of reshaping you want to do....

## Reshaping Arrays

There are many different ways to change the dimensions of an array. To begin with there is flatten which will take any array and turn it into a one dimensional array. For example, look what happens to the sales array. With flatten, you can compare where the numbers are with flatten versus in the array.

In [18]:
#Print the flattened array
print(sales.flatten())

[100 200  50 100 140 300  55  40  21  33  43  53]


In [19]:
#Compare to the original array
print(sales)

[[[100 200]
  [ 50 100]]

 [[140 300]
  [ 55  40]]

 [[ 21  33]
  [ 43  53]]]


Reshaping is a way to transform between different dimensions. First, the array is flattened to a one dimensional array then it fills in an array of the shape specified. Take this six element 1d array to begin with.

In [20]:
#Create the data
ar = np.array([1, 2, 3, 4, 5, 6])
print(ar)

[1 2 3 4 5 6]


To use reshape, you must pass a tuple of dimensions to use. The multiplication of these dimensions must be equal to the overall number of elements in the array. For example, the shape 3x2 works because there are 6 total elements. This will return an array with 3 rows and 2 columns.

In [21]:
#Reshape the array
print(ar.reshape((3,2)))

[[1 2]
 [3 4]
 [5 6]]


Of course, we could just as well have a 2x3 array instead...

In [22]:
print(ar.reshape((2,3)))

[[1 2 3]
 [4 5 6]]


You can also have higher dimension arrays with re-shaping. For example, we might want a 3x2x2 from a length 12 one dimensional array.

In [23]:
#Create data
ar = np.array(list(range(1,13)))
print(ar)
print()

#Print re-shaped data
print(ar.reshape((3,2,2)))

[ 1  2  3  4  5  6  7  8  9 10 11 12]

[[[ 1  2]
  [ 3  4]]

 [[ 5  6]
  [ 7  8]]

 [[ 9 10]
  [11 12]]]


Reshaping can be applied however you need it to be done and will allow for many useful applications. One last example will be bringing back the sales data. You might imagine a scenario where all you care about is seeing the sales numbers across days easily. In that case, it might be useful to just transform the array to show a one dimensional array per day referring to the sales numbers....

In [24]:
print("Original Sales Data:")
print(sales)
print()

print("Reshaped Sales Data")
print(sales.reshape(3,4))

Original Sales Data:
[[[100 200]
  [ 50 100]]

 [[140 300]
  [ 55  40]]

 [[ 21  33]
  [ 43  53]]]

Reshaped Sales Data
[[100 200  50 100]
 [140 300  55  40]
 [ 21  33  43  53]]
