# Introduction to Numpy

By now, you should be proud of yourself, for knowing and working with ```Pandas```. Let's go even further! Even more exciting part!

Now, we're going to start off with a foray into the ```NumPy``` library, which is one of the fundamental packages for scientific computing<br> in Python.

It turns out that the ```pandas DataFrames``` we worked with are actually built off the ```NumPy array``` (which we'll get to), so it's important<br> to have some basic knowledge of what's running under the hood of our ```DataFrames```.

There are loads of things that you can do with NumPy arrays, and today we're going to introduce some of their amazing capabilities.<br> Numpy offers a lots of functionality. To know more about this, see this [website](https://numpy.org/).

## Objectives

By the end of this notebook, you will be:
- be able to create NumPy arrays
- have an idea of the many things you can do with NumPy and NumPy arrays
- and the types of situations where you would want to use them

### What's so special about ```Numpy arrays```??

A NumPy array is much faster to interact with and perform certain types of calculations with than a standard Python list. Why is that,<br> though? The two main reasons that they are faster are:

- They are stored as one contiguous block of memory, rather than being spread out across multiple locations like a list.<br>
- Each item in a NumPy array is of the same data type (i.e. all integers, all floats, etc.), rather than a conglomerate of any number<br> of data types (as a list is). We call this idea homogeneity, as opposed to the possible heterogeneity of Python lists.


Just how much faster are they? Let's find it out! Let's take the numbers from 0 to 1 million, and sum those numbers, timing it with<br> both a list and a NumPy array.

In [12]:
import numpy as np # Standard alias when importing NumPy - follow this convention!
numpy_array = np.arange(0, 1000000)
python_list = range(1000000)

In [13]:
python_list

range(0, 1000000)

In [14]:
%timeit np.sum(numpy_array)

615 μs ± 53.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [15]:
%timeit sum(python_list)

14.8 ms ± 563 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [16]:
%timeit sum(numpy_array)

71.2 ms ± 5.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


So, NumPy runs nearly an order of magnitude faster! This is because of those two points above. Because NumPy arrays store data in<br> contiguous blocks of memory, they are able to take advantage of **vectorization**, which is the ability of a CPU to perform one operation<br> on multiple pieces of data at once. In addition, since a NumPy array knows what type each object it is storing is (and those types<br> don't change), it doesn't have to waste time checking what type each object is (like a list). The combo of these two things speeds up our<br> calculation quite a bit.

Notice, too, that we had to specify ```np.sum()``` - NumPy's implementation of sum. When we just used the built-in Python ```sum()``` on the<br> NumPy array, the calculation was actually slower! This is because NumPy arrays are optimized for vectorized computations, and when we try<br> to do a non-vectorized operation we pay a price.

It's also worth noting that all we did above was a sum - just a **simple** sum. When we move to doing more complicated operations, we'll<br> save even more time! Let's look at what else NumPy arrays can do...

### Producing a Numpy array

We're not going to cover everything that you can do with NumPy arrays (see the methods [docs](https://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.ndarray.html#numpy.ndarray) for that), but we'll look at the basics.

Let's start with how we can create a ```NumPy array```. To do this, we use the ```np.array()``` constructor, which expects some kind of array or<br> something that exposes the array interface (i.e. acts like an array - lists and tuples are examples). So, this means that we can<br> create a NumPy array by passing in a list or tuple.

In [18]:
my_lst_ndarray = np.array([10, 20, 30, 40, 50])
# You can specify the data type upon creation. 
my_tuple_ndarray = np.array((10, 20, 30, 40, 50), np.int32) 
print(my_tuple_ndarray)

[10 20 30 40 50]


Just like we have the shape attribute on pandas DataFrames, we also have it on NumPy arrays.

In [20]:
print(my_lst_ndarray.shape)
print(my_tuple_ndarray.shape)

(5,)
(5,)


We also have the dtype attribute, which will tell us the data type of the objects in our ndarray (n-dimensional array).

In [19]:
print(my_lst_ndarray.dtype)
print(my_tuple_ndarray.dtype)

int64
int32


Let's try something else:

In [None]:

my_lst_ndarray2 = np.array(["1", 2, 3, "10", 5])
print(my_lst_ndarray2.dtype)

<U21


"U" stands for Unicode sting and "2" because its a 2-character long string. Every element in the array was converted to a string.

In [24]:
# every element is as type a string now
type(my_lst_ndarray2[1])

numpy.str_

If you try to tell the ndarray to be a certain data type, it will try to cast every object you pass in to that data type (here a<br>32-bit integer), and fail if it can't cast it to that data type. Below, we are able to cast "10" to a 32-bit integer, so this is fine.

In [26]:
my_lst_ndarray3 = np.array([1, 2, 3, "10", 5], np.int32) 
print(my_lst_ndarray3.dtype)

int32


In [16]:
# This will not work, because Python can't cast the string 'boy' as a 32 bit integer. 
my_lst_ndarray3 = np.array([1, 2, 3, "boy", 5], np.int32) 

ValueError: invalid literal for int() with base 10: 'boy'

Here are some other methods of constructing a NumPy array. It's helpful to know these exist.

In [18]:
zeros_arr = np.zeros((2,4)) # Create a matrix of zeros with 2 rows and 4 columns. 
zeros_arr

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [14]:
ones_arr = np.ones((9,4))  # Create a matrix of ones with 9 rows and 4 columns.
ones_arr

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [28]:
identity_arr = np.identity(40) # Create an identity matrix with 40 rows and 40 columns. 
identity_arr


array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]], shape=(40, 40))

In [16]:
random_arr = np.random.rand(3, 3) # Create a 3x3 array of random floats ranging from 0 to 1. 
random_arr

array([[0.48876054, 0.76902541, 0.93665335],
       [0.5589696 , 0.70939513, 0.16508046],
       [0.18729455, 0.97421708, 0.01038221]])

In [17]:
range_arr = np.arange(0, 15, 0.5) # Create a NumPy array with arguments (start, end, step_size). 
range_arr

array([ 0. ,  0.5,  1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ,
        5.5,  6. ,  6.5,  7. ,  7.5,  8. ,  8.5,  9. ,  9.5, 10. , 10.5,
       11. , 11.5, 12. , 12.5, 13. , 13.5, 14. , 14.5])

## Mathematics with Numpy array

When working with a NumPy array, we have all of the basic mathematics operators available to us: ```+```, ```-```, ```*```, ```/```, ```**```, and ```%```. Frequently,<br> we'll be working with two arrays that are the same size, in which case these operators will just be performed ```element-wise```. For<br> example:

In [83]:
first_arr = np.array([4, 5, 1, 4])
second_arr = np.array([1, 9, 7, 8])
first_arr + second_arr # Each element is lined up with it's corresponding element in the other 
                       # array, and the addition is then performed. 

array([ 5, 14,  8, 12])

In [84]:
first_arr = np.array([[1, 5], [3, 2]]) # This is now a two-dimensional array. 
second_arr = np.array([[5, 5], [6, 8]]) # This is now a two-dimensional array. 
first_arr * second_arr # Each element is lined up with it's corresponding element in the other 
                       # array, and the multiplication is then performed.

array([[ 5, 25],
       [18, 16]])

It turns out that our numerical operations can also work when we want to perform an operation between a ```NumPy array``` and a single<br> value. For example, let's say that we want to subtract ```5``` from ```first_arr``` above, or multiply it by ```2```, or find the remainder when<br> everything is divided by ```3```. We can do that via the following:

In [20]:
first_arr = np.array([[1, 2], [3, 4]]) 

In [21]:
first_arr - 5

array([[-4, -3],
       [-2, -1]])

In [22]:
first_arr * 2

array([[2, 4],
       [6, 8]])

In [23]:
first_arr % 3

array([[1, 2],
       [0, 1]])

The concept that allows this to happen is referred to as [broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html). It is a concept that will be particularly useful when working<br> with and interacting with ```NumPy arrays```. Basically, it takes that single number on the right (the``` 5, 2 or 3``` above), and **broadcasts**<br> it's shape to match that of ```first_arr (2 x 2)```. After doing so, it then performs the operation element-wise like we saw before.

It turns out that things can get a little more intricate than this. If we wanted, we could perform mathematical operations like the<br> above at a column level, or row level. For example, we could subtract off ```3``` from the first column and ```4``` from the second column, or ```3```<br> from the first row and ```4``` from the second row. We would do that via the following:

In [30]:
first_arr = np.array([[3, 2], [1, 4]])

In [31]:
# Here, we subtract 3 off the first column and 4 off the second column. 
first_arr - [3, 4]

array([[ 0, -2],
       [-2,  0]])

In [26]:
# Here, we subtract 3 from the first row and 4 from the second row. 
first_arr - [[3], [4]]

array([[ 0, -1],
       [-3,  0]])

## More on Numpy Arrays

There are actually quite a number of things we can do. We can index into them, perform calculations, ask for aggregation type metrics,<br> etc.

### Indexing

Let's start with Indexing! With NumPy arrays, we don't have the ```.loc[] or .iloc[]``` methods like we do on a ```DataFrame``` - we simply index<br> into them like we would a list. It's effectively a multidimensional list, though. Therefore, we can pass it multiple indexing values.<br> Let's take a look.

In [36]:
# Reshape will reshape the data to the shape that you tell it to (here 3 rows, 5 columns). 
range_arr = np.arange(0, 15, 1).reshape(3, 5)
range_arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [37]:
# Grab every row, but only the element at index 1 in those rows. 
range_arr[:, 1]

array([ 1,  6, 11])

In [None]:
# With no second index, this defaults to taking the rows. 
range_arr[0:2]

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

In [38]:
# The first set of numbers refers to the rows to grab, the second set the columns.
range_arr[0:3, 1:4] 

array([[ 1,  2,  3],
       [ 6,  7,  8],
       [11, 12, 13]])

### Other methods

Again, there is a ton we can do, and we're aiming here to at least get your eyes on a lot of the things that are possible. Google is<br> also amazing for this.

We can perform sums in any direction with a method on the arrays.

In [39]:
range_arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [31]:
# Sum along the rows (i.e. get column totals)
range_arr.sum(axis=0)

array([15, 18, 21, 24, 27])

In [40]:
# Sum along the columns (i.e. get row totals)
range_arr.sum(axis=1)

array([10, 35, 60])

In [41]:
# Get sum of all elements in NumPy array. 
range_arr.sum() 

np.int64(105)

We can also grab the mean, standard deviation, max, and min values along the rows (i.e. for the columns). We could also do this along<br> the columns, or for the array as a whole (just like we did with ```.sum()```).

In [42]:
range_arr.mean(axis=0)

array([5., 6., 7., 8., 9.])

In [None]:
range_arr.std(axis=0)

In [43]:
range_arr.max(axis=0)

array([10, 11, 12, 13, 14])

In [44]:
range_arr.min(axis=0)

array([0, 1, 2, 3, 4])

If we want to instead grab the **index** at which those min and max values occur (either along the rows or columns), then we can use<br> the ```argmin()``` and ```argmax()``` methods available on our NumPy array.

In [85]:
# We see that the mins of each column occur at row 1 (index 0).
range_arr.argmin(axis=0)

array([0, 0, 0, 0, 0])

In [86]:
# We see that the maxs of each column occur at row 5 (index 4).
range_arr.argmax(axis=0) 

array([2, 2, 2, 2, 2])

In [None]:
 # Here we get the index of the overall minimum (the 0th index).
range_arr.argmin()

In [None]:
# Here we get the index of the overall maximum (the last index). 
range_arr.argmax() 

We can get the cumulative sum or product with the following.

In [None]:
# Here it gets the cumsum along the rows (i.e. from top to bottom)
range_arr.cumsum(axis=0)

array([[ 0,  1,  3,  6, 10],
       [ 5, 11, 18, 26, 35],
       [10, 21, 33, 46, 60]])

In [50]:
# Gets the cumprod along the rows
range_arr.cumprod(axis=0) 

array([[  0,   1,   2,   3,   4],
       [  0,   6,  14,  24,  36],
       [  0,  66, 168, 312, 504]])

In [58]:
# We can flatten our arrays as follows. 
range_arr.flatten()
range_arr.ravel()  # They look the same in this case. 

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

## Another Numpy method

Since pandas DataFrames are built on NumPy arrays, the methods available on NumPy arrays largely coincide with the methods available<br> on pandas DataFrames.

Many of these methods are available as functions on the ```NumPy``` module itself, as well. Just like we can call the ```argmax()``` method on a<br> NumPy array, we can call ```np.argmax()``` and pass in a list or tuple. Before we move back to DataFrames, let's look at one last method<br> that is available in NumPy, ```np.where()```. ```np.where()``` can help us to find what elements in a NumPy array meet some condition.

In [59]:
my_ndarray = np.array([1, 3, 5, 7, 20, 13, 18, 9, 12])

In [60]:
# Returns the indices where the data meet the condition.
print(np.where(my_ndarray <= 2))  
print(np.where(my_ndarray == 8)) 
print(np.where(my_ndarray > 6)) 

(array([0]),)
(array([], dtype=int64),)
(array([3, 4, 5, 6, 7, 8]),)


## Pivot Tables

Another usage is Numpy is in creating ```Pivot Tables```. A pivot table can automatically sort, count total, or give the average of<br> the data stored in one table or spreadsheet, displaying the results in a second table showing the summarized data

As you might have guessed, we have functionality to create pivot tables available for our use in pandas. The way that we do this is by<br> calling the ```pivot_table()``` function that is available on the pandas module (which we've stored as ```pd```). As the [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html) tell us, the<br> ```pivot_table()``` expects a number of different arguments:

1. ```data```: A DataFrame object
2. ```values```: a column or a list of columns to aggregate
3. ```index```: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table index. If<br> an array is passed, it is being used as the same manner as column values.
4. ```columns```: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table column.<br> If an array is passed, it is being used as the same manner as column values.
5. ```aggfunc```: function to use for aggregation, defaulting to np.mean

Notice that by default this uses the mean for the ```aggfunc``` parameter.

In [75]:
# Let's recall what the data looks like. 
import pandas as pd
import numpy as np
red_wines_df = pd.read_csv('data/winequality-red.csv', sep=';')
red_wines_df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


Let's take a moment to quickly learn about another pandas function called ```cut()``` that allows us to turn a column with continuous data<br> into categoricals by specifying bins to place them in.

In [76]:
pd.cut(red_wines_df['fixed acidity'], bins=np.arange(4, 17)).head()

0      (7, 8]
1      (7, 8]
2      (7, 8]
3    (11, 12]
4      (7, 8]
Name: fixed acidity, dtype: category
Categories (12, interval[int64, right]): [(4, 5] < (5, 6] < (6, 7] < (7, 8] ... (12, 13] < (13, 14] < (14, 15] < (15, 16]]

In [78]:
fixed_acidity_bins = np.arange(4, 17)
fixed_acidity_series = pd.cut(red_wines_df['fixed acidity'], bins=fixed_acidity_bins, 
                              labels=fixed_acidity_bins[:-1])
fixed_acidity_series.name = 'fa_bin'
fixed_acidity_series
red_wines_df = pd.concat([red_wines_df, fixed_acidity_series], axis=1)

In [79]:
red_wines_df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,fa_bin
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,7
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,7
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,7
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,11
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,7


Now we can get the mean residual sugar for each quality category/fixed acidity bin like we did earlier, but with a pivot_table (mean<br> is the default aggregation function).

In [80]:
pd.pivot_table(red_wines_df, values='residual sugar', index='quality', columns='fa_bin')

  pd.pivot_table(red_wines_df, values='residual sugar', index='quality', columns='fa_bin')


fa_bin,4,5,6,7,8,9,10,11,12,13,14,15
quality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
3,,,1.5,3.5375,3.4,,1.8,2.2,,,,
4,1.75,5.3,2.714286,2.453846,2.583333,2.433333,2.566667,1.5,4.5,,,
5,1.6,1.85,2.492623,2.441331,2.496786,2.675,3.238889,2.77,2.393333,3.133333,,5.025
6,2.35,2.886538,2.556767,2.167027,2.281731,2.801563,2.910345,2.524359,2.9125,2.85,1.8,
7,2.1,1.9,2.595,2.655,2.796429,2.8625,2.718,2.638889,4.15,2.8,2.2,3.7
8,2.0,1.6,,2.316667,1.8,2.166667,3.866667,5.2,2.2,,,


In [81]:
# We can also specify a function to aggregate with (by default it is mean)
pd.pivot_table(red_wines_df, values='residual sugar', index='quality', 
               columns='fa_bin', aggfunc=np.max)

  pd.pivot_table(red_wines_df, values='residual sugar', index='quality',
  pd.pivot_table(red_wines_df, values='residual sugar', index='quality',


fa_bin,4,5,6,7,8,9,10,11,12,13,14,15
quality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
3,,,1.8,5.7,3.4,,2.1,2.2,,,,
4,2.1,12.9,5.6,4.4,6.3,3.4,3.4,1.6,4.5,,,
5,1.6,2.5,7.9,8.1,7.9,13.8,15.5,5.15,4.6,4.8,,7.5
6,4.3,13.9,10.7,5.5,5.1,11.0,15.4,6.2,4.3,3.8,1.8,
7,2.1,2.2,6.0,8.3,6.2,8.9,6.55,4.4,5.8,2.8,2.2,3.7
8,2.0,1.8,,3.6,1.8,2.8,6.4,5.2,2.2,,,
