# NumPy

## What is NumPy?

Numpy is an open-source library for working efficiently with arrays. Developed in 2005 by Travis Oliphant, the name stands for Numerical Python. As a critical data science library in Python, many other libraries depend on it.

## Why is NumPy so popular?

NumPy is extremely popular because it dramatically improves the ease and performance of working with multidimensional arrays.

Some of Numpy's advantages:

1. Mathematical operations on NumPy’s ndarray objects are up to 50x faster than iterating over native Python lists using loops. The efficiency gains are primarily due to NumPy storing array elements in an ordered single location within memory, eliminating redundancies by having all elements be the same type and making full use of modern CPUs. The efficiency advantages become particularly apparent when operating on arrays with thousands or millions of elements, which are pretty standard within data science.
2. It offers an Indexing syntax for easily accessing portions of data within an array.
3. It contains built-in functions that improve quality of life when working with arrays and math, such as functions for linear algebra, array transformations, and matrix math.
4. It requires fewer lines of code for most mathematical operations than native Python lists.

### Best place to get more info

The online documentation (https://numpy.org/doc/) is a great place to look for further information on topics introduced in this article. The documentation goes into more detail than this introduction and is continually updated with evolving best practices.

## When should you start using NumPy?

NumPy would be a good candidate for the first library to explore after gaining basic comfort with the Python environment. After NumPy, the next logical choices for growing your data science and scientific computing capabilities might be SciPy and pandas. In short, learn Python, then NumPy, then SciPy, or pandas.

## What's the relationship between NumPy, SciPy, Scikit-learn, and Pandas?

- **NumPy** provides a foundation on which other data science packages are built, including SciPy, Scikit-learn, and Pandas.
- **SciPy** provides a menu of libraries for scientific computations. It extends NumPy by including integration, interpolation, signal processing, more linear algebra functions, descriptive and inferential statistics, numerical optimizations, and more.
- **Scikit-learn** extends NumPy and SciPy with advanced machine-learning algorithms.
- **Pandas** extends NumPy by providing functions for exploratory data analysis, statistics, and data visualization. It can be thought of as Python's equivalent to Microsoft Excel spreadsheets for working with and exploring tabular data (tutorial).

## An Alternative to MATLAB?

Many readers will likely be familiar with the commercial scientific computing software MATLAB. When used together with other Python libraries like Matplotlib, NumPy can be considered as a fully-fledged alternative to MATLAB's core functionality.

Python is quite an attractive alternative to MATLAB for the following reasons:

- Python is open-source, which means that you have the option of inspecting the source code yourself.
- Access the vast and ever-growing possibilities open to Python users.
- Unlike MATLAB, Python and Numpy are free. No further explanation is needed!

## Installation

To check if you already have NumPy installed in your Python installation (it most likely is), run the following command:

In [1]:
import numpy as np

If no error message is returned, that's a good sign NumPy is already available. If you get an error message like

ModuleNotFoundError: No module named 'numpy'

This probably means that NumPy needs to be installed first. If you use the pip Python package manager, the required command is 'pip install numpy'. If you need more detailed installation instructions, refer to https://numpy.org/.

## List of useful NumPy functions

NumPy has numerous useful functions. You can see the full list of functions in the NumPy docs. As an overview, here are some of the most popular and useful ones to give you a sense of what NumPy can do. We will cover many of them in this tutorial.

- **Array Creation**: arange, array, copy, empty, empty_like, eye, fromfile, fromfunction, identity, linspace, logspace, mgrid, ogrid, ones, ones_like, r_, zeros, zeros_like
- **Conversions**: ndarray.astype, atleast_1d, atleast_2d, atleast_3d, mat
- **Manipulations**: array_split, column_stack, concatenate, diagonal, dsplit, dstack, hsplit, hstack, ndarray.item, newaxis, ravel, repeat, reshape, resize, squeeze, swapaxes, take, transpose, vsplit, vstack
- **Questions**: all, any, nonzero, where
- **Ordering**: argmax, argmin, argsort, max, min, ptp, searchsorted, sort
- **Operations**: choose, compress, cumprod, cumsum, inner, ndarray.fill, imag, prod, put, putmask, real, sum
- **Basic Statistics**: cov, mean, std, var
- **Basic Linear Algebra**: cross, dot, outer, linalg.svd, vdot

# Section 1: The basics

## NumPy arrays

The NumPy array - an n-dimensional data structure - is the central object of the NumPy package.

A one-dimensional NumPy array can be thought of as a vector, a two-dimensional array as a matrix (i.e., a set of vectors), and a three-dimensional array as a tensor (i.e., a set of matrices).

![image](https://storage.googleapis.com/lds-media/images/numpy-vector-matrix-3d-matrix.width-1200.jpg)

Need more than three dimensions? It's entirely possible to have arrays with many dimensions, including so many dimensions that it's no longer humanly possible to conceptualize them.

### Array data types

An array can consist of integers, floating-point numbers, or strings. Within an array, the data type must be consistent (e.g., all integers or all floats).

Need an array with mixed data types? Consider using Numpy's record array format or pandas dataframes instead (see the Pandas tutorial).

In this article, we'll restrict our focus to conventional NumPy arrays consisting of a single data type.

### Defining arrays

We can define NumPy arrays in a number of ways. We'll detail a few of the most common approaches below.

### Using np.array()

To define an array manually, we can use the `np.array()` function. Below, we pass a list of two elements, each of which is a list containing two values. The result is a 2x2 matrix:

In [2]:
np.array([[1,2],[3,4]])

array([[1, 2],
       [3, 4]])

It's as simple as that! Once we have our data in a NumPy array, a vast suite of computing possibilities becomes available. Much of this article is concerned with exploring these possibilities.

NumPy has numerous functions for generating commonly-used arrays without having to enter the elements manually. A few of those are shown below:

### Defining arrays: np.arange()

The function np.arange() is great for creating vectors easily. Here, we create a vector with values spanning 1 up to (but not including) 5:

In [3]:
np.arange(1,5)

array([1, 2, 3, 4])

### Defining arrays: np.zeros, np.ones, np.full

In many programming tasks, it can be useful to initialize a variable and then write a value to it later in the code. If that variable happens to be a NumPy array, a common approach would be to create it as an array with zeros in every element. We can do this using `np.zeros()`. Here, we create an array of zeros with three rows and one column.

In [4]:
np.zeros((3,1))

array([[0.],
       [0.],
       [0.]])

You can also initialize an array with ones instead of zeros:

In [5]:
np.ones((3, 1))

array([[1.],
       [1.],
       [1.]])

`np.full()` creates an array repeating a fixed value (defaults to zero). Here we create a 2x3 array with the number 7 in each element:

In [6]:
np.full((2,3),7)

array([[7, 7, 7],
       [7, 7, 7]])

Making arrays in this way is also helpful for appending columns or rows to an existing arrays, which will be covered a little later.

### Array shape

All arrays have a shape accessible using `.shape`.

For example, let's get the shape of a vector, matrix, and tensor.

In [7]:
vector = np.arange(5)
print("Vector shape:", vector.shape)

matrix = np.ones([3, 2])
print("Matrix shape:", matrix.shape)

tensor = np.zeros([2, 3, 3])
print("Tensor shape:", tensor.shape)

Vector shape: (5,)
Matrix shape: (3, 2)
Tensor shape: (2, 3, 3)


The shape of the vector is one-dimensional. The first number in its shape is the number of elements (or rows). For the matrix, `.shape` tells us we have three rows and two columns. The tensor is slightly different. The first number is how many matrices/slices we have. The second gives the number of rows. The third provides the number of columns.

If you're familiar with `pandas`, you might have noticed that the syntax for the number of rows and columns is strikingly similar to the equivalent in pandas. As we continue to explore NumPy arrays, you may notice many more similarities.

If we print the tensor, we can see it's representation as a list of 3x3 matrices:

In [8]:
tensor

array([[[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]])

### Reshaping arrays

We can reshape an array into any compatible dimensions using `.reshape`.

For example, say we want a 3x3 matrix where each element is incremented from 1 to 9. Easy:

In [9]:
arr = np.arange(1, 10)
print(arr, '\n')

# Reshape to 3x3 matrix
arr = arr.reshape(3, 3)
print(arr, '\n')

# Reshape back to the original size
arr = arr.reshape(9)
print(arr)

[1 2 3 4 5 6 7 8 9] 

[[1 2 3]
 [4 5 6]
 [7 8 9]] 

[1 2 3 4 5 6 7 8 9]


Numpy can try to infer one of the dimensions if you use -1. You will still need to have precisely the correct number of digits for the inference to work.

In [10]:
arr = np.arange(1, 10).reshape(3, -1)
print(arr)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


### Reading data from a file into an array

Usually, data sets are too large to define manually. Instead, the most common use case is to import data from a data file into a NumPy array.

As an example, let's take some publicly-available data from the U.S. Energy Information Administration. The dataset we'll explore contains information on electricity generation in the USA from a range of sources. You can download the file, MER_T07_02A.csv, here: https://www.eia.gov/totalenergy/data/browser/csv.php?tbl=T07.02A.

Because the data file is a CSV file, we'll use the `csv` module to import the data. It's worth noting that NumPy also has functions to read other types of data files directly into NumPy arrays, such as `np.genfromtxt()` for text files.

Here we're just reading the CSV file row-by-row, appending to a list, and then converting to a NumPy array:

In [11]:
import csv

data = []

with open('MER_T07_02A.csv', 'r') as csvfile:
    file_reader = csv.reader(csvfile, delimiter=',')
    for row in file_reader:
        data.append(row)
        
data = np.array(data) #convert the list of lists to a NumPy array

We now have our data stored in a NumPy array that we've named data. For much of the remainder of this article, we'll be exploring how NumPy's functionality can be used to manipulate and gain insights into this data.

First, we'll explore some attributes of the array. One thing that we may want to know about an array is its dimensions:

In [12]:
data.shape

(8529, 6)

For this two-dimensional array, we have 8230 rows and 6 columns of data.

Another property of a NumPy array that we may wish to know is its data type. This information is stored in the dtype attribute. Calling dtype reveals that our array is made up of strings:

In [13]:
data.dtype.type

numpy.str_

## Saving

When we are ready to save our data, we can use the `save` function.

In [14]:
np.save(open('data.npy', 'wb'), data)      # Saves data to a binary file with the .npy extension

## Indexing

At some point, it will become necessary to index (select) subsets of a NumPy array. For instance, you might want to plot one column of data or perform a manipulation of that column. NumPy uses the same indexing notation as MATLAB.

### Basics of indexing notation

- **Commas** separate axes of an array.
- **Colons** mean "through". For example, x[0:4] means the first 5 rows (rows 0 through 4) of x.
- **Negative numbers** mean "from the end of the array." For example, x[-1] means the last row of x.
- **Blanks** before or after colons means "the rest of". For example, x[3:] means the rest of the rows in x after row 3. Similarly, x[:3] means all the rows up to row 3. x[:] means all rows of x.
- When there are **fewer indices than axes**, the missing indices are considered complete slices. For example, in a 3-axis array, x[0,0] means all data in the 3rd axis of the 1st row and 1st column.
- **Dots** "..." mean as many colons as needed to produce a complete indexing tuple. For example, x[1,2,...] is the same as x[1,2,:,:,:].

In the following code, we'll explore some useful examples of selecting subsets from an array.

### Example

![example](https://storage.googleapis.com/lds-media/images/numpy-indexing-arrays.width-1200.jpg)

#### Indexing example 1: Colons and commas

Let's say we are interested in the first ten rows in the 4th column. We can use the following syntax to index this array section: `__array[start_row:end_row, col]__`

In [15]:
data[0:10,4]

array(['Description', 'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors'], dtype='<U80')

The first row is the header for the column. Column 4 contains a description of energy sectors.

#### Indexing example 2: Colons as *all* rows or columns

A colon can also denote all rows, or all columns. Here, we index all rows of column 4.

In [16]:
data[:,4]

array(['Description', 'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors', ...,
       'Electricity Net Generation Total (including from sources not shown), All Sectors',
       'Electricity Net Generation Total (including from sources not shown), All Sectors',
       'Electricity Net Generation Total (including from sources not shown), All Sectors'],
      dtype='<U80')

#### Indexing example 3: Subset of columns

We can use the same format for any dimension of an array. The general syntax is: array[start_row:end_row, start_col:end_col]. The following indexes all rows and the second column up to (but not including) the 4th column:

In [17]:
data[:,2:4]

array([['Value', 'Column_Order'],
       ['135451.32', '1'],
       ['154519.994', '1'],
       ...,
       ['374205.509', '13'],
       ['404614.884', '13'],
       ['414224.475', '13']], dtype='<U80')

#### Indexing example 4: Explicitly specifying column numbers

What if the columns we need are not next to each other? Instead of indexing a range of columns, it can be useful to specify them explicitly. To explicitly specify particular columns, we just include them in a list. Let's index the five rows after the header, selecting only columns 2 and 3. This time, we'll write the output to a new array named subset that we can re-use in the following example.

In [18]:
subset = data[1:6, [2,3]]
subset

array([['135451.32', '1'],
       ['154519.994', '1'],
       ['185203.657', '1'],
       ['195436.666', '1'],
       ['218846.325', '1']], dtype='<U80')

#### Indexing example 5: Mask arrays

Another convenient way to index certain sections of a NumPy array is to use a mask array. A mask array, also known as a logical array, contains boolean elements (i.e. True or False). Indexing of a given array element is determined by the value of the mask array's corresponding element.

First, we define a NumPy array of True/False values, where the True values are the ones we want to keep. Then we mask the `subset` array from the previous example. The result is retaining only the rows that correspond to elements that are True in the mask array.

In [19]:
mask_array = np.array([False, True, False, True, True])

subset[mask_array]

array([['154519.994', '1'],
       ['195436.666', '1'],
       ['218846.325', '1']], dtype='<U80')

As you can see, the mask array retained the rows corresponding to True and the excluded the ones corresponding to False. It is worth noting that a similar approach is used for indexing pandas dataframes.

Masking is a powerful tool that allows us to index elements based on logical expressions. We'll make good use of in the case study later in the article.

#### Concatenating
NumPy also provides useful functions for concatenating (i.e., joining) arrays. Let's say we wanted to restrict our attention to the first and the last three rows of our dataset. First, we'll define new sub-arrays as follows:

In [20]:
array_start = data[:3,:]
array_start

array([['MSN', 'YYYYMM', 'Value', 'Column_Order', 'Description', 'Unit'],
       ['CLETPUS', '194913', '135451.32', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours'],
       ['CLETPUS', '195013', '154519.994', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours']], dtype='<U80')

In [21]:
array_end = data[-3:,:]
array_end

array([['ELETPUS', '202106', '374205.509', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['ELETPUS', '202107', '404614.884', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['ELETPUS', '202108', '414224.475', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours']], dtype='<U80')

To concatenate these arrays we can use `np.vstack`, where the v denotes vertical, or row-wise, stacking of the sub-arrays:

In [22]:
np.vstack((array_start, array_end))

array([['MSN', 'YYYYMM', 'Value', 'Column_Order', 'Description', 'Unit'],
       ['CLETPUS', '194913', '135451.32', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours'],
       ['CLETPUS', '195013', '154519.994', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours'],
       ['ELETPUS', '202106', '374205.509', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['ELETPUS', '202107', '404614.884', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['ELETPUS', '202108', '414224.475', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours']], dtype='<U80')

Here we've stacked the first three rows and last three rows on top of each other.

The horizontal counterpart of `np.vstack()` is `np.hstack()`, which combines sub-arrays column-wise. For higher dimensional joins, the most common function is `np.concatenate()`. The syntax for this function is similar to the 2D versions, with the additional requirement of specifying the axis along which concatenation should be performed.

Calling `np.concatenate((array_start, array_end), axis = 0)` would generate identical output to using `np.vstack()`. Axis=1 would generate identical output to using `np.hstack()`.

### Splitting

The opposite of concatenating (i.e., joining) arrays is splitting them. To split an array, NumPy provides the following commands:

- hsplit: splits along the horizontal axis
- vsplit: splits along the vertical axis
- dsplit: Splits an array along the 3rd axis (depth)
- array_split: lets you specify the axis to use in splitting

### Adding/Removing Elements

NumPy provides several functions for adding or deleting data from an array:

- resize: Returns a new array with the specified shape, with zeros as placeholders in all the new cells.
- append: Adds values to the end of an array
- insert: Adds values in the middle of an array
- delete: Returns a new array with given data removed
- unique: Finds only the unique values of an array

### Sorting
There are several useful functions for sorting array elements. Some of the available sorting algorithms include `quicksort`, `heapsort`, `mergesort`, and `timesort`.

For example, here's how you'd merge sort the columns of an array:

In [23]:
a = np.array([[3,8,1,2], [9,5,4,8]])
np.sort(a, axis=1, kind='mergesort')      # Sort by column

array([[1, 2, 3, 8],
       [4, 5, 8, 9]])

### No Copy vs. Shallow Copy vs. Deep Copy

A common source of confusion NumPy beginners is knowing when data is and isn't copied into a new object.

**No copy**: function calls and assignments:

In [24]:
print(id(a))

# Object "b" points to object "a". No new object is created.
b = a       

# Python passes objects as references. No copy is made.
def f(x):   
    print(id(x))
    
f(b)

99622832
99622832


Notice the id of `b` is the same as `a`, even if it's passed into a function.

**View/Shallow Copy**: Arrays that share some data. The view method creates an object looking at the same data. Slicing an array returns a view of that array.

In [25]:
# View
a = b.view()

# The shape of b doesn't change
a = a.reshape((4, 2))    

# Slice
# a[:] is a view of "a".
a[:] = 5

**Deep copy**: Use the `copy` method to make a complete copy of an array and all its data.

In [26]:
c = a.copy()

The `copy()` method creates the new array object `c` that is identical to `a`.