# Intro

In this notebook, you will learn about
- Getting data from a csv file
- Basic indexing and manipulation
- Saving data for later use with `np.savez`
- Basic built-in NumPy tools for plotting and integrating with Matplotlib.

---

## 1. What is NumPy?

NumPy (Numerical Python) is an open source Python library that’s used in almost every field of science and engineering. It’s the universal standard for working with numerical data in Python, and it’s at the core of the scientific Python and PyData ecosystems. NumPy users include everyone from beginning coders to experienced researchers doing state-of-the-art scientific and industrial research and development. The NumPy API is used extensively in Pandas, SciPy, Matplotlib, scikit-learn, scikit-image and most other data science and scientific Python packages.

The NumPy library contains multidimensional array and matrix data structures (you’ll find more information about this in later sections). It provides ndarray, a homogeneous n-dimensional array object, with methods to efficiently operate on it. NumPy can be used to perform a wide variety of mathematical operations on arrays. It adds powerful data structures to Python that guarantee efficient calculations with arrays and matrices and it supplies an enormous library of high-level mathematical functions that operate on these arrays and matrices.

Learn more about NumPy [here!](https://numpy.org/devdocs/user/whatisnumpy.html#whatisnumpy)

To access NumPy and its functions import it in your Python code like this:

In [None]:
import numpy as np

We shorten the imported name to `np` for better readability of code using NumPy. This is a widely adopted convention that you should follow so that anyone working with your code can easily understand it.

### What's the difference between a Python list and a NumPy array?

NumPy gives you an enormous range of fast and efficient ways of creating arrays and manipulating numerical data inside them. While a Python list can contain different data types within a single list, all of the elements in a NumPy array should be homogeneous. The mathematical operations that are meant to be performed on arrays would be extremely inefficient if the arrays weren’t homogeneous.

### Why use NumPy?

NumPy arrays are faster and more compact than Python lists. An array consumes less memory and is convenient to use. NumPy uses much less memory to store data and it provides a mechanism of specifying the data types. This allows the code to be optimized even further.

### What is an array?

An array is a central data structure of the NumPy library. An array is a grid of values and it contains information about the raw data, how to locate an element, and how to interpret an element. It has a grid of elements that can be indexed in various ways. The elements are all of the same type, referred to as the array dtype.

An array can be indexed by a tuple of nonnegative integers, by booleans, by another array, or by integers. The rank of the array is the number of dimensions. The shape of the array is a tuple of integers giving the size of the array along each dimension.

One way we can initialize NumPy arrays is from Python lists, using nested lists for two- or higher-dimensional data. For this, we can use the `np.array()` function.

For example:

In [None]:
a = np.array([1, 2, 3, 4, 5, 6])

or:

In [None]:
a = np.array([[1, 2, 3, 4], 
              [5, 6, 7, 8], 
              [9, 10, 11, 12]])

We can access the elements in the array using square brackets. When you’re accessing elements, remember that indexing in NumPy starts at 0. That means that if you want to access the first element in your array, you’ll be accessing element “0”.

In [None]:
print(a[0])

You might occasionally hear an array referred to as a “ndarray,” which is shorthand for “N-dimensional array.” An N-dimensional array is simply an array with any number of dimensions. You might also hear 1-D, or one-dimensional array, 2-D, or two-dimensional array, and so on. The NumPy ndarray class is used to represent both matrices and vectors. A vector is an array with a single dimension (there’s no difference between row and column vectors), while a matrix refers to an array with two dimensions. For 3-D or higher dimensional arrays, the term tensor is also commonly used.

![TODO](https://www.oreilly.com/library/view/elegant-scipy/9781491922927/assets/elsp_0105.png)

### What are the attributes of an array?

An array is usually a fixed-size container of items of the same type and size. The number of dimensions and items in an array is defined by its shape. The shape of an array is a tuple of non-negative integers that specify the sizes of each dimension.

In NumPy, dimensions are called axes. This means that if you have a 2D array that looks like this:

```python
[[0., 0., 0.],
 [1., 1., 1.]]
```

Your array has 2 axes. The first axis has a length of 2 and the second axis has a length of 3.

Just like in other Python container objects, the contents of an array can be accessed and modified by indexing or slicing the array. Unlike the typical container objects, different arrays can share the same data, so changes made on one array might be visible in another.

Array attributes reflect information intrinsic to the array itself. If you need to get, or even set, properties of an array without creating a new array, you can often access an array through its attributes.

## 2. A problem to explore

From [Wikipedia](https://en.wikipedia.org/wiki/Cost_of_living):
    
    Cost of living is the cost of maintaining a certain standard of living. Changes in the cost of living over time are often operationalized in a cost-of-living index. Cost of living calculations are also used to compare the cost of maintaining a certain standard of living in different geographic areas. Differences in cost of living between locations can also be measured in terms of purchasing power parity rates. 
    
From [Numbeo](https://www.numbeo.com), it is possible to obtain data about the cost of living and quality of life indices for several cities across the world as a `.csv` (comma separated values) file. We will explore this data using NumPy and its array manipulating capabilities.

### Getting data from a `.csv` file

CSV files are standard in data analysis and other applications. Because we are mostly interested in the array manipulations, we will use the [pandas library](https://pandas.pydata.org/), which is the industry standard, to obtain the data and later convert it into NumPy arrays. 

In [None]:
import pandas as pd

quality_of_life = pd.read_csv('../data/quality_of_life_index.csv')

First, let's explore what is in this file:

In [None]:
quality_of_life

From [Numbeo](https://www.numbeo.com/quality-of-life/indices_explained.jsp):

    Quality of Life Index (higher is better) is an estimation of overall quality of life by using an empirical formula which takes into account purchasing power index (higher is better), pollution index (lower is better), house price to income ratio (lower is better), cost of living index (lower is better), safety index (higher is better), health care index (higher is better), traffic commute time index (lower is better) and climate index (higher is better).

For more details on the other indices, check Numbeo's website.

## 3. Array properties

At this point, because we used pandas to obtain the data from the csv files, what we have are objects called [DataFrames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html):

In [None]:
type(quality_of_life)

To create a NumPy array from this data, you can use the `np.array()` function:

In [None]:
data = np.array(quality_of_life)
data

The last line containes the expression `dtype=object`. As we mentioned before, what makes NumPy arrays efficient is the fact that it contains homogeneous data, meaning that every entry should have the same data type. Because we tried creating an array with different data types (integer numbers in the first column, character strings in the second, floating point numbers for the others), the only possible way for this array to contain a homogeneous set of data is by using the `object` data type, which is very general and includes all the data described. However, this is clearly as inefficient as using a plain Python list (maybe even worse!).

To fix this, let's select only a subset of the data to put in our NumPy array.

In [None]:
quality_index = np.array(quality_of_life['Quality of Life Index'])
quality_index

Now, you can check the data type of this new array by querying the `dtype` property of this array:

In [None]:
quality_index.dtype

As expected, the data type for the `quality_index` array is now `float64`, or 64-bit floating point numbers.

Now, let's look at the "Rank" column of our `quality_of_life` DataFrame:

In [None]:
rank = np.array(quality_of_life['Rank'])
rank

In [None]:
rank.dtype

Here we have integer data, but not that we could have selected a different data type to represent these items by using the `dtype` keyword when calling the `np.array()` function. Different data types available can be found [here](dtype documentation).

In [None]:
rank_float = np.array(rank, dtype=np.float64)
rank_float

### Shape and size

At this point, we may also be interested in checking for the physical layout of this array, like its size. If you are used to Python lists, you may be tempted to apply the `len` function to this array:

In [None]:
len(quality_index)

However, there are specific properties provided by NumPy for this information:

- `.ndim` will tell you the number of axes, or dimensions, of the array.
- `.size` will tell you the total number of elements of the array. This is the product of the elements of the array’s shape.
- `.shape` will display a tuple of integers that indicate the number of elements stored along each dimension of the array. If, for example, you have a 2-D array with 2 rows and 3 columns, the shape of your array is (2, 3).

In [None]:
quality_index.ndim

In [None]:
quality_index.size

In [None]:
quality_index.shape

Note that we could also have selected a 2-dimensional subset of the data, by choosing to represent two columns of the initial DataFrame as a NumPy array:

In [None]:
quality_safety = np.array(quality_of_life[['Quality of Life Index', 'Safety Index']])

In [None]:
quality_safety

Now:

In [None]:
len(quality_safety)

In [None]:
quality_safety.ndim

In [None]:
quality_safety.size

In [None]:
quality_safety.shape

In this case, it is clear that the `len` function is no longer appropriate, and that NumPy arrays require a bit more information to be understood correctly.

### Exercises

## 4. Basic indexing and manipulation

In the original `quality_of_life` table, the data corresponding to the city of Lagos in Nigeria had index 244. Let's consult its data from the current `quality_safety` array:

In [None]:
quality_safety[244]

Because of the way we built this array, at position 244 we have two items: the Quality of Life Index in the first position, and the Safety Index in the second position. We can access those individually by using the following syntax:

In [None]:
quality_safety[244, 0]

(Note that it is not necessary to separate each dimension’s index into its own set of square brackets.)

We can also access the extremities of our array:

In [None]:
quality_safety[0]

Just like for regular Python lists, we can use the index `-1` to select the last element of our array:

In [None]:
quality_safety[-1]

In [None]:
quality_safety.shape

Note that because this is a two-dimensional array, the shape is a 2-item tuple. 

Because the indexing starts at 0, a shape of `(247, 2)` means an index range from 0 to 246, and if we try to access index 247 this will raise an `IndexError`:

```ipython
quality_safety[247]

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/var/folders/3w/490kdvj917n7kpxjpx14zztw0000gq/T/ipykernel_36431/3641354653.py in <module>
----> 1 quality_safety[247]

IndexError: index 247 is out of bounds for axis 0 with size 247
```

### Exercises

## 5. Operations and built-in utilities

Arrays can be operated on as single objects, simplifying the way we manipulate data. Say we wish to increase every item in the array by 1 unit. We could do:

In [None]:
for i in range(quality_index.shape[0]):
    quality_index[i] += 1

In general, looping through an array is not necessary or efficient. Because of the vectorization properties of NumPy arrays, we could do instead

In [None]:
quality_index += 1

This is valid for other operations as well.

In [None]:
2*quality_index

We can also perform operations on pairs of arrays:

In [None]:
quality_index+rank

There are subtleties when dealing with arrays of different shapes, and we will discuss this later.

Beware: the multiplication symbol `*` means element-by-element multiplication:

In [None]:
product = rank*quality_index

In [None]:
rank.shape

In [None]:
quality_index.shape

In [None]:
product.shape

To perform a matrix product operation, you can use the operator `@`:

In [None]:
rank@quality_index

This is the same as 

In [None]:
np.dot(rank, quality_index)

There are several built-in utilities that can be applied to a NumPy array. For example, we can compute the maximum and minimum values of an array using

In [None]:
np.max(quality_index), np.min(quality_index)

Note that we can also pick which axes to compute the maximum or minimum for:

In [None]:
np.max(quality_safety, axis=0)

In [None]:
np.max(quality_safety, axis=1)

Other functions include:

In [None]:
np.mean(quality_index)

In [None]:
np.sum(quality_index)

Note also that, if our array has two possible `axis` over which to compute the sum, we can tell `sum` what to do using the `axis` keyword: 

In [None]:
quality_safety.shape

Here, `axis=0` corresponds to the sum over all rows: 

In [None]:
np.sum(quality_safety, axis=0)

While `axis=1` corresponds to the sum over all columns:

In [None]:
np.sum(quality_safety, axis=1)

### Exercises

### Adding, removing and sorting elements

Let's gather more data from our initial dataframe:

In [None]:
cost_index = np.array(quality_of_life['Cost of Living Index'])

Suppose we wish to choose a city with a high quality of life index and a low cost of living index. A (perhaps very simplistic) way of computing this is by computing the ratio of these indexes. The higher the ratio, the closer the city is to our goal. 

In [None]:
ratio = quality_index/cost_index

Now, we can try and organize this data by *stacking* the existing arrays, meaning that we will create a 2-dimensional array from these four arrays. We can to that by using `np.stack()`:

In [None]:
new_index = np.stack((rank, quality_index, cost_index, ratio))

This works, but note that

In [None]:
new_index.shape

It would be more natural to organize these by columns, just like in the original dataframe. Now, we could look at the *transpose* of this 2-dimensional array:

In [None]:
new_index.T

However, looking at the `np.stack` docstring might give us another idea:

In [None]:
np.stack?

Using the `axis` keyword to denote that we want to stack columns (`axis=1``
) together, we can get the desired shape:

In [None]:
new_index = np.stack((rank, quality_index, cost_index, ratio), axis=1)

In [None]:
new_index.shape

In [None]:
new_index

Now that we have computed the `quality_of_life/cost_of_living` ratio, we can sort the data by this ratio. To do that, we can use the `np.sort()` function from NumPy:  

In [None]:
np.sort(ratio)

We now have the ratios sorted, but we can't identify which city would have the best ratio. We can use `np.argsort()` to help with that:

In [None]:
np.argsort(ratio)

In [None]:
ratio[132]

In [None]:
quality_of_life.iloc[132]

## Plotting

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.plot(np.arange(len(cost_index)), cost_index)

## Saving data for later use with `np.savez`

In [None]:
np.savez?

In [None]:
np.savez('cost_index.npz', cost_index)

In [None]:
!ls

---

## Read more

- [NumPy functions and methods overview](https://numpy.org/devdocs/user/quickstart.html#functions-and-methods-overview)
- [NumPy Quickstart guide](https://numpy.org/devdocs/user/quickstart.html)
- [NumPy for absolute beginners](https://numpy.org/devdocs/user/absolute_beginners.html)

## Next

Go to [Notebook 2: How to write efficient code with NumPy](02_How_to_write_efficient_code.ipynb).