# Introduction

In this notebook, you will learn about
- Getting data from a csv file
- Basic indexing and processing
- Exploring array properties

---

## 1. What is NumPy?

NumPy (Numerical Python) is an open source Python library that’s used in almost every field of science and engineering. It’s the universal standard for working with numerical data in Python, and it’s at the core of the scientific Python and PyData ecosystems. NumPy users include everyone from beginning coders to experienced researchers doing state-of-the-art scientific and industrial research and development. The NumPy API is used extensively in Pandas, SciPy, Matplotlib, scikit-learn, scikit-image and most other data science and scientific Python packages.

The NumPy library contains multidimensional array and matrix data structures (you’ll find more information about this in later sections). It provides ndarray, a homogeneous n-dimensional array object, with methods to efficiently operate on it. NumPy can be used to perform a wide variety of mathematical operations on arrays. It adds powerful data structures to Python that guarantee efficient calculations with arrays and matrices and it supplies an enormous library of high-level mathematical functions that operate on these arrays and matrices.

Learn more about NumPy [here](https://numpy.org/devdocs/user/whatisnumpy.html#whatisnumpy).

### Getting started

To access NumPy and its functions import it in your Python code like this:

In [None]:
import numpy as np

We use the widely adopted convention to shorten `numpy` to `np` in our
examples. The shortened version makes code easier to read. 

### What's the difference between a Python list and a NumPy array?

NumPy gives you fast and efficient ways to create arrays and process
numerical data. A Python list can contain different data types
within a single list, but all of the elements in a NumPy array should be
homogeneous. The mathematical operations that are meant to be performed
on arrays would be extremely inefficient if the arrays weren’t
homogeneous.

#### Homogeneous list and array

`my_list = [1, 2, 3]` and `my_array = np.array([1, 2, 3])`

#### Inhomogenous list

`my_inhomogenous_list = [1, '2', 'hello']`

### Why use NumPy?

NumPy arrays are faster and more compact than Python lists. An array consumes less memory and is convenient to use. NumPy uses much less memory to store data and it provides a mechanism of specifying the data types. This allows the code to be optimized even further.

### What is an array?

An array is a central data structure of the NumPy library. An array is a grid of values and it contains information about the raw data, how to locate an element, and how to interpret an element. It has a grid of elements that can be indexed in various ways. The elements are all of the same type, referred to as the array dtype. In a sense, arrays can be seen as generalized vectors or matrices commonly used in mathematics.

An array can be indexed by a tuple of integers, by booleans, by another array, or by integers. The rank of the array is the number of dimensions. The shape of the array is a tuple of integers giving the size of the array along each dimension.

We can initialize NumPy arrays from Python lists with the `np.array()`
function. Nested lists create two- or higher-dimensional data. 

For example:

In [None]:
a = np.array([1, 2, 3, 4, 5, 6])

or

In [None]:
a = np.array([[1, 2, 3, 4], 
              [5, 6, 7, 8], 
              [9, 10, 11, 12]])

We use square brackets to access the elements in the array. When you’re accessing elements, remember that indexing in NumPy starts at 0. That means that if you want to access the first element in your array, you’ll be accessing element “0”.

In [None]:
print(a[0])

An array is often referred to as a “ndarray.” This is shorthand for
“N-dimensional array.” An N-dimensional array is an array with any
number of dimensions e.g. a 1-D array _or vector_ has a single dimension
and a 2-D array _or matrix_ has two dimensions.  The NumPy ndarray class
is used to represent both matrices and vectors. For 3-D or higher
dimensional arrays, the term tensor is also commonly used.

<img src="https://www.oreilly.com/library/view/elegant-scipy/9781491922927/assets/elsp_0105.png" alt="Image showing 3 diagrams representing arrays of different shapes and sizes. First, there is a 1D array, represented by four squares stacked horizontally, an indication that this represents axis 0, and shape (4,). In the center, there is a representation of a 2D array, with 6 squares organized in a 2 rows, 3 columns matrix. It shows axis 0 and axis 1 and shape (2, 3). Finally, a diagram of a 3D array, showing a 'cube' of data with shape (4, 3, 2)- 4 items in axis 0, 3 items in axis 1 and 2 items in axis 2." width="800" align="middle"/>

*Image credits: [Elegant SciPy, O'Reilly](https://github.com/elegant-scipy/elegant-scipy)*

### What are the attributes of an array?

An array is usually a fixed-size container of items of the same type and
size. The number of dimensions and items in an array is defined by its
shape. The shape of an array is a tuple of non-negative integers that
specify the sizes of each dimension.

In NumPy, dimensions are called axes. This means that if you have a 2D array that looks like this:

```python
np.array(
[[0., 0., 0.],
 [1., 1., 1.]]
 )
```

Your array has 2 axes. The first axis has a length of 2 and the second axis has a length of 3.

> **Note** Just like in other Python container objects, the contents of
> an array can be accessed and modified by indexing or slicing the
> array. Unlike the typical container objects, different arrays can
> share the same data, so changes made on one array might be visible in
> another.

Array attributes reflect information intrinsic to the array itself. If
you need to get, or even set, properties of an array without creating a
new array, you can often access an array through its attributes.

## 2. A problem to explore

From [Wikipedia](https://en.wikipedia.org/wiki/Cost_of_living):
    
    Cost of living is the cost of maintaining a certain standard of living. Changes in the cost of living over time are often operationalized in a cost-of-living index. Cost of living calculations are also used to compare the cost of maintaining a certain standard of living in different geographic areas. Differences in cost of living between locations can also be measured in terms of purchasing power parity rates. 
    
From [Numbeo](https://www.numbeo.com), it is possible to obtain data about the cost of living and quality of life indices for several cities across the world as a `.csv` (comma separated values) file. We will explore this data using NumPy and its array processing capabilities.

### Getting data from a `.csv` file

CSV files are standard in data analysis and other applications. We want
to process data as arrays. We will use the
[pandas library](https://pandas.pydata.org/), which is the industry
standard, to obtain the data and later convert it into NumPy arrays.

In [None]:
import pandas as pd

quality_of_life = pd.read_csv('../data/quality_of_life_index.csv')

First, let's explore what is in this file: 

In [None]:
quality_of_life

From [Numbeo](https://www.numbeo.com/quality-of-life/indices_explained.jsp):

    Quality of Life Index (higher is better) is an estimation of overall quality of life by using an empirical formula which takes into account purchasing power index (higher is better), pollution index (lower is better), house price to income ratio (lower is better), cost of living index (lower is better), safety index (higher is better), health care index (higher is better), traffic commute time index (lower is better) and climate index (higher is better).

For more details on the other indices, check Numbeo's website.

## 3. Array properties

We can identify this `quality_of_life` object by inspecting its type:

In [None]:
type(quality_of_life)

At this point, because we used pandas to obtain the data from the csv files, what we have are objects called [DataFrames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

Because we want to work with NumPy arrays, we can convert this data using the `np.array()` function like we did earlier for lists:

In [None]:
data = np.array(quality_of_life)
data

### Data type

The last line in the previous operation containes the expression `dtype=object`. As we mentioned before, what makes NumPy arrays efficient is the fact that it contains homogeneous data, meaning that every entry should have the same data type. Because we tried creating an array with different data types (integer numbers in the first column, character strings in the second, floating point numbers for the others), the only possible way for this array to contain a homogeneous set of data is by using the `object` data type, which is very general and includes all the data described. However, this is clearly as inefficient as using a plain Python list (maybe even worse!).

To fix this, let's select only a subset of the data to put in our NumPy array. Using the pandas syntax, this can be done by, for example, choosing the `'Quality of Life Index'` columns from the pandas DataFrame, and converting it into a NumPy array:

In [None]:
quality_index = np.array(quality_of_life['Quality of Life Index'])
quality_index

Now, you can check the data type of this new array by querying the `dtype` property of this array:

In [None]:
quality_index.dtype

As expected, the data type for the `quality_index` array is now `float64`, or 64-bit floating point numbers.

Now, check another subset of this DataFrame, the `Rank` column:

In [None]:
rank = np.array(quality_of_life['Rank'])
rank

In [None]:
rank.dtype

Here we have integer data, but note that we could have selected a different data type to represent these items by using the `dtype` keyword when calling the `np.array()` function. Different data types available can be found at [NumPy's Array objects](https://numpy.org/devdocs/reference/arrays.dtypes.html). For example, we can try with the `np.float64` data type:

In [None]:
rank_float = np.array(rank, dtype=np.float64)
rank_float

In [None]:
rank_float.dtype

### Shape and size

At this point, we may also be interested in checking for the physical layout of this array, like its shape and size. If you are used to Python lists, you may be tempted to apply the `len` function to this array:

In [None]:
len(quality_index)

However, there are specific properties provided by NumPy for this information:

- `.ndim` will tell you the number of axes, or dimensions, of the array.
- `.size` will tell you the total number of elements of the array. This is the product of the elements of the array’s shape.
- `.shape` will display a tuple of integers that indicate the number of elements stored along each dimension of the array. If, for example, you have a 2-D array with 2 rows and 3 columns, the shape of your array is (2, 3).

In [None]:
quality_index.ndim

In [None]:
quality_index.size

In [None]:
quality_index.shape

We could have selected a 2-dimensional subset of the data, by choosing to represent two columns of the initial DataFrame as a NumPy array:

In [None]:
quality_safety = np.array(quality_of_life[['Quality of Life Index', 'Safety Index']])

In [None]:
quality_safety

Now, for this two-dimensional array, we have:

In [None]:
len(quality_safety)

In [None]:
quality_safety.ndim

In [None]:
quality_safety.size

In [None]:
quality_safety.shape

In this case, it is clear that the `len` function is no longer appropriate, and that NumPy arrays require a bit more information to be understood correctly.

---

#### Self-assessment 1

---

### Creating arrays

You can create arrays [in several ways](https://numpy.org/devdocs/user/basics.creation.html). We have seen above how to create an array from a list, but we could also create an array with zeros in every entry, by choosing its shape:

In [None]:
z = np.zeros((2, 3))
z

Similarly, we can create arrays containing only the value `1`:

In [None]:
all_ones = np.ones((2, 6, 3))
all_ones

We can specify the dtype. If we wanted `all_ones` to be an integer array:

In [None]:
all_ones = np.ones((2, 6, 3), dtype=np.int64)
all_ones

It is often useful to create arrays from regularly incrementing values. We can do that with the `np.arange()` function:

In [None]:
np.arange?

We only need to specify the `stop` parameter, but we can be more specific by using the `start` and `step` parameters as needed. For example, to create an array of even numbers between 10 and 20, we can do

In [None]:
np.arange(10, 20, 2)

> **Note**: The `stop` value is not included in the output.

---

#### Self-assessment 2

---

## 4. Basic indexing and processing

In the original `quality_of_life` table, the data corresponding to the city of Lagos in Nigeria had index 244:

In [None]:
quality_of_life

To consult data corresponding to Lagos from the current `quality_safety` array, we will access the entry with index 244:

In [None]:
quality_safety[244]

Take some time to check that the values obtained here are as expected from the DataFrame above.

Because of the way we built this array, at position 244 we have two items: the Quality of Life Index in the first position, and the Safety Index in the second position:

In [None]:
quality_safety

This means that we need two indices in order to access one individual entry, one for each axis. For example, to get the Quality of Life Index located at row 244 and column 0, we can do

In [None]:
quality_safety[244, 0]

> **Note**: It is not necessary to separate each dimension's index
> into its own set of square brackets, but it is allowed e.g.
> `quality_safety[244][0]`

Because of the nested nature of an ndarray, you can think of each row in the `quality_safety` array as a 1D array with two elements. Indeed:

In [None]:
quality_safety[0]

Just like for regular Python lists, we can use the index `-1` to select the last entry in our array:

In [None]:
quality_safety[-1]

This is a two-dimensional array so the shape is a 2-item tuple. 

In [None]:
quality_safety.shape

Because the indexing starts at 0, a shape of `(247, 2)` means an index range from 0 to 246, and if we try to access index 247 this will raise an `IndexError`:

```ipython
quality_safety[247]

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/var/folders/3w/490kdvj917n7kpxjpx14zztw0000gq/T/ipykernel_36431/3641354653.py in <module>
----> 1 quality_safety[247]

IndexError: index 247 is out of bounds for axis 0 with size 247
```

---

#### Self-assessment 3

---

### Operations between arrays

Let's gather more data from our initial dataframe:

In [None]:
cost_index = np.array(quality_of_life['Cost of Living Index'])

Suppose we wish to choose a city with a high quality of life index and a low cost of living index. A (perhaps very simplistic) way of computing this is by computing the ratio of these indexes. The higher the ratio, the closer the city is to our goal. 

We could do: 

In [None]:
ratio = np.zeros(cost_index.shape)
for i in range(quality_index.shape[0]):
    ratio[i] = quality_index[i]/cost_index[i]

In [None]:
ratio

In general, looping through an array is not necessary or efficient. Because of the vectorization properties of NumPy arrays, we could instead operate on these arrays as single objects, simplifying the way we process the data:

In [None]:
ratio = quality_index/cost_index

This is valid for other operations as well.

In [None]:
2*quality_index

> **Note**: There are subtleties when dealing with arrays of different
> shapes, and we will discuss this later when we talk about
> Broadcasting.

**Beware**: Math symbols, including the multiplication symbol `*`, mean element-by-element multiplication.


In [None]:
product = rank*quality_index

In [None]:
rank.shape

In [None]:
quality_index.shape

In [None]:
product.shape

To perform a matrix product operation, you can use the operator `@`:

In [None]:
rank@quality_index

This is the same as 

In [None]:
np.dot(rank, quality_index)

---

#### Self-assessment 4

---

### Adding, removing and sorting elements

Now, we can try and organize this data by *stacking* the existing arrays, meaning that we will create a 2-dimensional array from these four arrays. We can to that by using `np.stack()`:

In [None]:
new_index = np.stack((rank, quality_index, cost_index, ratio))

This works, but note that

In [None]:
new_index.shape

It would be more natural to organize these by columns, just like in the original dataframe. Now, we could look at the *transpose* of this 2-dimensional array:

In [None]:
new_index.T

However, looking at the `np.stack` docstring might give us another idea:

In [None]:
np.stack?

Using the `axis` keyword to denote that we want to stack columns (`axis=1`
) together, we can get the desired shape:

In [None]:
new_index = np.stack((rank, quality_index, cost_index, ratio), axis=1)

In [None]:
new_index.shape

In [None]:
new_index

Now that we have computed the `quality_of_life/cost_of_living` ratio, we can sort the data by this ratio. To do that, we can use the `np.sort()` function from NumPy:  

In [None]:
np.sort(ratio)

We now have the ratios sorted, but we can't identify which city would have the best ratio. We can use `np.argsort()` to help with that:

In [None]:
np.argsort(ratio)

The result shows the original indices of the sorted elements in our array. This means that the smallest ratio, corresponding to the last element in the sorted `ratio` array, originally had index 132:

In [None]:
ratio[132]

Now, using the pandas `iloc` syntax, we can recover from our original dataset which city is this:

In [None]:
quality_of_life.iloc[132]

---

#### Self-assessment 5

---

---

## Read more

- Some of the content in this page comes from the [NumPy for absolute beginners](https://numpy.org/devdocs/user/absolute_beginners.html) tutorial in the NumPy documentation.
- [NumPy functions and methods overview](https://numpy.org/devdocs/user/quickstart.html#functions-and-methods-overview)
- [NumPy Quickstart guide](https://numpy.org/devdocs/user/quickstart.html)

## Next

Go to [Notebook 2: How to write efficient code with NumPy](02_How_to_write_efficient_code.ipynb).