# Python: The numpy library

**Goal**: manipulate matrices or multidimensional arrays with the numpy package!

## Introduction to numpy

Numpy is one of the fundamental package for scientific computing in Python. The numpy library (http://www.numpy.org/) allows you to perform numerical calculations with Python. It introduces an easy management of arrays of numbers. To use numpy, you must first import the numpy package with the following instruction ``` import numpy ```.

In [1]:
import numpy as np

## Arrays with numpy

In this section, we will discover the numpy tables. So, the main data structure in numpy is the ``` ndarray ```. Arrays can be created with ``` numpy.array() ```. Square brackets are used to delimit lists of elements in arrays. In summary, a vector represents a list and a matrix represents a list of lists.

In [2]:
vector = np.array([1, 2, 3, 4, 5])
print(vector)

[1 2 3 4 5]


In [3]:
matrix = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]])
print(matrix)

[[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]


## Size of an array

When manipulating matrices (matrix multiplication for example), it is very useful to know the ``` size of the array ```. Otherwise, you may get errors. So to do this, we have the array property ``` ndarray.shape ```. For vectors, shape will return a tuple with an element. On the other hand, the size of a matrix has two elements, the number of rows and the number of columns.

In [4]:
vector.shape

(5,)

In [5]:
matrix.shape

(3, 3)

## Read dataset using numpy

To read a dataset directly with the numpy library, we use the numpy ``` genfromtxt() ``` function.

In [6]:
world_alcohol = np.genfromtxt("world_alcohol.csv", delimiter=",")
world_alcohol

array([[      nan,       nan,       nan,       nan,       nan],
       [1.986e+03,       nan,       nan,       nan, 0.000e+00],
       [1.986e+03,       nan,       nan,       nan, 5.000e-01],
       ...,
       [1.986e+03,       nan,       nan,       nan, 2.540e+00],
       [1.987e+03,       nan,       nan,       nan, 0.000e+00],
       [1.986e+03,       nan,       nan,       nan, 5.150e+00]])

In [7]:
print(type(world_alcohol))

<class 'numpy.ndarray'>


We can clearly see that there is a lot of nan in the result. In reality, they are not nan, we will see later how to correct this problem. In fact, when we talk about matrices, we must have only numbers in the dataset. However numpy can accept many types of data, not only numbers. But the particularity of a numpy array is that all the values of the different columns must have the same data type by default. The default format type used by the genformtext function is float, see the documentation: https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html.

## Data types in numpy

Here we will see the main data types in numpy which have slight differences with the Python data types. We will also see the particularity with numpy arrays. The data types accepted by numpy are: ```bool```, ```integer```, ```float``` and ```string```. To know the data type of an array, we use the ```dtype``` method of the numpy ```ndarray object```.

In [8]:
vector.dtype

dtype('int32')

In [9]:
matrix.dtype

dtype('float64')

In [10]:
world_alcohol.dtype

dtype('float64')

## Display data correctly

We have seen previously that our dataset displays nan on some columns. In this section, we will see how to display the data correctly. As a reminder, ```nan``` means ```not a number```, it is a ```data type``` that is used to say that the ```value is missing```. There is another type of data that is ```na (not available)``` for items that are not available. So you have to know that ```nan``` and ```na``` are data types.

In [11]:
world_alcohol

array([[      nan,       nan,       nan,       nan,       nan],
       [1.986e+03,       nan,       nan,       nan, 0.000e+00],
       [1.986e+03,       nan,       nan,       nan, 5.000e-01],
       ...,
       [1.986e+03,       nan,       nan,       nan, 2.540e+00],
       [1.987e+03,       nan,       nan,       nan, 0.000e+00],
       [1.986e+03,       nan,       nan,       nan, 5.150e+00]])

To display the correct values, we use the ```dtype=U75``` parameter of the ```genfromtxt``` function. This is to force numpy to read each value in the ```U75 (Unicode 75 bit)``` format.

In [12]:
world_alcohol = np.genfromtxt("world_alcohol.csv", delimiter=",", dtype="U75")
world_alcohol

array([['Year', 'WHO region', 'Country', 'Beverage Types',
        'Display Value'],
       ['1986', 'Western Pacific', 'Viet Nam', 'Wine', '0'],
       ['1986', 'Americas', 'Uruguay', 'Other', '0.5'],
       ...,
       ['1986', 'Europe', 'Switzerland', 'Spirits', '2.54'],
       ['1987', 'Western Pacific', 'Papua New Guinea', 'Other', '0'],
       ['1986', 'Africa', 'Swaziland', 'Other', '5.15']], dtype='<U75')

We notice that our dataset has a header. To remove it, we use the ```skip_header``` parameter which we will set to a ```1``` to skip the first row.

In [13]:
world_alcohol = np.genfromtxt("world_alcohol.csv", delimiter=",", dtype="U75", skip_header=1)
world_alcohol

array([['1986', 'Western Pacific', 'Viet Nam', 'Wine', '0'],
       ['1986', 'Americas', 'Uruguay', 'Other', '0.5'],
       ['1985', 'Africa', "Cte d'Ivoire", 'Wine', '1.62'],
       ...,
       ['1986', 'Europe', 'Switzerland', 'Spirits', '2.54'],
       ['1987', 'Western Pacific', 'Papua New Guinea', 'Other', '0'],
       ['1986', 'Africa', 'Swaziland', 'Other', '5.15']], dtype='<U75')

## Extract a value from a numpy array

The extraction of an element from a numpy array (vectors) is done in a similar way to a list. For matrices, indexing is similar as for lists of list.

In [14]:
# example for vector
vector, vector[0]

(array([1, 2, 3, 4, 5]), 1)

In [15]:
# example for matrice
matrix, matrix[0,0]

(array([[1., 2., 3.],
        [4., 5., 6.],
        [7., 8., 9.]]),
 1.0)

### Training

In this section, we will try to answer the following questions:

* assign the number of liters of wine drunk by an Ivorian in 1985 to the variable value_ivory_1985, this corresponds to the data in the 3rd line

* assign the name of the country in the 2nd line to the variable second_country_name

In [16]:
world_alcohol

array([['1986', 'Western Pacific', 'Viet Nam', 'Wine', '0'],
       ['1986', 'Americas', 'Uruguay', 'Other', '0.5'],
       ['1985', 'Africa', "Cte d'Ivoire", 'Wine', '1.62'],
       ...,
       ['1986', 'Europe', 'Switzerland', 'Spirits', '2.54'],
       ['1987', 'Western Pacific', 'Papua New Guinea', 'Other', '0'],
       ['1986', 'Africa', 'Swaziland', 'Other', '5.15']], dtype='<U75')

In [17]:
# number of liters of wine drunk by an Ivorian in 1985
value_ivory_1985 = world_alcohol[2,4]
print(value_ivory_1985)

1.62


In [18]:
# name of the country in the 2nd line
second_country_name = world_alcohol[1,2]
print(second_country_name)

Uruguay


## Extract a vector of values from a numpy array

The extraction of a vector from a numpy array is done in the same way as the extraction of an element. It consists in a kind of **slicing** from a list. However, the extraction of a matrix is a bit more complex. You can extract a row or column vector, or in a more advanced way, extract a subset of a matrix.

In [19]:
# example for vector
vector

array([1, 2, 3, 4, 5])

In [20]:
sub_vector = vector[0:3]
sub_vector

array([1, 2, 3])

In [21]:
# example for matrix
matrix

array([[1., 2., 3.],
       [4., 5., 6.],
       [7., 8., 9.]])

In [22]:
# column extraction
first_column = matrix[:,0]
first_column

array([1., 4., 7.])

In [23]:
second_column = matrix[:,1]
second_column

array([2., 5., 8.])

In [24]:
third_column = matrix[:,2]
third_column

array([3., 6., 9.])

In [25]:
# row extraction
first_row = matrix[0,:]
first_row

array([1., 2., 3.])

In [26]:
second_row = matrix[1,:]
second_row

array([4., 5., 6.])

In [27]:
third_row = matrix[2,:]
third_row

array([7., 8., 9.])

### Training

In this section, we will try to answer the following questions:

* assign all the 3rd column of world_alcohol to the countries variable

* assign the whole 5th column of world_alcohol to the variable alcohol_consumption

In [28]:
countries = world_alcohol[:,2]
countries

array(['Viet Nam', 'Uruguay', "Cte d'Ivoire", ..., 'Switzerland',
       'Papua New Guinea', 'Swaziland'], dtype='<U75')

In [29]:
alcohol_consumption = world_alcohol[:,4]
alcohol_consumption

array(['0', '0.5', '1.62', ..., '2.54', '0', '5.15'], dtype='<U75')

## Extract an array of values from a numpy array

In this section we will see how to extract an array of values (matrix or sub-matrix) from a numpy array of values (matrix). Extraction is done in the same way just by double slicing the rows and columns of a matrix.

In [30]:
# example
matrix

array([[1., 2., 3.],
       [4., 5., 6.],
       [7., 8., 9.]])

In [31]:
first_sub_matrix = matrix[:2,:2]
first_sub_matrix

array([[1., 2.],
       [4., 5.]])

In [32]:
second_sub_matrix = matrix[:2,1:3]
second_sub_matrix

array([[2., 3.],
       [5., 6.]])

In [33]:
third_sub_matrix = matrix[1:3,:2]
third_sub_matrix

array([[4., 5.],
       [7., 8.]])

In [34]:
fourth_sub_matrix = matrix[1:3,1:3]
fourth_sub_matrix

array([[5., 6.],
       [8., 9.]])

These extractions are interesting when the determinant of a matrix has to be calculated.

### Training

In this section, we will try to answer the following questions:

* assign all lines of the first 2 columns of world_acohol to the variable first_two_columns

* assign the first 10 rows of the first column of world_alcohol to the variable first_ten_years

* assign the first 10 rows of all world_alcohol columns to the variable first_ten_rows

* assign the first 20 rows of the world_alcohol index 1 and 2 columns to the variable first_twenty_regions

In [35]:
first_two_colomns = world_alcohol[:,:2]
first_two_colomns

array([['1986', 'Western Pacific'],
       ['1986', 'Americas'],
       ['1985', 'Africa'],
       ...,
       ['1986', 'Europe'],
       ['1987', 'Western Pacific'],
       ['1986', 'Africa']], dtype='<U75')

In [36]:
first_ten_years = world_alcohol[:10,0]
first_ten_years

array(['1986', '1986', '1985', '1986', '1987', '1987', '1987', '1985',
       '1986', '1984'], dtype='<U75')

In [37]:
first_ten_rows = world_alcohol[:10,:]
first_ten_rows

array([['1986', 'Western Pacific', 'Viet Nam', 'Wine', '0'],
       ['1986', 'Americas', 'Uruguay', 'Other', '0.5'],
       ['1985', 'Africa', "Cte d'Ivoire", 'Wine', '1.62'],
       ['1986', 'Americas', 'Colombia', 'Beer', '4.27'],
       ['1987', 'Americas', 'Saint Kitts and Nevis', 'Beer', '1.98'],
       ['1987', 'Americas', 'Guatemala', 'Other', '0'],
       ['1987', 'Africa', 'Mauritius', 'Wine', '0.13'],
       ['1985', 'Africa', 'Angola', 'Spirits', '0.39'],
       ['1986', 'Americas', 'Antigua and Barbuda', 'Spirits', '1.55'],
       ['1984', 'Africa', 'Nigeria', 'Other', '6.1']], dtype='<U75')

In [38]:
first_twenty_regions = world_alcohol[:20,1:3]
first_twenty_regions

array([['Western Pacific', 'Viet Nam'],
       ['Americas', 'Uruguay'],
       ['Africa', "Cte d'Ivoire"],
       ['Americas', 'Colombia'],
       ['Americas', 'Saint Kitts and Nevis'],
       ['Americas', 'Guatemala'],
       ['Africa', 'Mauritius'],
       ['Africa', 'Angola'],
       ['Americas', 'Antigua and Barbuda'],
       ['Africa', 'Nigeria'],
       ['Africa', 'Botswana'],
       ['Americas', 'Guatemala'],
       ['Western Pacific', "Lao People's Democratic Republic"],
       ['Eastern Mediterranean', 'Afghanistan'],
       ['Western Pacific', 'Viet Nam'],
       ['Africa', 'Guinea-Bissau'],
       ['Americas', 'Costa Rica'],
       ['Africa', 'Seychelles'],
       ['Europe', 'Norway'],
       ['Africa', 'Kenya']], dtype='<U75')