# 2. Arrays - Part 1

In [1]:
import numpy as np

## Creating arrays
An array is a table of elements. 

It can be from the values specified in a list (or tuple) with the `array` function. 

An n-dimensional array can be created by passing a nested list to this function. See also  
http://docs.scipy.org/doc/numpy/user/basics.creation.html

In [2]:
# create array from a list:
print(np.array([1, 3, 5, 7, 9]))

[1 3 5 7 9]


In [3]:
# create array from a tuple:
print( np.array((1, 3, 5, 7, 9)) )
print()

[1 3 5 7 9]



In [4]:
# 2x3 array
print(np.array([[2, 3, 3], [4, 5, 6]]))

[[2 3 3]
 [4 5 6]]


- `empty` creates an empty array (filled with arbitrary values)
- `zeros` creates an array filled with zeros
- `ones` creates an array filled with ones.
All these functions take the shape of the array as input. (More about the shape and dimensionality is explained in the next section.)

In [5]:
twoxthree = (2,3)
# create an empty array:
print( np.empty(twoxthree) )

[[  9.88131292e-324   1.48219694e-323   1.48219694e-323]
 [  1.97626258e-323   2.47032823e-323   2.96439388e-323]]


In [6]:
# create array with only zeros:
print( np.zeros((2,3)) )

[[ 0.  0.  0.]
 [ 0.  0.  0.]]


In [7]:
# create array with only ones:
print (np.ones((2,3)) )

[[ 1.  1.  1.]
 [ 1.  1.  1.]]


### Random arrays
`np.random.random` creates an array with numbers draw uniformly at random from the interval between 0 and 1.

The expression `np.random.uniform(x, y, z)` returns an array with with shape `z` random numbers uniformly drawn from the interval between `x` and `y`.

(The random seed can be set to an integer *n* fixed with `np.random.seed(n)`. Note that the random number generator, and the random seed, used by numpy and base Python is different.)

In [8]:
np.random.seed(666)
# create array with random numbers between 0 and 1:
print(np.random.random((2,2)))

[[ 0.70043712  0.84418664]
 [ 0.67651434  0.72785806]]


In [9]:
# create array with random numbers between 5 and 10:
print( np.random.uniform(-1, 1, (2,3)) )

[[ 0.90291591 -0.97459361 -0.1728246 ]
 [-0.90237441 -0.80014288  0.01613261]]


Random numbers from the normal distribution:

In [10]:
print(np.random.normal(5, 10, (2,2)))

[[ 11.40573153  -2.86443172]
 [ 11.08869993  -4.31011849]]


In [11]:
# we can specify mean, standard deviation and size by keyword
print(np.random.normal(loc=5, scale=10, size=(2,2)))

[[ 14.78222248  -2.36918061]
 [  2.01267382   0.39412625]]


### Identity matrix
The identity matrix is a $n \times n$ matrix with all zeros, but with ones on the diagonal. 
The identity matrix can be created with the `eye(n)` function. Since the identity matrix is always a square only one input parameter is needed to create a 2-dimensional matrix.

In [12]:
# create n by n identity matrix:
print( np.eye(5))

[[ 1.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  1.]]


### Ranges
Numpy has a function to create arrays with numbers within a range: `arange(a, b, s)`, where $a$ is the start point, $b$ is the end point and $s$ is the step size. The function can have floats as inputs.

Another function to create a range is `linspace(a, b, i)`, where $a$ is the start point, $b$ is the end point, and $i$ is the number of items.


In [13]:
# use fixed step size:
x = np.arange(1.0, 10.0, 2.0)
print(x)

[ 1.  3.  5.  7.  9.]


In [14]:
# use fixed number of items:
y = np.linspace(1.0, 10.0, 5)
print(y)

[  1.     3.25   5.5    7.75  10.  ]


## Dimensions

The array class in Python is called "ndarray". 
An array can be 1-dimensional, which is often called a (row or column) vector. By default, a one-dimensional numpy array is a row vector. In order to create a column vector, you need to use a two-dimensional array, with the second dimension equal to 1. See examples below.


An array can be 2-dimensional, where $m$ is the number of rows and $n$ the number of columns. 
In case of a 2-dimensional array, the first dimension is the rows and the second dimension is the columns.

### Bookshelf analogy
An array can be $n$-dimensional. You think of $n$-dimensional arrays in terms of the bookshelf analogy:

- 1d array is a single row of a bookshelf, where a book can be identified by its position in the row
- 2d array is the whole bookshelf, where a book can be identified by its row number and its position in the row
- 3d array is a room full of bookshelves, where a book can be identified by the number of the bookshelf, row, and position in the row
- 4d array is a library with rooms with bookshelves, where a book can be identified by the room, bookshelf, row and position in the row


### Reshaping arrays
A 1D array can be converted to an $n$ dimensional array using the `reshape` function. 

Note that the the total number of elements in the array have to be the same as the product of the lengths of the dimensions. For example, if the length of the list is 24, then we can reshape it to a 4 by 6 matrix, but also to a 2 by 3 by 4 matrix.

Let's assume we have a 2x3x4 matrix, which we will call `z`. Since the index in Python starts at 0, the first element of the array is `z[0, 0, 0]`, and the last element of the array is `z[1, 2, 3]`.

Let x be the ndarray. Some important attributes to get insight in the dimensionality:
- `x.ndim` : the number of dimensions.
- `x.shape` : the length of each dimension.
- `x.size` : the total number of elements.

### Creating row vectors

In [15]:
a = np.zeros((3))
print(a); 
print(a.ndim, a.shape, a.size)

[ 0.  0.  0.]
1 (3,) 3


In [16]:
b = np.array([1, 2, 3])
print(b); 
print(b.ndim, b.shape, b.size)

[1 2 3]
1 (3,) 3


In [17]:
c = np.zeros((1,3))
print(c)

[[ 0.  0.  0.]]


In [18]:
d = b.reshape((1,3))
print(d.ndim, d.shape, d.size)

2 (1, 3) 3


### Creating column vectors

In [19]:
a = np.zeros((3, 1))
print(a); 
print(a.ndim, a.shape, a.size)

[[ 0.]
 [ 0.]
 [ 0.]]
2 (3, 1) 3


In [20]:
b_1 = np.array([1,2,3])
print(b_1)
print(b_1.ndim)

[1 2 3]
1


In [21]:
b = np.array([[1], [2], [3]])
print(b); print(b.ndim, b.shape, b.size)

[[1]
 [2]
 [3]]
2 (3, 1) 3


In [22]:
c_row = np.array([1,2,3])
print(c_row)
shape = (3,1)
c = c_row.reshape(shape)
print(c); print(c.ndim, c.shape, c.size)

[1 2 3]
[[1]
 [2]
 [3]]
2 (3, 1) 3


### Creating two-dimensional arrays

Two or more dimensional arrays can br created by reshaping one-dimensional arrays, or by passing in nested lists to the `array` function.

In [5]:
print(np.arange(2,14,2).reshape((3,2)))
print( np.arange(2, 14, 2).reshape((2, 3)) )

[[ 2  4]
 [ 6  8]
 [10 12]]
[[ 2  4  6]
 [ 8 10 12]]


In [8]:
print( np.array([[1, 3], [2, 4,7]]) )

[list([1, 3]) list([2, 4, 7])]


In [9]:
# create three dimensional arrays:
z = np.arange(24).reshape((2, 3, 4))

# dimensions/shape/size of the array:
print(z)
print("Number of dimensions:", z.ndim)
print("Length of each dimension:", z.shape)
print("The total number of elements:", z.size)

[[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]]
Number of dimensions: 3
Length of each dimension: (2, 3, 4)
The total number of elements: 24


### Accessing elements

In [15]:
print("First element:", z[0, 0, 0])
print("Last element:",  z[1, 2, 3])
print("Last element:",  z[-2, -1, -1])

print("Some element:",  z[1, 0, 1])

First element: 0
Last element: 23
Last element: 11
Some element: 13


### Accessing rows/columns and other axes

In [27]:
# Extract axes
a = np.random.random(2*3).reshape(2,3)
print(a)
print("First row:\n", a[0, :])
print("First row:\n", a[0,])
print("Second column:\n", a[:, 1])
print("First two columns:\n", a[:, 0:2])

[[ 0.0232363   0.72732115  0.34003494]
 [ 0.19750316  0.90917959  0.97834699]]
First row:
 [ 0.0232363   0.72732115  0.34003494]
First row:
 [ 0.0232363   0.72732115  0.34003494]
Second column:
 [ 0.72732115  0.90917959]
First two columns:
 [[ 0.0232363   0.72732115]
 [ 0.19750316  0.90917959]]


## Exercises

### Exercise 2a.1

Generate a $5\times 5 \times 5$ 3D array of random numbers between -10.0 and 10.0. Reshape it to a $5 \times 25$ matrix, and extract the first two rows of this matrix. 


In [20]:
# 8<----------------
arr = np.random.uniform(-10, 10, (5,5,5))
arr = arr.reshape(5,25)
print(arr[0:2, ])

[[-9.92753851 -4.05688815 -8.86302898 -8.80216892 -8.24236241  1.96921146
   6.60546053  3.60849523  3.52984654  1.02274895 -3.53629001  1.29903919
  -0.02132747 -9.76205648 -6.48675794 -7.58726239 -4.12022222 -5.44268317
   6.25719293 -3.41543908 -1.24154183 -6.84926185 -7.56859192  9.48662909
  -8.2457738 ]
 [ 4.10934545  9.04996795 -0.12640135  7.29514106  5.48565705  6.38965187
   0.46403888 -9.511494    4.94318656  1.17894447 -8.74979298 -3.10333891
  -7.57691365  2.12631957 -1.77326092 -4.49299098 -6.89658519 -4.46067283
   0.68343901  2.95482164  7.32768321  6.66336875 -8.05245273  9.97692237
  -5.63096226]]


### Exercise 2a.2

First create the 2-D array (without typing it in explicitly):
```python
[[1,  6, 11],
 [2,  7, 12],
 [3,  8, 13],
 [4,  9, 14],
 [5, 10, 15]]
 
 ```
You may find the function `numpy.transpose` useful for this purpose.

Now extract a new array containing its 2nd and 4th rows.

In [25]:
# 8<----------------
arie = np.arange(1, 16, 1)
arie = arie.reshape(3,5)
arie = np.transpose(arie)
print(arie)

[[ 1  6 11]
 [ 2  7 12]
 [ 3  8 13]
 [ 4  9 14]
 [ 5 10 15]]


In [28]:
# 8<----------------
print(arie[[1,3], : ])

[[ 2  7 12]
 [ 4  9 14]]


### Exercise 2a.3

Load the content of the file [population.txt](population.txt) into a numpy array. Extract the first column into a vector and assign it to variable named `year`, extract the second column and assign it to variable `hare`, etc for the four columns.

Convert the variables `year` and `carrot` into the datatype `int`.


In [36]:
# 8<----------------
arie = np.loadtxt("population.txt")
year = arie[:,0].astype(np.int)
hare = arie [:,1]
lynx = arie [:,2]
carrot = arie[:,3].astype(np.int)
print(year)
print(hare)
print(lynx)
print(carrot)

[1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913
 1914 1915 1916 1917 1918 1919 1920]
[30000. 47200. 70200. 77400. 36300. 20600. 18100. 21400. 22000. 25400.
 27100. 40300. 57000. 76600. 52300. 19500. 11200.  7600. 14600. 16200.
 24700.]
[ 4000.  6100.  9800. 35200. 59400. 41700. 19000. 13000.  8300.  9100.
  7400.  8000. 12300. 19500. 45700. 51100. 29700. 15800.  9700. 10100.
  8600.]
[48300 48200 41500 38200 40600 39800 38600 42300 44500 42100 46000 46800
 43800 40900 39400 39000 36700 41800 43300 41300 47300]


In [37]:
# 8<----------------


### Pandas

- `pandas` is a library which provides datastructures useful for storing and processing tabular data and especially time series. It has some similarities to the statistical language R. 

In [51]:
import pandas as pd

Pandas has a useful function `pd.read_csv` for loading CSV files. Tabular data is stored in a DataFrame object. The DataFrame will be pretty-printed by the Jupyter notebook:

In [42]:
data = pd.read_csv("population.csv", sep='\t', index_col='year')
data.head()

Unnamed: 0_level_0,hare,lynx,carrot
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1900,30000.0,4000.0,48300
1901,47200.0,6100.0,48200
1902,70200.0,9800.0,41500
1903,77400.0,35200.0,38200
1904,36300.0,59400.0,40600


A DataFrame has labels for columns (similar to a numpy structured array) but it also can have labels for rows. The set of labels for rows is called an index. This is especially useful for time series. The index can be accessed via the `.index` attribute:

In [43]:
print(data.index)

Int64Index([1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910,
            1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920],
           dtype='int64', name='year')


The data in a DataFrame can be accessed by column labels or by row labels. 


In [45]:
print(data[:,2])

KeyError: 2

In order to access data by row label, use the .loc attribute:

In [37]:
# print the 1919 row
print(data.loc[1919])
# print the range of data between 1900 and 1905
print(data.loc[1900:1905])

hare      16200.0
lynx      10100.0
carrot    41300.0
Name: 1919, dtype: float64
         hare     lynx  carrot
year                          
1900  30000.0   4000.0   48300
1901  47200.0   6100.0   48200
1902  70200.0   9800.0   41500
1903  77400.0  35200.0   38200
1904  36300.0  59400.0   40600
1905  20600.0  41700.0   39800


The underlying numpy arrays can be accesses using the `.values` attribute.

In [38]:
print(data['lynx'].values)

[  4000.   6100.   9800.  35200.  59400.  41700.  19000.  13000.   8300.
   9100.   7400.   8000.  12300.  19500.  45700.  51100.  29700.  15800.
   9700.  10100.   8600.]


In [39]:
print(data.values)

[[ 30000.   4000.  48300.]
 [ 47200.   6100.  48200.]
 [ 70200.   9800.  41500.]
 [ 77400.  35200.  38200.]
 [ 36300.  59400.  40600.]
 [ 20600.  41700.  39800.]
 [ 18100.  19000.  38600.]
 [ 21400.  13000.  42300.]
 [ 22000.   8300.  44500.]
 [ 25400.   9100.  42100.]
 [ 27100.   7400.  46000.]
 [ 40300.   8000.  46800.]
 [ 57000.  12300.  43800.]
 [ 76600.  19500.  40900.]
 [ 52300.  45700.  39400.]
 [ 19500.  51100.  39000.]
 [ 11200.  29700.  36700.]
 [  7600.  15800.  41800.]
 [ 14600.   9700.  43300.]
 [ 16200.  10100.  41300.]
 [ 24700.   8600.  47300.]]
