# 3. Vectorized operations

In [1]:
import numpy

**Question**

Given two lists `x = [10, 20, 30, 40]` and `y = [5, 7, 52, 34]` how would we sum elements at corresponding indices? 

Numpy arrays are designed to make such operations easy.

## Basic operations on arrays with the same shape

The basic operations on arrays are applied elementwise.
The basic operations are addition, subtraction, multiplication, division and power.
The simplest case is when the shapes of the arrays are exactly the same, then an elementwise operation is straightforward. 

In [2]:
# basic operations between two arrays with the same shape:
x = numpy.array([10, 20, 30, 40])
y = numpy.array([5, 7, 52, 34])

print("y - x = ", y - x)
print("x + y = ", x + y)
print("x * y = ", x * y)
print("x / y = ", x / y)

y - x =  [ -5 -13  22  -6]
x + y =  [15 27 82 74]
x * y =  [  50  140 1560 1360]
x / y =  [ 2.          2.85714286  0.57692308  1.17647059]


## Basic operations on arrays with different shapes

Besides operations between arrays of the same shape, also operations between arrays of different shapes are allowed, but are not always possible. Operations on arrays with different shapes involve *broadcasting*.

For more information about Broadcasting:
http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html

There are some different types of broadcasting:

### Array and a scalar. 
There are no restrictions on the shape.

In [5]:
# scalar term
x = numpy.array([20, 25, 30, 35])
print("x - 2 = ", x - 2)
print("x * 2 = ", x * 2)
print("x **2 = ", x**2)

x - 2 =  [18 23 28 33]
x * 2 =  [40 50 60 70]
x **2 =  [ 400  625  900 1225]


### Array and a row vector. 
The number of columns in the array has to be the same as the length of the row vector.

In [6]:
# operations between array and vector
x = numpy.array([[1, 2, 3], [4, 5, 6]])
y = numpy.array([5, 5, 5]) # row vector
z = numpy.array([[1], [2]]) # column vector

print(x)
print()
print(y)
print()
print(z)

[[1 2 3]
 [4 5 6]]

[5 5 5]

[[1]
 [2]]


In [7]:
# array and row vector
print("Operations between x and y which are applied for each row")
print("x + y = \n", x+y)
print("x * y = \n", x*y)

Operations between x and y which are applied for each row
x + y = 
 [[ 6  7  8]
 [ 9 10 11]]
x * y = 
 [[ 5 10 15]
 [20 25 30]]


### Array and a column vector. 

The number of rows in the array has to be the same as the length of the column vector.


In [8]:
# array and column vector
print("Operations between x and z which are applied for each column")
print("x + z = \n", x+z)
print("x * z = \n", x*z)

Operations between x and z which are applied for each column
x + z = 
 [[2 3 4]
 [6 7 8]]
x * z = 
 [[ 1  2  3]
 [ 8 10 12]]


## Vector transformations

Simple examples of operations on vectors are:

- Standardization: `z = (x - mean(x)) / stdev(x)`. Standardized values (z-scores) have zero mean and unit standard deviation. Standardization is often used before applying machine learning algorithms. 

- Feature scaling: `y = (x - min(x)) / (max(x) - min(x))`, which brings the score in the range 0 to 1.

- Conversion between different scales of measurements. Some examples: from Fahrenheit to Celsius, or from Dollars to Euros, or from Inches to Centimetres. 


### Exercise 3.1

Define function `standardize` which converts a vector of numbers to z-scores.


In [9]:
# 8< ..........................................
x = numpy.random.normal(-1,3,10)
print(x)

def standardize(x):
    return (x - numpy.mean(x))/numpy.std(x)

print(standardize(x))
print(numpy.mean(standardize(x)))
print(numpy.std(standardize(x)))


[ 3.34602705  0.33649955  0.62714828  2.52290647  1.554453   -0.8489329
 -1.33359976  0.76831849 -0.61359304 -0.88223262]
[ 1.89838679 -0.14327811  0.05389812  1.3399814   0.6829821  -0.94747607
 -1.2762743   0.14966806 -0.78782139 -0.97006661]
0.0
1.0


### Exercise 3.2
- Define function `to_cm` which takes a vector of measurements in inches and converts them to centimeters.
- Define function `to_celsius` which takes a vector of measurements in Fahrenheit and converts them to Celsius: C = (F-32)/1.8


In [10]:
# 8< ..............................................
inch = numpy.array([1.0, 2.0, 10.0])
f = numpy.array([-40, 0.0, 100.0])

def to_cm(x):
    return x * 2.54
def to_celsius(x):
    return (x-32)/1.8

print(to_cm(inch))
print(to_celsius(f))

[  2.54   5.08  25.4 ]
[-40.         -17.77777778  37.77777778]


## Boolean operations on arrays

Boolean conditions can also applied to the arrays. They are applied to every element in the array. Several different operators can be used, such as: equal to (`==`), not equal to (`!=`), greater than (`>=` or `>`), or less than (`<=` or `<`). 

In [11]:
# boolean operations on arrays
x = numpy.array([10, 20, 30, 14, 15, 16])
y = numpy.array([7, 5, 5, 7, 5, 7]) 
print("(x > 15) = ", x>15)
print("(y == 7) = ", y==7)

(x > 15) =  [False  True  True False False  True]
(y == 7) =  [ True False False  True False  True]


## Mathematical functions applied on vectors

Many mathematical functions can be applied to arrays elementwise, such as:

- `numpy.sqrt`: square root
- `numpy.sin`: sine
- `numpy.cos`: cosine
- `numpy.tan`: tangent
- `numpy.exp`: exponential
- `numpy.log`: natural logarithm (base e)
- `numpy.log2`: base-2 logarithm 
- `numpy.log10`: base-10 logarithm

In [12]:
x = numpy.array([1, 2, 3, 4])
print("x = ", x)
print("sqrt(x) = ", numpy.sqrt(x))
print("sin(x) = ", numpy.sin(x) )
print("cos(x) = ", numpy.cos(x) )
print("tan(x) = ", numpy.tan(x) )
print("exp(x) = ", numpy.exp(x) )
print("log(x) = ", numpy.log(x) )
print("log2(x) = ", numpy.log2(x) )
print("log10(x) = ", numpy.log10(x) )

x =  [1 2 3 4]
sqrt(x) =  [ 1.          1.41421356  1.73205081  2.        ]
sin(x) =  [ 0.84147098  0.90929743  0.14112001 -0.7568025 ]
cos(x) =  [ 0.54030231 -0.41614684 -0.9899925  -0.65364362]
tan(x) =  [ 1.55740772 -2.18503986 -0.14254654  1.15782128]
exp(x) =  [  2.71828183   7.3890561   20.08553692  54.59815003]
log(x) =  [ 0.          0.69314718  1.09861229  1.38629436]
log2(x) =  [ 0.         1.         1.5849625  2.       ]
log10(x) =  [ 0.          0.30103     0.47712125  0.60205999]


## Reductions

Some methods can be applied to the entire array or to only one dimension:

- `.sum` and `numpy.cumsum`
- `.min` and `.argmin`
- `.max` and `.argmax`

In [13]:
x = numpy.array([[1, 6, 5], [2, 7, 8]])
print(x)
# functions applied to the entire array:
print("sum:", x.sum())
print("sum:", numpy.sum(x))
print("minimum:", x.min(), "and index of minimum:", x.argmin())
print("maximum:", x.max(), "and index of maximum:", x.argmax())

[[1 6 5]
 [2 7 8]]
sum: 29
sum: 29
minimum: 1 and index of minimum: 0
maximum: 8 and index of maximum: 5



One important thing to notice is that the index retured by `argmin` or `argmax` is the linear index and not the multidimensional index (see [2b_arrays.ipynb](2b_arrays.ipynb))

### Axis
Many reduction functions have a parameter called `axis`. When `axis=0` the operation is carried out on columns, so that the result has one element per column. When `axis=1` the operation is carried out on rows, so that the result has one element per row.


In [14]:
# functions applied to only one dimension of the array:
print("Sum columns:", x.sum(axis=0))
print("Sum rows:", x.sum(axis=1))
print("Minimum per column:", x.min(axis=0))
print("Maximum per row:", x.max(axis=1))

Sum columns: [ 3 13 13]
Sum rows: [12 17]
Minimum per column: [1 6 5]
Maximum per row: [6 8]


### Exercise 3.3
Define function `scale` which takes a vector of numbers and brings them to the range from 0 to 1:
$$\mathrm{scale}(x_i) = \frac{x_i - min(x)}{max(x) - min(x)}$$

In [15]:
# 8< .............................................
def scale(x):
    return (x - x.min())/(x.max() - x.min())

z = numpy.arange(0,10)
print(z)
print(scale(z))



[0 1 2 3 4 5 6 7 8 9]
[ 0.          0.11111111  0.22222222  0.33333333  0.44444444  0.55555556
  0.66666667  0.77777778  0.88888889  1.        ]


### Exercise 3.4a

The function `softmax` is often used in machine learning and statistics to convert a vector of arbitrary numbers into a vector of probabilities summing up to $1$. Softmax is computed by computing the exponential of each number, and then dividing each number by the sum of the exponentials:
$$ \mathrm{softmax}(x_i): \frac{\exp(x_i)}{\sum_{k=1}^N \exp(x_k)}$$

Implement the softmax function. Verify that in the resulting vector all number are between 0 and 1. Verify that the resulting numbers sum up to $1$.

In [19]:
# 8< ...........................................
z = numpy.random.normal(0,2,10)
print(z)
def softmax(x):
    E = numpy.exp(x)
    return E /numpy.sum(E)

print(softmax(z))
print(numpy.sum(softmax(z)))
print(numpy.all(softmax(z) >= 0.0))
print(numpy.all(softmax(z) <= 1.0))

[ 0.05522743 -0.53386407  0.54687816  2.16878896 -0.29991347 -0.66503052
 -2.72362797  3.65152789 -0.99394274  0.18224684]
[ 0.01973707  0.01095074  0.03227037  0.16337696  0.01383716  0.00960459
  0.00122586  0.71967454  0.00691248  0.02241024]
1.0
True
True


### Exercise 3.4b

Implement a version of the softmax function which takes a matrix, and converts the values to probabilities such that each column sums up to 1.

In [26]:
# 8< ...........................................
z = numpy.random.normal(0,2,(4,5))
print(z)
print()
def softmax(x):
    E = numpy.exp(x)
    return E /numpy.sum(E, axis=0)

print(softmax(z))
print(numpy.sum(softmax(z), axis=0))
print(numpy.all(softmax(z) >= 0.0))
print(numpy.all(softmax(z) <= 1.0))

[[-0.85204647 -1.87809144 -0.15519115  0.7535107   0.51966451]
 [-3.44600751  1.4940935  -1.42101538 -2.34846816  1.80497619]
 [-0.40157707 -0.00642536  0.11168924 -0.40898966 -2.62792004]
 [-1.30627749 -1.39898731 -1.55924743  3.60366758 -0.73247516]]

[[ 0.30499819  0.02613978  0.3529217   0.0536246   0.20223907]
 [ 0.02279052  0.7617688   0.0995263   0.00241097  0.73125498]
 [ 0.47855695  0.16988542  0.46087488  0.01676859  0.00868733]
 [ 0.19365434  0.042206    0.08667712  0.92719584  0.05781862]]
[ 1.  1.  1.  1.  1.]
True
True


## Sorting

For sorting use the functions and `sort` and `argsort`.

In [27]:
# sorting an 1-dimensional array:
print("Applied to 1-dimensional array")
x = numpy.array([5, 3, 6, 2, 6, 8])
print("Original x:           ", x)
print("Sorted   x:           ", numpy.sort(x))
y = x.argsort()
print("Indices of argsort:   ", y)
print("Sorted using indices: ", x[y])

Applied to 1-dimensional array
Original x:            [5 3 6 2 6 8]
Sorted   x:            [2 3 5 6 6 8]
Indices of argsort:    [3 1 0 2 4 5]
Sorted using indices:  [2 3 5 6 6 8]


**Attention**  The method `.sort` sorts the array in-place, that is destructively. Use with at own risk.

In [28]:
x = numpy.array([2, 3, 1])
print("x=", x)
x.sort()
print("x=", x)

x= [2 3 1]
x= [1 2 3]


Sorting can also be done per row or column.

In [29]:
x = numpy.array([[5, 3, 4],[2, 4, 2]])
print("Original: ")
print(x)
print("Columns are sorted: ")
print(numpy.sort(x, axis=0))

print("Rows are sorted: ")
print(numpy.sort(x, axis=1))


Original: 
[[5 3 4]
 [2 4 2]]
Columns are sorted: 
[[2 3 2]
 [5 4 4]]
Rows are sorted: 
[[3 4 5]
 [2 2 4]]


## Reversing

There is a special indexing syntax in `numpy` to obtain a view of the array in the reverse order. 

In [30]:
a = numpy.random.randint(0,10,5)
print(a)
print()
print(a[::-1])

[6 3 1 1 4]

[4 1 1 3 6]


### Exercise 3.5

The file `winequality-red.csv` contains measurements of wine samples, together with a quality rating. You can load this data into a structured array like this:

In [34]:
data = numpy.genfromtxt("winequality-red.csv", names=True, delimiter=';')

- Sort the data according to the quality rating, from lowest to highest
- Now sort the wines from highest to lowest

In [35]:
data_s = numpy.sort(data, order='quality')
# Ascending
print(data_s)
print()
# Descending
print(data_s[::-1])

[ (  6.7,  0.76 ,  0.02,  1.8,  0.078,   6.,  12.,  0.996  ,  3.55,  0.63,   9.95,  3.)
 (  6.8,  0.815,  0.  ,  1.2,  0.267,  16.,  29.,  0.99471,  3.32,  0.51,   9.8 ,  3.)
 (  7.1,  0.875,  0.05,  5.7,  0.082,   3.,  14.,  0.99808,  3.4 ,  0.52,  10.2 ,  3.)
 ...,
 ( 10.7,  0.35 ,  0.53,  2.6,  0.07 ,   5.,  16.,  0.9972 ,  3.15,  0.65,  11.  ,  8.)
 ( 11.3,  0.62 ,  0.67,  5.2,  0.086,   6.,  19.,  0.9988 ,  3.22,  0.69,  13.4 ,  8.)
 ( 12.6,  0.31 ,  0.72,  2.2,  0.072,   6.,  29.,  0.9987 ,  2.88,  0.82,   9.8 ,  8.)]

[ ( 12.6,  0.31 ,  0.72,  2.2,  0.072,   6.,  29.,  0.9987 ,  2.88,  0.82,   9.8 ,  8.)
 ( 11.3,  0.62 ,  0.67,  5.2,  0.086,   6.,  19.,  0.9988 ,  3.22,  0.69,  13.4 ,  8.)
 ( 10.7,  0.35 ,  0.53,  2.6,  0.07 ,   5.,  16.,  0.9972 ,  3.15,  0.65,  11.  ,  8.)
 ...,
 (  7.1,  0.875,  0.05,  5.7,  0.082,   3.,  14.,  0.99808,  3.4 ,  0.52,  10.2 ,  3.)
 (  6.8,  0.815,  0.  ,  1.2,  0.267,  16.,  29.,  0.99471,  3.32,  0.51,   9.8 ,  3.)
 (  6.7,  0.76 ,  0.02,  1.

### Rounding 

Rounding functions:
- `numpy.round`
- `numpy.floor`
- `numpy.ceil`


In [36]:
# rounding 
x = 10*numpy.random.random((1,5))
print("not rounded:", x)

x1 = numpy.round(x, decimals = 2)
print("round:", x1)

x2 = numpy.floor(x)
print("round down:", x2)

x3 = numpy.ceil(x)
print("round up:", x3)

not rounded: [[ 6.69805743  0.54245276  6.61575029  5.84070497  9.46216215]]
round: [[ 6.7   0.54  6.62  5.84  9.46]]
round down: [[ 6.  0.  6.  5.  9.]]
round up: [[  7.   1.   7.   6.  10.]]


### Statistics

Statistics functions:

- `numpy.median` : median
- `numpy.mean` : mean
- `numpy.average`: (weighted) average
- `numpy.std` : standard deviation
- `numpy.var` : variance
- `numpy.cov` : covariance matrix
- `numpy.corrcoef` : Pearson product-moment correlation coefficients

These functions can be applied to the entire array, or to only one axis. When applied to one axis, then the parameter `axis` can be used. Similar functions exists which ignore NaN, these functions are called: `nanmedian`, `nanmean`, `nanstd`, `nanvar`. 

For more statistical functions in numpy: http://docs.scipy.org/doc/numpy/reference/routines.statistics.html

### Exercise 3.6a

Define function `print_summary` which takes a structured vector of numerical values and prints, for each column, basic statistics:

- name (name of the column in the input array)
- mean 
- median
- min (minimum value)
- max (maximum value)
- std (standard deviation)

For example:
```
column: fixed_acidity
mean: 8.31963727329581
median: 7.9
min: 4.6
max: 15.9
std: 1.7405518001102729
column: volatile_acidity
mean: 0.5278205128205128
median: 0.52
min: 0.12
max: 1.58
...
```

### Exercise 3.6b
Modify the above function so that it takes an additional argument where the user can specify the number of decimal digits to display. For example, `print_summary(data, decimals=2)`:
```
column: fixed_acidity
mean: 8.32
median: 7.9
min: 4.6
max: 15.9
std: 1.74
....
```


In [37]:
# 8< ............................................................
import numpy
population = numpy.genfromtxt("populations.txt", names=True)

def print_summary(data):
    for col in data.dtype.names:
        print("column: {}".format(col))
        print("mean: {}".format(numpy.mean(data[col])))
        print("median: {}".format(numpy.median(data[col])))
        print("min: {}".format(numpy.min(data[col])))
        print("max: {}".format(numpy.max(data[col])))
        print("std: {}".format(numpy.std(data[col])))
        
print_summary(population)

column: year
mean: 1910.0
median: 1910.0
min: 1900.0
max: 1920.0
std: 6.0553007081949835
column: hare
mean: 34080.95238095238
median: 25400.0
min: 7600.0
max: 77400.0
std: 20897.906458089667
column: lynx
mean: 20166.666666666668
median: 12300.0
min: 4000.0
max: 59400.0
std: 16254.591536908763
column: carrot
mean: 42400.0
median: 41800.0
min: 36700.0
max: 48300.0
std: 3322.5062255844787


## Python modules

A Python module is a collection of reusable functions. You can create a module by putting some function definitions in a file with the extension `.py`. For example, put some of the functions you defined above in a file called `functions.py`. You can then use them from any notebook or other Python code by importing like this:

```python
from functions import * 
```
This will import all functions from this module, and they can be used directly.

The alternative is:

```python
import functions as F
```
where `F` is some shortened name. If your module have the function `scale`, you will then call it as `F.scale`.

Try this in a new notebook.


**For assignment 1 you will need to submit a Python module with a number of function definitions.** Make sure you understand this concept.