### Additional Math and Stats Functions

Apart from the universal math functions, some of which we saw earlier, NumPy implements many more functions, covering areas such as linear algebra, Fourier transforms, sampling, statistics, etc.

For a complete list, see this link:

https://numpy.org/doc/stable/reference/routines.html

NumPy does have a few simple financial related functions (mainly related to interest rate calculations), but these are being deprecated and will eventually be removed from NumPy. (https://numpy.org/neps/nep-0032-remove-financial-functions.html)

For a more in-depth financial package, you could look at **QuantLib**, a Python package for more advanced algorithms, including features such as yield curve models, Monte Carlo methods, solvers, and more.

In this section we'll focus primarily on a few additional math functions and some stats functions.

In [1]:
import numpy as np

We can find the max, min of arrays:

In [2]:
np.amin(np.array([10, 5, 20]))

np.int64(5)

In [3]:
np.amax(np.array([10, 5, 20]))

np.int64(20)

We can perform these calculations with 2-D arrays as well:

In [4]:
m = np.array([[10, 2, 3], [4, 50, 6], [7, 8, 90]])
m

array([[10,  2,  3],
       [ 4, 50,  6],
       [ 7,  8, 90]])

In [5]:
np.amin(m), np.amax(m)

(np.int64(2), np.int64(90))

We can also specify a specific axis we want NumPy to do this on:

In [6]:
np.amin(m, axis=0)

array([4, 2, 3])

As you can see, this returned an array containing the minimum value across all rows (axis `0`) for each column.

Alternatively, we can set our axis to `1`, which means we'll get the minimum for each row across all columns:

In [7]:
np.amin(m, axis=1)

array([2, 4, 7])

Other standard stats functions include things like median, mean, standard deviation:

In [8]:
np.median(np.array([1, 2, 3, 4, 5]))

np.float64(3.0)

In [9]:
np.median(np.array([1, 2, 3, 4, 5, 6]))

np.float64(3.5)

For means, we can use the `mean` function (and you can do weighted averages too, using `average`):

In [10]:
np.mean(np.array([1, 2, 3]))

np.float64(2.0)

Standard deviations can be calculated using the `std` function:

In [11]:
np.std(np.array([-2, -1, 0, 1, 2]))

np.float64(1.4142135623730951)

Of course these functions work on multi-dimensional arrays as well, we just need to specify the axis we are calculating the median, mean, etc over.

In [12]:
m = np.array(
    [
        [1, 10, 100],
        [2, 20, 200],
        [3, 30, 300],
        [3, 30, 300],
        [4, 40, 400]
    ]
)

To calculate the mean or median for each column, we set our traversal axis to rows (`0`):

In [13]:
np.mean(m, axis=0)

array([  2.6,  26. , 260. ])

In [14]:
np.median(m, axis=0)

array([  3.,  30., 300.])

And to calculate the mean or median for each row, we need to traverse the columns, so we set the axis to `1`:

In [15]:
np.mean(m, axis=1)

array([ 37.,  74., 111., 111., 148.])

In [16]:
np.median(m, axis=1)

array([10., 20., 30., 30., 40.])

We also have functions for finding the sum all elements along some axis.

In [17]:
np.sum(np.arange(1, 10))

np.int64(45)

It will even sum up every element of a multi-dimensional array as we saw in the last coding video.

In [18]:
m = np.arange(1, 10).reshape(3, 3)
m

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [19]:
np.sum(m)

np.int64(45)

But, we can be more specific with a higher dimensional array and by specifying the axis we want to sum along:

In [20]:
np.sum(m, axis=0)

array([12, 15, 18])

In [21]:
np.sum(m, axis=1)

array([ 6, 15, 24])

NumPy also implements some rounding functions, such as the `around` function:

In [22]:
arr = np.array([1.11, 2.22, 5.55, 6.66])
arr

array([1.11, 2.22, 5.55, 6.66])

In [23]:
np.around(arr, 1)

array([1.1, 2.2, 5.6, 6.7])

In [24]:
np.around(arr)

array([1., 2., 6., 7.])

Another very handy function is the `histogram` function, which can calculate a frequency distribution of values in an array, using specified bins.

In [25]:
np.random.seed(0)
arr = np.random.randint(1, 10, 20)
arr

array([6, 1, 4, 4, 8, 4, 6, 3, 5, 8, 7, 9, 9, 2, 7, 8, 8, 9, 2, 6],
      dtype=int32)

We want to calculate a frequency distribution of numbers binned as follows:

```
[0, 2) [2, 4) [4, 6) [6, 8) [8, 9]
```

So our bin edges are: `0`, `2`, `4`, `6`, `8` as well as the (inclusive) rightnmost edge `9`.

In [26]:
np.histogram(arr, bins=np.array([0, 2, 4, 6, 8, 9]))

(array([1, 3, 4, 5, 7]), array([0, 2, 4, 6, 8, 9]))

We could also just specify a number of bins we want, and let NumPy work out the edges of uniformly wide bins based on the min/max in our array:

In [27]:
np.histogram(arr, bins=4)

(array([3, 4, 4, 9]), array([1., 3., 5., 7., 9.]))

#### Example

Let's go back to an example we did a while back that involved calculating the frequency distribution of some random data when we were studying random numbers.

This function was used to calculate the frequency distribution for integer values:

In [28]:
def freq_distribution(data):
    freq = {}
    for el in data:
        freq[el] = freq.get(el, 0) + 1
    return freq

In [29]:
data = [1, 1, 1, 2, 2, 3]

In [30]:
freq_d = freq_distribution(data)
freq_d

{1: 3, 2: 2, 3: 1}

Then we calculate the relative frequencies:

In [31]:
def relative_freq(freq_dist):
    sum_freq = sum(freq_dist.values())
    return {
        k: v / sum_freq * 100 for k, v in freq_dist.items()
    }

In [32]:
relative_f = relative_freq(freq_d)
relative_f

{1: 50.0, 2: 33.33333333333333, 3: 16.666666666666664}

Then we sorted and transformed this data into a list of tuples for the number and the frequency:

In [33]:
sorted_items = sorted(relative_f.items(), key=lambda x: x[0])
sorted_items

[(1, 50.0), (2, 33.33333333333333), (3, 16.666666666666664)]

And then we did some rough charting for this data:

In [34]:
def chart_freq(data):
    pad = max([len(str(el[0])) for el in data])
    for k, v in data:
        print(f"{str(k).rjust(pad)}| {'*' * round(v)}")

In [35]:
chart_freq(sorted_items)

1| **************************************************
2| *********************************
3| *****************


Now let's do something similar, but using NumPy.

In [36]:
data

[1, 1, 1, 2, 2, 3]

In [37]:
arr = np.array(data, dtype=int)
arr

array([1, 1, 1, 2, 2, 3])

In [38]:
freq, bins = np.histogram(arr, bins=[1, 2, 3, 3])

In [39]:
freq

array([3, 2, 1])

In [40]:
bins

array([1, 2, 3, 3])

What we really want is not the absolute frequencies in `freq`, but the relative frequencies, i.e. we need to calculate, for each element of `freq` the value:

```
el / sum(frequencies) * 100
```

We have already seen all the functions we need to do this, so let's go ahead and make the calculations:

In [41]:
freq

array([3, 2, 1])

In [42]:
rel = freq / np.sum(freq) * 100
rel

array([50.        , 33.33333333, 16.66666667])

In [43]:
bins

array([1, 2, 3, 3])

Finally, our rough charting function expects a list of tuples, so we get that easily by zipping up lists of the two arrays (and it is important to use the `tolist` method since it will not only create Python `list` objects, but also convert the NumPy C types to the proper Python equivalents):

In [44]:
data = list(zip(bins.tolist(), rel.tolist()))
data

[(1, 50.0), (2, 33.33333333333333), (3, 16.666666666666664)]

Note that we did not even have to zip `bins[:-1]` to omit the rightmost bin edge, since `zip` will stop at the shortest iterable, which is `rel`.

And now we can chart this data:

In [45]:
chart_freq(data)

1| **************************************************
2| *********************************
3| *****************


Let's put this together, starting with what we had done with the Python version earlier:

In [46]:
import random 

def freq_distribution(data):
    freq = {}
    for el in data:
        freq[el] = freq.get(el, 0) + 1
    return freq

def relative_freq(freq_dist):
    sum_freq = sum(freq_dist.values())
    return {
        k: v / sum_freq * 100 for k, v in freq_dist.items()
    }

def chart_freq(data):
    pad = max([len(str(el[0])) for el in data])
    for k, v in data:
        print(f"{str(k).rjust(pad)}| {'*' * round(v)}")
        
def analyze_randint(n, a, b):
    data = [random.randint(a, b) for _ in range(n)]
    
    freq = freq_distribution(data)
    rel = relative_freq(freq)
    
    sorted_items = sorted(rel.items(), key=lambda x: x[0])
    chart_freq(sorted_items)

In [47]:
random.seed(0)

analyze_randint(10_000, 1, 10)

 1| **********
 2| **********
 3| **********
 4| **********
 5| **********
 6| **********
 7| **********
 8| **********
 9| *********
10| **********


Now, let's do the same with NumPy:

In [48]:
def np_analyze_randint(n, a, b):
    data = np.random.randint(a, b + 1, n)
    bins = np.arange(a, b + 2)
    freq, _ = np.histogram(data, bins=bins)
    rel = freq / np.sum(freq) * 100

    sorted_items = list(zip(bins.tolist(), rel.tolist()))
    print(sorted_items)
    chart_freq(sorted_items)

In [49]:
np.random.seed(0)
np_analyze_randint(10_000, 1, 10)

[(1, 9.67), (2, 10.32), (3, 9.66), (4, 10.100000000000001), (5, 9.629999999999999), (6, 10.17), (7, 9.84), (8, 9.69), (9, 10.489999999999998), (10, 10.43)]
 1| **********
 2| **********
 3| **********
 4| **********
 5| **********
 6| **********
 7| **********
 8| **********
 9| **********
10| **********


As you can see, the code to do these manipulations was a lot more concise in NumPy.

There's actually a few improvements we can easily make - note how the relative frequency values are rounded inside the charting function - we would be better off rounding using NumPy (with vectorization) - that should be more efficient than using Python's rounding in a loop.

Furthermore, why are we taking our arrays, transforming them to lists, zipping them up, and then passing to the charting function - let's use NumPy arrays instead.

Let's refactor a bit:

In [50]:
def np_chart_freq(keys, values):
    pad = max(len(key) for key in keys)
    for k, v in zip(keys, values):
        print(f"{k.rjust(pad)}| {'*' * v}")
        
def np_analyze_randint(n, a, b):
    data = np.random.randint(a, b + 1, n)
    bins = np.arange(a, b + 2)
    freq, _ = np.histogram(data, bins=bins)
    rel = np.around(freq / np.sum(freq) * 100)

    np_chart_freq(bins[:-1].astype(str), rel.astype(int))

In [51]:
np.random.seed(0)
np_analyze_randint(10, 1, 5)

1| ********************
2| **********
3| **********
4| ****************************************
5| ********************


Note: we can actually also vectorize the `len` function we used to calculate the padding - but this requires a bit more advanced concepts we are not going to cover in this course - here, I'll just show you how to do it, given that this example is quite simple:

In [52]:
def np_chart_freq(keys, values):
    np_len = np.vectorize(len)
    pad = np.amax(np_len(keys))
    for k, v in zip(keys, values):
        print(f"{k.rjust(pad)}| {'*' * v}")
        
def np_analyze_randint(n, a, b):
    data = np.random.randint(a, b + 1, n)
    bins = np.arange(a, b + 2)
    freq, _ = np.histogram(data, bins=bins)
    rel = np.around(freq / np.sum(freq) * 100)

    np_chart_freq(bins[:-1].astype(str), rel.astype(int))

In [53]:
np.random.seed(0)
np_analyze_randint(10, 1, 5)

1| ********************
2| **********
3| **********
4| ****************************************
5| ********************


In [54]:
from time import perf_counter

In [55]:
random.seed(0)
start = perf_counter()
analyze_randint(30_000_000, 1, 10)
end = perf_counter()
print('Elapsed:', end - start)

 1| **********
 2| **********
 3| **********
 4| **********
 5| **********
 6| **********
 7| **********
 8| **********
 9| **********
10| **********
Elapsed: 18.892832199999248


In [56]:
np.random.seed(0)
start = perf_counter()
np_analyze_randint(30_000_000, 1, 10)
end = perf_counter()
print('Elapsed:', end - start)

 1| **********
 2| **********
 3| **********
 4| **********
 5| **********
 6| **********
 7| **********
 8| **********
 9| **********
10| **********
Elapsed: 1.0461904000003415


In [57]:
%timeit analyze_randint(30_000_000, 1, 10)
%timeit np_analyze_randint(30_000_000, 1, 10)

 1| **********
 2| **********
 3| **********
 4| **********
 5| **********
 6| **********
 7| **********
 8| **********
 9| **********
10| **********
 1| **********
 2| **********
 3| **********
 4| **********
 5| **********
 6| **********
 7| **********
 8| **********
 9| **********
10| **********
 1| **********
 2| **********
 3| **********
 4| **********
 5| **********
 6| **********
 7| **********
 8| **********
 9| **********
10| **********
 1| **********
 2| **********
 3| **********
 4| **********
 5| **********
 6| **********
 7| **********
 8| **********
 9| **********
10| **********
 1| **********
 2| **********
 3| **********
 4| **********
 5| **********
 6| **********
 7| **********
 8| **********
 9| **********
10| **********
 1| **********
 2| **********
 3| **********
 4| **********
 5| **********
 6| **********
 7| **********
 8| **********
 9| **********
10| **********
 1| **********
 2| **********
 3| **********
 4| **********
 5| **********
 6| **********
 7| ******