In [1]:
import numpy as np

![](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/NumPy_logo.svg/775px-NumPy_logo.svg.png)

## The Scientific Python ecosystem is built on Numpy
![](https://gcpy.readthedocs.io/en/latest/_images/state_of_the_stack_2015.png)

## Arrays

The basic object of numpy is the ndarray

An ndarray is a multidimensional container of items of the same type and size. The number of dimensions and items in an array is defined by its shape

In [2]:
x = np.array([1,2,3,4])
print(x)
print(f'Shape:{x.shape}')

[1 2 3 4]
Shape:(4,)


In [3]:
# Mixed type arrays are coerced to the more general type or throws an error
x = np.array([1,2,3,'oops'])
print(x)

['1' '2' '3' 'oops']


In [4]:
x = np.array([[1,1],[2,2]])
print('A 2-D array:')
print(x)
print(f'Shape:{x.shape}')

A 2-D array:
[[1 1]
 [2 2]]
Shape:(2, 2)


In [5]:
x = np.array([[[1,1],[2,2]],[[3,3],[4,4]]])
print('A 3-D array:')
print(x)
print(f'Shape:{x.shape}')

A 3-D array:
[[[1 1]
  [2 2]]

 [[3 3]
  [4 4]]]
Shape:(2, 2, 2)


## Vectorization
Arrays are important because they enable you to express batch operations on data without writing any for loops. This is usually called vectorization. Any arithmetic operations between equal-size arrays applies the operation elementwise.

Vectorization is more than just a concise way of writing computations, it is **fast**.

For loops are generally executed in Python while numpy vectorized calculations are offloaded to compiled C or Fortran code.  This has two benefits:
1. compiled C and Fortan are just faster than Python as an interpreted language
2. Frees up the Python GIL to do other work

In [6]:
# Initialize two 1-D arrays of length 1,000
a = np.random.rand(1000)
b = np.random.rand(1000)

In [7]:
%%timeit 
a*b

958 ns ± 46.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [8]:
%%timeit 
for i in range(1000):
    a[i]*b[i]

350 µs ± 29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


The vectorized calculation in this example is 100s of times faster than the non-vectorized calculation.

**When working with numerical computation on an array, you should always look for a vectorized way of executing.**

## Universal Functions (ufunc)

Numpy refers to its vectorized calculations as universal functions. From the docs: a ufunc is a “vectorized” wrapper for a function that takes a fixed number of specific inputs and produces a fixed number of specific outputs.  There are a [ton of them](https://docs.scipy.org/doc/numpy-1.15.1/reference/ufuncs.html) many of which have short-hand arithmetic notation.  Many of the ufuncs are also callable as a method (see sum)

In [None]:
x = np.random.rand(1000,1000)
# How many vectorized/ufunc calclations are embedded in this line of code?
np.logical_or(x<.025, x>.975).sum()/(1000**2)

In [None]:
np.divide(np.sum(np.logical_or(np.less(x,.025), np.greater(x,.975))),(1000**2))

#### Exercise 3.1: What is the coefficient of variation of x (standard deviation/mean) of x?:

In [None]:
x = np.arange(100)

In [None]:
# Your answer here:

#%load ./answers/03.1.py

## Slicing
Often you will want to hone into certain data in the array.  If you know the position of the elements in the array, you can use slicing notation which uses brackets of the form `[start:stop:step]` for every dimension of the array.  Each paramater of the slicer is optional.
1. If you omit start, it will start from the first element
2. If you omit stop, it will end at the last element
3. If you omit step, it will grab every element between start and stop
4. If you set step to -1, it will reverse the order of elements

** Slicing is a Python construct not unique to numpy. These rules are useful for pandas, lists in addition to numpy arrays **

In [None]:
x = np.array([1,2,3,4,5,6,7,8,9,10])
print(x[3:6])
print(x[:])
print(x[1::2])
print(x[::-2])

In [None]:
x = np.array([[(y+1)*(x+1) for y in range(10)] for x in range(10)])
print(x)
print(x[5:8,0:3])
print(x[::-1,::-1])

#### Exercise 3.2: Extract the values in the 3rd through 6th index of x in reverse order?

In [None]:
x = np.arange(10)

In [None]:
# Your answer here:

#%load ./answers/03.2.py

## Filtering with boolean arrays
A boolean array is an array where the elements take on True of False values.  You can filter a multidimensional array by boolean arrays.  You can also create boolean arrays by applying boolean expressions against non-boolean arrays.  To filter by boolean arrays, the boolean array must have the same shape as the array to be filtered.

In [None]:
x = np.array([1,2,3,4,5,6])
y = np.array([True,False,False,True,True,False])
x[y]

In [None]:
x[x>3]

#### Exercise 3.3: Create an array of values 1 to 100 and filter out all elements that are divisible by 4 using the np.mod ufunc.

In [None]:
x = np.arange(100)+1

In [None]:
# Your answer here:

#%load ./answers/03.3.py

## Aggregates along an Axis (dimension)
There are a variety of aggregate functions you can apply to a dimension
sum, min, max, count, cumsum, var, prod, percentile are all valid aggregate functions.
To aggregate over a specific axis, use the axis argument.  There are versions of all of these functions 
that are designed to ignore invalid elements.


In [None]:
x = np.array([1,2,3,4,5,6])
print(np.sum(x))
x = np.array([1,2,np.nan,4,5,6])
print(np.sum(x))

In [None]:
x = np.array([1,2,np.nan,4,5,6])
print(np.nansum(x))

In [None]:
x = np.array([[(y+1)*(x+1) for y in range(10)] for x in range(10)])
print(x)
print(np.sum(x,axis=1))
print(np.sum(x))

##### Exercise 3.4: What is the mean of the sum of each column in x?

In [None]:
x = np.array([[(y+1)*(x+1) for y in range(-5,5)] for x in range(10)])

In [None]:
# Your answer here:

#%load ./answers/03.4.py

## Modifying shape
You will often need to append additional dimensions, elements within dimensions, or generally change the shape of a multi-dimensional array.

reshape, expand_dims, and concatenate are your friends for this.

In [None]:
x = np.arange(12)+1
print(x)
print(np.reshape(x, (4,3)))
print(np.reshape(x, (3,2,2)))

In [None]:
x = np.arange(10)
print(x)
x = np.expand_dims(x, axis=1)
print(x)
y = 2*x
np.concatenate((x,y),axis=1)

#### Exercise3.5: Transpose x to a 100 x 1 matrix using reshape.

In [None]:
x = np.arange(100)

In [None]:
# Your answer here:

#%load ./answers/03.5.py

## Real world example

Building a multi-dimensional IBNR model in numpy

### Create Initial multi-dimensional array

In [15]:
import pandas as pd
cas = pd.read_csv(r'http://www.casact.org/research/reserve_data/wkcomp_pos.csv')
# Let's make the pandas dataframe look more triangle like
triangle_frame = pd.pivot_table(cas[cas['DevelopmentYear']<=1997], 
                                values='CumPaidLoss_D', 
                                index=['GRNAME','AccidentYear'], 
                                columns='DevelopmentLag')
# Let's use the reshape method to create a 3-D Matrix of triangles
triangle_array = np.array(triangle_frame).reshape(len(cas['GRNAME'].unique()),
                                                  len(cas['AccidentYear'].unique()),
                                                  len(cas['DevelopmentLag'].unique()))

In [20]:
triangle_frame

Unnamed: 0_level_0,DevelopmentLag,1,2,3,4,5,6,7,8,9,10
GRNAME,AccidentYear,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Agway Ins Co,1988,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Agway Ins Co,1989,0.0,0.0,0.0,0.0,0.0,0.0,23.0,23.0,31.0,
Agway Ins Co,1990,0.0,2.0,2.0,2.0,2.0,16.0,16.0,23.0,,
Agway Ins Co,1991,8.0,17.0,25.0,31.0,26.0,29.0,38.0,,,
Agway Ins Co,1992,0.0,0.0,0.0,0.0,0.0,0.0,,,,
Agway Ins Co,1993,0.0,0.0,0.0,0.0,0.0,,,,,
Agway Ins Co,1994,0.0,0.0,0.0,0.0,,,,,,
Agway Ins Co,1995,0.0,0.0,0.0,,,,,,,
Agway Ins Co,1996,0.0,0.0,,,,,,,,
Agway Ins Co,1997,0.0,,,,,,,,,


In [None]:
x

In [None]:
x

In [None]:
# triangle array is a set 10x10 triangles for more than 100 companies.
triangle_array.shape

### Clean up missing and zero values

In [None]:
# This is not a very good triangle to have in our model.
triangle_array[-1]

In [None]:
# Let's get rid of completely empty triangles
triangle_sum = np.nansum(np.nansum(triangle_array, axis=1),axis=1)
triangle_array = triangle_array[triangle_sum!=0,:,:]
# let's turn 0's to nan - this will alleviate issues around dividing by zero
triangle_array[triangle_array==0]=np.nan

### Age-to-age factors

In [None]:
# Let's use slicing to create age-to-age factors
ata_array = triangle_array[:,:-1,1:]/triangle_array[:,:-1,:-1]
# Let's default the completely blank age-to-age columns with 1.0
accident_periods = len(cas['DevelopmentLag'].unique())
ata_array_defaults = np.expand_dims(np.all(np.isnan(ata_array),axis=1),axis=1)
ata_array[np.repeat(ata_array_defaults,accident_periods-1,axis=1)]=1.0

### LDFs and CDFs

In [None]:
# Let's create an array of LDFs, by taking simple averages of the age-to-age factors.
ldf_array = np.nanmean(ata_array, axis=1)
ldf_array[np.isnan(ldf_array)]=1.0
# Let's create an array of CDFs with a tail factor from our LDFs
cdf_array = ldf_array[:,::-1].cumprod(axis=1)[:,::-1]
tail_factor = 1.0
cdf_array = np.append(cdf_array,np.expand_dims(np.repeat(tail_factor,cdf_array.shape[0]),1),axis=1)

### Ultimates and IBNR

In [None]:
latest_diagonal = np.nan_to_num(np.diagonal(triangle_array[:,::-1,],axis1=1,axis2=2)[:,::-1])
ultimate = latest_diagonal * cdf_array[:,::-1]
ibnr = ultimate - latest_diagonal

### The business questions answered by our model

In [None]:
companies = np.array(triangle_frame.index.levels[0])[triangle_sum!=0]
print('How much IBNR does the entire industry need according to this model?')
print(np.sum(ibnr).round(0))
print()
print('What is the average ultimate to paid ratio across the industry?')
print((np.sum(ultimate)/np.sum(latest_diagonal)).round(3))
print()
print('Which company has the highest 12-Ultimate CDF?')
print(companies[np.argmax(cdf_array[:,0])])
print()
print('Which company has the lowest 12-24 LDF?')
print(companies[np.argmin(ldf_array[:,1])])
print()
print('What is the 95% confidence interval on the estimate of 12-Ultimate CDF?')
print((np.sort(cdf_array[:,1])[int(.025*len(cdf_array[:,1]))],cdf_array[:,1][int(.975*len(cdf_array[:,1]))]))

Performance test of the above code 

In [None]:
%%timeit
triangle_array = np.array(triangle_frame).reshape(len(cas['GRNAME'].unique()),len(cas['AccidentYear'].unique()),len(cas['DevelopmentLag'].unique()))
triangle_sum = np.nansum(np.nansum(triangle_array, axis=1),axis=1)
triangle_array = triangle_array[triangle_sum!=0,:,:]
triangle_array[triangle_array==0]=np.nan
ata_array = triangle_array[:,:-1,1:]/triangle_array[:,:-1,:-1]
accident_periods = len(cas['DevelopmentLag'].unique())
ata_array_defaults = np.expand_dims(np.all(np.isnan(ata_array),axis=1),axis=1)
ata_array[np.repeat(ata_array_defaults,accident_periods-1,axis=1)]=1.0
ldf_array = np.nanmean(ata_array, axis=1)
ldf_array[np.isnan(ldf_array)]=1.0
cdf_array = ldf_array[:,::-1].cumprod(axis=1)[:,::-1]
tail_factor = 1.0
cdf_array = np.append(cdf_array,np.expand_dims(np.repeat(tail_factor,cdf_array.shape[0]),1),axis=1)
latest_diagonal = np.nan_to_num(np.diagonal(triangle_array[:,::-1,],axis1=1,axis2=2)[:,::-1])
ultimate = latest_diagonal * cdf_array[:,::-1]
ibnr = ultimate - latest_diagonal

In [None]:
for m in [ i for i in dir(np) if i[0] >= 'a' and i[0]<='z']:
    print(m)

In [14]:
for m in [ i for i in dir(np.random) if i[0] >= 'a' and i[0]<='z']:
    print(m)

absolute_import
bench
beta
binomial
bytes
chisquare
choice
dirichlet
division
exponential
f
gamma
geometric
get_state
gumbel
hypergeometric
info
laplace
logistic
lognormal
logseries
mtrand
multinomial
multivariate_normal
negative_binomial
noncentral_chisquare
noncentral_f
normal
np
operator
pareto
permutation
poisson
power
print_function
rand
randint
randn
random
random_integers
random_sample
ranf
rayleigh
sample
seed
set_state
shuffle
standard_cauchy
standard_exponential
standard_gamma
standard_normal
standard_t
test
triangular
uniform
vonmises
wald
weibull
zipf


In [None]:
# optional pause for something more advanced... 
for m in [ i for i in dir(np) if i[0] >= 'a' and i[0]<='z']:
    print(f'\n\n{m}\n{"="*len(m)}\n')
    print(np.__getattribute__(m).__doc__)

In [None]:
df = pd.DataFrame([(i, getattr(x, i).__doc__) for i in dir(x)], columns=['Method', 'Help'])

In [None]:
cas = pd.read_csv(r'http://www.casact.org/research/reserve_data/wkcomp_pos.csv')
# Let's make the pandas dataframe look more triangle like
triangle_frame = pd.pivot_table(cas[cas['DevelopmentYear']<=1997], 
                                values='CumPaidLoss_D', 
                                index=['GRNAME','AccidentYear'], 
                                columns='DevelopmentLag')

In [29]:
cas.query(' GRCODE==86 and AccidentYear + DevelopmentLag <=1998 ')

Unnamed: 0,GRCODE,GRNAME,AccidentYear,DevelopmentYear,DevelopmentLag,IncurLoss_D,CumPaidLoss_D,BulkLoss_D,EarnedPremDIR_D,EarnedPremCeded_D,EarnedPremNet_D,Single,PostedReserve97_D
0,86,Allstate Ins Co Grp,1988,1988,1,367404,70571,127737,400699,5957,394742,0,281872
1,86,Allstate Ins Co Grp,1988,1989,2,362988,155905,60173,400699,5957,394742,0,281872
2,86,Allstate Ins Co Grp,1988,1990,3,347288,220744,27763,400699,5957,394742,0,281872
3,86,Allstate Ins Co Grp,1988,1991,4,330648,251595,15280,400699,5957,394742,0,281872
4,86,Allstate Ins Co Grp,1988,1992,5,354690,274156,27689,400699,5957,394742,0,281872
5,86,Allstate Ins Co Grp,1988,1993,6,350092,287676,20641,400699,5957,394742,0,281872
6,86,Allstate Ins Co Grp,1988,1994,7,346808,298499,14513,400699,5957,394742,0,281872
7,86,Allstate Ins Co Grp,1988,1995,8,349124,304873,15862,400699,5957,394742,0,281872
8,86,Allstate Ins Co Grp,1988,1996,9,348157,321808,8974,400699,5957,394742,0,281872
9,86,Allstate Ins Co Grp,1988,1997,10,347762,325322,8843,400699,5957,394742,0,281872
