# Numeric Python

## Data Types

The ndarray needs to interpret a chunk of memory as a particular type of data, hence the data type. We can 

(1) assign type information when creating an array;

(2) cast an existing array into a different type. You will, understandably, get an error by trying to cast some English letter to integers.

(3) use the python built in data type, int and float*

(4) more often than not python can infer things just fine.

This is because Python actually used a standard double precision floating point type, which is exactly np.float64. Similarly, numpy will understand int as numpy.int64

In [1]:
import numpy as np
case_1 = np.array([1,2,3,4,5], dtype=np.int32) #specifying types
case_1_float = np.array([1,2,3,4,5], dtype=np.float64) # a different type
print(case_1.dtype, case_1_float.dtype)

cast_ = case_1.astype('S12') # a string of length 10. Longer than that will be truncated
print(cast_)
print(cast_.dtype)

cast_ = case_1.astype(np.string_)# a similar way to cast array to string type
print(cast_)
print(cast_.dtype)

int32 float64
[b'1' b'2' b'3' b'4' b'5']
|S12
[b'1' b'2' b'3' b'4' b'5']
|S11


In [2]:
case_3 = np.array([1,2,3,3,4], dtype=int)
print(case_3.dtype)

case_4 = np.array(['12', '23'])
print(case_4.dtype)

int64
<U2


Several things to notice here:

(1) 'U' means unicode string and 2 indicates the length of data. Note that the aforementioned  S10 gives a byte string but U gives a unicode string. In other words, we got to decode the np.string_ into utf-8 to get a python string. 

(2) whenver we call astype(), a copy is made! To further showcase the point, use the flags attribute, see below.

(3) however we do slicing, a view (more like a reference, will be generated--pay attention to the OWNDATA attribute)

In [3]:
cast_.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

In [4]:
cast_[0:2].flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

## Slicing
There are several ways of doing slicing, some of which seem similar, but have pretty different behaviors, especially when it comes down to copy vs. view argument. Conclusion first: basic indexing and slicing gives view, boolean and fancy indexing (use a list to do indexing) gives copy. Let's show them one by one with a two-d array. Last but not least, you use any combination of them, but it will end up as a copy as long as one of last two methods are involved.

In [5]:
test_array = np.random.randn(10,3)

In [6]:
basic_indexing = test_array[3]
print(basic_indexing.flags.owndata) #copy or not
print(basic_indexing.copy().flags.owndata) #make a copy and it's gonna be fine

False
True


In [7]:
basic_slicing = test_array[1:3,:]
print(basic_slicing.flags.owndata)

False


In [8]:
mask = [True]* 5 + [False]*5 #take first half and discard the rest
boolean_indexing = test_array[mask]
print(boolean_indexing.flags.owndata)

True


In [9]:
integer_mask = np.random.choice(np.arange(10), 10)
fancy_indexing = test_array[integer_mask]
print(fancy_indexing.flags.owndata)

True


The author of Python for Data Analysis claims that boolean selection will not fail even if the boolean array is not the correct length, it's not true.

Here's more caveats:

(1) The .copy() function is always going to work, and with this method you can create copy even if it is indexing or slicing. 

(2) slicing/indexing allows us to set value. Generally speaking, we can either pass in the new data of rigth size, or just pass a scalar and let numpy figure it out. However, numpy wont let the wrong dimension slide.

In [10]:
setting_data_with_scalar = np.random.randn(10, 2)
setting_data_with_scalar[0, :] = 10
setting_data_with_scalar

array([[10.        , 10.        ],
       [-1.23205384, -0.53359214],
       [-0.57203771, -1.00385177],
       [ 0.06078447,  1.08760236],
       [-0.92981901,  0.05753419],
       [ 0.82754667,  0.12812708],
       [-0.04529814,  0.22594587],
       [-0.58298997, -1.44182617],
       [-0.86815463, -1.01110198],
       [ 1.65811791,  1.3490869 ]])

In [11]:
setting_data_with_array = np.random.randn(10, 2)
setting_data_with_array[0, :] = [10, 10] # this is fine 
setting_data_with_array[0, :] = np.array([10, 10]) # this is also ok
#setting_data_with_array[0, :] = [10, 10, 10] # will get a value error
setting_data_with_array

array([[10.        , 10.        ],
       [-0.99661862, -0.03248761],
       [-1.46269561,  0.10846294],
       [ 0.29109391,  0.58900529],
       [ 0.30499742,  0.05051971],
       [ 1.35256148, -1.45365382],
       [-0.24992649,  3.42659455],
       [ 0.83801317, -0.68639557],
       [ 2.3285309 ,  2.55200234],
       [ 0.80087307, -0.28647745]])

Additionally, there are three operators are kind of both important and annoying. ~ means negating, and can help us convert True to False or vice versa. This is a super useful tool but cannot be used for native python lists but only works for numpy arrays.

Secondly, the "and" and "or" operators are & and |. This is fine but you cannnot use these two english words directly. And since pandas as built on numpy, the same annoyance is inherited. 

## Shape and reshape

Here's something obvious:

1. To get a shape of an array, use .shape attribute
2. To get number of dimensions of an array, use ndim attribute
3. To reshape the array use .reshape() method. It'll generate a view

However there is something more than what meets the eye in reshape. Think about reshape a list of numbers into a squared matrix. Should you got row first or column first? This actually leads to a major argument of C order and Fortran order. And this has something to do which two floating numbers are adjacent in the memory, and thus makes a performance difference when it comes to aggregation operation. But we'll discuss that later.

In [12]:
np.arange(25).reshape((5,5), order='F')

array([[ 0,  5, 10, 15, 20],
       [ 1,  6, 11, 16, 21],
       [ 2,  7, 12, 17, 22],
       [ 3,  8, 13, 18, 23],
       [ 4,  9, 14, 19, 24]])

In [13]:
np.arange(25).reshape((5,5), order='C') # by row is a defalt option

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

In [14]:
# when we want to get a list from a matrix
# this kind of decision is still relevant
square_matrix = np.arange(25).reshape(5,5)
print(square_matrix.ravel(order='F'))
print(square_matrix.ravel(order='C'))

[ 0  5 10 15 20  1  6 11 16 21  2  7 12 17 22  3  8 13 18 23  4  9 14 19
 24]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24]


One last thing, there's actually two function in numpy doing the same thing, ravel() and flatten(), both are straightening a matrix in a list. The only difference is the latter is creating a copy but the former is a view. Here's the demo

In [15]:
square_matrix2 = np.arange(25).reshape(5,5)
view_ = square_matrix.ravel('C')
copy_ = square_matrix.flatten('C')
print(view_.flags.owndata) # not copy from ravel
print(copy_.flags.owndata) # copy from ravel()

False
True


Finally let's elaborate the difference of c order and fortran order and its effect on performance. C order traverses high dimension first and Fortran traverses low dimmensions first. Here's a demo

In [16]:
np.arange(9).reshape(3,3)

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

remember that axis 0 is row and axis 1 is column? if we go by C order, we'll visit 0 first and then 1 and then 2, which means we are traversing axis 1 first before we jump to next line. However, fortran order is the opposite, we have to go 0, 3, 6, which means we have to advance on axis 0 before we advancing on axis 1. 

Another way to understand this is C order goes like [0,1], [0,2], [0,3] before the first dimension jumps to 1. And fortran order goes like [0,1], [1,1], [2,1] and etc.

The whole reason behind this is the data is stored in the memory in a linear matter but we need to organise it in a matrix form and thus we got to make a choice. 

This choice means in the C order layout, row observations are closer to each other hence row operations are faster and for fortran column operations are faster. We can easily showcase this for a large dataset, and this is usually one of the directions to squeeze some performance.

In [17]:
the_c_order = np.random.randn(1000000).reshape((1000, 1000),order='C')
the_fortran_order = np.random.randn(1000000).reshape((1000, 1000), order='F')

In [18]:
%%timeit
c = the_c_order.mean(axis=0) # get the col sum

716 µs ± 47.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [19]:
%%timeit
f = the_fortran_order.mean(axis=0) #get the col sum

398 µs ± 9.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Explanation: we aggregate on column (along axis 0), and that's how fortran order organizes data. It can double the speed! However, after slicing, it's possible that we end up with a data array that is neither C contiguous or Fortran contiguous.