# Working with Tabular Numeric Data

Python is an excellent text processing tool, but it sometimes fails to deliver adequate numeric performance. That’s where numpy comes to rescue.

Module *numpy* - Numerical Python (imported as numpy) - is an interface to a family of efficient and parallelizable functions that implement high-performance numerical operations.

*numpy* provides a new Python data structure – array - and a toolbox of array-specific functions.


## Bridge to Terabytia

If your program needs access to really huge amounts of numerical data - terabytes and more - you can’t avoid using module h5py.

The module is a portal to the HDF5 binary data format that works with a lot of third-party software, such as IDL and MATLAB.

H5py imitates familiar numpy and Python mechanisms, such as arrays and dictionaries.


## Creating Arrays 

*numpy* arrays are more compact and faster than native Python lists, especially in multi-dimensional cases. However, unlike lists, arrays are homogeneous: you cannot mix and match array items that belong to different data types.

Function *array()* creates an array from array-like data. The data can be a list, a tuple or another array.

*numpy* infers the type of the array elements from the data, unless you explicitly pass the dtype parameter. 
*numpy* supports close to twenty data types, such as bool_, int64, uint64, float64, and U32 (for Unicode strings).

When *numpy* creates an array, it does not copy the data from the source to the new array, but links to it for efficiency reasons.

By default, a *numpy* array is a view of its underlying data, not a copy of it. If the underlying data object changes, the array data changes, too. If this behavior is undesirable (which is always is, unless the amount of data to copy is prohibitively large), pass copy=True to the constructor.

Contrary to their name, Python “lists” are actually implemented as arrays, not as lists.

Large Python lists require only about 13% more storage than “real” numpy arrays. However, Python executes some simple built-in operations, such as sum(), 5–10 times faster on lists than on arrays. 

Ask yourself if you really need any numpy-specific features before you start a numpy project!



Let’s create our first array—a silly array of the first 10 positive integer numbers:
    

In [3]:
import numpy as np 
numbers = np.array(range(1, 11), copy=True)
print(numbers)

[ 1  2  3  4  5  6  7  8  9 10]


Functions *ones()*, *zeros()*, and *empty()* construct arrays of all ones, all zeros, and all unitinialized entries, respectively. 

They them take a required shape parameter - a list or tuple of array dimensions.

In [5]:
print("Function ones()")
ones = np.ones([2, 4], dtype=np.float64)
ones


Function ones()


array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])

In [6]:
print("Function zeros()")
zeros = np.zeros([2, 4], dtype=np.float64)
zeros

Function zeros()


array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

In [7]:
print("Function empty()")
empty = np.empty([2, 4], dtype=np.float64)
# The array content is not always zeros!
empty

Function zeros()


array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

*numpy* stores the number of dimensions, the shape, and the data type of an array in the attributes ndim, shape, and dtype.

In [8]:
ones.shape # The original shape, unless reshaped


(2, 4)

In [9]:
numbers.ndim # Same as len(nubers.shape)


1

In [10]:
zeros.dtype

dtype('float64')

Function *eye(N, M=None, k=0, dtype=np.float)* constructs an N×M eye-dentity matrix with ones on the kth diagonal and zeros elsewhere. Positive *k* denotes the diagonals above the main diagonal. When M is None (default), M=N.

In [11]:
eye = np.eye(3, k=1)
eye

array([[ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 0.,  0.,  0.]])

In addition to the good old built-in *range()*, numpy has its own, more efficient way of generating arrays of regularly spaced numbers: *function arange([start,] stop[, step,], dtype=None)*.

In [12]:
np_numbers = np.arange(2, 5, 0.25)
np_numbers


array([ 2.  ,  2.25,  2.5 ,  2.75,  3.  ,  3.25,  3.5 ,  3.75,  4.  ,
        4.25,  4.5 ,  4.75])

Just like with *range()*, the value of start can be larger than stop - but then step must be negative, and the order of numbers in the array is decreasing.

In [13]:
np_numbers = np.arange(5, 2, -0.25)
np_numbers

array([ 5.  ,  4.75,  4.5 ,  4.25,  4.  ,  3.75,  3.5 ,  3.25,  3.  ,
        2.75,  2.5 ,  2.25])

In the case of type narrowing (converting to a more specific
data type), some information may be lost. 

This is true about any narrowing, not just in numpy.

In [14]:
np_inumbers = np_numbers.astype(np.int)
np_inumbers

array([5, 4, 4, 4, 4, 3, 3, 3, 3, 2, 2, 2])

To preserve original data, function copy() creates a copy of an existing array. 
If your array has 1 bln items, think twice before copying it.

In [16]:
np_inumbers_copy = np_inumbers.copy()
np_inumbers_copy


array([5, 4, 4, 4, 4, 3, 3, 3, 3, 2, 2, 2])

## Transposing and Reshaping

In [17]:
# Some S&P stock symbols
sap = np.array(["MMM", "ABT", "ABBV", "ACN", "ACE", "ATVI", "ADBE", "ADT"])
sap


array(['MMM', 'ABT', 'ABBV', 'ACN', 'ACE', 'ATVI', 'ADBE', 'ADT'], 
      dtype='<U4')

Reshaping it to 2D

In [18]:
sap2d = sap.reshape(2, 4)
sap2d

array([['MMM', 'ABT', 'ABBV', 'ACN'],
       ['ACE', 'ATVI', 'ADBE', 'ADT']], 
      dtype='<U4')

Reshaping it to 3D

In [19]:
sap3d = sap.reshape(2, 2, 2)
sap3d

array([[['MMM', 'ABT'],
        ['ABBV', 'ACN']],

       [['ACE', 'ATVI'],
        ['ADBE', 'ADT']]], 
      dtype='<U4')

The value of the attribute T is the transposed view of the array (for a one-dimensional array, data.T==data; for a two-dimensional array, the rows and the columns are swapped).

In [11]:
print(sap2d, "\n")
print(sap2d.T)

[['MMM' 'ABT' 'ABBV' 'ACN']
 ['ACE' 'ATVI' 'ADBE' 'ADT']] 

[['MMM' 'ACE']
 ['ABT' 'ATVI']
 ['ABBV' 'ADBE']
 ['ACN' 'ADT']]


## Indexing and Slicing

numpy arrays support the same indexing [i] and slicing [i:j] operations as Python lists. 

In addition, they implement Boolean indexing: you can use an array of Boolean values as an index, and the result of the selection is the array of items of the original array for which the Boolean index is True.

Suppose your data sponsor told you that all data in the data set dirty are strictly non-negative. 

This means that any negative value is not a true value but an error, and you must replace it with something that makes more sense - say, with a zero. This operation is called data cleaning. 


To clean the dirty data, locate the offending values and substitute them with reasonable alternatives.

In [21]:
dirty = np.array([9, 4, 1, -0.01, -0.02, -0.001])
dirty


array([  9.00000000e+00,   4.00000000e+00,   1.00000000e+00,
        -1.00000000e-02,  -2.00000000e-02,  -1.00000000e-03])

In [22]:
whos_dirty = dirty < 0 
# Boolean array, to be used as Boolean index
whos_dirty

array([False, False, False,  True,  True,  True], dtype=bool)

In [23]:
dirty[whos_dirty] = 0 # Change all negative values to 0
dirty

array([ 9.,  4.,  1.,  0.,  0.,  0.])

In [None]:
Another example:

In [25]:
linear = np.arange(-1, 1.1, 0.2)
linear


array([ -1.00000000e+00,  -8.00000000e-01,  -6.00000000e-01,
        -4.00000000e-01,  -2.00000000e-01,  -2.22044605e-16,
         2.00000000e-01,   4.00000000e-01,   6.00000000e-01,
         8.00000000e-01,   1.00000000e+00])

In [27]:
(linear <= 0.5) & (linear >= -0.5)

array([False, False, False,  True,  True,  True,  True,  True, False,
       False, False], dtype=bool)

In [28]:
linear[(linear <= 0.5) & (linear >= -0.5)]

array([ -4.00000000e-01,  -2.00000000e-01,  -2.22044605e-16,
         2.00000000e-01,   4.00000000e-01])

Let’s select the second, the third, and the last stock symbols from our S&P list. (That’s “smart” indexing.) 

In [14]:
sap[[1, 2, -1]]


array(['ABT', 'ABBV', 'ADT'], 
      dtype='<U4')

Why not extract all rows in the middle column from the reshaped array? (That’s “smart” slicing.) 

In [29]:
sap2d


[['ABT']
 ['ATVI']] 

['ABT' 'ATVI']


In [30]:
# als Spalte 
sap2d[:, [1]]


array([['ABT'],
       ['ATVI']], 
      dtype='<U4')

In [31]:
# als Zeile
sap2d[:, 1]


array(['ABT', 'ATVI'], 
      dtype='<U4')

## Broadcasting
numpy arrays eagerly engage in vectorized arithmetic operations with other arrays - as long as they are of the same shape. 

To add two arrays element wise without numpy, you must use a for loop or list comprehension; with numpy, you simply add them up:

In [34]:
a = np.arange(4)
a


array([0, 1, 2, 3])

In [35]:
b = np.arange(1, 5)
b


array([1, 2, 3, 4])

In [36]:
a + b

array([1, 3, 5, 7])

Vectorized operations on arrays are known as broadcasting. 
Broadcasting along two dimensions is possible if they are equal or one of them is a scalar:

In [37]:
a = np.arange(4)
print(a,"\n")
a * 5


[0 1 2 3] 



array([ 0,  5, 10, 15])

The star operator * behaves differently in Python and numpy. 

The “core” Python expression seq * 5 replicates list seq five times. 

The same numpy expression multiplies each element of array seq by 5.


Let’s create a diagonal matrix and add some small (but not random) noise to it:

In [18]:
noise = np.eye(4) + 0.01 * np.ones((4, ))
noise

array([[ 1.01,  0.01,  0.01,  0.01],
       [ 0.01,  1.01,  0.01,  0.01],
       [ 0.01,  0.01,  1.01,  0.01],
       [ 0.01,  0.01,  0.01,  1.01]])

But what if you want some small and random noise?

In [19]:
noise = np.eye(4) + 0.01 * np.random.random([4, 4])
np.round(noise, 2)
noise

array([[  1.00396462e+00,   9.02291297e-03,   9.98907563e-03,
          5.34777270e-03],
       [  2.61289799e-03,   1.00801944e+00,   1.98442700e-04,
          8.99732177e-03],
       [  9.26482386e-03,   8.73207099e-03,   1.00319946e+00,
          5.70741651e-03],
       [  4.68103303e-03,   7.53616292e-03,   5.94327292e-03,
          1.00793390e+00]])

## Demystifying Universal Functions
Vectorized universal functions, or ufuncs, are a functional counterpart of broadcasting.
+ arithmetic: add(), multiply(), negative(), exp(), log(), sqrt()
+ trigonometric: sin(),cos(), hypot()
+ bitwise: bitwise_and(), left_shift()
+ relational and logical: less(), logical_not(), equal()
+ maximum() and minimum();
+ functions for working with floating point numbers: isinf(), isfinite(), floor(), isnan()

Let’s say we recorded the stock prices for the eight symbols sap after and before the weekend of 01/10/2016 in a one-dimensional array stocks:

In [41]:
stocks=np.array([ 140.49, 0.97, 40.68, 41.53, 55.7 , 57.21, 98.2, 99.19, 109.96, 111.47, 35.71, 36.27, 87.85, 89.11, 30.22, 30.91])
stocks

array([ 140.49,    0.97,   40.68,   41.53,   55.7 ,   57.21,   98.2 ,
         99.19,  109.96,  111.47,   35.71,   36.27,   87.85,   89.11,
         30.22,   30.91])

We want to know which stock prices fell over the weekend. 

We’ll first group the prices by symbols and put newer quotes after older quotes—by reshaping the original array into a 2×8 matrix:

In [42]:
stocks = stocks.reshape(8, 2).T
stocks

array([[ 140.49,   40.68,   55.7 ,   98.2 ,  109.96,   35.71,   87.85,
          30.22],
       [   0.97,   41.53,   57.21,   99.19,  111.47,   36.27,   89.11,
          30.91]])

Now we can apply function *greater()* to both rows, perform column-wise comparison, and find out the symbol of interest using Boolean indexing:

In [43]:
fall = np.greater(stocks[0], stocks[1])
fall


array([ True, False, False, False, False, False, False, False], dtype=bool)

In [44]:
sap[fall]

array(['MMM'], 
      dtype='<U4')

Incidentally, MMM is a Ponzi scheme–based Russian company that has never been listed at any stock exchange. No wonder their stock is in decline.


Universal function isnan() is an excellent tool for locating the outcasts.

In [23]:
# Pretend the new MMM quote is missing
stocks[1, 0] = np.nan
np.isnan(stocks)


array([[False, False, False, False, False, False, False, False],
       [ True, False, False, False, False, False, False, False]], dtype=bool)

In [45]:
# Repair the damage; it can't get worse than this
stocks[np.isnan(stocks)] = 0
stocks

array([[ 140.49,   40.68,   55.7 ,   98.2 ,  109.96,   35.71,   87.85,
          30.22],
       [   0.97,   41.53,   57.21,   99.19,  111.47,   36.27,   89.11,
          30.91]])

## Understanding Conditional Functions

Function where(c, a, b) is the numpy idea of the ternary operator if-else. 

It takes a Boolean array c and two other arrays a and b and returns array d[i]=a[i] if c[i] else b[i]. 

All three arrays must be of the same shape.


We recorded some S&P stock prices in an array sap.

To find out which stock changed substantially (by more than $1.00 per share), let’s replace “small” changes be zeros, locate the non-zero elements, and use their indexes as a “smart index” into the array of stock symbols:

In [46]:
changes = np.where(np.abs(stocks[1] - stocks[0]) > 1.00, stocks[1] - stocks[0], 0)
changes


array([-139.52,    0.  ,    1.51,    0.  ,    1.51,    0.  ,    1.26,    0.  ])

In [47]:
np.nonzero(changes)

(array([0, 2, 4, 6], dtype=int64),)

In [48]:
sap[np.nonzero(changes)]

array(['MMM', 'ABBV', 'ACE', 'ADBE'], 
      dtype='<U4')

You could surely get the same answer using Boolean indexes alone:
But it would not be so much fun!

In [49]:
sap[np.abs(stocks[1] - stocks[0]) > 1.00]

array(['MMM', 'ABBV', 'ACE', 'ADBE'], 
      dtype='<U4')

## Aggregating and Ordering Arrays

Extract the stocks from Unit that changed either way by more than the average eight-stock portfolio.
But, honestly, mixing positive and negative stock quote changes sounds like another horrible idea.

In [27]:
sap[ np.abs(stocks[0] - stocks[1])
    > np.mean(np.abs(stocks[0] - stocks[1]))]

array(['MMM'], 
      dtype='<U4')

## Treating Array as Sets
Function unique(x) returns an array of all unique elements of *x*.

In [51]:
dna = "AGTCCGCGAATACAGGCTCGGT"
dna_as_array = np.array(list(dna))
dna_as_array


array(['A', 'G', 'T', 'C', 'C', 'G', 'C', 'G', 'A', 'A', 'T', 'A', 'C',
       'A', 'G', 'G', 'C', 'T', 'C', 'G', 'G', 'T'], 
      dtype='<U1')

In [52]:
np.unique(dna_as_array)
# You knew it, didn't you?

array(['A', 'C', 'G', 'T'], 
      dtype='<U1')

Function *in1d(needle, haystack)* returns a Boolean array where an element is True if the corresponding element of the needle is in the haystack. The arrays needle and haystack do not have to be of the same shape.


In [31]:
np.in1d(["MSFT", "MMM", "AAPL"], sap)

array([False,  True, False], dtype=bool)

## Saving and Reading Arrays

In [33]:
# A silly way to copy an array
np.save("sap.npy", sap)
sap_copy = np.load("sap.npy")

# The files are in a binary format, and only numpy can handle them.


Another pair of functions, loadtxt() and savetxt(), loads tabular data from a text file and saves an array to a text file.

    arr = np.loadtxt(fname, comments="#", delimiter=None, skiprows=0, dtype=float)


    np.savetxt(fname, arr, comments="#", delimiter=" ", dtype=float)
