# Advanced pandas

## Categorical Data

In [1]:
import numpy as np; import pandas as pd

### Categorical Type in pandas

In [2]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2
N = len(fruits)
df = pd.DataFrame({'fruit': fruits,
                   'basket_id': np.arange(N),
                   'count': np.random.randint(3, 15, size=N),
                   'weight': np.random.uniform(0, 4, size=N)},
                  columns=['basket_id', 'fruit', 'count', 'weight'])

In [3]:
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,5,3.800056
1,1,orange,14,0.084845
2,2,apple,12,0.241727
3,3,apple,6,3.279358
4,4,apple,12,1.81231
5,5,orange,13,2.05156
6,6,apple,6,2.60589
7,7,apple,6,3.287836


In [4]:
fruit_cat = df['fruit'].astype('category')

In [5]:
fruit_cat

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

In [6]:
c = fruit_cat.values
type(c)

pandas.core.arrays.categorical.Categorical

The Categorical object has categories and codes attributes

In [7]:
c.categories

Index(['apple', 'orange'], dtype='object')

In [8]:
c.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

In [9]:
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar']) # pd.Categorical()
my_categories

['foo', 'bar', 'baz', 'foo', 'bar']
Categories (3, object): ['bar', 'baz', 'foo']

In [10]:
categories = ['foo', 'bar', 'baz']
codes = [0, 1, 2, 0, 0, 1]
my_cats_2 = pd.Categorical.from_codes(codes, categories)       # pd.Categorical.from_codes()
my_cats_2

['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo', 'bar', 'baz']

In [11]:
ordered_cat = pd.Categorical.from_codes(codes, categories, ordered=True)
ordered_cat

['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']

In [12]:
my_cats_2.as_ordered()

['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']

### Computations with Categoricals

In [13]:
np.random.seed(12345)
draws = np.random.randn(1000)

In [14]:
bins = pd.qcut(draws, 4)
bins

[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.9499999999999997, -0.684], (-0.0101, 0.63], (0.63, 3.928]]
Length: 1000
Categories (4, interval[float64]): [(-2.9499999999999997, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]

In [15]:
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
bins

['Q2', 'Q3', 'Q2', 'Q2', 'Q4', ..., 'Q3', 'Q2', 'Q1', 'Q3', 'Q4']
Length: 1000
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

In [16]:
bins = pd.Series(bins, name='quartile')
results = (pd.Series(draws).groupby(bins)
           .agg(['count', 'min', 'max']).reset_index())
results

Unnamed: 0,quartile,count,min,max
0,Q1,250,-2.949343,-0.685484
1,Q2,250,-0.683066,-0.010115
2,Q3,250,-0.010032,0.628894
3,Q4,250,0.634238,3.927528


### Better performance with categoricals

In [17]:
N = 10000000
draws = pd.Series(np.random.randn(N))
labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))

In [18]:
categories = labels.astype('category')

In [19]:
labels.memory_usage()

80000128

In [20]:
categories.memory_usage()

10000320

### Categorical Methods

In [21]:
s = pd.Series(['a', 'b', 'c', 'd'] * 2)
cat_s = s.astype('category')
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [22]:
# cat attribute
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [23]:
cat_s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

Suppose that we know the actual set of categories for this data extends beyond the
four values observed in the data. We can use the set_categories method to change
them

In [24]:
actual_categories = ['a', 'b', 'c', 'd', 'e']
cat_s2 = cat_s.cat.set_categories(actual_categories)   # .cat.set_categories()
cat_s2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

In [25]:
cat_s3 = cat_s[cat_s.isin(['a', 'b'])]
cat_s3

0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [26]:
cat_s3.cat.remove_unused_categories()          # .cat.remove_unused_categories()

0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): ['a', 'b']

#### Categorical methods for Series in pandas: Python for Data Analysis, page 372

## Advanced GroupBy Use

### Group Transforms and “Unwrapped” GroupBys

In [27]:
df = pd.DataFrame({'key': ['a', 'b', 'c'] * 4,
                   'value': np.arange(12.)})
df

Unnamed: 0,key,value
0,a,0.0
1,b,1.0
2,c,2.0
3,a,3.0
4,b,4.0
5,c,5.0
6,a,6.0
7,b,7.0
8,c,8.0
9,a,9.0


In [28]:
g = df.groupby('key').value

In [29]:
g.mean()

key
a    4.5
b    5.5
c    6.5
Name: value, dtype: float64

Suppose instead we wanted to produce a Series of the same shape as df['value'] but
with values replaced by the average grouped by 'key'. We can pass the function
lambda x: x.mean() to transform

In [30]:
g.transform(lambda x: x.mean())                       # Groupby_obj.transform(func)

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

In [31]:
g.transform('mean')

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

In [32]:
g.transform(lambda x: x.rank(ascending=False))

0     4.0
1     4.0
2     4.0
3     3.0
4     3.0
5     3.0
6     2.0
7     2.0
8     2.0
9     1.0
10    1.0
11    1.0
Name: value, dtype: float64

In [33]:
def normalize(x):
    return (x - x.mean()) / x.std()

In [34]:
g.transform(normalize)

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

In [35]:
g.apply(normalize)

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

In [36]:
normalized = (df['value'] - g.transform('mean')) / g.transform('std')
normalized

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

### Grouped Time Resampling

In [37]:
N = 15
times = pd.date_range('2017-05-20 00:00', freq='1min', periods=N)
df = pd.DataFrame({'time': times, 'value': np.arange(N)})
df

Unnamed: 0,time,value
0,2017-05-20 00:00:00,0
1,2017-05-20 00:01:00,1
2,2017-05-20 00:02:00,2
3,2017-05-20 00:03:00,3
4,2017-05-20 00:04:00,4
5,2017-05-20 00:05:00,5
6,2017-05-20 00:06:00,6
7,2017-05-20 00:07:00,7
8,2017-05-20 00:08:00,8
9,2017-05-20 00:09:00,9


In [38]:
df.set_index('time').resample('5min').count()

Unnamed: 0_level_0,value
time,Unnamed: 1_level_1
2017-05-20 00:00:00,5
2017-05-20 00:05:00,5
2017-05-20 00:10:00,5


In [39]:
df2 = pd.DataFrame({'time': times.repeat(3),            # dt_index.repeat()
                    'key': np.tile(['a', 'b', 'c'],N),  # np.tile()
                    'value': np.arange(N * 3.)})
df2[:7]

Unnamed: 0,time,key,value
0,2017-05-20 00:00:00,a,0.0
1,2017-05-20 00:00:00,b,1.0
2,2017-05-20 00:00:00,c,2.0
3,2017-05-20 00:01:00,a,3.0
4,2017-05-20 00:01:00,b,4.0
5,2017-05-20 00:01:00,c,5.0
6,2017-05-20 00:02:00,a,6.0


To do the same resampling for each value of 'key', we introduce the pandas.Grouper object

In [40]:
time_key = pd.Grouper(freq = '5min')
time_key

TimeGrouper(freq=<5 * Minutes>, axis=0, sort=True, closed='left', label='left', how='mean', convention='e', origin='start_day')

In [41]:
resampled = (df2.set_index('time').groupby(['key', time_key]).sum())
resampled

Unnamed: 0_level_0,Unnamed: 1_level_0,value
key,time,Unnamed: 2_level_1
a,2017-05-20 00:00:00,30.0
a,2017-05-20 00:05:00,105.0
a,2017-05-20 00:10:00,180.0
b,2017-05-20 00:00:00,35.0
b,2017-05-20 00:05:00,110.0
b,2017-05-20 00:10:00,185.0
c,2017-05-20 00:00:00,40.0
c,2017-05-20 00:05:00,115.0
c,2017-05-20 00:10:00,190.0


In [42]:
resampled.reset_index()

Unnamed: 0,key,time,value
0,a,2017-05-20 00:00:00,30.0
1,a,2017-05-20 00:05:00,105.0
2,a,2017-05-20 00:10:00,180.0
3,b,2017-05-20 00:00:00,35.0
4,b,2017-05-20 00:05:00,110.0
5,b,2017-05-20 00:10:00,185.0
6,c,2017-05-20 00:00:00,40.0
7,c,2017-05-20 00:05:00,115.0
8,c,2017-05-20 00:10:00,190.0


One constraint with using Grouper() is that the time must be the index of the Series
or DataFrame

## Techniques for Method Chaining

The DataFrame.assign method is a functional alternative to column assignments
of the form df[k] = v. Rather than modifying the object in-place, it returns a
new DataFrame with the indicated modifications. So these statements are equivalent

One thing to keep in mind when doing method chaining is that you may need to
refer to temporary objects. In the preceding example, we cannot refer to the result of
load_data until it has been assigned to the temporary variable df. To help with this,
assign and many other pandas functions accept function-like arguments, also known
as callables

Whether you prefer to write code in this style is a matter of taste, and splitting up the
expression into multiple steps may make your code more readable

### The pipe Method

Suppose that you wanted to be able to demean more than one column and easily
change the group keys

# Advanced numpy 

## ndarray Object Internals

In [43]:
#indicating the number of bytes to “step” in order to
#advance one element along a dimension
np.ones((3, 4, 5), dtype=np.float64).strides

(160, 40, 8)

### NumPy dtype Hierarchy

There are multiple types of floating-point numbers (float16 through float128), checking that the dtype
is among a list of types would be very verbose. Fortunately, the dtypes have superclasses
such as np.integer and np.floating, which can be used in conjunction with the np.issubdtype function

In [44]:
ints = np.ones(10, dtype=np.uint16)

In [45]:
floats = np.ones(10, dtype=np.float32)

In [46]:
np.issubdtype(ints.dtype, np.integer)

True

In [47]:
np.issubdtype(floats.dtype, np.character)

False

You can see all of the parent classes of a specific dtype by calling the type’s mro
method

In [48]:
np.float32.mro()

[numpy.float32,
 numpy.floating,
 numpy.inexact,
 numpy.number,
 numpy.generic,
 object]

## Advanced Array Manipulation

### Reshaping Arrays

In [49]:
arr = np.arange(8)
arr

array([0, 1, 2, 3, 4, 5, 6, 7])

In [50]:
arr.reshape(4,2,order='C')

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])

In [51]:
arr.reshape(4,2, order='F').reshape(2,4)

array([[0, 4, 1, 5],
       [2, 6, 3, 7]])

In [52]:
arr = np.arange(15)
arr

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [53]:
# One of the passed shape dimensions can be –1, in which case the value used for that
# dimension will be inferred from the data
arr.reshape(-1,5)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

The opposite operation of reshape from one-dimensional to a higher dimension is
typically known as flattening or raveling

In [54]:
arr = np.arange(15).reshape(5, 3)
arr

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [55]:
arr.ravel()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [56]:
arr.flatten()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [57]:
arr

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

The key difference between C and Fortran order is the way in which the
dimensions are walked:

C/row major order
Traverse higher dimensions first (e.g., axis 1 before advancing on axis 0).

Fortran/column major order
Traverse higher dimensions last (e.g., axis 0 before advancing on axis 1).

### Concatenating and Splitting Arrays

In [58]:
arr1 = np.array([[1, 2, 3], [4, 5, 6]])
arr2 = np.array([[7, 8, 9], [10, 11, 12]])

In [59]:
np.concatenate([arr1, arr2], axis=0)

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [60]:
np.concatenate([arr1, arr2], axis=1)

array([[ 1,  2,  3,  7,  8,  9],
       [ 4,  5,  6, 10, 11, 12]])

In [61]:
np.vstack((arr1, arr2))            # np.vstack()

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [62]:
np.hstack((arr1, arr2))            # np.hstack()

array([[ 1,  2,  3,  7,  8,  9],
       [ 4,  5,  6, 10, 11, 12]])

In [63]:
arr = np.random.randn(5, 2)
arr

array([[-0.33202655,  0.97761764],
       [-0.34480824,  1.32121974],
       [-1.02114479, -1.42017304],
       [-0.82626944,  1.20887706],
       [ 1.08643483,  0.87772803]])

In [64]:
first, second, third = np.split(arr, [1, 3])

In [65]:
first

array([[-0.33202655,  0.97761764]])

In [66]:
second

array([[-0.34480824,  1.32121974],
       [-1.02114479, -1.42017304]])

In [67]:
third

array([[-0.82626944,  1.20887706],
       [ 1.08643483,  0.87772803]])

The value [1, 3] passed to np.split indicate the indices at which to split the array
into pieces.

#### Array concatenation functions: Python for Data Analysis, page 456

In [68]:
# Stacking helpers: r_ and c_
arr1 = np.arange(6).reshape((3, 2))
arr2 = np.random.randn(3, 2)

In [69]:
arr1

array([[0, 1],
       [2, 3],
       [4, 5]])

In [70]:
arr2

array([[-1.20209626,  2.18981666],
       [-0.69892964, -1.98375661],
       [-0.60940103,  0.07757414]])

In [71]:
np.r_[arr1,arr2]

array([[ 0.        ,  1.        ],
       [ 2.        ,  3.        ],
       [ 4.        ,  5.        ],
       [-1.20209626,  2.18981666],
       [-0.69892964, -1.98375661],
       [-0.60940103,  0.07757414]])

In [72]:
np.c_[arr1,arr2]

array([[ 0.        ,  1.        , -1.20209626,  2.18981666],
       [ 2.        ,  3.        , -0.69892964, -1.98375661],
       [ 4.        ,  5.        , -0.60940103,  0.07757414]])

In [73]:
np.c_[1:6, -10:-5]

array([[  1, -10],
       [  2,  -9],
       [  3,  -8],
       [  4,  -7],
       [  5,  -6]])

### Repeating Elements: tile and repeat

REPEAT replicates each element in an array some number
of times, producing a larger array

In [74]:
arr = np.arange(3)
arr

array([0, 1, 2])

In [75]:
arr.repeat(3)

array([0, 0, 0, 1, 1, 1, 2, 2, 2])

In [76]:
arr.repeat([2,3,4])

array([0, 0, 1, 1, 1, 2, 2, 2, 2])

In [77]:
arr = np.random.randn(2, 2)
arr

array([[-0.33198823, -0.02238705],
       [ 0.15462106, -0.21486347]])

In [78]:
arr.repeat(2,axis=1)

array([[-0.33198823, -0.33198823, -0.02238705, -0.02238705],
       [ 0.15462106,  0.15462106, -0.21486347, -0.21486347]])

In [79]:
arr.repeat([2, 3], axis=0)

array([[-0.33198823, -0.02238705],
       [-0.33198823, -0.02238705],
       [ 0.15462106, -0.21486347],
       [ 0.15462106, -0.21486347],
       [ 0.15462106, -0.21486347]])

TILE is a shortcut for stacking copies of an array along an axis

In [80]:
np.tile(arr,2)

array([[-0.33198823, -0.02238705, -0.33198823, -0.02238705],
       [ 0.15462106, -0.21486347,  0.15462106, -0.21486347]])

The second argument to tile can be a tuple
indicating the layout of the “tiling”

In [81]:
np.tile(arr,(2,1))

array([[-0.33198823, -0.02238705],
       [ 0.15462106, -0.21486347],
       [-0.33198823, -0.02238705],
       [ 0.15462106, -0.21486347]])

In [82]:
np.tile(arr, (3, 2))

array([[-0.33198823, -0.02238705, -0.33198823, -0.02238705],
       [ 0.15462106, -0.21486347,  0.15462106, -0.21486347],
       [-0.33198823, -0.02238705, -0.33198823, -0.02238705],
       [ 0.15462106, -0.21486347,  0.15462106, -0.21486347],
       [-0.33198823, -0.02238705, -0.33198823, -0.02238705],
       [ 0.15462106, -0.21486347,  0.15462106, -0.21486347]])

### Fancy Indexing Equivalents: take and put

In [83]:
arr = np.arange(10) * 100
inds = [7, 1, 2, 6]
arr[inds]

array([700, 100, 200, 600])

In [84]:
arr.take(inds)

array([700, 100, 200, 600])

In [85]:
arr.put(inds, 42)
arr

array([  0,  42,  42, 300, 400, 500,  42,  42, 800, 900])

In [86]:
arr.put(inds, [40, 41, 42, 43])
arr

array([  0,  41,  42, 300, 400, 500,  43,  40, 800, 900])

In [87]:
inds = [2, 0, 2, 1]
arr = np.random.randn(2, 4)
arr

array([[-0.48931766, -0.4130431 ,  0.65650474,  1.05898613],
       [-0.57342861, -0.68584776, -0.21884371, -1.44196261]])

In [88]:
arr[:,inds]

array([[ 0.65650474, -0.48931766,  0.65650474, -0.4130431 ],
       [-0.21884371, -0.57342861, -0.21884371, -0.68584776]])

In [89]:
arr.take(inds, axis=1)

array([[ 0.65650474, -0.48931766,  0.65650474, -0.4130431 ],
       [-0.21884371, -0.57342861, -0.21884371, -0.68584776]])

## Broadcasting

In [90]:
arr = np.random.randn(4, 3)
arr

array([[-0.73343662, -0.05745301,  0.17896382],
       [ 0.17072935,  0.80834265, -0.51660819],
       [ 0.46782339, -1.92930289,  0.73828552],
       [-0.06104182, -1.51593372, -1.0924147 ]])

In [91]:
arr.mean(0)

array([-0.03898142, -0.67358675, -0.17294339])

In [92]:
demeaned = arr - arr.mean(0)
demeaned

array([[-0.6944552 ,  0.61613373,  0.35190721],
       [ 0.20971078,  1.48192939, -0.34366481],
       [ 0.50680482, -1.25571615,  0.91122891],
       [-0.02206039, -0.84234697, -0.91947131]])

In [93]:
demeaned.mean(0)

array([-8.67361738e-18,  0.00000000e+00, -2.77555756e-17])

In [94]:
arr

array([[-0.73343662, -0.05745301,  0.17896382],
       [ 0.17072935,  0.80834265, -0.51660819],
       [ 0.46782339, -1.92930289,  0.73828552],
       [-0.06104182, -1.51593372, -1.0924147 ]])

In [95]:
row_means = arr.mean(1)
row_means.shape

(4,)

In [96]:
row_means.reshape(4, 1)

array([[-0.20397527],
       [ 0.1541546 ],
       [-0.24106466],
       [-0.88979674]])

In [97]:
demeaned = arr - row_means.reshape((4, 1))

In [98]:
demeaned.mean(1)

array([ 1.85037171e-17,  0.00000000e+00, -7.40148683e-17,  3.70074342e-17])

### Broadcasting Over Other Axes

In [99]:
arr = np.zeros((4, 4))
arr_3d = arr[:, np.newaxis, :]
arr_3d.shape

(4, 1, 4)

In [100]:
arr_3d

array([[[0., 0., 0., 0.]],

       [[0., 0., 0., 0.]],

       [[0., 0., 0., 0.]],

       [[0., 0., 0., 0.]]])

In [101]:
arr.reshape(4,1,4)

array([[[0., 0., 0., 0.]],

       [[0., 0., 0., 0.]],

       [[0., 0., 0., 0.]],

       [[0., 0., 0., 0.]]])

In [102]:
arr_1d = np.random.normal(size=3)
arr_1d[:, np.newaxis]

array([[-0.84017898],
       [ 1.8092765 ],
       [ 0.60639657]])

In [103]:
# demeaning axis 2
arr = np.random.randn(3, 4, 5)
arr

array([[[ 0.3794318 ,  0.08312829,  0.2519332 ,  0.86990763,
         -0.24959093],
        [ 0.53297283,  2.35696402, -1.18967186,  0.18773994,
         -0.6314337 ],
        [-0.92955955, -0.62805982,  0.62138342, -1.03266072,
         -0.94624145],
        [ 0.61367694, -0.8232601 ,  1.00448097, -0.03879877,
          0.84516462]],

       [[ 0.6874872 , -0.04011116,  0.55370256,  0.16678658,
         -0.02933461],
        [ 0.19075351, -0.75766937, -1.19583006, -0.58889431,
          0.93077627],
        [-0.90089073, -0.03133782,  1.2162068 , -0.31347299,
          0.22745571],
        [ 1.01718219,  0.86162825,  0.16998235, -0.72518791,
         -0.15358866]],

       [[-0.6062885 , -0.06694511, -0.26078213,  1.87311377,
         -0.08237701],
        [-0.57168419,  0.36589958, -0.60107633, -0.34760626,
          0.53606199],
        [ 0.84316052, -1.51420343, -1.39828137, -1.45038408,
         -0.04201274],
        [ 0.49061691,  0.40564318,  0.71513586,  0.99847367,
         -0

In [104]:
depth_means = arr.mean(2)
depth_means

array([[ 0.266962  ,  0.25131425, -0.58302763,  0.32025273],
       [ 0.26770611, -0.28417279,  0.03959219,  0.23400324],
       [ 0.1713442 , -0.12368104, -0.71234422,  0.43981687]])

In [105]:
depth_means[:, :, np.newaxis]

array([[[ 0.266962  ],
        [ 0.25131425],
        [-0.58302763],
        [ 0.32025273]],

       [[ 0.26770611],
        [-0.28417279],
        [ 0.03959219],
        [ 0.23400324]],

       [[ 0.1713442 ],
        [-0.12368104],
        [-0.71234422],
        [ 0.43981687]]])

In [106]:
demeaned = arr - depth_means[:, :, np.newaxis]
demeaned.mean(2)

array([[ 0.00000000e+00, -4.44089210e-17,  6.66133815e-17,
        -2.22044605e-17],
       [-1.11022302e-17,  0.00000000e+00, -1.11022302e-17,
         4.44089210e-17],
       [ 3.33066907e-17,  4.44089210e-17,  2.22044605e-17,
        -4.44089210e-17]])

### Setting Array Values by Broadcasting

In [107]:
arr = np.zeros((4, 3))
arr[:] = 5

In [108]:
arr

array([[5., 5., 5.],
       [5., 5., 5.],
       [5., 5., 5.],
       [5., 5., 5.]])

In [111]:
col = np.array([1.28, -0.42, 0.44, 1.6])
arr[:]=col[:, np.newaxis]
arr

array([[ 1.28,  1.28,  1.28],
       [-0.42, -0.42, -0.42],
       [ 0.44,  0.44,  0.44],
       [ 1.6 ,  1.6 ,  1.6 ]])

In [114]:
arr[:2] = [[-1.37], [0.509]]
arr

array([[-1.37 , -1.37 , -1.37 ],
       [ 0.509,  0.509,  0.509],
       [ 0.44 ,  0.44 ,  0.44 ],
       [ 1.6  ,  1.6  ,  1.6  ]])

## Advanced ufunc Usage

### ufunc Instance Methods

Reduce takes a single array and aggregates its values, optionally along an axis, by performing
a sequence of binary operations.

We can use np.logical_and to check whether
the values in each row of an array are sorted

In [116]:
arr = np.random.randn(5, 5)

In [117]:
arr[::2].sort(1)

In [118]:
arr

array([[-1.79087524, -0.88215124,  0.04592871,  0.60702157,  0.72776312],
       [-0.36588777,  1.17518389,  0.80768151,  1.48714754,  1.60262925],
       [-0.14449571, -0.02374048,  0.18449371,  0.7713527 ,  1.80907671],
       [ 0.17651717,  1.76308272, -0.43424697, -0.021746  , -2.43582632],
       [-0.55760969, -0.42324862, -0.01296674,  0.42354613,  0.75896272]])

In [119]:
arr[:, :-1]

array([[-1.79087524, -0.88215124,  0.04592871,  0.60702157],
       [-0.36588777,  1.17518389,  0.80768151,  1.48714754],
       [-0.14449571, -0.02374048,  0.18449371,  0.7713527 ],
       [ 0.17651717,  1.76308272, -0.43424697, -0.021746  ],
       [-0.55760969, -0.42324862, -0.01296674,  0.42354613]])

In [120]:
arr[:, 1:]

array([[-0.88215124,  0.04592871,  0.60702157,  0.72776312],
       [ 1.17518389,  0.80768151,  1.48714754,  1.60262925],
       [-0.02374048,  0.18449371,  0.7713527 ,  1.80907671],
       [ 1.76308272, -0.43424697, -0.021746  , -2.43582632],
       [-0.42324862, -0.01296674,  0.42354613,  0.75896272]])

In [123]:
np.logical_and.reduce(arr[:, :-1] < arr[:, 1:], axis=1)

array([ True, False,  True, False,  True])

Accumulate is related to reduce like cumsum is related to sum. It produces an array of
the same size with the intermediate “accumulated” values

In [129]:
arr = np.arange(15).reshape((3, 5))
np.add.accumulate(arr, axis=1)

array([[ 0,  1,  3,  6, 10],
       [ 5, 11, 18, 26, 35],
       [10, 21, 33, 46, 60]], dtype=int32)

Reduceat, performs a “local reduce,” in essence an array groupby
operation in which slices of the array are aggregated together. It accepts a sequence of
“bin edges” that indicate how to split and aggregate the values

In [134]:
arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [137]:
np.add.reduceat(arr, [0,5,8])

array([10, 18, 17], dtype=int32)

The results are the reductions (here, sums) performed over arr[0:5], arr[5:8], and
arr[8:]. As with the other methods, you can pass an axis argument

In [138]:
arr = np.multiply.outer(np.arange(4), np.arange(5))
arr

array([[ 0,  0,  0,  0,  0],
       [ 0,  1,  2,  3,  4],
       [ 0,  2,  4,  6,  8],
       [ 0,  3,  6,  9, 12]])

In [139]:
np.add.reduceat(arr, [0, 2, 4], axis=1)

array([[ 0,  0,  0],
       [ 1,  5,  4],
       [ 2, 10,  8],
       [ 3, 15, 12]], dtype=int32)

## Structured and Record Arrays

In [158]:
dtype = [('x', np.float64), ('y', np.int32)]
sarr = np.array([(6.4, 2.6), (np.pi, -2)], dtype=dtype)

In [159]:
sarr

array([(6.4       ,  2), (3.14159265, -2)],
      dtype=[('x', '<f8'), ('y', '<i4')])

In [147]:
sarr[0]

(1.5, 6)

In [148]:
sarr['y']

array([ 6, -2])

### Nested dtypes and Multidimensional Fields

In [160]:
dtype = [('x', np.int64, 3), ('y', np.int32)]
arr = np.zeros(5, dtype=dtype)
arr

array([([0, 0, 0], 0), ([0, 0, 0], 0), ([0, 0, 0], 0), ([0, 0, 0], 0),
       ([0, 0, 0], 0)], dtype=[('x', '<i8', (3,)), ('y', '<i4')])

In [151]:
arr['x']

array([[0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0]], dtype=int64)

In [162]:
dtype = [('x', [('a', 'f8'), ('b', 'f4')]), ('y', np.int32)]

In [174]:
data1 = np.array([(1,9),2,3,4,5,6], dtype=dtype)
data1

In [176]:
data2 = np.array([((1, 2), 5), ((3, 4), 6)], dtype=dtype)
data2

array([((1., 2.), 5), ((3., 4.), 6)],
      dtype=[('x', [('a', '<f8'), ('b', '<f4')]), ('y', '<i4')])

## More About Sorting

### Indirect Sorts: argsort and lexsort

In [204]:
values = np.array([5, 0, 1, 3, 2])

In [205]:
indexer = values.argsort()

In [206]:
indexer

array([1, 2, 4, 3, 0], dtype=int64)

In [207]:
values[indexer]

array([0, 1, 2, 3, 5])

In [190]:
arr = np.random.randn(3, 5)
arr[0]=values

In [191]:
arr

array([[ 5.        ,  0.        ,  1.        ,  3.        ,  2.        ],
       [-0.34547689,  1.69149892,  0.12664616,  1.55242374,  1.47384462],
       [-0.88030983, -2.15187097, -0.18839127,  1.0252376 ,  0.06165114]])

In [192]:
arr[:, arr[0].argsort()]

array([[ 0.        ,  1.        ,  2.        ,  3.        ,  5.        ],
       [ 1.69149892,  0.12664616,  1.47384462,  1.55242374, -0.34547689],
       [-2.15187097, -0.18839127,  0.06165114,  1.0252376 , -0.88030983]])

In [193]:
arr[0] = arr[0, arr[0].argsort()]

In [194]:
arr

array([[ 0.        ,  1.        ,  2.        ,  3.        ,  5.        ],
       [-0.34547689,  1.69149892,  0.12664616,  1.55242374,  1.47384462],
       [-0.88030983, -2.15187097, -0.18839127,  1.0252376 ,  0.06165114]])

In [195]:
first_name = np.array(['Bob', 'Jane', 'Steve', 'Bill', 'Barbara'])
last_name = np.array(['Jones', 'Arnold', 'Arnold', 'Jones', 'Walters'])

In [196]:
sorter = np.lexsort((first_name, last_name))
sorter

array([1, 2, 3, 0, 4], dtype=int64)

In [200]:
list(zip(last_name[sorter], first_name[sorter]))

[('Arnold', 'Jane'),
 ('Arnold', 'Steve'),
 ('Jones', 'Bill'),
 ('Jones', 'Bob'),
 ('Walters', 'Barbara')]

lexsort can be a bit confusing the first time you use it because the order in which the
keys are used to order the data starts with the last array passed. Here, last_name was
used before first_name.

### Alternative Sort Algorithms

In [214]:
values = np.array(['2:first', '2:second', '1:first', '1:second', '1:third'])

In [215]:
key = np.array([2, 2, 1, 1, 1])

In [216]:
indexer = key.argsort(kind='mergesort')

In [217]:
indexer

array([2, 3, 4, 0, 1], dtype=int64)

In [218]:
values.take(indexer)

array(['1:first', '1:second', '1:third', '2:first', '2:second'],
      dtype='<U8')

### Partially Sorting Arrays

In [224]:
np.random.seed(1235)

In [225]:
arr = np.random.randn(20)
arr

array([ 0.68938232, -0.03171215,  0.66805361,  0.48883782, -0.67978825,
       -1.30747938,  1.47030437, -1.23102724,  0.95877525,  0.74048962,
        0.70329051, -0.07352256, -1.27431351, -0.23115703,  0.50514333,
        0.51273198,  1.30436025,  0.73063226, -0.53210578,  0.60349086])

In [226]:
np.partition(arr, 3)

array([-1.30747938, -1.27431351, -1.23102724, -0.67978825, -0.53210578,
        0.48883782,  0.60349086,  0.66805361,  0.51273198, -0.03171215,
        0.50514333, -0.07352256, -0.23115703,  0.68938232,  0.70329051,
        0.95877525,  1.30436025,  0.73063226,  1.47030437,  0.74048962])

After you call partition(arr, 3), the first three elements in the result are the smallest
three values in no particular order. numpy.argpartition, similar to numpy.arg
sort, returns the indices that rearrange the data into the equivalent order

In [227]:
indices = np.argpartition(arr, 3)
indices

array([ 5, 12,  7,  4, 18,  3, 19,  2, 15,  1, 14, 11, 13,  0, 10,  8, 16,
       17,  6,  9], dtype=int64)

In [228]:
arr.take(indices)

array([-1.30747938, -1.27431351, -1.23102724, -0.67978825, -0.53210578,
        0.48883782,  0.60349086,  0.66805361,  0.51273198, -0.03171215,
        0.50514333, -0.07352256, -0.23115703,  0.68938232,  0.70329051,
        0.95877525,  1.30436025,  0.73063226,  1.47030437,  0.74048962])

### numpy.searchsorted: Finding Elements in a Sorted Array

searchsorted is an array method that performs a binary search on a sorted array,
returning the location in the array where the value would need to be inserted to
maintain sortedness

In [229]:
arr = np.array([0, 1, 7, 12, 15])
arr.searchsorted(9)

3

In [230]:
arr.searchsorted([0, 8, 11, 16])

array([0, 3, 3, 5], dtype=int64)

In [233]:
data = np.floor(np.random.uniform(0, 10000, size=50))

In [234]:
data

array([2265., 7305., 5024., 1221., 8061., 6671., 9657., 4791., 4583.,
       3739., 1440.,   45., 9769., 4782., 2348.,   84., 1350., 4228.,
       3823., 5626., 6182., 4545.,  707., 9256., 8862., 5921., 1025.,
       7652., 8346., 3131., 3347., 2232., 4761.,  362., 1382.,  247.,
       9090., 5583., 7247., 8436.,  722., 9243., 8323., 9418.,  574.,
       6396., 9067., 6167., 1488., 2930.])

In [235]:
bins = np.array([0, 100, 1000, 5000, 10000])

In [236]:
labels = bins.searchsorted(data)
labels

array([3, 4, 4, 3, 4, 4, 4, 3, 3, 3, 3, 1, 4, 3, 3, 1, 3, 3, 3, 4, 4, 3,
       2, 4, 4, 4, 3, 4, 4, 3, 3, 3, 3, 2, 3, 2, 4, 4, 4, 4, 2, 4, 4, 4,
       2, 4, 4, 4, 3, 3], dtype=int64)

In [237]:
pd.Series(data).groupby(labels).mean()

1      64.500000
2     522.400000
3    2970.550000
4    7708.782609
dtype: float64

## Writing Fast NumPy Functions with Numba

#### Python for Data Analysis, page 476-478

## Advanced Array Input and Output (Memory-Mapped Files, HDF5....)

#### Python for Data Analysis, page 478-480