# Using masks to filter data in NumPy

In both NumPy and Pandas we can create masks to filter data. Masks are 'Boolean' arrays - that is arrays of true and false values and provide a powerful and flexible method to selecting data.

## Creating a mask

Let's begin by creating an array of 4 rows of 10 columns of uniform random number between 0 and 100

In [2]:
import numpy as np

array1 = np.random.randint(0,100,size=(4,10))

print (array1)

[[73  6 10 91  7 10 75 71 38 66]
 [84 65 81  5 44 32 29 13 47 56]
 [42 11  2 81 26 49  2 32 17 25]
 [60 91 41 56 76 48 19 65 68 68]]


Now we'll create a mask to show those numbers greater than 70.

In [3]:
mask = array1 > 70

print(mask)

[[ True False False  True False False  True  True False False]
 [ True False  True False False False False False False False]
 [False False False  True False False False False False False]
 [False  True False False  True False False False False False]]


We can use that mask to extract the numbers 

In [4]:
print (array1[mask])

[73 91 75 71 84 81 81 91 76]


## Using any and all

<em>any</em> and <em>all</em> allow us to check for all true or all false.

We can apply that to the whole array:

In [5]:
print (mask.any())
print (mask.all())

True
False


Or we can apply it columnwise (by passing axis=0) or rowwise (by passing axis = 0)

In [6]:
print ('All test in a column are true:')
print (mask.all(axis=0))
print ('\nAny test in a row is true:')
print (mask.any(axis=1))

All test in a column are true:
[False False False False False False False False False False]

Any test in a row is true:
[ True  True  True  True]


We can use != to invert a mask if needed (all trues become false, and all falses become true). This can be useful, but can also become a little confusing!

In [7]:
inverted_mask = mask!=True
print (inverted_mask)

[[False  True  True False  True  True False False  True  True]
 [False  True False  True  True  True  True  True  True  True]
 [ True  True  True False  True  True  True  True  True  True]
 [ True False  True  True False  True  True  True  True  True]]


## Adding or averaging trues

Boolean values (True/False) in Python also take the values 1 and 0. This can be useful for counting trues/false, for example:

In [8]:
print ('Number of trues in array:')
print (mask.sum())

Number of trues in array:
9


In [9]:
print('Number of trues in array by row:')
print (mask.sum(axis=1))

Number of trues in array by row:
[4 2 1 2]


In [10]:
print('Average of trues in array by column:')
print (mask.mean(axis=0))

Average of trues in array by column:
[0.5  0.25 0.25 0.5  0.25 0.   0.25 0.25 0.   0.  ]


## Selecting rows or columns based on one value in that row or column

Let's select all columns where the value of the first element is equal to, or greater than 50:

In [11]:
mask = array1[0,:] >= 50 # colon indicates all columns, zero indicates row 0
print ('\nHere is the mask')
print (mask)
print ('\nAnd here is the mask applied to all columns')
print (array1[:,mask]) # colon represents all rows of chosen columns


Here is the mask
[ True False False  True False False  True  True False  True]

And here is the mask applied to all columns
[[73 91 75 71 66]
 [84  5 29 13 56]
 [42 81  2 32 25]
 [60 56 19 65 68]]


Similarly if we qanted to select all rows where the 2nd element was equal to, or greather, than 50

In [12]:
mask = array1[:,1] >= 50 # colon indicates all roes, 1 indicates row 1 (the second row, as the first is row 0)
print ('\nHere is the mask')
print (mask)
print ('\nAnd here is the mask applied to all rows')
print (array1[mask,:]) # colon represents all rows of chosen columns


Here is the mask
[False  True False  True]

And here is the mask applied to all rows
[[84 65 81  5 44 32 29 13 47 56]
 [60 91 41 56 76 48 19 65 68 68]]


## Using <em>and</em> and <em>or</em>, and combining filters from two arrays

We may creat and combine multiple masks. For example we may have two masks that look for values less than 20 or greater than 80, and then combine those masks with or which is represented by | (stick).

In [13]:
print ('Mask for values <20:')
mask1 = array1 < 20
print (mask1)

print ('\nMask for values >80:')
mask2 = array1 > 80
print (mask2)

print ('\nCombined mask:')
mask = mask1  | mask2 # | (stick) is used for 'or' with two boolean arrays
print (mask)

print ('\nSelected values using combined mask')
print (array1[mask])

Mask for values <20:
[[False  True  True False  True  True False False False False]
 [False False False  True False False False  True False False]
 [False  True  True False False False  True False  True False]
 [False False False False False False  True False False False]]

Mask for values >80:
[[False False False  True False False False False False False]
 [ True False  True False False False False False False False]
 [False False False  True False False False False False False]
 [False  True False False False False False False False False]]

Combined mask:
[[False  True  True  True  True  True False False False False]
 [ True False  True  True False False False  True False False]
 [False  True  True  True False False  True False  True False]
 [False  True False False False False  True False False False]]

Selected values using combined mask
[ 6 10 91  7 10 84 81  5 13 11  2 81  2 17 91 19]


We can combine these masks in a single line

In [14]:
mask = (array1 < 20) | (array1 > 80)
print (mask)

[[False  True  True  True  True  True False False False False]
 [ True False  True  True False False False  True False False]
 [False  True  True  True False False  True False  True False]
 [False  True False False False False  True False False False]]


We can combine masks derived from different arrays, so long as they are the same shape. For example let's produce an another array of random numbers and check for those element positions where corresponding positions of both arrays have values of greater than 50. When comparing boolean arrays we represent 'and' with &.

In [15]:
array2 = np.random.randint(0,100,size=(4,10))

print ('Mask for values of array1 > 50:')
mask1 = array1 > 50
print (mask1)

print ('\nMask for values of array2 > 50:')
mask2 = array2 > 50
print (mask2)

print ('\nCombined mask:')
mask = mask1  & mask2 
print (mask)

Mask for values of array1 > 50:
[[ True False False  True False False  True  True False  True]
 [ True  True  True False False False False False False  True]
 [False False False  True False False False False False False]
 [ True  True False  True  True False False  True  True  True]]

Mask for values of array2 > 50:
[[False False  True False  True False  True False False False]
 [ True False False False False False False  True  True  True]
 [False False  True  True  True False  True  True False False]
 [False  True False False  True  True False  True False  True]]

Combined mask:
[[False False False False False False  True False False False]
 [ True False False False False False False False False  True]
 [False False False  True False False False False False False]
 [False  True False False  True False False  True False  True]]


We could shorten this to:

In [16]:
mask = (array1 > 50) & (array2 > 50)
print (mask)

[[False False False False False False  True False False False]
 [ True False False False False False False False False  True]
 [False False False  True False False False False False False]
 [False  True False False  True False False  True False  True]]


## Setting values based on mask

We can use masks to reassign values only for elements that meet the given critera. For example we can set the values of all cells with a value less than 50 to zero, and set all other values to 1.

In [17]:
print ('Array at sttart:')
print (array1)
mask = array1 < 50
array1[mask] = 0
mask = mask != True # invert mask
array1[mask] = 1
print('\nNew array')
print (array1)

Array at sttart:
[[73  6 10 91  7 10 75 71 38 66]
 [84 65 81  5 44 32 29 13 47 56]
 [42 11  2 81 26 49  2 32 17 25]
 [60 91 41 56 76 48 19 65 68 68]]

New array
[[1 0 0 1 0 0 1 1 0 1]
 [1 1 1 0 0 0 0 0 0 1]
 [0 0 0 1 0 0 0 0 0 0]
 [1 1 0 1 1 0 0 1 1 1]]


We can shorten this, by making the mask implicit in the assignment command. 

In [18]:
array2[array2<50] = 0
array2[array2>=50] = 1

print('New array2:')
print(array2)

New array2:
[[0 0 1 0 1 0 1 0 0 0]
 [1 0 0 0 0 0 0 1 1 1]
 [0 0 1 1 1 0 1 1 0 0]
 [0 1 0 0 1 1 0 1 0 1]]


## Miscellaneous examples

Select columns where the average value across the column is greater than the average across the whole array, and return both the columns and the column number.

In [30]:
array = np.random.randint(0,100,size=(4,10))
number_of_columns = array.shape[1]
column_list = np.arange(0, number_of_columns) # create a list of column ids
array_average = array.mean()
column_average = array.mean(axis=0)
column_average_greater_than_array_average = column_average > array_average
selected_columns = column_list[column_average_greater_than_array_average]
selected_data = array[:,column_average_greater_than_array_average]

print ('Selected columns:')
print (selected_columns)
print ('\nSeelcted data:')
print (selected_data)

Selected columns:
[0 1 3 4 8 9]

Seelcted data:
[[76 98 95 67 24 60]
 [57 81 59 39 54 61]
 [71 84 74 90 99 80]
 [49 40 74 22 47 24]]
