In [1]:
%pip install numpy



In [1]:
import numpy as np

# Filtering Data With Logical Indexing

Sometimes you want to remove certain values from your dataset.  In Numpy, this can be done with **Logical Indexing**, and in normal Python this is done with an **If Statement**

### Step 1: Create a Logical Numpy Array

We can convert all of the values in an array at once with a single logical expression.  This is broadcasting, the same as is done with the math operations we saw earlier:

```python
>>> data = np.array([1, 2, 3, 4, 5])
>>> data < 3
[True, True, False, False, False]
```

**Exercises**: Make arrays of True/False values that answer the following questions about the dataset below for each element.

In [2]:
import numpy as np

list_of_values = [3, 7, 10, 2, 1, 7, np.nan, 20, -5]
data = np.array(list_of_values)

*Example*: Where are the values that are greater than zero?

In [3]:
data > 0

array([ True,  True,  True,  True,  True,  True, False,  True, False])

Where are the values that are less than four?

Where are the values that are equal to 7?

Where are the values that are greater or equal to 7?

Where are the values that are not equal to 7?

## Step 2: Filter with Logical Indexing

If an array of True/False values is used to *index* another array, and both arrays are the same size, it will return all of the values that correspond to the True values of the indexing array:

```python
>>> data = np.array([1, 2, 3, 4, 5])
>>> mask = data > 3
>>> mask
[False, False, False, True, True]
>>> data[mask]
[4, 5]
```

Both steps can also be done in a single expression.  Sometimes this can make things clearer!


```python
>>> data[data > 3]
[4, 5]
```


**Exercises**:  Using the data below, extract only the values that corresspond to each question

In [6]:
data = np.array([3, 1, -6, 8, 20, 2, np.nan, 7, 1, np.nan, 9, 7, 7, -7])
data

array([ 3.,  1., -6.,  8., 20.,  2., nan,  7.,  1., nan,  9.,  7.,  7.,
       -7.])

*Example*: The values that are less than 0

In [7]:
data[data < 0]

array([-6., -7.])

The values that are greater than 3

The values not equal to 7

The values equal to 20

### Statistics on Filtered Data



**Exercises**: Using the following dataset, have Python to calculate the answers to the questions below:

In [8]:
data = np.array([3, 1, -6, 8, 20, 2, 7, 1, 9, 7, 7, -7])
data

array([ 3,  1, -6,  8, 20,  2,  7,  1,  9,  7,  7, -7])

*Example*: How many values are greater than 4?  

In [15]:
len(data[data > 4])

6

How many values are equal to 7?

What is the mean value of the positive numbers?

What is the mean value of the negative numbers?

What is the median value of the values that are greater than 5?

What proportion of the values are positive?  (hint: sum and len, or mean)

What proportion of the values are less than or equal to 8?

## Modifying Data Using Logical Indexing

Just like in normal indexing operations with arrays, logical indexing can be used to *set* new values ,in addition to *getting* new values from an array:

| Example | Description |
| :-- | :-- |
| **`data[data > 5] = 10`** | Set all values greater than 5 to 10  |
| **`data[data > 5] = data[data > 5] * 10`** | Set the values greater than 5 to themselves times 10  |
| **`data[data > 5] *= 10`** | Multiply all the values greater than 5 by 10, setting them in-place.  |
| **`data2 = data.copy()`** | Copy an array to a new variable |

Example: Set all positive values in `data` to 100

In [3]:
data = np.arange(-4, 5)
data

array([-4, -3, -2, -1,  0,  1,  2,  3,  4])

In [4]:
data[data > 0] = 100
data

array([ -4,  -3,  -2,  -1,   0, 100, 100, 100, 100])

Set all negative values in `data` to 0

In [7]:
data = np.array([ 1, -2,  2, -1,  1,  4, -3,  1,  2, -1])
data

array([ 1, -2,  2, -1,  1,  4, -3,  1,  2, -1])

Add 100 to all values in `data` less than 100.

In [13]:
data = np.array([0, 101, 2, 3, 104, 105, 6, 107, 8])
data

array([  0, 101,   2,   3, 104, 105,   6, 107,   8])

Make all the negative values in `data` positive.

In [17]:
data = np.array([1, -2, -3, 4, -5, 6, 7, -8, 9, 10, -11])
data

array([  1,  -2,  -3,   4,  -5,   6,   7,  -8,   9,  10, -11])

Challenge: Set all the values greater than 10 in `data` to random values between 0 and 10 (use the `np.random.randint()` function)

In [19]:
data = np.array([3, 15, 8, 12, 7, 19, 11, 6, 2, 25])
data

array([ 3, 15,  8, 12,  7, 19, 11,  6,  2, 25])

## Using Logical Indexing to Link Two Different Variables in a Dataset

So long as this process produces a mask of Trues and Falses equal in shape to the array it's bein used on, it works!  An implication of this is that we can use one dataset to index another dataset:

| Syntax | Description |
| :--  | :-- |
| **`data2[data1 > 0]`** | Get the values in `data2` with indices that correspond to the positive values in `data1` |

Get all the "Treatment" group's temperatures

In [30]:
temp = np.array([35, 38, 32, 35, 39, 37, 36, 38, 39])
group = np.array(['Control', 'Treatment', 'Treatment', 'Control', 'Treatment', 'Control', 'Control', 'Treatment', 'Control'])

Calculate the mean temperature of the "Control" group

In [32]:
temp = np.array([35, 38, 32, 35, 39, 37, 36, 38, 39])
group = np.array(['Control', 'Treatment', 'Treatment', 'Control', 'Treatment', 'Control', 'Control', 'Treatment', 'Control'])

Get the group names for all the temperatures greater than 35, calculating the proportion that are in the Control group

In [None]:
temp = np.array([35, 38, 32, 35, 39, 37, 36, 38, 39])
group = np.array(['Control', 'Treatment', 'Treatment', 'Control', 'Treatment', 'Control', 'Control', 'Treatment', 'Control'])