In [1]:
import numpy as np

### Why are we studying the details of NumPy when our ultimate goal is to analyze tabular data(Excel sheet style)?"
We study NumPy details even though our end goal is to analyze tabular data in Pandas because Pandas is built on top of NumPy. Understanding NumPy makes it easier to work with Pandas and perform data analysis effectively.

# Numpy Arrays  vs Python Lists

Like Python Lists, numpy Arrays are ordered sequences of data.  However, they have slightly different properties than Python lists:

| Property | Lists | Arrays |
| :--:     | :--:  | :--:   |
| Ordered  | ✔️    | ✔️ |
| Mutable | ✔️    | ✔️ |
| Can Mix Data Types | ✔️ |   |
| Append Data without Copying Whole Structure | ✔️  |  |
| Broadcastable |  | ✔️ |
| Fast Calculations |   | ✔️ |
| Fast Append | ✔️ |  |



### Exercises 
Let's explore each property and compare lists and arrays

#### Ordered Index

**Examples**

Index the third element of these data collections:

List:

In [None]:
x = [10, 20, 30, 40, 50]


Array:

In [None]:
x = np.array([10, 20, 30, 40, 50])


#### Ordered Slicing

Slice out the second-to-fourth element of these data collections:

**Examples**

List:

In [None]:
x = [10, 20, 30, 40, 50]


Array:

In [None]:
x = np.array([10, 20, 30, 40, 50])


#### "Mutate" a Value inside a Collection

Just as values inside a collection can be retrieved, they can also be re-assigned.  For example, to change the third element of these data collections to the value "A":

```python
data[index] = value
```

List:

In [None]:
x = ["A", "C", "G", "G", "C", "T"]
x[2] = "A"
x

Array:

In [None]:
x = np.array(["A", "C", "G", "G", "C", "T"])
x[2]="A"
x

#### Mixing Data Types

Change the third element of these data collections to the value 40:
```python
data[index] = value
```

List:

In [None]:
x = ["A", "C", "G", "G", "C", "T"]
x[2]=40
x

Array:

In [None]:
x = np.array(["ATG", "CGA", "GCT", "GTT", "CTT", "TGA"])
x[2]=40
x

.
#### Append Values

Append a new value to the end of these data:

List:
```python
data.append(value)
```

In [None]:
x = ["A", "C", "G", "G", "C", "T"]
x.append('A')
x

In [None]:
x = ["A", "C", "G", "G", "C", "T"]
x.insert(3, "A")
x

Array:
```python
np.append(data, value)
```

In [None]:
x = np.array(["A", "C", "G", "G", "C", "T"])
x2 = np.append(x, "A")
x

### Broadcasting

Run the following code, which multiplies every value in the collection by 10

List:

In [None]:
data = [1, 2, 3, 4, 5]
data

In [None]:
data * 10

In [None]:
data10 = [x * 10 for x in data]
data10

Array:

In [None]:
data=np.array([1, 2, 3, 4, 5])

In [None]:
data10 = data * 10
data10

### Fast Calculations

Run the following code, which multiplies every value in the collection by 10.  This time, take a look at how long it takes to run.  Which is faster?

Note: the ```%%timeit``` magic command runs the cell many times and reports the average amount of time each run of the cell's code took.

List:

In [None]:
data = list(range(0, 10_00_000))
# data

In [None]:
%%timeit
data10 = [x * 10 for x in data]

Array:

In [None]:
data = np.arange(0, 1_000_000)
data

In [None]:
%%timeit
data10 = data * 10

### Fast Append

Run the following code, which appends a new value to a list a thousand times.  Which is faster?


List:

In [None]:
%%timeit
data = []  # an empty list
for _ in range(10000):  # repeat N times
    data.append("A")



Array:

In [None]:
data

In [None]:
%%timeit
data = np.array([], dtype=str)  # an empty array
for _ in range(10000):  # repeat N times
    data = np.append(data, "A")

### Summary
On the other hand, if you want maximum flexibility, lists are perfect!  But if your data is complete and well-organized, arrays are quite handy!  They are simple to work with and can crunch a lot of numbers in a short time!  

# Filtering Data With Logical Indexing

Sometimes you want to remove certain values from your dataset.  In Numpy, this can be done with **Logical Indexing**, and in normal Python this is done with an **If Statement**

### Step 1: Create a Logical Numpy Array

We can convert all of the values in an array at once with a single logical expression.  This is broadcasting, the same as is done with the math operations we saw earlier:

```python
>>> data = np.array([1, 2, 3, 4, 5])
>>> data < 3
[True, True, False, False, False]
```

**Exercises**: Make arrays of True/False values that answer the following questions about the dataset below for each element.

In [4]:
list_of_values = [3, 7, 10, 2, 1, 7, np.nan, 20, -5]
data = np.array(list_of_values)

*Example*: Where are the values that are greater than zero in data?

array([ True,  True,  True,  True,  True,  True, False,  True, False])

In [8]:
numbers = [3, 7, 10, 2, 1, 7, np.nan, 20, -5]
filtered_numbers = [num > 0 for num in numbers]

# filtered_numbers = [num for num in numbers if num > 0]

print(filtered_numbers) 


[True, True, True, True, True, True, False, True, False]


Where are the values that are equal to 7?

In [None]:
data == 7

## Step 2: Filter with Logical Indexing

If an array of True/False values is used to *index* another array, and both arrays are the same size, it will return all of the values that correspond to the True values of the indexing array:

```python
>>> data = np.array([1, 2, 3, 4, 5])
>>> mask = data > 3
>>> mask
[False, False, False, True, True]
>>> data[mask]
[4, 5]
```

Both steps can also be done in a single expression.  Sometimes this can make things clearer!


```python
>>> data[data > 3]
[4, 5]
```


**Exercises**:  Using the data below, extract only the values that corresspond to each question

In [None]:
data = np.array([3, 1, -6, 8, 20, 2, np.nan, 7, 1, np.nan, 9, 7, 7, -7])
data

*Example*: The values that are less than 0

The values that are greater than 3

The values not equal to 7

The values equal to 20

### Statistics on Filtered Data



**Exercises**: Using the following dataset, have Python to calculate the answers to the questions below:

In [9]:
data = np.array([3, 1, -6, 8, 20, 2, 7, 1, 9, 7, 7, -7])
data

array([ 3,  1, -6,  8, 20,  2,  7,  1,  9,  7,  7, -7])

*Example*: How many values are greater than 4?  

How many values are equal to 7?

What is the mean value of the positive numbers?

What is the mean value of the negative numbers?

What is the median value of the values that are greater than 5?

## Modifying Data Using Logical Indexing

Just like in normal indexing operations with arrays, logical indexing can be used to *set* new values ,in addition to *getting* new values from an array:

| Example | Description |
| :-- | :-- |
| **`data[data > 5] = 10`** | Set all values greater than 5 to 10  |
| **`data[data > 5] = data[data > 5] * 10`** | Set the values greater than 5 to themselves times 10  |
| **`data[data > 5] *= 10`** | Multiply all the values greater than 5 by 10, setting them in-place.  |
| **`data2 = data.copy()`** | Copy an array to a new variable |

Example: Set all positive values in `data` to 100

In [None]:
data = np.arange(-4, 5)
data

Set all negative values in `data` to 0

In [None]:
data = np.array([ 1, -2,  2, -1,  1,  4, -3,  1,  2, -1])
data

Add 100 to all values in `data` less than 100.

In [None]:
data = np.array([0, 101, 2, 3, 104, 105, 6, 107, 8])
data

Make all the negative values in `data` positive.

In [None]:
data = np.array([1, -2, -3, 4, -5, 6, 7, -8, 9, 10, -11])
data

Challenge: Set all the values greater than 10 in `data` to random values between 0 and 10 (use the `np.random.randint()` function)

In [None]:
data = np.array([3, 15, 8, 12, 7, 19, 11, 6, 2, 25])
data

## Using Logical Indexing to Link Two Different Variables in a Dataset

So long as this process produces a mask of Trues and Falses equal in shape to the array it's bein used on, it works!  An implication of this is that we can use one dataset to index another dataset:

| Syntax | Description |
| :--  | :-- |
| **`data2[data1 > 0]`** | Get the values in `data2` with indices that correspond to the positive values in `data1` |

|temp|group|
|----|-----|
35|control
38|Treatment
32|Treatment

Get all the "Treatment" group's temperatures

In [14]:
temp = np.array([35, 38, 32, 35, 39, 37, 36, 38, 39])
group = np.array(['Control', 'Treatment', 'Treatment', 'Control', 'Treatment', 'Control', 'Control', 'Treatment', 'Control'])

Calculate the mean temperature of the "Control" group

In [None]:
temp = np.array([35, 38, 32, 35, 39, 37, 36, 38, 39])
group = np.array(['Control', 'Treatment', 'Treatment', 'Control', 'Treatment', 'Control', 'Control', 'Treatment', 'Control'])

Get the group names for all the temperatures greater than 35

In [None]:
temp = np.array([35, 38, 32, 35, 39, 37, 36, 38, 39])
group = np.array(['Control', 'Treatment', 'Treatment', 'Control', 'Treatment', 'Control', 'Control', 'Treatment', 'Control'])

In [None]:
group[temp > 35]

A example of np.selct, we will do detail excercise in pandas session

In [17]:

data = np.array([1,5,0, 2, 3, 4, 5])

# Define the conditions and corresponding values
conditions = [data < 3, data == 3, data > 3]
values = ['Small', 'Medium', 'Large']

# Apply the conditions and values to the data using np.select
result = np.select(conditions, values)

print(result)

['Small' 'Large' 'Small' 'Small' 'Medium' 'Large' 'Large']
