# Lesson I

## Indexing and Slicing Arrays

### Indexing 1D arrays

NumPy indexing is zero-based, meaning the first index is zero. Array indexing uses square brackets, similiar to Python lists.

In [2]:
import numpy as np

array = np.array([2, 4, 6, 8, 10])
print(array[3])

8


### Indexing elements in 2D

When indexing a 2D array, give NumPy both row and column index in order to return a single element. Here, like everywhere else in NumPy, row information comes before column information.

In [3]:
# Sudoku game array
sudoku_game = np.load('datasets/sudoku_game.npy')

print(sudoku_game[2, 4])

0


If we give NumPy just one index when indexing a 2D array, it assumes that the index is a row index. For example asking for index 0 will return the first row.

In [4]:
print(sudoku_game[0])

[0 0 5 0 0 9 0 0 1]


To index a column, indicate a column index byu providing a colon in place of any row index. The colon by itself tells NumPy that we are looking for all row information.

In [5]:
print(sudoku_game[:, 3])

[0 0 0 0 0 0 5 9]


### Slicing 1D arrays

Slicing extracts a subset of data based on given indices from one array and creates a new array with the sliced data. To slice, provide a start and stop value, seperated by colon and enclosed in square brackets. The element at the start index is **included** in result, but the one at the stop index is **not**.

In [6]:
array = np.array([2, 4, 6, 8, 10])
print(array[2:4])

[6 8]


### Slicing 2D arrays

To slice in 2D, we'll need to give NumPy information on how both the rows and columns should be sliced. Row start and stop indices will be followed by column start and stop indices, **seperated by a comma**.

In [7]:
print(sudoku_game[3:6, 3:6])

[[0 0 7]
 [0 8 3]
 [0 0 0]]


In addition, we can give NumPy a third number: **step value**. For example, below we are only looking at **corner** values of the middle sudoku square.

In [8]:
print(sudoku_game[3:6:2, 3:6:2])

[[0 7]
 [0 0]]


### Sorting arrays

The **np.sort()** function sorts an array along a given axis. Here, NumPy has sorted the array along columns, with the highes number from each row in the far right.

In [9]:
print("Before Sorting")
print(sudoku_game)

Before Sorting
[[0 0 5 0 0 9 0 0 1]
 [0 7 0 0 6 0 0 4 3]
 [0 0 6 0 0 2 0 8 7]
 [1 9 0 0 0 7 4 0 0]
 [0 5 0 0 8 3 0 0 0]
 [6 0 0 0 0 0 1 0 5]
 [0 0 3 5 0 8 6 9 0]
 [0 4 2 9 1 0 3 0 0]]


In [10]:
print("After Sorting")
print(np.sort(sudoku_game))

After Sorting
[[0 0 0 0 0 0 1 5 9]
 [0 0 0 0 0 3 4 6 7]
 [0 0 0 0 0 2 6 7 8]
 [0 0 0 0 0 1 4 7 9]
 [0 0 0 0 0 0 3 5 8]
 [0 0 0 0 0 0 1 5 6]
 [0 0 0 0 3 5 6 8 9]
 [0 0 0 0 1 2 3 4 9]]


### Axis order

What if we want to sort by row? To do that, we need to understand NumPy axis labels.

In 2D array, the direction along **row** is axis zero. The direction along **columns** is axis one.

***Remember, Column looks like number 1!***

#### Sorting by axis

The default axis in **np.sort()** is the last axis of the array passed to it. if a 2D array is being sorted, NumPy sorts by column, since columns are axis 1 and rows are axis 0. This is why previous **np.sort()**, sorted the array so that the highest number from each row are in the far-right column.

To sort the array by *row*, so that the highest numbers in each column are at the bottom of the array, we set the ***axis keyword argument to zero***.

In [12]:
print("Sorted by axis")
print(np.sort(sudoku_game, axis=0))  

Sorted by axis
[[0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 2 0 0 0]
 [0 4 2 0 0 3 1 0 1]
 [0 5 3 0 1 7 3 4 3]
 [1 7 5 5 6 8 4 8 5]
 [6 9 6 9 8 9 6 9 7]]


## Exercise I 

###  Slicing and indexing trees

Imagine you are a researcher working with data from New York City's tree census. Each row of the 
```tree_census``` 2D array lists information for a different tree: *the tree ID, block ID, trunk diameter, and stump diameter* in that order. Living trees do not have stump diameters, which explains why there are so many zeros in that column. *Column order is important because NumPy does **not** have ****column names!***** 

The first and last three rows of ```three_census``` are shown below.

In [13]:
# Ney York City's tree census data
tree_census = np.load('datasets/tree_census.npy')
print(tree_census)

[[     3 501451     24      0]
 [     4 501451     20      0]
 [     7 501911      3      0]
 ...
 [  1198 227387     11      0]
 [  1199 227387     11      0]
 [  1210 227386      6      0]]


In this exercise, you'll be working specifically with the second column, representing block IDs: 
your research requires you to select specific city blocks for further analysis using NumPy slicing 
and indexing.

In [18]:
# Select all rows of the block ID data from the second column
block_ids = tree_census[:, 1]
# Print the first five block_ids
print(block_ids[:5])
# Select the tenth block Id from block_ids
tenth_block_id = block_ids[11]
# Print the tenth block Id
print(tenth_block_id)
# Select five block IDs from block_ids starting with the tenth ID
block_id_slice = block_ids[9:14]
# Print block_id_slice
print(block_id_slice)

[501451 501451 501911 501911 501911]
501911
[501911 501911 501911 501909 501909]


### Stepping into 2D

Now assume that your research requires you to take an admittedly unrepresentative sample of trunk 
diameters, which are located in the third column of ```tree_census```. Getting just a selection of trunk diameters can be done with NumPy's slicing and stepping functionality.


In [26]:
# Create an array of the first 100 trunk diameters from tree_census
hundred_diameters = tree_census[:100, 2]
# Print hundred_diameters
print(hundred_diameters)
# Create an array of trunk diameters with event row indices from 50 to 100 inclusive
every_other_diameter = tree_census[50:101:2, 2]
# print every other diameter
print(every_other_diameter)

[24 20  3  3  4  4  4  4  4  3  3  4  2  2  3  4  4  4  0 14  3  4  7  8
  7  8  7  5  6  5  5 17  0 19 21 18  4  5  3  4  3  4 13 13 13  5  4  4
  4 11  5  4  5  8 51  7  4 15  3  8  6  6  3  4  3  2  3  3  6  5  5  5
  5  9  4  4  7  7  6  5  4  4  5  5  5  7  3  5  3  3  6  6  8  7  4  5
  4  4  4  4]
[ 5  5 51  4  3  6  3  3  3  6  5  5  4  7  6  4  5  5  3  3  6  8  4  4
  4  6]


### Sorting trees

Sometimes it's easiest to understand data when it is sorted according to the value you are most 
interested in. Your new research task is to create an array containing the trunk diameters in the 
New York City tree census, sorted in order from smallest to largest.

In [27]:
# Extract trunk diameters information and sort from smallest to largest
sorted_trunk_diameters = np.sort(tree_census[:, 2], axis=0)
# print sorted trunk diameters
print(sorted_trunk_diameters)

[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  1  1  1  1  1  1  2  2  2  2  2  2
  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
  2  2  2  2  2  2  2  2  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  4  4  4  4
  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
  4  4  4  4  4  4  4  4  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5

# Lesson II 

## Filtering Arrays

Filtering is selecting data based on whether data meets a condition.

There are two ways to filter in NumPy; each is useful in different situations.

* Masks and fancy indexing

* ```np.where()```

### Boolean Masks

The code to create a mask checks whether a condition is true for each element in an array. The mask itself is an array of Booleans with the same shape as the evaluated array.

In [28]:
# Creating a 1D array from 1 to 5
one_to_five = np.arange(1, 6)
print(one_to_five)

[1 2 3 4 5]


In [29]:
# Creating a masks that selects only the even numbers
mask = one_to_five % 2 == 0
print(mask)

[False  True False  True False]


#### Filtering with fancy indexing

Once we have a Boolean mask indicating which elements the condition holds true for, we can index the array using the mask. This is called **fancy indexing**, and its useful when we are only interested in the elements that meet a condition.

In [30]:
print(one_to_five[mask])

[2 4]


### 2D fancy indexing

We may want to filter based on a condition in one row or column but return data from another. Let's say we are assigning partners in a school, and we want to know which class ids have even number of students.

Class ids are in the left column and class sizes are in the right.

In [31]:
classroom_ids_and_sizes = np.array([[1, 22], [2, 21], [3, 27], [4, 26]])
print(classroom_ids_and_sizes)

[[ 1 22]
 [ 2 21]
 [ 3 27]
 [ 4 26]]


First, we create a mask which checks which values in the second column are divisible by two.

In [32]:
classroom_ids_and_sizes[:, 1] % 2 == 0

array([ True, False, False,  True])

Then , index the first column using that mask so that we return class ids for rows where the class size in the second column meets the condition.

In [33]:
classroom_ids_and_sizes[:, 0][classroom_ids_and_sizes[:, 1] % 2 == 0]

array([1, 4])

### Fancy indexing vs np.where()

We've seen that fancy indexing returns a filtered array of elements which meet a condition. **np.where()** returns an *array of indices of elements* which meet the condition. This can be useful when indices are needed later to direct NumPy where to apply code.

**np.where()** can also be used for combining data as well as filtering arrays: it can pull different elements into a new array based on whether a condition is met.

| Fancy Indexing | ```np.where()``` |
| ---------------|------------------|
| Returns array of elements | Returns array of indices |
|   | Can create an array based on whether elements do or don't meet condition |

### Filtering with np.where()

Using *np.where()* in the classroom example returns indices indicating that the classroom at indices *zero* and *three* have even numbers of students. Notice the array of indicies is enclosed in parentheses: the ***np.where() function actually returns a tuple of arrays.***

Why? Because when we the filtered array is multi-dimensional, each element can only be located by including an index for every dimension.

In [34]:
np.where(classroom_ids_and_sizes[:, 1] % 2 == 0)

(array([0, 3], dtype=int64),)

#### np.where() element retrieval

Let's look at using np.where() to return the indices of zeros in our sudoku game.

##### A tuple of indices

**np.where()** returns two sets of indices - one for row indices and one for column indices- 

Because of this, it's helpful to unpack the results of *np.where()* into different variables.

In [35]:
row_ind, column_ind = np.where(sudoku_game == 0)
print(row_ind)
print(column_ind)

[0 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6
 7 7 7 7]
[0 1 3 4 6 7 0 2 3 5 6 0 1 3 4 6 2 3 4 7 8 0 2 3 6 7 8 1 2 3 4 5 7 0 1 4 8
 0 5 7 8]


##### Find and replace

The real power of np.where() is its ability to check whether rows, columns or elements meet a condition and then pull one element if the condition is met and another if not.

To replace all zeros in sudoku_game with empty strings, pass an empty strings as the **second argument** to *np.where()*. The **third argument** specifies how to change the element if it does not meet the condition.

In [36]:
np.where(sudoku_game == 0, " " , sudoku_game)

array([[' ', ' ', '5', ' ', ' ', '9', ' ', ' ', '1'],
       [' ', '7', ' ', ' ', '6', ' ', ' ', '4', '3'],
       [' ', ' ', '6', ' ', ' ', '2', ' ', '8', '7'],
       ['1', '9', ' ', ' ', ' ', '7', '4', ' ', ' '],
       [' ', '5', ' ', ' ', '8', '3', ' ', ' ', ' '],
       ['6', ' ', ' ', ' ', ' ', ' ', '1', ' ', '5'],
       [' ', ' ', '3', '5', ' ', '8', '6', '9', ' '],
       [' ', '4', '2', '9', '1', ' ', '3', ' ', ' ']], dtype='<U21')

## Exercise II 

### Filtering with Masks

In the last lesson, you sorted trees from smallest to largest. Now, you'll use fancy indexing to 
return the row of data representing the largest tree in tree_census. You'll also examine other trees 
located on the same block as the largest tree: are they also large?

In [47]:
# Create an array which contains row data on the largest tree in tree_census
largest_tree_data = tree_census[:][tree_census[:, 2] == 51]
# print largest tree data
print(largest_tree_data)
# Slice largest tree data to get only the block id
largest_tree_data_id = largest_tree_data[:, 1]
# Print largest tree data id
print(largest_tree_data_id)
# Create an array which contains row data on all trees with largest tree block id
trees_on_largest_tree_block = tree_census[:][tree_census[:, 1] == 501882]
# print the trees on largest tree block
print(trees_on_largest_tree_block)

[[    61 501882     51      0]]
[501882]
[[    60 501882      8      0]
 [    61 501882     51      0]
 [    62 501882      7      0]
 [    63 501882      4      0]
 [    64 501882     15      0]
 [    65 501882      3      0]
 [    66 501882      8      0]
 [    67 501882      6      0]
 [    68 501882      6      0]
 [    69 501882      3      0]]


### Fancy indexing vs. np.where()

You and your tree research team are double-checking collection data by visiting a few trees in 
person to confirm their measurements. You've been assigned to check the data for trees on block 
313879, and you'd like to make a small array of just the tree data that relates to your work.

In [50]:
# Create the block 313879 array containing trees on block 313879 - Using fancy indexing -
block_313879 = tree_census[:][tree_census[:, 1] == 313879]
# Print the block
print(block_313879)

[[  1115 313879      3      0]
 [  1116 313879     17      0]]


In [55]:
# Create an array of row indices for trees on block 313879
row_indices = np.where(tree_census[:, 1] == 313879)
# Create an array which only contains data for trees on block 313879
block_313879 = tree_census[row_indices]
# print block
print(block_313879)

[[  1115 313879      3      0]
 [  1116 313879     17      0]]


### Creating arrays from conditions

Currently, the stump diameter and trunk diameter values in tree_census are in two different columns. 
Living trees have a stump diameter of zero while stumps have a trunk diameter of zero. If you'd like 
to include both living trees and stumps in certain research calculations, it might be useful to have 
their diameters together in just one column.

In [58]:
# Create and print a 1D array of tree and stump diameters
trunk_stump_diameters = np.where(tree_census[:, 2] > 0, tree_census[:, 2], tree_census[:, 3])
# print trunk stump diameters
print(trunk_stump_diameters)

[24 20  3  3  4  4  4  4  4  3  3  4  2  2  3  4  4  4  3 14  3  4  7  8
  7  8  7  5  6  5  5 17 31 19 21 18  4  5  3  4  3  4 13 13 13  5  4  4
  4 11  5  4  5  8 51  7  4 15  3  8  6  6  3  4  3  2  3  3  6  5  5  5
  5  9  4  4  7  7  6  5  4  4  5  5  5  7  3  5  3  3  6  6  8  7  4  5
  4  4  4  4  6  5  3  4 12 12 12  5  6  6  6  6  6  5  5  6  7  7 25  5
  5  4  6  6  7 11  6 17 13 14 14 20 15 13  7  7 10 17 14  4  6  7  8  7
  7  6  7  5  2  2  2  2 26 25  2 15  6 20  5  9 15 13 15  3  2 13  6 12
 15 18 22 18 18 15 17  7  3  7  8  4 12 11 12  3  9 12 11 10  8  6  6  7
  7  3 15 12 12  4  5  5  5  4  4  5  4  9  2  4  4  6  5  5  2  5  5  4
  4  5  5  6 11  4  5  7  3 14 11 10  7 15 10  5  6 10 10  6  5  4  4  3
  5  4 14 12 11  8 14 12  9 12 11  7  8 10 10 12 11 12  5  5  6  9  9  8
  5  5  5  6  6 12 12 11 12  8  9  5  5  5  8  2  2  2 14 18 14 14 22 15
 19 14 18  7  7  7  8  8  5 10 14  2  2  2  2 11 12 12  3  3  3  3  3  6
  6  8  2  2 11 11 11  9 11 12 13  9 11  6  4  5  5

# Lesson III

## Adding and removing data

Next stop on out NumPy tour is adding and deleting array elements, an important skill as data is almost never perfectly suited to a task right of the bat.

### Concatenating in NumPy

Concatenation refers to adding data to an array along any existing axis, such as adding columns to a 2D array. We concatenate using the **np.concatenate()** function.

#### Concatenating Rows

Let's go back to our classrom array and add information for two more classrooms, also stored in an array.

To concatenate the two arrays, pass a tuple of the array names to **np.concatenate()**

Since concatenating along the first axis is np.concatenate()'s default behaviour, we don't need to explicitly state which axis to concatenate along.

In [60]:
new_classrooms = np.array([[5, 30], [5, 17]])

np.concatenate((classroom_ids_and_sizes, new_classrooms))

array([[ 1, 22],
       [ 2, 21],
       [ 3, 27],
       [ 4, 26],
       [ 5, 30],
       [ 5, 17]])

#### Concatenating Columns

To concatenate along other dimensions, set the **axis keyword** argument.

For example to add column to a 2D array use **axis=1**, since column index is 1.

In [62]:
grade_levels_and_teachers = np.array([[1, "James"], [1, "George"], [3, "Amy"], [3, "Meehir"]])

np.concatenate((classroom_ids_and_sizes, grade_levels_and_teachers), axis=1)

array([['1', '22', '1', 'James'],
       ['2', '21', '1', 'George'],
       ['3', '27', '3', 'Amy'],
       ['4', '26', '3', 'Meehir']], dtype='<U11')

### Shape Compatibility

The arrays to be concatenated **must** have compatible shapes. Specifically, they must have the same shape along all axes except the one being concatenated along.

For example, three by three array cannot be concatenated with four by two array. But it can be concatenated with three by two array column-wise.

The two arrays **must** also have the same number of **dimensions**. This is especially important to remember when concatenating a single row or column of data, usually held in a 1D array.

Turn 1D array into a 2D array using **.reshape()** method before appending.

To reshape a 1D array in preparation for concatenation with 2D array, indicate whether the data is vertical or horizontal by setting a value of one as the length of the flat dimension.

In [64]:
array_1D = np.array([1, 2, 3])
column_array_2D = array_1D.reshape((3, 1))
column_array_2D 

array([[1],
       [2],
       [3]])

In [65]:
row_array_2D = array_1D.reshape((1, 3))
row_array_2D

array([[1, 2, 3]])

It is **not** possible to add new dimensions with *np.concatenate()*, since the function only adds data along an existing data.

### Deleting with np.delete()

NumPy's np.delete() function takes three arguments:

* The array to delete from

* A slice, index, or array of indices to be deleted

* The axis to be deleted along.

For example, to delete the second row from 2D array, the index to delete will be one, and the deletion will occur along the first axis, represented with zero.

In [69]:
classroom_data = np.concatenate((classroom_ids_and_sizes, grade_levels_and_teachers), axis=1)
print(classroom_data)

[['1' '22' '1' 'James']
 ['2' '21' '1' 'George']
 ['3' '27' '3' 'Amy']
 ['4' '26' '3' 'Meehir']]


In [71]:
# Deleting second row
print(np.delete(classroom_data, 1, axis=0))

[['1' '22' '1' 'James']
 ['3' '27' '3' 'Amy']
 ['4' '26' '3' 'Meehir']]


To delete the second column update the axis keyword argument to one.

In [73]:
classroom_data = np.concatenate((classroom_ids_and_sizes, grade_levels_and_teachers), axis=1)

# Delete the second column
print(np.delete(classroom_data, 1, axis=1))

[['1' '1' 'James']
 ['2' '1' 'George']
 ['3' '3' 'Amy']
 ['4' '3' 'Meehir']]


If you do not specify axis keyword, NumPy deletes the indicated index or indices along a flattened version of the array.

In [75]:
classroom_data = np.concatenate((classroom_ids_and_sizes, grade_levels_and_teachers), axis=1)

print(np.delete(classroom_data, 1))

['1' '1' 'James' '2' '21' '1' 'George' '3' '27' '3' 'Amy' '4' '26' '3'
 'Meehir']


## Exercise III 

### Adding Rows

The research team has discovered two trees that were left off the tree_census. Your task is to add 
rows containing the data for these new trees to the end of the tree_census array. 

In [78]:
# Trees to be added
new_trees = np.array([[1211, 227386, 20, 0], [1212, 227386, 8, 0]])

# Print the shapes of tree_census and new_trees
print(tree_census.shape, new_trees.shape)

# Add rows to tree census which contains data for the new trees
updated_tree_census = np.concatenate((tree_census, new_trees), axis=0)
# print the updated tree census
print(updated_tree_census)

(1000, 4) (2, 4)
[[     3 501451     24      0]
 [     4 501451     20      0]
 [     7 501911      3      0]
 ...
 [  1210 227386      6      0]
 [  1211 227386     20      0]
 [  1212 227386      8      0]]


### Adding Columns

You finished the last set of exercises by creating an array called ```trunk_stump_diameters```, which combined data from the trunk diameter and stump diameter columns into a 1D array. Now, you'll add that 1D array as a column to the ```tree_census``` array.

In [82]:
# Print the shapes of tree census and trunk stump diameters
print(tree_census.shape, trunk_stump_diameters.shape)
# Reshape trunk stump diameters
reshaped_diameters = trunk_stump_diameters.reshape((1000, 1))
# Concatenate reshaped_diameters to tree_census as the last column
concatenated_tree_census = np.concatenate((tree_census, reshaped_diameters), axis=1)    
# print the result
print(concatenated_tree_census)

(1000, 4) (1000,)
[[     3 501451     24      0     24]
 [     4 501451     20      0     20]
 [     7 501911      3      0      3]
 ...
 [  1198 227387     11      0     11]
 [  1199 227387     11      0     11]
 [  1210 227386      6      0      6]]


### Deleting with np.delete()

What if your tree research focuses only on living trees on publicly-owned city blocks? It might be 
helpful to delete some unneeded data like the stump diameter column and some trees located on 
private blocks.


In [84]:
# Delete the stump diameter column from tree_census
tree_census_no_stumps = np.delete(tree_census, 3, axis=1)
# Save the indices of the trees on block 313879
private_block_indices = np.where(tree_census[:, 1] == 313879)
# Delete the rows for trees on block 313879 from tree_census_no_stumps
tree_census_clean = np.delete(tree_census_no_stumps, private_block_indices, axis=0)
# Print the shape of the tree_census_clean
print(tree_census_clean.shape)

(998, 3)
