## **BOOLEAN INDEXING WITH NUMPY**

- We have learnt how to use NumPy for vectorized math operations. If we wanted more statistical information, then, we would need advanced methods which leads us to **Boolean Indexing**.


- Let us take the first step of learning how to **import csv files with NumPy**. The `numpy.genfromtxt()` function is used to actualize this goal. This function takes in two parameters:

    `filename` - This is a string representing the name of the file to be read.
    
    `delimiter` - A named argument which specifies the string used to separate each element in the dataset.
    
    
- Now, let us read our taxi dataset using the NumPy way!

In [1]:
import numpy as np
taxi_numpy_list = np.genfromtxt('nyc_taxis.csv', delimiter = ',')
taxi_numpy_shape = taxi_numpy_list.shape
print (taxi_numpy_list[:20])
print (taxi_numpy_shape)

[[       nan        nan        nan        nan        nan        nan
         nan        nan        nan        nan        nan        nan
         nan        nan        nan]
 [2.0160e+03 1.0000e+00 1.0000e+00 5.0000e+00 0.0000e+00 2.0000e+00
  4.0000e+00 2.1000e+01 2.0370e+03 5.2000e+01 8.0000e-01 5.5400e+00
  1.1650e+01 6.9990e+01 1.0000e+00]
 [2.0160e+03 1.0000e+00 1.0000e+00 5.0000e+00 0.0000e+00 2.0000e+00
  1.0000e+00 1.6290e+01 1.5200e+03 4.5000e+01 1.3000e+00 0.0000e+00
  8.0000e+00 5.4300e+01 1.0000e+00]
 [2.0160e+03 1.0000e+00 1.0000e+00 5.0000e+00 0.0000e+00 2.0000e+00
  6.0000e+00 1.2700e+01 1.4620e+03 3.6500e+01 1.3000e+00 0.0000e+00
  0.0000e+00 3.7800e+01 2.0000e+00]
 [2.0160e+03 1.0000e+00 1.0000e+00 5.0000e+00 0.0000e+00 2.0000e+00
  6.0000e+00 8.7000e+00 1.2100e+03 2.6000e+01 1.3000e+00 0.0000e+00
  5.4600e+00 3.2760e+01 1.0000e+00]
 [2.0160e+03 1.0000e+00 1.0000e+00 5.0000e+00 0.0000e+00 2.0000e+00
  6.0000e+00 5.5600e+00 7.5900e+02 1.7500e+01 1.3000e+00 0.0000e+00
  0.

- When reading the csv files using the list of lists, we always have to read the csv files and then convert to a list before it is ready for use. The good thing about using the NumPy way is that the long process is not needed. The `numpy.genfromtxt()` function reads a file and attempts to determine the data type of the file by looking at the values. 


- We can also use the `ndarray.type` attribute to see the internal datatype that has been used. NumPy will choose a data type which will enable the majority of our dataset to be read without an error.


- After reading the file in the above block, we can see that our dataset includes numbers as well as NaN values. Note from today that **NaN in Python means Not a Number** meaning that a value cannot be stored as a number. It can referred to as a null value.


- NaN often implies a null value or when a value is missing but in the case of the dataset for this lesson, **the nan values represent the column titles depicted in strings** and the values cannot be converted by NumPy into  a float.


- When importing a list of lists, to removing the header, the following syntax is used: `taxi[1:]` but good news! Using NumPy has made it easier for you to remove headers. The `numpy.genfromtxt()` function can accomodate an **additional argument: `skip_header()`** which helps us to skip the headers of the dataset. The parameter takes in an integer which states the number of rows from the start of the file to skip. If `skip_header` = 1, it means to **skip the first row of the dataset.** Note that the value used is the row number and **not the index** which is why row 1 = 1 and not 0(for an index).


- Let us apply this to our taxi dataset.

In [2]:
import numpy as np
taxi_numpy_list = np.genfromtxt('nyc_taxis.csv', delimiter = ',', skip_header= 1)
taxi_numpy_shape = taxi_numpy_list.shape
print (taxi_numpy_list[:20])
print (taxi_numpy_shape)
# check the shape to see whether it has changed.

[[2.0160e+03 1.0000e+00 1.0000e+00 5.0000e+00 0.0000e+00 2.0000e+00
  4.0000e+00 2.1000e+01 2.0370e+03 5.2000e+01 8.0000e-01 5.5400e+00
  1.1650e+01 6.9990e+01 1.0000e+00]
 [2.0160e+03 1.0000e+00 1.0000e+00 5.0000e+00 0.0000e+00 2.0000e+00
  1.0000e+00 1.6290e+01 1.5200e+03 4.5000e+01 1.3000e+00 0.0000e+00
  8.0000e+00 5.4300e+01 1.0000e+00]
 [2.0160e+03 1.0000e+00 1.0000e+00 5.0000e+00 0.0000e+00 2.0000e+00
  6.0000e+00 1.2700e+01 1.4620e+03 3.6500e+01 1.3000e+00 0.0000e+00
  0.0000e+00 3.7800e+01 2.0000e+00]
 [2.0160e+03 1.0000e+00 1.0000e+00 5.0000e+00 0.0000e+00 2.0000e+00
  6.0000e+00 8.7000e+00 1.2100e+03 2.6000e+01 1.3000e+00 0.0000e+00
  5.4600e+00 3.2760e+01 1.0000e+00]
 [2.0160e+03 1.0000e+00 1.0000e+00 5.0000e+00 0.0000e+00 2.0000e+00
  6.0000e+00 5.5600e+00 7.5900e+02 1.7500e+01 1.3000e+00 0.0000e+00
  0.0000e+00 1.8800e+01 2.0000e+00]
 [2.0160e+03 1.0000e+00 1.0000e+00 5.0000e+00 0.0000e+00 4.0000e+00
  2.0000e+00 2.1450e+01 2.0040e+03 5.2000e+01 8.0000e-01 0.0000e+00
  5.

## **BOOLEAN ARRAYS**

- As stated in the topic we are examining in this lesson, we are now shifting focus to one of the most powerful NumPy tools - **Boolean Arrays**. A Boolean array is an array of boolean values and they are also called **Boolean vectors** or **Boolean masks**. A boolean in Python has two values - true or false.


- Boolean values are common with comparison operators i.e. `= < > !=` where `!=` means *not equal to*. When we explored vector math [here](https://github.com/Tess-hacker/THE-ULTIMATE-GUIDE-TO-UNDERSTANDING-NumPy-ARRAYS/blob/master/INTRODUCTION%20TO%20NUMPY.ipynb), we learnt that an operation between a numpy array and a single value results in a new numpy array. However, if we perform a boolean operation between a single value and a numpy array, we would not have the same result. Let's experiment this:

In [3]:
print ('The result of an addition operation between a single value and a numpy array is:')
print(np.array([2,4,6,8]) + 10)
print ('\n')
print ('The result of a boolean operation between a single value and a numpy array is:')
print(np.array([2,4,6,8]) < 5)

The result of an addition operation between a single value and a numpy array is:
[12 14 16 18]


The result of a boolean operation between a single value and a numpy array is:
[ True  True False False]


In [4]:
a = np.array([1, 2, 3, 4, 5])
a_booleanresults = a > 3
print ('The result of a boolean operation for numbers greater than 3 is:')
print (a_booleanresults)
print ('\n')
b = np.array(["blue", "blue", "red", "blue"])
print ('The result of a boolean operation for colors = blue is:')
print (b == "blue")
print ('\n')
c = np.array([80.0, 103.4, 96.9, 200.3])
print ('The result of a boolean operation for numbers greater than 100 is:')
print (c>100)

The result of a boolean operation for numbers greater than 3 is:
[False False False  True  True]


The result of a boolean operation for colors = blue is:
[ True  True False  True]


The result of a boolean operation for numbers greater than 100 is:
[False  True False  True]


## **BOOLEAN INDEXING**

- Now, we should learn the art of boolean indexing. It is important to understand and know how to select arrays using boolean indexes. For instance, in the codes above, if we wanted to print out the valid/True results as a new index, the following approach can be used:

In [5]:
a_booleanindex = a[a_booleanresults]
print (a_booleanindex)

[4 5]


- From the above instance, you can now see that the boolean array acts a dataset filter such that the False values are automatically removed and the True values are automatically retained. 


- Let us confirm the number of New York taxi rides for the month of January in our dataset and filter out the results using boolean indexing.

In [6]:
taxi_pickups = taxi_numpy_list[: , 1]
January_pickups = taxi_pickups == 1 # where 1 represents the month of January
January_booleanindex = taxi_pickups[January_pickups]
January_shape = January_booleanindex.shape[0]
end_string = "rides"
print (January_booleanindex)
print ('\n')
print ("The total pickup rides in January are:")
print (January_shape, end_string, sep = " ")

[1. 1. 1. ... 1. 1. 1.]


The total pickup rides in January are:
13481 rides


In [7]:
taxi_pickups = taxi_numpy_list[: , 1]
February_pickups = taxi_pickups == 2 # where 2 represents the month of February
February_booleanindex = taxi_pickups[February_pickups]
February_shape = February_booleanindex.shape[0]
end_string = "rides"
print (February_booleanindex)
print ('\n')
print ("The total pickup rides in February are:")
print (February_shape, end_string, sep = " ")

[2. 2. 2. ... 2. 2. 2.]


The total pickup rides in February are:
13333 rides


- Remember, you can calculate for the rest of the months to ascertain which had the highest pickup rides

In [8]:
taxi_pickups = taxi_numpy_list[: , 1]
March_pickups = taxi_pickups == 3 
April_pickups = taxi_pickups == 4 
May_pickups = taxi_pickups == 5 
June_pickups = taxi_pickups == 6 
July_pickups = taxi_pickups == 7
August_pickups = taxi_pickups == 8 
September_pickups = taxi_pickups == 9 
October_pickups = taxi_pickups == 10 
November_pickups = taxi_pickups == 11
December_pickups = taxi_pickups == 12
March_booleanindex = taxi_pickups[March_pickups]
April_booleanindex = taxi_pickups[April_pickups]
May_booleanindex = taxi_pickups[May_pickups]
June_booleanindex = taxi_pickups[June_pickups]
July_booleanindex = taxi_pickups[July_pickups]
August_booleanindex = taxi_pickups[August_pickups]
September_booleanindex = taxi_pickups[September_pickups]
October_booleanindex = taxi_pickups[October_pickups]
November_booleanindex = taxi_pickups[November_pickups]
December_booleanindex = taxi_pickups[December_pickups]
March_shape = March_booleanindex.shape[0]
April_shape = April_booleanindex.shape[0]
May_shape = May_booleanindex.shape[0]
June_shape = June_booleanindex.shape[0]
July_shape = July_booleanindex.shape[0]
August_shape = August_booleanindex.shape[0]
September_shape = September_booleanindex.shape[0]
October_shape = October_booleanindex.shape[0]
November_shape = November_booleanindex.shape[0]
December_shape = December_booleanindex.shape[0]
end_string = "rides"
print ("The total pickup rides in March are:")
print (March_shape, end_string, sep = " ")
print ("The total pickup rides in April are:")
print (April_shape, end_string, sep = " ")
print ("The total pickup rides in May are:")
print (May_shape, end_string, sep = " ")
print ("The total pickup rides in June are:")
print (June_shape, end_string, sep = " ")
print ("The total pickup rides in July are:")
print (July_shape, end_string, sep = " ")
print ("The total pickup rides in August are:")
print (August_shape, end_string, sep = " ")
print ("The total pickup rides in September are:")
print (September_shape, end_string, sep = " ")
print ("The total pickup rides in October are:")
print (October_shape, end_string, sep = " ")
print ("The total pickup rides in November are:")
print (November_shape, end_string, sep = " ")
print ("The total pickup rides in December are:")
print (December_shape, end_string, sep = " ")

The total pickup rides in March are:
15547 rides
The total pickup rides in April are:
14810 rides
The total pickup rides in May are:
16650 rides
The total pickup rides in June are:
15739 rides
The total pickup rides in July are:
0 rides
The total pickup rides in August are:
0 rides
The total pickup rides in September are:
0 rides
The total pickup rides in October are:
0 rides
The total pickup rides in November are:
0 rides
The total pickup rides in December are:
0 rides


In [9]:
shapes = (January_shape, February_shape, March_shape, April_shape, May_shape, June_shape)
maximum_ride = max(shapes)
start_string = "The month with the maximum number of rides is May with"
last_string = "rides"
print(start_string, maximum_ride, last_string, sep = " ")

The month with the maximum number of rides is May with 16650 rides


- We have worked with 1-dimensional arrays in the above codes. Now, it is time to work with 2-dimensional arrays.


- Usually a boolean array contains no information about how it was created, so, it is possible for us to use an array created from one column to index the rest of the array. 


- Using the knowledge we have gotten so far, let us calculate the:

    - Average travel speed
    
    - Travel speeds greater than 20,000mph
    
    - The maximum values for the tip amount 

In [10]:
trip_milesperhour = taxi_numpy_list[:,7] / (taxi_numpy_list[:,8]/3600) # where column index 7 and 8 are the trip distance and trip length respectively

# now, we find the travel speeds which are greater than 20,000mph
tripmph_bool = trip_milesperhour > 20000
tripmph_greaterthan20k = taxi_numpy_list[tripmph_bool,5:9]
print (tripmph_greaterthan20k)
print ('\n')
# Now, let us examine the highest values on the tip amount column. Really, we don't know the maximum amount on this column to determine the benchmark value. Let's find out
tipamount_column = taxi_numpy_list[: , 12]
tip_benchmark = max(tipamount_column)
print ("The maximum value on the tip amount column is:")
print (tip_benchmark)
print ('\n')
# Now that we know, we could set our benchmark to 70 and find out the values greater than this.
tipamount_bool = tipamount_column > 70
highest_tip_amounts = taxi_numpy_list[tipamount_bool, 5:14]
print("The highest tips received for all rides are:")
print (highest_tip_amounts)

[[ 2.   2.  23.   1. ]
 [ 2.   2.  19.6  1. ]
 [ 2.   2.  16.7  2. ]
 [ 3.   3.  17.8  2. ]
 [ 2.   2.  17.2  2. ]
 [ 3.   3.  16.9  3. ]
 [ 2.   2.  27.1  4. ]]


The maximum value on the tip amount column is:
100.0


The highest tips received for all rides are:
[[4.0000e+00 2.0000e+00 1.8420e+01 2.9680e+03 5.2000e+01 8.0000e-01
  5.5400e+00 8.0000e+01 1.3834e+02]
 [4.0000e+00 2.0000e+00 3.8200e+01 9.2520e+03 5.2000e+01 8.0000e-01
  5.5400e+00 8.0000e+01 1.3834e+02]
 [2.0000e+00 0.0000e+00 2.6210e+01 1.7029e+04 1.8050e+02 8.0000e-01
  5.5400e+00 1.0000e+02 2.8684e+02]
 [2.0000e+00 2.0000e+00 0.0000e+00 3.0000e+00 2.5000e+00 1.8000e+00
  0.0000e+00 7.5700e+01 8.0000e+01]]


## **ASSIGNING VALUES IN NDARRAYS**

- Apart from indexing which we can perform with 1-dimensional and 2-dimensional arrays, we can replace and assign values to numpy arrays. The indexing techniques which we have learnt earlier will be helpful in realizing this goal. Values can be assigned either individually or in groups for both a 1-dimensional and  2-dimensional arrays on rows and columns likewise. 

- Let us practice this below:


In [11]:
# assigning single values
a = np.array(['red','blue','black','blue','purple'])
a[0] = 'orange'
print(a)
print ('\n')
# assigning multiple values
a[3:] = 'pink'
print(a)
print('\n')
# assigning values to a specific index on a ndarray
ones = np.array([[1, 1, 1, 1, 1],
                 [1, 1, 1, 1, 1],
                 [1, 1, 1, 1, 1]])
ones[1,2] = 99
print(ones)
print ('\n')
# assigning a whole row
ones[0] = 42
print(ones)
print ('\n')
# assigning a whole column
ones[:,2] = 0
print(ones)

['orange' 'blue' 'black' 'blue' 'purple']


['orange' 'blue' 'black' 'pink' 'pink']


[[ 1  1  1  1  1]
 [ 1  1 99  1  1]
 [ 1  1  1  1  1]]


[[42 42 42 42 42]
 [ 1  1 99  1  1]
 [ 1  1  1  1  1]]


[[42 42  0 42 42]
 [ 1  1  0  1  1]
 [ 1  1  0  1  1]]


In [12]:
# let us now practice these techniques on our nyc dataset
taxi_numpy_list_duplicate = taxi_numpy_list.copy() # this is a duplicate of the original dataset to enable us work on the dataset without modifying the original
taxi_numpy_list_duplicate[28214,5] = 1
taxi_numpy_list_duplicate[:,0] = 16
new_mean = taxi_numpy_list_duplicate[1800:1802,7]
new_mean = taxi_numpy_list_duplicate[:,7].mean()
taxi_numpy_list_duplicate[1800:1802,7] = taxi_numpy_list_duplicate[:,7].mean() # this is the same with 'new mean'; just a different approach
print (taxi_numpy_list_duplicate[28214, 5])
print ('\n')
print (taxi_numpy_list_duplicate[:,0])
print ('\n')
print (new_mean)
print ('\n')
print (taxi_numpy_list_duplicate[1800:1802,7])

1.0


[16. 16. 16. ... 16. 16. 16.]


12.6674260830728


[12.66742608 12.66742608]


- In the previous code, we had to assign our new values to a variable before inserting them. However, the boolean arrays are much more powerful. You can assign these new values using shortcuts. Let us see an example below after which we would now apply it to the duplicate dataset. 

In [13]:
a2 = np.array([1, 2, 3, 4, 5])

a2_bool = a2 > 2

a2[a2_bool] = 99

print(a2)

[ 1  2 99 99 99]


In [14]:
# let us use the shortcut to change the values of the total amount column lesser than 0 to 0
total_column = taxi_numpy_list_duplicate[:, 13]
total_bool = total_column < 0
total_column[total_bool] = 0
print (total_column)

[69.99 54.3  37.8  ... 63.34 44.75 54.84]


- Using the boolean arrays, we can conduct a two-dimensional assignment of variables. Here's how:

    - Given a two-dimensional array:
    
        `c = np.array([[1,2,3],[4,5,6],[7,8,9]])`
    
    - We can use one column to perform a comparison with the 2-dimensional array:
        
        `c[:, 1] > 2`
        
    - Then, we can use the boolean array as the row index and the column index to specify the column:
    
        `c[c[:,1]>2 , 1] = 99`
        

- Let us practicalize this using the duplicate dataset:                

In [15]:
# in the duplicated dataset, I would like to add an additional column which will have an index no 15
new_column = np.zeros([taxi_numpy_list.shape[0],1])
taxi_numpy_list_duplicate = np.concatenate([taxi_numpy_list, new_column], axis =1)
print (taxi_numpy_list_duplicate)
# now we want to compare the new column values to an existing column value and assign respective values. Let us use the 5th index column which is the airport location
# for every JFK airport location with 2 as its value, we want to assign the value 1 to the new column
taxi_numpy_list_duplicate[taxi_numpy_list_duplicate[:,5]== 2, 15] = 1
# for every LaGuardia airport location with 3 as its value, we want to assign the value 1 to the new column
taxi_numpy_list_duplicate[taxi_numpy_list_duplicate[:,5]== 3, 15] = 1
# for every Newark airport location with 5 as its value, we want to assign the value 1 to the new column
taxi_numpy_list_duplicate[taxi_numpy_list_duplicate[:,5]== 5, 15] = 1
print (taxi_numpy_list_duplicate[:,15])

[[2.016e+03 1.000e+00 1.000e+00 ... 6.999e+01 1.000e+00 0.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 5.430e+01 1.000e+00 0.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 3.780e+01 2.000e+00 0.000e+00]
 ...
 [2.016e+03 6.000e+00 3.000e+01 ... 6.334e+01 1.000e+00 0.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 4.475e+01 1.000e+00 0.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 5.484e+01 2.000e+00 0.000e+00]]
[1. 1. 1. ... 1. 1. 1.]


## **CHALLENGE 1: FINDING THE MOST POPULAR AIRPORT**

- To complete this lesson, we want to execute some challenges. First, we want to find out which airport is the most popular among the three airports. For the exercise, we would be using the sixth index which is the `dropoff_location_code`. The three airports which we have and their respective values are:

    - JFK : 2
    
    - LaGuardia : 3
    
    - Newark: 5
    

- We would use Boolean indexing the filter out the rows with the above values for the airports and we would subsequently use the numpy `shape()` function to find the total rows containing those values which would help us to finally decide the airport with the highest visits serving as the most popular destination. 


- Kindly note that we are using our original dataset instead of the duplicate.

In [21]:
jfk_airport = taxi_numpy_list[taxi_numpy_list[:,6] == 2]
# print (jfk_airport)
jfk_count = jfk_airport.shape[0]
print ('The total visits to the JFK airport is:')
print (jfk_count)
print ('\n')
laguardia_airport = taxi_numpy_list[taxi_numpy_list[:,6] == 3]
# print (laguardia_airport)
laguardia_count = laguardia_airport.shape[0]
print ('The total visits to the LaGuardia airport is:')
print (laguardia_count)
print ('\n')
newark_airport = taxi_numpy_list[taxi_numpy_list[:,6] == 5]
# print (newark_airport)
newark_count = newark_airport.shape[0]
print ('The total visits to the Newark airport is:')
print (newark_count)

The total visits to the JFK airport is:
11832


The total visits to the LaGuardia airport is:
16602


The total visits to the Newark airport is:
63


- From the above analysis, it can be concluded that LaGuardia has the highest visits thus making it the most popular airport.


- Our second and last challenge is to clean up our dataset. In other words, we want to remove the bad data from the dataset to give us **clean data**. 


- We will begin by removing the rows with an average speed greater than 100mph using Boolean indexing. Afterwards, we will use the numpy array method to calculate the mean for certain columns available in the rest of the cleaned data which are:

    - `trip_distance` at column index 7
    
    - `trip_length` at column index 8
    
    - `total_column` at column index 13

In [24]:
# in this calculation, we would make use of the trip_milesperhour which has been calculated in the above codes
average_speed = taxi_numpy_list[trip_milesperhour < 100]
# now we calculate the means
trip_distance = average_speed[:,7]
mean_tripdistance = trip_distance.mean()
trip_length = average_speed[:,8]
mean_triplength = trip_length.mean()
total_amount = average_speed[:,13]
mean_totalamount = total_amount.mean()
print (mean_tripdistance)
print ('\n')
print (mean_triplength)
print ('\n')
print (mean_totalamount)

12.666396599932893


2239.503657309026


48.98131853260262


# **CONCLUSION**

- In this lesson, I have been able to explicitly show you:

    - How to use numpy.genfromtxt() to read in an ndarray.
    
    - About NaN values.
    
    - What a boolean array is, and how to create one.
    
    - How to use boolean indexing to filter values in one and two-dimensional ndarrays.
    
    - How to assign one or more new values to an ndarray based on their locations.
    
    - How to assign one or more new values to an ndarray based on their values.


- Now, go ahead and try out some exercises using NumPy and see what you have learnt! HAPPY CODING!!