# Pandas and Numpy Fundamentals

## Boolean Indexing with NumPy

In this mission we learned:

- How to use numpy.genfromtxt() to read in an ndarray.
- About NaN values.
- What a boolean array is, and how to create one.
- How to use boolean indexing to filter values in one and two-dimensional ndarrays.
- How to assign one or more new values to an ndarray based on their locations.
- How to assign one or more new values to an ndarray based on their values.

This is the last mission that deals exclusively with NumPy, but it's certainly not the last time we'll use NumPy. As we move onto learning other Python data libraries, you'll find yourself using a lot of these fundamental NumPy concepts. We'll also use NumPy from time to time to create, transform and otherwise work with tabular data.

In the next mission, we'll start using the pandas library and learn how it compares to NumPy.

### Reading CSV files with NumPy

Import the NumPy library and assign to the alias np.
1. Use the numpy.genfromtxt() function to read the nyc_taxis.csv file into NumPy. Assign the result to taxi.
2. Use the ndarray.shape attribute to assign the shape of taxi to taxi_shape.
3. Use the variable inspector under the code box to view the taxi ndarray and its shape after running your code.

In [1]:
import numpy as np
taxi = np.genfromtxt("../nyc_taxis.csv", delimiter=",")
taxi_shape = taxi.shape

### Reading CSV files with NumPy Continued

1. Use the numpy.genfromtxt() function to again read the nyc_taxis.csv file into NumPy, but this time, skip the first row. Assign the result to taxi.
2. Assign the shape of taxi to taxi_shape.
3. Use the variable inspector under the code box to view the taxi ndarray and its shape after you have run your code.

In [2]:
taxi = np.genfromtxt("../nyc_taxis.csv", delimiter=",", skip_header=1)
taxi_shape = taxi.shape

### Boolean Arrays

1. Use vectorized boolean operations to:
    - Evaluate whether the elements in array a are less than 3. Assign the result to a_bool.
    - Evaluate whether the elements in array b are equal to "blue". Assign the result to b_bool.
    - Evaluate whether the elements in array c are greater than 100. Assign the result to c_bool.
2. Once you've run your code, use the variable inspector below the code box to view each boolean array.

In [3]:
a = np.array([1, 2, 3, 4, 5])
b = np.array(["blue", "blue", "red", "blue"])
c = np.array([80.0, 103.4, 96.9, 200.3])

a_bool = a < 3
b_bool = b == "blue"
c_bool = c > 100

### Boolean Indexing with 1D ndarrays

1. Calculate the number of rides in the taxi ndarray that are from February:
     - Create a boolean array, february_bool, that evaluates whether the items in pickup_month are equal to 2.
     - Use the february_bool boolean array to index pickup_month. Assign the result to february.
     - Use the ndarray.shape attribute to find the number of items in february. Assign the result to february_rides.
2. Once you have run your code, use the variable inspector to view the number of rides for February.

In [4]:
pickup_month = taxi[:,1]

january_bool = pickup_month == 1
january = pickup_month[january_bool]
january_rides = january.shape[0]

pickup_month = taxi[:,1]

january_bool = pickup_month == 1
january = pickup_month[january_bool]
january_rides = january.shape[0]

In [5]:
taxi = np.genfromtxt("../nyc_taxis.csv", delimiter=",", skip_header=1)
taxi_shape = taxi.shape

### Boolean Arrays

1. Use vectorized boolean operations to:
    - Evaluate whether the elements in array a are less than 3. Assign the result to a_bool.
    - Evaluate whether the elements in array b are equal to "blue". Assign the result to b_bool.
    - Evaluate whether the elements in array c are greater than 100. Assign the result to c_bool.
2. Once you've run your code, use the variable inspector below the code box to view each boolean array.

In [6]:
a = np.array([1, 2, 3, 4, 5])
b = np.array(["blue", "blue", "red", "blue"])
c = np.array([80.0, 103.4, 96.9, 200.3])

a_bool = a < 3
b_bool = b == "blue"
c_bool = c > 100

### Boolean Indexing with 1D ndarrays

1. Calculate the number of rides in the taxi ndarray that are from February:
     - Create a boolean array, february_bool, that evaluates whether the items in pickup_month are equal to 2.
     - Use the february_bool boolean array to index pickup_month. Assign the result to february.
     - Use the ndarray.shape attribute to find the number of items in february. Assign the result to february_rides.
2. Once you have run your code, use the variable inspector to view the number of rides for February.

In [7]:
pickup_month = taxi[:,1]

february_bool = pickup_month == 2
february = pickup_month[february_bool]
february_rides = february.shape[0]

### Boolean Indexing with 2D ndarrays

1. Create a boolean array, tip_bool, that determines which rows have values for the tip_amount column of more than 50.
2. Use the tip_bool array to select all rows from taxi with values tip amounts of more than 50, and the columns from indexes 5 to 13 inclusive. Assign the resulting array to top_tips.

In [8]:
tip_bool = taxi[:, 12] > 50
top_tips = taxi[tip_bool, 5:14]

### Assigning Values in ndarrays

To help you practice without making changes to our original array, we have used the ndarray.copy() method to make taxi_modified, a copy of our original for these exercises.

1. The value at column index 5 (pickup_location) of row index 28214 is incorrect. Use assignment to change this value to 1 in the taxi_modified ndarray.
2. The first column (index 0) contains year values as four digit numbers in the format YYYY (2016, since all trips in our data set are from 2016). Use assignment to change these values to the YY format (16) in the taxi_modified ndarray.
3. The values at column index 7 (trip_distance) of rows index 1800 and 1801 are incorrect. Use assignment to change these values in the taxi_modified ndarray to the mean value for that column.

In [9]:
# this creates a copy of our taxi ndarray
taxi_modified = taxi.copy()
taxi_modified[28214, 5] = 1
taxi_modified[:, 0] = 16
taxi_modified[1800:1802, 7] = taxi_modified[:, 7].mean()

### Assignment Using Boolean Arrays

We again used the ndarray.copy() method to make taxi_copy, a copy of our original for this exercise.

1. Select the fourteenth column (index 13) in taxi_copy. Assign it to a variable named total_amount.
2. For rows where the value of total_amount is less than 0, use assignment to change the value to 0.

In [10]:
# this creates a copy of our taxi ndarray
taxi_copy = taxi.copy()
total_amount = taxi_copy[:, 13]
taxi_copy[total_amount < 0] = 0

### Assignment Using Boolean Arrays Continued
We have created a new copy of our taxi dataset, taxi_modified with an additional column containing the value 0 for every row.

1. In our new column at index 15, assign the value 1 if the pickup_location_code (column index 5) corresponds to an airport location, leaving the value as 0 otherwise by performing these three operations:
2. For rows where the value for the column index 5 is equal to 2 (JFK Airport), assign the value 1 to column index 15.
3. For rows where the value for the column index 5 is equal to 3 (LaGuardia Airport), assign the value 1 to column index 15.
4. For rows where the value for the column index 5 is equal to 5 (Newark Airport), assign the value 1 to column index 15.

In [11]:
# create a new column filled with `0`.
zeros = np.zeros([taxi.shape[0], 1])
taxi_modified = np.concatenate([taxi, zeros], axis=1)
taxi_modified[taxi_modified[:, 5] == 2, 15] = 1
taxi_modified[taxi_modified[:, 5] == 3, 15] = 1
taxi_modified[taxi_modified[:, 5] == 5, 15] = 1

print(taxi_modified)

[[2.016e+03 1.000e+00 1.000e+00 ... 6.999e+01 1.000e+00 1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 5.430e+01 1.000e+00 1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 3.780e+01 2.000e+00 1.000e+00]
 ...
 [2.016e+03 6.000e+00 3.000e+01 ... 6.334e+01 1.000e+00 1.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 4.475e+01 1.000e+00 1.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 5.484e+01 2.000e+00 1.000e+00]]


### Challenge: Which is the most popular airport?
1. Using the original taxi ndarray, calculate how many trips had JFK Airport as their destination:
    - Use boolean indexing to select only the rows where the dropoff_location_code column (column index 6) has a value that corresponds to JFK. Assign the result to jfk.
    - Calculate how many rows are in the new jfk array and assign the result to jfk_count.
2. Calculate how many trips from taxi had Laguardia Airport as their destination:
    - Use boolean indexing to select only the rows where the dropoff_location_code column (column index 6) has a value that corresponds to Laguardia. Assign the result to laguardia.
    - Calculate how many rows are in the new laguardia array. Assign the result to laguardia_count.
3. Calculate how many trips from taxi had Newark Airport as their destination:
    - Select only the rows where the dropoff_location_code column has a value that corresponds to Newark, and assign the result to newark.
    - Calculate how many rows are in the new newark array and assign the result to newark_count.
4. After you have run your code, inspect the values for jfk_count, laguardia_count, and newark_count and see which airport has the most dropoffs.

In [12]:
jfk = taxi[taxi[:, 6] == 2]
jfk_count = jfk.shape[0]

laguardia = taxi[taxi[:, 6] == 3]
laguardia_count = laguardia.shape[0]

newark = taxi[taxi[:, 6] == 5]
newark_count = newark.shape[0]

### Challenge: Calculating Statistics for Trips on Clean Data

The trip_mph ndarray has been provided for you.

1. Create a new ndarray, cleaned_taxi, containing only rows for which the values of trip_mph are less than 100.
2. Calculate the mean of the trip_distance column of cleaned_taxi. Assign the result to mean_distance.
3. Calculate the mean of the trip_length column of cleaned_taxi. Assign the result to mean_length.
4. Calculate the mean of the total_amount column of cleaned_taxi. Assign the result to mean_total_amount.

In [13]:
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)
cleaned_taxi = taxi[trip_mph < 100]
mean_distance = cleaned_taxi[:, 7].mean()
mean_length = cleaned_taxi[:, 8].mean()
mean_total_amount = cleaned_taxi[:, 13].mean()