# NumPy introduction
---

[NumPy](https://numpydoc.readthedocs.io/en/latest/) is a Python library that helps working with [vectorized operations](https://en.wikipedia.org/wiki/Automatic_vectorization). Basically it means that with NumPy and [pandas](http://pandas.pydata.org/) the Python code will be able to work with arrays instead of single rows.

Now let's import the NumPy module as np:

In [15]:
import numpy as np

And work with our dataset `nyc_taxis.csv`. Here's a description of the columns of the dataset:

Below is information about selected columns from the data set:

* `pickup_year` - The year of the trip
* `pickup_month` - The month of the trip (January is 1, December is 12)
* `pickup_day` - The day of the month of the trip
* `pickup_location_code` - The airport or borough where the the trip started, as one of eight categories:
  * `0` - Bronx
  * `1` - Brooklyn
  * `2` - JFK Airport
  * `3` - LaGuardia Airport
  * `4` - Manhattan
  * `5` - Newark Airport
  * `6` - Queens
  * `7` - Staten Island
* `dropoff_location_code` - The airport or borough where the the trip finished, using the same eight category codes as `pickup_location_code`
* `trip_distance` - The distance of the trip in miles
* `trip_length` - The length of the trip in seconds
* `fare_amount` - The base fare of the trip, in dollars
* `total_amount` - The total amount charged to the passenger, including all fees, tolls and tips

In [16]:
import csv

# import nyc_taxi.csv as a list of lists
f = open("nyc_taxis.csv", "r")
taxi_list = list(csv.reader(f))

# remove the header row
taxi_list = taxi_list[1:]

# convert all values to floats

converted_taxi_list = [[float(item) for item in row] for row in taxi_list]

The last step is to convert our list of lists into a NumPy n-dimensional array, or ndarray. For now let's think of it as NumPy's version of a list of lists format. To convert from the list type to ndarray, we use the `numpy.array()` [constructor](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.array.html).

In [17]:
taxi = np.array(converted_taxi_list)

Look at both versions of lists:

In [18]:
print(converted_taxi_list[:2])

[[2016.0, 1.0, 1.0, 5.0, 0.0, 2.0, 4.0, 21.0, 2037.0, 52.0, 0.8, 5.54, 11.65, 69.99, 1.0], [2016.0, 1.0, 1.0, 5.0, 0.0, 2.0, 1.0, 16.29, 1520.0, 45.0, 1.3, 0.0, 8.0, 54.3, 1.0]]


In [19]:
print(taxi) # three dots mean that there are more rows / columns between the displayed ones

[[2016.      1.      1.   ...   11.65   69.99    1.  ]
 [2016.      1.      1.   ...    8.     54.3     1.  ]
 [2016.      1.      1.   ...    0.     37.8     2.  ]
 ...
 [2016.      6.     30.   ...    5.     63.34    1.  ]
 [2016.      6.     30.   ...    8.95   44.75    1.  ]
 [2016.      6.     30.   ...    0.     54.84    2.  ]]


We have a small problem here. `NumPy` prints the figures with way too many decimals because some of the items have that number of decimals. To make it look nicer we can use [`np.set_printoptions()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.set_printoptions.html).

In [20]:
np.set_printoptions(suppress = True)
print(taxi)

[[2016.      1.      1.   ...   11.65   69.99    1.  ]
 [2016.      1.      1.   ...    8.     54.3     1.  ]
 [2016.      1.      1.   ...    0.     37.8     2.  ]
 ...
 [2016.      6.     30.   ...    5.     63.34    1.  ]
 [2016.      6.     30.   ...    8.95   44.75    1.  ]
 [2016.      6.     30.   ...    0.     54.84    2.  ]]


Use [`ndarray.shape` attribute](http://docs.scipy.org/doc/numpy-1.12.0/reference/generated/numpy.ndarray.shape.html#numpy.ndarray.shape) to see the number of rows and columns (in our case we have a 2 dimensional array):

In [21]:
print(taxi.shape)

(89560, 15)


Ndarray stands for 'n-dimensional array'. In programming, array is a term that describes a collection of elements. A list object in Python could be described generically as an array. N-dimensional refers to the fact that ndarrays can have one or more dimensions. Here's an example of different kinds of arrays:

<img src="https://s3.amazonaws.com/dq-content/289/dimensional_arrays.svg" alt="1D-2D-3D-arrays" width="600"/>

Similar to using lists of lists, we use numbers to specify the location of elements of our data that we want to work with. Just like with lists, we call these numbers index values (or collectively, indices).

Unlike with Python lists, every value in an ndarray must be of the same types. For the NYC taxi data set this does not matter, as all the values are float values.

Let's see how slicing works with ndarrays:

In [22]:
print(taxi[:3], str("\n"))
print(taxi[1], str("\n"))
print(taxi[1, 0], str("\n")) # basically the syntax is ndarray[row, column] 
print(taxi[1:4, :3], str("\n")) # and you can use slicing here as well!
print(taxi[[1, 3], 2:4]) # or even use lists of lists to select specific columns/rows/items!

[[2016.      1.      1.      5.      0.      2.      4.     21.   2037.
    52.      0.8     5.54   11.65   69.99    1.  ]
 [2016.      1.      1.      5.      0.      2.      1.     16.29 1520.
    45.      1.3     0.      8.     54.3     1.  ]
 [2016.      1.      1.      5.      0.      2.      6.     12.7  1462.
    36.5     1.3     0.      0.     37.8     2.  ]] 

[2016.      1.      1.      5.      0.      2.      1.     16.29 1520.
   45.      1.3     0.      8.     54.3     1.  ] 

2016.0 

[[2016.    1.    1.]
 [2016.    1.    1.]
 [2016.    1.    1.]] 

[[1. 5.]
 [1. 5.]]


Let's practice with slicing and selecting different columns/rows/items:

In [23]:
row_0 = taxi[0]
rows_391_to_500 = taxi[391:501]
row_21_column_5 = taxi[21, 5]
columns_1_4_7 = taxi[:, [1, 4, 7]]
row_99_columns_5_to_8 = taxi[99, 5:9]
rows_100_to_200_column_14 = taxi[100:201, 14]

Operations with ndarrays are way much faster than operations with lists (approximately 30 times faster!). `NumPy` can use any of the standard Python numeric operators to perform vector math:

* `vector_a + vector_b` - Addition
* `vector_a - vector_b` - Subtraction
* `vector_a * vector_b` - Multiplication (this is unrelated to the vector multiplication used in linear algebra)
* `vector_a / vector_b` - Division
* `vector_a % vector_b` - Modulus (find the remainder when vector_a is divided by vector_b)
* `vector_a ** vector_b` - Exponent (raise vector_a to the power of vector_b)
* `vector_a // vector_b` - Floor Division (divide vector_a by vector_b, rounding down to the nearest integer)


Our dataset has two columns: `trip_distance` and `trip_length`. With these, we can calculate average travel speed (dividing `trip_distance` (miles) by `trip_length` (seconds)):

In [24]:
trip_distance_miles = taxi[:,7]
trip_length_seconds = taxi[:,8]

trip_distance_kilometers = 1.609344 * trip_distance_miles
trip_length_hours = trip_length_seconds / 3600 # 3600 seconds is one hour
trip_kmh = trip_distance_kilometers / trip_length_hours

Numpy ndarrays have methods for many different calculations. A few key methods are:

* `ndarray.min()` to calculate the minimum value
* `ndarray.max()` to calculate the maximum value
* `ndarray.mean()` to calculate the mean average value
* `ndarray.sum()` to calculate the sum of the values

You can see them a full list of ndarray methods in the [NumPy ndarray documentation](https://docs.scipy.org/doc/numpy-1.14.0/reference/arrays.ndarray.html#calculation).

Let's use the methods we've just learned about to calculate the smallest, largest, and mean average speed from our trip_mph ndarray:

In [25]:
kmh_min = trip_kmh.min()
kmh_max = trip_kmh.max()
kmh_mean = trip_kmh.mean()
print(kmh_max, kmh_mean, kmh_min)

133253.6832 51.889412016610855 0.0


We can see that the dataset has some strange figures, like 133253 kmh.

Anyhow, let's see the logic behind the used methods and how they calculate statistics for two-dimensional ndarrays. If we use the arrays without additional parameters, they will return a single value, just like they do with a 1D array:

<img src="https://s3.amazonaws.com/dq-content/289/array_method_axis_none.svg" alt="ndarray.max()" width="600"/>

We can use the axis parameter, and specify a value of 1, which indicates we want to calculate values for each row. If we want to find the maximum value of each column, we use an axis value of 0:

<img src="https://s3.amazonaws.com/dq-content/289/array_method_axis_1.svg" alt="ndarray.max(axis=1)" width="600"/>
<img src="https://s3.amazonaws.com/dq-content/289/array_method_axis_0.svg" alt="ndarray.max(axis=0)" width="600"/>

We can use `axis` parameter with any method of `ndarray`. Here's how it works:
<img src="https://s3.amazonaws.com/dq-content/289/axis_param.svg" alt="ndarray.method(axis=0)" width="600"/>

For example, here's how we calculate mean of each column:

In [26]:
taxi_column_means = taxi.mean(axis=0)
print(taxi_column_means)

[2016.            3.61447075   15.69353506    3.84133542    3.08267084
    2.95988164    3.37924297   12.66742608 2235.98110764   38.40448403
    1.20917642    3.53830951    5.81448917   48.96666246    1.29044216]


Earlier we detected some anomalies in speed. To take a closer look at why we might be getting this value, we're going to do the following:

* Add the `trip_kmh` as a column to our `taxi` ndarray.
* Sort taxi by `trip_kmh`.
* Look at the rows with the highest `trip_kmh` from our sorted ndarray to see what they tell us about these large values.

To add a column we can use the [`numpy.concatenate()`](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.concatenate.html) function. This function accepts:

* A list of ndarrays as the first, unnamed parameter
* An integer for the axis parameter, where 0 will add rows and 1 will add columns

The `numpy.concatenate()` function requires that each array have the same shape, excepting the dimension corresponding to axis. That said, if we want to add column(s), then the number of rows of each array should be tha same. If we want to add row(s), then the number of columns of each array should be the same. Also each array should have the same ammount of dimensions. We can add another dimension to an array using [`numpy.expand_dims()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.expand_dims.html) function.

Here's a simple example of using these techniques:

In [27]:
ones = np.array([[1, 1, 1],[1, 1, 1]])
zeros = np.array([0, 0, 0])
print(ones)
print(zeros)

[[1 1 1]
 [1 1 1]]
[0 0 0]


In [28]:
combined = np.concatenate([ones,zeros],axis=0) # error because of different dimensions

ValueError: all the input arrays must have same number of dimensions

In [29]:
zeros_2d = np.expand_dims(zeros,axis=0)
print(zeros_2d)

[[0 0 0]]


In [30]:
combined = np.concatenate([ones,zeros_2d],axis=0)
print(combined)

[[1 1 1]
 [1 1 1]
 [0 0 0]]


Now let's do the same with our dataset:

In [31]:
trip_kmh_2d = np.expand_dims(trip_kmh, axis=1)
taxi = np.concatenate([taxi, trip_kmh_2d], axis = 1)
print(taxi)

[[2016.            1.            1.         ...   69.99
     1.           59.72823093]
 [2016.            1.            1.         ...   54.3
     1.           62.09103259]
 [2016.            1.            1.         ...   37.8
     2.           50.32777543]
 ...
 [2016.            6.           30.         ...   63.34
     1.           35.88688846]
 [2016.            6.           30.         ...   44.75
     1.           68.26115049]
 [2016.            6.           30.         ...   54.84
     2.           59.39241235]]


Now that we've added our `trip_kmh` column to our array, our next step is to sort the array. For this, we'll use the [`numpy.argsort()`](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.argsort.html#numpy.argsort) function. The `numpy.argsort()` function returns the indices which would sort an array.

We need to sort the array by the `trip_kmh` column (`taxi[:, 15]`). To do so we need to sort that column and then sort the whole array by the given indices:

In [32]:
trip_kmh_sort = np.argsort(taxi[:, 15])
taxi_sorted = taxi[trip_kmh_sort]
print(taxi_sorted)

[[  2016.           1.           3.      ...     24.84         1.
       0.     ]
 [  2016.           1.          22.      ...     63.34         1.
       0.     ]
 [  2016.           1.          14.      ...     52.8          1.
       0.     ]
 ...
 [  2016.           3.          28.      ...      4.3          2.
   51563.38176]
 [  2016.           2.          13.      ...      3.3          2.
  113555.31264]
 [  2016.           1.          22.      ...      3.3          2.
  133253.6832 ]]


Let's inspect the last ten rows of the array:

In [33]:
taxi_sorted[-10:]

array([[  2016.     ,      2.     ,     19.     ,      5.     ,
             4.     ,      2.     ,      2.     ,     17.3    ,
             4.     ,      2.5    ,      1.8    ,      0.     ,
             0.     ,      4.3    ,      2.     ,  25057.48608],
       [  2016.     ,      6.     ,      6.     ,      1.     ,
             0.     ,      2.     ,      2.     ,     18.7    ,
             4.     ,      2.5    ,      1.3    ,      0.     ,
             0.     ,      3.8    ,      3.     ,  27085.25952],
       [  2016.     ,      4.     ,     12.     ,      2.     ,
             4.     ,      2.     ,      2.     ,     19.8    ,
             4.     ,      2.5    ,      1.8    ,      0.     ,
             0.     ,      4.3    ,      2.     ,  28678.51008],
       [  2016.     ,      4.     ,     24.     ,      7.     ,
             5.     ,      3.     ,      3.     ,     16.9    ,
             3.     ,     52.     ,      0.8    ,      0.     ,
             0.     ,     52.8    ,  

There is no discernible pattern to the date or time of the trips with unrealistic average speeds. We can see that most of them are very short rides - all have `trip_length` values of 4 or less seconds, which does not reconcile with the trip distances, all of which are more than 16 kilometers.

All of these rows have the same pickup_location_code and `dropoff_location_code`. This might suggest that the machines that record the data may use the last known GPS signal if they can't find the location, and if a driver starts and finishes a fare quickly, the machine will calculate an accurate time with inaccurate location data.

In any case, it's safe to say that the data in these rows is bad, and needs to be removed before any further analysis is performed.

## Boolean indexing and more on NumPy
---

In the first part of the notebook we used Python's built-in csv module to import our CSV as a 'list of lists' and used loops to convert each value to a float before we created our NumPy ndarray.

NumPy module has it's own function to read text files into ndarray. The [`numpy.genfromtxt()`](http://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt) function reads a text file into a NumPy ndarray. While it has over 20 parameters, for most cases you need only two. Here is the simplified syntax for the function, and an explanation of the two parameters:

```python
np.genfromtxt(filename,delimiter)
```
* `filename` - A positional argument, usually a string representing the path to the text file to be read
* `delimiter` - A named argument, specifying the string used to separate each value

Let's read our dataset using this function:

In [34]:
taxi = np.genfromtxt('nyc_taxis.csv', delimiter = ',')
print(taxi)

[[    nan     nan     nan ...     nan     nan     nan]
 [2016.      1.      1.   ...   11.65   69.99    1.  ]
 [2016.      1.      1.   ...    8.     54.3     1.  ]
 ...
 [2016.      6.     30.   ...    5.     63.34    1.  ]
 [2016.      6.     30.   ...    8.95   44.75    1.  ]
 [2016.      6.     30.   ...    0.     54.84    2.  ]]


When `numpy.genfromtxt()` reads in a file, it attempts to determine the data type of the file by looking at the values. We can use the [`ndarray.dtype`](https://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.ndarray.dtype.html#numpy.ndarray.dtype) attribute to see the internal datatype that has been used.

In [35]:
print(taxi.dtype)

float64


NaN is most commonly seen when a value is missing, but in this case we have NaN because the first line from our CSV file contains the names of each column. We can remove the header row using slicing or we can pass an additional parameter, `skip_header`, to the `numpy.genfromtxt()` function. The `skip_header` parameters accepts an integer, the number of rows from the start of the file to skip (note that because this is the number of rows and not the index, to skip the first row would require a value of 1 and not 0).

In [36]:
taxi = np.genfromtxt('nyc_taxis.csv', delimiter = ',', skip_header = 1)
print(taxi)

[[2016.      1.      1.   ...   11.65   69.99    1.  ]
 [2016.      1.      1.   ...    8.     54.3     1.  ]
 [2016.      1.      1.   ...    0.     37.8     2.  ]
 ...
 [2016.      6.     30.   ...    5.     63.34    1.  ]
 [2016.      6.     30.   ...    8.95   44.75    1.  ]
 [2016.      6.     30.   ...    0.     54.84    2.  ]]


Probably, the most powerful method is the boolean array. A boolean array, as the name suggests is an array full of boolean values. Boolean arrays are sometimes called boolean vectors or boolean masks.

Let's look at what happens when we perform a boolean operation between an ndarray and a scalar:

In [37]:
print(np.array([2,4,6,8]) < 5)

[ True  True False False]


So we can create a boolean using vectorized boolean operations. The boolean array acts as a filter, and the values that correspond to `True` become part of the resultant ndarray, where the the values that correspond to `False` are removed.

Here's an example:

In [38]:
an_array = np.array([2,4,6,8])
bool_array = an_array < 5
print(an_array[bool_array])

[2 4]


Now let's use this technique on our dataset. With this we'll find out the number of rides made in January, February and March:

In [39]:
pickup_month = taxi[:,1]

january_bool = pickup_month == 1
january = pickup_month[january_bool]
january_rides = january.shape[0] # see the number of items in the 1-st axis

february_bool = pickup_month == 2
february = pickup_month[february_bool]
february_rides = february.shape[0]

march_bool = pickup_month == 3
march = pickup_month[march_bool]
march_rides = march.shape[0]

print(january_rides, february_rides, march_rides)

13481 13333 15547


You can use boolean indexing in combination with any of the indexing methods we learned in the previous mission. The only limitation is that the boolean array must have the same length as the dimension you're indexing. With that we can, for example, see the columns `pickup_location_code`, `dropoff_location_code`, `trip_distance`, `trip_length`, `fare_amount`, `fees_amount`, `tolls_amount`, `tip_amount`, and `total_amount` with tips more than 50:

In [40]:
tip_amount = taxi[:,12]
tip_bool = tip_amount > 50
top_tips = taxi[tip_bool, 5:14]
print(top_tips)

[[    4.       2.      21.45  2004.      52.       0.8      0.      52.8
    105.6 ]
 [    3.       4.       9.2   1041.      27.       1.3      5.54    60.
     93.84]
 [    2.       0.      19.8   1671.      52.5      1.3      5.54    59.34
    118.68]
 [    4.       2.      18.42  2968.      52.       0.8      5.54    80.
    138.34]
 [    3.       6.       0.49   158.       3.5      1.8      0.      70.
     75.3 ]
 [    2.       2.       2.7    381.       9.5      0.8      0.      60.
     70.3 ]
 [    3.       4.       9.54  1210.      27.5      0.8      5.54    55.
     88.84]
 [    2.       4.      17.6   3251.      52.       0.8      5.54    65.
    123.34]
 [    4.       2.      38.2   9252.      52.       0.8      5.54    80.
    138.34]
 [    4.       2.      18.    2276.       0.01     0.3      5.54    62.
     67.85]
 [    2.       0.      26.21 17029.     180.5      0.8      5.54   100.
    286.84]
 [    2.       2.       0.      24.       2.5      0.8      0.      58.
 

The last important thing is changing the values in ndarray. Here are some cases:

In [42]:
a = np.array(['red','blue','black','blue','purple'])
print(a)
a[0] = 'orange' # changing just one value in 1D array
print(a)

['red' 'blue' 'black' 'blue' 'purple']
['orange' 'blue' 'black' 'blue' 'purple']


In [43]:
a[3:] = 'pink' # changing several values
print(a)

['orange' 'blue' 'black' 'pink' 'pink']


In [46]:
ones = np.array([[1, 1, 1, 1, 1],
                 [1, 1, 1, 1, 1],
                 [1, 1, 1, 1, 1]])
print(ones, '\n')
ones[1,2] = 99 # changing a value in 2D array
print(ones)

[[1 1 1 1 1]
 [1 1 1 1 1]
 [1 1 1 1 1]] 

[[ 1  1  1  1  1]
 [ 1  1 99  1  1]
 [ 1  1  1  1  1]]


In [47]:
ones[0] = 42 # changing a whole row
print(ones)

[[42 42 42 42 42]
 [ 1  1 99  1  1]
 [ 1  1  1  1  1]]


In [48]:
ones[:,2] = 0 # and a whole column!
print(ones)

[[42 42  0 42 42]
 [ 1  1  0  1  1]
 [ 1  1  0  1  1]]


Before making changes in an array, it's useful to make a copy (use `.copy()` method) of that array and work with it instead so the original data is safe. Let's practice with our dataset:

In [49]:
taxi_modified = taxi.copy() # copying the original data
taxi_modified[28214, 5] = 1 # fixing the incorrect 0.0 value
taxi_modified[:, 0] = taxi_modified[:, 0] - 2000 # changing YYYY format to the YY
taxi_modified[[1800, 1801], 7] = taxi_modified[:, 7].mean() # fixing the incorrect values

Basic syntax for using boolean indexing looks like this:

```python
array[array[:, column_for_comparison] == value_for_comparison, column_for_assignment] = new_value
```

Here are some examples of using it:

In [50]:
# create a new column filled with `0`.
# numpy.zeros(shape, dtype=float, order='C')
# Return a new array of given shape and type, filled with zeros
zeros = np.zeros([taxi_modified.shape[0], 1])
taxi_modified = np.concatenate([taxi, zeros], axis=1)
print(taxi_modified)

# Next few lines changes the added column values to 1 if
# the airport location (index 5) is:
# JFK Airport (2), LaGuardia Airport (3), Newark Airport (5)
taxi_modified[taxi_modified[:, 5] == 2, 15] = 1
taxi_modified[taxi_modified[:, 5] == 3, 15] = 1
taxi_modified[taxi_modified[:, 5] == 5, 15] = 1

[[2016.      1.      1.   ...   69.99    1.      0.  ]
 [2016.      1.      1.   ...   54.3     1.      0.  ]
 [2016.      1.      1.   ...   37.8     2.      0.  ]
 ...
 [2016.      6.     30.   ...   63.34    1.      0.  ]
 [2016.      6.     30.   ...   44.75    1.      0.  ]
 [2016.      6.     30.   ...   54.84    2.      0.  ]]


Let's find the most popular drop off locations (index 6) airports among the selected:

In [51]:
jfk = taxi[taxi[:, 6] == 2]
laguardia = taxi[taxi[:, 6] == 3]
newark = taxi[taxi[:, 6] == 5]

jfk_count = jfk.shape[0]
laguardia_count = laguardia.shape[0]
newark_count = newark.shape[0]

print(jfk_count, laguardia_count, newark_count)

11832 16602 63


Now let's clean our data from the abnormal speeds and calculate some statistics:

In [57]:
cleaned_taxi = taxi[trip_kmh < 160]
mean_distance = trip_distance_kilometers.mean()
mean_length = cleaned_taxi[:, 8].mean()
mean_total_amount = cleaned_taxi[:, 13].mean()
mean_kmh = trip_kmh[trip_kmh < 160].mean()

print(mean_distance, mean_kmh, mean_length, mean_total_amount)

20.386246162236713 37.58339470285745 2239.503657309026 48.98131853260262
