# Introduction to NumPy

In this notebook we will learn the following:

* How to work with data in using NumPy and pandas objects?
* How to explore and clean data in pandas?
* How to use pandas and NumPy to analyze data quickly and efficiently?

As we learn NumPy, we'll be analyzing taxi trip data released by the city of New York. The city releases data on taxis and for-hire vehicles on the Taxi and Limousine Commission (TLC) Website. There is data on over 1.3 trillion individual trips, reaching back as far as 2009 and is regularly updated.


In [1]:
import csv
import numpy as np

# import nyc_taxi.csv as a list of lists
f = open("nyc_taxis.csv", "r")
taxi_list = list(csv.reader(f))

# remove the header row
taxi_list = taxi_list[1:]

# convert all values to floats
converted_taxi_list = []
for row in taxi_list:
    converted_row = []
    for item in row:
        converted_row.append(float(item))
    converted_taxi_list.append(converted_row)

# start writing your code below this comment
taxi = np.array(converted_taxi_list)

In [5]:
taxi.shape # provides information about num_rows and num_columns

(89560, 15)

The output of the ndarray.shape attribute gives us a few important pieces of information:

There are two numbers, which tells us that our ndarray is two-dimensional.
Note: the data type returned is called a tuple. Tuples are very similar to Python lists, but are immutable (can't be modified). Tuples are defined and displayed using parentheses () rather than brackets []

* The first number tells us that the first dimension is 89,560 items long, or put another way that there are 89,560 rows in our data set.
* The second number tells us that the second dimension is 15 items long, or put another way that there are 15 columns in our data set.

ndarray[row,column]

#or if you want to select all
#columns for a given set of rows
ndarray[row]

Where row defines the location along the row axis and column defines the location along the column axis. Both row and column can be one of the following:

* An integer, indicating a specific location, eg ndarray[3,0].
* A slice, indicating a range of locations, eg ndarray[0:5,6:].
* A colon, indicating every location, eg ndarray[:,2].
* A list of values, indicating specific locations, eg ndarray[[0,1,3,4],0].
* A boolean array, indicating specific locations - we'll look at this method in detail in the second mission of this  course.
* Or any combination of the above.

From the taxi ndarray:
* Select the row at index 0 and assign it to row_0.
* Select every column for the rows at indexes 391 to 500 inclusive and assign them to rows_391_to_500.
* Select the item at row index 21 and column index 5 and assign it to row_21_column_5

In [6]:
row_0 = taxi[0]
rows_391_to_500 = taxi[391:501,:]
row_21_column_5 = taxi[21,5]

In [7]:
cols = [1,4,7]
columns_1_4_7 = taxi[:,cols]
row_99_columns_5_to_8 = taxi[99,5:9]
rows_100_to_200_column_14 = taxi[100:201,14]

In [8]:
trip_distance_miles = taxi[:,7]
trip_length_seconds = taxi[:,8]

trip_length_hours = trip_length_seconds / 3600 # 3600 seconds is one hour
trip_mph = trip_distance_miles/trip_length_hours

To make the calculations in the previous screen, we used operators like the / symbol to perform vectorized operations over our data. NumPy provides a second way to make these calculations - arithmetic functions. Let's look at how we would write the exercise from the previous screen with with the equivalent, the numpy.divide function:

In [9]:
# using the `/` operator:
trip_mph_1 = trip_distance_miles / trip_length_hours

# using the `numpy.divide()` function:
trip_mph_2 = np.divide(trip_distance_miles,trip_length_hours)

In [10]:
print(trip_mph_1)
print(trip_mph_2)

[37.11340206 38.58157895 31.27222982 ... 22.29907867 42.41551247
 36.90473407]
[37.11340206 38.58157895 31.27222982 ... 22.29907867 42.41551247
 36.90473407]
