# Exercise 3.0: Importing the Honolulu Flights Data Set

For some of the exercises in this chapter we will be working with a data set containing information about all the arriving and departing flights in and out of the Honolulu aiport, HNL, on the Island of Oahu in December 2015. 

Please run the following code cell which will parse the 'honolulu_flights.csv' file, and build the `HNL_flights_df DataFrame` before trying the exercises in this chapter related to the Honolulu flights data set.

This data set contains the following columns:

| Column |Description|
|:----------|-----------|
| `YEAR` | The year of the flight  |
| `MONTH` |  The month of the flight |
| `DAY` |  The day of the flight |
| `DAY_OF_WEEK` |  The day of the week of the flight |
| `FLIGHT_NUMBER` |  The flight number of the flight |
| `ORIGIN_AIRPORT` |  The origin airport of the flight  |
| `DESTINATION_AIRPORT` |  The destination airport of the flight |
| `DEPARTURE_DELAY` |  The departure delay of the flight  |
| `DISTANCE` |  The distance of the flight in miles |
| `AIR_TIME` |  The flight time without taxiing in minutes |
| `ARRIVAL_DELAY` |  The arrival delay of the flight  |

In [0]:
HNL_flights_df = pd.read_csv('honolulu_flights.csv')

# Exercise 3.1 Attributes

Which of the following lines of code will give the number of rows and columns, i.e. the `shape` attribute, of  the `HNL_flights_df DataFrame`?  Please note that the output should be: (7975, 11).

A:
```python
HNL_flights_df.size
```

B: 
```python
HNL_flights_df.shape
```

C: 
```python
HNL_flights_df.shape()
```

D: 
```python
pd.shape(HNL_flights_df)
```

**Correct Answer**

B:
```python
HNL_flights_df.shape
```

**Explanation**

A: This is the correct way to access an attribute of a `DataFrame`, but this attribute gives the total number of elements rather than the number of rows and columns.

B: In `Python`, you can access an object’s attribute using the syntax `ObjectName.attributeName`. For instance, if our `DataFrame` is named `df` and our attribute is `shape`, `df.shape`will return the shape attribute of our `DataFrame`.

C: This is the syntax used to call a `method`, but `shape` is an attribute of the `DataFrame` that holds a tuple. This line results in a `TypeError` with the message: ''tuple' object is not callable'.

D: The line of code in option D will attempt to access the `shape` attribute of the `pandas` library, which does not exist, and then call it, as if it where a function, passing `HNL_flights_df` as an argument. This line results in a `AttributeError` with the message: 'module 'pandas' has no attribute 'shape''.

# Exercise 3.2: `DataFrame` Methods

Which of the following lines of code will output the average arrival delay time for the flights described in `HNL_flights_df`? Note that the output should be: -2.2572254335260116

A:
```python
HNL_flights_df['ARRIVAL_DELAY'].mean()
```

B:
```python
HNL_flights_df.mean()
```

C:
```python
pd.mean(HNL_flights_df.ARRIVAL_DELAY)
```

D:
```python
HNL_flights_df.describe().ARRIVAL_DELAY
```

**Correct Answer**

A:
```python
HNL_flights_df['ARRIVAL_DELAY'].mean()
```

**Explanation**

A: This answer correctly acesses the `ARRIVAL_DELAY` column of `HNL_flights_df`, and then calls the `mean()` of the resulting `Series`.

B: The line of code in this option will return a `Series` of the average of each of the numerical columns of `HNL_flights_df`, but this problem specifically asked for the average arrival delay time.

C: The `Pandas` module does not have a `mean()` method as it is used in this option. This line of code results in an `AttributeError` with the message: "module 'pandas' has no attribute 'mean'."

D: This option will return a `Series` contiaing many statistical properties describing the `ARRIVAL_DELAY` column, including the mean, however, this problem specifically asked for the average arrival delay time. 



# Exercise 3.3: Vectorization

Which of the following lines of code will result in a `Series` that contains the average speed of the plane in miles per minute for each flight decribed in  `HNL_flights_df`? Please note that this can be calcluated by dividing the distance by the air time of the flight.

A:
```python
HNL_flights_df.loc[:, 'DISTANCE' / 'AIR_TIME']
```

B:
```python
HNL_flights_df.loc['DISTANCE', :] / HNL_flights_df.loc['AIR_TIME', :]
```

C:
```python
HNL_flights_df.loc[:, 'DISTANCE'] / HNL_flights_df.loc[:, 'AIR_TIME']
```

D:
```python
HNL_flights_df.loc[:, ['DISTANCE', 'AIR_TIME']].divide(HNL_flights_df.loc[:, 'AIR_TIME'], axis='rows')
```

Hint: Feel free to use the code cell below to try these commands out. For the incorrect options, make note of what is going wrong and or what errors are being thrown.

**Correct Answer**

C:
```python
HNL_flights_df.loc[:, 'DISTANCE'] / HNL_flights_df.loc[:, 'AIR_TIME']
```


**Explanation**

A: This line of code will result in a `TypeError` with the message: "unsupported operand type(s) for /: 'str' and 'str'". This is because `Python` is attempting to divide the strings 'DISTANCE'  and 'AIR_TIME' to use to index `HNL_flights_df`.

B: This option will result in a `KeyError` with the message: "the label \[`DISTANCE`] is not in the \[`index`]". The `KeyError` is thrown becuase the syntax used is attempting to access a *row* labeled by the string 'DISTANCE', rather than a column. Remeber we reference by row and then column with the `loc` attribute of the `DataFrame`.

C: This line of code correctly accesses the two columns and performs the vectorized arithmetic to obtain a `Series` with the average speed of the plane in miles per minute for each flight described in `HNL_flights_df`.

D: This choice first creates a `DataFrame` that is a subset of the original `HNL_flights_df` only containing the columns `DISTANCE` and `AIR_TIME`. Then the resulting `DataFrame`calls the `divide()` method with `axis='rows'` and passes the `AIR_TIME` column of `HNL_flights_df`. The `divide()` call will perform vectorized arithmetic for both `DISTANCE` and `AIR_TIME`, but the problem clearly states that only the the average speed of the plane in miles per minute is desired. 


# Exercise 3.4: Broadcasting

Currently the `AIR_TIME` column, which gives the total flight time without taxiing, is in units of minutes. Which of the following lines of code will result in a `Series` that contains the total flight time without taxiing in units of hours? Please note that this can be found by dividing each of the entries in the `AIR_TIME` column by sixty.

A:
```python
HNL_flights_df.loc[:, 'AIR_TIME'] / pd.DataFrame([60])
```

B:
```python
HNL_flights_df.loc[:, 'AIR_TIME'] / pd.Series([60])
```

C:
```python
HNL_flights_df.loc['AIR_TIME', :] / 60
```

D:
```python
HNL_flights_df.loc[:, 'AIR_TIME'] / 60
```

**Correct Answer**

D:
```python
HNL_flights_df.loc[:, 'AIR_TIME'] / 60
```


**Explanation**

A, B: This problem can be solved using broadcasting. `Pandas` will broadcast native `Python` types, such as strings and integers, not `Pandas` types. If a `DataFrame` or `Series` is used in an arithmetic operation, as in options A and B, then `pandas` will align the object by the index and column labels and will not broadcast values.

C: This line of code attempts to access a row labeled by the string 'AIR_TIME', but since that label does not exist in the `index`, a `KeyError` is thrown. 

D: This line of code correctly accessses the `AIR_TIME` column, and then uses broadcasting to divide each entry by sixty. 

# Exercise 3.5: Subsetting with Comparison Operations


Suppose we want to focus our attention to flights that had arrival delays. Which line of code will correctly return a subset of `HNL_flights_df` containg only flights that had a positive arrival delay?


A:
```python
HNL_flights_df.loc[:, 'ARRIVAL_DELAY'] > 0
```

B:
```python
HNL_flights_df[HNL_flights_df.loc[:, 'ARRIVAL_DELAY'] > 0]
```

C:
```python
HNL_flights_df.loc[:, 'ARRIVAL_DELAY' > 0]
```

D:
```python
HNL_flights_df[HNL_flights_df > 0]
```

Hint: Feel free to use the code cell below to try these commands out. For the incorrect options, make note of what is going wrong and or what errors are being thrown.

**Correct Answer**

B:
```python
HNL_flights_df[HNL_flights_df.loc[:, 'ARRIVAL_DELAY'] > 0]
```


**Explanation**

A: This option will result in a `Series` of booleans indicating the positions of the flights that have a positive arrival delay. This is used to subset `HNL_flights_df` in the correct answer.

B: This line of code correctly creates a `Series` of booleans indicating the positions of the flights that have a positive arrival delay, and then uses it to subset `HNL_flights_df`.

C: This line of code attempts to compare the string 'ARRIVAL_DELAY' to the integer 0, which is undefined in `Python` and results in a `TypeError` with the message: '>' not supported between instances of 'str' and 'int'.

D: This line of code compares the `HNL_flights_df` to the integer 0 and uses the result to subset `HNL_flights_df`. `Pandas` defines the comparison so this line will run but without the desired results.

# Exercise 3.6: Subsetting with Boolean Operations

Suppose we wanted to focus on flights that had both departure and arrival delays. Which line of will correctly return a subset of `HNL_flights_df` containing only flights that had both departure and arrival delays?

A:
```python
HNL_flights_df[(HNL_flights_df.loc[:, 'ARRIVAL_DELAY'] > 0) | (HNL_flights_df.loc[:, 'DEPARTURE_DELAY'] > 0)]
```

B:
```python
HNL_flights_df[HNL_flights_df.loc[:, 'ARRIVAL_DELAY'] > 0 & HNL_flights_df.loc[:, 'DEPARTURE_DELAY'] > 0]
```

C:
```python
HNL_flights_df[(HNL_flights_df.loc[:, 'ARRIVAL_DELAY'] > 0) && (HNL_flights_df.loc[:, 'DEPARTURE_DELAY'] > 0)]
```

D:
```python
HNL_flights_df[(HNL_flights_df.loc[:, 'ARRIVAL_DELAY'] > 0) & (HNL_flights_df.loc[:, 'DEPARTURE_DELAY'] > 0)]
```

**Correct Answer**

D:
```python
HNL_flights_df[(HNL_flights_df.loc[:, 'ARRIVAL_DELAY'] > 0) & (HNL_flights_df.loc[:, 'DEPARTURE_DELAY'] > 0)]
```


**Explanation**

A: This option will result in a subset of `HNL_flights_df` with flights that *either* had an arrival delay *or* a departure delay, but the problem statement asks for a subset of flights with both. The '|'' key is used for the boolean *or* operation and '&' is used as the boolean *and* operation.

B: This line of code results in a `TypeError` with the message: 'cannot compare a dtyped \[float64] array with a scalar of type \[bool]'. Be sure to use parenthesis so that the order of operations is interpretted correctly.

C: This choice uses '&&' which is not the correct syntax for the boolean *and* operation.

D: This line of code correctly subsets the `HNL_flights_df DataFrame` by using parenthesis to ensure the order of operations and applying the '&' operator.  

# Exercise 3.7: Sorting

Which line of code will correctly sort `HNL_flights_df` by the arrival delay in order of greates to least, i.e. in descending order?

A:
```python
HNL_flights_df.sort_values(by='ARRIVAL_DELAY', ascending = False)
```

B:
```python
HNL_flights_df.sort_values(by='ARRIVAL_DELAY', ascending = True))
```

C:
```python
HNL_flights_df.sort_index()
```

D:
```python
HNL_flights_df.sort(by='ARRIVAL_DELAY', ascending = False)
```

**Correct Answer**

A:
```python
HNL_flights_df.sort_values(by='ARRIVAL_DELAY', ascending = False)
```


**Explanation**

A: This option correctly uses the `sort_values() DataFrame` method and sets the parameters to the appropriates values

B: This line of code incorrectly sets the ascending parameter to `True`, but the problem clearly states that `HNL_flights_df` should be sorted in descending order.

C: This choice calls the `sort_index()` method which will sort  `HNL_flights_df` in ascending order using the index, which is not what is desired.

D: This line of code calls a method called `sort()` which does not exist, therefore a `AttributeError` is thrown.   