A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Problem 3. Pandas.

In this problem, we will use Pandas to calculate the minimum, maximum, mean, and median values of a column given a CSV (Comma Seperated Values) file.

In [None]:
import numpy as np
import pandas as pd
from nose.tools import assert_equal, assert_almost_equal

Suppose we have a CSV file with 4 columns: `Year`, `Month`, `DayofMonth`, and `ArrDelay`.

```
Year,Month,DayofMonth,ArrDelay
2001,1,17,-3
2001,1,18,4
2001,1,19,23
2001,1,20,10
2001,1,20,20
```

`ArrDelay` represents the arrival delay of a flight on the given date. So the first row says, on January 17, 2001, a flight arrived 3 minutes earlier than scheduled; the second row says, on January 18, 2001, a flight was delayed 4 minutes; and so on. For simplicity, let us suppose that the CSV file has only 5 rows of data, but real-world data will have many more.


In the following cell, we will use Python to create this CSV file and name it `sample.csv`.

In [None]:
csv_text = """Year,Month,DayofMonth,ArrDelay
2001,1,17,-3
2001,1,18,4
2001,1,19,23
2001,1,20,10
2001,1,20,20"""

with open('sample.csv', 'w') as f:
    f.write(csv_text)

In the following code cell, we use an IPython magic function called `%cat` to verify that we have successfully created the CSV file. The `%cat` magic displays the contents of a file.

In [None]:
%cat sample.csv

In Pandas, the `read_csv()` function makes it painless to read data from a CSV file and create a `DataFrame`.

In [None]:
df = pd.read_csv("sample.csv")
print(df)

We will compute some basic statistics of the `ArrDelay` column. The syntax for extracting a specific column from a data frame is similar to using a dictionary.

In [None]:
arr_delay = df["ArrDelay"]

Note that some people prefer to use the dot syntax, which yields exactly the same result.

```python
>>> arr_delay = df.ArrDelay
```

They are just two different ways of performing an identical operation.

Now that we have set up the problem, in the rest of the notebook, we will compute the mimimum, maximum, mean, and median values of `arr_delay`.

## Write a function that returns the minimum of a column.

In [None]:
def compute_minimum(column):
    """
    Computes the minimum of 'column'.
    
    Parameters
    ----------
    column: A pandas Series object.
    
    Returns
    -------
    An int.
    """
    
    # YOUR CODE HERE
    
    return minimum

In [None]:
print(compute_minimum(arr_delay))

In [None]:
assert_equal(compute_minimum(arr_delay), -3)

# test some more
data1 = {
    'A': [0, 1, 2, 3, 4],
    'B': [1, 2, 3, 4, np.nan], # append NaN since we need same number of elements
    'C': [4, 3, 2, 1, 0],
    'D': [7, 3, 5, 2, 11]
}
df1= pd.DataFrame(data1)

assert_equal(compute_minimum(df1['A']), 0)
assert_equal(compute_minimum(df1['B']), 1)
assert_equal(compute_minimum(df1['C']), 0)
assert_equal(compute_minimum(df1['D']), 2)

## Write a function that returns the maximum of a column.

In [None]:
def compute_maximum(column):
    """
    Computes the maximum of 'column'.
    
    Parameters
    ----------
    column: A pandas Series object.
    
    Returns
    -------
    An int.
    """

    # YOUR CODE HERE
    
    return maximum

In [None]:
print(compute_maximum(arr_delay))

In [None]:
assert_equal(compute_maximum(arr_delay), 23)

assert_equal(compute_maximum(df1['A']), 4)
assert_equal(compute_maximum(df1['B']), 4)
assert_equal(compute_maximum(df1['C']), 4)
assert_equal(compute_maximum(df1['D']), 11)

## Write a function that returns the mean of a column.

In [None]:
def compute_mean(column):
    """
    Computes the mean of 'column'.
    
    Parameters
    ----------
    column: A pandas Series object.
    
    Returns
    -------
    A float.
    """
    
    # YOUR CODE HERE
    
    return mean

In [None]:
print(compute_mean(arr_delay))

In [None]:
assert_almost_equal(compute_mean(arr_delay), 10.8)
                    
assert_almost_equal(compute_mean(df1['A']), 2.0)
assert_almost_equal(compute_mean(df1['B']), 2.5)
assert_almost_equal(compute_mean(df1['C']), 2.0)
assert_almost_equal(compute_mean(df1['D']), 5.6)

## Write a function that returns the median of a column.

In [None]:
def compute_median(column):
    """
    Computes the median of 'column'.
    
    Parameters
    ----------
    column: A pandas Series object.
    
    Returns
    -------
    A float.
    """
    
    # YOUR CODE HERE
    
    return median

In [None]:
print(compute_median(arr_delay))

In [None]:
assert_almost_equal(float(compute_median(arr_delay)), 10.0)

assert_almost_equal(float(compute_median(df1[['A']])), 2.0)
assert_almost_equal(float(compute_median(df1[['B']])), 2.5)
assert_almost_equal(float(compute_median(df1[['C']])), 2.0)
assert_almost_equal(float(compute_median(df1[['D']])), 5.0)