Skip to content

Latest commit

 

History

History
76 lines (50 loc) · 3.18 KB

p2.md

File metadata and controls

76 lines (50 loc) · 3.18 KB

Problem 4.2. Simple Statistics II.

  • A template file for this problem is provided: stats2.ipynb

This problem is a continuation of problem 4.1. Recall that you wrote a function named get_stats() that takes a list and returns a tuple of minimum, maximum, mean, and median. To use this function, you have to convert your IPython notebook to a regular .py file. One way to do this is to use the IPython %%script%% magic function; in an IPython notebook cell, type (assuming the filename of your IPython notebook from Problem 4.1 is stats.ipynb):

%%bash
ipython3 nbconvert --to python stats.ipynb

and press shift + enter. This will create a Python script named stats.py.

(Note: If ipython3 nbconvert produces an error and says that there is no module named pygments, you have to run docker pull lcdm/info490 to download the latest image. You also have to stop and remove the running notebook server, and create a new notebook server with the latest image. Remember to back up your notebooks before running a new notebook server, because removing a container will erase all data in the container.)

We will import this as a module in stats2.ipynb:

from stats import get_stats

Function: get_stats()

We will use the function get_stats() to compute basic statistics of a number of columns from the arline performance dataset we downloaded in week 2. Namely, we will use the following columns:

  • Column 15, "ArrDelay": arrival delay, in minutes,
  • Column 16, "DepDelay": departure delay, in minutes, and
  • Column 19, "Distance": distance, in miles.

To extract these columns from the CSV file,

  • Write a function named get_column(filename, n, header = True) that reads the n-th column from a file and returns a list of integers.

  • You may assume that the column is made of integers.

  • We will also use the optional argument header because the first line of our file lists the names of the columns, but we might want to turn this off to handle a file that doesn't have a header.

  • Use a combination of with statement and open() function to open filename in the get_column() function.

    Tip: When I tried to use open() to read 2001.csv, I had the following error:

      'utf-8' codec can't decode byte 0xe4 in position 343: invalid
      continuation byte
    

    You can avoid this error by using encoding='latin-1' option in open().

  • Skip the first line if the header parameter is True; do not skip if it's False.

  • Some columns have missing values 'NA', and you need a way to handle these missing values. If the n-th column is missing, you should not include that column in result; that is, skip all rows with 'NA'. As a result, lists returned from different columns may have different lengths.

Function: print_stats()

We also want to print out the results in a nicely formatted manner.

  • The print_stats(input_list, title = None) function is already for you. You don't have to write this function.

It takes a list of integers and prints out the basic statistics, e.g.

>>> print_stats([1, 2, 3, 4, 5], title = 'Title')
Title
Minimum: 1
Maximum: 5
Mean: 3.0
Median: 3.0