**YOUR NAME HERE**

Spring 2020

CS 251: Data Analysis and Visualization

Project 1: Data Analysis and Visualization

## Due date

Before lab next week, your goal is to write a *draft* of the `Data` class (i.e. try to complete every task in this notebook; *there might be bugs and that's ok — you will have the chance to fix them later without penality*). You will need `Data` for Project 1.

**Your progress on `Data` will be graded in lab**: 1 point if you make substantial progress, 0 if not.

Get started *immediately* on Task 1 — the first lecture of CS251 should be all that you need to complete this notebook.

## Task 0) Download data and code templates

**TODO:**
- Download `iris.csv`, `iris_bad.csv`, `test_data_spaces.csv`, `test_data_complex.csv`, and `anscombe.csv` datasets. Store them in a `data` subdirectory in your project folder.
- Download the `data.py` code template, which contains method signatures that you need to implement and detailed instructions about how to implement them. 

### Don't skip reading this

**In this class, it is of PARAMOUNT importance that you write code that conforms to the code template method specifications EXACTLY. This means that the parameter count, data types, return types must match the docstring specifications 100%. We grade by running test code, which includes what we give you in these notebooks. We give this to you to give you rapid feedback to help you determine whether your code is working properly. We may also have other test code that we use to grade, which makes similar assumptions. You will lose points if test code fails to run or returns incorrect results!**

**You should never have to modify test code!**

## Task 1) `Data` class

The `Data` class
- Reads in and stores tables of data contained in .csv files
- Allows the user to select and return variables by their string name, rather than their column index.

### Overview of CSV files

Your `Data` class parses .csv data files. Here is the assumed structure of each .csv file:
- 1st row: headers (name of each data variable / column)
- 2nd row: Data type. Possible values: `numeric`, `string`, `enum`, or `date`. Numeric types can be either integers or floating point values; strings are arbitrary strings; enum implies there are a finite number of values but they can be strings or numbers; a date should be interpreted as a calendar date.
- 3rd row+: Actual data.

**Your `Data` object should only hold `numeric` data variables (ignore non-numeric columns of data)**

### Overview of `Data` class

**TODO:** 

Implement the following methods in `data.py`. As you go, execute (Shift+Return) code in the notebook cells below to test out your code.
- Constructor: Declare/initialize instance variables, start parsing .csv file if its path is provided.
- `read(filepath)`: Reads data from the specified .csv file into the `Data` object.
- `__str__()`: Prepares a nicely formatted string for printing `Data` objects
- `get_headers()`: returns a list of all of the headers.
- `get_types()`: returns a list of all of variable data types
- `get_mappings()`: returns a list of all of the dictionary mappings between variable name and column index.
- `get_num_dims()`: returns the number of variables (columns).
- `get_num_samples()`: returns the number of samples in the data set.
- `get_sample(rowInd)`: returns the `rowInd`-th data sample.
- `get_all_data()`: returns a copy of the entire dataset.



In [1]:
from data import Data
import numpy as np

# Automatically reload external modules
%load_ext autoreload
%autoreload 2

### 1a) Test constructor and `read`

#### (i) Read data in constructor

In [2]:
iris_filename = 'data/iris.csv'
iris_data = Data(iris_filename)

print(f'Your file path is {iris_data.filepath} and should be data/iris.csv\n')
print(f"Your iris headers are\n{iris_data.headers}\nand should be\n['sepal_length', 'sepal_width', 'petal_length', 'petal_width']\n")
print(f"Your iris variable types are\n{iris_data.types}\nand should be\n['numeric', 'numeric', 'numeric', 'numeric']\n")
print(f"Your iris variable mapping is\n{iris_data.header2col}\nand should be\n'sepal_length': 0, 'sepal_width': 1, 'petal_length': 2, 'petal_width': 3\n")
print(f'Your data is a ndarray? {isinstance(iris_data.data, np.ndarray)}')
print(f'Your data has {iris_data.data.shape[0]} samples and {iris_data.data.shape[1]} variables/dimensions.\nIt should have 150 samples and 4 variables/dimensions.')

Your file path is data/iris.csv and should be data/iris.csv

Your iris headers are
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
and should be
['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

Your iris variable types are
['numeric', 'numeric', 'numeric', 'numeric', 'string']
and should be
['numeric', 'numeric', 'numeric', 'numeric']

Your iris variable mapping is
{'sepal_length': (<DataType.Numeric: 1>, 0), 'sepal_width': (<DataType.Numeric: 1>, 1), 'petal_length': (<DataType.Numeric: 1>, 2), 'petal_width': (<DataType.Numeric: 1>, 3), 'species': (<DataType.NonNumeric: 2>, 0)}
and should be
'sepal_length': 0, 'sepal_width': 1, 'petal_length': 2, 'petal_width': 3

Your data is a ndarray? True
Your data has 150 samples and 4 variables/dimensions.
It should have 150 samples and 4 variables/dimensions.


#### (ii) Read data separately

In [3]:
iris_filename = 'data/iris.csv'
iris_data = Data()
print('Before calling read...')
print(f"Your iris headers are None and should be None or []\n")

iris_data.read(iris_filename)

print('After calling read...')
print(f'Your file path is {iris_data.filepath} and should be data/iris.csv\n')
print(f"Your iris headers are\n{iris_data.headers}\nand should be\n['sepal_length', 'sepal_width', 'petal_length', 'petal_width']\n")
print(f"Your iris variable types are\n{iris_data.types}\nand should be\n['numeric', 'numeric', 'numeric', 'numeric']\n")
print(f"Your iris variable mapping is\n{iris_data.header2col}\nand should be\n'sepal_length': 0, 'sepal_width': 1, 'petal_length': 2, 'petal_width': 3\n")
print(f'Your data is a ndarray? {isinstance(iris_data.data, np.ndarray)}')
print(f'Your data has {iris_data.data.shape[0]} samples and {iris_data.data.shape[1]} variables/dimensions.\nIt should have 150 samples and 4 variables/dimensions.')

Before calling read...
Your iris headers are None and should be None or []

After calling read...
Your file path is data/iris.csv and should be data/iris.csv

Your iris headers are
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
and should be
['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

Your iris variable types are
['numeric', 'numeric', 'numeric', 'numeric', 'string']
and should be
['numeric', 'numeric', 'numeric', 'numeric']

Your iris variable mapping is
{'sepal_length': (<DataType.Numeric: 1>, 0), 'sepal_width': (<DataType.Numeric: 1>, 1), 'petal_length': (<DataType.Numeric: 1>, 2), 'petal_width': (<DataType.Numeric: 1>, 3), 'species': (<DataType.NonNumeric: 2>, 0)}
and should be
'sepal_length': 0, 'sepal_width': 1, 'petal_length': 2, 'petal_width': 3

Your data is a ndarray? True
Your data has 150 samples and 4 variables/dimensions.
It should have 150 samples and 4 variables/dimensions.


#### (iii) Handle error

This should crash, but with own error message that helps the user identify the problem and what to do to fix it.

In [4]:
iris_filename = 'data/iris_bad.csv'
iris_data = Data()
iris_data.read(iris_filename)

data type 5.1 not supported: check row 2 of the data file



#### (iv) Test spaces

In [5]:
test_filename = 'data/test_data_spaces.csv'
test_data = Data(test_filename)
print(f'Your test data looks like:\n', test_data.data)

Your test data looks like:
 [[ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 9. 10. 11. 12.]]


You should see:

    Your test data looks like:
     [[ 1.  2.  3.  4.]
     [ 5.  6.  7.  8.]
     [ 9. 10. 11. 12.]]
     
Pay attention to the data type! The numbers should be floats (not have quotes around them).

### 1b) Test `__str__`

#### (i) Iris data

In [6]:
iris_filename = 'data/iris.csv'
iris_data = Data(iris_filename)
print(iris_data)

sepal_length sepal_width petal_length petal_width 
numeric numeric numeric numeric 
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.

You should get something that looks like:

    -------------------------------
    data/iris.csv (150x4)
    Headers:
    sepal_length	sepal_width	petal_length	petal_width
    Types:
    numeric	numeric	numeric	numeric
    -------------------------------
    Showing first 5/150 rows.
    5.1	3.5	1.4	0.2
    4.9	3.0	1.4	0.2
    4.7	3.2	1.3	0.2
    4.6	3.1	1.5	0.2
    5.0	3.6	1.4	0.2

    -------------------------------

#### (ii) Anscombe quartet data

In [7]:
ans_filename = 'data/anscombe.csv'
ans_data = Data(ans_filename)
print(ans_data)

x y 
numeric numeric 
[[10.    8.04]
 [ 8.    6.95]
 [13.    7.58]
 [ 9.    8.81]
 [11.    8.33]
 [14.    9.96]
 [ 6.    7.24]
 [ 4.    4.26]
 [12.   10.84]
 [ 7.    4.82]
 [ 5.    5.68]
 [10.    9.14]
 [ 8.    8.14]
 [13.    8.74]
 [ 9.    8.77]
 [11.    9.26]
 [14.    8.1 ]
 [ 6.    6.13]
 [ 4.    3.1 ]
 [12.    9.13]
 [ 7.    7.26]
 [ 5.    4.74]
 [10.    7.46]
 [ 8.    6.77]
 [13.   12.74]
 [ 9.    7.11]
 [11.    7.81]
 [14.    8.84]
 [ 6.    6.08]
 [ 4.    5.39]
 [12.    8.15]
 [ 7.    6.42]
 [ 5.    5.73]
 [ 8.    6.58]
 [ 8.    5.76]
 [ 8.    7.71]
 [ 8.    8.84]
 [ 8.    8.47]
 [ 8.    7.04]
 [ 8.    5.25]
 [19.   12.5 ]
 [ 8.    5.56]
 [ 8.    7.91]
 [ 8.    6.89]]


You should get something that looks like:

    -------------------------------
    data/anscombe.csv (44x2)
    Headers:
    x	y
    Types:
    numeric	numeric
    -------------------------------
    Showing first 5/44 rows.
    10.0	8.04
    8.0	6.95
    13.0	7.58
    9.0	8.81
    11.0	8.33

    -------------------------------

#### (iii) Test data with spaces

In [8]:
test_filename = 'data/test_data_spaces.csv'
test_data = Data(test_filename)
print(test_data)

headers spaces bad places 
numeric numeric numeric numeric 
[[ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 9. 10. 11. 12.]]


You should get something that looks like:

    -------------------------------
    data/test_data_spaces.csv (3x4)
    Headers:
    headers	spaces	bad	places
    Types:
    numeric	 numeric	 numeric	 numeric
    -------------------------------
    1.0	 2	 3	 4
    5.0	6	7	8
    9.0	10	11	12

    -------------------------------

#### (iv) Test data with complex data types

In [9]:
test_filename = 'data/test_data_complex.csv'
test_data = Data(test_filename)
print(test_data)

enumstuff numberstuff datestuff 
enum numeric date 
[[0.0000000e+00 4.0000000e+00 1.2938580e+09]
 [0.0000000e+00 3.0000000e+00 1.2939444e+09]
 [1.0000000e+00 2.0000000e+00 1.2940308e+09]
 [2.0000000e+00 1.0000000e+00 1.2941172e+09]
 [2.0000000e+00 5.0000000e+00 1.2942036e+09]
 [1.0000000e+00 6.0000000e+00 1.2942900e+09]
 [3.0000000e+00 7.0000000e+00 1.2943764e+09]
 [1.0000000e+00 8.0000000e+00 1.2944628e+09]
 [1.0000000e+00 9.0000000e+00 1.2945492e+09]
 [2.0000000e+00 1.0000000e+01 1.2946356e+09]
 [0.0000000e+00 1.1000000e+01 1.2947220e+09]
 [0.0000000e+00 1.5000000e+01 1.2948084e+09]
 [3.0000000e+00 1.4000000e+01 1.3253940e+09]
 [0.0000000e+00 1.3000000e+01 1.3254804e+09]
 [2.0000000e+00 1.2000000e+01 1.3255668e+09]]


You should get something that looks like:

    -------------------------------
    data/test_data_complex.csv (15x1)
    Headers:
    numberstuff
    Types:
    numeric
    -------------------------------
    Showing first 5/15 rows.
    4.0
    3.0
    2.0
    1.0
    5.0

    -------------------------------

### 1c) Test get methods

In [10]:
iris_filename = 'data/iris.csv'
iris_data = Data(iris_filename)

print(f"Your iris headers are\n{iris_data.get_headers()}\nand should be\n['sepal_length', 'sepal_width', 'petal_length', 'petal_width']\n")
print(f"Your iris variable types are\n{iris_data.get_types()}\nand should be\n['numeric', 'numeric', 'numeric', 'numeric']\n")
print(f"Your iris variable mapping is\n{iris_data.get_mappings()}\nand should be\n'sepal_length': 0, 'sepal_width': 1, 'petal_length': 2, 'petal_width': 3\n")
print(f'Your data has {iris_data.get_num_samples()} samples and {iris_data.get_num_dims()} variables/dimensions.\nIt should have 150 samples and 4 variables/dimensions.\n')
print(f'Your 10th sample is\n{iris_data.get_sample(9)}\nand should be \n[4.9 3.1 1.5 0.1]\n')

dat = iris_data.get_all_data()
dat[0,:] = -9999
new_dat = iris_data.get_all_data()
if new_dat[0, 0] == -9999.:
    print('!!You did not return a copy of your data!!\n')

Your iris headers are
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
and should be
['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

Your iris variable types are
['numeric', 'numeric', 'numeric', 'numeric', 'string']
and should be
['numeric', 'numeric', 'numeric', 'numeric']

Your iris variable mapping is
{'sepal_length': 0, 'sepal_width': 1, 'petal_length': 2, 'petal_width': 3}
and should be
'sepal_length': 0, 'sepal_width': 1, 'petal_length': 2, 'petal_width': 3

Your data has 150 samples and 5 variables/dimensions.
It should have 150 samples and 4 variables/dimensions.

Your 10th sample is
[[4.9 3.1 1.5 0.1]]
and should be 
[4.9 3.1 1.5 0.1]



TypeError: 'NoneType' object does not support item assignment