In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("Lab_1_arrays_and_dictionaries.ipynb")

# Lab 1: Statistical analysis of data using numpy

Read the Lab slides:  https://docs.google.com/presentation/d/1lVYGqoStt0ZdnRAYMfF9Km6f0NgMNkuYgINsRhXASwI/edit?usp=sharing before starting this problem

Additional resources: What you're doing laid out visually, see miro https://miro.com/app/board/uXjVOWs8R4Y=/?share_link_id=567398312390 Lab 1: frame, mapping between task and code and the Lab slides

Motivation: Whether you're in engineering or business or health care - almost any field nowadays - you need to be able to work with data. Just about every thing that touches a computer now has the ability to store data. Most of this data will be numbers, but sometimes it will be qualitative data (think 3 people like this, 10 people don't).

You can do a lot of data analysis with spreadsheets, but at some point it's almost always easier to write some code to either *put* data into a spread sheet in a form that's useful, to *pull* specific data from one (or more) spreadsheets, or to automate some processes (like creating six custom plots from this month's data showing price trends). Being able to write a bit of code to clean up or re-purpose data is really useful, and not too difficult.

- Lab week 1: Read in data, re-arrange it, and use it to do (text-based) statistical analysis
- Lab week 2: Same thing again, but this time with functions so you can re-use code
- Lab week 3: Plot the data you worked with in labs week 1  & 2
- Homework weeks 1, 2 & 3: 
-- Make the code more general, so you can look at different data channels
-- Make nicer plots

Some notes on the data you'll be working with. This is real data captured by a robotic hand designed to pick fruit. The hand is instrumented with a couple sensors (IMUs in each of the three fingers, force and torque information at the wrist and information from the motors driving the three fingers). Each of these sensors outputs a data stream, which we've stored in a csv file. 

Big picture: We want to know if we can detect if the apple was picked or not from the sensor data. Each row of the Data/proxy_pick_data.csv file is data from a single picking trial. Each group of n columns represents one time step. We want to plot/analyze data from different data channels to see if there is a difference between the successful and unsuccessful picks.

For this lab the goal is to pull out one data channel (the wrist torque sensor) and print out statistics for failed versus successful picks. Yes, you could do all of this by manually going into the spreadsheet, sorting
columns, and setting up some spreadsheet formulas. That works for one data channel... but what if you want to do a different one? Or the data file format changes because someone added another sensor? Or you're asked to throw out the biggest n samples?

Yes, this is going to be frustrating/seem like a lot of work for nothing the first time you do it. The point is not
to do this particularly task, but to learn how to access data in dictionaries, lists, and numpy arrays to "pull out"
data that you're interested in. Yes, I could just tell you to use numpy slicing to pull out every 15th column,
starting with the 3rd column, and sort by the last column, but where would the fun be in that?

Note: Next week we'll take what we write here and put it into functions.

Note 2: The data structure is very similar to the one you used in the lecture activity

In [None]:
# Safety check - if you did not install json correctly this should do it for you
#  It should spit out a bunch of "requirement already satisfied" messages
import sys
!{sys.executable} -m pip install json

In [None]:
# Libraries that we need to import - numpy and json (for loading the description file)
import numpy as np
import json as json

## Reading in data
TODO First step, read in the data from **Data/proxy_pick_data.csv** and put it in a numpy array pick_data. Don't forget to set the delimiter.
 - to find out more about the numpy method **loadtxt**, Google *numpy loadtxt* 
 - there is also an example in a_tutorial_numpy.ipynb


In [None]:
# EXAMPLE CODE
# The example code for the data load is in a_tutorial_numpy.ipynb - search for loadtxt

In [None]:

    
# TODO - put the code to load the data here.
pick_data = ...

In [None]:
# EXAMPLE CODE
#  See the "What are the dimensions of the data?" in lec_act_1_data_structures.ipynb

# Make the same data set we used in lec_act_1_data_structures.ipynb - just a bit more concisely
# Make space
my_test_data = np.zeros((5, 3 * 10 + 1))
# Last column
my_test_data[0::2, -1] = 1

# x-data
x_data_for_one_row = np.linspace(start=0, stop=1.0, num=10)
for r in range(0, my_test_data.shape[0]):
    # loop through each row r
    # Fill in column 0 to one before the end (don't overwrite the good/bad), skipping every 3
    my_test_data[r, 0:-2:3] = x_data_for_one_row

# y-data
my_test_data[:, 1::3] = np.random.uniform(-1.0, 0.0, size=(5, 10))

# z-data
my_test_data[:, 2::3] = np.random.uniform(10.0, 20.0, size=(5, 10))

# Number of rows is in shape[0], columns in shape[1]
num_rows = my_test_data.shape[0]


In [None]:
# TODO 
#  - set the n_picks variable. Do NOT just put in a number - use the variable pick_data to calculate this
#  - change the print line to print out the number of picks
n_picks = ...
print(f"Number of picks: {}")

In [None]:
# EXAMPLE CODE
# Remember you want all of the rows - that's what the : is for
# To get just the last column, use -1
get_last_column = my_test_data[:, -1]
print(f"Last column: {get_last_column}")

# An example of count_nonzero
#  Make a numpy array with 10 elements, all set to 1
an_array_of_ten_ones = np.ones((10))
# Count the number of ones
print(f"How many ones? {np.count_nonzero(an_array_of_ten_ones == 1.0)}")

# Sum will return a double, not an integer. Use int() to change a double to an integer
#   You can tell it's a double by the 10.0 on the print out
print(f"Sum of array as double {np.sum(an_array_of_ten_ones)} and integer {int(np.sum(an_array_of_ten_ones))}")

In [None]:
# TODO - set the variable n_successful and print it out. Do NOT just put in a number - use the variable pick_data
#   Calculating the number of successful picks: The number of values in the last column that are 1 (use np.count_nonzero)

n_successful = ...
print(f"Number of successful picks: {}")

In [None]:
grader.check("count_rows")

#### JSON, lists, and dictionaries: Getting information from a file
The format of the spreadsheet data is given in Data/proxy_data_description.json. 

TODO: Open up the file using VSCode (just click on Data then click on the file) and look through it to see if it makes sense. Also open up proxy_pick_data.csv the same way and make sure you understand the data format (see slides).

- Step 1 (this problem): Figure out how to get the "Data channels" list out of pick_data_description
Note: **pick_data_description** is a dictionary.

- Step 2 (this problem): Find the size of the list

- Step 3 (next problem): Find the total number of dimensions of the data and the number of time steps

In [None]:
# This reads in the json data
# Try-except is just a fancy if-then statement that says if the file is not found, spit out the print statement (instead of
#  the usual incomprehensible python error messages)
try:
    with open("Data/proxy_data_description.json", "r") as fp:
        pick_data_description = json.load(fp)
except FileNotFoundError:
    print(f"The file was not found; check that the data directory is in the current one and the file is in that directory")


### How many sensor data channels?

TODO:  Figure out how many different data channels there are, 

In [None]:
# EXAMPLE CODE, step 1

my_test_dictionary = {"Key 1 name": "Name",
                      "Key 2 data list": [1, 2, 3]}

# Get list out of the dictionary
list_from_dictionary = my_test_dictionary["Key 2 data list"]
print(f"List {list_from_dictionary}")

# Sum up all of the elements in the list
sum_elems = 0
for item in list_from_dictionary:
    sum_elems += item
                    
print(f"Sum of elements in list is: 1+2+3 = {sum_elems}")

In [None]:
# TODO - use the key "Data channels" to get out the list of data channels from pick_data_description
data_channels = ...
# How many elements does the list have in it?
number_of_data_channels = ...

In [None]:
grader.check("read_json")

### Step 2: Loop over the data channels and add up the total number of dimensions

TODO: Turn this pseudo code into real code

- total number of channels = 0

- for each channel in data_channels list
   - add in the number of dimensions (key is "dimensions")

Check in **proxy_pick_data.csv** that the number of dimensions you found was where the data repeats itself

Stuck? Try printing out **pick_data_description** and match that to what you see in the json file. Try getting the first element out (is it a list or a dictionary? How do you access a list or a dictionary element?) and printing it. Repeat until you're sure you know how to get the number of dimensions of the first channel.

Now put it in a **for** loop, looping over the list. Print out each element in the list in the **for** loop.

Now change the print statement to just print out the **dimensions** value.

Now you can do the sum - you can use **x = x + v **.  OR ** x += v**

In [None]:
n_total_dims = 0
# TODO 1: turn this pseudo code into real code. 
#. for each item in data channels
#.    Get the number of dimmensions in that element and add it to n_total_dims
#.  Note that each item in data channels is a dictionary - so you'll have to get the number out of the dictionary
#
# TODO: Fill out the print statement with the number of items in the data channels list, and 
# the total number of dimensions you calculated
print(f"Number of data channels items in list: {}, total summed number of dimensions: {n_total_dims}")

In [None]:
# EXAMPLE CODE
num_channels_test_data = 3  # x, y, z - we made the data with three channels

# Get all of the test data EXCEPT the last column
#  The first : is all the rows, the :-1 is all but the last
just_xyz_test_data = my_test_data[:, :-1]

# Get the last column (successful/not)
#   The first : is all the rows, the -1 is JUST the last row
just_successful_test_data = my_test_data[:, -1]

# TODO: Look at both of the above variables in the variable window

# This is the shape of the original data, minus one column
expected_shape = (my_test_data.shape[0], my_test_data.shape[1]-1)
# This one should be that size
assert(just_xyz_test_data.shape == expected_shape)

# And this one is number of rows * 1 size
assert(just_successful_test_data.size == 5)

# Now do the actual divide to get the number of time steps
n_time_steps_test_data = just_xyz_test_data.shape[1] // num_channels_test_data

In [None]:

# TODO: Split pick_data into two parts - the actual channel data and the last column
pick_channel_data = ...
pick_successful = ...


In [None]:

# TODO: Calculate number of time steps
n_time_steps = ...
print(f"Number of time steps: {n_time_steps}")

In [None]:
grader.check("number_dimensions")

##### Data slicing to get out the Wrist torque data

Practice slicing - pull out the X, Y, Z data for the Wrist torque channel for all picks

You are free to use the fact that the name of the data channel you want is Wrist torque, but you should get the actual offset index value from the dictionary, not just do index_wrist_torque_start_index = 3 (suppose someone changed the order of the data...).

There are several ways to do this; the simplest is to loop through all of the data channels looking for the one
that is called "Wrist torque" and then set the index offset value from that. It would be a good idea to check that you actually found the right starting index by looking at the .json file. Don't forget that numpy indexes from 0.

Note: Use ==, not **is**, for the string comparison. 

We'll do this in two parts (second part is next question): 

- TODO: Get the start index from the dictionary
- TODO: Slice pick_channel_data to get out just the Wrist Torque x,y,z data
- TODO: Optional: Reshape this slice to be n_picks X xyz X time data (a 3 dimensional matrix of data)

In [None]:
# EXAMPLE CODE

# These are examples of how to get data out of the pick_data_description data structure
#   Reminder that you already stored the list "Data channels" in the variable data_channels

# Grab the second dictionary in the list of dictionaries
get_second_dictionary_in_list = data_channels[1]

# Look at proxy_data_description.json - this is one if the dictionaries in that file 
print(f"What is in one dictionary entry:\n {get_second_dictionary_in_list}")

# Using the "name" key to get the name stored in this dictionary
name_in_dictionary = get_second_dictionary_in_list["name"]

# Using the "start_index" key to get the starting index
start_index_in_dictionary = get_second_dictionary_in_list["index_offset"]

print(f"Channel {name_in_dictionary} starts at {start_index_in_dictionary}")

In [None]:
# This is the name we're searching for. Using a variable so that we can change from Wrist torque to something else later

channel_name = "Wrist torque"
index_wrist_torque_offset = -1  #  Set it to a value that is NOT a valid index
# TODO: Turn this pseudo code into real code
# for each channel in data channels
#     if this channel's name is the one I'm looking for    
#.        Set index_wrist_torque_offset to that channel's start index


# Check that you actually set the value somewhere in the loop - this is "defensive coding"
if index_wrist_torque_offset == -1:
    print(f"Error: No channel {channel_name} found")
    
print(f"Offset for wrist torque: {index_wrist_torque_offset}")

In [None]:
grader.check("channel_index")

### Step 2 - Now use slicing to get out all of the wrist torque data

The goal is to slice the data to get out a numpy array that is n_picks by n_time_steps*3. The 3 is because we have x, y, and z data. This is a bit like the way **my_test_data** was created (create an empty array, set the x, then the y, then the z data)

- First, use the slice operator selecting all rows and columns, **data[:, :]**
- Now change the column slice from all columns (:) to starting at the offset value you just calculated.
- Now change the slice to take a step/stride of **n_total_dims** instead of 1
- Reminder: slicing is  **start:end:step**

Note: You don't need to put an end value in - just leave it blank if you want to go all the way to the end

Remember: The data is in **pick_channel_data**, not **pick_data_description**



In [None]:
# We know that this channel's data has x,y,z values (3 dimens). Use a variable instead of just the number 3
#  in case we want to change it later
n_dims_for_wrist_torque_data = 3
# Create space for the data
wrist_torque_data = np.zeros((n_picks, n_time_steps * n_dims_for_wrist_torque_data))

#

# TODO 1 Fix this to copy the x data into wrist_torque_data
#  On the left-hand side, the columns should start at zero and increment by 3 (n_dims_for_wrist_torque_data)
#  On the right-hand side, the columns should start at index_wrist_torque_offset and increment by n_total_dims
wrist_torque_data[:, :] = pick_channel_data[:, :]

# TODO 2 Copy the line above and change it to copy the y data. Hint: The increments are the same, but the starting
#  index should be 1 more than the x starting index (on both the left and the right)

# TODO 3 Now do the same for the z data
#  Option 1: Do the same thing you did for y, but add 2 instead of 1
#  Option 2: Edit your y code to do all three (x,y, and z) with a for loop
#    The for loop should go from 0 to 2   for i in range(0, 3):
#    Replace the +1 with +i 

# Cleaning up - if you used numbers for the start index and/or the slice spacing, replace those with index_wrist_torque_offset and
#  n_dims_for_this_channel

print(f"Shape of wrist_torque_data is {wrist_torque_data.shape}, should be 660 X 120")
print(f"  120 is 40 (number of time steps) * 3 (x,y,z)")
print(f"First row, first column value {wrist_torque_data[0, 0]:0.2f}, should be -0.27")
print(f"First row, last column value {wrist_torque_data[0, -1]:0.2f}, should be -0.08")
print(f"Last row, first column value {wrist_torque_data[-1, 0]:0.2f}, should be -0.35")
print(f"Last row, last column value {wrist_torque_data[-1, -1]:0.2f}, should be -0.05")



In [None]:
grader.check("slicing")

### Min/max/Mean/SD of z value

Now that the wrist torque data is nicely separated out, find the min, max, mean and standard deviation of each of the x, y, and z channels. Put the result into a dictionary.

In [None]:
# EXAMPLE CODE
# Get the min/max of the x values of the test code (should be 0 and 1) and store it in a dictionary

# Since we have x,y, and z, we'll want a list to store the stats for each dimension
my_list_of_stats = []

# Since we need to do both min and max, create a variable that has the x slice
#    The : says all of the rows, start at 0 and skip every 3rd
x_slice = my_test_data[:, 0::3]

# Put the results of min/max in a dictionary
my_dict = {"Min" : np.min(x_slice),
           "Max" : np.max(x_slice)}

# Put the dictionary with the x min and max into the list
my_list_of_stats.append(my_dict)

print(f"Stats {my_list_of_stats}")

In [None]:
# SCRATCH CELL
# Try editing the above code to do the y and z channels as well - the result should be a list with three elements
#  Option 1: Copy the code (from x_slice through the append ) and then change 0 to 1 to do the y channel.
#  Option 2: Use a for loop over i=0,1,2 and change the 0 to an i
#    Change the variable name from x_slice to something like cur_slice, since it will be the x slice, then the y, then the z

In [None]:
wrist_torque_stats_list = []

# TODO For each of the x,y, and z data channels, calculate the min, max, mean and standard deviation. 
#   Store the values in a dictionary with the keys "Min", "Max", "Mean", and "SD"
#   Put the dictionaries into the wrist_torque_stats_list list
# Your output should look like Data/Lab1_check_results.json


In [None]:
# TEST CODE
#   The correct answers are in Lab1_check_results.json. You can write test code here to check
#   each value in turn, make sure the slicing is the correct size. This will not be graded.

In [None]:
grader.check("statistics")

## Boolean slicing to get successful versus unsuccessful picks out

TODO: Calculate the min of the z values for the successful picks, and the min of the z values for the unsuccessful picks.

The main difference between this problem and the previous one is that in this one you only use some of the rows (instead of all of them like the last problem). The column slice stays the same, but the row slice changes. We're going to use Boolean indexing to do this.

Step 1: Create a boolean index that is True if the pick is successful, False if it is not
Step 2: Use the boolean index to select the rows - only select rows where the index is True
Step 3: Do the same thing again, but this time select rows that are False

Remember that the successful/unsuccessful pick data was stored in **pick_successful**.

Note: Use **== 1**, not **is 1** in the comparison

Use **wrist_torque_data** for the data; remember that the z data starts at column 2, not 0, and the spacing is 3 (x,y,z). Use the variable **n_dims_for_wrist_torque_data** instead of the number 3 in your final version.

In [None]:
# EXAMPLE CODE

# This creates a boolean index that is True where the data is 1, False otherwise. (Look at it in the variable window
#  and compare it to just_successful_test_data)
b_is_true_and_false = just_successful_test_data == 1

# This gets out just the rows that are true - look at it in the variable window
my_test_data_only_good = my_test_data[b_is_true_and_false, :]

# Now do the min over the z values
xyz_skip = 3 
z_start_offset = 2
min_z = np.min(my_test_data_only_good[:, z_start_offset::xyz_skip])

In [None]:
# TODO: Create a boolean array to pick out the successful rows
bool_array_successful = ...

# TODO: Now use that boolean array plus column slicing to calculate the min of the z values
min_wrist_torque_successful_z = ...

# TODO: Repeat for unsuccessful picks
bool_array_unsuccessful = ...
min_wrist_torque_unsuccessful_z = ...

print(f"Successful: Minimum value {min_wrist_torque_successful_z:0.4f} of wrist torque z channel")
print(f"Unsuccessful: Minimum value {min_wrist_torque_unsuccessful_z:0.4f} of wrist torque z channel")

In [None]:
grader.check("boolean_slicing")

## Hours and collaborators
Required for every assignment - fill out before you hand-in.

Listing names and websites helps you to document who you worked with and what internet help you received in the case of any plagiarism issues. You should list names of anyone (in class or not) who has substantially helped you with an assignment - or anyone you have *helped*. You do not need to list TAs.

Listing hours helps us track if the assignments are too long.

In [None]:

# List of names (creates a set)
worked_with_names = {"not filled out"}
# List of URLS TCW3 (creates a set)
websites = {"not filled out"}
# Approximate number of hours, including lab/in-class time
hours = -1.5

In [None]:
grader.check("hours_collaborators")

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Submit just the .ipynb file to Gradescope (Lab 1 Arrays and dictionaries). You do not need to submit the data files. Don't change the provided variable names or autograding will fail. Make sure to do a restart and run all before turning in. Look at the Gradescope grading rubric for code-quality checks.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)