## DS2000 & DS2001 Programming with Data & Practicum
# Homework 2: Arrays and Tables 

### 63 Points

## Due Saturday, January 28 by 10:00PM

**Reading**: 
* [Chapter 4: Data Types](https://inferentialthinking.com/chapters/04/Data_Types.html ) 
* [Chapter 5: Sequences](https://www.inferentialthinking.com/chapters/05/Sequences.html)
* [Chapter 6: Tables](https://www.inferentialthinking.com/chapters/06/Tables.html)

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to load the provided tests. Each time you start your server, you will need to execute this cell again to load the tests.

Start early so that you can come to office hours if you're stuck. Check Canvas for the [office hours schedule](https://northeastern.instructure.com/courses/141472/assignments/syllabus).
 
Throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. 

In [None]:
# Don't change this cell; just run it.
import numpy as np
from datascience import *
import otter
grader = otter.Notebook("hw02.ipynb")

## 1. Creating Arrays


An array acts like a container holding a sequence of data items, and these items must have the same data type. We can use the `make_array` function from the datascience package to create an array.  Below are some examples.

In [None]:
fruits = make_array('apple', 'orange','watermelon')
fruits

Here do not be concerned with dtype='<U10' (this translates to "Unicode string of maximum length 10").

In [None]:
numbers= make_array(2,5,6,8,10)
numbers

Many useful functions on numbers are included in the `math` module.  Below we illustrate some function calls using the math module.

In [None]:
# the list of functions in the math module can be found at
# https://docs.python.org/3/library/math.html
# some examples below

import math 

print(math.sin(2.5))   # math.sin() returns the sine of a number
print(math.sqrt(64))   # math.sqrt() returns the square root of a number
print(math.ceil(1.4))  # math.ceil() rounds up to the nearest integer
print(math.floor(1.4)) # math.floor() rounds down to the nearest integer
print(max(numbers))    # max returns the largest number among those in the numbers array from above

**Question 1.1. (2 pts)** Make an array called `weird_numbers` containing the following numbers (in the given order):

1. -2
2. the sine of 1.2
3. the square root of 3
4. 5 to the power of 1.2


*Note:* In DS2000, we use numpy arrays which are different from Python lists. If you don't know what Python lists are, that is okay. For those who worked Python before, please make an **array** not a Python list.


In [None]:
weird_numbers = ...
weird_numbers 

In [None]:
# Testing Question 1.1
grader.check("q11") 

**Question 1.2. (2 pts)** Make an array called `book_title_words` containing the following three strings: "Eats", "Shoots", and "and Leaves".

In [None]:
book_title_words = ...
book_title_words

In [None]:
# Testing Question 1.2
grader.check("q12")

Strings have a method called `join`.  `join` takes one argument, an array of strings.  It returns a single string.  Specifically, the value of `a_string.join(an_array)` is a single string that's the [concatenation](https://en.wikipedia.org/wiki/Concatenation) ("putting together") of all the strings in `an_array`, **except** `a_string` is inserted in between each string.  Let's see an example to clarify.

In [None]:
month_array = make_array("Jan", "Feb", "Dec")
month_string = "#".join(month_array)
print(month_string) 

birthday_array=make_array("12","4","1990")
birthday_string = "/".join(birthday_array)
print(birthday_string)

**Question 1.3. (3 pts)** Use the array `book_title_words` and the method `join` to make two strings:

1. "Eats, Shoots, and Leaves" (call this one `with_commas`)
2. "Eats Shoots and Leaves" (call this one `without_commas`)

In [None]:
with_commas = ...
without_commas = ...

# These lines are provided just to print out your answers.
print('with_commas:', with_commas)
print('without_commas:', without_commas)

In [None]:
# Testing Question 1.3
grader.check("q13")

## 2. Indexing Arrays


These exercises give you practice accessing individual elements of arrays.  In Python (and in many programming languages), elements in an array are accessed by *index*.  One important aspect about indexing in computer science is that the **first index is 0** -- so the first element is the element at index 0, the second element is the element at index 1, etc.

*Note:* Please don't use bracket notation when indexing (i.e. `arr[0]`), as this can yield different data type outputs than what we will be expecting. This can cause you to fail an autograder test.

In [None]:
# An example -- creating an array and indexing elements
string_array = make_array('h','ell','o')

print(string_array.item(0) )  # print the first element/item which is at index 0
print(string_array.item(1))   # index 1, the second element
# function len returns the number of elements in an array
string_array.size

**Question 2.1. (2 pts)** The cell below creates an array of some numbers.  Set `third_element` to the third element of `some_numbers`.

In [None]:
some_numbers = make_array(-1, -3, -6, -10, -15)

third_element = ...
third_element

In [None]:
grader.check("q21")

**Question 2.2. (4 pts)** The next cell creates a table that displays some information about the elements of `some_numbers` and their order.  Run the cell to see the partially-completed table, then fill in the missing information (the cells that say "Ellipsis") by assigning `blank_a`, `blank_b`, `blank_c`, and `blank_d` to the correct elements in the table.

In [None]:
blank_a = ...
blank_b = ...
blank_c = ...
blank_d = ...
elements_of_some_numbers = Table().with_columns(
    "English name for position", make_array("first", "second", blank_a, blank_b, "fifth"),
    "Index",                     make_array(blank_c, 1, 2, blank_d, 4),
    "Element",                   some_numbers)
elements_of_some_numbers

In [None]:
# Testing Question 2.2
grader.check("q22")

**Question 2.3. (1 pt)** You'll sometimes want to find the *last* element of an array.  Suppose an array has 142 elements.  What is the index of its last element?

In [None]:
index_of_last_element = ...

In [None]:
# Testing Question 2.3
grader.check("q23")

More often, you don't know the number of elements in an array, its *size or length*.  (For example, it might be a large dataset you found on the Internet.)  The method `size` and returns the size of that array (an integer). For example, some_numbers.size would have returned 5.  Try running it below.

In [None]:
# note that size is not followed by parentheses
some_numbers.size

 Other array methods include `sum` and `mean` (which computes the average).  Try running the following cells.

In [None]:
# note sum is followed by parentheses
some_numbers.sum()

In [None]:
# note mean is also followed by parentheses
some_numbers.mean()

Let's load a csv file `president_births.csv` which contains a table of all US Presidents who have passed away (except George H. W. Bush who passed away November 30, 2018) sorted by "Birth Year".  Run the next cell.

In [None]:
presidents = Table.read_table("president_births.csv")
presidents

**Question 2.4. (2 pts)** What is the most recent birth year of any deceased president in the table?  Assign that year to `most_recent_birth_year`.

*Hint: You would return the last element of the `Birth Year` column of the `president_births_years` table. The last element in that array is the most recent birth year of any deceased president in the table.*

In [None]:
most_recent_birth_year = ...
most_recent_birth_year

In [None]:
# Testing Question 2.4
grader.check("q24")

**Question 2.5. (3 pts)** Assign `sum_of_birth_years` to the sum of the first, tenth, and last birth year in `presidents` table.

In [None]:
sum_of_birth_years = ...

In [None]:
# Testing Question 2.5
grader.check("q25")

 **Question 2.6. (2 pts)** What is the average life span/years of presidents in the table?  Assign that average to `average_life_span`.

In [None]:
average_life_span = ...
average_life_span

In [None]:
# Testing Question 2.6
grader.check("q26")

## 3. Basic Array Arithmetic


**Question 3.1. (2 pts)** Multiply the numbers 42, 4224, 42422424, and -250 by 157. Assign each variable below such that `first_product` is assigned to the result of $42 * 157$, `second_product` is assigned to the result of $4224 * 157$, and so on. 

For this question, **don't** use arrays.

In [None]:
first_product = ...
second_product = ...
third_product = ...
fourth_product = ...
print(first_product, second_product, third_product, fourth_product)

In [None]:
# Testing Question 3.1
grader.check("q31")

**Question 3.2. (3 pts)** Now, do the same calculation, but using an array called `numbers` and only a single multiplication (`*`) operator.  Store the 4 results in an array named `products`.

In [None]:
numbers = ...
products = ...
products

In [None]:
# Testing Question 3.2
grader.check("q32")

**Question 3.3. (2 pts)** Oops, we made a typo!  Instead of 157, we wanted to multiply each number by 1577.  Compute the correct products in the cell below using array arithmetic.  Notice that your job is really easy if you previously defined an array containing the 4 numbers.

In [None]:
correct_products = ...
correct_products

In [None]:
# Testing Question 3.3
grader.check("q33")

**Question 3.4. (2 pts)** We've loaded an array of temperatures in the next cell.  Each number is the highest temperature observed on a day at a climate observation station, mostly from the US.  Since they're from the US government agency [NOAA](noaa.gov), all the temperatures are in Fahrenheit.  Convert them all to Celsius by first subtracting 32 from them, then multiplying the results by $\frac{5}{9}$. Make sure to **ROUND** the final result after converting to Celsius to the nearest integer using the `np.round` function.  An example of np.round call: np.round(weird_numbers).


In [None]:
max_temperatures = Table.read_table("temperatures.csv").column("Daily Max Temperature")

celsius_max_temperatures_rounded = ...
celsius_max_temperatures_rounded

In [None]:
# Testing Question 3.4
grader.check("q34")

**Question 3.5. (2 pts)** The cell below loads all the *lowest* temperatures from each day (in Fahrenheit).  Compute the size of the daily temperature range for each day.  That is, compute the difference between each daily maximum temperature and the corresponding daily minimum temperature.  **Pay attention to the units, give your answer in Celsius!** Make sure **NOT** to round your answer for this question!

In [None]:
min_temperatures = Table.read_table("temperatures.csv").column("Daily Min Temperature")

celsius_temperature_ranges = ...
celsius_temperature_ranges

In [None]:
# Testing Question 3.5
grader.check("q35")

## 4. World Population


The cell below loads a table of estimates of the world population for different years, starting in 1950. The estimates come from the [US Census Bureau website](https://www.census.gov/en.html).

In [None]:
world = Table.read_table("world_population.csv").select('Year', 'Population')
world.show(4)

The name `population` is assigned to an **array** of population estimates which is the `Population` column of the world table.

In [None]:
population = world.column(1)
population

In this question, you will apply some built-in `numpy` functions to this array. `numpy` is a module that is often used in Data Science!   [Section 5.1.1 ](https://inferentialthinking.com/chapters/05/1/Arrays.html) covers some basic functions of numpy.

<img src="array_diff.png" style="width: 600px;"/>

The difference function `np.diff` subtracts each element in an array from the element after it within the array. As a result, the length of the array `np.diff` returns will always be one less than the length of the input array.  We show an example in the next cell.

In [None]:
# the np.diff function computes the difference between each adjacent pair of elements in an array.
ages = make_array(24,34,21,50)  
np.diff(ages)   # 34-24 =10 , 21-34=-13, 50-21=29

<img src="array_cumsum.png" style="width: 700px;"/>

The cumulative sum function `np.cumsum` outputs an array of partial sums. For example, the third element in the output array corresponds to the sum of the first, second, and third elements.

In [None]:
# np.cumsum means the cumulative sum: for each element, add all elements so far 
myNumbers = make_array(10,20,30,40,45,55)
np.cumsum(myNumbers)     # [10 , 10+20, 10+20+30, 10+20+30+40, 10+20+30+40+45+55]

**Question 4.1. (2 pts)** Very often in data science, we are interested understanding how values change with time. Use `np.diff` and `np.max` (or just `max`) to calculate the largest annual change in population between any two consecutive years.

In [None]:
largest_population_change = ...
largest_population_change

In [None]:
# Testing Question 4.1
grader.check("q41")

**Question 4.2. (1 pt)** What do the values in the resulting array represent (choose one)?

In [None]:
np.cumsum(np.diff(population))

1) The total population change between consecutive years, starting at 1951.

2) The total population change between 1950 and each later year, with later year starting at 1951.

3) The total population change between 1950 and each later year, with later year starting inclusively at 1950.

In [None]:
# Assign cumulative_sum_answer to 1, 2, or 3
cumulative_sum_answer = ...

In [None]:
# Testing Question 4.2
grader.check("q42")

## 5. Tables


**Question 5.1. (3 pts)** Suppose you have 4 apples, 3 oranges, and 3 pineapples.  (Perhaps you're using Python to solve a high school Algebra problem.)  Create a table that contains this information.  It should have two columns: `fruit name` and `count`.  Assign the new table to the variable `fruits`.

**Note:** Use lower-case and singular words for the name of each fruit, like `"apple"`.

[Chapter 6: Tables](https://www.inferentialthinking.com/chapters/06/Tables.html) 

In [None]:
# Our solution uses 1 statement split over 3 lines.
fruits = ...
         ...
         ...
fruits

In [None]:
# Testing Question 5.1
grader.check("q51") 

**Question 5.2. (2 pts)** The file `inventory.csv` contains information about the inventory at a fruit stand.  Each row represents the contents of one box of fruit. Load it as a table named `inventory` using the `Table.read_table()` function. `Table.read_table(...)` takes one argument (data file name in string format) and returns a table.

In [None]:
inventory = ...
inventory

In [None]:
# Testing Question 5.2
grader.check("q52")

**Question 5.3. (3 pts)** How many pieces of fruit does the inventory contain?  How many pieces of strawberries does the inventory contain?

In [None]:
total_count =  ...
print("Total fruit count is ", total_count)
strawberry_count =  ...
print("Straberry count is ", strawberry_count)

In [None]:
# Testing Question 5.3
grader.check("q53")  

**Question 5.4. (2 pts)** The file `sales.csv` contains the number of fruit sold from each box last Saturday.  It has an extra column called "price per fruit (\$)" that's the price *per item of fruit* for fruit in that box.  The rows are in the same order as the `inventory` table.  Load these data into a table called `sales`.

In [None]:
sales = ...
sales

In [None]:
# Testing Question 5.4
grader.check("q54")

**Question 5.5. (2 pts)** How many fruits did the store sell in total on that day?

In [None]:
total_fruits_sold = ...
total_fruits_sold

In [None]:
# Testing Question 5.5
grader.check("q55")

**Question 5.6. (3 pts)** What was the store's total revenue (the total price of all fruits sold) on that day?

*Hint:* If you're stuck, think first about how you would compute the total revenue from just the grape sales.


In [None]:
total_revenue = ...
total_revenue

In [None]:
# Testing Question 5.6
grader.check("q56")

**Question 5.7. (4 pts)** Make a new table called `remaining_inventory`.  It should have the same rows and columns as `inventory`, except that the amount of fruit sold from each box should be subtracted from that box's count, so that the "count" is the amount of fruit remaining after Saturday.

In [None]:
remaining_inventory = ...
   ...
   ...
   ...
remaining_inventory

In [None]:
# Testing Question 5.7
grader.check("q57")

## 6. Old Faithful

Old Faithful is a geyser in Yellowstone that erupts every 44 to 125 minutes (according to [Wikipedia](https://en.wikipedia.org/wiki/Old_Faithful)). People are [often told that the geyser erupts every hour](http://yellowstone.net/geysers/old-faithful/), but in fact the waiting time between eruptions is more variable. Let's take a look.

In [None]:
old_faithful = Table.read_table('old_faithful.csv')
old_faithful

The column 'waiting' corresponds to waiting times between eruptions.  For example, the time between the end of the first eruption (the first eruption lasted 3.6 minutes) and the beginning of the second eruption is 79 minutes. Below `waiting_times` is assigned to an array of 272 consecutive waiting times between eruptions, taken from a classic 1938 dataset.

In [None]:
waiting_times = old_faithful.column('waiting')
waiting_times

##### **Question 6.1. (3 pts)**  Assign the names `shortest`, `longest`, and `average` so that the `print` statement is correct. 

In [None]:
shortest = ...
longest = ...
average = ...

print("Old Faithful erupts every", shortest, "to", longest, "minutes and every", average, "minutes on average.")

In [None]:
grader.check("q61")

**Question 6.2. (2 pts)** Assign `biggest_decrease` to the biggest decrease in waiting time between two consecutive eruptions. For example, the third eruption occurred after 74 minutes and the fourth after 62 minutes, so the decrease in waiting time was 74 - 62 = 12 minutes.

*Hint*: We want to return the absolute value of the biggest decrease.

*Note*: `np.diff()` calculates the difference between subsequent values in an array. For example, calling `np.diff()` on the array `make_array(1, 8, 3, 5)` evaluates to `array([8 - 1, 3 - 8, 5 - 3])`, or `array([7, -5, 2])`.


In [None]:
# np.diff() calculates the difference between subsequent values in a NumPy array.
differences = np.diff(waiting_times)
biggest_decrease = ...
biggest_decrease

In [None]:
grader.check("q62")

**Question 6,3. (4 pts)** The `faithful_with_eruption_nums` table contains two columns: `eruption_number`, which represents the number of that eruption, and `waiting`, which represents the time spent waiting after that eruption. For example, take the first two rows of the table:

| eruption number | waiting |
|-----------------|---------|
| 1               | 79      |
| 2               | 54      |

We can read this as follows: after the first eruption, we waited 79 minutes for the second eruption. Then, after the second eruption, we waited 54 minutes for the third eruption.  So we waited a total of 133 minutes between the first and the third eruptions.  

Suppose Oscar and Wendy started watching Old Faithful at the start of the first eruption. Assume that they watch until the end of the tenth eruption. For some of that time they will be watching eruptions, and for the rest of the time they will be waiting for Old Faithful to erupt. How many minutes will they spend waiting for eruptions? 

*Hint #1:* You can start by using the `take` method on the table `faithful_with_eruption_nums`. 

*Hint #2:* `first_nine_waiting_times` must be an array.


In [None]:
faithful = old_faithful.drop("eruptions")
faithful_with_eruption_nums = faithful.with_column("eruption number", np.arange(faithful.num_rows) + 1).select(1, 0)
faithful_with_eruption_nums

In [None]:
first_nine_waiting_times = ...
total_waiting_time_until_tenth = ...
total_waiting_time_until_tenth

In [None]:
grader.check("q63")

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## 7. Save, Download, and Submit

Good job!  You've completed Homework 2!

Please save your notebook, download a pdf version of the notebook, and submit it to Gradescope (see instructions in Homework 1).