### *“Creation always involves building upon something else. There is no art that doesn't reuse.”* 

*–Lawrence Lessig (American Attorney)*

At the end of this lab, we'll learn how to (re)use some previously coded functions to manipulate arrays.  No one wants to reinvent the wheel so we should know what's out there for our needs.

### Your Name: Daniel DeLuca

---

## Run the cell below to import what you need for this lab!

In [5]:
import numpy as np     #A package that lets us work with arrays!
import pandas as pd    #A package for working with tables

# Lab F: Arrays

So far, we've used Python to manipulate 2 types of information: numbers and Strings.  Oh but there are many more data types in Python!

To work with datasets in Python, we need to work often with  *collections* of data, like the numbers 2 through 5 or the words "welcome", "to", and "lab".  Python provides many types to work with collections of data.  We will mostly be concerned with these types.

* arrays, dictonaries, and dataframes



In this class, we principally use two kinds of collections:
  * **Arrays:** An array is a collection of many pieces of a single kind of data, kept in order.  An array is like a single column in an Excel spreadsheet.
  * **Tables:** A table is a collection of many pieces of different kinds of data.  It's like an entire Excel spreadsheet.  Each row of a table represents one entity, and each column contains a different kind of data about each entity.

---

# 1. Arrays


Up to now, we haven't done much that you couldn't do yourself by hand, without going through the trouble of learning Python.  Computers are most useful when you can use a small amount of code to *do the same action* to *many different things*.

For example, in the time it takes you to calculate the 18% tip on a restaurant bill, a laptop can calculate 18% tips for every restaurant bill paid by every human on Earth that day. 

Arrays are how we put many values in one place so that we can operate on them as a group.  For example, if `billions_of_numbers` is an array of numbers, the expression

    .18 * billions_of_numbers

gives a new array of numbers that's the result of multiplying each number in `billions_of_numbers` by .18 (18%).  Arrays are not limited to numbers; we can also put all the words in a book into an array of strings.

Concretely, an array is like a column in an Excel spreadsheet.

<img src="https://github.com/kathleen-ryan-DeSales/CS250/blob/main/pictures/excel_array.jpg?raw=true">

# 2. Making arrays
You can type in the data that goes in an array yourself, but that's not typically how programs work. Normally, we create arrays by loading them from an external source, like a data file.

First, though, let's learn how to do it the hard way.

To create an array, call the function `np.array`.  You pass just one argument to `np.array`, a list of values between hard/square brackets.  The function returns an array with all elements in the list you gave it.  Run this cell to see an example:

In [9]:
np.array([0.125, 4.75, -1.3])

array([ 0.125,  4.75 , -1.3  ])

Each thing in an array (in the above case, the numbers 0.125, 4.75, and -1.3) is called an *element* of that array.

Arrays are values, just like numbers and strings.  That means you can assign them names or use them as arguments to functions.

---

# **Question 1.** 
Make an array containing the numbers 1, 2, and 3, in that order.  Name it `small_numbers`.

In [10]:
small_numbers = np.array([1, 3])
small_numbers

array([1, 3])

# **Question 2.** 

Make an array containing the numbers 0, 1, -1, $\pi$, and $e$, in that order, and do NOT use approximations.  

Name the array `interesting_numbers`. 

*Hint:* How did you get the values $\pi$ and $e$ earlier?  You can refer to them in exactly the same way here.  

In [11]:
import  math #What module do you need to access the pi and e constants?

interesting_numbers = np.array ([0, 1, -1, math.pi, math.e])
                        
interesting_numbers

array([ 0.        ,  1.        , -1.        ,  3.14159265,  2.71828183])

# **Question 3.** 
Make an array containing the five strings `"Hello"`, `","`, `" "` (so a blank space), `"world"`, and `"!"`.  (The third one is a single space inside quotes.)  Name it `hello_world_components`.

*Note:* If you print `hello_world_components`, you'll notice some extra information in addition to its contents: `dtype='<U5'`.  That's just NumPy's extremely cryptic way of saying that the things in the array are strings.

In [12]:
hello_world_components =  np.array(["Hello" , "," " " "world" , "!"])
hello_world_components

array(['Hello', ', world', '!'], dtype='<U7')

---

# 3. A helpful function: `np.arange`
Arrays are provided by a package called [NumPy](http://www.numpy.org/) (pronounced "NUM-pie" or, if you prefer to pronounce things incorrectly, "NUM-pee").  The package is called `numpy`, but it's standard to rename it `np` for brevity.  You can do that with:

    import numpy as np

Very often in data science, we want to work with many numbers that are evenly spaced within some range.  NumPy provides a special function for this called `arange`.  `np.arange(start, stop, space)` produces an array with all the numbers starting at `start` and counting up by `space`, stopping before `stop` is reached.

For example, the value of `np.arange(1, 6, 2)` is an array with elements 1, 3, and 5 -- it starts at 1 and counts up by 2, then stops before 6.  In other words, it's equivalent to `make_array(1, 3, 5)`.

`np.arange(4, 9, 1)` is an array with elements 4, 5, 6, 7, and 8.  (It doesn't contain 9 because `arange` stops *before* the stop value is reached.)

---

# **Question 4.** 
Import `numpy` as `np` and then use `np.arange` to create an array with the multiples of 99 from 0 up to (**and including**) 9999.  (So its elements are 0, 99, 198, 297, etc.)

In [13]:
import numpy as np
multiples_of_99 =  np.arange (0, 10000, 99)
multiples_of_99

array([   0,   99,  198,  297,  396,  495,  594,  693,  792,  891,  990,
       1089, 1188, 1287, 1386, 1485, 1584, 1683, 1782, 1881, 1980, 2079,
       2178, 2277, 2376, 2475, 2574, 2673, 2772, 2871, 2970, 3069, 3168,
       3267, 3366, 3465, 3564, 3663, 3762, 3861, 3960, 4059, 4158, 4257,
       4356, 4455, 4554, 4653, 4752, 4851, 4950, 5049, 5148, 5247, 5346,
       5445, 5544, 5643, 5742, 5841, 5940, 6039, 6138, 6237, 6336, 6435,
       6534, 6633, 6732, 6831, 6930, 7029, 7128, 7227, 7326, 7425, 7524,
       7623, 7722, 7821, 7920, 8019, 8118, 8217, 8316, 8415, 8514, 8613,
       8712, 8811, 8910, 9009, 9108, 9207, 9306, 9405, 9504, 9603, 9702,
       9801, 9900, 9999])

---

## 3.1 A Real Life Example of `arange`: Temperature readings
NOAA (the US National Oceanic and Atmospheric Administration) operates weather stations that measure surface temperatures at different sites around the United States.  The hourly readings are [publicly available](http://www.ncdc.noaa.gov/qclcd/QCLCD?prior=N).

Suppose we download all the hourly data from the Oakland, California site for the month of December 2015.  To analyze the data, we want to know when each reading was taken, but we find that the data don't include the timestamps of the readings (the time at which each one was taken).

However, we know the first reading was taken at the first instant of December 2015 (midnight on December 1st) and each subsequent reading was taken exactly 1 hour after the last.

---

# **Question 5a.** 

Create an array of the *time, in seconds, since the start of the month* at which each hourly reading was taken.  Name it `collection_times`.

*Hint:* There were 31 days in December, which is equivalent to $31 \times 24$ hours or $31 \times 24 \times 60 \times 60$ seconds.  So your array should have $31 \times 24$ elements in it.

In [14]:
import numpy as np
collection_times =  np.arange(0, 31 * 24 * 60 * 60, 3600)
collection_times

array([      0,    3600,    7200,   10800,   14400,   18000,   21600,
         25200,   28800,   32400,   36000,   39600,   43200,   46800,
         50400,   54000,   57600,   61200,   64800,   68400,   72000,
         75600,   79200,   82800,   86400,   90000,   93600,   97200,
        100800,  104400,  108000,  111600,  115200,  118800,  122400,
        126000,  129600,  133200,  136800,  140400,  144000,  147600,
        151200,  154800,  158400,  162000,  165600,  169200,  172800,
        176400,  180000,  183600,  187200,  190800,  194400,  198000,
        201600,  205200,  208800,  212400,  216000,  219600,  223200,
        226800,  230400,  234000,  237600,  241200,  244800,  248400,
        252000,  255600,  259200,  262800,  266400,  270000,  273600,
        277200,  280800,  284400,  288000,  291600,  295200,  298800,
        302400,  306000,  309600,  313200,  316800,  320400,  324000,
        327600,  331200,  334800,  338400,  342000,  345600,  349200,
        352800,  356

# **Question 5b.** 

The `len` function  tells you how many elements are in an array.  So check that `collection_times` has $31 \times 24$ elements so that you know you have the right length array from part (a).

In [15]:
print(31*24)

...

744


Ellipsis

---

# 4. Working with single elements of arrays ("indexing")
Let's work with a more interesting dataset.  The next cell creates an array called `population` that includes estimated world populations in every year from **1950** to roughly the present.  (The estimates come from the [US Census Bureau website](http://www.census.gov/population/international/data/worldpop/table_population.php).)

Rather than type in the data manually, we've loaded them from a file on your computer called `world_population.csv`.  You'll learn how to do that soon.

In [7]:
# Read the code comments in this cell, but don't worry too much if you don't understand yet.

population_table = pd.read_csv("world_population.csv")  #Read in the population csv data.

population = population_table["Population"].to_numpy()  #Here we got the Population column and turned it into an array

Here's how we get the first element of `population`, which is the world population in the first year in the dataset, 1950.

In [17]:
population[0]

2557628654

The value of that expression is the number 2557628654 (around 2.5 billion), because that's the first thing in the array `population`.

Notice that we wrote `[0]`, not `[1]`, to get the first element.  This is a weird convention in computer science.  0 is called the *index* of the first item.  It's the number of elements that appear *before* that item.  So 3 is the index of the 4th item.

Here are some more examples.  In the examples, we've given names to the things we get out of `population`.  

**Read and run each cell.**

In [18]:
# The third element in the array is the population in 1952.
population_1952 = population[2]  # So 0 is 1950, 1 is 1952, 2 is 1953
population_1952

2636772306

In [19]:
# The thirteenth element in the array is the population  in 1962 (which is 1950 + 12).
population_1962 = population[13]
population_1962

3209827882

In [20]:
# The 66th element is the population in 2015.
population_2015 = population[65]
population_2015

7256490011

In [21]:
# The array has only 66 elements, so the below doesn't work.
# (There's no element with 66 other elements before it.)
# We can index from 0 to 65, but not 66.

population_2016 = population[66]
population_2016

IndexError: index 66 is out of bounds for axis 0 with size 66

In [None]:
# Since np.array returns an array, we can call [3]
# on its output to get its 4th element, just like we

np.array([-1, -3, 4, -2])[3]

---

# **Question 6.** 

Set `population_1973` to the world population in 1973, by getting the appropriate element from `population` using `[]` notation.  (To help, run the code block after the next one.  It prints the whole table out.)

In [None]:
population_1973 =  population[23]
population_1973

Run the next cell to visualize the elements of `population` and their indices.  You'll learn next week how to make tables like this!

In [None]:
#Don't worry what's going on in this cell, other than knowing it prints the full table.
pd.set_option("display.max_rows", None, "display.max_columns", None)
population_table

# **Question 7.** 

What's the index of the 31st item in `population`?  Try to answer the question without looking at the table or the data!

In [9]:
index_of_31st_item = 30

Click here for a **hint** to make sure you have the right answer.

<!-- The answer is a number less than 100.  So it is not a year.-->

---

## 5. Doing something to every element of an array
Arrays are primarily useful for doing the same operation many times, so we don't often have to use `[]` and work with single elements.

##### Logarithms
Here is one simple question we might ask about world population:

> How big was the population in *orders of magnitude* in each year?

The logarithm function is one way of measuring how big a number is. The logarithm (base 10) of a number increases by 1 every time we multiply the number by 10. It's like a measure of how many decimal digits the number has, or how big it is in orders of magnitude.

We could try to answer our question like this, using the `log10` function from the `math` module and the `[]` notation that you just learned:

In [None]:
import math
population_1950_magnitude = math.log10(population[0])
population_1951_magnitude = math.log10(population[1])
population_1952_magnitude = math.log10(population[2])
population_1953_magnitude = math.log10(population[3])
#etcetera

But this is tedious and doesn't really take advantage of the fact that we are using a computer.

Instead, NumPy provides its own version of `log10` that takes the logarithm of each element of an array.  It takes a single array of numbers as its argument.  It returns an array of the same length, where the first element of the result is the logarithm of the first element of the argument, and so on.  So the `log10` funciton performs an elementwise operation on the population column.

---

# **Question 8.** 
Use the log10 function of `numpy` to compute the logarithms of the world population in every year.  Give the result (an array of 66 numbers) the name `population_magnitudes`.  Your code should be very short.

In [None]:
population_magnitudes =  np.log(population)
population_magnitudes

<img src="https://github.com/kathleen-ryan-DeSales/CS250/blob/main/pictures/array_logarithm.jpg?raw=true">

This is called *elementwise* application of the function, since it operates separately on each element of the array it's called on.  The textbook's section on arrays has a useful list of NumPy functions that are designed to work elementwise, like `np.log10`.

---

##  5 Arithmetic
Arithmetic also works elementwise on arrays.  For example, you can divide all the population numbers by 1 billion to get numbers in billions:

In [None]:
population_in_billions = population / 1000000000
population_in_billions

You can do the same with addition, subtraction, multiplication, and exponentiation (`**`). For example, you can calculate a tip on several restaurant bills at once (in this case just 3):

In [None]:
#read in a csv of bill data and print it out to see it.
restaurant_bills = pd.read_csv("restaurant_bills.csv")["Bill"].to_numpy()
print("Restaurant bills:\t", restaurant_bills)

#Multiply each bill amount by 0.2.
tips = .2 * restaurant_bills
print("Tips:\t\t\t", tips)

<img src="https://github.com/kathleen-ryan-DeSales/CS250/blob/main/pictures/array_multiplication.jpg?raw=true">

---

# **Question 9.** 

Suppose the total charge at a restaurant is the original bill plus the tip.  That means we can multiply the original bill by 1.2 to get the total charge.  Compute the total charge for each bill in `restaurant_bills`.

In [None]:
total_charges =  (restaurant_bills * 1.2)
total_charges

---

# **Question 10.**  

`more_restaurant_bills.csv` contains 100,000 bills!  Compute the total charge for each one.  Is your code any different from the previous question?

In [None]:
#In this line, we generate an array with the Bill column information.
more_restaurant_bills = pd.read_csv("more_restaurant_bills.csv")["Bill"].to_numpy()

#Update the following code for this question.
more_total_charges =  (more_restaurant_bills * 1.2)
more_total_charges

---

# **Question 11.** 

The function `sum` takes a single array of numbers as its argument.  It returns the sum of all the numbers in that array (so it returns a single number, not an array).

What was the sum of all the bills in `more_restaurant_bills`, *including tips*?

In [None]:
import math
sum_of_bills =  sum(more_total_charges)
sum_of_bills

Double-click __here__ for a hint.

<!-- The answer is not: 1496441.7200000011
-->
<hr/>

# **Question 12.** 

The powers of 2 ($2^0 = 1$, $2^1 = 2$, $2^2 = 4$, etc) arise frequently in computer science.  (For example, you may have noticed that storage on cell phones comes in powers of 2, like 16 GB, 32 GB, or 64 GB.) 

Use `np.arange` and the exponentiation operator `**` to compute the first 30 powers of 2, starting from `2^0`.

When done, your answer should be:

```1, 2, 4, 8, 16, 32,  64,  128,  256,  512, 1024, 2048, 4096, ...,  33554432,  67108864, 134217728, 268435456, 536870912```


In [2]:
import numpy as np

powers_of_2 = 2**np.arange(0,30,1)
 
powers_of_2 

array([        1,         2,         4,         8,        16,        32,
              64,       128,       256,       512,      1024,      2048,
            4096,      8192,     16384,     32768,     65536,    131072,
          262144,    524288,   1048576,   2097152,   4194304,   8388608,
        16777216,  33554432,  67108864, 134217728, 268435456, 536870912],
      dtype=int32)

Click here for a **hint**.

<!-- F

First worry about listing out the first 30 powers starting at 0 (0, 1, 2, 3, ..., If you want 30 total values wha is the last number in this list???) 

Second, raise 2 to each power in the previous array.
!-->


---

## 6. Built-In Array functions

**Your textbook lists many built in funcions at the end of Ch 5.1 at this url:  [Section 5.1](https://inferentialthinking.com/chapters/05/1/Arrays.html).**

As expected, different functions expect different arguments and return a different output.  

Read and run each cell below for some examples.


In [None]:
#Here's an example of of a function that takes in an an array as an argument and returns a single value.

my_nums = np.arange(1,4)  #1, 2, 3

my_nums.prod() #Takes the product of the numbers so returns just one number.

In [None]:
#Here's an example of of a function that takes an array as an argument and returns an array of values.

my_nums = np.array([1,3,7,15,11,22])  #an array of 6 elements

np.diff(my_nums)  #Takes the difference between the numbers above - so leaves an array of 5 elements

In [None]:
#Here's an example of of a function that takes an array of strings and returns an array.

my_strings = np.array(["Hello", "Beekeeper", "Banker", "KathLeen", "peewee"])
print(my_strings)

my_strings = np.char.upper(my_strings) #Makes each sring upper case and reassigns 
                                       #these values to the my_string variable.
my_strings

In [None]:
#Here's an example of of a function that takes both an array of strings 
#and a search string; each returns an array.

np.char.count(my_strings, "EE") #Counts the number of times a search string appears 
                                #among the elements of an array

---

# **Question 13.** 

Find a function in Section 5.1 that will help you find the index of the letter `'K'` in each word of `my_strings` variable (defined above).

So your code should return [-1,  3,  3,  0, -1].  Note that -1 indicates that the word does not contain a `'K'`.  Using -1 to mean "not found" is pretty normal in computer science, since -1 is not considered a valid index.

In [26]:
import numpy as np
np.char.find(my_strings, 'K')

array([-1,  3,  3,  0, -1])

---

# **Question 14.** 

Find a function that will help you take the accumulated product of the values in the `my_nums` array below.

So your output should be: `[1, 3, 21, 315,  3465, 76230]`. 

If you're confused by the oput:  `21 = 1 * 3 * 7`.  Also, `3465 = 1*3*7*15*11`.

In [31]:
my_nums = np.array([1,3,7,15,11,22]) 
np.cumprod(my_nums)
 

array([    1,     3,    21,   315,  3465, 76230], dtype=int32)

---

# **Question 15.** 

Find a function that will output True or False for each string in `random_words`, indicating if the string contains all letters or not.

So for your output should be: `[False,  True, False, False, False]`

In [36]:
random_words = np.array(["12", "AbC", "2755 Station Ave", "Jenny Jenny", "267-5309"])

np.char.isalpha(random_words)

array([False,  True, False, False, False])

---
You're done!  Please submit according to the naming conventions.