# Numpy Tutorial

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to load required modules. Each time you start your server, you will need to execute this cell again to load the modules.

Throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. 

In [1]:
# Don't change this cell; just run it. 

import numpy as np

# In this assignment, we would use pandas command to import table from csv files and take the columns as array.
# Don't worry about those commands, we will learn pandas next class.
import pandas as pd

## 1. Creating Arrays


**Question 1.** Make an array called `weird_numbers` containing the following numbers (in the given order):

1. -2
2. the sine of 1.2
3. 3
4. 5 to the power of the cosine of 1.2

*Hint:* `sin` and `cos` are functions in the `math` module.

In [2]:
# Our solution involved one extra line of code before creating
# weird_numbers.
import math
weird_numbers = np.array([-2, math.sin(1.2), 3, 5**math.cos(1.2)])
weird_numbers

array([-2.        ,  0.93203909,  3.        ,  1.79174913])

**Question 2.** Make an array called `book_title_words` containing the following three strings: "Eats", "Shoots", and "and Leaves".

In [3]:
book_title_words = np.array(['Eats', 'Shoots', 'and Leaves'])
book_title_words

array(['Eats', 'Shoots', 'and Leaves'], dtype='<U10')

Strings have a method called `join`.  `join` takes one argument, an array of strings.  It returns a single string.  Specifically, the value of `a_string.join(an_array)` is a single string that's the [concatenation](https://en.wikipedia.org/wiki/Concatenation) ("putting together") of all the strings in `an_array`, **except** `a_string` is inserted in between each string.

**Question 3.** Use the array `book_title_words` and the method `join` to make two strings:

1. "Eats, Shoots, and Leaves" (call this one `with_commas`)
2. "Eats Shoots and Leaves" (call this one `without_commas`)

*Hint:* If you're not sure what `join` does, first try just calling, for example, `"foo".join(book_title_words)` .

In [4]:
with_commas = ", ".join(book_title_words)
without_commas = " ".join(book_title_words)

# These lines are provided just to print out your answers.
print('with_commas:', with_commas)
print('without_commas:', without_commas)

with_commas: Eats, Shoots, and Leaves
without_commas: Eats Shoots and Leaves


**Question 3.** Create a 3x3 matrix with values ranging from 0 to 8.

In [5]:
mat3 = np.array([[0, 1, 2],
                 [3, 4, 5],
                 [6, 7, 8]])

**Question 4.** Find the dimension of the above matrix `mat3`.

In [6]:
mat3.shape

(3, 3)

**Question 5.** Create the following vector/matrix.
* Create a vector with values ranging from 10 to 49
* Create a 3x3 identity matrix
* Create a 3x3 array with random values

In [7]:
# making a vector with values 10 to 49
vector = np.arange(10,50)
print(vector)

# making an identity matrix
identity = np.identity(3)
print(identity)

# making a random 3x3 matrix
rand = np.random.random([3,3])
print(rand)

[10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49]
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
[[0.47645641 0.48172832 0.89672625]
 [0.72490987 0.39112101 0.98476745]
 [0.8985761  0.83968194 0.95149354]]


## 2. Indexing Arrays


These exercises give you practice accessing individual elements of arrays.  In Python (and in many programming languages), elements are accessed by *index*, so the first element is the element at index 0.  

**Question 1.** The cell below creates an array of some numbers.  Set `third_element` to the third element of `some_numbers`.

In [8]:
some_numbers = np.array([-1, -3, -6, -10, -15])

third_element = some_numbers[2]
third_element

-6

**Question 2.** The next cell creates a table that displays some information about the elements of `some_numbers` and their order.  Run the cell to see the partially-completed table, then fill in the missing information in the cell (the strings that are currently "???") to complete the table.

In [9]:
elements_of_some_numbers = pd.DataFrame({
    "English name for position": np.array(["first", "second", "third", "fourth", "fifth"]),
    "Index":                     np.array(["0", "1", "2", "3", "4"]),
    "Element":                   some_numbers})
elements_of_some_numbers

Unnamed: 0,English name for position,Index,Element
0,first,0,-1
1,second,1,-3
2,third,2,-6
3,fourth,3,-10
4,fifth,4,-15


**Question 3.** You'll sometimes want to find the *last* element of an array.  Suppose an array has 142 elements.  What is the index of its last element?

In [10]:
index_of_last_element = [-1]

**Question 4.** The cell below creates an 2d array.  
* Let `eg_elem` to the element on the second row and third column (use indexing method).
* Set `first_row` to the first row of `eg_mat`. 
* Set `third_column` to the third column of `eg_mat`. 
* Set `two_cols` to the second and third columns of `eg_mat`. 
* Set `sub_mat` to the last two rows and two columns of `eg_mat`., i.e. $\left(\begin{array}{cc} 
6 & 7\\ 
10 & 11
\end{array}\right)$

In [11]:
eg_mat = np.arange(12).reshape(3,4)
eg_mat

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [12]:
eg_elem = eg_mat[1,2]
eg_elem

6

In [13]:
first_row = eg_mat[:1]
first_row

array([[0, 1, 2, 3]])

In [14]:
third_column = eg_mat[:,2]
third_column

array([ 2,  6, 10])

In [15]:
two_cols = eg_mat[:,[1,2]]
two_cols

array([[ 1,  2],
       [ 5,  6],
       [ 9, 10]])

In [16]:
sub_mat = eg_mat[1:,2:]
sub_mat

array([[ 6,  7],
       [10, 11]])

**Question 5.** Reverse a vector (first element becomes last)

* Method 1: Use the reversed indexing

In [17]:
vec = np.arange(2,10)
vec

array([2, 3, 4, 5, 6, 7, 8, 9])

In [18]:
reversed_index = np.arange(7,-1,-1) 
reversed_index

array([7, 6, 5, 4, 3, 2, 1, 0])

In [19]:
print("vector =", vec)
rev_vec1 = np.arange(9,-1,-1)[:8]
print("reversed vector =", rev_vec1)

vector = [2 3 4 5 6 7 8 9]
reversed vector = [9 8 7 6 5 4 3 2]


What's the meaning of the arguments of `reversed_index`? What's the meaning of `7`,`-1` and `-1`?

In [20]:
## 7 is the start, -1 is the stop, -1 is the step

* Method 2: Use the `np.flip` function
Since we didn't learn the `np.flip` function before, try to run the following cell to know the usage of it.

In [21]:
# Try to run this cell to learn how to use a new command/function
? np.flip

In [22]:
print("vector =", vec)
rev_vec2 = np.flip(vec)
print("reversed vector =", rev_vec2)

vector = [2 3 4 5 6 7 8 9]
reversed vector = [9 8 7 6 5 4 3 2]


**Question 6:** 
* Find how many non-zero elements in the 1d array [1,2,0,0,4,0] 
* Find how many non-zero elements in each row of the 2d array 
$\left(\begin{array}{cc} 
1 & 0 & 2 & 0 \\
0 & 0 & 3 & 4 \\
5 & 6 & 0 & 7 \\
0 & 8 & 9 & 0 \\
\end{array}\right)$
* Find how many non-zero elements in each column of the above 2d array 

Hint: in python, we use `==` or `!=` to find the condition is right or not

In [23]:
## Here is an example
a = np.arange(3)
print("The output of `a==0`:", a == 0)
print("The output of `a!=0`:", a != 0)

The output of `a==0`: [ True False False]
The output of `a!=0`: [False  True  True]


In [163]:
matrix = np.array([[1, 0, 2, 0],[0, 0, 3, 4],[5, 6, 0, 7], [0, 8, 9, 0]])
len(matrix[matrix[:,2]==0])

1

In [165]:
# write your answer of question 6 here

# Part A
array = np.array([1, 2, 0, 0, 4, 0])
print('The total number of non-zero elements in the array are:',len(array[array==0]))

print('-------------------------- Part B --------------------------')

matrix = np.array([[1, 0, 2, 0],[0, 0, 3, 4],[5, 6, 0, 7], [0, 8, 9, 0]])
for i in range(len(matrix)):
    print('The total number of non-zero elements in the matrix row '+str(i+1)+' are:',len(matrix[matrix[i]==0]))    

print('-------------------------- Part C --------------------------')

matrix = np.array([[1, 0, 2, 0],[0, 0, 3, 4],[5, 6, 0, 7], [0, 8, 9, 0]])
true = 0
for i in range(len(matrix.T)):
    print('The total number of non-zero elements in the matrix row '+str(i+1)+' are:',len(matrix[matrix[:,i]==0])) 

The total number of non-zero elements in the array are: 3
-------------------------- Part B --------------------------
The total number of non-zero elements in the matrix row 1 are: 2
The total number of non-zero elements in the matrix row 2 are: 2
The total number of non-zero elements in the matrix row 3 are: 1
The total number of non-zero elements in the matrix row 4 are: 2
-------------------------- Part C --------------------------
The total number of non-zero elements in the matrix row 1 are: 2
The total number of non-zero elements in the matrix row 2 are: 2
The total number of non-zero elements in the matrix row 3 are: 1
The total number of non-zero elements in the matrix row 4 are: 2


## 3. Basic Array Arithmetic


**Question 1.** Multiply the numbers 42, 4224, 42422424, and -250 by 157.  For this question, **don't** use arrays.

In [55]:
first_product = 42 * 157
second_product = 4224 * 157
third_product = 42422424 * 157
fourth_product = -250 * 157
print(first_product, second_product, third_product, fourth_product)

6594 663168 6660320568 -39250


**Question 2.** Now, do the same calculation, but using an array called `numbers` and only a single multiplication (`*`) operator.  Store the 4 results in an array named `products`.

In [58]:
numbers = np.array([42, 4224, 42422424, -250])
products = numbers * 157
products

array([      6594,     663168, 6660320568,     -39250])

**Question 3.** Oops, we made a typo!  Instead of 157, we wanted to multiply each number by 1577.  Compute the fixed products in the cell below using array arithmetic.  Notice that your job is really easy if you previously defined an array containing the 4 numbers.

In [59]:
fixed_products = numbers * 1577
fixed_products

array([      66234,     6661248, 66900162648,     -394250])

**Question 4.** We've loaded an array of temperatures in the next cell.  Each number is the highest temperature observed on a day at a climate observation station, mostly from the US.  Since they're from the US government agency [NOAA](noaa.gov), all the temperatures are in Fahrenheit.  Convert them all to Celsius by first subtracting 32 from them, then multiplying the results by $\frac{5}{9}$. Make sure to **ROUND** each result to the nearest integer using the `np.round` function.

Hint: The first line of code is out of our scope now, and we will learn it next week. If you have trouble understanding, please print the "maximum temperature" to take a look, it is essentially an array.

In [63]:
max_temperatures = pd.read_csv("temperatures.csv")["Daily Max Temperature"].to_numpy()

celsius_max_temperatures = np.round((max_temperatures-32)*5/9)
celsius_max_temperatures

array([-4., 31., 32., ..., 17., 23., 16.])

**Question 5.** The cell below loads all the *lowest* temperatures from each day (in Fahrenheit).  Compute the size of the daily temperature range for each day.  That is, compute the difference between each daily maximum temperature and the corresponding daily minimum temperature.  **Give your answer in Celsius!** Make sure **NOT** to round your answer for this question!

In [70]:
min_temperatures = pd.read_csv("temperatures.csv")["Daily Min Temperature"].to_numpy()

celsius_temperature_ranges = (max_temperatures-32)*5/9 - (min_temperatures-32)*5/9
celsius_temperature_ranges

array([ 6.66666667, 10.        , 12.22222222, ..., 17.22222222,
       11.66666667, 11.11111111])

**Question 6:** Compute the matrix multiplication using python

$$\left(\begin{array}{cc} 
1 & 2 & 3 & 4 \\
2 & 3 & 4 & 5 \\
5 & 6 & 7 & 8 \\
\end{array}\right) \times 
\left(\begin{array}{cc} 
1 & 2 \\
2 & 3 \\
5 & 6 \\
\end{array}\right) = ?
$$

In [148]:
a = np.array([[1, 2, 3, 4], [2, 3, 4, 5], [5, 6, 7, 8]])
b = np.array([[1, 2], [2, 3], [5, 6]])

np.multiply(a,b)

ValueError: operands could not be broadcast together with shapes (3,4) (3,2) 

In [150]:
# Cant do due to incorrect dimensions ^ yet we can do something like
m1 = np.multiply(a[:,:2],b) # or
m2 = np.multiply(a[:,2:],b) 

# we must change the shape of matrix A to multiply by matrix B

## 4. World Population


The cell below loads a table of estimates of the world population for different years, starting in 1950. The estimates come from the [US Census Bureau website](https://www.census.gov/en.html).

In [102]:
world = pd.read_csv("world_population.csv")[['Year', 'Population']]
world.head(5)

Unnamed: 0,Year,Population
0,1950,2557628654
1,1951,2594939877
2,1952,2636772306
3,1953,2682053389
4,1954,2730228104


The name `population` is assigned to an array of population estimates.

In [103]:
population = world['Population'].to_numpy()
population

array([2557628654, 2594939877, 2636772306, 2682053389, 2730228104,
       2782098943, 2835299673, 2891349717, 2948137248, 3000716593,
       3043001508, 3083966929, 3140093217, 3209827882, 3281201306,
       3350425793, 3420677923, 3490333715, 3562313822, 3637159050,
       3712697742, 3790326948, 3866568653, 3942096442, 4016608813,
       4089083233, 4160185010, 4232084578, 4304105753, 4379013942,
       4451362735, 4534410125, 4614566561, 4695736743, 4774569391,
       4856462699, 4940571232, 5027200492, 5114557167, 5201440110,
       5288955934, 5371585922, 5456136278, 5538268316, 5618682132,
       5699202985, 5779440593, 5857972543, 5935213248, 6012074922,
       6088571383, 6165219247, 6242016348, 6318590956, 6395699509,
       6473044732, 6551263534, 6629913759, 6709049780, 6788214394,
       6866332358, 6944055583, 7022349283, 7101027895, 7178722893,
       7256490011])

In this question, you will apply some built-in Numpy functions to this array.

<img src="array_diff.png" style="width: 600px;"/>

The difference function `np.diff` subtracts each element in an array by the element that preceeds it. As a result, the length of the array `np.diff` returns will always be one less than the length of the input array.

<img src="array_cumsum.png" style="width: 700px;"/>

The cumulative sum function `np.cumsum` outputs an array of partial sums. For example, the third element in the output array corresponds to the sum of the first, second, and third elements.

**Question 1.** Very often in data science, we are interested understanding how values change with time. Use `np.diff` and `np.max` (or just `max`) to calculate the largest annual change in population between any two consecutive years.

In [105]:
largest_population_change = np.max(np.diff(population))
largest_population_change

87515824

**Question 2.** Describe in words the result of the following expression. What do the values in the resulting array represent (choose one)?

 #### This expression means we find the difference of the population growth for each element change (ex. for [a, b, c] we find that the difference for this element would be [b-a, c-b]) and THEN we sum up each of the elements by the ones that follow, resulting in the first element being the first element of the input array, and the final element being the total sum of the array.

In [106]:
np.cumsum(np.diff(population)) #for those of you know how to program, can you rewrite this using a loop?

array([  37311223,   79143652,  124424735,  172599450,  224470289,
        277671019,  333721063,  390508594,  443087939,  485372854,
        526338275,  582464563,  652199228,  723572652,  792797139,
        863049269,  932705061, 1004685168, 1079530396, 1155069088,
       1232698294, 1308939999, 1384467788, 1458980159, 1531454579,
       1602556356, 1674455924, 1746477099, 1821385288, 1893734081,
       1976781471, 2056937907, 2138108089, 2216940737, 2298834045,
       2382942578, 2469571838, 2556928513, 2643811456, 2731327280,
       2813957268, 2898507624, 2980639662, 3061053478, 3141574331,
       3221811939, 3300343889, 3377584594, 3454446268, 3530942729,
       3607590593, 3684387694, 3760962302, 3838070855, 3915416078,
       3993634880, 4072285105, 4151421126, 4230585740, 4308703704,
       4386426929, 4464720629, 4543399241, 4621094239, 4698861357])

1) The total population change between consecutive years, starting at 1951.

2) The total population change between 1950 and each later year, starting at 1951.

3) The total population change between 1950 and each later year, starting inclusively at 1950.

In [108]:
# Assign cumulative_sum_answer to 1, 2, or 3
cumulative_sum_answer = 3

## 5. Old Faithful


Old Faithful is a geyser in Yellowstone that erupts every 44 to 125 minutes (according to [Wikipedia](https://en.wikipedia.org/wiki/Old_Faithful)). People are [often told that the geyser erupts every hour](http://yellowstone.net/geysers/old-faithful/), but in fact the waiting time between eruptions is more variable. Let's take a look.

**Question 1.** The first line below assigns `waiting_times` to an array of 272 consecutive waiting times between eruptions, taken from a classic 1938 dataset. Assign the names `shortest`, `longest`, and `average` so that the `print` statement is correct.

In [109]:
waiting_times = pd.read_csv('old_faithful.csv')['waiting'].to_numpy()

shortest = np.min(waiting_times)
longest = np.max(waiting_times)
average = np.mean(waiting_times)

print("Old Faithful erupts every", shortest, "to", longest, "minutes and every", average, "minutes on average.")

Old Faithful erupts every 43 to 96 minutes and every 70.8970588235294 minutes on average.


**Question 2.** Assign `biggest_difference` to the biggest difference in waiting time between two consecutive eruptions. For example, the third eruption occurred after 74 minutes and the fourth after 62 minutes, so the difference in waiting time was 74 - 62 = 12 minutes. 

*Hint*: You'll need an array arithmetic function [mentioned in the textbook](https://www.inferentialthinking.com/chapters/05/1/arrays.html#Functions-on-Arrays).

*Hint 2*: The biggest difference could be negative, but in the end, we want to return the absolute value of the biggest difference

In [120]:
biggest_difference = np.max(np.abs(np.diff(waiting_times)))
biggest_difference

47

**Question 3.** If you expected Old Faithful to erupt every hour, you would expect to wait a total of `60 * k` minutes to see `k` eruptions. Set `difference_from_expected` to an array with 272 elements, where the element at index `i` is the absolute difference between the expected and actual total amount of waiting time to see the first `i+1` eruptions.  *Hint*: You'll need to compare a cumulative sum to a range.

For example, since the first three waiting times are 79, 54, and 74, the total waiting time for 3 eruptions is 79 + 54 + 74 = 207. The expected waiting time for 3 eruptions is 60 * 3 = 180. Therefore, `difference_from_expected.item(2)` should be $|207 - 180| = 27$.

In [132]:
expected = np.zeros(272)+60
difference_from_expected = np.cumsum(waiting_times) - np.cumsum(expected)
difference_from_expected

array([  19.,   13.,   27.,   29.,   54.,   49.,   77.,  102.,   93.,
        118.,  112.,  136.,  154.,  141.,  164.,  156.,  158.,  182.,
        174.,  193.,  184.,  171.,  189.,  198.,  212.,  235.,  230.,
        246.,  264.,  283.,  296.,  313.,  319.,  339.,  353.,  345.,
        333.,  353.,  352.,  382.,  402.,  400.,  424.,  422.,  435.,
        458.,  462.,  455.,  477.,  476.,  491.,  521.,  515.,  535.,
        529.,  552.,  563.,  567.,  584.,  605.,  604.,  628.,  616.,
        638.,  638.,  670.,  688.,  706.,  711.,  724.,  746.,  742.,
        761.,  772.,  774.,  790.,  790.,  808.,  824.,  847.,  862.,
        884.,  894.,  899.,  912.,  940.,  956.,  976.,  964.,  990.,
        990., 1020., 1010., 1028., 1031., 1043., 1067., 1082., 1073.,
       1095., 1097., 1125., 1114., 1137., 1158., 1145., 1169., 1161.,
       1187., 1208., 1223., 1222., 1251., 1270., 1269., 1290., 1280.,
       1305., 1304., 1331., 1324., 1333., 1350., 1346., 1374., 1395.,
       1380., 1402.,

**Question 4.** If instead you guess that each waiting time will be the same as the previous waiting time, how many minutes would your guess differ from the actual time, averaging over every wait time except the first one.

For example, since the first three waiting times are 79, 54, and 74, the average difference between your guess and the actual time for just the second and third eruption would be $\frac{|79-54|+ |54-74|}{2} = 22.5$.

In [139]:
average_error = np.abs(np.diff(np.diff(waiting_times))/2)
average_error

array([22.5, 16. , 17.5, 26.5, 31.5, 18. , 15.5, 34. , 32.5, 30.5, 18. ,
       12.5, 33.5, 33.5, 20.5,  6. , 27. , 29.5, 27.5, 12. , 17.5, 20. ,
        7. ,  2. , 18.5, 24.5,  9.5,  0.5,  3.5,  5. ,  7.5, 12.5, 10. ,
        8. ,  9. , 18. , 26.5, 26. , 20.5,  6. , 24. , 26. , 20.5,  2.5,
       14.5,  4. , 20. , 26. , 19.5,  0.5, 25.5, 31. , 26. , 27.5, 20.5,
        2.5, 10. ,  4.5, 13. , 23.5, 30.5, 35. , 28. , 27. , 23. ,  7. ,
        6.5, 10.5,  0.5, 17.5, 24.5, 15.5,  0.5, 11.5, 15. , 17. , 10. ,
        4.5,  7.5,  7.5,  9.5,  3.5,  6.5,  3.5, 13.5,  8. , 18. , 35. ,
       32. , 28. , 35. , 34. , 21.5, 12. ,  1.5, 10.5,  7.5, 27.5, 25.5,
       23. , 32.5, 36.5, 18. , 16. , 35.5, 34.5, 33. , 19.5,  0.5,  5. ,
       23. , 20. ,  5. , 21. , 26.5, 33. , 30.5, 27. , 31. , 25. ,  4. ,
       14.5, 26.5, 19.5, 14.5, 36.5, 32. , 31. , 40. , 41.5, 32.5, 30. ,
       38. , 39.5, 33.5, 33. , 34. , 29.5, 12. , 11.5, 21.5, 13.5,  2. ,
        8. , 19. , 26. , 39. , 45. , 33.5, 12. ,  6