# <img style="float: left; padding-right: 10px; width: 45px" src="https://github.com/Harvard-IACS/2018-CS109A/blob/master/content/styles/iacs.png?raw=true"> CS109A/STAT121A Introduction to Data Science 


## Lab 3: `numpy`, plotting, K-NN Regression, Simple Linear Regression
## <font color='red'> PRE-LAB : DO THIS PART BEFORE COMING TO LAB</font>

**Harvard University**<br>
**Fall 2018**<br>
**Instructors:** Pavlos Protopapas and Kevin Rader<br>

---

In [1]:
## Run the cell below to properly highlight the exercises
from IPython.display import HTML
style = "<style>div.exercise { background-color: #ffcccc;border-color: #E9967A; border-left: 5px solid #800080; padding: 0.5em;}</style>"
HTML(style)

# Table of Contents
  <li> Revision Goals</li>
<ol start="0">
  <li> `numpy` </li>
  <li> Creating plots </li>
</ol>

## Learning Goals

Overall description and goal for the lab.

By the end of this lab, you should be able to:
* Use `numpy` proficiently and efficiently
* Make great, readable, informative plots with `matplotlib`
* Feel comfortable with simple linear regression
* Feel comfortable with $k$ nearest neighbors

**This lab corresponds to lecture 3 and maps on to homework 2 (and beyond).**



## Numerical Python:  `numpy`
Scientific `Python` code uses a fast array structure, called the `numpy` array. Those who have worked in `Matlab` will find this very natural.   For reference, the `numpy` documention can be found here: [`numpy`](http://www.numpy.org/).  


Let's make a numpy array.

In [2]:
import numpy as np

In [3]:
my_array = np.array([1,4,9,16])
my_array

array([ 1,  4,  9, 16])

Numpy arrays support the same operations as lists! Below we compute length, slice, and iterate. 

In [4]:
print("len(array):", len(my_array)) # Length of array

print("array[2:4]:", my_array[2:4]) # A slice of the array

# Iterate over the array
for ele in my_array:
    print("element:", ele)

len(array): 4
array[2:4]: [ 9 16]
element: 1
element: 4
element: 9
element: 16


**In general you should manipulate numpy arrays by using numpy module functions** (e.g. `np.mean`). This is for efficiency purposes, and a discussion follows below this section.

You can calculate the mean of the array elements either by calling the method `.mean` on a numpy array or by applying the function `np.mean` with the `numpy` array as an argument.

In [5]:
# Two ways of calculating the mean

print(my_array.mean())

print(np.mean(my_array))

7.5
7.5


The way we constructed the `numpy` array above seems redundant. After all we already had a regular `python` list. Indeed, it is the other ways we have to construct `numpy` arrays that make them super useful. 

There are many such `numpy` array *constructors*. Here are some commonly used constructors. Look them up in the documentation.

In [6]:
zeros = np.zeros(10) # generates 10 floating point zeros
zeros

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

`Numpy` gains a lot of its efficiency from being strongly typed. That is, all elements in the array have the same type, such as integer or floating point. The default type, as can be seen above, is a float of size appropriate for the machine (64 bit on a 64 bit machine).

In [7]:
zeros.dtype

dtype('float64')

In [8]:
np.ones(10, dtype='int') # generates 10 integer ones

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

Often you will want random numbers. Use the `random` constructor!

In [9]:
np.random.rand(10) # uniform on [0,1]

array([0.7541534 , 0.25898681, 0.42937074, 0.77508989, 0.75557664,
       0.45132106, 0.17618301, 0.69506555, 0.85333898, 0.460434  ])

You can generate random numbers from a normal distribution with mean $0$ and variance $1$ as follows:

In [10]:
normal_array = np.random.randn(1000000)
print("The sample mean and standard devation are {0:17.16f} and {1:17.16f}, respectively.".format(np.mean(normal_array), np.std(normal_array)))

The sample mean and standard devation are -0.0005217430082452 and 0.9996502296565319, respectively.


#### `numpy` supports vector operations

What does this mean? It means that to add two arrays instead of looping ovr each element (e.g. via a list comprehension as in base Python) you get to simply put a plus sign between the two arrays.

In [11]:
ones_array = np.ones(5)
twos_array = 2*np.ones(5)
ones_array + twos_array

array([3., 3., 3., 3., 3.])

Note that this behavior is very different from `python` lists, which just get longer when you try to + them.

In [12]:
first_list = [1., 1., 1., 1., 1.]
second_list = [1., 1., 1., 1., 1.]
first_list + second_list # not what you want

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

On some computer chips nunpy's addition actually happens in parallel, so speedups can be high. But even on regular chips, the advantage of greater readability is important.

`Numpy` supports a concept known as *broadcasting*, which dictates how arrays of different sizes are combined together. There are too many rules to list all of them here.  Here are two important rules:

1. Multiplying an array by a number multiplies each element by the number
2. Adding a number adds the number to each element.

In [13]:
ones_array + 1

array([2., 2., 2., 2., 2.])

In [14]:
5 * ones_array

array([5., 5., 5., 5., 5.])

This means that if you wanted the distribution $N(5, 7)$ you could do:

In [15]:
normal_5_7 = 5.0 + 7.0 * normal_array

np.mean(normal_5_7), np.std(normal_5_7)

(4.996347798942285, 6.997551607595724)

Now you have seen how to create and work with simple one dimensional arrays in `numpy`.  You have also been introduced to some important `numpy` functionality (e.g. `mean` and `std`).

Next, we push ahead to two-dimensional arrays and begin to dive into some of the deeper aspects of `numpy`.

### 2D arrays
We can create two-dimensional arrays without too much fuss.

In [16]:
# create a 2d-array by handing a list of lists
my_array2d = np.array([ 
    [1, 2, 3, 4], 
    [5, 6, 7, 8], 
    [9, 10, 11, 12] 
])

# you can do the same without the pretty formatting (decide which style you like better)
my_array2d = np.array([ [1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12] ])


# 3 x 4 array of ones
ones_2d = np.ones([3, 4])
print(ones_2d, "\n")

# 3 x 4 array of ones with random noise
ones_noise = ones_2d + 0.01*np.random.randn(3, 4)
print(ones_noise, "\n")

# 3 x 3 identity matrix
my_identity = np.eye(3)
print(my_identity, "\n")

[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]] 

[[0.99226045 0.99939867 1.00309239 0.99285125]
 [0.98635533 1.00257217 0.99652809 1.0018203 ]
 [1.005472   1.0043559  1.01463246 0.99853056]] 

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]] 



Like lists, `numpy` arrays are $0$-indexed.  Thus we can access the $n$th row and the $m$th column of a two-dimensional array with the indices $[n - 1, m - 1]$.

In [17]:
print(my_array2d)
print("element [2,3] is:", my_array2d[2, 3])

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
element [2,3] is: 12


Numpy arrays can be sliced, and can be iterated over with loops.  Below is a schematic illustrating slicing two-dimensional arrays.  

 <img src="images/2dindex_v2.png" alt="Drawing" style="width: 500px;"/>
 
Notice that the list slicing syntax still works!  
`array[2:,3]` says "in the array, get rows 2 through the end, column 3]"  
`array[3,:]` says "in the array, get row 3, all columns".

Numpy functions will by default work on the entire array:

In [18]:
np.sum(ones_2d)

12.0

The axis `0` is the one going downwards (i.e. the rows), whereas axis `1` is the one going across (the columns). You will often use functions such as `mean` or `sum` along a particular axis. If you `sum` along axis 0 you are summing across the rows and will end up with one value per column. As a rule, any axis you list in the axis argument will dissapear.

In [19]:
np.sum(ones_2d, axis=0)

array([3., 3., 3., 3.])

In [20]:
np.sum(ones_2d, axis=1)

array([4., 4., 4.])

<div class="exercise"><b>Exercise</b></div>
* Create a two-dimensional array of size $3\times 5$ and do the following:
  * Print out the array
  * Print out the shape of the array
  * Create two slices of the array:
    1. The first slice should be the last row and the third through last column
    2. The second slice should be rows $1-3$ and columns $3-5$
  * Square each element in the array and print the result

In [21]:
# your code here
A = np.array([ [5, 4, 3, 2, 1], [1, 2, 3, 4, 5], [1.1, 2.2, 3.3, 4.4, 5.5] ])
print(A, "\n")

# set length(shape)
dims = A.shape
print(dims, "\n")

# slicing
print(A[-1, 2:], "\n")
print(A[1:3, 3:5], "\n")

# squaring
A2 = A * A
print(A2)

[[5.  4.  3.  2.  1. ]
 [1.  2.  3.  4.  5. ]
 [1.1 2.2 3.3 4.4 5.5]] 

(3, 5) 

[3.3 4.4 5.5] 

[[4.  5. ]
 [4.4 5.5]] 

[[25.   16.    9.    4.    1.  ]
 [ 1.    4.    9.   16.   25.  ]
 [ 1.21  4.84 10.89 19.36 30.25]]


#### `numpy` supports matrix operations
2d arrays are numpy's way of representing matrices. As such there are lots of built-in methods for manipulating them

Earlier when we generated the one-dimensional arrays of ones and random numbers, we gave `ones` and `random`  the number of elements we wanted in the arrays. In two dimensions, we need to provide the shape of the array, i.e., the number of rows and columns of the array.

In [22]:
three_by_four = np.ones([3,4])
three_by_four

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

You can transpose the array:

In [23]:
three_by_four.shape

(3, 4)

In [24]:
four_by_three = three_by_four.T

In [25]:
four_by_three.shape

(4, 3)

Matrix multiplication is accomplished by `np.dot`. The `*` operator will do element-wise multiplication.

In [26]:
print(np.dot(three_by_four, four_by_three)) # 3 x 3 matrix
np.dot(four_by_three, three_by_four) # 4 x 4 matrix

[[4. 4. 4.]
 [4. 4. 4.]
 [4. 4. 4.]]


array([[3., 3., 3., 3.],
       [3., 3., 3., 3.],
       [3., 3., 3., 3.],
       [3., 3., 3., 3.]])

Numpy has functions to do the difficult matrix operations that are awful to do by hand.

In [27]:
matrix = np.random.rand(4,4) # a 4 by 4 matrix
matrix

array([[0.49948297, 0.3791904 , 0.44096266, 0.37051393],
       [0.2081437 , 0.3040164 , 0.00407319, 0.62093466],
       [0.43418946, 0.74098855, 0.86031301, 0.58029174],
       [0.55550538, 0.58715387, 0.73228724, 0.0448074 ]])

Let's get the eigenvalues and eigenvectors!

In [28]:
np.linalg.eig(matrix)

(array([ 1.84344652+0.j        , -0.44618896+0.j        ,
         0.15568111+0.11020989j,  0.15568111-0.11020989j]),
 array([[ 0.4413276 +0.j        ,  0.06902564+0.j        ,
         -0.07088047+0.38398144j, -0.07088047-0.38398144j],
        [ 0.26516792+0.j        ,  0.62443802+0.j        ,
          0.68341973+0.j        ,  0.68341973-0.j        ],
        [ 0.69279557+0.j        , -0.0318191 +0.j        ,
         -0.51400629-0.31258417j, -0.51400629+0.31258417j],
        [ 0.50492596+0.j        , -0.77736746+0.j        ,
         -0.13613074-0.00536367j, -0.13613074+0.00536367j]]))

How about inverses?

In [29]:
inv_matrix = np.linalg.inv(matrix) # the invert matrix
print(inv_matrix)

#prove it's the inverse
np.dot(matrix,inv_matrix)

[[ 3.10581561  0.09941945 -2.13902096  0.64219868]
 [-5.96522573  3.96630181 -0.78191642  4.48864732]
 [ 2.31285014 -3.2346087   2.18316364 -2.57400287]
 [ 1.86436652 -0.34357745  1.08553567 -2.39607755]]


array([[ 1.00000000e+00,  0.00000000e+00, -1.24900090e-16,
         0.00000000e+00],
       [ 1.11022302e-16,  1.00000000e+00, -5.55111512e-17,
         5.55111512e-17],
       [-8.88178420e-16,  0.00000000e+00,  1.00000000e+00,
        -2.22044605e-16],
       [ 0.00000000e+00,  0.00000000e+00, -1.66533454e-16,
         1.00000000e+00]])

Notice that there is a bit of 'rounding error' in the inverse calculation. This is becuase the computer can't store the exact values the inverse matrix asks for (they have more decimal places than the computer can hold). Built-in numpy routines manage these errors, which is why it's vrey important to use pre-built tools whenever possible, and to be very cautious when writing your own.

(It happens that there are even more advanced numpy functions like `np.linalg.solve` which are more accurate than just taking the naked inverse)

See the documentation to learn more about `numpy` functions as needed.

#### Numpy's layout in the computer's memory
**Advanced coders**: You should notice that access is row-by-row and one dimensional iteration gives a row. This is because `numpy` lays out memory row-wise. <br><br>
**Starting coders**: If you really need a particular section of numpy to run faster, ask an advanced coder to point out improvements in your code. They may have you transpose your arrays. But if your code is fast enough, then it's fast enough.

 <img src="https://aaronbloomfield.github.io/pdr/slides/images/04-arrays-bigoh/2d-array-layout.png" alt="Drawing" style="width: 500px;"/>
 
(from https://aaronbloomfield.github.io)

An often seen idiom allocates a two-dimensional array, and then fills in one-dimensional arrays from some function.  We will discuss why this is a bad design pattern.

In [30]:
# allocate a 2D array
twod = np.zeros([5, 2])
print(twod, "\n")

# Fill in with random numbers
for i in range(twod.shape[0]):
    twod[i, :] = np.random.random(2)
print(twod)

[[0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]] 

[[0.48321035 0.37564029]
 [0.45988703 0.7550365 ]
 [0.80108516 0.43621695]
 [0.35040792 0.97978141]
 [0.03942828 0.30527216]]


In this and many other cases, it is faster to simply do:

In [31]:
twod = np.random.random(size=(5,2))
twod

array([[0.40312569, 0.47105594],
       [0.29186086, 0.05646294],
       [0.49913959, 0.78658258],
       [0.19114486, 0.71519765],
       [0.48748529, 0.91040722]])

### `Numpy `Arrays vs. `Python` Lists?

1. Why the need for `numpy` arrays?  Can't we just use `Python` lists?
2. Iterating over `numpy` arrays is slow. Slicing is faster.

`Python` lists may contain items of different types. This flexibility comes at a price: `Python` lists store *pointers* to memory locations.  On the other hand, `numpy` arrays are typed, where the default type is floating point.  Because of this, the system knows how much memory to allocate, and if you ask for an array of size $100$, it will allocate one hundred contiguous spots in memory, where the size of each spot is based on the type.  This makes access extremely fast.

<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/array_vs_list.png" alt="Drawing" style="width: 500px;"/>

(from the book below)

Unfortunately, looping over an array slows things down. In general you should not access `numpy` array elements by iteration.  This is because of type conversion.  `Numpy` stores integers and floating points in `C`-language format.  When you operate on array elements through iteration, `Python` needs to convert that element to a `Python` `int` or `float`, which is a more complex beast (a `struct` in `C` jargon).  This has a cost.

<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/cint_vs_pyint.png" alt="Drawing" style="width: 500px;"/>

(from the book below)

If you want to know more, we will suggest that you read [Jake Vanderplas's Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/). You will find that book an incredible resource for this class.

Why is slicing faster? The reason is technical: slicing provides a *view* onto the memory occupied by a `numpy` array, instead of creating a new array. That is the reason the code above this cell works nicely as well. However, if you iterate over a slice, then you have gone back to the slow access.

By contrast, functions such as `np.dot` are implemented at `C`-level, do not do this type conversion, and access contiguous memory. If you want this kind of access in `Python`, use the `struct` module or `Cython`. Indeed many fast algorithms in `numpy`, `pandas`, and `C` are either implemented at the `C`-level, or employ `Cython`.

# Concept Check:
Answer these questions and see the bottom of the lab to check your answers.

1. What is the major benefit of working in `numpy`?
2. Why is it vital to use built-in numpy functions whenever possible?
3. You need to find the running total of each element in a numpy array. For example, the input \[2,5,3,2\] should give the output \[2,7,10,12\]. Rather than writing your own code, think of at least two google querrys to find a relevant numpy function.
4. What `numpy` function does the job above?