### **Please read carefully. Ask questions if you are unsure.**



We use an auto-grader to check your work. If you invent new notation (such as, new variable names) for yourself, you will mess up the auto-grader and receive no points. We will NOT do manual regrades because of failure to use the requested variable names.



Do not reinitialize variables and data provided for you. Please just run the cells when information is initialized for you. DO NOT RETYPE IT unless it is in a static cell (a cell that has no run button).



Not all tests are visible to you. Just because you have passed a test, doesn’t mean you will get full credit. Take some time to understand what it is your code is doing and what should output so you can check your answers before submission.





---



Estimating Precipitation in Alaska Using Surface Regression
===========================================================



Introduction
------------



In this part of the project, your goal is to perform a 3D surface regression which will fit a model that will allow us to estimate the total amount of precipitation during the month of July 2020 at any longitude/latitude pair that falls in the state of Alaska.



Our estimation will be based on datasets maintained by the National Oceanic and Atmospheric Administration (NOAA) which provides free access to a variety of weather and climate datasets.



When performing any kind of data analysis, one of the first steps is to get a feel for the data. That can mean many things from understanding the size of the dataset, seeing what the data looks like (does it contain string values, integers, floats?) and exploring if there is missing data, to name just a few aspects of interest.



In [None]:
using LaTeXStrings, LinearAlgebra, CSV, ProgressBars, Printf, Random, DataFrames, Plots

import GMT
gr()

**The next cell reads in our dataset and visualizes the information.**



**Dataset Citation** : Vose, Russell S., Applequist, Scott, Squires, Mike, Durre, Imke, Menne, Matthew J., Williams, Claude N. Jr., Fenimore, Chris, Gleason, Karin, and Arndt, Derek (2014): Gridded 5km GHCN-Daily Temperature and Precipitation Dataset (nCLIMGRID), Version 1. 202007.prcp.alaska.pnt. NOAA National Centers for Environmental Information. DOI:10.7289/V5SX6B56 Aug 1, 2020.



In [None]:
#= Upload the dataset
and display the size and first 
five rows of the data. 
=#
df = CSV.read("202007_prcp_alaska_1.csv", DataFrame, header=false)
data = Matrix(df)

@show typeof(data)
@show size(data)

# Output the first five rows
show(stdout, "text/plain", data[1:5,:])
println()

The dataset we have just read in contains 3 columns:



* The first column contains longitude values which range from -180 to 0 (western hemisphere) or 0 to 180 (eastern hemisphere).
* The second column contains latitude values which we can think of as horizontal slices across the world. Latitude values can range from 0 to 90 (northern hemisphere) or -90 to 0 (southern hemisphere).
* The third column contains total precipitation information in millimeters.


Note: 300 mm = 30 cm = 1 foot of rain





---



Below, we give you the data you will use when building your regressor matrix \Phi and when assigning the vector of measured values, Y. Please make a note of them and do not recreate/reinitialize them.



In [None]:
#= 
Split the data into geometric coordinates and precipitation values
You will use these later!!
=#

#Use this matrix when building Phi
#column 1 is longitude, column 2 is latitude
dataLongLat=data[:,1:2] 

#This is a vector of measured values, Y
dataPrecip=data[:,3] 

@show size(dataLongLat) 
@show length(dataPrecip) 

This being a larger dataset, let's create a scatter plot using the location of a small portion of the samples to further get a sense of the dataset and the values within it. We will use the provided sample function to do this.



In [None]:
#=
Samples a percentage of the rows of matrix 'data'

Input:
    data    - data matrix to sample rows from
    percent - 0 < percent <= 100
=#
function sample_rows(data, percent)
   
    @assert percent > 0 && percent <= 100
    
    N = size(data, 1)
    M = floor(Int64,percent*N/100)
    
    # Set's the initial condition of the random number generator
    # so that every notebook will arrive at the same results
    Random.seed!(1817);
    center_indices = collect(1:N); 
    indices = shuffle(center_indices)[1:M]
    
    col1 = data[indices, 1]
    col2 = data[indices, 2]
    prcp = data[indices, 3]
    
    # Assemble columns using horizontal concatenation
    # concatenation in this case means placing side by side
    subset = hcat(col1, col2, prcp)

    return subset
end

In [None]:
#=
Sample and display the locations
of 1.5% of the measurements 
in our dataset
=#
percent = 1.5 

dataSubset = sample_rows(data, percent)
# We only want the geometric data for the centers
centers = dataSubset[:,1:2] 

s1 = scatter(
    centers[:,1], 
    centers[:,2], 
    markersize=1,
    label="Measurements",
    xlabel="Longitude",
    ylabel="Latitude",
    title="Locations of Precipitation Measurements"
)

After running the scatter function, you should be able to see the shape of Alaska take form. **Each (x, y) point corresponds to a (longitude, latitude) pair in the dataset.**



Another view of the data would be to visualize the precipitation value at each point. We make use of the GMT package's bar3 function to create a 3D plot of the precipitation recorded at each location. Before being able to call the function, however, we need to section up Alaska into a series of bins. Imagine placing a checkerboard over the above graph. Each square in the checkerboard will have a (longitude, latitude) pair associated with it. Each measurement will be placed in the square that it is closest to. We will take an average if there are squares that have more than one measurement associated with it.



The pixelate function below helps us with taking our data and turning it into the grid we described. You only need to understand the inputs and outputs to the function, not all of the code.



In [None]:
#=
pixelate(data::AbstractArray, n_xpixels::Int, n_ypixels::Int)

Takes a 3D matrix (lon,lat,rain) and returns a grid where each pixel is 
equal to the average of all the rain measurements that fall into that pixel

Inputs:
    data      - the data set
    n_xpixels - the number of pixels on the x axis (longitude)
    n_ypixels - the number of pixels on the y axis (latitude)

Output:
    grid      - a n_ypixels x n_xpixels matrix holding the average 
                rain that fell in that location
=#
function pixelate(data::AbstractArray, n_xpixels::Int, n_ypixels::Int)

    # (lon,lat) coordinates with measured rain in third column
    @assert size(data, 2) == 3
    
    # No nan values exist in data
    @assert !any(isnan.(data))
    
    xmin = min(data[:,1]...)
    xmax = max(data[:,1]...)
    ymin = min(data[:,2]...)
    ymax = max(data[:,2]...)
    
    xedges = collect(range(floor(xmin), ceil(xmax), length=n_xpixels))
    yedges = collect(range(floor(ymin), ceil(ymax), length=n_ypixels))
    
    grid = zeros(n_ypixels, n_xpixels)
    count = zeros(n_ypixels, n_xpixels)
    
    for i in 1:size(data,1)
        x_idx = searchsortedlast(xedges, data[i,1])
        y_idx = searchsortedlast(yedges, data[i,2])
        
        grid[y_idx, x_idx] += data[i,3]
        count[y_idx, x_idx] += 1
    end
    
    grid = grid./ count
    
    # Replace NaN from divide by 0 with 0
    replace!(grid, NaN=>0)
        
    return GMT.mat2grid(grid, x=xedges, y=yedges) 
end

In [None]:
#= 
create a 3D bar graph that
displays the precipitation 
across the state of Alaska
=#

grid = pixelate(dataSubset, 100, 100)
# Compute a colormap with the grid's data range
cmap = GMT.grd2cpt(grid);  

# Plotting function to make a 3D bar graph
GMT.bar3(grid, lw=:thinnest, color=cmap, fmt=:png, show=true, view=(200,50),
         xlabel="Longitude", ylabel="Latitude", title="Precipitation in Alaska (mm)")

We see that for most of the state of Alaska, not a lot of precipitation fell. However, near the south of Alaska where the state capitol Juneau resides, there was quite a bit of precipitation that relative to the rest of the state in July.



If you try plotting a higher percentage of measurement samples, you may feel that the 3D bar plot is already dense enough. However, each degree of difference in latitude/longitude is approximately 50-70 miles apart depending on where you are in the world (closer to the equator or pole etc.). **We wish to be able to create a model that allows us to estimate the precipitation level at any (longitude, latitude) pair at infinite resolution.**





---



### You will create a model that will be able to estimate the amount of precipitation that fell during the month of July anywhere in Alaska. To do this, you will perform a surface regression using the radial functions as your basis.



**Here is a new RBF function that works** **for vectors**



In [None]:
# Radial basis function
rbf(x, xc, s) = exp.(-norm(x-xc)^2 / (2*s^2))

#=
Example of how to use the RBF
with vectors instead of scalars
=#

# Each Longitude-Latitude data pair is a 2-vector
x = [3 4] 

# Each center is also a 2-vector 
xc = [3.5 4.5] 

s = 1

# Call the RBF
rbf(x, xc, s)

**More helpful functions for later:**



In [None]:
function forwardsub(L, b)
    # Assert no entries in the diagonal of L
    # are 0 (or very close to 0)
    @assert minimum(abs.(diag(L))) > 1e-6
  
    n = length(b)
    x = Vector{Float64}(undef, n)
  
    x[1] = b[1]/L[1,1] 
    for i = 2:n
        x[i] = (b[i]- (L[i,1:i-1])'*x[1:i-1] )/L[i,i] 
        
    end
  
    return x
end

function backwardsub(U, b)
    
    # Assert no entries in the diagonal of U
    # are 0 (or very close to 0)
    @assert minimum(abs.(diag(U))) > 1e-6
    
    n = length(b)
    x = Vector{Float64}(undef, n)

    x[n] = b[n] / U[n,n]
    for i = n-1:-1:1
        x[i] = (b[i] - (U[i,(i+1):n])' * x[(i+1):n]) / U[i,i]
    end
    
    return x    
end



---



### Copy your `least_squares_lu()` function from Part 1



This must be correct for you to move forward with this part of the project. You will earn 1 point for this.



In [None]:
# YOUR ANSWER HERE
# This is the only test! No hidden tests!

PhiTest = [ 1 1.0; 1 2; 1 4; 1 5]
YTest = [2; 3.2; 4.7; 6]
alphaStarTest=least_squares_lu(PhiTest,YTest)
display(alphaStarTest)
is_it_correct_check1 = (norm(alphaStarTest- [1.125000000000001;  0.9499999999999996]) < 1e-4) ? "Yes" : "No" 

@show is_it_correct_check1
println("\n If you failed this test, but passed it in Part 1, then you did not copy it exactly \n")
@show @assert is_it_correct_check1 == "Yes"



---



Reminder: `dataLongLat=data[:,1:2]` is a matrix that was defined many cells above. You will need it for the x-values in the RBF function.



You will also need a set of centers for the `x_c` values in your RBF function. They are defined in the next cell. Look below for `centersLongLat=centers[:,1:2]`.



In [None]:
s = 1
percent = 1.5 

# longitude, latitude, precipitation
centers = sample_rows(data, percent)

# only the longitude, latitude pairs
centersLongLat=centers[:,1:2]

@show size(centers)
@show size(centersLongLat)



---



Task 3: Build a Regression Model to Predict the Precipitation in Alaska
-----------------------------------------------------------------------



### **Task 3a**



Modify the Helper Functions `calc_phi_row()` and `regressor_matrix()` to allow for passing in data that is now in \mathbb{R}^2.



Think carefully about the variables being used in your function. Are they scalars, vectors, or matrices? This is important because it changes how you index into them.



Take the function `calc_phi_row(xᵢ, centers, s)` from Part 1 and modify it to work here where `xᵢ` is a vector, `centers` is a matrix, and `s` remains a scalar.



In [None]:
#= 
calc\_phi\_row() from Part 1
I recommend copying this and then making edits 
to account for the fact that centers is a matrix.
Do not alter the for loop to a nested loop.
A single loop is sufficient.

DO NOT change the variable names!
=#

function calc\_phi\_row(xᵢ, centers, s)
 # xi is a scalar
 # centers is a vector of centers for the rbf basis elements
 # s is the scale value
 
 # plus one bc we include a constant vector
 NumBasisElements = length(centers) + 1
 
 phi\_row = zeros(1, NumBasisElements)
 phi\_row[1] = 1 
 for i in 2:NumBasisElements
 phi\_row[i] = rbf(xᵢ, centers[i-1], s)
 end 
 return phi\_row
end

In [None]:
# YOUR ANSWER HERE
dummy_dataLongLat = [1 3;2 4;3 5;4 6;5 7;6 8;7 9;8 10;9 11]
dummy_centers = [2 5; 4 7; 5 6]
phi_row_1_test = calc_phi_row(dummy_dataLongLat[1,:], dummy_centers,1)
phi_row_5_test = calc_phi_row(dummy_dataLongLat[5,:], dummy_centers,1)

is_it_correct_check1 = isapprox(phi_row_1_test, [1.0  0.082085  3.72665e-6  3.72665e-6], atol = 1e-3) ? "Yes" : "No"
is_it_correct_check2 = isapprox(phi_row_5_test, [1.0  0.00150344  0.606531  0.606531], atol = 1e-3) ? "Yes" : "No"

@show is_it_correct_check1;
@show is_it_correct_check2;

println("\n dummy centers \n")
show(stdout, "text/plain", dummy_centers)

println("\n\n dummy dataLongLat \n")
show(stdout, "text/plain", dummy_dataLongLat)
println("\n")


#= 
The point value for getting this problem correct 
is included in Part B.
In other words, you  may see 0 points
but this function still matters 
and will affect your grade in later problems
=#

This next part has you take your function `regressor_matrix` from Part 1 and modify it to work here. It will not work as is. You have to make edits.



Build the regressor matrix row by row using a SINGLE for loop and your modified function `calc_phi_row(xᵢ, centers, s)`. Once again, be very careful about the sizes of the variables in your function and be careful about how you index into matrices vs vectors. We strongly suggest that you check `size(centersLongLat)`, and while you are at it, you can also check `size(dataLongLat)`.



In [None]:
#= 
regressor\_matrix() from Part 1
I recommend copying this and then making edits 
to account for the fact that X is a matrix.
Do not alter the for loop to a nested loop.
A single loop is sufficient.

DO NOT change the variable names!
=#

function regressor\_matrix(X, centers, s)
 # X is a vector of points in R
 # centers is a vector of centers for the rbf basis elements
 # s is the scale value 
 N = length(X)
 M = length(centers)
 Phi = Array{Float64, 2}(undef, N, M+1) 
 for i in 1:N
 Phi[i, :] = calc\_phi\_row(X[i], centers, s)
 end 
 return Phi
end

#=
DESCRIPTION FOR YOUR NEW regressor\_matrix()

function regressor\_matrix()

Returns the regressor matrix Phi

Inputs:
 X - an Nx2 matrix holding the X value of all the measurements
 centers - an Mx2 matrix holding the centers of the determined RBFs
 s - the shared kernel width (RBF width)
=#

In [None]:
# YOUR ANSWER HERE
#= 
If your function is LIKELY correct, this will be its output

9×4 Matrix{Float64}:
 1.0  0.00673795   4.13994e-8   1.12535e-7
 1.0  0.367879     0.00673795   0.0183156
 1.0  0.00673795   0.367879     1.0
 1.0  4.13994e-8   0.00673795   0.0183156
 1.0  3.09882e-12  1.25015e-9   3.72665e-6
 1.0  0.135335     0.135335     0.367879
 1.0  4.53999e-5   0.135335     0.367879
 1.0  5.10909e-12  4.53999e-5   0.00012341
 1.0  1.92875e-22  5.10909e-12  1.38879e-11
=#

regressor_matrix([1 2; 3 4; 5 6; 7 8; 9 3; 4 5; 6 7; 8 9; 10 11], [2 5; 4 7; 5 6], 1)

#= 
The point value for getting this problem correct 
is included in Part B.
In other words, you  may see 0 points
but this function still matters 
and will affect your grade in later problems
=#

### **Task 3b**



You will setup the regression problem and use the functions implemented above to solve for the vector of weights `a_star`. Even though our data has grown in dimension, we are still using the same model for fitting as in the end of Part 1! As we saw when we plotted 1% of the points, we could make out the shape of Alaska. Thus 1% of the data provides decent enough coverage of the state, so we will use those locations as our basis centers.



In the next cell write the code necessary to solve for the coefficients of our model `a_star`.



\hat{y} = a\_1 + a\_2 f(x; x\_{c\_1}, s) + a\_3 f(x; x\_{c\_2}, s) + ... + a\_{M+1} f(x; x\_{c\_M}, s)



We are expecting you to use your function `least_squares_lu` in order to compute `a_star`



* Build `Phi`
* Build `Y`
* Build `a_star`

Hint: use `dataLongLat`, `dataPrecip`, and `centersLongLat` that were created in the first few cells


In [None]:
# YOUR ANSWER HERE
ans1 = isapprox(a_star[1:5], [41.02774014953474, 3001.0797227419307, 163844.96534774158, 28363.22655160181, 7288.967977936066], atol = 1)

is_it_correct_check1 = ans1 ? "Yes" : "No" 

@show is_it_correct_check1;

In [None]:
#=
Use this information to reason
about whether your answer 
is correct or not
=#

@show size(dataLongLat)
@show size(centersLongLat)
@show size(Phi)
@show length(a_star)
@show length(dataPrecip)

println("\n Recall that a_star includes a constant term, which is why it is 
    one longer than the number of rows in centersLongLat \n")

### **Task 3c**



Build a function that computes the amount of precipitation at any position `x=[longitude; latitude]` in Alaska and call it `Precip(x)`.



Hint: Look back at how we built the function `f_hatRBF(x)` for you in Part 1.



In [None]:
# YOUR ANSWER HERE
if isa(Precip([-159.159, 70.5409]), Vector) || isa(Precip([-159.159, 70.5409]), Matrix)
    println("Your `Precip(x)` function implementation is wrong.")
    println("Your `Precip(x)` function should return a real-valued number and not a Vector.\n")
    println("Extract the number from the Vector before returning it from your `Precip(x)` function")
else
    println("Good. Your `Precip(x)` function returns the right value Type.")
    println("Now on to the remaining friendly check...\n")
    println("Is the value returned correct? \n")

    # if the value of is_it_correct_checkN is "Yes", then your answer may be correct. 
    # If the value of is_it_correct_checkN is "No", then your answer is wrong

    is_it_correct_check1 = isapprox(Precip([-159.159, 70.5409]), 21.89695, atol=1e-1) ? "Yes" : "No"   

    @show is_it_correct_check1; 
end;

With your model approximation, `Precip(x)`, you can now provide an estimate of the precipitation at any longitude and latitude pair in Alaska. Let's test and see how much precipitation in millimeters fell in Juneau in July based on our model. We will use a longitude and latitude of (-134.410652, 58.301930) obtained from Google maps. You should see a value between 290 and 320 mm. Store your answer in a variable named `rain_in_juneau`.



In [None]:
longitude = -134.410652
latitude = 58.301930

In [None]:
# YOUR ANSWER HERE
#= 
If you do not pass the test, take
a look back at how you calculated a_star. 

Remember your answer should be between 290 and 320 mm
=#
@printf("In July 2020, our model predicts that a total of %.2f mm of rain fell in Juneau, Alaska. \n", rain_in_juneau)

### **Task 3d**



Estimate the precipitation that fell over the entire state of Alaska at discretized (longitude, latitude) pairs. We did not get any precipitation measurements off of the coast of Alaska (measurements were only taken over land) so we only estimate with our model if the (longitude, latitude) coordinate falls approximately on land. Your job is to fill in the code in the space provided (*second cell below*).



In [None]:
#=
Divide up the state of Alaska into
a series of squares that look 
like a checkerboard / grid. In 
the next cell we evaluate 
the precipitation for each square 
in the grid.
=#

# Latitude and longitude extremas of Alaska
lon_min = -178.0
lon_max = -130.0
lat_min = 51.0
lat_max = 72.0

# Number of squares we want on the x and y axis for plotting our 3D bar graph
n_xpixels = 200
n_ypixels = 200

lon_edges = collect(range(floor(lon_min), ceil(lon_max), length=n_xpixels));
lat_edges = collect(range(floor(lat_min), ceil(lat_max), length=n_ypixels));
gmt_grid = pixelate(data, n_xpixels, n_ypixels);

The next cell is used to estimate the precipitation across Alaska. `mat[i,j]` should hold the precipitation that fell at `lon` and `lat` which are set at the beginning of the inner for loop each time.



Use the model you just fit, `Precip(x)`, along with the given variables above to estimate the precipitation and set the value in `mat[i,j]`.



In [None]:
#Type me exactly and add your code to mat[i,j]
mat = zeros(n\_ypixels, n\_xpixels)
for i = 1:n\_ypixels
 for j = 1:n\_xpixels
 lon = lon\_edges[j]
 lat = lat\_edges[i]
 
 #= 
 Do not estimate if coordinate is not on land
 We check this by seeing if there was any measured
 precipitaton near that lon,lat pair in the original data set
 =#
 
 if gmt\_grid.z[i,j] > 1e-6
 
 mat[i,j] = #YOUR CODE HERE
 
 end
 end
end

In [None]:
# YOUR ANSWER HERE
@printf("Maximum rain in any place in alaska measured %0.1f mm \n",maximum(mat))  
@printf("There was on average %0.1f mm of precipitation in Alaska in June 2020 \n", sum(mat)/(length(lat_edges)*length(lon_edges)))
#=
to know if you are correct
check if your plot (below) 
matches what is expected
in the Project 2 guide
=#

In [None]:
#Compare me to the Project 2 Guide!

grid = GMT.mat2grid(mat, x=lon_edges, y=lat_edges)
# Compute a colormap with the grid's data range
cmap = GMT.grd2cpt(grid);  

# Plotting function to make a 3D bar graph
GMT.bar3(grid, lw=:thinnest, color=cmap, fmt=:png, show=true, view=(200,50),
    xlabel="Longitude", ylabel="Latitude", title="Precipitation in Alaska (mm)")

Congratulations, you have fit a surface to the dataset!
-------------------------------------------------------



![](https://media.giphy.com/media/13p77tfexyLtx6/giphy.gif)

It is possible to produce a good final plot and yet have made errors along the way that cancel one another out. You are responsible for checking your work as you go.



### Don't forget to hit the submit button.



### 

