# Data Science Mathematics
# Simple Linear Regression
# In-Class Activity

Refer to your class handout for background information.

In [2]:
import numpy as np
from scipy import stats

Let's instantiate the data set.

In [3]:
submarine_sightings = np.array([1,2,3,4,5,6,7,8,9,10])
cyber_activity_metric = np.array([0.021025,0.022103,0.023237,0.024428,0.025681,0.026997,0.028381,0.029836,0.031366,0.032974])

Now, let's calculate our regression values.

In [4]:
slope, intercept, r_value, p_value, std_err = stats.linregress(submarine_sightings,cyber_activity_metric)

In [5]:
slope, intercept, r_value, p_value, std_err

(0.001324557575757576,
 0.019317733333333337,
 0.9980126947882119,
 6.807697897873586e-11,
 2.9567948236252566e-05)

Next, print the R^2 value.  How good is your fit?

In [6]:
print('r-squared:', r_value**2)

r-squared: 0.9960293389584285


### Question 1: 

In [15]:
#Question 1 a. get sample correlation coefficient
# 𝛽 ̂=(∑2_(𝑖=1)^𝑛▒(𝑥_𝑖−𝑥 ̅ )(𝑦_𝑖−𝑦 ̅ ) )/(∑2_(𝑖=1)^𝑛▒(𝑥_𝑖−𝑥 ̅ )^2 )=(𝑐𝑜𝑣(𝑥,𝑦))/(𝑣𝑎𝑟(𝑥))=𝑟_𝑥𝑦  𝑠_𝑦/𝑠_𝑥 
# get function for covariance
def get_covariance(x, y, x_bar, y_bar):
    sigma_xy = 0
    for i in range(len(x)): 
        sigma_xy += ((x[i] - x_bar)*(y[i] - y_bar))
    cov = sigma_xy/(len(x)-1)
    return cov
# get variance, just for fun
def get_variance(x, x_bar):
    sigma_x = 0
    for i in range(len(x)): 
        sigma_x += ((x[i] - x_bar)**2)
    cov = sigma_x/(len(x)-1)
    return cov
# get means, standard deviation, and variance
sub_bar = np.mean(submarine_sightings)
cyber_bar = np.mean(cyber_activity_metric)
sub_std = np.std(submarine_sightings, ddof=1)
cyber_std = np.std(cyber_activity_metric, ddof=1)
# verify variance function is accurate
var1 = np.var(submarine_sightings, ddof=1)
sub_var = get_variance(submarine_sightings, sub_bar)
#now get covariance
cov = get_covariance(submarine_sightings, cyber_activity_metric, sub_bar, cyber_bar)
#and check
print(cov, np.cov(submarine_sightings, cyber_activity_metric))
# get coefficient
corr_coeff = (cov/sub_var)*(sub_std/cyber_std)
#and check
corr_coeff1 = np.corrcoef(submarine_sightings, cyber_activity_metric)
print("Correlation coefficient is", corr_coeff)

0.012141777777777779 [[9.16666667e+00 1.21417778e-02]
 [1.21417778e-02 1.61465964e-05]]
Correlation coefficient is 0.9980126947882119


In [17]:
# Question 1 b. find Alpha and Beta
beta = corr_coeff*(cyber_std/sub_std)
print("Beta is", beta)
# 𝛼 ̂=𝑦 ̅−𝛽 ̂𝑥 ̅
alpha = cyber_bar - beta*sub_bar
print("Alpha is", alpha)


Beta is 0.0013245575757575762
Alpha is 0.019317733333333333


Question 1 c. A linear relationship just by looking at the data sets seems to exist, as the cyber data set increases in a relatively proportional way as the number of submarine sightings.

Question 1 d. The correlation coefficient is very close to 1, which indicates a highly linear relationship and correlation between these datasets. However, this data is limited to only submarine sightings, instead of total submarines in the region, so the correlation between submarines and cyber activity may not be completely accurate.


In [15]:
# Question 2: 
# f(m,b)=m^2+b^2 & ∇f(m,b)=[2m,2b]
# 𝑎_(𝑛+1)=𝑎_𝑛−𝛾𝛻𝐹(𝑎_𝑛 ), 
# 2 a. 𝛾 a.k.a "step" is 0.1, a_n is (1,5) 
def gradient_descent(step, pos): # making a function to call this more quickly
    i = 0
    for i in range(5):
        pos[0] = round((pos[0] - (step * 2 * pos[0])),2) # this is finding a_(n+1) of m
        pos[1] = round((pos[1] - (step * 2 * pos[1])),2) # this is finding a_(n+1) of b
        pos[2] = round((pos[0]**2 + pos[1]**2),2) # this is finding a_(n+1)
        print("Step {}".format(i+1))
        print (pos)
        print ("\n")
    return
step = 0.1
pos = [1,5,"f(m,b)"] #f(m,b) = m**2 + b**2)
gradient_descent(step, pos)


Step 1
[0.8, 4.0, 16.64]


Step 2
[0.64, 3.2, 10.65]


Step 3
[0.51, 2.56, 6.81]


Step 4
[0.41, 2.05, 4.37]


Step 5
[0.33, 1.64, 2.8]




Model does begin to converge, with gradient getting smaller with each iteration.

In [16]:
#Question 2 b.
step = 0.5
pos = [1,5,"f(m,b)"] # have to reset this value
gradient_descent(step, pos)

Step 1
[0.0, 0.0, 0.0]


Step 2
[0.0, 0.0, 0.0]


Step 3
[0.0, 0.0, 0.0]


Step 4
[0.0, 0.0, 0.0]


Step 5
[0.0, 0.0, 0.0]




Model immediately converged within the first step.

Question 2 c. The learning rate definitely impacted the convergence of the model. The first smaller rate showed a gradual descent towards an endpoint, while the second step was significant enough that the quality of the descent wasn't described at all (it just went to the correlation at 0). Considering this particular function, that commonality makes sense, though with a less convergent function that sort of change may change the overall conclusion of gradient observations.