In [1]:
#  Ebnable HTML/CSS 
from IPython.core.display import HTML
HTML("<link href='https://fonts.googleapis.com/css?family=Passion+One' rel='stylesheet' type='text/css'><style>div.attn { font-family: 'Helvetica Neue'; font-size: 30px; line-height: 40px; color: #FFFFFF; text-align: center; margin: 30px 0; border-width: 10px 0; border-style: solid; border-color: #5AAAAA; padding: 30px 0; background-color: #DDDDFF; }hr { border: 0; background-color: #ffffff; border-top: 1px solid black; }hr.major { border-top: 10px solid #5AAA5A; }hr.minor { border: none; background-color: #ffffff; border-top: 5px dotted #CC3333; }div.bubble { width: 65%; padding: 20px; background: #DDDDDD; border-radius: 15px; margin: 0 auto; font-style: italic; color: #f00; }em { color: #AAA; }div.c1{visibility:hidden;margin:0;height:0;}div.note{color:red;}</style>")

___
Enter Team Member Names here (double click to edit):

- Travis Peck
- Quinn Matthews
- Christ Hirschbrich
- Tyler Olbright


# In Class Assignment One
In the following assignment you will be asked to fill in python code and derivations for a number of different problems. Please read all instructions carefully and turn in the rendered notebook (or HTML of the rendered notebook)  before the end of class (or right after class). The initial portion of this notebook is given before class and the remainder is given during class. Please answer the initial questions before class, to the best of your ability. Once class has started you may rework your answers as a team for the initial part of the assignment. 

<a id="top"></a>
## Contents
* <a href="#Loading">Loading the Data</a>
* <a href="#linearnumpy">Linear Regression</a>
* <a href="#sklearn">Using Scikit Learn for Regression</a>
* <a href="#classification">Linear Classification</a>

________________________________________________________________________________________________________

<a id="Loading"></a>
<a href="#top">Back to Top</a>
## Loading the Data
Please run the following code to read in the "diabetes" dataset from sklearn's data loading module. 

This will load the data into the variable `ds`. `ds` is a `bunch` object with fields like `ds.data` and `ds.target`. The field `ds.data` is a numpy matrix of the continuous features in the dataset. **The object is not a pandas dataframe. It is a numpy matrix.** Each row is a set of observed instances, each column is a different feature. It also has a field called `ds.target` that is a continuous value we are trying to predict. Each entry in `ds.target` is a label for each row of the `ds.data` matrix. 

In [2]:
from sklearn.datasets import load_diabetes
import numpy as np
from __future__ import print_function


ds = load_diabetes()

# this holds the continuous feature data
# because ds.data is a matrix, there are some special properties we can access (like 'shape')
print('features shape:', ds.data.shape, 'format is:', ('rows','columns')) # there are 442 instances and 10 features per instance
print('range of target:', np.min(ds.target),np.max(ds.target))

features shape: (442, 10) format is: ('rows', 'columns')
range of target: 25.0 346.0


________________________________________________________________________________________________________
<a id="linearnumpy"></a>
<a href="#top">Back to Top</a>
## Using Linear Regression 
In the videos, we derived the formula for calculating the optimal values of the regression weights (you must be connected to the internet for this equation to show up properly):

$$ w = (X^TX)^{-1}X^Ty $$

where $X$ is the matrix of values with a bias column of ones appended onto it. For the diabetes dataset one could construct this $X$ matrix by stacking a column of ones onto the `ds.data` matrix. 

$$ X=\begin{bmatrix}
         & \vdots &        &  1 \\
        \dotsb & \text{ds.data} & \dotsb &  \vdots\\
         & \vdots &         &  1\\
     \end{bmatrix}
$$

**Question 1:** For the diabetes dataset, how many elements will the vector $w$ contain?

In [3]:
# Enter your answer here (or write code to calculate it)
# Number of columns in X plus 1 (for the bias term)
print(ds.data.shape[1]+1)

#

11


________________________________________________________________________________________________________

**Exercise 1:** In the following empty cell, use the given equation above (using numpy matrix operations) to find the values of the optimal vector $w$. You will need to be sure $X$ and $y$ are created like the instructor talked about in the video. Don't forget to include any modifications to $X$ to account for the bias term in $w$. You might be interested in the following functions:

- `import numpy as np`
- `np.hstack((mat1,mat2))` stack two matrices horizontally, to create a new matrix
- `np.ones((rows,cols))` create a matrix full of ones
- `my_mat.T` takes transpose of numpy matrix named `my_mat`
- `np.dot(mat1,mat2)` or `mat1 @ mat2` is matrix multiplication for two matrices
- `np.linalg.inv(mat)` gets the inverse of the variable `mat`

In [4]:
# Write you code here, print the values of the regression weights using the 'print()' function in python
y = ds.target
ones = np.ones((ds.data.shape[0],1))

X = np.hstack((ds.data, ones))
Xm = np.matrix(X)
ym = np.matrix(y.reshape((len(y),1)))

wm = (Xm.T*Xm)**-1*Xm.T*ym




___
<a id="sklearn"></a>
<a href="#top">Back to Top</a>
# Start of Live Session Coding

**Exercise 2:** Scikit-learn also has a linear regression fitting implementation. Look at the scikit learn API and learn to use the linear regression method. The API is here: 

- API Reference: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

Use the sklearn `LinearRegression` module to check your results from the previous question. 

**Question 2**: Did you get the same parameters? 

In [5]:
from sklearn.linear_model import LinearRegression

# write your code here, print the values of model by accessing 
#    its properties that you looked up from the API

reg = LinearRegression(fit_intercept=True).fit(ds.data, ds.target)

print('model coefficients are:', reg.coef_)
print('model intercept is', reg.intercept_)
print('Answer to question is:', 'Yes they are the same')

model coefficients are: [ -10.0098663  -239.81564367  519.84592005  324.3846455  -792.17563855
  476.73902101  101.04326794  177.06323767  751.27369956   67.62669218]
model intercept is 152.13348416289597
Answer to question is: Yes they are the same


________________________________________________________________________________________________________

Recall that to predict the output from our model, $\hat{y}$, from $w$ and $X$ we need to use the following formula:

- $\hat{y}=w^TX^T$, for row vector $\hat{y}$
- OR 
- $\hat{y}=Xw$, for column vector $\hat{y}$

Where $X$ is a matrix with example instances in *each row* of the matrix (and the bias term).

**Exercise 3:** 
- *Part A:* Use matrix multiplication to predict output using numpy, $\hat{y}_{numpy}$. 
 - **Note**: you may need to make the regression weights a column vector using the following code: `w = w.reshape((len(w),1))` This assumes your weights vector is assigned to the variable named `w`.
- *Part B:* Use the sklearn API to get the values for $\hat{y}_{sklearn}$ (hint: use the `.predict` function of the API).
- *Part C:* Calculate the mean squared error between your prediction from numpy and the target, $\frac{1}{M}\sum_i(y-\hat{y}_{numpy})^2$. 
- *Part D:* Calculate the mean squared error between your sklearn prediction and the target, $\frac{1}{M}\sum_i(y-\hat{y}_{sklearn})^2$.
 - **Note**: parts C and D can each be completed in one line of code using numpy. There is no need to write a `for` loop.

In [6]:
# Use this block to answer the questions
from numpy import dot 

weights = reg.coef_.reshape((len(reg.coef_),1)) # make w a column vector
weights = np.vstack((weights, reg.intercept_)) # add the intercept to the top of the weights

y_numpy = weights.T.dot(X.T)
y_pred_sklearn = reg.predict(ds.data)

MSE_numpy = np.mean((ds.target - y_numpy)**2)
MSE_sklearn = np.mean((ds.target - y_pred_sklearn)**2)

print('MSE Numpy is:', MSE_numpy)
print('MSE Sklearn is:', MSE_sklearn)

MSE Numpy is: 2859.6963475867506
MSE Sklearn is: 2859.6963475867506


________________________________________________________________________________________________________
<a id="classification"></a>
<a href="#top">Back to Top</a>
## Using Linear Classification
Now lets use the code you created to make a classifier with linear boundaries. Run the following code in order to load the iris dataset.

In [7]:
from sklearn.datasets import load_iris
import numpy as np

# this will overwrite the diabetes dataset
ds = load_iris()
print('features shape:', ds.data.shape) # there are 150 instances and 4 features per instance
print('original number of classes:', len(np.unique(ds.target)))

# now let's make this a binary classification task
ds.target = ds.target>1
print ('new number of classes:', len(np.unique(ds.target)))

features shape: (150, 4)
original number of classes: 3
new number of classes: 2


________________________________________________________________________________________________________

**Exercise 4:** Now use linear regression to come up with a set of weights, `w`, that predict the class value. You can use numpy or sklearn, whichever you prefer. This is exactly like you did before for the *diabetes* dataset. However, instead of regressing to continuous values, you are just regressing to the integer value of the class (0 or 1), like we talked about in the video (using the hard limit function). 
 - **Note**: If you are using numpy, remember to account for the bias term when constructing the feature matrix, `X`.
 

In [8]:
# write your code here and print the values of the weights 
from numpy.linalg import inv
from numpy import dot

X = np.hstack((np.ones((ds.data.shape[0],1)), ds.data))
y = ds.target

reg = LinearRegression(fit_intercept=True).fit(X, y)

print('model coefficients are:', reg.coef_)
print('model intercept is', reg.intercept_)


model coefficients are: [ 0.         -0.04587608  0.20276839  0.00398791  0.55177932]
model intercept is -0.6952818633256028


________________________________________________________________________________________________________

**Exercise 5:** Finally, use a hard decision function on the output of the linear regression to make this a binary classifier. This is just like we talked about in the video, where the output of the linear regression passes through a function: 

- $\hat{y}=g(w^TX^T)$ where
 - $g(w^TX^T)$ for $w^TX^T < \alpha$ maps the predicted class to `0` 
 - $g(w^TX^T)$ for $w^TX^T \geq \alpha$ maps the predicted class to `1`. 

Here, alpha is a threshold for deciding the class. 

**Question 3**: What value for $\alpha$ makes the most sense? What is the accuracy of the classifier given the $\alpha$ you chose? 

Note: You can calculate the accuracy with the following code: `accuracy = float(sum(yhat==y)) / len(y)` assuming you choose variable names `y` and `yhat` for the target and prediction, respectively.

In [9]:
# use this box to predict the classification output
y_regression = reg.predict(X)

# Find the precentage of 1s in the target
percentage_1 = float(sum(y))/len(y)
print('Percentage of 1s in the target:', percentage_1*100, '%')
percentage_0 = 1 - percentage_1

# Sort the predictions
y_reg_sorted = np.sort(y_regression)

# Find the value that separates the 1s from the 0s
index = int(y_reg_sorted.shape[0]*percentage_0)
alpha = y_reg_sorted[index]
print('alpha:', alpha)

# Calculate the classification 
y_pred = y_regression>alpha
accuracy = float(sum(y_pred==y))/len(y)
print('Percentage accuracy:', accuracy*100, '%')


Percentage of 1s in the target: 33.33333333333333 %
alpha: 0.5206020028016898
Percentage accuracy: 94.0 %


________________________________________________________________________________________________________

That's all! Please **save (make sure you saved!!!) and upload your rendered notebook** and please include **team member names** in the notebook submission.