# Bias-variance lab

In this lab you'll explore how bias and variance changes using a dataset on college statistics.

---

In [1]:
import numpy as np
import scipy 
import seaborn as sns
import pandas as pd
import patsy

from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso, RidgeCV, LassoCV
from sklearn.cross_validation import cross_val_score, KFold, train_test_split

import matplotlib
import matplotlib.pyplot as plt

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

plt.style.use('fivethirtyeight')


---

### Load data

Feel free to choose a target variable on your own. I chose "Grad.Rate" as my target variable but it's not required.

You'll want to discard the name of the college, and if you're planning on using the "Private" variable it will have to be changed into 1s and 0s rather than yes/no.

In [2]:
college = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/college_stats/College.csv')

In [3]:
college.head(2)

Unnamed: 0.1,Unnamed: 0,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal,PhD,Terminal,S.F.Ratio,perc.alumni,Expend,Grad.Rate
0,Abilene Christian University,Yes,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60
1,Adelphi University,Yes,2186,1924,512,16,29,2683,1227,12280,6450,750,1500,29,30,12.2,16,10527,56


In [4]:
college.rename(columns={'Unnamed: 0':'college'}, inplace=True)

In [6]:
college.columns = [c.lower().replace('.','_') for c in college.columns]
college.head(2)

Unnamed: 0,college,private,apps,accept,enroll,top10perc,top25perc,f_undergrad,p_undergrad,outstate,room_board,books,personal,phd,terminal,s_f_ratio,perc_alumni,expend,grad_rate
0,Abilene Christian University,Yes,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60
1,Adelphi University,Yes,2186,1924,512,16,29,2683,1227,12280,6450,750,1500,29,30,12.2,16,10527,56


In [7]:
college.drop('college', axis=1, inplace=True)

In [8]:
grad_rate = college.grad_rate.values
X = college.iloc[:,:-1]
print X.columns

Index([u'private', u'apps', u'accept', u'enroll', u'top10perc', u'top25perc',
       u'f_undergrad', u'p_undergrad', u'outstate', u'room_board', u'books',
       u'personal', u'phd', u'terminal', u's_f_ratio', u'perc_alumni',
       u'expend'],
      dtype='object')


In [9]:
X['private'] = X.private.map(lambda x: 1 if x == 'Yes' else 0)

---

### Cross-validate a linear regression predicting your target variable from the other variables

How does it perform?

In [10]:
linreg = LinearRegression()
scores = cross_val_score(linreg, X, grad_rate, cv=10)
print scores
print np.mean(scores)

[ 0.44921644  0.35875931  0.53650887  0.48549145  0.17445097  0.38775448
  0.17393223  0.43964213  0.61095627  0.3575135 ]
0.397422563045


---

### Create a function that will iteratively predict your target from different train-test splits

This will be used to calculate the bias and the variance after this.

Your function should:

1. Accept a model, X predictor matrix/dataframe, y target variable, and a number of random splits to do training and testing on.
2. The output should be a dataframe that has as its first column the true values of y, and all the other columns will be corresponding predicted values of y when that row was in the testing set.
3. It will iterate through the number of splits
4. Create a variable that is the list of row numbers. Use this with `train_test_split` to get out randomized training rows and testing rows for each iteration.
5. Subset your X and y into training and testing
6. Train your model on the training X and training y
7. Predict values of y using the testing X
8. Add the predicted values of y to the dataframe tracking y predictions - the predicted y values should be insert in the correct row so that they match the true value of y in the first column. You can index using the test indices you got out of train_test_split to do this. (The rest of the rows that were part of the training set can be nan for that iteration).


In [None]:
def make_multiple_predictions(model, X, y, random_splits_num=200):
    # set up the output dataframe that will be returned by the function:
    # first column will be true values of y:
    predictions = pd.DataFrame({'ytrue':y})
    
    # get out a list that is the row indices - i will use this in train_test_split
    # to divide up my data:
    rows = range(len(y))
    
    # go through the number of random splits i specified:
    for i in range(random_splits_num):
        
        # use train_test_split to get the training indices and testing indices
        # with that rows variable i made earlier.
        # im choosing the test set to be 30% of the whole thing:
        train_inds, test_inds = train_test_split(rows, test_size=0.3)
        
        # subset the x and y into training and testing sets:
        Xtrain, Xtest = X.iloc[train_inds, :], X.iloc[test_inds, :]
        Ytrain, Ytest = Y[train_inds], Y[test_inds]
        
        # train the model:
        model.fit(Xtrain, Ytrain)
        
        # get out the model predictions for Xtest:
        yhats = model.predict(Xtest)
        
        # put the predicted values into the predictions dataframe I made earlier, 
        # at the correct row (where they correspond to the true y variable in that row):
        predictions['sample'+str(i)] = np.nan
        predictions.iloc[test_inds, -1] = yhats
        
    return predictions
    

---

### Create different predictor datasets

To see what happens to bias and variance as the predictors change, create a few versions of X that have different numbers of predictors in them.

For example, one could have all the other variables, and another one could be predicting only using private vs. public.

---

### Use the predict function you wrote above to get the predicted values for each version of the data

Run each of your X through the function with the y target vector. As you recall the output of your function has the true values of y in a column and then predicted values of y in other columns for the different train-test splits

---

### Calculate bias and variance 

I've given you two functions below to calculate bias and variance if they are given the dataframe that has the first column as the true y values and the other column the predicted y values at each train/test split iteration.

You can use these to calculate the bias and variance of your different predictor variables. If you have more predictors variance of prediction should generally go up and bias goes down. Likewise, if you have few predictors variance should go down and bias goes up.

If you have an insanely bad model, they both might go up a lot!

In [3]:
def calculate_bias_sq(yhats_df):
    # Take out the true values of y that are in the first column:
    ytrue = yhats_df.iloc[:,0].values
    
    # Calculate the mean of the predictions, averaged across the columns.
    # So, all of the predictions for the true y at row 0 would be averaged together
    # and so on for all the rows.
    yhat_means = yhats_df.iloc[:,1:].mean(axis=1).values
    
    # Subtract the true value of y from the mean of the predicted values, and square it.
    elementwise_bias_sq = (yhat_means - ytrue)**2
    
    # Take the mean of those squared bias values (across all y)
    mean_bias_sq = np.mean(elementwise_bias_sq)
    return mean_bias_sq


def calculate_variance(yhats_df):
    # Calculate the mean of the predicted y's across the columns (mean of yhat for each row)
    yhats_means = yhats_df.iloc[:,1:].mean(axis=1)
    
    # subtract the mean of the yhats from the original yhat values (for each row)
    # and square the result. 
    yhats_devsq = yhats_df.iloc[:,1:].subtract(yhats_means, axis=0)**2
    
    # Take the mean of the squared deviations from the mean, then 
    # take the mean of those to get the overall variance across the y observations
    yhats_devsq_means = yhats_devsq.mean(axis=1).values
    return np.mean(yhats_devsq_means)


---

### How does regularization affect bias and variance?

Use Lasso and/or Ridge on your dataset with all the predictor variables. You can feed the lasso or ridge model into the function you wrote earlier to get the predictions using regularization instead of just ordinary least squares regression.

How does using regularization affect bias and variance?