1. Datasets 
  - Sysnthetic datasets in scikitlearn 
  - Low dimensional data vs High dimensional data 
2. Supervised Learning Algorithms 
   - k Nearest Neighbors
   - Generalized Linear Models 
   - GLMs with regularization - Lasso and Ridge GLMs
   - Kernel based support vector models 
   - Decision Trees 
   - Tree Ensembles (Bagging, Random Forest, Boosting with trees) 
   - Neural Networks  
3. Common Concepts Used in using these algorithms. 
   - Algorithm Classification 
     - Parametric/ Non Parametric 
     - Classification / Regression / Unsupervised
   - Bias vs Variance or Underfitting vs Overfitting   
     - Usually related to choice of model hyperparameters or train/test selection
4. Strengths vs Weaknesses of each algorithm  
5. Scikit-learn's general flow of modeling 

## 1. Data sets 
- Sysnthetic datastes in scikit learn can be generated for a varietry of tasks, like
  experimenting with regression and classification algorithms 
- Lower dimension data sets are useful to comprehend concepts like overfitting and undefitting 
- BUT sometimes they do not generalize well on a higher dimensional data set, because higher 
  dimensional data sets could be sparse (more on that to come)

In [2]:
import sklearn
from sklearn.datasets import make_classification, make_regression

#### Simple classification dataset

In [39]:
X, y = make_classification(n_classes= 2, n_samples= 100, n_features= 2,
                           flip_y= 0.01,n_informative= 2, n_redundant= 0)
from matplotlib import pyplot as plt
%matplotlib notebook 
plt.figure()
plt.scatter(x = X[:,0], y = X[:,1], c = y)

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x1a29a095f8>

#### Complex classification dataset

In [37]:
X, y = make_classification(n_classes= 2, n_samples= 100, n_features= 2,
                           flip_y= 0.5,n_informative= 2, n_redundant= 0)
from matplotlib import pyplot as plt
%matplotlib notebook 
plt.figure()
plt.scatter(x = X[:,0], y = X[:,1], c = y)

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x1a28fe54a8>

#### Dataset for clusters

In [33]:
from sklearn.datasets import make_blobs

In [36]:
X,y = make_blobs(n_samples= 100, n_features= 2, centers= 4, random_state= 121)
plt.figure()
plt.scatter(x = X[:,0], y = X[:,1], c = y)

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x1a285ca278>

#### Dataset for regression

In [32]:
X,y = make_regression(n_samples= 100, n_features= 1, n_informative= 1,bias = 10, noise = 30, random_state= 111)
plt.figure()
plt.scatter(X,y)

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x1a27122978>

## 2 . Supervised Learning Algorithms

### 2.1 kNearest Neighbors for classification and regression
#### a. kNN for classification 
- Algorithm 
- Application 
  - Synthetic data set, divide into train and test 
  - Show overfitting and underfitting, and how a balance can be achieved 
    - Use visual representation of decision boundaries to explain  
    
#### b. kNN for regression  
- same as above 

#### c. Pros and cons of knn

#### a.1 Algorithm  - TBD
- Brief overview : http://localhost:8888/notebooks/Documents/DS/Python/UM%20Spcialization/Machine_Learning/Week%201/Intro_to_ML.ipynb

#### a.2 Application

#### a.2.1 Synthetic data set

In [285]:
import numpy as np
import pandas as pd
from matplotlib.colors import ListedColormap
from sklearn.datasets import make_classification
X,y = make_classification(n_classes= 3, n_features= 2, n_informative= 2, n_redundant= 0,
                          n_clusters_per_class= 1, flip_y= 0.2, random_state= 123)

In [286]:
plt.figure()
color_list_light = ['bisque', 'palegreen', 'lightblue']
color_list_dark = ['darkorange', 'darkgreen', 'royalblue']
cmap_points = ListedColormap(color_list_dark)
plt.scatter(X[:,0], X[:,1],c = y, cmap = cmap_points)

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x1a3e014c18>

#### a.2.2 Overfitting and Underfitting with kNN classifier

### Overfitting vs Underfitting. 
- Capturing global trends vs local trends (Good fit vs Overfit)
- Capturing global trends well vs failing to capture global trends well enough (good fit vs underfit) 
- Many reasons for overitting
  - ** Ability to capture global trend relies on an important assumption:**
    - **Data is representative of future data that model will see, which might not be realistically true.** 
      - So, choice of training sample is definitely important for model to not overfit
      - Some model monitoring is required to flag when scoring data starts veering from the training set
      - We rely on this assumption, and create tests on test set to try as much as possible to not overfit 
    - Choice of hyperparameters  
      - k in KNN, low k leads to overfit , high l to underfit 
      - learning rate in GBM
    - Data sampling 
      - If a selection bias occurs in data sampling, can results in over or even under fit  
    - Too many features, not good variable selection  
      - Polynomial regression 

In [287]:
from sklearn.model_selection import train_test_split 
from sklearn.neighbors import KNeighborsClassifier

In [288]:
def knn_classification(X,y,k,train_prop = 0.8):
    """Function to train a kNN classifier on train set. Expects X (2 features) and y as arrays,
    k as number of neighbors
    Returns:
    A plot of decision boundary after training, accuracy score on train and test"""
    X_train, X_test, y_train, y_test = train_test_split(X, y,train_size = train_prop , 
                                                        test_size = (1-train_prop))
    # Usually should ensure no rank 1 array, but KneighborClassifier expects y matrices to be rank 2 arrays
    print("Shapes of X_train = {0}, X_test = {1}, y_train = {2}, y_test = {3}"  
          .format(X_train.shape, X_test.shape, y_train.shape, y_test.shape))
    #y_train = y_train.reshape((-1,1))
    #y_test = y_test.reshape((-1,1))
    #print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
    model = KNeighborsClassifier(n_neighbors = k)
    
    # Below method shows that algorithm implementation is such that KNN memorizes the training data. More to be covered
    #later. Broadly, it memorizes it in a tree form
    model.fit(X_train, y_train)
    
    train_accuracy = model.score(X = X_train,y = y_train)
    test_accuracy = model.score(X = X_test,y = y_test)
    
    # Plotting decision boundary
    #1. Get a meshgrid using range values of two features
    # 2.Then use the trained model, to predict each record formed by mesh grid 
    #3. plot the region, using colors for region of each class
    
    x_axis_min = np.min(X_train[:,0])
    x_axis_max = np.max(X_train[:,0])
    y_axis_min = np.min(X_train[:,1])
    y_axis_max = np.max(X_train[:,1])

    mesh_step_size = 0.01
    xx, yy = np.meshgrid(np.arange(x_axis_min, x_axis_max + 1,mesh_step_size), 
               np.arange(y_axis_min, y_axis_max + 1, mesh_step_size))
    z = np.concatenate((xx.reshape(1,-1), yy.reshape(1,-1)), axis = 0)
    scored_mesh = model.predict(z.T)
    print("Shapes of arrays of meshgrid , arr1 = {0}, arr2 = {1}, scored_mesh = {2}" 
          .format(xx.shape, yy.shape, scored_mesh.shape))
    scored_mesh = scored_mesh.reshape(xx.shape)
    
    color_list_light = ['bisque', 'palegreen', 'lightblue']
    color_list_dark = ['darkorange', 'darkgreen', 'royalblue']
    cmap_points = ListedColormap(color_list_dark)
    cmap_region = ListedColormap(color_list_light)
    #plt.figure()
    plt.pcolormesh(xx,yy, scored_mesh, cmap = cmap_region)
    plt.scatter(X_train[:,0], X_train[:,1],c = y_train, cmap = cmap_points,
               s = 10, edgecolor = 'black')
    #plt.xlabel('Feature 0')
    #plt.ylabel('Feature 1')
    plt.xlim(x_axis_min, x_axis_max)
    plt.ylim(y_axis_min, y_axis_max)
    plt.title("k = {0}, Train Accuracy = {1}, Test Accuracy = {2}".format(k,train_accuracy, test_accuracy),
             fontsize = 'xx-small')
    plt.tick_params(axis = 'both', labelsize = 'xx-small')
    #plt.subplots_adjust(hspace = 0.3)
    plt.tight_layout()

- First two charts for k = 1 show you how accuracy on test set is dependent on random chance of getting a test set 
- For small values of k, model learns small patterns, not general patterns, hence it is more complex(shown through
  decision boundary), and has low bias (train error 0). But does not perform well on test set as well(higher variance)
- As k increases, complexity decreases, in fact test accuracy reaches max for k =5 when train accuracy is least

In [289]:
plt.figure()
plt.subplot(3,2,1)
knn_classification(X,y,k=1)
plt.subplot(3,2,2)
knn_classification(X,y,k=1)
plt.subplot(3,2,3)
knn_classification(X,y,k=3)
plt.subplot(3,2,4)
knn_classification(X,y,k=4)
plt.subplot(3,2,5)
knn_classification(X,y,k=5)
plt.subplot(3,2,6)
knn_classification(X,y,k=6)

<IPython.core.display.Javascript object>

Shapes of X_train = (80, 2), X_test = (20, 2), y_train = (80,), y_test = (20,)
Shapes of arrays of meshgrid , arr1 = (679, 627), arr2 = (679, 627), scored_mesh = (425733,)
Shapes of X_train = (80, 2), X_test = (20, 2), y_train = (80,), y_test = (20,)
Shapes of arrays of meshgrid , arr1 = (679, 659), arr2 = (679, 659), scored_mesh = (447461,)
Shapes of X_train = (80, 2), X_test = (20, 2), y_train = (80,), y_test = (20,)
Shapes of arrays of meshgrid , arr1 = (679, 659), arr2 = (679, 659), scored_mesh = (447461,)
Shapes of X_train = (80, 2), X_test = (20, 2), y_train = (80,), y_test = (20,)
Shapes of arrays of meshgrid , arr1 = (679, 659), arr2 = (679, 659), scored_mesh = (447461,)
Shapes of X_train = (80, 2), X_test = (20, 2), y_train = (80,), y_test = (20,)
Shapes of arrays of meshgrid , arr1 = (655, 622), arr2 = (655, 622), scored_mesh = (407410,)
Shapes of X_train = (80, 2), X_test = (20, 2), y_train = (80,), y_test = (20,)
Shapes of arrays of meshgrid , arr1 = (679, 659), arr2 = (679

#### b. kNN for regression
- kNN for regression problem, works similarly, just the y values of k neighbor is combined by default by
  taking their mean 
- The evaluation metric by default is R squared / coefficient of determination i.e how much of variation 
  is explained. 
 $$R^2 = \frac{ESS}{TSS}$$  
 ESS = TSS - RSS

In [293]:
X, y = make_regression(n_samples= 100, n_features= 1, n_informative= 1, 
                n_targets= 1, bias = 50, noise = 30 )
plt.figure()
plt.scatter(X,y)

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x1a4b04bb70>

In [189]:
from sklearn.datasets import make_regression
from sklearn.neighbors import KNeighborsRegressor

In [294]:
def knn_regression(X,y,k,train_prop = 0.8):
    """Function to split data into train, test with one feature; train a kNN regression, score 
    on test set.
    Returns a plot with decision boundary learnt, and Rsquare on train and test set"""
    X_train, X_test, y_train, y_test = train_test_split(X, y,train_size = train_prop , 
                                                        test_size = (1-train_prop),random_state = 123)
    print("Shapes of X_train = {0}, X_test = {1}, y_train = {2}, y_test = {3}"  
          .format(X_train.shape, X_test.shape, y_train.shape, y_test.shape))
    model = KNeighborsRegressor(n_neighbors = k)
    model.fit(X_train, y_train)
    train_r_sq = model.score(X_train, y_train)
    test_r_sq = model.score(X_test, y_test)
    
    #Construct a series to score using the model, for drawing decision boundary
    x_axis_min = np.min(X_train)
    x_axis_max = np.max(X_train)
    step_size = 0.1
    xx = np.arange(x_axis_min, x_axis_max + 1, step_size).reshape(-1,1)
    print(xx.shape)
    yy = model.predict(xx)
    plt.scatter(X_train, y_train, c = 'royalblue', s = 15, marker = 'o', alpha = 0.5)
    #plt.scatter(xx,yy, c = 'red', marker = 'o', 
    #            s = 15,edgecolor = 'black', alpha = 0.6)
    plt.plot(xx,yy,color = 'red', marker = 'o', markeredgecolor= 'black', 
             linestyle = '-', ms = 3, alpha = 0.5)
    plt.xlim(x_axis_min, x_axis_max)
    plt.title("k = {0}, \n Train R Sq = {1}, \n Test R Sq = {2}".format(k,train_r_sq, test_r_sq),
             fontsize = 'xx-small')
    plt.tick_params(axis = 'both', labelsize = 'xx-small')
    #plt.subplots_adjust(hspace = 0.3)
    plt.tight_layout()


##### Can see k = 1 to k = 3, make model go from overfitting to becoming better, them for k = 50, it underfits, both
trianing and test errors drop

In [296]:
plt.figure()
plt.subplot(2,2,1)
knn_regression(X,y,k=1)
plt.subplot(2,2,2)
knn_regression(X,y,k=7)
plt.subplot(2,2,3)
knn_regression(X,y,k=15)
plt.subplot(2,2,4)
knn_regression(X,y,k=50)

<IPython.core.display.Javascript object>

Shapes of X_train = (80, 1), X_test = (20, 1), y_train = (80,), y_test = (20,)
(56, 1)
Shapes of X_train = (80, 1), X_test = (20, 1), y_train = (80,), y_test = (20,)
(56, 1)
Shapes of X_train = (80, 1), X_test = (20, 1), y_train = (80,), y_test = (20,)
(56, 1)
Shapes of X_train = (80, 1), X_test = (20, 1), y_train = (80,), y_test = (20,)
(56, 1)


In [299]:
%%html
<style>
  table {margin-left: 0 !important;}
</style>

#### c. Pros and Cons of knn 

|Pros|Cons
|:-|:-|
|1.Simple to understand, no assumptions are made on distribution of data|1. If no. of features are many, can be slow to train|
|2.Good to set a benchmark|2. Performance esp. decreases if data is sparse, with many features|
|3. TBD|3. Performance changes with small changes in data, so is'nt it good to detect shifts in real world data compared to training data|


## 2.2 Generalized Linear Models 
- GLMs make assumptions of relationship between dependent and independent variables 
- Hypothesis orientation to model building, in validating linear relationship exists , and is tested after
  model training 
- Theoretical aspects for Linear Regression, and algorithm are discussed here - 
  - http://localhost:8888/notebooks/Documents/DS/Supervised%20Learning/SL201/Linear%20Regression/Linear_Regression_latest.ipynb

#### Key Aspects :
1. Strong assumptions about data relationships 
  - Linearity
  - No specification bias 
  - Homoscedasticity : This is adjusted for in GLMs where target has other than normal distribution
  - Multicolinearity  
  - Normality of Errors 
  - Autocorrelation of Errors 
2. Loss Function - Mean Residual Sum of Squares/ Mean RSS
3. Model Complexity
  - There is no mechanism to control model complexity(without regularization), something like k in kNN
  - Linear models can become complex, by increase in number of features, feature forms, interactions 
  - In presence of large number of features, a measure of complexity often is the size of coefficients, 
    which we will see later, is used to control complexity. 
4. Use of Linear regression in sci-kit learn 
5. Comparison with kNN regression

In [347]:
from sklearn.linear_model import LinearRegression
X, y = make_regression(n_samples= 100, n_features= 1, n_informative= 1, 
                n_targets= 1, bias = 50, noise = 30, random_state= 123)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.2,random_state = 123)
print("Shapes of X_train = {0}, X_test = {1}, y_train = {2}, y_test = {3}"  
       .format(X_train.shape, X_test.shape, y_train.shape, y_test.shape))

linear_model = LinearRegression(fit_intercept= True)
linear_model.fit(X_train, y_train)
print('weights = {0:}, intercept = {1}'.format(linear_model.coef_, linear_model.intercept_))
scored_train_linear = linear_model.predict(X_train)
scored_test_linear = linear_model.predict(X_test)
lm_rsq_train = linear_model.score(X_train,y_train)
lm_rsq_test = linear_model.score(X_test,y_test)

# knn model
k = 5
knn_model = KNeighborsRegressor(n_neighbors = k)
knn_model.fit(X_train, y_train)
scored_train_knn = knn_model.predict(X_train)
scored_test_knn = knn_model.predict(X_test)
knn_rsq_train = knn_model.score(X_train, y_train)
knn_rsq_test = knn_model.score(X_test, y_test)

Shapes of X_train = (80, 1), X_test = (20, 1), y_train = (80,), y_test = (20,)
weights = [ 33.930188], intercept = 49.61446610803708


In [360]:
plt.figure()
plt.scatter(X_train, y_train,c = 'royalblue', marker = 'o', alpha = 0.5, s = 20, label = 'Training points')
plt.plot(X_train,scored_train_knn, 'g*', label = 'Knn Model',markersize = 5)
plt.plot(X_train, scored_train_linear,'r-', label = 'Linear Reg.')
plt.title('Fit on train set \n Rsq_train_linear = {0:.2f}, Rsq_train_knn = {1:.2f}'
          .format(lm_rsq_train,knn_rsq_train))
plt.legend()

<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x1a4dea4f98>

In [359]:
plt.figure()
plt.scatter(X_test, y_test,c = 'royalblue', marker = 'o', alpha = 0.5, label = 'Test points')
plt.plot(X_test,scored_test_knn, 'g*', markersize = 5,label = 'Knn scored set')
plt.plot(X_test, scored_test_linear,'r-',label = 'Linear Reg. scored set')
plt.title('Score on test set \n Rsq_test_linear = {0:.2f}, Rsq_test_knn = {1:.2f}'
          .format(lm_rsq_test,knn_rsq_test))
plt.legend()

<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x1a4132fdd8>

##### Comparison of Linear Regression and kNN regression

|kNN|Linear Regression
|:-|:-|
|1.Simple to understand, no assumptions are made on distribution of data|1. Strong assumptions on relationships in data, assumptions require validation|
|2.Can learn non-linear decision boundaries|2. Requires variables to be transformed to address non-linear relationships|
|3. Can be sensitive to changes in data,  generally have low bias (can learn non-linear relationships)|3. Stable to changes in data, but generally have high bias (cause of linear structure)|

### 3. Lasso and Ridge Regressions 
- Ridge Regression 
  - L2 penalty in cost function, scoring remains same
  - Need for feature scaling
    - type of scaling depends on data, learning task (class/reg) and algorithm (more later)
  - Effect of penalty with and without Feature scaling. 
  - Effect of regularization on evaluation metric in a learning scenario

#### Ridge Regression. 
Cost function = f($\beta$) = 
$\Sigma_{i=1}^m[y^{(i)} - (\widehat{\beta_0} + \widehat{\beta_1}*x^{(i)}_1 + ... + \widehat{\beta_n}*x^{(i)}_n)]^2 + 
\alpha * \Sigma_{j=1}^n\beta^2_j$ 

- **Notice bias term is not used in regularization term**

- **Alpha** 
  - Alpha controls the extent of regularization, is > 0 
  - It becomes a hyperparameter, that choice of alpha affects the parameters of the model , and hence 
    it is tuned using a validation dataset / cross validation 
  - Higher alpha means more regularization  

#### How does penalty term avoid overfitting? 
  - Empirical observation suggested, in picking up noise, linear regression coefficients increase in size. 
    First half of the cost function tries to learn that way by decreasing RSS, but second half penalizes 
    increase in coefficient size. 
  - This counter-effect  means that learned function is able to generalize well on newer data sets
#### Once the coefficients and bias is learned, predictions are generated using them as in case of unregularized regression  

#### Feature Scaling 
- The features in the regression can be on different scales. **Remember the learned coefficients values depend
  on the scale of features used** 
- when using ridge, we are adding all squared weights together by giving them same weight 1, it turns out well
  if we scale each feature to a common scale 
- **Feature scaling is helpful in many aspects and used in many algorithms as pre-procssing step**
  - Esp. parametric models : Linear, Logistic regression, neural nets, svm, kNN
  - Faster convergence of learning algorithm 
  - Fair and easy specification of weights without any weighting
- Many options available for type of scaling, MinMax, 0 mean and 1 std etc.   
- **Use same scaler on test set, do not fit the scaler again with test data, results in data leakage!, think of
  it as you are not supposed to know about test data during training**  
- Downside - interpretation is affected, meaning the regression prediction would be different with feature scaling
  than without it

In [407]:
def load_crime_dataset():
    # Communities and Crime dataset for regression
    # https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized

    crime = pd.read_table('CommViolPredUnnormalizedData.txt', sep=',', na_values='?')
    # remove features with poor coverage or lower relevance, and keep ViolentCrimesPerPop target column
    columns_to_keep = [5, 6] + list(range(11,26)) + list(range(32, 103)) + [145]  
    crime = crime.iloc[:,columns_to_keep].dropna()

    X_crime = crime.iloc[:,range(0,88)]
    y_crime = crime['ViolentCrimesPerPop']

    return (X_crime, y_crime)
X_crime, y_crime = load_crime_dataset()
print(X_crime.shape, y_crime.shape)

(1994, 88) (1994,)


#### Effect of feature scaling using two features - MinMaxScaler 
x = (x - min(x))/ range(x)

In [384]:
from sklearn.preprocessing import MinMaxScaler
sample = X_crime.loc[:,['agePct12t21', 'agePct12t29']]
scaler = MinMaxScaler() 
scaler.fit(sample) # Compute the minimum and maximum to be used for later scaling
print(scaler)
sample_tfd = scaler.transform(sample) # transformation

MinMaxScaler(copy=True, feature_range=(0, 1))


In [392]:
plt.figure()
plt.subplot(1,2,1)
plt.scatter(X_crime['agePct12t21'], X_crime['agePct12t29'])
plt.subplot(1,2,2)
plt.scatter(sample_tfd[:,0], sample_tfd[:,1])


<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x1a6d120400>

#### Ridge Regression (with normalization) vs Unregularized regression 
- Is there an affect on performance
- See the difference in coef. values 
- See the difference in predictions, how is interpretability affected 
- Finding standard errors in regression coefficients from ridge/lasso
  - https://stats.stackexchange.com/questions/45449/when-using-glmnet-how-to-report-p-value-significance-to-claim-significance-of-pr
- Finding standard errors from regression 
  - sklearn.Linear Regression() does not give, but class can b easily extended. The solution comes from 
    solving the matrix algebra 
  - https://stackoverflow.com/questions/27928275/find-p-value-significance-in-scikit-learn-linearregression  

In [436]:
import time

In [445]:
# Linear model, not regularized
X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,test_size = 0.2,random_state = 123)
print("Shapes of X_train = {0}, X_test = {1}, y_train = {2}, y_test = {3}"  
       .format(X_train.shape, X_test.shape, y_train.shape, y_test.shape))

scaler = MinMaxScaler()
X_train_tfd = scaler.fit_transform(X_train)
X_test_tfd = scaler.transform(X_test)

linear_model = LinearRegression(fit_intercept= True)
start = time.time()
linear_model.fit(X_train_tfd, y_train)
end = time.time()
print('weights = {0}, intercept = {1}'.format(linear_model.coef_, linear_model.intercept_))
print('Non zero weights = {}'.format(sum(linear_model.coef_ != 0)))
print('Training Time in ms: {}'.format((end-start)*1000))
#scored_train_linear = linear_model.predict(X_train)
scored_test_linear = linear_model.predict(X_test_tfd)
lm_rsq_train = linear_model.score(X_train_tfd,y_train)
lm_rsq_test = linear_model.score(X_test_tfd,y_test)
print('Linear model: Rsq_train = {0:.3f}, Rsq_test = {1:.3f}'
      .format(lm_rsq_train,lm_rsq_test))

Shapes of X_train = (1595, 88), X_test = (399, 88), y_train = (1595,), y_test = (399,)
weights = [ -4.66111597e+03   9.06174811e+01   5.70695396e+02  -2.15880378e+03
   4.19403309e+02  -5.37842989e+02   2.08168575e+02   1.13622595e+02
  -1.46820299e+03  -4.67894683e+02   5.28784733e+01  -5.29796382e+02
   7.35994039e+02   2.89944896e+02  -5.42105451e+02   6.79391567e+02
   1.61120371e+02  -1.23331429e+03  -4.13040957e+02  -6.55696933e+02
   2.48415091e+02   2.05407766e+02  -4.48274883e+01   5.05638204e+02
  -2.17788924e+02  -2.16398923e+02   2.43728902e+02   4.88217620e+02
   2.50368358e+03   7.60786806e+02   2.10120566e+03  -4.07118575e+03
  -7.64885121e+02   5.39498150e+02  -2.00586825e+03   4.38460862e+02
  -5.44274640e+01   4.29488395e+02  -7.48045632e+02  -4.99671512e+03
   1.38018430e+03   3.67541894e+03   1.72290189e+02   3.06391658e+02
  -5.10312871e+02   2.57886328e+02   5.94613628e+02  -2.30780610e+03
   2.92212263e+03  -1.98407956e+03  -1.56405572e+02  -1.26701937e+03
   1.1

##### Ridge 
- Can use 5 potential solvers, setting is auto to choose
- Setting for number of iterations as well, because the objective may not be convex

In [446]:
from sklearn.linear_model import Ridge
linear_model_ridge = Ridge(random_state = 123)
print(linear_model_ridge)

start = time.time()
linear_model_ridge.fit(X_train_tfd, y_train)
end = time.time()
print('weights = {0}, intercept = {1}'.format(linear_model_ridge.coef_, linear_model_ridge.intercept_))
print('Non zero weights = {}'.format(sum(linear_model.coef_ != 0)))
print('Training Time in ms: {}'.format((end-start)*1000))
print('Training Iterations = {}'.format(linear_model_ridge.n_iter_))

#scored_train_linear = linear_model.predict(X_train)
scored_test_ridge = linear_model_ridge.predict(X_test_tfd)
ridge_rsq_train = linear_model_ridge.score(X_train_tfd,y_train)
ridge_rsq_test = linear_model_ridge.score(X_test_tfd,y_test)
print('Linear model: Rsq_train = {0:.3f}, Rsq_test = {1:.3f}'
      .format(ridge_rsq_train,ridge_rsq_test))

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=123, solver='auto', tol=0.001)
weights = [   83.68692279   137.49956796    72.40012861  -539.35205833   -83.70064812
   -11.00498013    81.06409567   114.08385365   -72.64821886  -243.80980423
   -16.34427048  -465.46415762   238.3753617    231.17653078  -392.99739981
   -93.98098234    41.87934902     8.93027448  -273.68752633  -521.67233802
   232.24613418   119.89605543  -150.30961162   200.02148131  -198.44752047
  -142.41408303   124.78213751   245.11750415   385.41436554   414.50875866
  -245.3434229   -100.01843343   243.26260869  -272.27942116 -1080.26095774
   180.74494771   -90.7378779    261.75436146  -515.42426767   -64.85400149
  1278.69197881    47.78977793   217.32171009    16.50675026  -204.84924182
   109.69208994   -30.38976103  -165.48300876    50.19723027    32.95240013
     9.3052496   -543.5373572    103.44418357  -238.53581673   240.6519063
  -213.70935128   -83.2373

In [447]:
plt.figure()
plt.scatter(scored_test_linear, scored_test_ridge, marker = '.')
plt.xlabel('Unregularized')
plt.ylabel("Ridge")

<IPython.core.display.Javascript object>

Text(0,0.5,'Ridge')

In [448]:
plt.figure()
#len(linear_model_ridge.coef_)
x = list(range(1,89,1))
x_ = [i+0.5 for i in x]
plt.bar(x = x,height = linear_model.coef_, width = 0.5, label = 'unreg')
plt.bar(x = x_ ,height = linear_model_ridge.coef_, width = 0.5, label = 'reg')
plt.legend()

<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x1a718927f0>

#### Ridge Regression (with normalization and without)
- Is the training time different 
- Is the performance different 
- What about prediction scores

4. Scikit-learn's general flow of modeling  
   - Train/Test sets : model_selection
   - Model Estimator : a class used to create a model object 
   - fit method : train the model object
   - score method : Model Evaluation using the object and test data