1. Datasets 
  - Sysnthetic datasets in scikitlearn 
  - Low dimensional data vs High dimensional data 
2. Supervised Learning Algorithms 
   - k Nearest Neighbors
   - Generalized Linear Models 
   - GLMs with regularization - Lasso and Ridge GLMs
   - Kernel based support vector models 
   - Decision Trees 
   - Tree Ensembles (Bagging, Random Forest, Boosting with trees) 
   - Neural Networks  
3. Common Concepts Used in using these algorithms. 
   - Algorithm Classification 
     - Parametric/ Non Parametric 
     - Classification / Regression / Unsupervised
   - Bias vs Variance or Underfitting vs Overfitting   
     - Usually related to choice of model hyperparameters or train/test selection
4. Strengths vs Weaknesses of each algorithm  
5. Scikit-learn's general flow of modeling 

## 1. Data sets 
- Synthetic datastes in scikit learn can be generated for a varietry of tasks, like
  experimenting with regression and classification algorithms 
- Lower dimension data sets are useful to comprehend concepts like overfitting and undefitting 
- BUT sometimes they do not generalize well on a higher dimensional data set, because higher 
  dimensional data sets could be sparse (more on that to come)

In [9]:
import sklearn
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib notebook 
from sklearn.datasets import make_classification, make_regression 

#### Simple classification dataset

In [3]:
X, y = make_classification(n_classes= 2, n_samples= 100, n_features= 2,
                           flip_y= 0.01,n_informative= 2, n_redundant= 0)
plt.figure()
plt.scatter(x = X[:,0], y = X[:,1], c = y)

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x11835fb70>

#### Complex classification dataset

In [6]:
X, y = make_classification(n_classes= 2, n_samples= 100, n_features= 2,
                           flip_y= 0.5,n_informative= 2, n_redundant= 0)
from matplotlib import pyplot as plt
%matplotlib notebook 
plt.figure()
plt.scatter(x = X[:,0], y = X[:,1], c = y)

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x1a1eee7588>

#### Dataset for clusters

In [7]:
from sklearn.datasets import make_blobs

In [8]:
X,y = make_blobs(n_samples= 100, n_features= 2, centers= 4, random_state= 121)
plt.figure()
plt.scatter(x = X[:,0], y = X[:,1], c = y)

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x1a1ef2f748>

#### Dataset for regression

In [9]:
X,y = make_regression(n_samples= 100, n_features= 1, n_informative= 1,bias = 10, noise = 30, random_state= 111)
plt.figure()
plt.scatter(X,y)

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x1a1ef72400>

## 2 . Supervised Learning Algorithms

### 2.1 kNearest Neighbors for classification and regression
#### a. kNN for classification 
- Algorithm 
- Application 
  - Synthetic data set, divide into train and test 
  - Show overfitting and underfitting, and how a balance can be achieved 
    - Use visual representation of decision boundaries to explain  
    
#### b. kNN for regression  
- same as above 

#### c. Pros and cons of knn

#### a.1 Algorithm  - TBD
- Brief overview : http://localhost:8888/notebooks/Documents/DS/Python/UM%20Spcialization/Machine_Learning/Week%201/Intro_to_ML.ipynb

#### a.2 Application

#### a.2.1 Synthetic data set

In [10]:
import numpy as np
import pandas as pd
from matplotlib.colors import ListedColormap
from sklearn.datasets import make_classification
X,y = make_classification(n_classes= 3, n_features= 2, n_informative= 2, n_redundant= 0,
                          n_clusters_per_class= 1, flip_y= 0.2, random_state= 123)

In [11]:
plt.figure()
color_list_light = ['bisque', 'palegreen', 'lightblue']
color_list_dark = ['darkorange', 'darkgreen', 'royalblue']
cmap_points = ListedColormap(color_list_dark)
plt.scatter(X[:,0], X[:,1],c = y, cmap = cmap_points)

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x1a20746898>

#### a.2.2 Overfitting and Underfitting with kNN classifier

### Overfitting vs Underfitting. 
- Capturing global trends vs local trends (Good fit vs Overfit)
- Capturing global trends well vs failing to capture global trends well enough (good fit vs underfit) 
- Many reasons for overitting
  - **Ability to capture global trend relies on an important assumption:**
    - **Data is representative of future data that model will see, which might not be realistically true.** 
      - So, choice of training sample is definitely important for model to not overfit
      - Some model monitoring is required to flag when scoring data starts veering from the training set
      - We rely on this assumption, and create tests on test set to try as much as possible to not overfit 
    - Choice of hyperparameters  
      - k in KNN, low k leads to overfit , high k to underfit 
      - learning rate in GBM
    - Data sampling 
      - If a selection bias occurs in data sampling, can results in over or even under fit  
    - Too many features, not good variable selection  
      - Polynomial regression 

In [12]:
from sklearn.model_selection import train_test_split 
from sklearn.neighbors import KNeighborsClassifier

In [13]:
def knn_classification(X,y,k,train_prop = 0.8):
    """Function to train a kNN classifier on train set. Expects X (2 features) and y as arrays,
    k as number of neighbors
    Returns:
    A plot of decision boundary after training, accuracy score on train and test"""
    X_train, X_test, y_train, y_test = train_test_split(X, y,train_size = train_prop , 
                                                        test_size = (1-train_prop))
    # Usually should ensure no rank 1 array, but KneighborClassifier expects y matrices to be rank 2 arrays
    print("Shapes of X_train = {0}, X_test = {1}, y_train = {2}, y_test = {3}"  
          .format(X_train.shape, X_test.shape, y_train.shape, y_test.shape))
    #y_train = y_train.reshape((-1,1))
    #y_test = y_test.reshape((-1,1))
    #print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
    model = KNeighborsClassifier(n_neighbors = k)
    
    # Below method shows that algorithm implementation is such that KNN memorizes the training data. More to be covered
    #later. Broadly, it memorizes it in a tree form
    model.fit(X_train, y_train)
    
    train_accuracy = model.score(X = X_train,y = y_train)
    test_accuracy = model.score(X = X_test,y = y_test)
    
    # Plotting decision boundary
    #1. Get a meshgrid using range values of two features
    # 2.Then use the trained model, to predict each record formed by mesh grid 
    #3. plot the region, using colors for region of each class
    
    x_axis_min = np.min(X_train[:,0])
    x_axis_max = np.max(X_train[:,0])
    y_axis_min = np.min(X_train[:,1])
    y_axis_max = np.max(X_train[:,1])

    mesh_step_size = 0.01
    xx, yy = np.meshgrid(np.arange(x_axis_min, x_axis_max + 1,mesh_step_size), 
               np.arange(y_axis_min, y_axis_max + 1, mesh_step_size))
    z = np.concatenate((xx.reshape(1,-1), yy.reshape(1,-1)), axis = 0)
    scored_mesh = model.predict(z.T)
    print("Shapes of arrays of meshgrid , arr1 = {0}, arr2 = {1}, scored_mesh = {2}" 
          .format(xx.shape, yy.shape, scored_mesh.shape))
    scored_mesh = scored_mesh.reshape(xx.shape)
    
    color_list_light = ['bisque', 'palegreen', 'lightblue']
    color_list_dark = ['darkorange', 'darkgreen', 'royalblue']
    cmap_points = ListedColormap(color_list_dark)
    cmap_region = ListedColormap(color_list_light)
    #plt.figure()
    plt.pcolormesh(xx,yy, scored_mesh, cmap = cmap_region)
    plt.scatter(X_train[:,0], X_train[:,1],c = y_train, cmap = cmap_points,
               s = 10, edgecolor = 'black')
    #plt.xlabel('Feature 0')
    #plt.ylabel('Feature 1')
    plt.xlim(x_axis_min, x_axis_max)
    plt.ylim(y_axis_min, y_axis_max)
    plt.title("k = {0}, Train Accuracy = {1}, Test Accuracy = {2}".format(k,train_accuracy, test_accuracy),
             fontsize = 'xx-small')
    plt.tick_params(axis = 'both', labelsize = 'xx-small')
    #plt.subplots_adjust(hspace = 0.3)
    plt.tight_layout()

- First two charts for k = 1 show you how accuracy on test set is dependent on random chance of getting a test set 
- For small values of k, model learns small patterns, not general patterns, hence it is more complex(shown through
  decision boundary), and has low bias (train error 0). But does not perform well on test set as well(higher variance)
- As k increases, complexity decreases, in fact test accuracy reaches max for k =5 when train accuracy is least

In [14]:
plt.figure()
plt.subplot(3,2,1)
knn_classification(X,y,k=1)
plt.subplot(3,2,2)
knn_classification(X,y,k=1)
plt.subplot(3,2,3)
knn_classification(X,y,k=3)
plt.subplot(3,2,4)
knn_classification(X,y,k=4)
plt.subplot(3,2,5)
knn_classification(X,y,k=5)
plt.subplot(3,2,6)
knn_classification(X,y,k=6)

<IPython.core.display.Javascript object>

Shapes of X_train = (80, 2), X_test = (20, 2), y_train = (80,), y_test = (20,)
Shapes of arrays of meshgrid , arr1 = (679, 659), arr2 = (679, 659), scored_mesh = (447461,)
Shapes of X_train = (80, 2), X_test = (20, 2), y_train = (80,), y_test = (20,)
Shapes of arrays of meshgrid , arr1 = (679, 595), arr2 = (679, 595), scored_mesh = (404005,)
Shapes of X_train = (80, 2), X_test = (20, 2), y_train = (80,), y_test = (20,)
Shapes of arrays of meshgrid , arr1 = (679, 659), arr2 = (679, 659), scored_mesh = (447461,)
Shapes of X_train = (80, 2), X_test = (20, 2), y_train = (80,), y_test = (20,)
Shapes of arrays of meshgrid , arr1 = (679, 659), arr2 = (679, 659), scored_mesh = (447461,)
Shapes of X_train = (80, 2), X_test = (20, 2), y_train = (80,), y_test = (20,)
Shapes of arrays of meshgrid , arr1 = (679, 659), arr2 = (679, 659), scored_mesh = (447461,)
Shapes of X_train = (80, 2), X_test = (20, 2), y_train = (80,), y_test = (20,)
Shapes of arrays of meshgrid , arr1 = (679, 659), arr2 = (679

#### b. kNN for regression
- kNN for regression problem, works similarly, just the y values of k neighbor is combined by default by
  taking their mean 
- The evaluation metric by default is R squared / coefficient of determination i.e how much of variation 
  is explained. 
 $$R^2 = \frac{ESS}{TSS}$$  
 ESS = TSS - RSS

In [15]:
X, y = make_regression(n_samples= 100, n_features= 1, n_informative= 1, 
                n_targets= 1, bias = 50, noise = 30 )
plt.figure()
plt.scatter(X,y)

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x1a23501898>

In [16]:
from sklearn.datasets import make_regression
from sklearn.neighbors import KNeighborsRegressor

In [17]:
def knn_regression(X,y,k,train_prop = 0.8):
    """Function to split data into train, test with one feature; train a kNN regression, score 
    on test set.
    Returns a plot with decision boundary learnt, and Rsquare on train and test set"""
    X_train, X_test, y_train, y_test = train_test_split(X, y,train_size = train_prop , 
                                                        test_size = (1-train_prop),random_state = 123)
    print("Shapes of X_train = {0}, X_test = {1}, y_train = {2}, y_test = {3}"  
          .format(X_train.shape, X_test.shape, y_train.shape, y_test.shape))
    model = KNeighborsRegressor(n_neighbors = k)
    model.fit(X_train, y_train)
    train_r_sq = model.score(X_train, y_train)
    test_r_sq = model.score(X_test, y_test)
    
    #Construct a series to score using the model, for drawing decision boundary
    x_axis_min = np.min(X_train)
    x_axis_max = np.max(X_train)
    step_size = 0.1
    xx = np.arange(x_axis_min, x_axis_max + 1, step_size).reshape(-1,1)
    print(xx.shape)
    yy = model.predict(xx)
    plt.scatter(X_train, y_train, c = 'royalblue', s = 15, marker = 'o', alpha = 0.5)
    #plt.scatter(xx,yy, c = 'red', marker = 'o', 
    #            s = 15,edgecolor = 'black', alpha = 0.6)
    plt.plot(xx,yy,color = 'red', marker = 'o', markeredgecolor= 'black', 
             linestyle = '-', ms = 3, alpha = 0.5)
    plt.xlim(x_axis_min, x_axis_max)
    plt.title("k = {0}, \n Train R Sq = {1}, \n Test R Sq = {2}".format(k,train_r_sq, test_r_sq),
             fontsize = 'xx-small')
    plt.tick_params(axis = 'both', labelsize = 'xx-small')
    #plt.subplots_adjust(hspace = 0.3)
    plt.tight_layout()


##### Can see k = 1 to k = 3, make model go from overfitting to becoming better, them for k = 50, it underfits, both
trianing and test errors drop

In [18]:
plt.figure()
plt.subplot(2,2,1)
knn_regression(X,y,k=1)
plt.subplot(2,2,2)
knn_regression(X,y,k=7)
plt.subplot(2,2,3)
knn_regression(X,y,k=15)
plt.subplot(2,2,4)
knn_regression(X,y,k=50)

<IPython.core.display.Javascript object>

Shapes of X_train = (80, 1), X_test = (20, 1), y_train = (80,), y_test = (20,)
(59, 1)
Shapes of X_train = (80, 1), X_test = (20, 1), y_train = (80,), y_test = (20,)
(59, 1)
Shapes of X_train = (80, 1), X_test = (20, 1), y_train = (80,), y_test = (20,)
(59, 1)
Shapes of X_train = (80, 1), X_test = (20, 1), y_train = (80,), y_test = (20,)
(59, 1)


In [19]:
%%html
<style>
  table {margin-left: 0 !important;}
</style>

#### c. Pros and Cons of knn 

|Pros|Cons
|:-|:-|
|1.Simple to understand, no assumptions are made on distribution of data|1. If no. of features are many, can be slow to train|
|2.Good to set a benchmark|2. Performance esp. decreases if data is sparse, with many features|
|3. -|3. Performance changes with small changes in data, so is'nt it good to detect shifts in real world data compared to training data|


## 2.2 Generalized Linear Models 
- GLMs make assumptions of relationship between dependent and independent variables 
- Hypothesis orientation to model building, in validating linear relationship exists , and is tested after
  model training 
- Theoretical aspects for Linear Regression, and algorithm are discussed here - 
  - http://localhost:8888/notebooks/Documents/DS/Supervised%20Learning/SL201/Linear%20Regression/Linear_Regression_latest.ipynb

#### Key Aspects :
1. Strong assumptions about data relationships 
  - Linearity
  - No specification bias 
  - Homoscedasticity : This is adjusted for in GLMs where target has other than normal distribution
  - Multicolinearity  
  - Normality of Errors 
  - Autocorrelation of Errors 
2. Loss Function - Mean Residual Sum of Squares/ Mean RSS
3. Model Complexity
  - There is no mechanism to control model complexity(without regularization), something like k in kNN
  - Linear models can become complex, by increase in number of features, feature forms, interactions 
  - In presence of large number of features, a measure of complexity often is the size of coefficients, 
    which we will see later, is used to control complexity. 
4. Use of Linear regression in sci-kit learn 
5. Comparison with kNN regression

In [20]:
from sklearn.linear_model import LinearRegression
X, y = make_regression(n_samples= 100, n_features= 1, n_informative= 1, 
                n_targets= 1, bias = 50, noise = 30, random_state= 123)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.2,random_state = 123)
print("Shapes of X_train = {0}, X_test = {1}, y_train = {2}, y_test = {3}"  
       .format(X_train.shape, X_test.shape, y_train.shape, y_test.shape))

linear_model = LinearRegression(fit_intercept= True)
linear_model.fit(X_train, y_train)
print('weights = {0:}, intercept = {1}'.format(linear_model.coef_, linear_model.intercept_))
scored_train_linear = linear_model.predict(X_train)
scored_test_linear = linear_model.predict(X_test)
lm_rsq_train = linear_model.score(X_train,y_train)
lm_rsq_test = linear_model.score(X_test,y_test)

# knn model
k = 5
knn_model = KNeighborsRegressor(n_neighbors = k)
knn_model.fit(X_train, y_train)
scored_train_knn = knn_model.predict(X_train)
scored_test_knn = knn_model.predict(X_test)
knn_rsq_train = knn_model.score(X_train, y_train)
knn_rsq_test = knn_model.score(X_test, y_test)

Shapes of X_train = (80, 1), X_test = (20, 1), y_train = (80,), y_test = (20,)
weights = [ 33.930188], intercept = 49.61446610803708


In [21]:
plt.figure()
plt.scatter(X_train, y_train,c = 'royalblue', marker = 'o', alpha = 0.5, s = 20, label = 'Training points')
plt.plot(X_train,scored_train_knn, 'g*', label = 'Knn Model',markersize = 5)
plt.plot(X_train, scored_train_linear,'r-', label = 'Linear Reg.')
plt.title('Fit on train set \n Rsq_train_linear = {0:.2f}, Rsq_train_knn = {1:.2f}'
          .format(lm_rsq_train,knn_rsq_train))
plt.legend()

<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x1a25c53748>

In [22]:
plt.figure()
plt.scatter(X_test, y_test,c = 'royalblue', marker = 'o', alpha = 0.5, label = 'Test points')
plt.plot(X_test,scored_test_knn, 'g*', markersize = 5,label = 'Knn scored set')
plt.plot(X_test, scored_test_linear,'r-',label = 'Linear Reg. scored set')
plt.title('Score on test set \n Rsq_test_linear = {0:.2f}, Rsq_test_knn = {1:.2f}'
          .format(lm_rsq_test,knn_rsq_test))
plt.legend()

<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x1a25ca5da0>

##### Comparison of Linear Regression and kNN regression

|kNN|Linear Regression
|:-|:-|
|1.Simple to understand, no assumptions are made on distribution of data|1. Strong assumptions on relationships in data, assumptions require validation|
|2.Can learn non-linear decision boundaries|2. Requires variables to be transformed to address non-linear relationships|
|3. Can be sensitive to changes in data,  generally have low bias (can learn non-linear relationships)|3. Stable to changes in data, but generally have high bias (cause of linear structure)|

### 3. Lasso and Ridge Regressions 
- Ridge Regression 
  - L2 penalty in cost function, scoring remains same
  - Need for feature scaling
    - type of scaling depends on data, learning task (class/reg) and algorithm (more later)
  - Effect of penalty with and without Feature scaling. 
  - Effect of regularization on evaluation metric in a learning scenario

#### Ridge Regression. 
Cost function = f($\beta$) = 
$\Sigma_{i=1}^m[y^{(i)} - (\widehat{\beta_0} + \widehat{\beta_1}*x^{(i)}_1 + ... + \widehat{\beta_n}*x^{(i)}_n)]^2 + 
\alpha * \Sigma_{j=1}^n\beta^2_j$ 

- **Notice bias term is not used in regularization term**

- **Alpha** 
  - Alpha controls the extent of regularization, is > 0 
  - It becomes a hyperparameter, that choice of alpha affects the parameters of the model , and hence 
    it is tuned using a validation dataset / cross validation 
  - Higher alpha means more regularization  

#### How does penalty term avoid overfitting? 
  - Empirical observation suggested, that important variables in linear regression get large coefficients which cause
    overfitting.
    First half of the cost function tries to learn that way by decreasing RSS, but second half penalizes 
    increase in coefficient size. 
  - This counter-effect  means that coefficient size of important variables is limited, and so model is able 
    to generalize well on newer data sets
#### Once the coefficients and intercept is learned, predictions are generated using them as in case of unregularized regression  

#### Feature Scaling 
- The features in the regression can be on different scales. **Remember the learned coefficients values depend
  on the scale of features used** 
- when using ridge, we are adding all squared weights together by giving them same weight 1, it turns out well
  if we scale each feature to a common scale 
- **Feature scaling is helpful in many aspects and used in many algorithms as pre-procssing step**
  - Esp. parametric models : Linear, Logistic regression, neural nets, svm, kNN
  - Faster convergence of learning algorithm 
  - Fair and easy specification of weights without any weighting
- Many options available for type of scaling, MinMax, 0 mean and 1 std etc.   
- **Use same scaler on test set, do not fit the scaler again with test data, results in data leakage!, think of
  it as you are not supposed to know about test data during training, but you will end up using extreme values
  of test data set in scaling, this might result in getting very good results on test set. Doing this in kaggle might
  be desirable to exploit the leakage :D**  
- Downside - interpretation is affected, meaning the regression prediction would be different with feature scaling
  than without it

In [93]:
def load_crime_dataset():
    # Communities and Crime dataset for regression
    # https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized

    crime = pd.read_table('CommViolPredUnnormalizedData.txt', sep=',', na_values='?')
    # remove features with poor coverage or lower relevance, and keep ViolentCrimesPerPop target column
    columns_to_keep = [5, 6] + list(range(11,26)) + list(range(32, 103)) + [145]  
    crime = crime.iloc[:,columns_to_keep].dropna()

    X_crime = crime.iloc[:,range(0,88)]
    y_crime = crime['ViolentCrimesPerPop']

    return (X_crime, y_crime)
X_crime, y_crime = load_crime_dataset()
print(X_crime.shape, y_crime.shape)

(1994, 88) (1994,)


#### Effect of feature scaling using two features - MinMaxScaler 
x = (x - min(x))/ range(x)

In [24]:
from sklearn.preprocessing import MinMaxScaler
sample = X_crime.loc[:,['agePct12t21', 'agePct12t29']]
scaler = MinMaxScaler() 
scaler.fit(sample) # Compute the minimum and maximum to be used for later scaling
print(scaler)
sample_tfd = scaler.transform(sample) # transformation

MinMaxScaler(copy=True, feature_range=(0, 1))


In [25]:
plt.figure()
plt.subplot(1,2,1)
plt.scatter(X_crime['agePct12t21'], X_crime['agePct12t29'])
plt.subplot(1,2,2)
plt.scatter(sample_tfd[:,0], sample_tfd[:,1])

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x1a20f5d550>

#### Ridge Regression (with normalization) vs Unregularized regression 
- Performance difference 
  - Ridge is fast, possibly due to optimizations in solver
- See the difference in coef. values  
  - Ridge coefficients are smaller ( see bar chart)
- See the difference in predictions, how is interpretability affected 
  - Scatter on test set shows prediction differ, there is no pattern of consistently being high or lowc
  - Interpretability is affectd, cause you cannot make reasoning that a coefficient represent the effect of unit   
    change in the variable, keeping all others same. However, you can compare coefficient sizes to gauge importance.
    Some normalizations like (z) still allow you to make a reasoning using a std. deviation change.
- Finding standard errors in regression coefficients from ridge/lasso
  - https://stats.stackexchange.com/questions/45449/when-using-glmnet-how-to-report-p-value-significance-to-claim-significance-of-pr
- Finding standard errors from regression 
  - sklearn.Linear Regression() does not give, but class can be easily extended/ or use statsmodels. 
  The solution comes from solving the matrix algebra 
  - The diagonal of np.dot(X,X^T). gives the variance of betas
  - https://stackoverflow.com/questions/27928275/find-p-value-significance-in-scikit-learn-linearregression  

In [26]:
import time

In [43]:
# Linear model, not regularized
X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,test_size = 0.2,random_state = 123)
print("Shapes of X_train = {0}, X_test = {1}, y_train = {2}, y_test = {3}"  
       .format(X_train.shape, X_test.shape, y_train.shape, y_test.shape))

scaler = MinMaxScaler()
X_train_tfd = scaler.fit_transform(X_train)
X_test_tfd = scaler.transform(X_test)

linear_model = LinearRegression(fit_intercept= True)
start = time.time()
linear_model.fit(X_train_tfd, y_train)
end = time.time()
print('weights = {0}, intercept = {1}'.format(linear_model.coef_, linear_model.intercept_))
print('Non zero weights = {}'.format(sum(linear_model.coef_ != 0)))
print('Training Time in ms: {}'.format((end-start)*1000))
#scored_train_linear = linear_model.predict(X_train)
scored_test_linear = linear_model.predict(X_test_tfd)
lm_rsq_train = linear_model.score(X_train_tfd,y_train)
lm_rsq_test = linear_model.score(X_test_tfd,y_test)
print('Linear model: Rsq_train = {0:.3f}, Rsq_test = {1:.3f}'
      .format(lm_rsq_train,lm_rsq_test))

Shapes of X_train = (1595, 88), X_test = (399, 88), y_train = (1595,), y_test = (399,)
weights = [ -4.66111597e+03   9.06174811e+01   5.70695396e+02  -2.15880378e+03
   4.19403309e+02  -5.37842989e+02   2.08168575e+02   1.13622595e+02
  -1.46820299e+03  -4.67894683e+02   5.28784733e+01  -5.29796382e+02
   7.35994039e+02   2.89944896e+02  -5.42105451e+02   6.79391567e+02
   1.61120371e+02  -1.23331429e+03  -4.13040957e+02  -6.55696933e+02
   2.48415091e+02   2.05407766e+02  -4.48274883e+01   5.05638204e+02
  -2.17788924e+02  -2.16398923e+02   2.43728902e+02   4.88217620e+02
   2.50368358e+03   7.60786806e+02   2.10120566e+03  -4.07118575e+03
  -7.64885121e+02   5.39498150e+02  -2.00586825e+03   4.38460862e+02
  -5.44274640e+01   4.29488395e+02  -7.48045632e+02  -4.99671512e+03
   1.38018430e+03   3.67541894e+03   1.72290189e+02   3.06391658e+02
  -5.10312871e+02   2.57886328e+02   5.94613628e+02  -2.30780610e+03
   2.92212263e+03  -1.98407956e+03  -1.56405572e+02  -1.26701937e+03
   1.1

##### Ridge 
- Can use 5 potential solvers, setting is auto to choose
- Setting for number of iterations as well, because the objective may not be convex

In [45]:
from sklearn.linear_model import Ridge
alpha = 10
linear_model_ridge = Ridge(random_state = 123, alpha = alpha)
print(linear_model_ridge)

start = time.time()
linear_model_ridge.fit(X_train_tfd, y_train)
end = time.time()
print('weights = {0}, intercept = {1}'.format(linear_model_ridge.coef_, linear_model_ridge.intercept_))
print('Non zero weights = {}'.format(sum(linear_model_ridge.coef_ != 0)))
print('Training Time in ms: {}'.format((end-start)*1000))
print('Training Iterations = {}'.format(linear_model_ridge.n_iter_))

#scored_train_linear = linear_model.predict(X_train)
scored_test_ridge = linear_model_ridge.predict(X_test_tfd)
ridge_rsq_train = linear_model_ridge.score(X_train_tfd,y_train)
ridge_rsq_test = linear_model_ridge.score(X_test_tfd,y_test)
print('Ridge model: Rsq_train = {0:.3f}, Rsq_test = {1:.3f}'
      .format(ridge_rsq_train,ridge_rsq_test))

Ridge(alpha=10, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=123, solver='auto', tol=0.001)
weights = [  8.14911404e+01   3.52829223e+01  -6.03710692e+01  -1.34158560e+02
  -9.10297373e+01   1.35954506e+01   8.12344460e+01   1.45161495e+02
   1.64306175e+01  -5.85520739e+01  -5.83946968e+01  -2.79415736e+02
   2.21514075e+01   1.80561250e+02  -1.57640618e+02  -8.05768885e+00
   4.40108697e+01   7.40478159e+01  -4.27080233e+00  -1.37406439e+02
   1.11906750e+02   2.22756235e+01  -5.82868380e+01  -2.90671086e-01
  -1.32659778e+02  -4.15636500e+01  -3.40758021e+00   7.12533864e+01
   1.71446659e+02   1.62042481e+02   1.10501731e+01   7.98067849e+01
   1.17955210e+02  -3.90660138e+02  -5.84064347e+02  -2.11553517e+02
  -2.79298151e+02   7.69811389e+01  -1.76002536e+02   7.51397117e+01
   7.64500643e+02   5.77113865e+01   9.76514579e+01  -4.14052890e+00
  -4.13240694e+01   5.91559883e+01   5.17148306e+00  -9.99493940e+00
   2.95477549e+01   4.64041949e+01

In [40]:
plt.figure()
plt.scatter(scored_test_linear, scored_test_ridge, marker = '.')
plt.xlabel('Unregularized')
plt.ylabel("Ridge")
ax = plt.gca()
mn, mx = ax.get_xlim()
arr = range(int(mn),int(mx),1)
plt.plot(arr,arr,'r--', alpha = 0.5)

<IPython.core.display.Javascript object>

[<matplotlib.lines.Line2D at 0x1a29ec5160>]

### Notice coefficient size in ridge is reduced (not for all variables, for some it increases) but for all 
### where unregularized regression seems to have a large coef. size

In [63]:
plt.figure()
#len(linear_model_ridge.coef_)
x = list(range(1,89,1))
#x_ = [i+0.5 for i in x]
plt.bar(x = x,height = linear_model.coef_, width = 1, label = 'unreg')
plt.bar(x = x ,height = linear_model_ridge.coef_, width = 1, label = 'reg',bottom = linear_model.coef_)
plt.legend()

<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x1a327750f0>

#### Ridge Regression (with normalization and without)
- Should not even consider this, because the weighting the betas as same (1) would not be right in the L2
  term

### Regularization is especially useful when m(training samples) < n(features). As the data set size increases effect of regularization in making model generalize well decreases

#### Lasso Regression  
- Uses an L1 penalty term 
- **L1 penalty term has an effect that it reduces the coefficients of weak variables to 0, unlike ridge, hence serving 
  as a variable selection**
- Alpha is tuneable parameter  

Cost function = f($\beta$) = 
$\Sigma_{i=1}^m[y^{(i)} - (\widehat{\beta_0} + \widehat{\beta_1}*x^{(i)}_1 + ... + \widehat{\beta_n}*x^{(i)}_n)]^2 + 
\alpha * \Sigma_{j=1}^n|\beta_j|$ 



#### Lasso vs Ridge 
- use ridge when several variables ca contribute small-medium effects
- can use lasso when only some variables can conribute medium - large effects

#### Lasso with tuning of alpha using a grid, max_iter

In [51]:
X_train_tfd.shape

(1595, 88)

In [76]:
from sklearn.linear_model import Lasso
alpha = [0.1,0.2, 0.3,0.4,0.5,0.6,0.7,.8,.9,1]
lasso_train_rsq = []
lasso_test_rsq = []
for al in alpha:
    linear_model_lasso = Lasso(random_state = 123, alpha = al)
    #if(al==0.1):
    #    print(linear_model_ridge)
    #start = time.time()
    linear_model_lasso.fit(X_train_tfd, y_train)
    #end = time.time()
    #print('weights = {0}, intercept = {1}'.format(linear_model_lasso.coef_, linear_model_lasso.intercept_))
    print('Non zero weights = {}'.format(sum(linear_model_lasso.coef_ != 0)))
    #print('Training Time in ms: {}'.format((end-start)*1000))
    #print('Training Iterations = {}'.format(linear_model_ridge.n_iter_))

    #scored_train_linear = linear_model.predict(X_train)
    #scored_test_ridge = linear_model_ridge.predict(X_test_tfd)
    lasso_rsq_train = linear_model_lasso.score(X_train_tfd,y_train)
    lasso_rsq_test = linear_model_lasso.score(X_test_tfd,y_test)
    lasso_train_rsq.append(lasso_rsq_train)
    lasso_test_rsq.append(lasso_rsq_test)

for al,train_score,test_score in zip(alpha, lasso_train_rsq,lasso_test_rsq):
    print('Alpha = {0},Train Rsq = {1:.3f}, Test Rsq = {2:.3f}'.format(al,train_score,test_score))

Non zero weights = 65
Non zero weights = 49
Non zero weights = 40
Non zero weights = 38
Non zero weights = 35
Non zero weights = 30
Non zero weights = 28
Non zero weights = 25
Non zero weights = 25
Non zero weights = 25
Alpha = 0.1,Train Rsq = 0.670, Test Rsq = 0.611
Alpha = 0.2,Train Rsq = 0.663, Test Rsq = 0.613
Alpha = 0.3,Train Rsq = 0.659, Test Rsq = 0.614
Alpha = 0.4,Train Rsq = 0.655, Test Rsq = 0.615
Alpha = 0.5,Train Rsq = 0.652, Test Rsq = 0.616
Alpha = 0.6,Train Rsq = 0.649, Test Rsq = 0.617
Alpha = 0.7,Train Rsq = 0.647, Test Rsq = 0.616
Alpha = 0.8,Train Rsq = 0.645, Test Rsq = 0.615
Alpha = 0.9,Train Rsq = 0.644, Test Rsq = 0.613
Alpha = 1,Train Rsq = 0.643, Test Rsq = 0.611


In [77]:
plt.style.use('bmh')
plt.figure()
plt.plot(np.array(alpha), np.array(lasso_train_rsq), 'r-',label = 'train',alpha = 0.6)
plt.plot(alpha, lasso_test_rsq,'b-' ,label = 'test',alpha = 0.6)
plt.legend()

<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x1a3e9440f0>

#### Retrain for alpha = 0.6 on full data set

In [106]:
full_model = Lasso(alpha = 0.6, random_state= 123)
scaler = MinMaxScaler()
X_crime_tfd = scaler.fit_transform(X_crime)
full_model.fit(X_crime_tfd,y_crime)
score = full_model.score(X_crime_tfd, y_crime)
result = pd.DataFrame({'name' : X_crime.columns, 'coef' :full_model.coef_},
             columns = ['name', 'coef']).sort_values(by = 'coef', ascending = False)
result_ = result.loc[result['coef'] !=0,:]

#### Coeff size plotted by sorting, only 28 non zer coefficients

In [127]:
plt.figure()
_ =plt.bar(x = range(len(result_['name'])),
        height = result_['coef'],color = 'blue')
_ =plt.gca().set_xticks(np.arange(len(result_))) # ensures all tick labels appear
_=plt.gca().set_xticklabels(labels = result_['name'], rotation = -90, fontsize = 'xx-small')


<IPython.core.display.Javascript object>

#### 4. Polynomial features/ Regression  
**Polynomial features** 
-  Taking the available features and constructing polynomial features from them is creating multiplicative combination of a chosen degree from them.Eg : Features x1 and x2, for degree 2 features can give $x_1^2, x_2^2, x_1x_2$  
**Why Polynomial features** 
- Exploratory analysis can suggest relationship between predictor and predictand is non-linear, like squared or 
  you can hypothesize a multiplicative relationship as appropriate to predict the predictand, and not an additive one.
  **This is often called feature interaction, like $x_1x_2$**.   
  - $Salary = \beta_0 + \beta_1*Age + \beta_2 * Gender$ implies average salary of Either male or female is more than the other irrespective of the age. This may not be tru from the data, you'll be able to check it, which then means you
  should introduce an interaction feature. 
  So, $Salary = \beta_0 + \beta_1*Age + \beta_2 * Gender + \beta_3 * Gender * Age$  would be appropriate 
  - Polynomial features still allow you to fit a linear regression, as it is linear in terms of Beta's, but 
  allow you to capture non-linear relationship between predictor and predictand, weighted linearly by parameters.   
  - Polynomial features allow to build more complex features, which can especially come handy in classification tasks,
  kernelized support vectors are an example of this. 
  - Other non-linear transformations are techincally called non-linear basis functions.  
**Risks of using polynomial features** 
- Can overfit, and always used with regularization 
  

In [5]:
# Polynomial features calss allows constructing features by specifying degree, all feattures of degree <=2
# are constructed, you can specify bias term as well
from sklearn.preprocessing import PolynomialFeatures
PolynomialFeatures?

#### 5. Logistic Regression

- categorical/nominal variable prediction (order does not matter), binary or multi-class 
- $\widehat{y} = P(y = 1) = \sigma(z)$
where $z = \beta_0 + \beta_1x_1 + ... + \beta_nx_n$ 
-  Logistic function scales the weighted transformation of predictors between a scale of 0 to 1 
So, it is a GLM, still a linear model in parameters 

- Logistic Regression 
  - Cost function  
  $J = \frac{-1}{m}\sum_{i=1}^m y^{(i)}log(\widehat{y^{(i)}}) + (1-y^{(i)})log(1-\widehat{y^{(i)}})$
  - Using a function to convert from predicted probabilities to classes  
  - Visualize probability function, and classification
- Decision boundary variation using different cut off values 
- Regularized Logistic regression

In [40]:
fruits = pd.read_table('/Users/sumad/Documents/DS/Python/\
UM Spcialization/Machine_Learning/fruit_data_with_colors.txt')
fruits['Target'] = fruits['fruit_name'] == 'apple'
data = fruits.loc[:,['Target','width', 'height']]
#Y =  fruits.loc[mask, 'fruit_label']

In [41]:
fruits.shape

(59, 8)

In [42]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
train, test = train_test_split(data,test_size = 0.3, random_state = 123, 
                             stratify = data['Target'])

In [45]:
print(train.shape, test.shape)
print('Train proportions --',sum(train['Target']==1)/41, sum(train['Target']==0)/41)
print('Test proportions --', sum(test['Target']==1)/18, sum(test['Target']==0)/18)

(41, 3) (18, 3)
Train proportions -- 0.317073170732 0.682926829268
Test proportions -- 0.333333333333 0.666666666667


In [46]:
LogisticRegression?

4. Scikit-learn's general flow of modeling  
   - Train/Test sets : model_selection
   - Model Estimator : a class used to create a model object 
   - fit method : train the model object
   - score method : Model Evaluation using the object and test data