## Linear Regression Learning Rate and Number of Iterations

The following notebook will demonstrate the effect of learning rate and number of iterations on the performance of a basic linear regression model. This is a self-contained notebook and will run assuming you have installed the necessary packages (sklearn, numpy, pandas, matplotlib, bokeh). Note: Anaconda distributions already contain all of these packages pre-installed.

The models presented here are simple linear regression models. The dataset used is the diabetes dataset (available through the sklearn package) which contains 442 samples, 10 features and 1 target. For the purposes of implementing a simple linear regression model, only one of the features was selected to be used to predict the target label.

In [1]:
import numpy as np
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# Load the diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

In [3]:
# Use only one feature
diabetes_X = diabetes_X[:, np.newaxis, 2]

In [4]:
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

As this is a basic example, it is ok to manually split the training and testing set according to the order (ie. last 20 entries are the test set) however, in general this is a bad practice. Ideally, you will **randomly** split the training and testing sets according to a pre-defined ratio appropriate for your problem (70:30, 80:20, etc.)

In [5]:
# Split the targets into training/testing sets
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]

In [6]:
# Create linear regression object
regr = linear_model.LinearRegression()

In [7]:
regr2 = linear_model.SGDRegressor()

In [8]:
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [9]:
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

In [10]:
#visualization loop
import warnings
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    results001 =[]
    for i in range (1,2000):
        regr001 = linear_model.SGDRegressor(max_iter = i, tol = 0.001, learning_rate = 'constant', eta0 = 0.001);
        regr001.fit(diabetes_X_train, diabetes_y_train);
        results001.append([ i, regr001.intercept_.item(), regr001.coef_.item() ]);
    results005 =[]
    for i in range (1,2000):
        regr005 = linear_model.SGDRegressor(max_iter = i, tol = 0.001, learning_rate = 'constant', eta0 = 0.005);
        regr005.fit(diabetes_X_train, diabetes_y_train);
        results005.append([ i, regr005.intercept_.item(), regr005.coef_.item() ]);
    results01 =[]
    for i in range (1,2000):
        regr01 = linear_model.SGDRegressor(max_iter = i, tol = 0.001, learning_rate = 'constant', eta0 = 0.01);
        regr01.fit(diabetes_X_train, diabetes_y_train);
        results01.append([ i, regr01.intercept_.item(), regr01.coef_.item() ]);
    results05 =[]
    for i in range (1,2000):
        regr05 = linear_model.SGDRegressor(max_iter = i, tol = 0.001, learning_rate = 'constant', eta0 = 0.05);
        regr05.fit(diabetes_X_train, diabetes_y_train);
        results05.append([ i, regr05.intercept_.item(), regr05.coef_.item() ]);
    results1 =[]
    for i in range (1,2000):
        regr1 = linear_model.SGDRegressor(max_iter = i, tol = 0.001, learning_rate = 'constant', eta0 = 0.1);
        regr1.fit(diabetes_X_train, diabetes_y_train);
        results1.append([ i, regr1.intercept_.item(), regr1.coef_.item() ]);
    results5 =[]
    for i in range (1,2000):
        regr5 = linear_model.SGDRegressor(max_iter = i, tol = 0.001, learning_rate = 'constant', eta0 = 0.5);
        regr5.fit(diabetes_X_train, diabetes_y_train);
        results5.append([ i, regr5.intercept_.item(), regr5.coef_.item() ]);
    resultsOPT =[]


In [11]:
#format the results as pandas dataframes
results001 = pd.DataFrame(results001, columns = ['Iteration','Intercept','Coefficient']);
results005 = pd.DataFrame(results005, columns = ['Iteration','Intercept','Coefficient']);
results01 = pd.DataFrame(results01, columns = ['Iteration','Intercept','Coefficient']);
results05 = pd.DataFrame(results05, columns = ['Iteration','Intercept','Coefficient']);
results1 = pd.DataFrame(results1, columns = ['Iteration','Intercept','Coefficient']);
results5 = pd.DataFrame(results5, columns = ['Iteration','Intercept','Coefficient']);


In [12]:
#import the necessary libraries from the bokeh visualization package
from bokeh.io import output_notebook, show, push_notebook
from bokeh.layouts import column, row
from bokeh.plotting import figure
from bokeh.models import Slider, ColumnDataSource, CategoricalColorMapper
from ipywidgets import interact
output_notebook()

In [13]:
#format the training and testing sets as pandas dataframes to enable easy conversion to Column Data Source
training = []
for i in range (0,len(diabetes_X_train)):
    training.append([diabetes_X_train[i].item(), diabetes_y_train[i].item(),'Training Set'])
training = pd.DataFrame(training, columns = ['X','Y','Label']) 

testing = []
for i in range (0,len(diabetes_X_test)):
    testing.append([diabetes_X_test[i].item(), diabetes_y_test[i].item(),'Testing Set'])
testing = pd.DataFrame(testing, columns = ['X','Y','Label']) 

In [14]:
#Create Column Data Source - Bokeh Data Structure
dataset = pd.concat([training,testing])
dataset = ColumnDataSource(dataset)

In [15]:
# Create Plots p (scatterplot) and p2 (bar graph)
x_vals = np.linspace(-0.1402752958985185,0.22055522598066002,100)

p = figure(x_range=( -0.1402752958985185, 0.22055522598066002), y_range=(23, 348), title = 'Visualization of Learning Rate vs. Iteration',plot_width=400, plot_height=400)
color_mapper = CategoricalColorMapper(factors=['Training Set', 'Testing Set'], palette=['red', 'blue'])
p.scatter('X', 'Y', source = dataset, color={'field': 'Label', 'transform': color_mapper}, legend = 'Label');
l = p.line(x_vals , y=results001.loc[0].Coefficient * x_vals + results001.loc[0].Intercept , line_color = 'green',line_width=4)
p.legend.location = 'bottom_right'
p.xaxis.axis_label = 'Feature Value'
p.yaxis.axis_label = 'Label Value'
p2 = figure(title = 'Learning Rate vs. RSS Scores for 2000 Iterations', x_range = ['0.001', '0.005', '0.01', '0.05', '0.1', '0.5'],plot_width=400, plot_height=400)
p2.vbar(x=['0.001', '0.005', '0.01', '0.05', '0.1', '0.5'],width = 0.5, bottom = 0, 
         top = [regr001.score(diabetes_X_test, diabetes_y_test), regr005.score(diabetes_X_test, diabetes_y_test),
         regr01.score(diabetes_X_test, diabetes_y_test), regr05.score(diabetes_X_test, diabetes_y_test),
         regr1.score(diabetes_X_test, diabetes_y_test), regr5.score(diabetes_X_test, diabetes_y_test)], color = 'firebrick')
p2.xaxis.axis_label = 'Learning Rate'
p2.yaxis.axis_label = 'RSS Score'

In [16]:
#create a slider and dropdown and define an update function
def update(iteration,lr):
    if lr == 0.001:
        l.data_source.data['x'] = x_vals
        l.data_source.data['y'] = results001.loc[iteration-1].Coefficient * x_vals + results001.loc[iteration-1].Intercept
    elif lr == 0.005:
        l.data_source.data['x'] = x_vals
        l.data_source.data['y'] = results005.loc[iteration-1].Coefficient * x_vals + results005.loc[iteration-1].Intercept
    elif lr == 0.01:
        l.data_source.data['x'] = x_vals
        l.data_source.data['y'] = results01.loc[iteration-1].Coefficient * x_vals + results01.loc[iteration-1].Intercept
    elif lr == 0.05:
        l.data_source.data['x'] = x_vals
        l.data_source.data['y'] = results05.loc[iteration-1].Coefficient * x_vals + results05.loc[iteration-1].Intercept
    elif lr == 0.1:
        l.data_source.data['x'] = x_vals
        l.data_source.data['y'] = results1.loc[iteration-1].Coefficient * x_vals + results1.loc[iteration-1].Intercept
    elif lr == 0.5:
        l.data_source.data['x'] = x_vals
        l.data_source.data['y'] = results5.loc[iteration-1].Coefficient * x_vals + results5.loc[iteration-1].Intercept
    push_notebook()
#show the ressulting figures
show(row(p,p2), notebook_handle = True)
interact(update, iteration = (1,1999,1), lr=[0.001,0.005,0.01,0.05,0.1,0.5])

interactive(children=(IntSlider(value=1000, description='iteration', max=1999, min=1), Dropdown(description='l…

<function __main__.update(iteration, lr)>

Examine the above learning rates from the drop-down menu and explore the progression of the predicted line (green) across various iterations. Additionally, use the Learning Rate vs. RSS Scores graph to answer the following questions:

1. What do you notice when you vary the learning rate?

2. Did some learning rates perform better than others?

3. Try re-running the entire notebook, do you get the same results? Why or why not?