In [1]:
from sklearn.datasets import load_iris

In [2]:
(train_input, train_output) = load_iris(return_X_y = True)

Loaded Fisher's iris dataset consisting of 150 training samples. The inputs are sepal length and width and petal length and width (all in cm), and the output is an iris species (Iris-Setosa, Iris-Versicolour, Iris-Virginica).  The three classes are equally represented (50 samples of each) in the dataset.

In [3]:
train_input.shape

(150, 4)

In [4]:
train_output.shape

(150,)

In [5]:
train_input[0]

array([5.1, 3.5, 1.4, 0.2])

In [6]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

train_input = scaler.fit_transform(train_input)

In [7]:
train_input.mean(axis = 0)

array([-1.69031455e-15, -1.84297022e-15, -1.69864123e-15, -1.40924309e-15])

In [8]:
train_input.std(axis = 0)

array([1., 1., 1., 1.])

Centered and scaled each of the input features using a sklearn.preprocessing.StandardScaler object.

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
train_input, test_input, train_output, test_output =\
train_test_split(train_input, train_output, test_size = .33)

In [11]:
train_input.shape

(100, 4)

In [12]:
test_input.shape

(50, 4)

Split the data into 100 training samples and 50 testing samples using the train_test_split helper function from sklearn.model_selection.

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier

In [14]:
model1 = LogisticRegression()

In [15]:
model1.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [16]:
model1.fit(train_input, train_output)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [17]:
model1.score(test_input, test_output)

0.96

LogisticRegression learns feature parameters for classification problems assuming a binary or categorical cross-entropy loss function.  For multiclass problems, the 'multi_class' parameter allows for specification of a one vs rest (= 'ovr') or categorical (= 'multinomial') approach.  The default setting, multi_class = 'auto', uses a categorical approach for multiclass data.

The loss function is further specified by the 'penalty' parameter, providing regularization terms, and the 'C' term, providing the regularization coefficient ('C' being the multiplicative inverse of 'alpha' parameter in SGDClassifier class described below).  One could also consider L^1 regularization or a combination of L^2 and L^1 regularization via the 'l1_ratio' parameter.  

Having specified the loss function, we must also provide an optimization algorithm via the 'solver' parameter.  Most approaches are gradient-based or quasi-Newton methods (e.g. the default BFGS algorithm).  We control the total number of iterations over the data set via the 'max_iter' parameter.  Early stopping can also be applied via the 'tol' parameter.

Performance of the model is indicated via the .score() method, which returns the classification accuracy of an (output, input) collection of data.

In [18]:
model2 = SGDClassifier(loss = 'log')

In [19]:
model2.get_params()

{'alpha': 0.0001,
 'average': False,
 'class_weight': None,
 'early_stopping': False,
 'epsilon': 0.1,
 'eta0': 0.0,
 'fit_intercept': True,
 'l1_ratio': 0.15,
 'learning_rate': 'optimal',
 'loss': 'log',
 'max_iter': 1000,
 'n_iter_no_change': 5,
 'n_jobs': None,
 'penalty': 'l2',
 'power_t': 0.5,
 'random_state': None,
 'shuffle': True,
 'tol': 0.001,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}

In [20]:
model2.fit(train_input, train_output)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=1000,
              n_iter_no_change=5, n_jobs=None, penalty='l2', power_t=0.5,
              random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

In [21]:
model2.score(test_input, test_output)

0.94

SGDClassifier applies stochastic gradient descent to a given loss function.  

The optimization objective is specified via the 'loss' parameter, providing the loss function, the 'penalty' parameter, providing regularization terms, and the 'alpha' parameter, providing the regularization coefficient.  By default, we have a linear SVM loss function ('loss' = 'hinge') with both L^2 regularization ('penalty' = 'l2') and regularization coefficient 'alpha' = .0001.  We set 'loss' = 'log' to consider the usual binary cross-entropy (with a one vs rest comparison on our multiclass dataset).  One could also consider L^1 regularization or a combination of L^2 and L^1 regularization via the 'l1_ratio' parameter.

Next, we have parameters governing the stochastic gradient descent algorithm.  In particular, we specify a learning rate scheme (static or adaptive) using various parameters ('learning_rate', 'eta0', 'power_t') and control the total number of iterations over the data set via the 'max_iter' parameter.  Early stopping criterion can also be specified (via e.g. 'early_stopping', 'n_iter_no_change', 'validation_fraction', 'tol' parameters).

Notice that there is no specification of batch size in the parameter list.  In fact, one must use the .partial_fit() method to train over subsamples of data.  In the absence of this method, the SGDClassifier object performs ordinary gradient descent.  Useful parameters for batch gradient descent include 'shuffle', 'random_state' and 'warm_start'.

As with the LogisticRegression class, performance of the model is indicated via the .score() method, which returns the classification accuracy of an (output, input) collection of data.