## A "Hello World" Example of Machine Learning - Revisit

Loading the Iris dataset from scikit-learn. 

The first column represents Sepal length, the second column represents Sepal width,  the third column represents the petal length, and the fourth column the petal width of the flower samples. The classes (type of species) are already converted to integer labels where 0=Iris-Setosa, 1=Iris-Versicolor, 2=Iris-Virginica.

Here, we are using only two features: the first and fourth columns. 

In [1]:
from sklearn import datasets
import numpy as np

iris = datasets.load_iris()
#iris.data

In [2]:
X = iris.data[:, [0, 3]]

In [3]:
y = iris.target

In [4]:
print('Class labels:', np.unique(y))

Class labels: [0 1 2]


Scikit-learn algorithms support multi-class classification via the One-Versus-Rest(OvR) method. 

Splitting data into 70% training and 30% test data:

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y)

In [6]:
print('Labels counts in y:', np.bincount(y))
print('Labels counts in y_train:', np.bincount(y_train))
print('Labels counts in y_test:', np.bincount(y_test))

Labels counts in y: [50 50 50]
Labels counts in y_train: [35 35 35]
Labels counts in y_test: [15 15 15]


### Standardizing the features:

In [7]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler() # center the distribution around zero (mean), with a standard deviation of 1.
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

In [8]:
from sklearn.linear_model import Perceptron

ppn = Perceptron(max_iter=100, eta0=0.1, random_state=42)
ppn.fit(X_train_std, y_train)

Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=0.1,
           fit_intercept=True, max_iter=100, n_iter_no_change=5, n_jobs=None,
           penalty=None, random_state=42, shuffle=True, tol=0.001,
           validation_fraction=0.1, verbose=0, warm_start=False)

### Test the model with the hold-out test set

In [9]:
y_pred = ppn.predict(X_test_std)
print('Misclassified samples: ' + str((y_test != y_pred).sum()))

Misclassified samples: 10


In [10]:
from sklearn.metrics import accuracy_score

print('Accuracy: ' + str(accuracy_score(y_test, y_pred)))

Accuracy: 0.7777777777777778


In [11]:
X_new = [[1.1, 0.2],[0.4, 1.9], [1.4, 0.2]] #BAD - not standardized, not accurate
y_new = ppn.predict(X_new)
y_new

array([2, 2, 2])

### Evaluate the model using cross validation

In [12]:
from sklearn.model_selection import cross_val_score
cross = cross_val_score(ppn, X_train_std, y_train, cv=4, scoring="accuracy") # four folds, produces accuracy scores
cross

array([0.88888889, 0.76923077, 0.53846154, 0.80769231])

In [13]:
import statistics
statistics.mean(cross)

0.7510683760683761

2-features: accuracy score: array([0.88888889, 0.48148148, 0.7037037 , 0.95833333])

### Exercise 1: Use all four features to train the model and use cross validaton to check if the results better? Briefly explain why. 

In [14]:
X = iris.data
y = iris.target

Splitting data into 70% training and 30% test data:

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y)

Standardizing the features

In [16]:
sc = StandardScaler() 
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

In [17]:
ppn = Perceptron(max_iter=100, eta0=0.1, random_state=42)
ppn.fit(X_train_std, y_train)

Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=0.1,
           fit_intercept=True, max_iter=100, n_iter_no_change=5, n_jobs=None,
           penalty=None, random_state=42, shuffle=True, tol=0.001,
           validation_fraction=0.1, verbose=0, warm_start=False)

In [18]:
y_pred = ppn.predict(X_test_std)
print('Misclassified samples: ' + str((y_test != y_pred).sum()))

print('Accuracy: ' + str(accuracy_score(y_test, y_pred)))

Misclassified samples: 3
Accuracy: 0.9333333333333333


In [19]:
cross = cross_val_score(ppn, X_train_std, y_train, cv=4, scoring="accuracy")
cross

array([0.92592593, 0.88461538, 0.84615385, 0.88461538])

In [20]:
statistics.mean(cross)

0.8853276353276354

When you use more features, you will have a better perceptron model. When using only 2 features (columns 0 and 3), our accuracy score was 0.7777777777777778. However, with 4 features (the entire dataset) the accuracy score is 0.9333333333333333.

The same concept applies to cross validation. With two features, the average cross validation score was 0.7510683760683761 ([0.88888889, 0.76923077, 0.53846154, 0.80769231]). However, when you use the entire dataset, the average cross validation score was 0.8853276353276354 ([0.92592593, 0.88461538, 0.84615385, 0.88461538]).

The number of partitioning with cross validation definitely matters. You must partition the data in such a way that the samples are large enough to accurately represent the dataset as a whole. If you partition the data poorly, you will produce an accuracy score that is not representative.

### Exercise 2: Try with the scikit-learn stochastic gradient descent model instead of perceptron. Use all four features. Evaluate with cross-validation how does the model perform in terms of accuracy using both two features and four features. 

In [21]:
from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier(random_state=42)

Two Features:

In [22]:
X = iris.data[:, [0, 3]] # Only two columns
y = iris.target

Splitting data into 70% training and 30% test data:

In [23]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y)

Standardizing the features

In [24]:
sc = StandardScaler() 
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

In [25]:
sgd.fit(X_train_std, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=42, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

In [26]:
y_pred = sgd.predict(X_test_std)
print('Misclassified samples: ' + str((y_test != y_pred).sum()))

print('Accuracy: ' + str(accuracy_score(y_test, y_pred)))

Misclassified samples: 1
Accuracy: 0.9777777777777777


In [27]:
cross = cross_val_score(sgd, X_train_std, y_train, cv=4, scoring="accuracy")
print(cross)
print(statistics.mean(cross))

[0.92592593 0.76923077 0.96153846 0.80769231]
0.8660968660968661


Using two features with the stochastic gradient descent model, the produced accuracy score was 0.9777777777777777. The average of the cross validation scores was 0.8660968660968661 ([0.92592593 0.76923077 0.96153846 0.80769231]).

Four Features:

In [28]:
X = iris.data
y = iris.target

Splitting data into 70% training and 30% test data:

In [29]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y)

Standardizing the data

In [30]:
sc = StandardScaler() 
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

In [31]:
sgd.fit(X_train_std, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=42, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

In [32]:
y_pred = sgd.predict(X_test_std)
print('Misclassified samples: ' + str((y_test != y_pred).sum()))

print('Accuracy: ' + str(accuracy_score(y_test, y_pred)))

Misclassified samples: 5
Accuracy: 0.8888888888888888


In [33]:
cross = cross_val_score(sgd, X_train_std, y_train, cv=4, scoring="accuracy")
print(cross)
print(statistics.mean(cross))

[1.         0.96153846 0.84615385 0.88461538]
0.9230769230769231


Using all features with the stochastic gradient descent model, the produced accuracy score was 0.8888888888888888. The average of the cross validation scores was 0.9230769230769231 ([1.         0.96153846 0.84615385 0.88461538])

With more data, the stochastic gradient descent model will jump around more before converging and will consequently produce a lower accuracy score. This concept is seen in the models above since 2 features produced an accuracy score of 0.9777777777777777 and 4 features produced an accuracy score of 0.8888888888888888.