## A "Hello World" Example of Machine Learning - Revisit

Loading the Iris dataset from scikit-learn. 

The first column represents Sepal length, the second column represents Sepal width,  the third column represents the petal length, and the fourth column the petal width of the flower samples. The classes (type of species) are already converted to integer labels where 0=Iris-Setosa, 1=Iris-Versicolor, 2=Iris-Virginica.

Here, we are using only two features: the third and fourth columns. 

In [13]:
from sklearn import datasets
import numpy as np

iris = datasets.load_iris() #iris is basically the building dataset
#iris.data

In [14]:
X = iris.data[:, [2, 3]] # we are only going to use the two features

In [15]:
y = iris.target #target is a field too



In [16]:
print('Class labels:', np.unique(y)) #to check the unique property for y, we are trying to find the number of unique values in thst target

Class labels: [0 1 2]


Scikit-learn algorithms support multi-class classification via the One-Versus-Rest(OvR) method. 


Splitting data into 70% training and 30% test data:

In [17]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y) #train, test, split put in the x and y. We are only worlong on x_train and y_train

In [18]:
print('Labels counts in y:', np.bincount(y))
print('Labels counts in y_train:', np.bincount(y_train))
print('Labels counts in y_test:', np.bincount(y_test))

#binary count counts how many values are zeroes. 



Labels counts in y: [50 50 50]
Labels counts in y_train: [35 35 35]
Labels counts in y_test: [15 15 15]


### Standardizing the features:

In [19]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler() #center the distribution around zero (mean), with a standard deviation of 1.
sc.fit(X_train) 
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

#we only fit the x_train and X_test because they are not the actual output whereas we dont have to train and test the ys because that this the actual putpit

In [20]:
from sklearn.linear_model import Perceptron

ppn = Perceptron(max_iter=100, eta0=0.1, random_state=42)
ppn.fit(X_train_std, y_train)

Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=0.1,
           fit_intercept=True, max_iter=100, n_iter_no_change=5, n_jobs=None,
           penalty=None, random_state=42, shuffle=True, tol=0.001,
           validation_fraction=0.1, verbose=0, warm_start=False)

### Test the model with the hold-out test set

In [21]:
y_pred = ppn.predict(X_test_std)
print('Misclassified samples: ' + str((y_test != y_pred).sum()))

Misclassified samples: 2


In [22]:
from sklearn.metrics import accuracy_score

print('Accuracy: ' + str(accuracy_score(y_test, y_pred)))

Accuracy: 0.9555555555555556


In [23]:
X_new = [[1.1, 0.2],[0.4, 1.9], [1.4, 0.2]]
y_new = ppn.predict(X_new)
y_new

array([1, 2, 1])

### Evaluate the model using cross validation

In [24]:
from sklearn.model_selection import cross_val_score
cross_val_score(ppn, X_train_std, y_train, cv=4
                , scoring="accuracy") #cv is cross-validation

array([0.88888889, 0.48148148, 0.7037037 , 0.95833333])

array([0.81481481, 0.81481481, 0.81481481, 0.875     ]) This is the old result

### Exercise 1: What if we use all four features to train the model? Are the results better? Why? 

X = iris.data[:, [0, 3]]

With two features the data is better seperated so, increasing the features does not always ensure that the result/ accuracy will improve

### Exercise 2: Try with the scikit-learn stochastic gradient descent model instead of perceptron. Evaluate with cross-validation how does the model perform in terms of accuracy using both two features and four features. 

In [25]:
from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier(random_state=42)
sgd.fit(X_train_std, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=42, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

In [26]:
y_pred = sgd.predict(X_test_std)

In [27]:
print('Accuracy: ' + str(accuracy_score(y_test, y_pred)))

Accuracy: 1.0


In [28]:
cross_val_score(sgd, X_train_std, y_train, cv=4
                , scoring="accuracy")

array([1.        , 0.92592593, 0.7037037 , 0.91666667])