### sklearn / scikit-learn library
Sklearn library should be already installed. It is included in Anaconda distribution.

If not, use CMD.exe prompt from Anaconda. By this prompt you can use conda package manager.

- conda install -c anaconda scikit-learn

Sclearn has few ready-to-use datasets.

The sklearn.datasets module includes utilities to load datasets, including methods to load and fetch popular reference datasets: 

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets

This module has method load_iris() that loads Iris dataset.

By the way, by Python naming convention methods use lowercase word or words. Words are separated with underscores to improve readability.

This method returns dataBunch - Dictionary-like object, that contains among others:
- data{ndarray, dataframe} of shape (150, 4) 
- target: {ndarray, Series} of shape (150,) - The classification target.
- feature_names: list - The names of the dataset columns.
- target_names: list - The names of target classes.

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris


Iris dataset:
- created by Ronald Fisher in 1936. He measured 150 Iris flowers (sepal length, sepal width, petal length, petal width), 3 categories of Iris flowers: setosa, versicolor, virginica.
- it is commonly used dataset in machine learning tutorials
- Number of Instances: 150 (50 in each of three classes)
- 'feature_names': ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'],
- 'target_names': array(['setosa', 'versicolor', 'virginica']
- categorical dependent variable

For labelled dataset, categorical target data we can use classification.

### NumPy
Sklearn uses NumPy ndarray data structure. ndarray is a shorthand name for N-dimensional array.  Common methods on ndarrays:
- Changing the shape of an array: 
    - shape/ reshape: https://numpy.org/doc/stable/user/quickstart.html#shape-manipulation
    - ndarray.shape - displays shape
    - reshape(row, col) - reshape the array
    - flatten the array - return a copy of the array collapsed into one dimension: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.flatten.html?highlight=flatten#numpy.ndarray.flatten
    - ndarray.flatten() 



In [1]:
# import load_iris method from dataset module 
from sklearn.datasets import load_iris

iris_dataset = load_iris() # use load_iris() method to download dataset

print(type(iris_dataset)) # it is dataBunch - Dictionary-like object - key:value data structure
print(iris_dataset)

<class 'sklearn.utils.Bunch'>
{'data': array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
    

In [2]:
iris_feature_names = iris_dataset['feature_names']
print(iris_feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [3]:
iris_target_names = iris_dataset['target_names']
print(iris_target_names)

['setosa' 'versicolor' 'virginica']


In [4]:
# X_iris - input data
X_iris = iris_dataset['data']

print(type(X_iris)) # it is numpy array data structure
print(X_iris) # 2 dimensional array

<class 'numpy.ndarray'>
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3

In [5]:
X_iris.shape # display ndarray shape: 150 instances, 4 dimensions (features)

(150, 4)

In [6]:
# Y_iris - output data
Y_iris = iris_dataset['target']

print(type(Y_iris))
print(Y_iris)

# it is much better to use numbers in machine learning. Those 0, 1, 2 corresponds to the classes names: 'setosa' 'versicolor' 'virginica'

<class 'numpy.ndarray'>
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [7]:
Y_iris.shape # 150 instances. The number of input data (instances) and output data must be the same.

(150,)

# SVC class

Following the sikit-learn chart: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

- is there more than 50 instances/records of data ? Yes
- category ? Yes
- labelled data ? Yes
- less than 100k ? Yes
- choose linear SVC estimator (linear means, that kerner='linear'. We will pass this parameter to constructor)


https://scikit-learn.org/stable/modules/svm.html#classification

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

By Python naming convention classes use camel case style.

Methods:
- fit(X, y[, sample_weight]) - Fit the SVM model according to the given training data.
- predict(X) - Perform classification on samples in X.
- score(X, y[, sample_weight]) - Return the mean accuracy on the given test data and labels.


Constructor has many parameters (hyperparameters), all with default values:

sklearn.svm.SVC(*, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=- 1, decision_function_shape='ovr', break_ties=False, random_state=None)

Usually the following hyperparameters are changed to tune model:
- C - by default C=1.0  C is regularization parameter.
- kernel - by default kernel='rbf', which stands for Radial basis function. Other commonly used options: ‘linear’, ‘poly’ (polynomial). 
- degree - only used with polynomial kernel
- gamma



In [8]:
from sklearn.svm import SVC # use SVC class

In [9]:
# create mmodel. The model is classification type
# pass parameters to SVC constructor. We will use kernel='linear', as it was suggested by scikit-learn algorithm cheat-sheet

model = SVC(kernel='linear') # create object from class SVC

In [10]:
# train your model using training dataset. Use X data as input and Y data as target
# fit() method trains model

model.fit(X_iris, Y_iris)

SVC(kernel='linear')

In [11]:
print(model.get_params()) # you can display all parameters if you want to check them

{'C': 1.0, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'linear', 'max_iter': -1, 'probability': False, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}


In [12]:
# predict result

# We do not have test dataset. Normally we should have it. 
# We could split original dataset on training and testing part (typically it is done by proportion 70/30), but dataset was already small.
# We will use the same dataset that we used for training purpose.

Y_predicted = model.predict(X_iris)

print(Y_predicted)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [13]:
# We can visually compare predicted data with target data
print(Y_iris)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [14]:
# We can use metrics.
# score(X,y) method returns accuracy
# X - Test samples
# y - True labels for X

# we do not have test dataset, so we use training dataset

accuracy = model.score(X_iris, Y_iris)
print(accuracy)

# Accuracy is 0.99

0.9933333333333333


In [15]:
# We can also use confusion matrix to check which instances were wrongly predicted

from sklearn.metrics import confusion_matrix

confusion_matrix_result = confusion_matrix(Y_predicted, Y_iris)
print(confusion_matrix_result)

# There are the same labels in conusion matrix: 'setosa' 'versicolor' 'virginica'.
# 3 classes - matrix is 3x3
# Columns represent test results. Rows - predicted result.
# Setosa and virginica are predicted correctly. 
# One instance of Versicola was predicted wrongly, as virginica.

[[50  0  0]
 [ 0 49  0]
 [ 0  1 50]]


In [16]:
# As you can see there is just few lines of code to build a machine learning model :)

# Here is the full necessary code without comments, beside this comment :)

from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix

iris_dataset = load_iris()
X_iris = iris_dataset['data']
Y_iris = iris_dataset['target']

model = SVC(kernel='linear')
model.fit(X_iris, Y_iris)

Y_predicted = model.predict(X_iris)

accuracy = model.score(X_iris, Y_iris)
confusion_matrix_result = confusion_matrix(Y_predicted, Y_iris)

print(accuracy)
print(confusion_matrix_result)

0.9933333333333333
[[50  0  0]
 [ 0 49  0]
 [ 0  1 50]]
