## Machine learning is about extracting knowledge from data

"Problems-examples" in which machine learning algorithms could be useful:  
- analyzing DNA sequences
- providing personalized treatments
- finding distant planets
- discovering new particles  

For many tasks we can use "handcoded rules" for solving; ML program require a large collection of data to build these rules.




Types of Machine learning:  
**Supervised learning ("learning with teacher")** - ML algorithms that learn from input/output pairs of data (it is often manual process to build correct output for each input object)  
**Unsupervised learning ("learning with teacher")** - ML algorithms where only input objects in data are known.  

*For both types of ML it is important to have a representation of your data that a computer (and your programm) can understand.*
**feature extraction (engineering)** - science about building a good representation of your data

Each algorithm is different in terms of what kind of data and what problem setting it works best for. In process of building model you should keep in mind the following questions:  
- What *questions* am I trying to answer? The collected data *can answer* that question, isn't it?
- Have I collected *enough* data?
- How will I *measure success* of my solution?
- What *features of data* did I extract and use for these task?
- How will the ML solution *interact* with other parts of this project?

In [1]:
%matplotlib notebook
import numpy as np
import pandas as pd
import sys
import sklearn
import matplotlib
import scipy
import IPython


print(f'Python version is {sys.version}')
print(f'NumPy version is {np.__version__}')
print(f'Scipy version is {scipy.__version__}')
print(f'Sklearn version is {sklearn.__version__}')
print(f'Pandas version is {pd.__version__}')
print(f'matplotlib version is {matplotlib.__version__}')
print(f'IPython version is {IPython.__version__}')

Python version is 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]
NumPy version is 1.20.1
Scipy version is 1.6.2
Sklearn version is 0.24.1
Pandas version is 1.2.4
matplotlib version is 3.3.4
IPython version is 7.22.0


## Classifying Iris
hobby botanist is interested in distinguishing the species of some iris flowers that she was found. She has collected some measurements associated with each iris (length and width of petals and sepals). She also has the measurements of some irises that have been identified by expert botanist.

**Our goal**: build a machine learning model (use learning from the measurements of irises whose species is known) and predict the species for new iris.

In [5]:
from sklearn.datasets import load_iris

iris_dataset = load_iris() # bunch (~ dictionary) with data
iris_dataset.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [15]:
print(f'Names of target (we predict targets for inputs): {iris_dataset.target_names}')
print(f'Names of feature (in input data): {iris_dataset.feature_names}')
print()
print(f'Full description:')
print(iris_dataset.DESCR)

Names of target (we predict targets for inputs): ['setosa' 'versicolor' 'virginica']
Names of feature (in input data): ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Full description:
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1

In [18]:
from sklearn.model_selection import train_test_split

In [23]:
X_train, X_test, y_train, y_test = train_test_split(iris_dataset.data, iris_dataset.target,
                                                   test_size=0.25,
                                                   random_state=0)
print(f'Original size of input data is {iris_dataset.data.shape}')
print(f'Train-data size is {X_train.shape} and test-data size is {X_test.shape}')

Original size of input data is (150, 4)
Train-data size is (112, 4) and test-data size is (38, 4)


In [25]:
iris_dataframe = pd.DataFrame(X_train,
                             columns=iris_dataset.feature_names)
iris_dataframe.head(5)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.9,3.0,4.2,1.5
1,5.8,2.6,4.0,1.2
2,6.8,3.0,5.5,2.1
3,4.7,3.2,1.3,0.2
4,6.9,3.1,5.1,2.3


In [28]:
pd.plotting.scatter_matrix(iris_dataframe, c = y_train, figsize=(8,8))

<IPython.core.display.Javascript object>

array([[<AxesSubplot:xlabel='sepal length (cm)', ylabel='sepal length (cm)'>,
        <AxesSubplot:xlabel='sepal width (cm)', ylabel='sepal length (cm)'>,
        <AxesSubplot:xlabel='petal length (cm)', ylabel='sepal length (cm)'>,
        <AxesSubplot:xlabel='petal width (cm)', ylabel='sepal length (cm)'>],
       [<AxesSubplot:xlabel='sepal length (cm)', ylabel='sepal width (cm)'>,
        <AxesSubplot:xlabel='sepal width (cm)', ylabel='sepal width (cm)'>,
        <AxesSubplot:xlabel='petal length (cm)', ylabel='sepal width (cm)'>,
        <AxesSubplot:xlabel='petal width (cm)', ylabel='sepal width (cm)'>],
       [<AxesSubplot:xlabel='sepal length (cm)', ylabel='petal length (cm)'>,
        <AxesSubplot:xlabel='sepal width (cm)', ylabel='petal length (cm)'>,
        <AxesSubplot:xlabel='petal length (cm)', ylabel='petal length (cm)'>,
        <AxesSubplot:xlabel='petal width (cm)', ylabel='petal length (cm)'>],
       [<AxesSubplot:xlabel='sepal length (cm)', ylabel='petal width (c

In [30]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
print(knn)

KNeighborsClassifier(n_neighbors=3)


In [34]:
x_from_test = X_test[:1, :]
y_from_test = y_test[:1]

y_from_modelprediction = knn.predict(x_from_test)

print(f'Истинная метка: {iris_dataset.target_names[y_from_test]}')
print(f'Спрогнозированная метка: {iris_dataset.target_names[y_from_modelprediction]}')

Истинная метка: ['virginica']
Спрогнозированная метка: ['virginica']


In [35]:
y_predict = knn.predict(X_test)

print('Настоящие метки тестовой выборки:')
print(y_test)
print('Предсказанные метки тестовой выборки:')
print(y_predict)

Настоящие метки тестовой выборки:
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 1]
Предсказанные метки тестовой выборки:
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]


In [38]:
print('accuracy: ', knn.score(X_test, y_test))

accuracy:  0.9736842105263158
