# Scikit-Learn oraz wprowadzenie do uczenia maszynowego

- Zawiera dużo dostępnych i darmowych zbiorów danych. 
- Algorytmy tam zastosowane sprawnie realizują zadania związane z uczeniem nadzorowanym i nienadzorowanym. 
- Zawiera dużo bibliotek dla uczenia i predykcji.
- Wsparcie modeli dla każdego z rodzajów problemu.
- Stałość modelu.
- Licencja otwarta.

Pomaga w zorganizowaniu prac, jako: problem-solution approach.

**Wymagania/zalecenia**
- stworznie oddzielnych objektów dla cech i odpowiedzi,
- cechy i odpowiedzi mogą przyjmować tylko wartości liczbowe,
- cechy i odpowiedzi powinny być skoncentrowane w ndarray (NumPy),
- rozmiary cech i odpowiedzi powinny być odpowiednie,
- cechy są zawsze na osi X, a odpowiedzi na Y.

# Regresja

Scikit ma wbudowany model **regresji liniowej**.
<br />
$y=\beta_0+\beta_1x+u$,<br /> gdzie $u$ - wartość szczątkowa, $\beta_0+\beta_1x$ - przewidywana wartość, y - faktyczna wartość, zaś samo $\beta_1=\frac{dy}{dx}$

Wartość szczątkową można rozumieć jako różnicę pomiędzy zaproponowaną linią prostą, a punktem.
Dopasowanie odbywa się najczęściej w oparciu o metodę minimalizacji kwadratu błędu (`least square` approach), czyli obu poniższcyh równań.<br />
$SSR=\Sigma(\hat{y_i}-\bar{y})^2$ <-- różnica pomiędzy wartością średnią y-ka ($\haty$) i wartością próbki<br />
$SSE=\Sigma(y-\hat{y})^2$<br /> <-- różnica pomiędzy wartością estymowaną, a faktyczną
W równaniach: $\hat{y}=\hat{\beta_0}+\hat{\beta_1}x$<br />
$\hat{y}$ jest w istocie średnią z wartości odpowiedzi.
<br /><br />
**sklearn.linear_model.LinearRegression( fit_intercept = True, normalize = False, copy_X = True, n_jobs = 1 )**
<br />
W powyższym: <br />
*fit_intercept* - oblicza punkt przecięcia z osią Y <br />
*normalize* - dopasowuje do rozkładu normalnego analizowaną zmienną przed obliczeniem prostej regresji <br />
*copy_X* - kopiuje (a nie nadpisuje) zmienną <br />
*n_jobs* - ilość wątków obliczeniowych<br />
<br />


In [2]:
import numpy as np
import pandas as pd

In [3]:
#załadowanie przykładowego zbioru danych:
from sklearn.datasets import load_boston
bostonDataset = load_boston()

In [4]:
#zwrócenie opisu zbioru danych
print( bostonDataset['DESCR'] )

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [5]:
# wyświetlenie cech opisanych w zbiorze:
print( bostonDataset['feature_names'] )

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']


In [6]:
# zapis danych do ramki biblioteki pandas
dfBostonDataset = pd.DataFrame( bostonDataset.data )

In [7]:
dfBostonDataset

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48


In [8]:
# potrzeba nazw kolumn
bostonDataset.feature_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

In [9]:
# zapis nazw kolumn:
dfBostonDataset.columns = bostonDataset.feature_names

In [10]:
dfBostonDataset

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48


In [11]:
dfBostonDataset.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [12]:
dfBostonDataset.shape

(506, 13)

In [13]:
# poprawne odpowiedzi modelu
bostonDataset.target

array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
       18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
       15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
       13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
       21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
       35.4, 24.7, 31.6, 23.3, 19.6, 18.7, 16. , 22.2, 25. , 33. , 23.5,
       19.4, 22. , 17.4, 20.9, 24.2, 21.7, 22.8, 23.4, 24.1, 21.4, 20. ,
       20.8, 21.2, 20.3, 28. , 23.9, 24.8, 22.9, 23.9, 26.6, 22.5, 22.2,
       23.6, 28.7, 22.6, 22. , 22.9, 25. , 20.6, 28.4, 21.4, 38.7, 43.8,
       33.2, 27.5, 26.5, 18.6, 19.3, 20.1, 19.5, 19.5, 20.4, 19.8, 19.4,
       21.7, 22.8, 18.8, 18.7, 18.5, 18.3, 21.2, 19.2, 20.4, 19.3, 22. ,
       20.3, 20.5, 17.3, 18.8, 21.4, 15.7, 16.2, 18. , 14.3, 19.2, 19.6,
       23. , 18.4, 15.6, 18.1, 17.4, 17.1, 13.3, 17.8, 14. , 14.4, 13.4,
       15.6, 11.8, 13.8, 15.6, 14.6, 17.8, 15.4, 21

In [14]:
bostonDataset.target.shape

(506,)

# Demo 02
**Regresja liniowa**

In [15]:
import numpy as np
import pandas as pd

**pobór przykładowych danych z sklearn.dataset** <br />
tak wyglądają dane bezpośrednio pobrane

In [19]:
#importing boston dataset
from sklearn.datasets import load_boston
bostonDataset = load_boston()
bostonDataset

{'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
         4.9800e+00],
        [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
         9.1400e+00],
        [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
         4.0300e+00],
        ...,
        [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
         5.6400e+00],
        [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
         6.4800e+00],
        [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
         7.8800e+00]]),
 'target': array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
        18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
        15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
        13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
        21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
        35.4, 24.7, 3

In [18]:
type(bostonDataset)

sklearn.utils.Bunch

In [25]:
#utworzenie ramki danych biblioteki Pandas
dfBostonDataset = pd.DataFrame(bostonDataset.data)
dfBostonDataset

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48


**Brak jest nazw kolumn, więc należy je podać bezpośrednio**

In [26]:
dfBostonDataset.columns = bostonDataset.feature_names
dfBostonDataset

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48


In [28]:
#Wciąż nie ma danych docelowych, które należy dodać jako oddzielna kolumna.
dfBostonDataset['Price'] = bostonDataset.target # <- patrz wyżej w bostonDataset od razu po pobraniu
dfBostonDataset

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,Price
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48,22.0


# Zapis danych jeszcze raz - tym razem do ndarrays z NumPy

Na tym operują później modele regresyjne

In [29]:
X_features = bostonDataset.data
type(X_features)

numpy.ndarray

In [30]:
Y_features = bostonDataset.target
type(Y_features)

numpy.ndarray

**Zaciągnięcie estymatora**

In [31]:
from sklearn.linear_model import LinearRegression
lineReg = LinearRegression()

In [32]:
# dopasowanie danych do estymatora
lineReg.fit(X_features, Y_features)

LinearRegression()

In [42]:
# wyświetlenie punktu przecięcia z osią Y
print("The estimated intercept is: {}".format(lineReg.intercept_))

The estimated intercept is: 36.45948838509001


In [43]:
# wyświetlenie współczynnika nachylenia:
print("The coefficient is: {}".format(lineReg.coef_))

The coefficient is: [-1.08011358e-01  4.64204584e-02  2.05586264e-02  2.68673382e+00
 -1.77666112e+01  3.80986521e+00  6.92224640e-04 -1.47556685e+00
  3.06049479e-01 -1.23345939e-02 -9.52747232e-01  9.31168327e-03
 -5.24758378e-01]


# Podział danych


In [46]:
from sklearn.model_selection import train_test_split

In [47]:
X_train, X_test, Y_train, Y_test = train_test_split( X_features, Y_features )

In [51]:
print( X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
print( X_train.shape[0]/X_features.shape[0],  Y_train.shape[0]/Y_features.shape[0] )

(379, 13) (127, 13) (379,) (127,)
0.7490118577075099 0.7490118577075099


In [52]:
lineReg.fit(X_train,Y_train)

LinearRegression()

In [55]:
lineReg.predict( X_test )

array([31.81275197, 15.83130476, 29.35394282, 15.39836728, 15.59230017,
       18.5647666 , 22.59876364, 15.58747892, 28.42281509, 18.42513646,
       38.44686691, 32.49637707, 18.57002287, 20.17469889, 43.99807935,
       23.13900809, 26.88583055, 16.26472592, 17.5125672 , 22.38392322,
       21.83319096, 17.51410927, 27.55333338, 30.28003598, 22.75425093,
       32.83087951, 14.18719795, 24.51150246, 29.61941863, 25.27249857,
       -5.57409871, 10.8661912 , 24.34515697, 20.92308647, 26.53487503,
       28.82511199, 24.2682514 , 18.7578764 , 15.38833129, 20.19684822,
       24.47652284, 26.7360816 , 23.89798353,  9.61846214, 20.87990052,
       29.52866788, 22.98815156, 17.66548409, 24.2840109 , 23.57898685,
       28.48330574, 18.3658425 , 32.63127079, 22.42997028, 26.62686092,
       17.32221547, 10.24107416, 17.22900732,  9.0844547 , 18.11291801,
       26.86969158, 20.7207333 , 30.13821913, 20.85728303, 17.48450639,
       24.02284195, 17.36427367, 28.61586379, 16.547154  , 18.76

In [56]:
lineReg.predict( X_test ) - Y_test

array([ 7.12751970e-01,  4.31304756e-01, -7.46057178e-01,  5.19836728e+00,
       -1.60769983e+00,  2.46476660e+00,  2.99876364e+00, -1.25210764e-02,
        4.72281509e+00,  4.32513646e+00, -6.35313309e+00,  5.49637707e+00,
        1.70022868e-01,  2.47469889e+00, -6.00192065e+00,  1.93900809e+00,
       -9.31416945e+00,  6.06472592e+00, -3.28743280e+00,  3.48392322e+00,
        1.13319096e+00, -3.28589073e+00,  9.53333383e-01,  8.80035983e-01,
        6.25425093e+00,  1.33087951e+00,  4.58719795e+00,  1.61150246e+00,
        5.19418627e-01,  3.67249857e+00, -1.25740987e+01, -1.13380880e+00,
        9.45156972e-01,  1.32308647e+00,  3.23487503e+00,  1.25111987e-01,
       -5.33174860e+00,  4.15787640e+00, -3.11668713e-01, -4.30315178e+00,
       -2.55234772e+01,  1.53608160e+00, -5.02016468e-01, -4.98153786e+00,
        4.77990052e+00,  6.02866788e+00,  2.68815156e+00, -1.83451591e+00,
        3.78401090e+00,  6.78986851e-01,  5.18330574e+00, -1.63415750e+00,
        6.31270786e-01,  

In [58]:
# assessment of predictin:
print("MSE value is %.2f"%np.mean( lineReg.predict( X_test ) - Y_test ) ** 2 ) 

MSE value is 0.39


In [57]:
#variance score:
print("The variace for the model is: {}".format(lineReg.score(X_test,Y_test)))

The variace for the model is: 0.6927072778555858


# Regresja logistyczna
**Tutaj k-NN**


In [29]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [30]:
from sklearn.datasets import load_iris
dataset = load_iris()
type(dataset)

sklearn.utils.Bunch

In [31]:
print(dataset.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [32]:
dataset.data.shape

(150, 4)

In [33]:
dataset

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [34]:
X = dataset.data
y = dataset.target

In [35]:
print( type(X), type(y))

<class 'numpy.ndarray'> <class 'numpy.ndarray'>


In [36]:
X.shape

(150, 4)

In [37]:
y.shape

(150,)

In [21]:
#
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier( n_neighbors = 1 )
print(classifier)

KNeighborsClassifier(n_neighbors=1)


In [23]:
classifier.fit(X,y)

KNeighborsClassifier(n_neighbors=1)

In [24]:
#dane testowe:
X_test = [[3,5,4,1],[5,3,4,2]]

In [25]:
classifier.predict(X_test)

array([1, 1])

**Logistic regressor estimator**

In [50]:
from sklearn.linear_model import LogisticRegression
estimator = LogisticRegression(max_iter=200)

In [51]:
estimator.fit(X,y)

LogisticRegression(max_iter=200)

In [52]:
estimator.predict(X_test)

array([0, 1])

# Clustering

**użyte tu jest k-means **

In [53]:
#importing needed libraries
import numpy as np
#importing the model
from sklearn.cluster import KMeans
#importing the dataset {actually a generator for data}
from sklearn.datasets import make_blobs

In [54]:
#define n_samples - number of points distributed equali between the clusters
n_samples = 300 
# to initialize the centroids
random_states = 20

#generate the data
X, y = make_blobs( n_samples = n_samples, n_features = 5, random_state = None)
#make a prediction
Ypredict = KMeans( n_clusters = 3, random_state = random_states ).fit_predict(X)
print(Ypredict)

[2 0 1 2 0 1 1 0 2 2 1 2 1 1 0 0 1 2 0 0 1 1 2 0 0 2 0 1 1 1 1 0 1 2 2 2 1
 0 2 0 2 1 2 2 1 2 0 1 0 1 2 1 0 0 1 1 1 1 0 2 2 1 0 0 2 1 0 2 1 1 2 2 2 0
 1 2 2 0 1 2 2 0 0 1 0 2 2 0 0 0 1 0 0 2 0 1 2 2 2 0 0 2 2 0 0 1 1 0 2 0 1
 2 2 0 1 0 2 2 1 0 2 0 0 1 2 0 0 1 0 2 2 2 0 0 0 0 1 1 1 2 0 2 1 0 0 1 2 0
 0 1 2 0 2 0 1 1 0 2 2 2 0 2 1 1 1 0 0 1 1 2 2 2 2 0 1 1 2 1 1 0 1 2 1 2 0
 0 2 1 0 1 1 0 1 1 2 2 0 1 2 2 0 1 1 1 0 1 0 1 1 0 2 1 2 2 0 2 2 0 1 2 1 2
 2 0 0 0 0 0 1 1 1 1 2 1 0 0 1 1 2 2 2 2 1 1 1 2 1 0 2 1 1 0 0 0 2 0 2 0 0
 0 1 2 1 2 2 2 1 2 2 2 1 2 0 2 2 1 0 0 2 0 1 2 0 2 2 0 1 2 1 2 0 1 1 0 1 0
 1 0 1 0]


In [55]:
print(X)

[[-3.43502107 -6.38484689  9.35925714  3.80827972 -2.43787094]
 [-2.66424315 -8.50374218  4.91188899 -7.39342593 -7.85808909]
 [-1.52391668 -3.88182004 10.01815524 -7.33057823  3.33719457]
 ...
 [-2.51601923 -9.03003068  5.56556266 -8.5093775  -7.71323994]
 [ 0.21028554 -5.20426236  9.33479209 -8.68695106  2.03149233]
 [-3.3080214  -8.65480984  5.6897379  -8.48382703 -7.88933248]]


# Dimensionality reduction - PCA

In [57]:
#import what is required
from sklearn.decomposition import PCA
#import data generator
from sklearn.datasets import make_blobs

In [59]:
#define settings
n_sample = 20
random_state = 20

# generate the data with 10 features
X, y = make_blobs(n_samples = n_sample, n_features = 10, random_state = None)

In [62]:
#checking the dataset
print(X.shape, y.shape )

(20, 10) (20,)


In [63]:
pca = PCA( n_components = 3 )

In [65]:
pca.fit(X)
print(pca.explained_variance_ratio_)

[0.67010837 0.30840069 0.00744888]


In [66]:
type(pca)

sklearn.decomposition._pca.PCA

In [68]:
type(pca.components_)

numpy.ndarray

In [67]:
#vieweing the data in pca:
pca.components_[0]

array([ 0.17735203,  0.38567402, -0.4676124 , -0.18532309, -0.48415253,
        0.43564659,  0.25759541, -0.24769839,  0.10277859, -0.06580928])

**applying the reduction**

In [71]:
reducedData = pca.transform(X)
type(reducedData)

numpy.ndarray

In [74]:
reducedData.shape

(20, 3)

In [75]:
print(reducedData)

[[-15.23211396  -8.3687772   -0.35808634]
 [-17.28775863  -7.75526311  -0.97555995]
 [ -3.45414809  14.31483618  -1.03294392]
 [-15.49202514  -9.05637301  -0.09514083]
 [-15.19106215  -6.80152789   3.15661751]
 [ 18.44662495  -3.52122875   2.2478196 ]
 [ 18.47323646  -5.3382637   -0.14692448]
 [ 18.45711423  -4.4203729   -2.1496997 ]
 [ -1.6265372   13.37640551   1.49811176]
 [ 17.70118954  -5.89776626  -0.35964163]
 [-16.94228484  -5.9387317    0.09241809]
 [ -2.84841764  13.7401264    0.17877295]
 [ -3.05655984  16.34196235  -1.42681364]
 [ 16.70189463  -3.67939953  -1.37865308]
 [-15.49793556  -7.24149364   0.63456172]
 [ -3.91166376  14.47836033   2.19835386]
 [ 17.61522901  -5.75417519   1.94346488]
 [-14.21761201  -9.11766264  -2.44933275]
 [ -1.87633199  15.29885618  -1.45779909]
 [ 19.23916198  -4.65951142  -0.11952495]]


# Pipelines

In [76]:
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression

from sklearn.decomposition import PCA

**Chain the estimators**

In [77]:
estimator = [('dim_reduction',PCA()),('linear_model',LinearRegression())]

**Put the list into Pipeline object**

In [78]:
pipelineEstimator = Pipeline(estimator)

**Check the pipeline**

In [79]:
pipelineEstimator

Pipeline(steps=[('dim_reduction', PCA()), ('linear_model', LinearRegression())])

**View the steps**

In [80]:
pipelineEstimator.steps[0]

('dim_reduction', PCA())

In [81]:
pipelineEstimator.steps[1]

('linear_model', LinearRegression())

In [82]:
pipelineEstimator.steps

[('dim_reduction', PCA()), ('linear_model', LinearRegression())]

# Model persistence and Evaluation

In [83]:
from sklearn.datasets import load_iris
irisDataset = load_iris()

In [85]:
print(irisDataset.feature_names, irisDataset.target_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] ['setosa' 'versicolor' 'virginica']


In [97]:
X = irisDataset.data
y = irisDataset.target

print( type(X), type(y) )
print( X.shape, y.shape )

<class 'numpy.ndarray'> <class 'numpy.ndarray'>
(150, 4) (150,)


In [98]:
#new object for prediction:
X_test = [[3,5,4,1],[5,3,4,2]]

In [104]:
#import linear regression estimator
from sklearn.linear_model import LogisticRegression
estimator = LogisticRegression(max_iter=200)

In [105]:
estimator.fit(X,y)

LogisticRegression(max_iter=200)

In [106]:
estimator.predict(X_test)

array([0, 1])

**For model persistance the pickle module is needed**

In [107]:
import pickle as pkl

In [109]:
persistModel = pkl.dumps(estimator)
print(persistModel)

b'\x80\x04\x95\x04\x03\x00\x00\x00\x00\x00\x00\x8c\x1esklearn.linear_model._logistic\x94\x8c\x12LogisticRegression\x94\x93\x94)\x81\x94}\x94(\x8c\x07penalty\x94\x8c\x02l2\x94\x8c\x04dual\x94\x89\x8c\x03tol\x94G?\x1a6\xe2\xeb\x1cC-\x8c\x01C\x94G?\xf0\x00\x00\x00\x00\x00\x00\x8c\rfit_intercept\x94\x88\x8c\x11intercept_scaling\x94K\x01\x8c\x0cclass_weight\x94N\x8c\x0crandom_state\x94N\x8c\x06solver\x94\x8c\x05lbfgs\x94\x8c\x08max_iter\x94K\xc8\x8c\x0bmulti_class\x94\x8c\x04auto\x94\x8c\x07verbose\x94K\x00\x8c\nwarm_start\x94\x89\x8c\x06n_jobs\x94N\x8c\x08l1_ratio\x94N\x8c\x0en_features_in_\x94K\x04\x8c\x08classes_\x94\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94\x8c\x05numpy\x94\x8c\x07ndarray\x94\x93\x94K\x00\x85\x94C\x01b\x94\x87\x94R\x94(K\x01K\x03\x85\x94h\x1c\x8c\x05dtype\x94\x93\x94\x8c\x02i4\x94\x89\x88\x87\x94R\x94(K\x03\x8c\x01<\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94b\x89C\x0c\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x94t\x94b\x8c\x05coef_\x

We can save a model into a file:

In [111]:
from joblib import dump, load

In [112]:
dump(persistModel,'regModel.pkl')

['regModel.pkl']

In [113]:
newEstimator = load('regModel.pkl') #load from joblib

In [114]:
newEstimator #but the file is a bitstream

b'\x80\x04\x95\x04\x03\x00\x00\x00\x00\x00\x00\x8c\x1esklearn.linear_model._logistic\x94\x8c\x12LogisticRegression\x94\x93\x94)\x81\x94}\x94(\x8c\x07penalty\x94\x8c\x02l2\x94\x8c\x04dual\x94\x89\x8c\x03tol\x94G?\x1a6\xe2\xeb\x1cC-\x8c\x01C\x94G?\xf0\x00\x00\x00\x00\x00\x00\x8c\rfit_intercept\x94\x88\x8c\x11intercept_scaling\x94K\x01\x8c\x0cclass_weight\x94N\x8c\x0crandom_state\x94N\x8c\x06solver\x94\x8c\x05lbfgs\x94\x8c\x08max_iter\x94K\xc8\x8c\x0bmulti_class\x94\x8c\x04auto\x94\x8c\x07verbose\x94K\x00\x8c\nwarm_start\x94\x89\x8c\x06n_jobs\x94N\x8c\x08l1_ratio\x94N\x8c\x0en_features_in_\x94K\x04\x8c\x08classes_\x94\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94\x8c\x05numpy\x94\x8c\x07ndarray\x94\x93\x94K\x00\x85\x94C\x01b\x94\x87\x94R\x94(K\x01K\x03\x85\x94h\x1c\x8c\x05dtype\x94\x93\x94\x8c\x02i4\x94\x89\x88\x87\x94R\x94(K\x03\x8c\x01<\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94b\x89C\x0c\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x94t\x94b\x8c\x05coef_\x

In [118]:
# so we need also to use .loads() from pickle
newEstimator = pkl.loads(newEstimator)

In [119]:
newEstimator.predict(X_test)

array([0, 1])

# Metrics