<a href="https://colab.research.google.com/github/F-Palmieri/PRML_UPC/blob/main/scikit_learn_Perceptron.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Reference:

     

1.   Python machine learning : machine learning and deep learning with Python, scikit-learn, and TensorFlow. Raschka, Sebastian, autor; Mirjalili, Vahid, 2019 - Chapter 3

---


# First steps with scikit-learn

we will take a look at the **scikit-learn** API, which, as mentioned, combines a user-friendly and consistent interface with a highly optimized implementation of several classification algorithms. The scikit-learn library offers not only a large variety of learning algorithms, but also many convenient functions to preprocess data and to fine-tune and evaluate our models.



Loading the Iris dataset from scikit-learn. Here, the third column represents the petal length, and the fourth column the petal width of the flower examples. The classes are already converted to integer labels where 0=Iris-Setosa, 1=Iris-Versicolor, 2=Iris-Virginica.

In [1]:
from IPython.display import Image
%matplotlib inline

from sklearn import datasets
import numpy as np

iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target

print('Class labels:', np.unique(y))



Class labels: [0 1 2]


The *np.unique(y)* function returned the three unique class labels stored in
*iris.target*, and as we can see, the Iris flower class names, Iris-setosa,
Iris-versicolor, and Iris-virginica, are already stored as integers (here: 0, 1,
2). 
To evaluate how well a trained model performs on unseen data, we will further
split the dataset into separate training and test datasets. 


In [2]:
# Splitting data into 70% training and 30% test data:


from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y)



Note that the `train_test_split` function already shuffles the training datasets
internally before splitting. Via the `random_state` parameter, we provided a fixed random seed (`random_state=1`) for the internal pseudo-random number generator that is used for shuffling the datasets prior to splitting. Using such a fixed `random_state` ensures that our results are reproducible.

Lastly, we took advantage of the built-in support for stratification via `stratify=y`. In this context, stratification means that the `train_test_split` method returns training and test subsets that have the same proportions of class labels as the input dataset. We can use NumPy's `bincount` function, which counts the number of occurrences of each value in an array, to verify that this is indeed the case:

In [3]:
print('Labels count in y:', np.bincount(y))
print('Labels count in y_train:', np.bincount(y_train))
print('Labels count in y_test:', np.bincount(y_test))


Labels count in y: [50 50 50]
Labels count in y_train: [35 35 35]
Labels count in y_test: [15 15 15]



**Standardizing the features:**

Here, we will standardize the features using the `StandardScaler` class from scikit-learn's preprocessing module:


In [4]:

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

Using the preceding code, we loaded the `StandardScaler` class from the
`preprocessing` module and initialized a new `StandardScaler` object that we
assigned to the *sc* variable. Using the `fit` method, `StandardScaler` estimated the parameters, **𝜇** (sample mean) and **𝜎** (standard deviation), for each feature dimension from the training data. By calling the `transform` method, we then standardized the training data using those estimated parameters, **𝜇**  and **𝜎**. Note that we used the same scaling parameters to standardize the test dataset so that both the values in the training and test dataset are comparable to each other.

# Training a perceptron via scikit-learn