**Classifying Iris Species**

What will be done through this code will be to create a machine learning model of the different species of iris that are already known so that the species can be predicted for a new iris.

*The first thing to do is add a basic data set:*

In [21]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()

In [22]:
print("Keys of iris_dataset: \n{}".format(iris_dataset.keys())) #Once the dataset is run, it returns keys and values:

Keys of iris_dataset: 
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])


In [23]:
print(iris_dataset['DESCR'][:193] + "\n...") #The key 'DESCR' contains a short description of the entire dataset

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, pre
...


In [24]:
print("Target names: {}".format(iris_dataset['target_names'])) #In target_names are the flower species that we want to predict

Target names: ['setosa' 'versicolor' 'virginica']


In [25]:
print("Feature names: \n{}".format(iris_dataset['feature_names'])) #This shows a description of features

Feature names: 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [26]:
print("Type of data: {}".format(type(iris_dataset['data']))) #The data obtained is within the data fields, which are details of the petals in this case

Type of data: <class 'numpy.ndarray'>


In [27]:
print("Shape of data: {}".format(iris_dataset['data'].shape)) #The number of flowers obtained and the measurements that were taken for each flower are shown

Shape of data: (150, 4)


In [28]:
print("First five columns of data:\n{}".format(iris_dataset['data'][:5])) #The measurements of each flower obtained are shown

First five columns of data:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


In [29]:
print("Type of target: {}".format(type(iris_dataset['target']))) #The target contains the species of each flower that was measured

Type of target: <class 'numpy.ndarray'>


In [30]:
print("Shape of target: {}".format(iris_dataset['target'].shape)) #Show the shape of the target

Shape of target: (150,)


In [31]:
print("Target:\n{}".format(iris_dataset['target'])) #species are coded

Target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


So far the machine has been provided with data for its training but we still need to make sure that the machine works really well to make predictions of what we want. We cannot check its operation by providing data that you have already stored because obviously you would do it well because you already know the data. In this case we need new data that is not stored in the dataset. These new data that we will introduce are called test data

The **train_test_split** function mixes the data set and divides it into two parts. This function extracts 75% of the data as the training data set and the other 25% extracts it as the test data set.

To really show us if our program works well, different data must be taken from each section that was generated (classes)

Through the **train_test_split** function and **random_state**, the data is shuffled and it is ensured that repeating this test process several times will obtain accurate results.

In [32]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
iris_dataset['data'], iris_dataset['target'], random_state=0)

In [33]:
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))

X_train shape: (112, 4)
y_train shape: (112,)


In [34]:
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

X_test shape: (38, 4)
y_test shape: (38,)


Before continuing with the process of creating machine learning, it is necessary to inspect the data, that is, the information, since characteristics are usually provided that are not included in the data set. This analysis is performed through scatter plots

In [43]:
iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)
grr = pd.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',
hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)

AttributeError: ignored

Now a classification of algorithms will be made to build the machine learning model. To make a prediction for a new data point, the algorithm
finds the point in the training set that is closest to the new point. Then assign the label this training point to the new data point.

In [36]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)

In [37]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

Now we could make predictions using this model on new data for which they do not have correct labels.

In [38]:
X_new = np.array([[5, 2.9, 1, 0.2]])
print("X_new.shape: {}".format(X_new.shape))

X_new.shape: (1, 4)


In [39]:
prediction = knn.predict(X_new)
print("Prediction: {}".format(prediction))
print("Predicted target name: {}".format(
iris_dataset['target_names'][prediction]))

Prediction: [0]
Predicted target name: ['setosa']


Now you can measure the accuracy of the model to predict the flower for the correct species

In [40]:
y_pred = knn.predict(X_test)
print("Test set predictions:\n {}".format(y_pred))

Test set predictions:
 [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]


In [41]:
print("Test set score: {:.2f}".format(np.mean(y_pred == y_test)))

Test set score: 0.97


In [42]:
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))

Test set score: 0.97


**Source**

Müller, A. C., &amp; Guido, S. (2016). *Introduction to machine learning with Python: A Guide for Data Scientists*. Bejing: Oreilly et Associates.