<a href="https://colab.research.google.com/github/ArquimedesG/Machine_Learning_Colabs/blob/main/ML_S3_Iris_Dataset_Test_Train_04JL23.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Creating Test and Train Datasets

###   Iris Flower Dataset

The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper
"The use of multiple measurements in taxonomic problems."

It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species.

The data set consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor).

Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

This dataset became a typical test case for many statistical classification techniques in machine learning such as support vector machines

The dataset contains a set of 150 records under 5 attributes:

    Petal Length
    Petal Width
    Sepal Length
    Sepal width
    Class(Species)

### sklearn.datasets.load_iris

Load and return the iris dataset (classification).

The iris dataset is a classic and very easy multi-class classification dataset.

    Classes               3
    Samples per class     50
    Samples total         150
    Dimensionality        4
    Features              real, positive

In [None]:
#  Importing Iris dataset from sklearn.datasets
from sklearn.datasets import load_iris

# Importing train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split

In [None]:
#  Assigning dataset info to iris Bunch object
iris= load_iris()

#  Displaying values in iris Bunch object
iris

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [None]:
#  Displaying the data portion of the Bunch object (flower sepal/petal dimensions)
iris.data

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [None]:
#  Displaying the target portion of the Bunch object (number assigned to each iris variant type)
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [None]:
#  Displaying the target_names portion of the Bunch object (iris variant types)
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [None]:
#   Splitting the data (inputs and outputs) in test and train groups
x_train,x_test,ytrain,ytest = train_test_split(iris.data,iris.target)

In [None]:
#  Displaying content of the x_train array (selected inputs to train the model)
x_train

array([[5.1, 3.7, 1.5, 0.4],
       [6.7, 3.1, 4.7, 1.5],
       [6.3, 2.7, 4.9, 1.8],
       [5.8, 4. , 1.2, 0.2],
       [6.7, 3. , 5. , 1.7],
       [5.1, 3.8, 1.9, 0.4],
       [5.5, 2.6, 4.4, 1.2],
       [6.4, 3.2, 5.3, 2.3],
       [5.2, 2.7, 3.9, 1.4],
       [5.6, 2.8, 4.9, 2. ],
       [7.2, 3.2, 6. , 1.8],
       [6.1, 3. , 4.6, 1.4],
       [6.4, 2.8, 5.6, 2.2],
       [4.5, 2.3, 1.3, 0.3],
       [5. , 3.5, 1.6, 0.6],
       [5.4, 3.7, 1.5, 0.2],
       [6.5, 3. , 5.2, 2. ],
       [6.1, 3. , 4.9, 1.8],
       [6.4, 2.7, 5.3, 1.9],
       [6.7, 3.1, 4.4, 1.4],
       [5.4, 3.4, 1.7, 0.2],
       [4.3, 3. , 1.1, 0.1],
       [5.3, 3.7, 1.5, 0.2],
       [6.3, 2.5, 4.9, 1.5],
       [4.8, 3. , 1.4, 0.3],
       [6.2, 2.2, 4.5, 1.5],
       [6.5, 3. , 5.8, 2.2],
       [5.1, 3.8, 1.6, 0.2],
       [5.7, 3.8, 1.7, 0.3],
       [5.5, 3.5, 1.3, 0.2],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3.5, 1.3, 0.3],
       [5.6, 2.9, 3.6, 1.3],
       [5.9, 3.2, 4.8, 1.8],
       [7.7, 3

In [None]:
#  Displaying the content of the x_test array (selected inputs to test the model)
x_test

array([[5.5, 4.2, 1.4, 0.2],
       [5.8, 2.6, 4. , 1.2],
       [6.5, 3.2, 5.1, 2. ],
       [5.7, 2.9, 4.2, 1.3],
       [5.5, 2.5, 4. , 1.3],
       [6.7, 2.5, 5.8, 1.8],
       [4.6, 3.6, 1. , 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.1, 3.4, 1.5, 0.2],
       [6.6, 2.9, 4.6, 1.3],
       [6.2, 3.4, 5.4, 2.3],
       [5.7, 2.8, 4.5, 1.3],
       [7.6, 3. , 6.6, 2.1],
       [5.9, 3. , 5.1, 1.8],
       [5.2, 4.1, 1.5, 0.1],
       [6.1, 2.8, 4. , 1.3],
       [7.2, 3.6, 6.1, 2.5],
       [4.9, 2.5, 4.5, 1.7],
       [6.1, 2.6, 5.6, 1.4],
       [6.7, 3.3, 5.7, 2.1],
       [5.5, 2.4, 3.8, 1.1],
       [6.1, 2.9, 4.7, 1.4],
       [5.8, 2.7, 3.9, 1.2],
       [7.7, 2.6, 6.9, 2.3],
       [6. , 2.7, 5.1, 1.6],
       [5.7, 2.6, 3.5, 1. ],
       [5.7, 4.4, 1.5, 0.4],
       [6. , 2.2, 5. , 1.5],
       [7.1, 3. , 5.9, 2.1],
       [6.3, 2.9, 5.6, 1.8],
       [4.4, 2.9, 1.4, 0.2],
       [5.4, 3.9, 1.3, 0.4],
       [6.7, 3.1, 5.6, 2.4],
       [6. , 3.4, 4.5, 1.6],
       [4.9, 3

In [None]:
#  Displaying the number of elements (shape) on the test arrays
print(x_test.shape,ytest.shape)

#  Displaying the number of elements (shape) on the train arrays
print(x_train.shape,ytrain.shape)

(38, 4) (38,)
(112, 4) (112,)


##  Creating Train and Test Groups Using Pandas

In [None]:
#  Importing Pandas
import pandas as pd

#  Importing LabelEncoder from sklearn.processing
from sklearn.preprocessing import LabelEncoder

In [None]:
#  Reading dataset from CSV file Iris.csv nd assigning to a DataFrame called iris
iris=pd.read_csv('IRIS.csv')
#iris=pd.read_csv('Iris.csv')

#  Displaying dataframe
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [None]:
#  Dropping the "Id" column from the dataframe and assigning to a dataframe data
#data=iris.drop("Id",axis=1)
data = iris
#  Dropping the "Species" column from dataframe data to create a dataframe for all the inputs
data=data.drop("species",axis=1)

#  Displaying dataframe with all the inputs
data

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [None]:
#  Assigning the "Species" column from the iris dataframe to the target array
target=iris['species']

#  Displaying the values of the target array
target

0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: species, Length: 150, dtype: object

### sklearn.preprocessing.LabelEncoder

LabelEncoder is a utility class to help normalize labels such that they contain only values between 0 and n_classes-1.

It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels

Encode target labels with value between 0 and n_classes-1.

This transformer should be used to encode target values, i.e. y, and not the input X.

In [None]:
#  Transforming the text labels in the target array to numeric classes using LabelEncoder
label_encoder=LabelEncoder()
target=label_encoder.fit_transform(target)

#  Displaying the new values on target array (output)
target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [None]:
#   Splitting the data (inputs and outputs) in test and train groups
x_train,x_test,ytrain,ytest = train_test_split(data,target)

In [None]:
#  Displaying the number of elements (shape) on the test arrays
print(x_test.shape,ytest.shape)

#  Displaying the number of elements (shape) on the train arrays
print(x_train.shape,ytrain.shape)

(38, 4) (38,)
(112, 4) (112,)
