# About the DataSet
> In this classification project we will be using the length and widths of the flower's sepal and petals to predict which of these three types of iris the flower is. If our model works well, we will be able to apply it to unseen data -- allowing us to predict the type of flower without knowing its true class
<center>
<img src="https://pbs.twimg.com/media/FdVDXfzVUAA2cvw.png", width="400",  height="200" />
</center>

# Import lybraries

In [28]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
#for graphs, visuals, etc.
import matplotlib.pyplot as plt
import seaborn as sn
# once add
import plotly.express as px
import cufflinks as cf
cf.go_offline()


## Import data
- Iris is an internal dataSet in sklearn and we just need to import dataSet

In [29]:
from sklearn.datasets import load_iris
data = load_iris()
# some information from documentation
print(data.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [30]:
# Dataset Keys
data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [31]:
# How data Look!
#data initially comes as dictionary with array of lists, split between 'data' and 'target' values
print(data.data[:5])
print(data.feature_names)

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [32]:
# and our target Feature is:
print(data.target)
print(data.target_names)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
['setosa' 'versicolor' 'virginica']


# What is our problem and what should we do?
- We have four characteristics of these three types of flowers, and using these characteristics(sepal length,sepal           width,petal length and petal width) for predict which class the Iris belongs to(setosa,Versicolor or virginica).

  We use Classification to solve this issue

# Create DataFrame
- We will create Pandas DataFrame because we can intract whit data simpler.

In [33]:
tod = pd.DataFrame(data.data)
tod.head()

Unnamed: 0,0,1,2,3
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


**The data is not already labeled, so we can reference the sklearn website for further information about the data and features**

in the documentation the data features are listed as: 
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm

**What do these features refer to?**
- Sepal is the green part around the flower petals that enclose the flower when the flower is not in bloom. This is measured by length and width in centimeters as a feature for predicting what type of iris it is. 
- The length and width of the petals, or the colorful leaves of the flower, are also measured in centimeters.

In [34]:
# Lets rename the columns with these features so we know what the variables are referring to when we model the data later.
tod.columns = ['sepal length', 'sepal width', 'petal length', 'petal width']
#now the data is in a dataframe where we can easily see what each of the numbers is referring to
tod.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [35]:
# add target column to DataFrme
target = pd.DataFrame(data.target)
target = target.rename(columns = {0: 'target'})
print(target.head())
print('all target column have 3 value: ',target.target.unique())

   target
0       0
1       0
2       0
3       0
4       0
all target column have 3 value:  [0 1 2]


- **What do these numbers refer to in the target column?**
- 0 = Iris Setosa
- 1 = Iris Versicolour
- 2 = Iris Virginica

In [36]:
# add target column to DataFrme
tod = pd.concat([tod, target], axis = 1)
tod.head()
# This will allow us to see how the features line up with the class determinations

Unnamed: 0,sepal length,sepal width,petal length,petal width,target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


# Statistics description

In [37]:
# general statistical overview of the data columns
# this is also a good way to check your data for extreme outliers if the min/max seems extremely far from the mean, you can investigate the data further
tod.describe()

Unnamed: 0,sepal length,sepal width,petal length,petal width,target
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


# Data Cleaning
> It's super important to look through your data, make sure it is clean, and begin to explore relationships between features and target variables. Since this is a relatively simple data set there is not much cleaning that needs to be done. But it is important to still go through these steps everytime. 

In [38]:
print('see what type of data you have:\n',tod.dtypes)
print('<---------------------------------------------------->')
print('check for missing values:\n',tod.isnull().sum())

see what type of data you have:
 sepal length    float64
sepal width     float64
petal length    float64
petal width     float64
target            int64
dtype: object
<---------------------------------------------------->
check for missing values:
 sepal length    0
sepal width     0
petal length    0
petal width     0
target          0
dtype: int64


# Exploratory Data Analysis (EDA)