## A step-by-step machine learning project with Iris Flower dataset 
#### The Iris flower data set is a specific set of information compiled by Ronald Fisher, a biologist, in the 1930s. It describes particular biological characteristics of various types of Iris flowers, specifically, the length and width of both pedals and the sepals, which are part of the flower’s reproductive system.
![image.png](attachment:image.png)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

## 1. Load the dataset

In [None]:
# load the iris dataset
iris_data = pd.read_csv('../input/Iris.csv')

## 2. Quick dataset summary

### 2.0 Do we have missing data?

In [None]:
# my personal reusable function for detecting missing data
def missing_value_describe(data):
    # check missing values in training data
    missing_value_stats = (data.isnull().sum() / len(data)*100)
    missing_value_col_count = sum(missing_value_stats > 0)
    missing_value_stats = missing_value_stats.sort_values(ascending=False)[:missing_value_col_count]
    print("Number of columns with missing values:", missing_value_col_count)
    if missing_value_col_count != 0:
        # print out column names with missing value percentage
        print("\nMissing percentage (desceding):")
        print(missing_value_stats)
    else:
        print("No misisng data!!!")
missing_value_describe(iris_data)

### Great!!! We have no missing data

In [None]:
# take a peek
iris_data.head()

### Id is the unique identifier for each flower. In this machine learning project, it will not help with our model's training and testing. Let's drop the Id column first

In [None]:
iris_data = iris_data.drop(['Id'], axis=1)
iris_data.columns

### From the printed column names, we can see the "Id" column is dropped now.

### 2.1 Dimension of the dataset

In [None]:
# dimension
print("the dimension:", iris_data.shape)

### We can see we have a dataset with 150 observations and each observation has 6 columns.
### 4 of the columns are numeric attributes we can use to train machine learning models and the last column is the label of a given flower.

### 2.2 Statistical summary using .describe()

In [None]:
print(iris_data.describe())

### Let's interpret the above statistical desciption of our dataset:
### The descriptoin shows we have data with super low std(standard deviation) 
### the range of the SepalLengthCm is: 4.300000 - 7.900000
### the range of the SepalWidthCm is: 2.000000 - 4.400000
### the range of the PetalLengthCm is: 1.000000 - 6.900000
### the range of the PetalWidthCm is: 0.100000 - 2.500000

### 2.3 Distribution of each class? 
### Since we are predicting the class of a given flower, let's exam what's the class distribution for this dataset

In [None]:
# class distribution
print(iris_data.groupby('Species').size())

## 3. Explore data with visualization

In [None]:
# import ploting tool
import matplotlib.pyplot as plt

### 3.1 Let's visualize the distribution:

In [None]:
# iris flower dataset class distribution
nameplot = iris_data['Species'].value_counts().plot.bar(title='Flower class distribution')
nameplot.set_xlabel('class',size=20)
nameplot.set_ylabel('count',size=20)

### From the above visualization and the summary, we can see each class has equal distribution in the dataset. It's very "ideal" in machine learning project.

### 3.2 Box and Whisker plot:
### We will use it see how the values are distributed in each attribute

In [None]:
# box and whisker plots
iris_data.plot(kind='box', subplots=True, layout=(2,2), 
               sharex=False, sharey=False, title="Box and Whisker plot for each attribute")
plt.show()

### 3.3 Histogram:
### Histogram is an very important tool to help visualize the dataset's value distribution.

In [None]:
# plot histogram
iris_data.hist()
plt.show()

### From the above Box and Whisker plot and histogram, they show 2 of the attirbutes has normal distribution. This is the assumption for many machine learning algorithms. We can utilize the distribution to model our data. 

### 3.4 Multivariate scatter plot:

### Multivariate scatter plot helps us to visualize the pair-wise relationship in our dataset

In [None]:
import seaborn as sns
sns.set(style="ticks")
sns.pairplot(iris_data, hue="Species")

### In the above scatter plot, we can see PetalWidthCm and PetalLengthCm has the strongest pari-wise relationship for classification. Each class are separated clearly for the pair-wise scatter plot between PetalWidthCm and PetalLengthCm

## 4. Data Modeling:
### Classification problem: our goal is to predict the flow 'Species' with given 4 features: 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', and 'PetalWidthCm'.

### 4.1 Train-Test Split:
### We will use Sklean to Split arrays or matrices into random train and test subsets for training and testing machine learning model.
### Our X will be the features of the flowers and Y will be the label of the flowers

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# we will split data to 80% training data and 20% testing data with random seed of 10
X = iris_data.drop(['Species'], axis=1)
Y = iris_data['Species']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=7)

In [None]:
print("X_train.shape:", X_train.shape)
print("X_test.shape:", X_test.shape)
print("Y_train.shape:", X_train.shape)
print("Y_test.shape:", Y_test.shape)

### 4.2 Models Building
### Let's build multiple machine learning models to evaluate how they will perform on our classification problem

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

### training the models and evaluate with 10-fold cross validation

In [None]:
# models
models = []

# linear models
models.append(('LR', LogisticRegression(solver='liblinear', multi_class="auto")))
models.append(('LDA', LinearDiscriminantAnalysis()))

# nonlinear models
models.append(('CART', DecisionTreeClassifier()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('GNB', GaussianNB()))
models.append(('SVC', SVC(gamma="auto")))

# evaluate each model in turn
print("Model Accuracy:")
names = []
accuracy = []
for name, model in models:
    # 10 fold cross validation to evalue model
    kfold = KFold(n_splits=10, random_state=7)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
    
    # display the cross validation results of the current model
    names.append(name)
    accuracy.append(cv_results)
    msg = "%s: accuracy=%f std=(%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

### Let's visualize the training with Box Plot

In [None]:
ax = sns.boxplot(x=names, y=accuracy)
ax.set_title('Model Accuracy Comparison')

### From above box plot, we can see the accuracy of the KNN, GNB, and SVC models has small deviation although the GNB model has a lowest accuracy score near 0.825.

### Test the KNN, GNB, and SVC models with test data and output their accuracy with confusion matrix together for selecting model

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
# models
models = []
models.append(('KNN', KNeighborsClassifier()))
models.append(('GNB', GaussianNB()))
models.append(('SVC', SVC(gamma="auto")))

### We will evalue the testing with [accuracy score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html), [confusion matrix](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/), and [classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) with Sklearn

In [None]:
# reusable function to test our model
def test_model(model):
    model.fit(X_train, Y_train) # train the whole training set
    predictions = model.predict(X_test) # predict on test set
    
    # output model testing results
    print("Accuracy:", accuracy_score(Y_test, predictions))
    print("Confusion Matrix:")
    print(confusion_matrix(Y_test, predictions))
    print("Classification Report:")
    print(classification_report(Y_test, predictions))

In [None]:
# predict values with our test set
for name, model in models:
    print("----------------")
    print("Testing", name)
    test_model(model)

### The highest testing accuracy is 0.93 from Support Vector Classifier.
### The SVC's confustion matrix has the highest diagonal values indicated that SVC predicted the class type better than the other 2 models.
### From above confusion matrix and classification report, the SVC model is the best model for our classification problem. 

## Conclusion:
### This kernel described and explored the classic Iris dataset with data visualizations. And we also experimented with 4 machine learning models: 2 linear and 4 non-linear models.
### I examined the training results with 10-fold cross validation and chose SVC as the best model with testing confusion matrix output and classification report.