<div class="alert alert-info" style="background-color:#006a79; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Iris Data classification</h2>
</div>

## Table of contents 
1. Importing Libraries
2. Load Data
3. Understand the Data
4. Data Visualization
5. Train-Validation Split

<div class="alert alert-info" style="background-color:#006a79; color:white; padding:0px 10px; border-radius:5px;"><h2 
style='margin:10px 5px'>1. Importing Libraries</h2>
</div>

In [1]:
import os
import requests
import sys

import pandas as pd
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.plotting import scatter_matrix
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, cross_val_score 
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC


<div class="alert alert-info" style="background-color:#006a79; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>2. Load Data</h2>
</div>

In [2]:
def read_data():
    
    ''' Reading .csv file of iris data, this function also supports if there is no data available. 
    This function will download dataset.'''
    
    try:
        df = pd.read_csv('data/iris.csv')
        df = df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm', 'Species']]
    except Exception as e:
        try:
            data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
            column_url = "https://archive.ics.uci.edu/ml/datasets/iris"

            r = requests.get(data_url)
            if not os.path.exists('data/'):
               # Create a new directory because it does not exist
               os.makedirs('data/')
            else:
                pass
            with open("data/iris.csv", 'wb') as f:
                f.write(r.content)

            colnames = ['SepalLengthCm','SepalWidthCm', 'PetalLengthCm','PetalWidthCm', 'Species']
            df = pd.read_csv('data/iris.csv', names=colnames)
            df.to_csv('data/iris.csv')
        except Exception as e:
            print('Error in downloading the Dataset:', e)
            raise
    return df

<div class="alert alert-info" style="background-color:#006a79; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>3. Understanding Data </h2>

The Iris Data has 150 instances and 5 attributes with 4 Independent(Sepal length, Sepal Width, Petal length and Petal width) and one Dependent Variable (Species).

The numerical values of all 4 independent variables have the same scale (centimeters) and similar ranges between 0 and 8 centimeters. There are three duplicate observations so after removing the duplicates dataset size decreased to 147 instances. There are no null values found. Replacing a few Outliers with median instead of removing them as sample size if very small.

The Dependent Variable has three unique values with each has 50 instances, After removing duplicates, we have one class not more than 4% of other class. so usually depends on dataset when one class is 30% to 60% greater than other is considered class imbalance.

In [3]:
def data_shape(df):
    
    ''' return shape of dataset'''
    
    print(df.shape)
    return df.shape

In [4]:
def duplicate_drop(df):
    
    ''' if there are duplicate records this function will drop them'''
    
    try:       
        df_duplicate = df[df.duplicated()]
        if len(df_duplicate)>0:
            df = df.drop_duplicates().reset_index(drop=True)
        else:
            pass
    except Exception as e:
        print('Error when addressing Duplicates: ', e)
    return df

In [5]:
def remove_outlier(rw):
    
    ''' values above 75 percentile + 1.5*InterQuartileRange(IQR) 
     or lower than 25 percentile - 1.5*InterQuartileRange(IQR) are outliers'''
    
    try:        
        upper = rw.quantile(0.75)
        lower = rw.quantile(0.25)
        iqr = upper - lower
        factor=iqr*1.5
        cond = (rw>=upper+factor) | (rw<=lower-factor)
        rw.loc[cond] = rw.median()
    except Exception as e:
        print('Error when addressing Outliers: ', e)
    return rw

<div class="alert alert-info" style="background-color:#006a79; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>4. Data Visualization </h2>
</div>

It looks like perhaps two of the input variables have a Gaussian distribution. Pandas, Matploitlib and Seaborn as the libraries used for each plot.

In [20]:
def data_visualization(df):   
    ''' return a few visualization plot to better understand the data''' 
    try:      
        print('Plotting BOX PLOT')
        df.plot(kind='box', subplots=True, layout=(2,3), sharex=False, sharey=False)
        plt.show()
        
        print('Plotting HISTOGRAM')
        df.hist()
        plt.show()
        
        print('Plotting HEATMAP to see Correlations')
        plt.figure(figsize=(12,10)) 
        p=sns.heatmap(df.corr(), annot=True,cmap ='RdYlGn')
        
        print('Plotting Scatter Plot for data variations') 
        sns.pairplot(df, hue='Species', markers='*')
    except Exception as e:
        print('Error in plotting Visualizations: ', e)

<div class="alert alert-info" style="background-color:#006a79; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>5. Train-Validation Split </h2>
</div>

Splitting the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset

In [21]:
def Data_split(df):  
    '''split the dataset into train and test sets''' 
    try:
        array = df.values
        X = array[:,0:4]
        Y = array[:,4]
        validation_size = 0.20
        seed = 7
        X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=seed)
    except Exception as e:
        print('Error in creating training and validation sets :', e)
        raise
    return X_train, X_validation, Y_train, Y_validation