# **Iris DataSet Information**

Iris is a genus of 260–300 species of flowering plants with showy flowers. Iris is also widely used as a common name for all Iris species, as well as some belonging to other closely related genera. Iris is extensively grown as ornamental plant in home and botanical gardens. The Iris flowers color ranges from white, pink, orange, purple, lavender.

**Date contains following Attributes:**
1. sepal Length in CM
2. sepal width in CM
3. Petal Length In CM
4. Petal length In CM
5. Species:1-Iris Setosa 2-Iris Versicolour 3-Iris Virginica



Load the important required libraries

In [None]:
import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

Load the dataset now

In [None]:
iris = pd.read_csv("../input/iris-flower-dataset/IRIS.csv")

Checking first  and last records from the datasets

In [None]:
iris.head(2)

In [None]:
iris.tail(2)

Finding the shape of data. Total no of rows and columns.

In [None]:
iris.shape

Fin out the information of the data.

In [None]:
iris.info()

# **Preview of Data**
1. There are 150 observations with 4 features each (sepal length, sepal width, petal length, petal width).
2. There are no null values, so we don't have to worry about that.
3. There are 50 observations of each species (setosa, versicolor, virginica).

Let's check count for each species.

In [None]:
iris['species'].value_counts()

* We can see here iris data set is a balanced dataset. The Iris dataset consists of 150 data instances. Each have 50 instances 

**There are 3 classes(Features)**
* Iris Setosa,
* Iris Versicolor
* Iris Virginica 


Some Exploratory Data Analysis With Iris

In [None]:
fig = iris[iris.species=='Iris-setosa'].plot(kind='scatter',x='sepal_length',y='sepal_width',color='orange', label='Setosa')
iris[iris.species=='Iris-versicolor'].plot(kind='scatter',x='sepal_length',y='sepal_width',color='blue', label='versicolor',ax=fig)
iris[iris.species=='Iris-virginica'].plot(kind='scatter',x='sepal_length',y='sepal_width',color='yellow', label='virginica',ax=fig)
fig.set_xlabel("Sepal Length")
fig.set_ylabel("Sepal Width")
fig.set_title("Sepal Length VS Width")
fig=plt.gcf()
fig.set_size_inches(10,6)
plt.show()




The above graph shows relationship between the sepal length and width. Now we will check relationship between the petal length and width.

In [None]:
fig= iris[iris.species=='Iris-setosa'].plot(kind='scatter',x='petal_length',y='petal_width',color='orange', label='Setosa')
iris[iris.species=='Iris-versicolor'].plot(kind='scatter', x='petal_length',y='petal_width',color='blue', label='versicolor',ax=fig)
iris[iris.species=='Iris-virginica'].plot(kind='scatter', x='petal_length',y='petal_width',color='yellow',label='virginica',ax=fig)
fig.set_xlabel("Petal Length")
fig.set_ylabel("Petal Width")
fig.set_title("Petal Length VS Width")
fig=plt.gcf()
fig.set_size_inches(10,6)
plt.show()


As we can see that the Petal Features are giving a better cluster division compared to the Sepal features. This is an indication that the Petals can help in better and accurate Predictions over the Sepal. We will check that later.

Now let us see how are the length and width are distributed

In [None]:
iris.hist(edgecolor="black", linewidth=1.4)
fig=plt.gcf()
fig.set_size_inches(12,8)
plt.show()

Now let us see how the length and width vary according to the species

In [None]:
plt.figure(figsize=(10,8))
plt.subplot(2,2,1)
sns.violinplot(x='species',y='petal_length',data=iris)
plt.subplot(2,2,2)
sns.violinplot(x='species',y='petal_width',data=iris)
plt.subplot(2,2,3)
sns.violinplot(x='species',y='sepal_length',data=iris)
plt.subplot(2,2,4)
sns.violinplot(x='species',y='sepal_width',data=iris)

In [None]:
sns.FacetGrid(iris,hue="species",height=6).map(sns.distplot,"petal_length").add_legend();

From above plot, we see that on the basis of petal length setosa is separable while the other two are overlapping

In [None]:
sns.FacetGrid(iris,hue="species",height=6).map(sns.distplot,"petal_width").add_legend();

From above plot, we see that on the basis of petal width setosa is separable while the other two are overlapping

In [None]:
sns.FacetGrid(iris,hue="species",height=6).map(sns.distplot,"sepal_length").add_legend();

From above plot, we see that on the basis of sepal length all species are overlapping

In [None]:
sns.FacetGrid(iris,hue="species",height=6).map(sns.distplot,"sepal_width").add_legend();

From above plot, we see that on the basis of sepal width all species are tight overlapping

In [None]:
plt.figure(figsize=(10,8))
sns.set_style('whitegrid')
sns.pairplot(data=iris,hue='species',palette=['orange','blue','yellow'])

**From above plot we can see that,**

In case of sepal length & sepal width, setosa is easily seperable but versicolor & virginica have some overlap.
In case of petal length & petal width, all the species are quite seperable. And the useful features to distinguish flower types.

In [None]:
plt.figure(figsize=(7,5))
sns.heatmap(iris.corr(),annot=True,cmap='cubehelix_r')
#draws  heatmap with input as the correlation matrix calculted by(iris.corr())
plt.show()

The correlation between the Sepal Width and Length was very low whereas the correlation between Petal Width and Length was very high.

We will use all the features for training the algorithm and check the accuracy.

#importing alll the necessary packages to use the various classification algorithms


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn import svm 
from sklearn.neighbors import KNeighborsClassifier


Splitting The Data into Training And Testing Dataset

In [None]:
train, test = train_test_split(iris, test_size = 0.3)
print(train.shape)
print(test.shape)

In [None]:
train_X = train[['sepal_length','sepal_width','petal_length','petal_width']]
train_y=train.species
test_X= test[['sepal_length','sepal_width','petal_length','petal_width']]
test_y =test.species

Lets check the Train and Test Dataset

In [None]:
train_X.head(2)

In [None]:
test_X.head(2)

In [None]:
train_y.head()

Logistic Regression

In [None]:
model=LogisticRegression()
model.fit(train_X,train_y)
prediction=model.predict(test_X)
print('The accuracy of the Logistic Regression is',metrics.accuracy_score(prediction,test_y))

Decision Tree

In [None]:
model=DecisionTreeClassifier()
model.fit(train_X,train_y)
prediction=model.predict(test_X)
print('The accuracy of the Decision Tree is',metrics.accuracy_score(prediction,test_y))

K-Nearest Neighbours

In [None]:
model=KNeighborsClassifier(n_neighbors=3)
model.fit(train_X,train_y)
prediction=model.predict(test_X)
print('The accuracy of the K-Nearest Neighbours is',metrics.accuracy_score(prediction,test_y))

K-Nearest is giving very good accuracy.

Support Vector Machine (SVM)

In [None]:
model = svm.SVC()
model.fit(train_X,train_y)
prediction=model.predict(test_X)
print('The accuracy of the SVM is',metrics.accuracy_score(prediction,test_y))

# **Observations:**

* Using Petals over Sepal for training the data gives a much better accuracy.
* This was expected as we saw in the heatmap above that the correlation between the Sepal Width and Length was very low whereas the correlation between Petal Width and Length was very high.
* Thank You!!

**Modelling with PCA**

In [None]:
x = iris.drop(['species'],axis=1)
y = iris.species

In [None]:
from sklearn.preprocessing import StandardScaler
x = StandardScaler().fit_transform(x)

In [None]:
iris_cov_matrix = np.cov(x.T)
iris_cov_matrix

In [None]:
eig_vals, eig_vecs = np.linalg.eig(iris_cov_matrix)
print('\nEigenvalues \n%s' %eig_vals)
print('Eigenvectors \n%s' %eig_vecs)

In [None]:
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
print('Eigenvalues in descending order:')
for i in eig_pairs:
    print(i[0])

In [None]:
total = sum(eig_vals)
var_exp = [(i / total)*100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print('Variance captured by each component is \n',var_exp)
print('Cumulative variance captured as we travel with each component \n',cum_var_exp)

In [None]:
from sklearn import decomposition

pca = decomposition.PCA(n_components=2)

x_transform = pca.fit_transform(x)

In [None]:
pc_df = pd.DataFrame(data = x_transform, columns = ['PC1', 'PC2'])
pc_df['species'] = y

In [None]:
pc_df.head()

In [None]:
pca.get_covariance()

In [None]:
explained_variance=pca.explained_variance_ratio_
explained_variance

Together, the first two principal components contain 95.80% of the information. The first principal component contains 72.77% of the variance and the second principal component contains 23.03% of the variance. The third and fourth principal component contained the rest of the variance of the dataset.