# Iris Information

Iris is a genus of 260–300 species of flowering plants with showy flowers. It takes its name from the Greek word for a rainbow, which is also the name for the Greek goddess of the rainbow, Iris. Iris is also widely used as a common name for all Iris species, as well as some belonging to other closely related genera. Iris is extensively grown as ornamental plant in home and botanical gardens. The Iris flowers color ranges from white, pink, orange, purple, lavender.

There are almost 300 different species of Iris has been already discoverd, for our Data Science purpose we are going to make EDA for following 3 different Iris species:

* Setosa
* Versicolor
* Virginica

The flowers are classified by the features

* sepal lenght in cm
* sepal width in cm
* petal lenth in cm
* petal lenght in cm

![](https://miro.medium.com/max/3500/1*f6KbPXwksAliMIsibFyGJw.png)
![](https://content.codecademy.com/programs/machine-learning/k-means/iris.svg)

**Let us start with our work on Iris Dataset**

**Load the important required libraries**

In [None]:
import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

**Load the dataset now**

In [None]:
iris = pd.read_csv("../input/iris-flower-dataset/IRIS.csv")

**Checking first 5 and last 5 records from the datasets**

In [None]:
iris.head(5)

In [None]:
iris.tail(5)

**Finding the shape of data. Total no of rows and columns.**

In [None]:
iris.shape

**Finding information of the data with data types, columns and null values**

In [None]:
iris.info()

**In iris data set, 4 features are there , sepal length, sepal width, petal length, and petal width and one label (species).**

**Also, there no null values in the data set so we can go ahead**

**Let's check count for each species**

In [None]:
iris['species'].value_counts()

**We can see here iris data set is a balanced dataset. The Iris dataset consists of 150 data instances. There are 3 classes(Features) - Iris Setosa,Iris Versicolor and Iris Virginica, each have 50 instances.**

In [None]:
sns.countplot('species',data=iris)
plt.show()

**Since the data set is equally balanced, we got the equal no of counts**

In [None]:
sns.FacetGrid(iris,hue="species",height=5).map(sns.distplot,"petal_length").add_legend();

**From above plot, we see that on the basis of petal length setosa is separable while the other two are overlapping**

In [None]:
sns.FacetGrid(iris,hue="species",height=5).map(sns.distplot,"petal_width").add_legend();

**From above plot, we see that on the basis of petal width setosa is separable while the other two are overlapping**

In [None]:
sns.FacetGrid(iris,hue="species",height=5).map(sns.distplot,"sepal_length").add_legend();

**From above plot, we see that on the basis of sepal length all species are overlapping**

In [None]:
sns.FacetGrid(iris,hue="species",height=5).map(sns.distplot,"sepal_width").add_legend();

**From above plot, we see that on the basis of sepal width all species are tight overlapping**

In [None]:
plt.figure(figsize=(7,7))
sns.set_style('whitegrid')
sns.scatterplot(x=iris['sepal_length'], y=iris['sepal_width'], hue=iris['species'], palette=['green','orange','dodgerblue'])

**From above plot using sepal length and sepal width, the setosa variety is easily distinguishable. The versicolor and virginica are overlapping, so harder to distinguish.**

In [None]:
plt.figure(figsize=(7,5))
sns.set_style('whitegrid')
sns.scatterplot(x=iris['petal_length'], y=iris['petal_width'], hue=iris['species'], palette=['green','orange','dodgerblue'])

**From above plot using sepal length and sepal width, the setosa variety is easily distinguishable. The versicolor and virginica are overlapping, so harder to distinguish.**

In [None]:
plt.figure(figsize=(7,7))
sns.set_style('whitegrid')
sns.pairplot(data=iris, hue='species', palette=['green','orange','dodgerblue'])

**From above plot we can see that,**

1. In case of sepal length & sepal width, setosa is easily seperable but versicolor & virginica have some overlap.
2. In case of petal length & petal width, all the species are quite seperable. And the useful features to distinguish flower types.

# Principal Component Analysis (PCA)

In [None]:
plt.figure(figsize=(7,7))
p=sns.heatmap(iris.corr(), annot=True,cmap='RdYlGn')

**From the above plot, The sepal_width_cm feature seems to be less relevant in explaining the target class as compared to the other features.**

**Modelling with PCA**

In [None]:
x = iris.drop(['species'],axis=1)
y = iris.species

**Scaling the Data**

In [None]:
from sklearn.preprocessing import StandardScaler
x = StandardScaler().fit_transform(x)

**Calculate the covariance matrix:**

In [None]:
iris_cov_matrix = np.cov(x.T)
iris_cov_matrix

**Calculating the eigenvalues and eigenvectors of the covariance matrix:**

In [None]:
eig_vals, eig_vecs = np.linalg.eig(iris_cov_matrix)
print('\nEigenvalues \n%s' %eig_vals)
print('Eigenvectors \n%s' %eig_vecs)

**Sorting the list of eigenvalues in descending order:**

In [None]:
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
print('Eigenvalues in descending order:')
for i in eig_pairs:
    print(i[0])

**Selecting the number of principal components:**

In [None]:
total = sum(eig_vals)
var_exp = [(i / total)*100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print('Variance captured by each component is \n',var_exp)
print('Cumulative variance captured as we travel with each component \n',cum_var_exp)

As we can see the first two componentes account nearly 96% of the total variance. If we only use this two components we are able to shrink the size of the dataset by half (2 instead of 4 columns).

In [None]:
from sklearn import decomposition

pca = decomposition.PCA(n_components=2)

x_transform = pca.fit_transform(x)

In [None]:
pc_df = pd.DataFrame(data = x_transform, columns = ['PC1', 'PC2'])
pc_df['species'] = y

In [None]:
pc_df.head()

In [None]:
pca.get_covariance()

In [None]:
explained_variance=pca.explained_variance_ratio_
explained_variance

**Together, the first two principal components contain 95.80% of the information. The first principal component contains 72.77% of the variance and the second principal component contains 23.03% of the variance. The third and fourth principal component contained the rest of the variance of the dataset.**

In [None]:
sample_df = pd.DataFrame({'var':pca.explained_variance_ratio_,
             'PC':['PC1','PC2']})
sns.barplot(x='PC',y="var", data=sample_df)

In [None]:
plt.subplots(figsize=(7,7))
sns.scatterplot(data=pc_df, x="PC1", y="PC2", hue='species', s=100)

**The three classes appear to be well separated!
Iris-virginica and Iris-versicolor could be better separated.**