# Iris Dataset Exploration

## Explore the similarities, differences and relationships among three different Iris species in terms of their sepal and petal widths and lengths

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
from pandas_profiling import ProfileReport
from sklearn.datasets import load_iris
%matplotlib inline

## Load Dataset, Explore and Display Features

In [None]:
iris = load_iris()
iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])

In [None]:
iris

In [None]:
iris_df

In [None]:
iris_df['target'] = iris_df['target'].replace([0,1,2],['setosa', 'versicolor', 'virginica'])

In [None]:
iris_df

In [None]:
iris_df.shape

In [None]:
iris_df.info()

In [None]:
iris_df['target'].describe()

In [None]:
iris_df.describe()

### Observations
* The dataset contains 150 observations, has 4 predictive attributes and 1 target variable
* The 4 predictive attributes are numerical, the target variable is categorical
* Changed the target attribute labels to a descriptive string vs numerical category for ease in analysis
* There are 3 unique target variables: setosa, versicolor and virginica; each a species of Iris
* See the table above for numerical measures of the 4 predictive attributes. Note:
    * petal length has the largest range and greatest variation of the 4 attributes, and also has the greatest difference of the 4 between its mean and median
    * for the other three attributes, their mean approximates their median which suggests the mean is not affected by outliers

## Explore the dataset using tools and libraries available in Python

In [None]:
profile = ProfileReport(iris_df)
profile

### Observations
* The dataset has zero missing observations
* The distributions of sepal length and sepal width are fairly normal
* The distributions of petal length and petal width both have two distinct groupings
* Correlation - because the 4 predictive attributes are all numerical, refer to the Pearson's r chart, above:
    * Sepal width and sepal length appear to be uncorrelated
    * Petal width and petal length appear to be highly correlated (positive correlation)
    * Petal length and sepal length appear to be fairly correlated
    * Petal width and sepal length also appear to be correlated, though less so than petal length and sepal length    
* Correlation - see pair plot graphs below for visual confirmation of the above correlation observations

## Calculations of dot product, norm and distance

In [None]:
# Calculate dot product between the sepal length vector vs the sepal width vector
s_length = iris_df['sepal length (cm)']
s_width = iris_df['sepal width (cm)']
s_length.dot(s_width)

In [None]:
# Calculate the norm of the sepal length vectors
s_length = iris_df['sepal length (cm)']
sepal_length_norm = np.linalg.norm(s_length)
sepal_length_norm

In [None]:
# Calculate the distance between the sepal length vector and the sepal width vector
s_length = iris_df['sepal length (cm)']
s_width = iris_df['sepal width (cm)']
dist = (np.linalg.norm(s_length - s_width))
dist

## Visualizations

In [None]:
# Compare sepal width to sepal length
plt.figure(figsize=(20,12))
sns.scatterplot(x='sepal length (cm)', y='sepal width (cm)', data=iris_df);

In [None]:
# Compare sepal width to sepal length, color coded by petal length
plt.figure(figsize=(20,12))
sns.scatterplot(x='sepal length (cm)', y='sepal width (cm)', data=iris_df, hue='petal length (cm)', palette='Blues_r');

In [None]:
# Compare sepal width to sepal length, color coded by petal width
plt.figure(figsize=(20,12))
sns.scatterplot(x='sepal length (cm)', y='sepal width (cm)', data=iris_df, hue='petal width (cm)', palette='Blues_r');

In [None]:
# Scatter plot of petal length vs petal width
plt.figure(figsize=(20,12))
sns.scatterplot(x='petal length (cm)', y='petal width (cm)', data=iris_df);

In [None]:
# Compare sepal width to sepal length, color coded by petal length
plt.figure(figsize=(20,12))
sns.scatterplot(x='petal length (cm)', y='petal width (cm)', data=iris_df, hue='sepal length (cm)', palette='Blues_r');

In [None]:
# Compare sepal width to sepal length, color coded by petal length
plt.figure(figsize=(20,12))
sns.scatterplot(x='petal length (cm)', y='petal width (cm)', data=iris_df, hue='sepal width (cm)', palette='Blues_r');

In [None]:
sns.pairplot(iris_df, hue='target', palette="prism_r");

In [None]:
sns.set_style("whitegrid")
  
sns.boxplot(x = 'target', y = 'sepal length (cm)', data = iris_df);

In [None]:
sns.set_style("whitegrid")
sns.boxplot(x = 'target', y = 'sepal width (cm)', data = iris_df);

In [None]:
sns.set_style("whitegrid")
sns.boxplot(x = 'target', y = 'petal length (cm)', data = iris_df);

In [None]:
sns.set_style("whitegrid")
sns.boxplot(x = 'target', y = 'petal width (cm)', data = iris_df);

In [None]:
g = sns.PairGrid(iris_df)
g.map_diag(plt.hist)
g.map_upper(plt.scatter)
g.map_lower(sns.kdeplot);

### Observations (this is just a start, more to add......)
* There doesn't appear to be a clear relationship between sepal width and sepal length in general
* Botht the records with petals that are lower in length (1-3 cm) and those with petals that are lower in width (0.4-0.8 cm) primarily correspond to those with a sepal length less than 6 cm and a sepal width greater than 3 cm

## Conclusions: (these are rough, just getting ideas down)
* Sepal length and sepal width taken together are good indicators of the species Setosa
* Petal length and petal width taken together give a pretty good breakout of all three species