# Iris Case Study

### A brief (and incomplete) introduction to Pandas

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy as sp
import sklearn as sk # data mining tools
import matplotlib.pylab as plt # plotting
import seaborn as sns # advanced plotting
from pandas.plotting import scatter_matrix
import warnings
warnings.filterwarnings("ignore")

## 1. Dataset description
As first step we load the whole Iris Dataset and make confidence with its features...

In [None]:
iris = pd.read_csv("data/iris.csv")
iris.head()

In [None]:
iris.describe()

<a id='transform'></a>
## 2. Data Transformation
In this stage, we will clean our data by 
 1. handling missing information, 
 2. creating new features for analysis, and 
 3. converting fields to the correct format for calculations and presentation.

#### Missing values

In [None]:
iris.isnull().sum()

In [None]:
iris['sepal_length'].fillna(iris['sepal_length'].median(), inplace = True)

#### Feature creation

In [None]:
iris['New_Column'] = iris['sepal_length'] * 100

In [None]:
iris.head()

#### Binning continuos variables

In [None]:
iris['sepal_length_bin'] = pd.qcut(iris['sepal_length'], 4) # qcut: frequency bins
iris['sepal_width_bin'] = pd.cut(iris['sepal_width'].astype(int), 5) # cut: equal size value bins

iris[['sepal_length', 'sepal_length_bin', 'sepal_width', 'sepal_width_bin']].head()

In [None]:
iris['sepal_length_bin'].value_counts()

In [None]:
iris['sepal_width_bin'].value_counts()

#### Feature Reshaping 

Last, but certainly not least, we'll deal with formatting. Our categorical data imported as objects, which makes it difficult for mathematical calculations. We will convert object datatypes to categorical dummy variables.

In [None]:
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()

iris['class_code'] = label.fit_transform(iris['class'])
iris[['class', 'class_code']].head()

## 3. Exploratory Analysis
Now that our data is cleaned, we will explore our data with descriptive and graphical statistics to describe and summarize our variables. 

### 3.A Features Distributions

In order to understand how the values of a continuos feature distribute we can use the kde (Kernel Density Estimate) plot

In [None]:
pl = iris['petal_length'].plot.kde()

In [None]:
sl = iris['sepal_length'].plot.kde()

In [None]:
pw = iris['petal_width'].plot.kde()

In [None]:
sw = iris['sepal_width'].plot.kde()

#### Conditional Feature Distribution

We can build kde plots also by grouping values of a same feature w.r.t. a categorical variable.

In [None]:
ax = iris.groupby(['class']).sepal_width.plot.kde()
plt.legend()
plt.show()

In [None]:
ax = iris.groupby(['class']).petal_width.plot.kde()
plt.legend()
plt.show()

In [None]:
ax = iris.groupby(['class']).petal_length.plot.kde()
plt.legend()
plt.show()

In [None]:
ax = iris.groupby(['class']).sepal_length.plot.kde()
plt.legend()
plt.show()

### 3.B Histogram plot
We can also use Histograms instead of kde to capture binned class distribution.

In [None]:
sx = iris.sepal_length.plot.hist(bins=10)

In [None]:
sx = iris.sepal_width.plot.hist(bins=10)

In [None]:
sx = iris.petal_length.plot.hist(bins=10)

In [None]:
sx = iris.petal_width.plot.hist(bins=10)

#### (Conditional, Stacked) histograms
Pandas does not have a simple way to visualise conditional histograms. <br/> 
To overcome such issue we can define a dedicated function as follows:

In [None]:
def conditional_histogram(df, column):

    booldf1 = pd.DataFrame(df[df['class_code']==0][column])
    booldf1.columns = ['Setosa']
    booldf2 = pd.DataFrame(df[df['class_code']==1][column])
    booldf2.columns = ['Versicolor']
    booldf3 = pd.DataFrame(df[df['class_code']==2][column])
    booldf3.columns = ['Virginica']
    row_concat = pd.concat([booldf1, booldf2, booldf3], axis=1)

    ax = row_concat.plot.hist(stacked=True, alpha=0.6)
    ax.set_xlabel(column)

In [None]:
conditional_histogram(iris, "sepal_length")

In [None]:
conditional_histogram(iris, "petal_length")

### 3.C Bar charts
Conversely from histograms (used to plot quantitative data with ranges of the data grouped into bins or intervals), bar charts plot categorical data.

In [None]:
sx = iris.groupby(['class']).petal_length.count().plot.barh()

### 3.D Dispersion and Outliers

Box plot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles.

Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution. The spacings between the different parts of the box indicate the degree of dispersion (spread) and skewness in the data, and show outliers.

In [None]:
sl_box = iris.boxplot(['sepal_length'], showfliers=True)

In [None]:
sw_box = iris.boxplot(['sepal_width'], showfliers=True)

In [None]:
pl_box = iris.boxplot(['petal_length'], showfliers=True)

In [None]:
pw_box = iris.boxplot(['petal_width'], showfliers=True)

#### Conditional box plots

In [None]:
pw_by_class = iris.boxplot(['petal_width'], by=['class'])

In [None]:
pl_by_class = iris.boxplot(['petal_length'], by=['class'])

In [None]:
sl_by_class = iris.boxplot(['sepal_length'], by=['class'])

In [None]:
sw_by_class = iris.boxplot(['sepal_width'], by=['class'])

## 4. Correlations

A correlation coefficient is a numerical measure of some type of correlation, meaning a statistical relationships between two variables.

Several types of correlation coefficients exist, each with their own definition and own range of usability and characteristics. They have in common that they assume values in the range from −1 to +1, where +1 indicates the strongest possible agreement and −1 the strongest possible disagreement. By default Pandas adopts Pearson correlation.

In [None]:
iris = iris.drop(['New_Column', 'sepal_length_bin', 'sepal_width_bin'], axis=1)

#### Correlation matrix

The correlation matrix computes the Pearson correlation coefficients of the columns of a matrix. That is, row i and column j of the correlation matrix is the correlation between column i and column j of the original matrix. Note that the diagonal elements of the correlation matrix will be 1 (since they are the correlation of a column with itself). The correlation matrix is also symmetric since the correlation of column i with column j is the same as the correlation of column j with column i.

In [None]:
import seaborn as sns
corr = iris.corr()
plt.subplots(figsize =(9, 6))
hm = sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values, annot=True)

#### Scatter plots

A scatter plot (also called a scatterplot, scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data.

In [None]:
from pandas.plotting import scatter_matrix

In [None]:
sm = scatter_matrix(iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']])

Scatter plots can be also generated individually

In [None]:
af = iris.plot.scatter(x='petal_length', y='petal_width')