This analysis will be doing correlations with the Iris data set.

In [1]:
# Importing libraries

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# Stats
from scipy import stats
from scipy.stats import pearsonr, linregress

#plotting
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# allows plotting viewed in the notebook
%matplotlib inline

In [2]:
# Loading the data
data = sns.load_dataset('iris')

#Taking a peak at the data
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
# Let's take a look at the correlation matrix to see if there are any correlations between any of the variables

data.corr()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
sepal_length,1.0,-0.11757,0.871754,0.817941
sepal_width,-0.11757,1.0,-0.42844,-0.366126
petal_length,0.871754,-0.42844,1.0,0.962865
petal_width,0.817941,-0.366126,0.962865,1.0


Data shows correlations between the sepal_length and pedal_length, between the sepal_length and the petal_width, and
between the petal_length and the petal_width. It'd be interesting to see if this correlation stands true when it's broken down
by species type.

In [4]:
# This shows the correlations between the variables brokendown by the individual species type.

data.groupby('species').corr()

Unnamed: 0_level_0,Unnamed: 1_level_0,petal_length,petal_width,sepal_length,sepal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
setosa,petal_length,1.0,0.33163,0.267176,0.1777
setosa,petal_width,0.33163,1.0,0.278098,0.232752
setosa,sepal_length,0.267176,0.278098,1.0,0.742547
setosa,sepal_width,0.1777,0.232752,0.742547,1.0
versicolor,petal_length,1.0,0.786668,0.754049,0.560522
versicolor,petal_width,0.786668,1.0,0.546461,0.663999
versicolor,sepal_length,0.754049,0.546461,1.0,0.525911
versicolor,sepal_width,0.560522,0.663999,0.525911,1.0
virginica,petal_length,1.0,0.322108,0.864225,0.401045
virginica,petal_width,0.322108,1.0,0.281108,0.537728


Here we can see the different correlations within each species type. It's worthy to note that this does not tell us if the correlations are signifacant or not. Let's test the significance of these next.

In [5]:
# Breaking down the matrix into smaller matrices to test for significance of the correlation values
# So that it can be presented in a DataFrame for easier viewing
setosa_df = data.ix[data.species == 'setosa']
vers_df = data.ix[data.species == 'versicolor']
vir_df = data.ix[data.species == 'virginica']

In [6]:
# Removing the name column for function purposes
setosa_df=setosa_df.drop('species', axis=1)
vers_df=vers_df.drop('species', axis=1)
vir_df=vir_df.drop('species', axis=1)

In [7]:
# Let's see the correlation and signifance of the setosa species

df = setosa_df
rho = df.corr()
pval = np.zeros([df.shape[1],df.shape[1]])

for i in range(df.shape[1]): # rows are the number of rows in the matrix.
    for j in range(df.shape[1]):
        JonI        = pd.ols(y=df.icol(i), x=df.icol(j), intercept=True)
        pval[i,j]  = JonI.f_stat['p-value']

print "Correlation matrix for the Setosa species"
rho


Correlation matrix for the Setosa species


  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
sepal_length,1.0,0.742547,0.267176,0.278098
sepal_width,0.742547,1.0,0.1777,0.232752
petal_length,0.267176,0.1777,1.0,0.33163
petal_width,0.278098,0.232752,0.33163,1.0


In [8]:
print "Significance values of the correlation matrix for the Setosa species"
DataFrame(pval, index= rho.index, columns= rho.columns)

Significance values of the correlation matrix for the Setosa species


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
sepal_length,0.0,6.709843e-10,0.060698,0.050526
sepal_width,6.709843e-10,0.0,0.216979,0.103821
petal_length,0.06069778,0.2169789,0.0,0.018639
petal_width,0.05052644,0.1038211,0.018639,0.0


You can see the correlation matrix has some significant findings between the correlation of several values.

In [9]:
df= vers_df
rho = df.corr()
pval = np.zeros([df.shape[1],df.shape[1]])

for i in range(df.shape[1]): # rows are the number of rows in the matrix.
    for j in range(df.shape[1]):
        JonI        = pd.ols(y=df.icol(i), x=df.icol(j), intercept=True)
        pval[i,j]  = JonI.f_stat['p-value']

print "Correlation matrix for the Versicolor species"
rho

Correlation matrix for the Versicolor species




Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
sepal_length,1.0,0.525911,0.754049,0.546461
sepal_width,0.525911,1.0,0.560522,0.663999
petal_length,0.754049,0.560522,1.0,0.786668
petal_width,0.546461,0.663999,0.786668,1.0


In [10]:
print "Signifance values of the correlation matrix for the Versicolor species"
DataFrame(pval, index= rho.index, columns= rho.columns)

Signifance values of the correlation matrix for the Versicolor species


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
sepal_length,0.0,8.77186e-05,2.586189e-10,4.035422e-05
sepal_width,8.77186e-05,0.0,2.302168e-05,1.466661e-07
petal_length,2.586189e-10,2.302168e-05,0.0,1.271916e-11
petal_width,4.035422e-05,1.466661e-07,1.271916e-11,0.0


Unlike like the Setosa species, all the variables that were measured were signficantly correlated.

In [11]:
df= vir_df
rho = df.corr()
pval = np.zeros([df.shape[1],df.shape[1]])

for i in range(df.shape[1]): # rows are the number of rows in the matrix.
    for j in range(df.shape[1]):
        JonI        = pd.ols(y=df.icol(i), x=df.icol(j), intercept=True)
        pval[i,j]  = JonI.f_stat['p-value']

print "Correlation matrix for the Virginica species"
rho

Correlation matrix for the Virginica species




Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
sepal_length,1.0,0.457228,0.864225,0.281108
sepal_width,0.457228,1.0,0.401045,0.537728
petal_length,0.864225,0.401045,1.0,0.322108
petal_width,0.281108,0.537728,0.322108,1.0


In [12]:
print "Signifance values of the correlation matrix for the Verginica species"
DataFrame(pval, index= rho.index, columns= rho.columns)

Signifance values of the correlation matrix for the Verginica species


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
sepal_length,0.0,0.000843,6.661338e-16,0.047981
sepal_width,0.0008434625,0.0,0.003897704,5.6e-05
petal_length,6.661338e-16,0.003898,0.0,0.022536
petal_width,0.04798149,5.6e-05,0.02253577,0.0


Just like Versicolor species, all the variables that were measured were signficantly correlated.