# Iris FunTime with Python!

### Let's check out the widely-used [iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set).  We'll do some cleaning, troubleshooting, and analyzing.  

First, let's import some usual suspects:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
# from sklearn import linear_model, datasets  ## sklearn.datasets contains famous datasets
from sklearn import datasets

%matplotlib inline

In [None]:
iris_dataset = datasets.load_iris() ## this loading feature is unique to sklearn.datasets
type(iris_dataset)

In [None]:
iris = iris_dataset.data ## the feature data
iris_features = iris_dataset.feature_names ## feature (column) names
iris_target = iris_dataset.target ## target data
iris_target_names = iris_dataset.target_names

print 'data: %s' %type(iris), '\n', \
    'target: %s' %type(iris_features), '\n', \
    'target_names: %s' %type(iris_target)

Let's see what these species look like IRL while learning how to load an image from URL:

In [None]:
from IPython.display import Image
from IPython.core.display import HTML 

In [None]:
print 'Iris setosa:'
Image(url = 'https://upload.wikimedia.org/wikipedia/commons/5/56/\
Kosaciec_szczecinkowaty_Iris_setosa.jpg', width=250)

In [None]:
print 'Iris versicolor:'
Image(url = 'https://upload.wikimedia.org/wikipedia/commons/4/41/Iris_versicolor_3.jpg', \
     width = 350)

In [None]:
name = 'virginica'
print 'Iris %s:' %name
Image(url = 'https://upload.wikimedia.org/wikipedia/commons/9/9f/Iris_virginica.jpg', \
     width = 450)

The image below is rendered in markdown (click on it to see the code):

![]("assets/Iris_virginica.png" x100)

Let's check out the dataset:

In [None]:
iris[0:10]

In [None]:
iris[4][0]

In [None]:
iris[0][4]

In [None]:
iris_features

In [None]:
df = pd.DataFrame(iris)
df.head()

Let's try to add the designated target column to the dataframe:

In [None]:
df['species'] = iris.target
df.head()

Oh oh, that doesn't work.  If you get an error like this, look where you tried to call the second part (the attribute) and fix that object; in our case, 'iris' was not the one we should've been using.  It should be 'iris_dataset' (two cells down).

In [None]:
iris_target_names ## we created this earlier

In [None]:
list(iris_target_names)

In [None]:
df.columns = iris_features
df['target'] = iris_dataset.target
df['species'] = df['target'].apply(lambda x: iris_target_names[x]) #bc ^ names are ordered 0,1,2
df.head(2)

Those feature column names are a little long, let's edit!

In [None]:
feature_dict = {
    'sepal length (cm)' : 'sepal_L',
    'sepal width (cm)' : 'sepal_W',
    'petal length (cm)' : 'petal_L',
    'petal width (cm)' : 'petal_W'
}

df = df.rename(columns = feature_dict)
df.head(2)

In [None]:
df.isnull().sum()

In [None]:
df.dtypes

Can the 'target' column have datatype integer?

... Nope, because it's categorical!  But we're not going to use it anyway, so we won't bother changing it.  

In [None]:
df['sepal_L'].describe()

In [None]:
df['sepal_L'].hist()
plt.show()

Don't print out really long dataframes or method returns like this!  ...

In [None]:
df['sepal_L'].value_counts()

Let's reassign all sepal lengths of 5.0cm to 'B' just for fun:

In [None]:
df.loc[df['sepal_L'] == 5.0,['sepal_L']] = 4500

How many sepals length 4500 are there?

Let's count them two different ways:

In [None]:
df['sepal_L'].value_counts()[4500]

In [None]:
len(df[df['sepal_L'] == 4500])

In [None]:
df.loc[df['sepal_L'] == 4500, :].index

In [None]:
df.loc[df['sepal_L'] == 4500, :][5:7]

In [None]:
df.loc[df['sepal_L'] == 4500, ['sepal_W']][5:7]

In [None]:
df.iloc[7:10, 2:]

In [None]:
df[df['sepal_L'] == 4500]['sepal_L'] = 5.0

In [None]:
df['petal_L'].hist(bins = 8, label = 'legends are awesome!', alpha = 0.6, color = 'thistle')
df['petal_W'].hist(bins = 4, label = 'totally!  petal width', alpha = 0.6, color = 'lightblue')
plt.title('Awesome Histo', fontsize = 18, y = 1.03)

plt.xlim([0.0, 7])
plt.xticks(range(0, 8, 1), fontsize = 12, rotation = -45) # Ticks go from 0 - 7 by jumps of 1
plt.xlabel('x-axis woohoo', fontsize = 14)

plt.ylim([0, 70])
plt.yticks(np.arange(0, 80, 10), fontsize = 12)
plt.ylabel('Count', fontsize = 14)
plt.grid(which='major', axis = 'x', color = 'lime')
plt.grid(which='major', axis = 'y', color = 'magenta', alpha = 0.5)

plt.axvline(2.45, linewidth = 0.4)
plt.axhline(35, linewidth = 1)

plt.legend(loc = 'upper right', fontsize = 14)      #LEGENDS

plt.show()

In [None]:
df['species'].value_counts()

In [None]:
data = [df['sepal_W'], df['petal_L'], df['petal_W']]

fig, ax1 = plt.subplots(figsize=(12, 8))

plt.boxplot(data, 0, 'gD')

ax1.yaxis.grid(True, linestyle='-', which='major', color='lightgrey',
               alpha=0.5)

ax1.set_axisbelow(True)
ax1.set_title('Iris Features', y =1.03, fontsize = 24)
ax1.set_xlabel('Feature', fontsize = 18)
ax1.set_ylabel('measurement (cm)', fontsize = 18)

# Set the axes ranges and axes labels
numBoxes = 3                              #Set boxes
ax1.set_xlim(0.5, numBoxes + 0.5)         # + 0.5 sets space on sides
ax1.set_ylim(0, 10)                       # Set 
xtickNames = plt.setp(ax1, xticklabels=['sepal width', 'petal length', 'petal width'])
plt.setp(xtickNames, fontsize=14)

plt.axhline(10, color = 'darkgreen')
plt.axvline(1, color = 'darkgreen', linewidth = 1, alpha = 0.4)

plt.show()

In [None]:
df.head(2)

In [None]:
df = df.drop('target', axis = 1)
df.head(2)

In [None]:
big_df = pd.concat([df.drop('species', axis = 1), pd.get_dummies(df['species'])], axis = 1)
big_df.head(2)

Read more about try/except [here](https://docs.python.org/3/tutorial/errors.html).

In [None]:
try:
    del big_df['setosa']
except:
    pass

This is one of the scalers you can use; NB they're also built-in to some of the regression functions:

In [None]:
sns.pairplot(big_df.iloc[:,0:4])
plt.show()

Colormap reference [here](http://matplotlib.org/examples/color/colormaps_reference.html) (scroll down, too!).

In [None]:
sns.heatmap(df.iloc[:,0:3].corr())
plt.show()

In [None]:
#Change colors on the correlation matrix

sns.heatmap(df.iloc[:,0:3].corr(), cmap= 'jet')
plt.show()

Set up a color mapping!

In [None]:
species_dict = {
    'setosa' : 'red',
    'virginica' : 'blue',
    'versicolor' : 'goldenrod'
}

colors = df['species'].apply(lambda x: species_dict[x])

In [None]:
df.loc[df['sepal_L'] == 4500,['sepal_L']] = 5.0

In [None]:
plt.figure(figsize=(12,8))

for name in iris_target_names:
    plt.scatter(df[df['species'] == name]['sepal_L'], df[df['species'] == name]['petal_L'], \
                color = species_dict[name], alpha = 0.6, s = 40, label = name)

plt.title('Iris Set: Petal Length vs. Sepal Length', fontsize = 24, y = 1.03)

# y 
# plt.ylabel('Actual Length (cm)', fontsize = 18)
# plt.yticks(np.arange(4.0, 9.0, 0.5), fontsize = 12)
# plt.xlim([4, 8.5])

# # x 
# plt.xlabel('Predicted Length (cm)', fontsize = 18)
# plt.xticks(np.arange(4.0, 9.0, 0.5), fontsize = 12)
# plt.ylim([4, 8.5])

# plt.plot([4, 8.5], [4, 8.5], '--', linewidth = 1, color = 'darkgreen', alpha = 0.6)

plt.legend(loc = 'center right', fontsize = 12)

plt.grid(True)

plt.show()

If you want to try jittering the points, there are helpful tips [here](http://stackoverflow.com/questions/8671808/matplotlib-avoiding-overlapping-datapoints-in-a-scatter-dot-beeswarm-plot).

In [None]:
plt.figure(figsize=(12,8))

plt.scatter(df[df['species'] == 'setosa']['sepal_L'], df[df['species'] == 'setosa']['petal_L'], \
                color = species_dict['setosa'], alpha = 0.6, s = 40, label = 'setosa')

plt.scatter(df[df['species'] == 'virginica']['sepal_L'], df[df['species'] == 'virginica']['petal_L'], \
                color = species_dict['virginica'], alpha = 0.6, s = 40, label = 'virginica')
plt.scatter(df[df['species'] == 'versicolor']['sepal_L'], df[df['species'] == 'versicolor']['petal_L'], \
                color = species_dict['versicolor'], alpha = 0.6, s = 40, label = name)


# Markdown crash course below!  Click on the cell to see relevant code.  Find more [here](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet).

_hey look, i'm in italics_ 

**and i'm in bold!**

~~i'm not here~~

# All different sizes
## All different sizes
### All different sizes
#### All different sizes
##i forgot to put a space after the pound symbols, i'm not what you want

![i have a bad file path, notice where this text appears](/assets/nonexistentpicture!.png)


i'm from file!
![i'm a .png loaded from file](assets/admiral-grace-murray-hopper.png)


i'm from a website!
![i'm a .png hosted elsewhere](http://africanrubiz.org/wp-content/uploads/2014/10/grace-hopper-4.jpg)