# Learning Unit 2 - Visualization - Example

Welcome to the fasctinating world of visualization! In this example you'll take you throught the bits and bots you'll need to explore your data. Then you can go do fun things. I promise. 

In [None]:
cd ..  

In [None]:
# importing libraries 
from IPython.display import Image
from sklearn import tree 

# image related libraries 
import pydotplus
import seaborn as sns 
import graphviz
from matplotlib import pyplot as plt 
from bokeh.io import output_notebook

# the following are for the charts to display inline 
output_notebook()  
% matplotlib inline 

from utils import load_data, visualizations   # <-- All the stuff we need is here, life rocks 

# Inbuilt Pandas plotting 

Pandas brilliantly has a bunch of plots already built in. I suggest you take a dataset, take a new cell, and write ``df1.plot??`` to see what they've included in the plot function call. 

Here we will just see a couple of basic ones. 

In [None]:
# first, let's get some data 
df1 = load_data.get_correlated_data()

In [None]:
df1.head()  # now, let's observe how boring this dataset is 

#### Line charts

In [None]:
df1['a'].plot()  # we won't be using lots of line charts today, but just so you know where to find them 
plt.show()       # this is just so that it will show in line and pretty 

#### Bar charts

[Uuuh a Bar Chart! Fascinating!](https://media.giphy.com/media/3o6Ztip1Hq0NNGfqyk/giphy.gif)

In [None]:
df1.mean().plot(kind='bar')  # <-- plot the means, as a bar chart 
plt.show()

#### Scatterplots 

We'll use a wrapper around bokeh to make slightly better ones, but it's useful to know this one because it's so simple

In [None]:
df1.plot(kind='scatter', x='a', y='b')  # <-- here you need to declare what are your x and y 
plt.show()

# Seaborn 

[Seaborn](seaborn.pydata.org/examples/index.html) is one of the few libraries that is simultaneously crazy simple, and extremely powerful. It's well worth learning. 

In our case, we'll use it mostly for heatmaps, but it's got lots of great stuff. 

Let's get some temperature data:

In [None]:
temperatures = load_data.get_temperature_data()

What does this data look like? 

In [None]:
temperatures.head()

Can I maybe see this in a less boring way? 

In [None]:
sns.heatmap(temperatures, annot=True)
plt.show()

# Bokeh

Bokeh is extremely powerful, but it's flexibility comes at the price of being hugely verbose. 

Pros: 
- You can do [pretty much anything](http://bokeh.pydata.org/en/latest/docs/gallery.html#gallery) with it once you know how to use it 

Cons: 
- The library is evolving very fast, and the documentation is pretty [rougth to navigate](https://imgs.xkcd.com/comics/manuals.png). 

In this tutorial we'll be using some wrappers that will serve most practical purposes. 

In [None]:
cross = load_data.get_cross_data() # Let's get some more data 

In [None]:
cross.head()

Suggestion: Before running the next cell, try to get an intuition of the `cross` dataset by using the other visualizations you've seen. 

### Scatterplot, with colors! 

In [None]:
visualizations.plot_scatter_3_features(cross, 'a', 'b', 'c', 'My first bokeh plot')

Visualizations are brilliant for this reason, we now know exaclty what this dataset is, even if we couldn't do it be looking at statistics. 

Interestingly, a linear model such as a Logistic Regression would hardly be able to do better than random at predicting this (try it!) , even though to our eyes the pattern is obvious. 

# Fitting a tree 

A great way to get a quick understanding of a model, is to fit a simple decision tree to it, and then to visualize it. 

Let's use the sklean _tree.[DecisionTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)_: 

In [None]:
dt = tree.DecisionTreeClassifier(max_depth=3) # instanciate the model 
dt.fit(cross[['b', 'a']], cross['c'])         # fit the model to the features a and b, to predict c 

And now for the fun part. Visualizing with Graphviz is simple, and extremely powerful

In [None]:
dot_data = tree.export_graphviz(dt,
                                out_file=None,
                                impurity=True,
                                feature_names=['b', 'a'],
                                class_names=['red', 'green'],
                                filled=True, rounded=True,
                                special_characters=True)

graph = graphviz.Source(dot_data) 
graph

Very interesting. Now consider that first split. Pretty weird huh? Any idea why it split at such a stupid place? 

Essentially it needed to break simetry. The ideal split for us (to our eyes at least) would be accross the middle, but that would not have had any entropy gain. So the tree actually maximized it quite reasonably, and then went on to quicly grasp the whole dataset. 

_Note: You might have noticed we fit the tree with `dt.fit(cross['b', 'a'], cross['c'])` instead of `dt.fit(cross['a', 'b'], cross['c'])`  
This was for a [very technical reason](https://i.imgflip.com/1ydlm5.jpg)_

# Great, we're done! Now on to the exercise :)