# Visualizations

Being able to proficiently visualize your data will be important to properly understand your data. So far, we've used pyplot up to the extent that it creates plots, but we'll get more in depth with a few of them now.

- The standard visualization package of python is `matplotlib`, which works fine, is very customizable, but doesn't always looks too great.
- `Seaborn` is another package that is built on top of `matplotlib` and looks a little better.
- `Vincent` is another package you might want to try out later.
- We use the magic command `%matplotlib inline` so that images are embedded in the notebook.
- See the <a href="http://matplotlib.org/users/pyplot_tutorial.html" target="_blank">pyplot tutorial</a> for a quick Getting Started guide.

We will go over a few exercises today with some of the options out there using `matplotlib.pyplot` as well as `seaborn`.

In [None]:
from matplotlib import pyplot as plt  
%matplotlib inline  

- The most common function of matplotlib is `plot`.
- Type `?plt.plot` to see the doc string.
- It takes in either a single series (i.e. plots using sequential indexes for the x axis), or two series (the first series indicating the x axis, the second series indicating the y axis)

In [None]:
xx = range(10)
yy = [x ** 2 + 1 for x in xx]
plt.plot(xx, yy, 'r:')  # 'r' = red, and ':' = dotted line

Almost everything is customizable in `matplotlib`.

In [None]:
plt.title("Hello, world!")
plt.xlabel("x-axis"), plt.ylabel("y-axis")
plt.xlim(0, 5), plt.ylim(0, 70)
plt.plot(xx, yy, 'bs-', label="parabole")
plt.plot(range(6), range(6), 'y*:', markersize=10, label="straight line")

# I often write `f = ` on the last line to surpress the output `<matplotlib.lines.Line2D at 0x10c750390>`
f = plt.legend()

In [None]:
x = pd.DataFrame( np.random.rand(10), index=pd.date_range('2015-01-01','2015-10-31',freq='m'),columns=['Rets'] )

In [None]:
x.cumsum().plot()
plt.title('Cumulative Return')
plt.xlabel('Month')
plt.ylabel('Return')

In [None]:
x = pd.DataFrame( np.random.rand(10,2), 
                  index=pd.date_range('2015-01-01','2015-10-31',freq='m'),
                  columns=['ARets','BRets'] )

In [None]:
x.cumsum().plot()
plt.xlabel('Month')
plt.ylabel('Return')
plt.title('ARet vs BRet Cumulative Return')

In [None]:
x = pd.DataFrame( np.random.randn(100000) )

In [None]:
# Histogram is a great way to get a distribution of a series. 
# I.e. if you don't know how a series is distributed, use histogram!
# Use ?plt.hist() or ?x.hist() to get more information, or just hit shift-tab while you're typing out the function
x.hist()

In [None]:
#Lets go back to a bivariate data frame
x = pd.DataFrame( np.random.rand(10,2), 
                  index=pd.date_range('2015-01-01','2015-10-31',freq='m'),
                  columns=['ARets','BRets'] )

In [None]:
# Let's plot 'arets' and 'brets' as points on a plot
plt.scatter( x['ARets'], x['BRets'] )
plt.xlabel('ARet')
plt.ylabel('BRet')
plt.legend(['ABRet Points'])

# Seaborn

Let's see how good the graphs look with `seaborn`. 

`import seaborn as sb`

Once you've imported seaborn, you can just call pyplot functions as is and you'll get prettier plots. In addition, the seaborn package itself has additional plots you can call. Let's rerun the scatter plot to see how much prettier it is.

In [None]:
import seaborn as sb

In [None]:
# Let's plot 'arets' and 'brets' as points on a plot
plt.scatter( x['ARets'], x['BRets'] )
plt.xlabel('ARet')
plt.ylabel('BRet')
plt.legend(['ABRet Points'])

How about for histogram?

In [None]:
x = pd.DataFrame( np.random.randn(100000) )
x.hist()

We'll try using seaborn more throughout the class, and will have an example later today.

# Anscombe's Quartet
Source: [Wikipedia](https://en.wikipedia.org/wiki/Anscombe's_quartet)

To start off with, let's take a look at Anscombe's Quartet. It is a collection of four different datasets, where each dataset has an $x$ and $y$ dimension.

Their basic statistics looks very similar, but they're not quite identical.

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
data = pd.DataFrame({'x1': [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0],
     'y1': [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68],
     'x2': [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0,],
     'y2': [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74],
     'x3': [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0,],
     'y3': [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73],
     'x4': [8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 19.0, 8.0, 8.0, 8.0],
     'y4': [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]})
data # This dataframe contains anscombe's quartet (i.e. (x1,y1), (x2,y2), (x3,y3), (x4,y4) )

# Exercise

**1) For each of the four datasets (i.e., for $(x_1, y_1)$, for $(x_2, y_2)$, etc), compute some basic statistics:**

For $x_i$ and $y_i$, generate:
- the mean and the standard deviation
- the correlation between $x_i$ and $y_i$. Hint: Use np.corrcoef or the Pandas dataframe function corr()

Hint: Create labels of $x_i$ and $y_i$ to be able to index into the dataframe for each $i$. This way, you can subset the columns in the `data` dataframe, and call mean(), std()

Hint2: Pandas dataframe calls to mean(), std(), corr(), etc. can be tuple unpacked

```
df = pd.DataFrame([[0,1],[2,3]], index = [0,1],columns=['A','B'])
asum, bsum = df.sum()
print asum, bsum
```

In [None]:
for i in xrange(1,5):
    x,y = "x" + str(i), "y" + str(i)
    # Your code here

These data points must be exactly the same..? 

**2) Generate a scatterplot for each set of $x_i$ and $y_i$.**

Hint: Use the "plt.scatter" function, passing in array of x, and the array of y you would like to scatterplot.

In [None]:
for i in xrange(1,5):
    x,y = "x"+str(i), "y"+str(i)
    plt.figure(figsize=(18,4))
    plt.subplot(2,2,i-1)
    # Your code here

**3) Seaborn is a visual package that is built on top of Matplotlib Pyplot. It makes plots look a lot nicer, and has some additional plotting capabilities (for instance heat maps). **

If you finish the exercise early, go ahead and install `seaborn`.

Go to your terminal, and type in `pip install seaborn`. Afterwards, try importing seaborn ala:

`import seaborn`

# Importance of Good Visualization

We've seen how important it is to visualize our data. Even when data aggregate statistics are the same, they could be vastly different! Now, let's try plotting out some different plots to get ourselves familiar with the various plot types.

**4) Plot the function $f(x) = x^3 - 2x^2 + 4$ on the domain $[-5, 5]$. You can plot at each point by using `range`, or use np.lin_space to help generate points**

Let's get some arbitrary dataset (see below)

In [None]:
import numpy as np  # import numpy -- we will cover this later
import pandas as pd
N = 1000  # number of dots
x = np.random.random(N)  # get N random values between 0 and 1
df = pd.DataFrame(dict(
        A=x,
        B=func(x),
        C=func(x) / func(x - 1)))
df.head()

**Scatterplot A against B, A against C, and B against C**

In [None]:
# Scatter of A vs B here


In [None]:
# Scatter of A vs C here


In [None]:
# Scatter of B vs C here


**Plot A against B, and make the size of the dots correspond with 100 * C. Hint: plt.scatter can take in an array for the `s` argument for indicating size**

**Plot B against C, and make the _color_ of the dots correspond with A. (This should result in a grey scale.)**


## Further reading

- <b>Pandas</b>:
<a href="http://pandas.pydata.org/pandas-docs/stable/10min.html" target="_blank">10 Minutes to Pandas</a> and
<a href="http://pandas.pydata.org/pandas-docs/stable/tutorials.html" target="_blank">tutorials</a>
of the official documentation<br>
Also recommended is the book <i>Python for Data Analysis</i>, O'Reilly Media.
- <b>Matplotlib pyplot</b>:
<a href="http://matplotlib.org/users/pyplot_tutorial.html" target="_blank">tutorial</a> 
of the official documentation
- <b>Seaborn</b>:
<a href="http://stanford.edu/~mwaskom/software/seaborn/" target="_blank">website</a>
- <b>Vincent</b>:
<a href="https://vincent.readthedocs.org/en/latest/" target="_blank">website</a>