# Scientific Python Bootcamp Day 3

Prepared and presented by John Russell and Ian Hunt-Isaak

This notebook is available on [Github]()

### Outline for the Day
- Making beautiful plots for presentations
- Crash course in covariance
- Application: financial time series data
- Time for questions, help, and other applications

### Making Beautiful Plots

It is one thing to make a plot for yourself, e.g. to make sure a function does what you want to do some preliminary data visualization. But when the time comes to turn your plot in as part of an assignment or to present in slides, you need to make the plots clear and readable for your audience.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import solve_ivp

#import matplotlib as mpl



In [None]:
#generate some data to plot - use lorenz equations from yesterday

def lorenz(t, r, rho, sigma, beta):
    x,y,z = r
    dxdt = sigma*(y-x)
    dydt = x*(rho-z) - y
    dzdt = x*y-beta*z
    return np.array([dxdt, dydt, dzdt])

init_vals = np.array([1,1,1])
lorenz_sol = solve_ivp(lorenz, (0,100), init_vals, t_eval = np.linspace(0,100,50000), args=(28, 10, 8/3))

In [None]:
plt.style.use('default')
plt.plot(lorenz_sol.y[0], lorenz_sol.y[1])
plt.savefig("bad_plot1.png")
plt.show()

[Look at this plot in google slides](https://docs.google.com/presentation/d/1m_e95QT_hWmRM7InbBNS_zTQLK61z9B0MyIgTXlMY-k/edit#slide=id.g7ddea531b5_0_45)

### Whats wrong here?
- Background - bad for notebook but maybe good for white slides
- Line - too thick - cannot see all the features
- Size - relatively small
- Labels - title is uninformative, no labels (or legend though not relevant here)
- Font - too small for people to read

In [None]:
#These are some matplot lib configurations that I like to use. 
import matplotlib as mpl
mpl.rc("font", family = "serif") #Serif font in matplotlib
mpl.rc("figure",figsize=(9,6)) #Increase default figure size
%config InlineBackend.figure_format = 'retina' #Render the plots more nicely

#the below make plots look better if youre using the dark theme for jupyter
mpl.style.use('dark_background') #Use a dark background for matplotlib figures 
plt.rcParams.update({"figure.facecolor": "111111", #show figures here with a matching background
                     "savefig.facecolor": "212121"}) #save figures with the color of google slides "Simple Dark" Theme

In [None]:
#improved version

plt.figure(figsize=(9,6))#arguments are width and height in inches
plt.plot(lorenz_sol.y[0], lorenz_sol.y[1], linewidth = .75) #make line narrow to show all features. For simple plots they should be thicker.
plt.title('Two Dimensions of the Lorenz Attractor',fontsize=20)
plt.xlabel('X',fontsize=14)
plt.ylabel('Y', fontsize=14)
plt.tight_layout()
plt.savefig('good_plot1.png',dpi=200)
plt.show()

### Covariance and introduction to Principal Component Analysis

In [None]:
mean = np.array([4.5,3])
cov = np.array([[9, 5],
                [5, 8]])

In [None]:
pts = np.random.multivariate_normal(mean, cov, size=300)

In [None]:
plt.figure(figsize=(9,9))
plt.scatter(pts[:,0],pts[:,1], c='xkcd:peacock blue')
plt.title("Some synthetic data - Positive Covariance", fontsize=20)
plt.xlim([-5,13])
plt.xlim([-6,12])
plt.xlabel("X",fontsize=14)
plt.ylabel("Y",fontsize=14)

plt.show()

In [None]:
neg_cov = np.array([[9, -5],
                    [-5, 8]])

neg_cov_pts = np.random.multivariate_normal(mean, neg_cov, 300)

no_cov = np.array([[9, 0],
                   [0, 8]])

no_cov_pts = np.random.multivariate_normal(mean, no_cov, 300)

In [None]:
fig, ax = plt.subplots(1,2,figsize=(18,9))
ax[0].scatter(no_cov_pts[:,0],no_cov_pts[:,1],color='xkcd:pinkish purple')
ax[0].set_title('Data with No covariance',fontsize=18)
ax[1].scatter(neg_cov_pts[:,0],neg_cov_pts[:,1], color='xkcd:dark yellow')
ax[1].set_title('Data with Negative covariance',fontsize=18)
plt.show()

### How do we know the covariance stucture of our data?

What we want to compute is called the covariance matrix. If that matrix is called $C$ and we want to know how a variable $x_i$ covaries with $x_j$ then we just need to look at $C_{i,j}$. This is all super easy in numpy.

In [None]:
empirical_cov = np.cov(pts.T) #transpose it because what we want is how X covarys with Y not how each point covaries with each other point
                              # I acutally got this wrong at first but its easy to check the shape of the covariance matrix to know

In [None]:
empirical_cov.shape

In [None]:
print(empirical_cov)
print('Covariance of X and Y:', empirical_cov[0,1] )

### Lets just work with the positive covariance data for now

If we call our current variables $x$ and $y$ then there is a sense that they aren't the "best" axes to view our data.

In [None]:
plt.figure(figsize=(10,10))
plt.scatter(pts[:,0],pts[:,1], c='xkcd:peacock blue')
plt.title("Some synthetic data - Positive Covariance", fontsize=20)
plt.xlim([-5,13])
plt.xlim([-6,12])
plt.xlabel("X",fontsize=14)
plt.ylabel("Y",fontsize=14)
plt.show()

Principal component analysis (PCA) tells us that the "best" axes to look at are the eigenvectors of the covariance matrix.

Dont worry if you haven't seen eigen things before. In brief if $A$ is a matrix, $x$ is a vector, and

$Ax = \lambda x,$

where $\lambda$ is a scalar, then we say $x$ is an eigenvector of $A$ with eigenvalue $\lambda$.

How do we represent our data according to the "best" axes or pricipal components? Relatively simple matrix multiplication.

In [None]:
eigenvalues, eigenvectors = np.linalg.eigh(empirical_cov)

In [None]:
transformed = pts@eigenvectors

# one trick here = it helps to reverse the order of eigenvectors for PCA its best to sort by descending eigenvalue
# numpy returns things in order of ascending eigenvalue
transformed = transformed[:,::-1]

In [None]:
plt.figure(figsize=(10,10))
plt.scatter(pts[:,0],pts[:,1], c='xkcd:peacock blue')
plt.title("Some synthetic data - Positive Covariance", fontsize=20)
plt.xlim([-5,13])
plt.xlim([-6,12])
plt.xlabel("X",fontsize=14)
plt.ylabel("Y",fontsize=14)
plt.arrow(mean[0],mean[1],-np.sqrt(vals[0])*vecs[0,0],-np.sqrt(vals[0])*vecs[1,0],
          head_width=0.5,fc='xkcd:mango', ec='xkcd:mango')
plt.arrow(mean[0],mean[1],np.sqrt(vals[1])*vecs[0,1], np.sqrt(vals[1])*vecs[1,1],
          head_width=0.5,fc='xkcd:mango', ec='xkcd:mango')
plt.show()

In [None]:
plt.figure(figsize=(10,10))
plt.scatter(transformed[:,0], transformed[:,1],c='xkcd:peacock blue')
plt.title("Same syntetic data plotted by Principal Components", fontsize=20)
plt.ylim([-8,8])
plt.xlim([-14,4])
plt.xlabel('PC1',fontsize=14)
plt.ylabel('PC2', fontsize=14)
plt.show()

## The Math behind Principal Component Analysis - Variance and Covariance

*This is beyond the scope of this bootcamp but provides some math equations to explain Covariance and PCA*

Many of you have probably taken a statistics class and had to compute the variance or standard deviation of some data. As a refresher, if we have $N$ observations of a random variable $x$ then the variance is

$$\sigma^2 = \frac{1}{N}\sum_{n=1}^{N} (x_n-\mu)^2$$

Where $\mu$ is the mean or average of the $x$ values in the dataset.

The covariance generalizes this concept to the case where one is interested in multiple random variables that may be correlated. If $x$ and $y$ are random variables which we have observed $N$ times then we can say

$$\text{Cov}(x,y) = \frac{1}{N} \sum_{n=1}^{N} (x_n-\mu_x)(y_n -\mu_y).$$

Note the follwing important relations:

$$\text{Cov}(x,x) = \sigma^2_x$$
$$ \text{Cov}(x,y) = \text{Cov}(y,x).$$

In general if there are $M$ random variables lets call them $x^{(i)}$ that may be correlated it is quite convenient to consider a *Covariance matrix* this a matrix $C$ such that 

$$C_{i,j} = \text{Cov}(x^{(i)},x^{(j)}).$$

Note that the two identies discussed above imply that 
1. The diagonal elements of the covariance matrix are the variances of each variable and
1. The covariance matrix is symmetric and
1. For the mathematicians it is also positive definite since variances are always positive. 


The covariance of different variables in a dataset is one of the most important things one can learn about their data and there are lots of ways of using the information in the covariance matrix to analyze and understand the data. 