# Guess the correlation

Help get an intuitive sense of what correlation "looks like" with this [online game](http://guessthecorrelation.com)

# Anscombe's Quartet  

### Summary statistics like mean, standard deviation, correlation, variance explained, t-tests, F-statistics, etc., are fantastic tools.  Still,  different data sets can be fundamentally very different, but still "look the same" when viewed through these lenses.   [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet).  Gives a nice example.  Data visualization adds a very useful sanity check for important similarities/differences that summary statistics may miss.

# Canonical examples in ANOVA

This notebook provides a concrete visual description of the type of least-squares model that underlies ANOVA.  There are lots of other **very useful** practical aspects of working with ANOVA's that you can **absorb with very little time + effort**.  One good way to get that experience is to look at about 8 carefully chosen examples of different outcomes. [This website](https://psychstat3.missouristate.edu/Documents/MultiBook3/Mlt08.htm) has those examples.  Just scroll to the bottom!  The main rule of thumb when expanding your ANOVA abilities is to find resources that work for you; google, ask friends and teachers, and check back in from time to time.  There are good resources out there, it just takes finding them.


# Relationships betweeen number of dependent variables, independent variables, and samples

Try re-running the 2-independent-variable linear regression notebook using each of the following point clouds (or just imagine that you did -- if you get to the end of this section and feel you understand everything, then you don't actually need to run the data).  Note that each of the clouds labeled "few" is a subsample of one of the clouds labeled "many."  

Now re-run your analysis with different subsamples of the larger clouds (or just imagine that you did -- if you get to the end of this section and agree with everything, you don't actually need to run the data).  You may find that the results you get for differen subsamples of the "thin" cloud change a lot more than the ones sampled from the "wide" cloud.  This is a general issue / phenomenon / problem in regression analysis.  **When your independent variables are strongly related to one another, your best-fit line / plane / etc can become unstable.**


In [31]:
import numpy as np
import plotly
import plotly.graph_objects as go

#   SET THE NUMER OR POINTS TO GENERATE
sample_size     = 50

#   SET THE NUMER OR POINTS TO SUBSAMPLE
subsample_size  = 8

#   DETERMINE WHICH POINTS WE'LL SUBSAMPLE
subsample_set   = np.random.choice(sample_size,  subsample_size, replace=False)
print(f'we\'ll subsample these rows of the matrix {subsample_set}')

#   OUR Z-COORDINATES WILL BE RANDOM NOISE; PICK A SCALE FOR THE NOISE
z_scale = 0.05

#   GENERATE TWO POINT CLOUDS
pcloud_widemany = np.random.rand(sample_size,3) * [1,1,2*z_scale]
pcloud_thinmany = np.concatenate(   (   
                                        np.repeat(np.random.rand(sample_size,1),2,axis=1) + np.random.rand(sample_size,2)*z_scale,
                                        np.random.rand(sample_size,1) * 2*z_scale
                                    ),
                                axis= 1
                                )

#   SUBSAMPLE THE CLOUDS
pcloud_widefew = pcloud_widemany[subsample_set,:]
pcloud_thinfew = pcloud_thinmany[subsample_set,:]

#   PLOT THE FOUR CLOUDS TO GET A SENSE WHAT THEY LOOK LIK
for pcloud in [pcloud_widemany,pcloud_widefew,pcloud_thinmany,pcloud_thinfew]:
  trace = go.Scatter3d(x=pcloud[:,0], y=pcloud[:,1], z=pcloud[:,2], mode='markers')
  scene = dict( xaxis=dict(range=[0,1]), yaxis=dict(range=[0,1]), zaxis=dict(range=[-0.5,0.5]), aspectmode = 'cube')
  figure = go.Figure(data=[trace],layout = go.Layout(scene=scene))
  figure.show()


we'll subsample these rows of the matrix [29  0 37 43 38  2  1 33]
