# Why Is Statistical Thinking Hard?

When I first learned statistics, it felt like black magic. If I followed a set of principles someone once proved, I could turn the data I see into some kind of statement about the world. Data in, bold claim out. Sometimes the principles would make sense, and sometimes they wouldn't. I could rarely "see" them in action in a way that gave me practical, concrete intuition. 

After completing my M.S. in Statistics from Stanford, working as a data scientist for five years, and communicating statistical results to everyone from CEOs to product people to actual scientists, I have a lot of opinions on why stats is so hard to see intuitively. Statistics involves backward thinking, like trying to find something you've misplaced. You collect some data you've observed, and from that data you try to generalize about things you didn't observe. How do you do this? How would you actually know that you were right and your reasoning worked? Much of the basic statistical curriculum was designed during the era of pen and paper. Brilliant researchers used mathematical proofs to demonstrate that if you followed a certain technique or principle under certain conditions, you would make a safe trip from your data to the fact you want to deduce about the world. Black magic indeed!

Enough abstraction: let's turn to a simple example. Suppose I want to know what color to paint university classrooms as part of my university's efforts to create a supportive educational environment. I design a survey asking students what color--red, yellow, or blue--makes for the nicest classrooms. After asking 100 students who walked by on the quad, I have some data with which to answer my question. Let's say 43 students preferred red, 41 preferred yellow, and 16 preferred blue. What do I do with this information?

I could just look at the most popular color in my dataset, which is red. But is that valid? Almost as many students liked yellow as liked red. How do I know that this set of 100 students didn't randomly have more red fans than the general population, and that yellow isn't actually more preferred? Speaking of random, you've probably heard that random selection is important (which, it looks like, I didn't actually do--I randomly selected students walking by on the quad, not students). Randomization is supposed to do something to make the sample more representative, but how big of a sample is required? Was 100 students enough? Obviously if I picked just one student (sample size one) and went with their preferred color, I wouldn't be able to reliably generalize from it even though I randomized. So sample size does matter. But how big is big enough? Is there a one size fits all sample size that would let you believe the study results? 

In a typical statistics course you'll learn some complicated formulas to go back from your data to the "truth" with some probabilistic guarantees. You'll learn best practices and mathematical techniques that can answer the questions we've just asked.

But you won't see the logic for yourself. This is too bad because in the computer era, you can! Using simulations and sampling, you can start with a hypothetical "true" situation, then see for yourself what happens if you run your experiment. 

Let's try it! Suppose that the real preferences of the student population were such that 45% preferred yellow, 40% preferred red, and 15% preferred blue. Now we decide to sample 100 of these students, truly at random, and see what the most preferred color is. Is this a good experiment? How likely is it that yellow will be the winner in our experiment the way it is in our hypothetical reality? 

We're going to code! Don't worry if you've not seen much code before--in Ch. 1 we'll cover the bare bones essentials of coding so you can follow the rest of the course. For now, skip the coding if you don't understand it, and just focus on the words and pictures.

(note to self this is where I can do a sample size calculation based on a uniform prior over these three params, and ask for each what the sample for N would do and for what percent of the samples the most popular value would return the "right answer"; do this after picking one particular "truth" and running this)

In [11]:
import numpy as np
import pandas as pd

In [5]:
colors = ['yellow', 'red', 'blue']
preferences = [0.45, 0.4, 0.15]

Notice that in the two lines above we've already written down all the information me need for our hypothetical "true situation," that among the student population, 45% prefer yellow, 40% prefer red, and 15% prefer blue. This is our first example of what we call a distribution in statistics.

Next, let's simulate an experiment where we ask 100 students about their preferences. We do this by sampling. We "draw" 100 samples, where each sample is a person with a preferred color. Since 45% of the population prefers yellow, we give a 45% chance that a single sample has the value yellow, and so on for red and blue. Basically this is no different than flipping 100 coins, but for coins there are two options (heads, tails) instead of three (yellow, red, blue), and the chances we'd draw those options are (50,50) instead of (45, 40, 15).

Python makes it really easy for us to draw samples from distributions. Let's do it!

In [35]:
number_of_samples = 100 # set the number of samples we'll draw

np.random.seed(2827) # setting a 'random seed' ensures that we get the same random sample when we rerun our code
dataset = np.random.choice(colors , size = number_of_samples, p=preferences, replace=True)
dataset

array(['yellow', 'red', 'red', 'red', 'blue', 'yellow', 'blue', 'blue',
       'red', 'yellow', 'yellow', 'red', 'yellow', 'red', 'yellow',
       'blue', 'red', 'red', 'yellow', 'blue', 'red', 'yellow', 'yellow',
       'yellow', 'blue', 'yellow', 'yellow', 'yellow', 'yellow', 'red',
       'red', 'yellow', 'red', 'blue', 'blue', 'red', 'red', 'yellow',
       'yellow', 'yellow', 'yellow', 'yellow', 'red', 'red', 'yellow',
       'blue', 'yellow', 'red', 'yellow', 'yellow', 'red', 'yellow',
       'red', 'yellow', 'yellow', 'red', 'yellow', 'blue', 'blue', 'blue',
       'red', 'blue', 'red', 'red', 'blue', 'blue', 'blue', 'red',
       'yellow', 'yellow', 'yellow', 'red', 'yellow', 'red', 'yellow',
       'blue', 'yellow', 'red', 'yellow', 'red', 'red', 'red', 'yellow',
       'red', 'yellow', 'red', 'yellow', 'red', 'red', 'red', 'red',
       'blue', 'red', 'red', 'yellow', 'red', 'yellow', 'yellow', 'red',
       'yellow'], dtype='<U6')

Now we've got a list called "dataset" containing 100 values, representing 100 hypothetical / simulated samples from our hypothetical / simulated student population with its preferences. Let's take a deeper look at our simulated dataset!

In [37]:
np.unique(dataset, return_counts=True) # get the unique values in our dataset, and how many times they each showed up

(array(['blue', 'red', 'yellow'], dtype='<U6'), array([18, 40, 42]))

In our simulated study, 18 respondents preferred blue, 40 preferred red, and 42 preferred yellow. Interesting results! Of course, the preferences in our study don't exactly match the preferences in the general population (or else it would have been 15 preferring blue, 40 preferring red, and 45 preferring yellow). However, if we'd used this study to conclude that yellow was the favorite option, we would have been right!

But would team yellow always come out on top? Would our experiment always give the correct answer? Probably not! Let's use simulations to answer the question: for our simulated student population, what percent of the time will the most popular color in our study of 100 random students match the true most popular color yellow? 

The magic of computers: I can run 1000 simulated experiments in not much more time than it takes to run 1 simulated experiment. 

In [38]:
np.random.seed(1835)
colors = ['yellow', 'red', 'blue']
preferences = [0.45, 0.4, 0.15]
samples_for_1000_experiments = (1000,100)
thousand_datasets = np.random.choice(colors , size = samples_for_1000_experiments, p=preferences, replace=True)
thousand_datasets

array([['red', 'yellow', 'red', ..., 'yellow', 'yellow', 'yellow'],
       ['yellow', 'yellow', 'red', ..., 'yellow', 'red', 'yellow'],
       ['red', 'yellow', 'yellow', ..., 'red', 'blue', 'red'],
       ...,
       ['red', 'yellow', 'yellow', ..., 'blue', 'red', 'blue'],
       ['red', 'yellow', 'yellow', ..., 'blue', 'blue', 'blue'],
       ['yellow', 'blue', 'blue', ..., 'yellow', 'yellow', 'yellow']],
      dtype='<U6')

Wow! Just a few lines of code, and you can already see that our array has different groups of colors, with each group representing one unique simulated experiment. Let's take a look at the differences between the first and second experiment, just for clarity.

In [43]:
np.unique(thousand_datasets[0], return_counts=True) # just like last time, but we use the index 0 to read just the first dataset from our thousand datasets

(array(['blue', 'red', 'yellow'], dtype='<U6'), array([17, 31, 52]))

In [44]:
np.unique(thousand_datasets[1], return_counts=True) # instead of grabbing the first dataset (index 0), we grab the second (index 1)

(array(['blue', 'red', 'yellow'], dtype='<U6'), array([ 9, 42, 49]))

We see that first the first and second experiments, yellow still wins! Note that python likes to count starting at 0, not starting at 1; this will come up a lot but you'll get used to it quickly. Ok, so let's do this for every dataset in our thousand datasets. We'll give it a shot by writing our first for loop of the class! Again, don't worry if the code doesn't make sense--we'll cover for loops soon enough. 

In [67]:
most_popular_colors = [] # make an empty list
for i in range(0,1000): # make a for loop that loops over our 1000 experiments! for each experiment...
    my_dataset = thousand_datasets[i] 
    values, counts = np.unique(my_dataset, return_counts=True) # we grab the counts
    most_popular_color = values[counts.argmax()] # cute trick to find the most popular color per experiment
    most_popular_colors.append(most_popular_color) # and then we add it to our list!

Now we have a new list called most_popular_colors with the winning color from each of our 1000 siulated experiments. Very cool! Let's just check that the results we have make sense.

In [69]:
len(most_popular_colors) # let's check how long the list is; it's 1000 long, so looks like we have a value per simulation

1000

In [70]:
most_popular_colors[0:10] # just checking the first ten values; looks like yellow wins often but red wins sometimes!

['yellow',
 'yellow',
 'yellow',
 'yellow',
 'yellow',
 'red',
 'yellow',
 'red',
 'red',
 'yellow']

In [64]:
np.unique(most_popular_colors, return_counts=True) # counts of the winners!

(array(['red', 'yellow'], dtype='<U6'), array([328, 672]))

Let's interpret these results. Across our 1000 simulated experiments, blue was never the most popular color. But in 328 of those experiments, red would have been (incorrectly!) chosen as the most popular color! This is actually momentous. We can see that under our simulated situation, our proposed experiment would have given us the correct answer ~67% of the time, while giving us the incorrect answer ~33% of the time!

# implications:

talk about statistical design
talk about connection to confidence intervals
talk about the difference being in the real world we don't know the underlying distribution, but we can use similar logic. And furthermore, whatever technique we use to go from sample to population, we can see for ourselves if it works across populations! ps make that a whole chapter showing that the vanilla bootstrap confidence interval works

# What IS Probability

What IS Probability?
Python Fundamentals
Pandas Fundamentals
Probability Problems
Conditioning?
Demo CLT?
Bootstrap (Vanilla Interval)
Demo that the bootstrap works by pulling a real population, sampling it, bootstrapping the sample and getting CIs, and showing what percent of CIs have coverage. 
Then throw a spanner in the wrench: make the tails long!

More spanners: temperature problem sampling across different times, as well as different sample size problem at different schools

# Why Statistics Feels Hard

People use statistics to make educated guesses about what they don't know, based on what they do know. You get some data or observations, and based on this data you try to say something about the world. This requires a kind of backwards thinking, which is unintuitive. Forward thinking says: because X happened, Y happened after. Backwards thinking says, I can only see Y: what do I think X is?

Let's take an example to make things clearer. Let's say I want to know the average height in the USA. How am I supposed to ever know what that is? I can't measure every single person. The best I can do is to measure some people, and then try to work backwards from the heights I've measured to say something about all the heights I didn't measure.

And how will I ever know if I'm right, since I never measure every height? Learning happens when people can see when they've made mistakes and where those mistakes are. But since we never have the real "truth" that we're trying to estimate statistically, it's hard to get a feel for whether our techniques work. All we really end up with are magic formulas that are hard to understand, that someone told us works. 

# Statistics in the Computer Age

When most the statistics you see in stats 101 was developed, people didn't have computers. Computations were done by hand. It was important to generate formulas that would produce the information we needed. Statistics was taught with all this in mind. Students were trained in such a way to get to these kinds of analytic solutions (technical math term) to problems, or to be able to derive new analytic solutions. 

In the computer age, the situation is completely different. Now, we can use computer simulations to generate a "ground truth" situation easily, and then take samples from it to create our dataset. Since we have both the data and the ground truth, we can easily see how our technique creates a picture about the truth from the data we have. We can directly see what works and what doesn't work. 

Today it is easier than ever to understand statistics, as long as we teach from a simulation and sampling centered point of view. 

# Example: Central Limit Theorem

Do an example 

