In [15]:
# Unit 4 - SciPy - Days 1, 2 & 3 - 01.27.18

#### STATISTICAL CONCEPTS

Introduction

Say you work for a major social media website. Your boss always says "data drives all our decisions" and it seems to be true. Metrics are collected on all users of your website, terabytes of data stored in replicated databases.

One day, your boss wants to know if college students are engaging in your website. You pull up the records for users in that age bracket and look at them one by one. The first person only spent half a second on your website before closing the tab, that doesn't look good. But the second person was on the site for thirty minutes! That's a running average of 15 minutes site time per user, but you still have half a million records to look at. On top of that, you need to compare it against other age brackets (and the average overall). That's going to take a lot of time if you do it all by hand, and you're still not sure what your methodology for proving college students spend enough time on your website to be "engaged".

Data analysts attempt to say meaningful things about the persuasions of people. Often, we want to know if a change or difference we see in a dataset is "real" or if it’s just a normal fluctuation or a result of the specific sample of people we have chosen to measure. A difference we observe in our data is only important if we can be reasonably sure that it is representative of the population as a whole, and reasonably sure that our result is repeatable.

This question of whether a difference is significant or not is essential to making decisions based on that difference. Some instances where this might come up include:

* Performing an A/B test — are the different observations really the results of different conditions (i.e., Condition A vs. Condition B)? Or just the result of random chance?
* Conducting a survey — is the fact that men gave slightly different responses than women a real difference between men and women? Or just the result of chance?

In this lesson, we will cover the fundamental concepts that will help us create tests to measure our confidence in our statistical results:

* Sample means and population means
* The Central Limit Theorem
* Why we use hypothesis tests
* What errors we can come across and how to classify them

Are the Millenials Engaged?

You work at the global megacorp social network SpyPy. SpyPy has 1.5 billion daily users, and you want to make sure that people in the millenial age bracket are engaging with your website. Your boss seems particularly frazzled by this question, and he's put it on you to find out. You decide that "engagement" means spending more than the average of seven minutes on the website. You fire up your data-science stack in Python and first check the average time -- which turns out to be near 11 whole minutes! But you can't really tell if they're really spending more time or if it's just random chance that a few of your users left the browser open and walked away. You write the following code:

In [5]:
import spypy
from scipy.stats import ttest_1samp

millenial_times = spypy.get_site_times_for_demographic('millenial')
t_stat, p_val = ttest_1samp(millenial_times, 7)

if p_val < .05:
    print "The Millenials are engaged!"
else:
    print "The Millenials are not engaged :(!"
The Millenials are engaged!
SpyPy: We're Significantly Different
Well that's great news! Millenials are, for the most part, spending around 10 minutes on your website. But before you break out the champgne glasses your boss is in a frenzy again, this time about Metropolitan Statistical Areas (MSAs). Your tasked with finding if people in cooler climates post more pictures on SpyPy than people in warmer climates. You cross corroborate with weather data and run a statistical test on the info.

In [2]:
from scipy.stats import ttest_ind

warmer_weather_picture_count = spypy.get_number_pictures_for_climate('hot')
colder_weather_picture_count = spypy.get_number_pictures_for_climate('cold')

t_stat, p_val = ttest_ind(warmer_weather_picture_count, colder_weather_picture_count)

if p_val < .05:
    print "People from colder climates post a different number of pictures compared to people from warmer climates"
else:
    print "Climate doesn't appear to affect the number of pictures posted"
Climate doesn't appear to affect the number of pictures posted
SpyPy: Because We Care About Your Data
Seems like climate probably doesn't really affect the number of times people post pictures. Not really sure why that would've been the case anyway. SpyPy has a new feature that you think will get people to interact with the website for longer: SpyPy Stories. It is preliminarily being launched to 8 million users and the internal goal is to get 2 million people to post SpyPy Stories in the first week. Unfortunately, only 1,997,893 people posted SpyPy Stories this week. We want to know if this is a significant difference from our goal -- did we pretty much meet it or did we seriously miss? You know how to answer this question:

In [7]:
from scipy.stats import binom_test

number_of_trials = 8000000
expected_successes = 2000000
actual_successes = 1997893
expected_success_rate = float(expected_successes) / float(number_of_trials)

p_val = binom_test(actual_successes, n=number_of_trials, p=expected_success_rate)
if p_val < 0.05:
    print "We didn't hit our target by a significant amount"
else:
    print "We just missed our target by a very small amount!"
We just missed our target by a very small amount!
Looks like we came very close to hitting our target for SpyPy sStories! You've saved the day so many times already! Your boss comes by to thank you for all the hard work you put in today and says you've made significant contributions to the team. You tell her you're not sure if that's true, but you definitely have a way of finding out.

#### Sample Mean and Population Mean

Suppose you want to know the average height of an oak tree in your local park. On Monday, you measure 10 trees and get an average height of 32 ft. On Tuesday, you measure 12 different trees and reach an average height of 35 ft. On Wednesday, you measure the remaining 11 trees in the park, whose average height is 31 ft. Overall, the average height for all trees in your local park is 32.8 ft.

The individual measurements on Monday, Tuesday, and Wednesday are called samples. A sample is a subset of the entire population. The mean of each sample is the sample mean and it is an estimate of the population mean.

Note that the sample means (32 ft., 35 ft., and 31 ft.) were all close to the population mean (32.8 ft.), but were all slightly different from the population mean and from each other.

For a population, the mean is a constant value no matter how many times it's recalculated. But with a set of samples, the mean will depend on exactly what samples we happened to choose. From a sample mean, we can then extrapolate the mean of the population as a whole. There are many reasons we might use sampling, such as:

* We don't have data for the whole population.
* We have the whole population data, but it is so large that it is infeasible to analyze.
* We can provide meaningful answers to questions faster with sampling.

When we have a numerical dataset and want to know the average value, we calculate the mean. For a population, the mean is a constant value no matter how many times it's recalculated. But with a set of samples, the mean will depend on exactly what samples we happened to choose. From a sample mean, we can then extrapolate the mean of the population as a whole.

In [3]:
import numpy as np

population = np.random.normal(loc=65, scale=3.5, size=300)
population_mean = np.mean(population)

print("Population Mean: {}".format(population_mean))

sample_1 = np.random.choice(population, size=30, replace=False)
sample_2 = np.random.choice(population, size=30, replace=False)
sample_3 = np.random.choice(population, size=30, replace=False)
sample_4 = np.random.choice(population, size=30, replace=False)
sample_5 = np.random.choice(population, size=30, replace=False)

sample_1_mean = np.mean(sample_1)
sample_2_mean = np.mean(sample_2)
sample_3_mean = np.mean(sample_3)
sample_4_mean = np.mean(sample_4)
sample_5_mean = np.mean(sample_5)

print("Sample 1 Mean: {}".format(sample_1_mean))
print("Sample 2 Mean: {}".format(sample_2_mean))
print("Sample 3 Mean: {}".format(sample_3_mean))
print("Sample 4 Mean: {}".format(sample_4_mean))
print("Sample 5 Mean: {}".format(sample_5_mean))

Population Mean: 65.12536191458324
Sample 1 Mean: 63.977841081057115
Sample 2 Mean: 64.73636469663214
Sample 3 Mean: 65.23451342150472
Sample 4 Mean: 64.46840179993049
Sample 5 Mean: 65.80276375041846


#### Central Limit Theorem

Perhaps, this time, you're a tailor of school uniforms at a middle school. You need to know the average height of people from 10-13 years old in order to know which sizes to make the uniforms. Knowing the best decisions are based on data, you set out to do some research at your local middle school. Organizing with the school, you measure the heights of some students. Their average height is 57.5 inches. You know a little about sampling and decide that measuring 30 out of the 300 students gives enough data to assume 57.5 inches is roughly the average height of everyone at the middle school. You set to work with this dimension and make uniforms that fit people of this height, some smaller and some larger.

Unfortunately, when you go about making your uniforms many reports come back saying that they are too small. Something must have gone wrong with your decision-making process! You go back to collect the rest of the data: you measure the sixth graders one day (56.7, not so bad), the seventh graders after that (59 inches tall on average), and the eighth graders the next day (61.7 inches!). Your sample mean was so far off from your population mean. How did this happen?

Well, your sample selection was skewed to one direction of the total population. It looks like you must have measured more sixth graders than is representative of the whole middle school. How do you get an average sample height that looks more like the average population height?

In the previous exercise, we looked at different sets of samples taken from a population and how the mean of each set could be different from the population mean. This is a natural consequence of the fact that a set of samples has less data than the population to which it belongs. If our sample selection is poor then we will have a sample mean seriously skewed from our population mean.

There is one surefire way to mitigate the risk of having a skewed sample mean — take a larger set of samples. The sample mean of a larger sample set will more closely approximate the population mean. This phenomenon, known as the Central Limit Theorem, states that if we have a large enough sample size, all of our sample means will be sufficiently close to the population mean.

Later, we'll learn how to put numeric values on "large enough" and "sufficiently close".

In [11]:
# Create population and find population mean

# loc = mean (center) of the distribution
# scale = std (spread or width) of the distribution

population = np.random.normal(loc=65, scale=100, size=3000)
population_mean = np.mean(population)

# Select increasingly larger samples
extra_small_sample = population[:10]
small_sample = population[:50]
medium_sample = population[:100]
large_sample = population[:500]
extra_large_sample = population[:1000]

# Calculate the mean of those samples
extra_small_sample_mean = np.mean(extra_small_sample)
small_sample_mean = np.mean(small_sample)
medium_sample_mean = np.mean(medium_sample)
large_sample_mean = np.mean(large_sample)
extra_large_sample_mean = np.mean(extra_large_sample)

# Print them all out!
print("Extra Small Sample Mean: {}".format(extra_small_sample_mean))
print("Small Sample Mean: {}".format(small_sample_mean))
print("Medium Sample Mean: {}".format(medium_sample_mean))
print("Large Sample Mean: {}".format(large_sample_mean))
print("Extra Large Sample Mean: {}".format(extra_large_sample_mean))

print("\nPopulation Mean: {}".format(population_mean))

Extra Small Sample Mean: 81.63923937890398
Small Sample Mean: 77.48585150954135
Medium Sample Mean: 62.78963852914493
Large Sample Mean: 55.80149840714335
Extra Large Sample Mean: 60.42100224819855

Population Mean: 64.24944252840096


#### Hypothesis Tests

When observing differences in data, a data analyst understands the possibility that these differences could be the result of random chance.

Suppose we want to know if men are more likely to sign up for a given programming class than women. We invite 100 men and 100 women to this class. After one week, 34 women sign up, and 39 men sign up. More men than women signed up, but is this a "real" difference?

We have taken sample means from two different populations, men and women. We want to know if the difference that we observe in these sample means reflects a difference in the population means. To formally answer this question, we need to re-frame it in terms of probability:

"What is the probability that men and women have the same level of interest in this class and that the difference we observed is just chance?"

In other words, "If we gave the same invitation to every person in the world, would more men still sign up?"

A more formal version is: "What is the probability that the two population means are the same and that the difference we observed in the sample means is just chance?"

These statements are all ways of expressing a null hypothesis. A null hypothesis is a statement that the observed difference is the result of chance.

Hypothesis testing is a mathematical way of determining whether we can be confident that the null hypothesis is false. Different situations will require different types of hypothesis testing, which we will learn about in the next lesson.

#### Type I Or Type II

When we rely on automated processes to make our decisions for us, we need to be aware of how this automation can lead to mistakes. Computer programs are as fallible as the humans who design them. As humans capable of programming, the responsibility is on us to understand what can go wrong and what we can do to contain these foreseeable problems.

In statistical hypothesis testing, we concern ourselves primarily with two types of error. The first kind of error, known as a Type I error, is finding a correlation between things that are not related. This error is sometimes called a "false positive" and occurs when the null hypothesis is rejected even though it is true.

For example, let's say you conduct an A/B test for an online store and conclude that interface B is significantly better than interface A at directing traffic to a checkout page. You have rejected the null hypothesis that there is no difference between the two interfaces. If, in reality, your results were due to the groups you happened to pick, and there is actually no significant difference between interface A and interface B in the greater population, you have been the victim of a false positive.

The second kind of error, a Type II error, is failing to find a correlation between things that are actually related. This error is referred to as a "false negative" and occurs when the null hypothesis is accepted even though it is false.

For example, with the A/B test situation, let's say that after the test, you concluded that there was no significant difference between interface A and interface B. If there actually is a difference in the population as a whole, your test has resulted in a false negative.

In [12]:
def intersect(list1, list2):
  return [sample for sample in list1 if sample in list2]

# the true positives and negatives:
actual_positive = [2, 5, 6, 7, 8, 10, 18, 21, 24, 25, 29, 30, 32, 33, 38, 39, 42, 44, 45, 47]
actual_negative = [1, 3, 4, 9, 11, 12, 13, 14, 15, 16, 17, 19, 20, 22, 23, 26, 27, 28, 31, 34, 35, 36, 37, 40, 41, 43, 46, 48, 49]

# the positives and negatives we determine by running the experiment:
experimental_positive = [2, 4, 5, 7, 8, 9, 10, 11, 13, 15, 16, 17, 18, 19, 20, 21, 22, 24, 26, 27, 28, 32, 35, 36, 38, 39, 40, 45, 46, 49]
experimental_negative = [1, 3, 6, 12, 14, 23, 25, 29, 30, 31, 33, 34, 37, 41, 42, 43, 44, 47, 48]

#define type_i_errors and type_ii_errors here
type_i_errors = intersect(actual_negative,experimental_positive)
type_ii_errors = intersect(actual_positive,experimental_negative)

print(type_i_errors)
print(type_ii_errors)

[4, 9, 11, 13, 15, 16, 17, 19, 20, 22, 26, 27, 28, 35, 36, 40, 46, 49]
[6, 25, 29, 30, 33, 42, 44, 47]


#### P-Values

We have discussed how a hypothesis test is used to determine the validity of a null hypothesis. A hypothesis test provides a numerical answer, called a p-value, that helps us decide how confident we can be in the result. A p-value is the probability that the null hypothesis is true.

A p-value of 0.05 would mean that there is a 5% chance that the null hypothesis is true. This generally means there is a 5% chance that there is no difference between the two population means.

Before conducting a hypothesis test, we determine the necessary threshold we would need before concluding that the results are significant. A higher p-value is more likely to give a false positive so if we want to be very sure that the result is not due to just chance, we will select a very small p-value.

It is important that we choose the p-value before we see the results. If we wait until after we see the results, we might pick our threshold such that we get the result we want to see. For instance, if we're trying to publish our results, we might set a p-value such that our results are significant. Choosing our p-value in advance helps keep us honest.

Generally, we want a p-value of less than 0.05, meaning that there is a less than 5% chance that our results are due to random chance.

In [13]:
def accept_null_hypothesis(p_value):
  """
  Returns the truthiness of the null_hypothesis

  Takes a p-value as its input and assumes p < 0.05 is significant
  """
  result = None
  if p_value < .05:
    result = False
  else:
    result = True
  return result


hypothesis_tests = [0.1, 0.009, 0.051, 0.012, 0.37, 0.6, 0.11, 0.025, 0.0499, 0.0001]

for p_value in hypothesis_tests:
    print(p_value, accept_null_hypothesis(p_value))

0.1 True
0.009 False
0.051 True
0.012 False
0.37 True
0.6 True
0.11 True
0.025 False
0.0499 False
0.0001 False


#### Types of Hypothesis Test

When we are trying to compare datasets, we often need a way to be confident knowing if datasets are significantly different from each other.
Some situations involve correlating numerical data, such as:

* a professor expects an exam average to be roughly 75%, and wants to know if the actual scores line up with this expectation. Was the test actually too easy or too hard?
* a manager of a chain of stores wants to know if certain locations have different revenues on different days of the week. Are the revenue differences a result of natural fluctuations or a significant difference between the stores' sales patterns?
* a PM for a website wants to compare the time spent on different versions of a homepage. Does one version make users stay on the page significantly longer?

Others involve categorical data, such as:

* a pollster wants to know if men and women have significantly different yogurt flavor preferences. Does a result where men more often answer "chocolate" as their favorite reflect a significant difference in the population?
* do different age groups have significantly different emotional reactions to different ads?

In this lesson, you will learn how about how we can use hypothesis testing to answer these questions. There are several different types of hypothesis tests for the various scenarios you may encounter. Luckily, SciPy has built-in functions that perform all of these tests for us, normally using just one line of code.

For numerical data, we will cover:

* One Sample T-Tests
* Two Sample T-Tests
* ANOVA
* Tukey Tests

For categorical data, we will cover:

* Binomial Tests
* Chi Square

After this lesson, you will have a wide range of tools in your arsenal to find meaningful correlations in data.

In [14]:
# Day 4 & 5 - Hypothesis Tesing

In [16]:
# https://s3.amazonaws.com/codecademy-content/courses/learn-hypothesis-testing/index.html

#### 1 Sample T-Testing

Let's imagine the fictional business BuyPie, which sends ingredients for pies to your household, so that you can make them from scratch. Suppose that a product manager wants the average age of visitors to BuyPie.com to be 30. In the past hour, the website had 100 visitors and the average age was 31. Are the visitors too old? Or is this just the result of chance and a small sample size?

We can test this using a univariate T-test. A univariate T-test compares a sample mean to a hypothetical population mean. It answers the question "What is the probability that the sample came from a distribution with the desired mean?"

When we conduct a hypothesis test, we want to first create a null hypothesis, which is a prediction that there is no significant difference. The null hypothesis that this test examines can be phrased as such: "The set of samples belongs to a population with the target mean".

The result of the 1 Sample T Test is a p-value, which will tell us whether or not we can reject this null hypothesis. Generally, if we receive a p-value of less than 0.05, we can reject the null hypothesis and state that there is a significant difference.

SciPy has a function called ttest_1samp, which performs a 1 Sample T-Test for you.

ttest_1samp requires two inputs, a distribution of values and an expected mean:

tstat, pval = ttest_1samp(example_distribution, expected_mean)
print pval

It also returns two outputs: the t-statistic (which we won't cover in this course), and the p-value — telling us how confident we can be that the sample of values came from a distribution with the mean specified.

In [22]:
from scipy.stats import ttest_1samp
import numpy as np

ages = [32, 34, 29, 29, 22, 39, 38, 37, 38, 36, 30, 26, 22, 22]
print(ages)

ages_mean = np.mean(ages)
print(ages_mean)

tstat, pval = ttest_1samp(ages, 30)
print(pval)

[32, 34, 29, 29, 22, 39, 38, 37, 38, 36, 30, 26, 22, 22]
31.0
0.560515588817


#### One Sample T-Test II

In the last exercise, we got a p-value that was much higher than 0.05, so we cannot reject the null hypothesis. Does this mean that if we wait for more visitors to BuyPie, the average age would definitely be 30 and not 31? Not necessarily. In fact, in this case, we know that the mean of our sample was 31.

P-values give us an idea of how confident we can be in a result. Just because we don’t have enough data to detect a difference doesn’t mean that there isn’t one. Generally, the more samples we have, the smaller a difference we’ll be able to detect. You can learn more about the exact relationship between the number of samples and detectable differences in the Sample Size Determination course.

To gain some intuition on how our confidence levels can change, let's explore some distributions with different means and how our p-values from the 1 Sample T-Tests change.

In [1]:
from scipy.stats import ttest_1samp
import numpy as np

correct_results = 0 # Start the counter at 0

daily_visitors = np.genfromtxt("daily_visitors.csv", delimiter=",")

for i in range(1000): # 1000 experiments
   tstat, pval = ttest_1samp(daily_visitors[i], 30)
   print(pval)
   if pval < .05:
      correct_results += 1
  
print("We correctly recognized that the distribution was different in " + str(correct_results) + " out of 1000 experiments.")

0.236959424736
0.00551175004638
0.236367958654
0.107835178116
0.00441488212078
0.162140482055
0.160082923667
0.00875290803148
0.00941375984411
0.288298628474
0.0353104675891
0.214476925759
0.000622771825632
9.58825887696e-06
0.235150922975
0.000263423248628
0.803928495457
0.0301657393052
0.670698861832
0.0873428605512
2.90788083399e-05
0.0264925568316
0.0446724516499
0.135443319261
0.0166128984321
0.148920193059
0.0376451522887
0.0153763697662
0.156798777942
0.150650541039
0.0685476997161
0.149465791272
0.000352701205788
0.0116419012384
0.799349315992
0.0162535069246
0.0271314568134
0.0483578873883
0.117582878505
0.880142334258
0.0554769642956
0.00508984769088
0.00667972461828
0.124786445326
0.00373992490997
0.00167155300361
0.00129447711818
0.153507674005
0.779846772104
0.00857641019387
0.502710657141
0.0442291447039
0.000903939671667
0.886965065712
0.325156446902
0.0045742054095
0.00407277310834
0.0291031195054
0.00306993347409
0.00958306001491
0.0264925568316
0.0146684812185
0.02931

0.294417580629
0.316460247962
0.297722485183
0.00398631098586
1.9391479076e-05
0.0045955190297
0.0062149980348
0.190870304988
2.42371064346e-05
0.350270810402
0.0423034528744
0.00162034916754
0.137759156645
0.408954450785
2.44281226327e-05
0.0123627273644
0.0400285991124
0.0612313924946
0.457107041163
0.121627248209
0.0968096267634
0.415063036729
0.222016419317
0.0911853184305
0.176100182289
0.0442580449604
0.129073862946
0.00607905757011
0.16841141717
0.702367104873
0.000919347898962
0.0102484131512
0.0730851687622
0.00938754512334
2.61766492247e-05
0.153396314022
0.293108061111
9.62743412039e-06
0.0303209845675
0.191222998452
0.002642233465
0.00914769823213
0.0208730217155
0.73657874168
0.0479556179482
0.00558496607806
0.0967790311638
0.2115728073
0.00483603790147
0.0403897121108
0.048541705983
0.110225662954
0.0119434578357
0.538584578817
0.07067455465
3.39155031656e-06
0.38222136967
0.101372805748
0.0132427429222
0.0277435464136
0.00134275920143
0.834978676372
0.0011808456111
0.082