In [38]:
import pandas as pd
import numpy as np
import random
random.seed=(666)


# What Place Does Probability Occupy in Data Science

    A data scientist is someone who knows more statistics than a computer scientist and more computer science than a statistician. 
                - Joshua Blumenstock
---

## Probability Examples in Data Science: A/B Testing

   One example of how data scientists use probability is in A/B testing. A/B testing is a simple experiment that examines two versions of something, to see which version performs better. This can be used with anything, which color of a t-shirt sells better, which style of ad leads to more purchases, or which version of a website translates to more traffic. Because a lot of data is needed for the results to be meaningful, A/B testing is mainly used in online applications to optimize websites and online ads. The way that A/B testing works is you show show version A of an ad to half of the people, version B of the ad to the other half of the people, and see which version of the ad has a higher probability of getting people to click on it. For the results to be useful it is important that there aren't many differences between version A and version B. Otherwise it is hard to say what changes caused the difference in performance. 

---
## Probability Examples in Data Science: Simulations

   If you don't have data you can simulate the problem to generate data to help answer questions. For example, if there are two types of ads on YouTube, skippable ads which are cheaper and unskippable ads which are more expensive, and we want to know which one is the better purchase we can create a simulation to solve this problem. For each type of ad there is a probability that a user will click on it. After the ad has been clicked there is some probability that the user will make a purchase. If we have some rough idea of the probabilities that users will click on each type of ad, and then make purchases, and a rough idea of how much people spend when they make a purchase, we can create a simulation of the situation. We can then use that simulation to generate data that can be used to evaluate the best type of ad for the money. We don't have to actually purchase both types of ad and then collect data on their effectiveness. 
    
---    
## Definitions

**Experiment**: An experiment is an operation or procedure carried out under controlled conditions in order to discover an unknown effect or law. To test or establish a hypothesis, or to illustrate a known law. 
    - Merrium Webster Dictionary

A/B testing is an example of an experiment. 
    
**Sample Space**: A sample space is a collection of all possible outcomes of a random experiment. A Sample space may be continuous or discrete. 
    - Merrium Webster Dictionary

Sample space is usually denoted as S. For a coin flip the sample space is (H,T) where H is heads, and T is tails. For a roll of a die the sample space is (1,2,3,4,5,6). If we flip two coins then the sample space is [(H,H), (H,T), (T,H), (T,T)]. These are all examples of discrete sample spaces. A discrete sample space means that there are a finite amount of outcomes of an experiment. We can can also have a continuous sample space. For example the sample space of the heights of NBA players is continuous. There is a lower and upper limit to the possible heights, but in between those values thare are an infinte number of possible heights. 

**Event**: An event is a subset of the possible outcomes of an experiment. 
    - Merrium Webster Dictionary

An event is usually denoted as E. Following the examples of sample spaces above, for a coin flip heads would be a possible event. For two coins heads,tails would be an event. A NBA player who is 6' tall would be an event. 

**Probability**: Probability is the ratio of the number of outcomes in an exhastive set of equally likely outcomes that produce a given event to the total number of possible outcomes. A logical relation between statements such that the evidence comfirming one confirms the other to some degree. 
    - Merrium Webster Dictionary
    
The probability of all the events must sum to one. If the probability is 1 it is a certainty, if the probability is 0 then the event cannot occur. 

To calculate the probability of an event in a discrete sample space, all you have to do is count the number of times the event is in the sample space and divide by the total number of events in the sample space. For example if we have a bag of 10 skittles with 3 yellow skittles, 2 purple skittles, 4 red skittles, and 1 green skittle, then we would have a sample space of { y, y, y, p, p, r, r, r, r, g}. If we want to know the probability of picking a purple skittle then we can count the number of purple skittles, and divide by the total number of skittles in the smaple space to get 2/10 or a 20% probability of picking a purple skittle. 

For continuous sample sets it starts to get more complicated because we have an infinte amount of possible events in the sample space. The probability of a basketball player being exactly 6' tall is almost zero because there are an infinite amount of possible heights that the player could be. So instead we say what is the probability that the basketball player is 6' with a tolerance of .1 feet? To calculate the probability we need to have data about NBA players heights so we can get probability density functon. Once we have a probability density function (pdf) we can integrate the pdf fomr 5.9' to 6.1'.  

$$P(X=6)=\int_{5.9}^{6.1}f(x)dx$$

**Long Run Frequencies of an Event**: If the probability of an event occuring is 1 in 6 that doesn't necessarily mean that if you run the experiment 6 times you are gauranteed to see that event. It is possible to roll a die 6 times and never roll a 1, but if you continue to roll that die 1,000 or 10,000 more times eventually you will see that you rolled a 1 1/6th of the time. When we talk about probability we are talking about the long run frequency of an event. 

--- 
## Python's Random Module

The random module in python allows us to create random numbers, randomly shuffle a list, randomly sample from a probability distrubution, and many other things. If you want your random numbers to be reproducable accross multiple machines you need to set the random.seed(). Otherwise the random.seed will be specific for your machine.  


In [21]:
# you can use random.random() to create a random float between 0 and 1
a=random.random()
print(a)

# if you want a random float that is larger than one you can use random.uniform() 
b = random.uniform(1,100)
print(b)

#if you want a random int you can use random.randint()
c=random.randint(1,15)
print(c)
# both the 1 and the 15 are inclusive meaning this could return a 1 or a 15

0.6447343807627546
36.94071181525108
7


In [22]:
# you can use random.choice() to randomly select an element of a list

colors=['red','orange','yellow', 'green','blue','indigo', 'violot', 'peach', 'black', 'white']
random.choice(colors)

'white'

In [23]:
# if you want more than one random element from the list you can use random.choices()
random.choices(colors,k=10) #k is the number of random samples that you want

['green',
 'violot',
 'peach',
 'green',
 'white',
 'blue',
 'green',
 'violot',
 'yellow',
 'yellow']

In [24]:
# you can weight the random choices as well. 
random.choices(colors,weights=[5,10,20,5,5,15,15,5,6,4],k=10) 

# In this example the weights list add to 100 for simplicity. The way it works is it adds 
#up all the all the values in the weights list, and then the probability of each element 
#is the value of that element/total. So for the color red the weight is 5. 5/100 is 5% so 
#it has a 5% probability of being selected

['green',
 'indigo',
 'orange',
 'orange',
 'violot',
 'violot',
 'peach',
 'indigo',
 'yellow',
 'violot']

In [25]:
#random.sample() will take the sampled value out of the list so it cannot be sampled again. 
random.sample(colors,k=10) 

['violot',
 'indigo',
 'yellow',
 'blue',
 'peach',
 'orange',
 'red',
 'white',
 'green',
 'black']

---
## Practice Problem
For practice see if you can create a roulette wheel simulation using the random module. A roulette wheel has 18 red spaces, 18 black spaces, and 2 green spaces where the ball can land. 

In [None]:
## Write your code here. 

---
We can use the random module and the cumsum() function, which will cumulatively sum a series, to demonstrate how the long run frequency of an event works for rolling a die. 

In [39]:
S=[0,1] # sample space of 0 for tails and 1 for heads 
flips=random.choices(S,k=1000) 
cumsumFlips=np.cumsum(flips)
print("-"*30)
print(flips[0:10])
print("-"*30)
print(list(flips[0:10]))
print("-"*30)


------------------------------
[0, 0, 1, 0, 1, 0, 1, 1, 1, 1]
------------------------------
[0, 0, 1, 0, 1, 0, 1, 1, 1, 1]
------------------------------


In [41]:
#if we normalize cumsumFlips by the number of trails we can see what percentage of the flips
# have been heads
Pheads=cumsumFlips/range(1,len(cumsumFlips)+1)
# now lets look at the first 15 flips, and the last 10 flips

print("-"*60)
print(flips[0:5])
print("-"*60)
print(list(cumsumFlips[0:5]))
print("-"*60)
print(Pheads[0:5])

print("="*60)

print("-"*60)
print(flips[-6:-1])
print("-"*60)
print(list(cumsumFlips[-6:-1]))
print("-"*60)
print(Pheads[-6:-1])

------------------------------------------------------------
[0, 0, 1, 0, 1]
------------------------------------------------------------
[0, 0, 1, 1, 2]
------------------------------------------------------------
[0.         0.         0.33333333 0.25       0.4       ]
------------------------------------------------------------
[0, 1, 1, 0, 0]
------------------------------------------------------------
[494, 495, 496, 496, 496]
------------------------------------------------------------
[0.49648241 0.49698795 0.49749248 0.49699399 0.4964965 ]


After the first 5 flips heads had only been flipped 40% of the time, but after 1000 flips we got heads 49.6% of the time. 