In [None]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
% matplotlib inline

## Randomness

Let's briefly review the principles of randomness we discussed last week.

In [None]:
#Randomly Take a Value From a List
np.random.choice(['a', 'b', 'c'])

In [None]:
#Randomly Take an Integer in Range of Numbers
np.random.randint(1,5)

In [None]:
#Randomly Take a Float Value in a Range of Numbers
np.random.uniform(0,2)

We can also use the 'size' argument in any of these functions to return more than one result

In [None]:
np.random.uniform(0,2, size=3)

What if we want to make reproducible code that uses a 'random' function? We can use the 'np.random.seed' function so that our 'random' value will be the same every time.  

**Note that in order for this to work in Jupyter Notebook, you need to run it in the same cell as your random code. If you are running multiple random functions that you want to be re-producible, you need to put this line of code in every cell that has a random function!!**

In [None]:
np.random.seed(42)

Below, we can our results when we randomly draw a float value between 0 and 2 10,000 times.

In [None]:
plt.hist(np.random.uniform(0,2, size=10000))

## Bernoulli Distribution

The Bernoulli distribution is equivalent to flipping a coin. We could simulate this by picking a random value between 0 and 1 and saying that anything above 0.5 is a 'head' and anything below 0.5 is a 'tail'.

In [None]:
np.random.random()

Let's say we wanted to "flip 100 coins" using this method.

In [None]:
#Note that last week I was using the 'append' function to add to our list of simulated trials, but the below is actually better practice.
random_numbers = np.empty(100)
for i in range(100):
    x = np.random.random()
    if x >= 0.5:
        random_numbers[i] = 1
    else:
        random_numbers[i] = 0
plt.hist(random_numbers)

This is nice to play around with, but we can also use the 'Scipy.Stats' module to create a simulation.

In [None]:
dist = stats.bernoulli(0.5)
random_numbers = dist.rvs(size=100)
plt.hist(random_numbers)

The module also has all of the vital information regarding the distribution.

In [None]:
#Mean
dist.mean()

In [None]:
#Variance
dist.var()

In [None]:
#PMF of 0
dist.pmf(0)

In [None]:
#PMF of 1
dist.pmf(1)

In [None]:
#CDF of 0
dist.cdf(0)

In [None]:
#CDF of 1
dist.cdf(1)

## Binomial Distribution

Let's look at binomial distribution, like flipping 100 coins!

In [None]:
dist = stats.binom(n=100, p=0.5)

Below is the result of how many heads we could get if we flipped 100 coins 100 times.

In [None]:
dist.rvs(size=100)

In [None]:
plt.hist(dist.rvs(size=100))

In [None]:
#Mean
dist.mean()

In [None]:
#Variance
dist.var()

In [None]:
#PMF of 50
dist.pmf(50)

In [None]:
#CDF of 50
dist.cdf(50)

In [None]:
#Manual Recreation of a CDF
x = 0
for i in range(51):
    x += dist.pmf(i)
x

Say we have only 3 trials instead of 100 - we can plot the different PMF values for each different number of successes.

In [None]:
values = np.array([])
pmf_values = np.array([])
for i in range(4):
    values = np.append(values, i)
    pmf_values = np.append(pmf_values, stats.binom.pmf(n=3, p=0.5, k=i))
plt.hist(values, weights=pmf_values)

## An Aside: Creating Nicer Plots in Matplotlib

Now is as good a time as any to learn a few more things about how to make better visualizations in Python with Matplotlib. We briefly covered some of this in Week 2.

First, we can put in a line of code to make our plot bigger. We can use the 'figsize' argument to put in a custom size for the height and width of our graph.

In [None]:
#Plot a Histogram - The Fig Line is to Make the Graph Bigger
fig = plt.figure(figsize=(10,10))
plt.hist(values, weights=pmf_values)

Now we can customize the ticks along the x axis

In [None]:
fig = plt.figure(figsize=(10,10))
plt.xticks(range(4))
plt.hist(values, weights=pmf_values)

We can add a title...

In [None]:
fig = plt.figure(figsize=(10,10))
plt.xticks(range(4))
plt.hist(values, weights=pmf_values)
fig.suptitle('PMFs for 3 Coin Flips', fontsize=15, y=0.92)

And labels for our X and Y axises:

In [None]:
fig = plt.figure(figsize=(10,10))
plt.xticks(range(4))
plt.hist(values, weights=pmf_values)
plt.xlabel('# of Heads')
plt.ylabel('Probability')
fig.suptitle('PMFs for 3 Coin Flips', fontsize=15, y=0.92)

Having a title and X and Y labels are **required** for any visualizations submitted for your project.

### Back to Binomial

Below is a CDF plot for the different values in 3 trials.

In [None]:
values = np.array([])
cdf_values = np.array([])
for i in range(4):
    values = np.append(values, i)
    cdf_values = np.append(cdf_values, stats.binom.cdf(n=3, p=0.5, k=i))

You can use either a bar graph or line graph to plot a CDF, but a line graph makes more sense.

In [None]:
fig = plt.figure(figsize=(10,10))
plt.xticks(range(4))
plt.plot(values, cdf_values,  drawstyle='steps-post', linestyle='-')
plt.xlabel('# of Heads')
plt.ylabel('Probability')
fig.suptitle('CDFs for 3 Coin Flips', fontsize=15, y=0.92)

## Geometric Distribution

We can also simulate a geometric distribution to see how many coins we will need to flip to get our first heads.

In [None]:
dist = stats.geom(0.5)

In [None]:
dist.mean()

In [None]:
dist.var()

In [None]:
dist.pmf(1)

In [None]:
dist.pmf(2)

In [None]:
dist.pmf(3)

In [None]:
dist.cdf(3)

In [None]:
values = np.array([])
pmf_values = np.array([])
#Why is the minimum one here?
for i in range(1,6):
    values = np.append(values, i)
    pmf_values = np.append(pmf_values, stats.geom.pmf(p=0.5, k=i))

In [None]:
fig = plt.figure(figsize=(10,10))
plt.xticks(range(1,6))
plt.hist(values, weights=pmf_values)
plt.xlabel('# of Trials Until First Head')
plt.ylabel('Probability')
fig.suptitle('PMFs for 5 Coin Flips', fontsize=15, y=0.92)

In [None]:
values = np.array([])
cdf_values = np.array([])
for i in range(1,6):
    values = np.append(values, i)
    cdf_values = np.append(cdf_values, stats.geom.cdf(p=0.5, k=i))

In [None]:
fig = plt.figure(figsize=(10,10))
plt.xticks(range(1,6))
plt.plot(values, cdf_values,  drawstyle='steps-post', linestyle='-')
plt.xlabel('# of Trials Until First Head')
plt.ylabel('Probability')
fig.suptitle('CDFs for 5 Coin Flips', fontsize=15, y=0.92)

## Poisson Distribution

Say, on average, 2 trains arrive every ten minutes at the 145th Street A stop.

In [None]:
dist = stats.poisson(2)

In [None]:
dist.mean()

In [None]:
dist.var()

In [None]:
values = np.array([])
pmf_values = np.array([])
for i in range(10):
    values = np.append(values, i)
    pmf_values = np.append(pmf_values, stats.poisson.pmf(mu=2, k=i))

In [None]:
fig = plt.figure(figsize=(10,10))
plt.xticks(range(10))
plt.hist(values, weights=pmf_values)
plt.xlabel('# of Trains')
plt.ylabel('Probability')
fig.suptitle('PMFs for # of Trains That Will Arrive in a 10-Minute Window', fontsize=15, y=0.92)

In [None]:
values = np.array([])
cdf_values = np.array([])
for i in range(10):
    values = np.append(values, i)
    cdf_values = np.append(cdf_values, stats.poisson.cdf(mu=2, k=i))

In [None]:
fig = plt.figure(figsize=(10,10))
plt.xticks(range(10))
plt.plot(values, cdf_values,  drawstyle='steps-post', linestyle='-')
plt.xlabel('# of Trains')
plt.ylabel('Probability')
fig.suptitle('PMFs for # of Trains That Will Arrive in a 10-Minute Window', fontsize=15, y=0.92)

As we mentioned in class, the Poisson distribution can be used as an approximation to the binomial distribution when there are a high number of trials (n > 100) and a low probability (p < 0.05). Let's see if this works out!

Let's look at a binomial function with 100 trials and a 0.01 probability of success per trial and examine the PMF values for 0-4 through both the binomial function and the Poisson function. In this case, the Poisson function will have a Lambda of 1 (100 * 0.01 = 1)

In [None]:
[stats.binom.pmf(p=0.01, n=100, k=i) for i in range(5)]

In [None]:
[stats.poisson.pmf(mu=1, k=i) for i in range(5)]

Quite close!

## Discrete Distributions

We can also build a custom distribution, such as the results of rolling two dice which we covered in class.

In [None]:
two_dice = np.array([])
for i in range(1,7):
    for j in range(1,7):
        two_dice = np.append(two_dice, i + j)

In [None]:
two_dice

In [None]:
len(two_dice)

Similar to the code we used for Monty Hall last week, we can use the 'np.unique' function to see the unique results in our output, as well as the number of times they occur. As a reminder, this code takes all the unique values found in our answer and gives us a raw count of how many times each value occurs in our array.

In [None]:
np.transpose(np.unique(two_dice, return_counts=True))

Transposing it is good for visualization purposes, but let's look at the actual output, two arrays - one with the unique results of two dice rolls and one with the respective counts for each of those results.

In [None]:
np.unique(two_dice, return_counts=True)

We can then assign these two arrays to variables.

In [None]:
two_dice_unique, counts = np.unique(two_dice, return_counts=True)

In [None]:
two_dice_unique

In [None]:
counts

We can see that the sum of the 'counts' variable is equal to the total sample space.

In [None]:
np.sum(counts)

Below we can get the PMF of each result:

In [None]:
counts/np.sum(counts)

In [None]:
np.transpose((two_dice_unique, counts/np.sum(counts)))

And below we can get the CDF of reach result:

In [None]:
np.cumsum(counts)/np.sum(counts)

In [None]:
np.transpose((two_dice_unique, np.cumsum(counts)/np.sum(counts)))

In [None]:
fig = plt.figure(figsize=(10,10))
plt.xticks(two_dice_unique)
plt.hist(two_dice_unique, weights=counts/np.sum(counts), bins=25)
plt.xlabel('Dice Roll Results')
plt.ylabel('Probability Mass Function')
fig.suptitle('PMFs for Rolling Two Dice', fontsize=15, y=0.92)

In [None]:
fig = plt.figure(figsize=(10,10))
plt.xticks(two_dice_unique)
plt.plot(two_dice_unique, np.cumsum(counts)/np.sum(counts),  drawstyle='steps-post', linestyle='-')
plt.xlabel('Dice Roll Results')
plt.ylabel('Cumulative Distribution Function')
fig.suptitle('CDFs for Two Dice', fontsize=15, y=0.92)