# Lab 2 Discrete Probability Distributions

In this lab we are going to investigate the discrete probability distributions which we will see in week 4.



## Part 1: Plotting Histograms of Data 

First we'll generate and work with some random data. We'll use the random package to help with this. 
We create a list with 20 random integers between 0 and 20. 

Note that range(n) produces the numbers from 0 to n-1, so range(20) gives us 20 numbers to iterate over. See here:


In [None]:
list(range(20))

Often, we don't really care what the exact numbers are, because we are just using them to create an index, i, that we iterate over. In some cases, we don't actually compute anything with i, like in the example below. 

In [None]:
import random
r1 = [random.randint(0,20) for i in range(20)]
r1

Depending on the random number generator chosen, the randomly generated numbers will come from different distributions. Here, you might wonder if the (0,20) in the random number generation function really gives numbers up to 20 or only up to 19 - can you think of a quick test to find out? 

You should find that is does include both endpoint. 


Next we make a histogram, I'm going to introduce a new set of (not so random) data which will be useful in illustrating some issues with histograms. 

In [None]:
r2 = [4,4,5,5,5,5,7,8,9,9,10,10,11,15,16,16,18,18,18,18]

import matplotlib.pyplot as plt
plt.hist(r2)


Can you identify two separate problems with this histogram? 

<details> 
    <summary markdown="span"> Click here to reveal </summary> 
    
    1. It doesn't show all of the possible data, as we are viewing a random sample of numbers between 1 and 20, we might want this whole range shown. 
    2. The numbers 4 and 5 are in the same bin. In fact, if you look at the data that Python gives just above the plot, the second array shows the bins, we can see that one goes from 4. to 5.4
    
</details>
    
    

Let's try for a better histogram. Take a look at the [documentation for pyplot.hist](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html) to see what parameters are available. 

<details><summary markdown="span"> Click here for a hint </summary> 
In particular we might be interested in bins, range and align. If you want to set a parameter you just include something like bins=20 in your hist command. 
</details>

In [None]:
# Try some code here to fix these problems


plt.hist(r2,range=(0,20),bins=20)




Here are some further tricks. I will use plt.xticks to set labels on the x axis and use align to centre the bins over integers - we do align left, because the bins are of form [1,2), [2,3) and so on.

Rather than specifying the number of bins, I give it a range, this defines the start of each bin at an integer. 

You don't need to know all of this in advance, you can try it, see how it looks and then adjust parameters to deal with problems. 


In [None]:
plt.xticks(range(21))
plt.hist(r2,bins=range(22),range=(0,20),align='left')





Now try something similar with your random data r1 

Now, generate a lot more data points - say 1000, and plot the corresponding histogram - can you guess what it should look like before plotting? 


It should look fairly flat, this is because we are randomly selecting with equal probability, instead we can generate random data using probability distributions. 

## 2. The binomial distribution - experimenting with data 

Now we will generate some random data using the binomial distribution. We can access probability distributions using the stats package in scipy. 
If you get an error because scipy is not installed then you need to install it. You can do this in Anaconda. In the main left hand menu go to "Environments", then search for scipy, there should be an option to install it. 



In [None]:

from scipy import stats

bn1 = stats.binom.rvs(n=5,p=0.5,size=20)
bn1




Can you interpret what this data represents? Try changing the parameters in the definition to explore what happens to the data. 

<details> 
    <summary markdown="span"> Click here to reveal </summary> 
    
            Each entry in the list is the number of successes in a series of 5 random trials with a
    0.5 probability of success. 
    
</details>

In [None]:
# now lets plot a histogram, I'll redefine bn1 first in case you changed the definition above

bn1 = stats.binom.rvs(n=5,p=0.5,size=20)

# put in code to make a histogram here 





Once you have some nice code to produce a histogram you can rerun it see different random data. 
Now change it to generate much more data, 20 trials and 1000 numbers, call this bn2. 


In [None]:
# define bn2 here 







We should see that the shape is very different to the distribution we had for the uniform random data in part 1.
Now try changing the probability of success and look at the result. 

In [None]:
# try plotting a couple of different probabilities 
# do a probability lower than 0.5, call it bn3







In [None]:
# do a probability higher than 0.5, call it bn4 







We can also calculate the mean and variance, these are methods associated with the random variable, for example for bn1 we can do bn1.mean() and bn1.var() 
Try calculating the mean and variance of all of our binomial random variable samples so far. 


In [None]:

print('The mean of bn1 is ',bn1.mean())
print('The variance of bn1 is',bn1.var())

#do the rest yourself here 








The theoretical mean for the binomial distribution is np, but for a given random sample it won't be exactly that. We do expect that as we increase the amount of data we look at, we will approach this. 

We will now carry out some experiments in order to observe this. We will generate a series of data sets and compute the mean for each, and look at what happens as the size of the data sets increases, then we can plot this data alongside the theoretical mean. 

**Try repeating for different values of n,p and size to explore.**



In [None]:

import numpy as np
# we initialise a list for data. This is so we can build our data by adding one at a time to the existing data 
# rather than generating a whole new data set each time. This is why we need numpy too, as our data is a list, 
# the .mean() method for random variables won't work, but numpy has a similar function. 
D = []

# we initialise a list which will store the means 
M = []

# I also make one for the theoretical mean, this is constant here, but setting it up this 
# way will be useful for what we do next. By building this list in the same loop as M we ensure that
# they are the same size and can be plotted together 
tm=[]

# it is also useful to set n and p here. 
n=10
p=0.2


size = 100

for i in range(1,size):
    D.append(stats.binom.rvs(n,p))
    M.append(np.mean(D))
    tm.append(n*p)
    
# this gives the x-axis of our plot 
x = np.arange(1,size)  

plt.plot(x,M) 
plt.plot(x,tm)
plt.show()
    
    

    
    
    
    
    
    

Now do the same analysis for the Variance. Measure the variance as your sample size changes or as the number of trials changes. The theoretical value for the variance is np(1-p). Produce versions of the plots above for variance. Once again, experiment with different parameters. 

# 3. Binomial Distribution calculations 

We have been working with random data generated using a binomial distribution, now we will work with the actual distribution and see how to calculate probabilities for binomial random variables. 

In [None]:
# first we will set n and p 
n = 5 
p = 0.5 

# now we define a binomial random variable 
# by default it expects n amd p first so we can just put those as arguments 
X = stats.binom(n,p)

# now we can work out probabilities with probability mass function (pmf) and cumulative density function (cdf) 

print('An example of the pmf is X.pmf(2)=',X.pmf(3))
print('An example of the cdf is X.cdf(2)=',X.cdf(3))







Can you see what these are calculating? 
Try a few more values to investigate 


In [None]:
# try here 


<details> 
    <summary markdown="span"> Click here for an explanation </summary> 
    
            The pmf gives the probability of any particular outcome. For example X.pmf(3) gives the probability of exactly 3 successes in n trials with a probability p of success each time. 
    The cdf gives the probability of up to that many successes so X.cdf(3) gives the probability of 3 or fewer successes. 
    
</details>

We can plot them as well, this time we'll see how to do bar plots in pyplot. 

In [None]:
# first we make a range of x values 
x = range(n+1)

plt.bar(x,X.cdf(x),label='cdf')
plt.bar(x,X.pmf(x),label='pmf')
plt.legend()
plt.show()









Using these functions calculate the following. Suppose a fair coin is tossed 5 times, what is the probability of 

(a) getting exactly 1 head

(b) getting at most 2 heads 


In [None]:
#calculate here 




Now suppose a biased coin is tossed 1000 times. It has a probability of 0.4 of coming up heads Find the probability that 

(a) There are at least 430 heads. 
(b) There are at most 390 heads. 
(c) There are between 300 and 400 heads. 

Finally, try plotting the pmf and cdf for large n and various choices of p. In this case there are so many bars that they are indistinct so we use a line plot instead, to do this simply use plt.plot rather than plt.bar (remember we did these in Lab 1). 

You'll find that it looks like a smooth line and the shape might look familiar - for certain parameters the binomial distribution is approximated by the normal distribution - we'll come back to that later in the course when we look at continuous probability distributions. 

For larger n these won't work well on the same plot so you might want to do them separately instead. 

# Extension: the Poisson Distribution 
Scipy.stats has many discrete probability distributions. Another which we have seen is the Poisson distribution, you can try exploring that in a similar way to what we have done with the binomial distribution. 
See https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.poisson.html