# Histogram Tutorial

Histograms are often used to look at patterns in collected data.   In particular histograms help interpret how multiple measurements of the same thing are distributed.   As such, histograms are particularly useful for statistics.

Python offers several handy tools for making and interpreting histograms. 

In [None]:
# Import block
import numpy as np               # for math
import matplotlib.pyplot as plt  # for plotting
from scipy.stats import norm     # for normal curves

## Generating Some Data to Play With

First, we'll generate some data to play with.

In [None]:
# Generate 1000 data points centered at 0 with a standard deviation of 1
myData = norm.rvs(0,scale=1, size=1000)  

## Plot the Data as a Histogram
This is a rough first pass at using the histogram function that is available from `matplotlib`. 

In [None]:
plt.hist(myData)
plt.title("My First Histogram")
plt.xlabel("Measurements (unit-less)")
plt.ylabel("Number of Measurements per Bin Width")

## Choosing the Bin Width

Histograms can look very different depending on the bins that are chosen.   In the example above, the bins are chosen automatically.   

In the example below, I choose the bin width explicitly. 


In [None]:
min=-4.25
max=4.25
nbins=17
myBins = np.linspace(min,max,nbins+1) 
#  This is a list of bin boundaries
#  I am choosing equally-spaced bins between -4.25 and 4.25.  
#  There will be 17 bins
#  Each bin will have a width of 0.5
#  Bins will be centered on whole and half numbers (e.g: -4,-3.5,-3,...3.5,4)
print(myBins)

In [None]:
plt.hist(myData, bins=myBins) # plot a histogram using my bins
plt.title("Histogram with Custon Bins")
plt.xlabel("Measurements (unit-less)")
plt.ylabel("Measurements per 0.5 (unit-less)")

Notice how the shape of the distribution changes slightly with the choice of bins. 

The choice of histogram bins is driven by the data you are visualizing.   Bins that are too narrow make it hard to see overall patterns.  Bins that are too wide, average the data too much.   Try playing with the histogram bin definition to see how the number of bins changes the resulting histogram. 

In *real* analyses, the bin choice is often a *systematic error* that must be accounted for. 


## Fitting A Histogram with a Normal Curve

Suppose that we want to fit a Normal (or Gaussian) Curve to the histogram to measure the mean and width (standard deviation) of the distribution.   This can be done with a fit as shown below. 

In [None]:
# fit a normal curve
mean, stdev = norm.fit(myData)
stderr = stdev/np.sqrt(myData.size)
print("Fit Results:")
print("Mean=%2.3f  Standard Deviation=%2.2f  Standard Error=%2.3f" % (mean,stdev,stderr))

Note that our results are consistent with the parameters used to generate the data.   Remember that the data were generated with the mean equal to 0.00 and the standard deviaiton equal to 1.00: 
$$ \bar{x} = 0.00$$
$$ \sigma_{x} = 1.00.$$

See the assignment and Taylor's *An Introduction to Error Analysis*, Chapter 4 for a more detailed explanation of average, standard deviation, and standard error. 

In [None]:
# Plot the histogram with the normal curve.
# density=True changes to probability.   That is the integral of the histogram is equal to one. 
plt.hist(myData, bins=myBins, density=True, label="Data") # plot a histogram using my bins
plt.title("Histogram with Custon Bins and a Fit")
plt.xlabel("Measurements (unit-less)")
plt.ylabel("Probability per 0.5 (unit-less)")

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
myFitFunction = norm.pdf(x, mean, stdev)
plt.plot(x, myFitFunction, 'k', linewidth=1, label="Fit")
plt.text(2,0.3,"Average = %2.2f"%(mean))
plt.text(2,0.275,"St. Dev = %2.2f"%(stdev))
plt.legend()

Note that in this case, I have changed the y-axis so that it now shows the **probability** of a given measurement.   For example, there is just under a 40% chance that we will measure a number between -0.25 and 0.25 (the bin centered at 0). 