# Weighted Averages

### Goals:

1. To review the concept of weighted averages.
2. To understand when it makes sense to use weighted averages. 
3. To understand how histograms and wieghted averages are tools that can be used to summarize large data sets a much smaller set of numbers.

### Timing

1. Try to finish this notebook in 10-15 minutes

### Question and Answer Template

You can go to the link below, and do "file" -> "make a copy" to make yourself a google doc that you can use to fill in the answers to the question in this weeks notebooks.

https://docs.google.com/document/d/1uNPXYCd6IF-jAnPyq7k9pFP2x7VIohZZstGqm6t0WC8/edit?usp=sharing

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

###  New functions we will use in this module

| Function Name            | What it does |
| - | - |
|    rng.integers          | generates a random integer |
|    rng.uniform           | generates a random real number from a flat or 'uniform' distribution |
|    plt.hist              | Makes a "histogram" plotting the number of values that fall into a set of bins |
|    plt.xlabel            | Set the x-axis label of a figure (also plt.ylabel) |  

## Weighted averages, in the context you are probably most familiar with them

Being students, I'm sure that you are familiar with weighted averages as they apply to course grades.  For example, you might be something like: "Homework will be 20% of your grade, the two short mid-term exams will be 20% each, and the final exam will be 40%".

### Question:  

#### 1.1  Write down the formula for grades that corresponds to the sentence above.

## Summary data and weighted averages

Now we are going to work through an exercise that shows another context in which weighted averages occur.  



In [None]:
rng = np.random.default_rng(42)

In [None]:
def rollD6(rng, nTimes):
    return rng.integers(low=1, high=6, endpoint=True, size=nTimes)

In [None]:
# Roll the dice 60 times.
diceRolls = rollD6(rng, 60)

# And count the number of times each value occurred
values = np.bincount(diceRolls)
weights = np.arange(7)

In [None]:
print("Dice roll data: ",diceRolls,"\n")
print("Counts per value: ", values)
print("Weights:          ", weights)

### Now let's write down the equation for the mean of the data two different ways

#### Using the indvidual rolls

It would look something like 

(1 + 5 + 4 + 3 + 3 + 6 + ... + 5 + 5) / 60

#### Using the bin counts

It would look something like

((10 * 1) + (5 * 2) + (12 * 3) + (9* 4) + (16 * 5) + (8 * 6)) / 60

### Formulas

mean = $\frac{\sum_i x_i}{n}$

weighted mean = $\frac{\sum_i w_i * x_i}{\sum w_i}$

#### Let's compute both of those using numpy and compare them to the numpy.mean() function

In [None]:
mean_v1 = np.sum(diceRolls) / len(diceRolls)
mean_v2 = np.sum(values*weights) / len(diceRolls)
mean_check = np.mean(diceRolls)

In [None]:
print("Mean:          ", mean_v1)
print("Weighted mean: ", mean_v2)
print("Check:         ", mean_check)

You can also programatically check that two numbers (arrays) are equal up to computer precision using `np.isclose` (`np.allclose`)

In [None]:
print(f"""
Mean == Weighted mean  : {np.isclose(mean_v1, mean_v2)}
Mean == Check          : {np.isclose(mean_v1, mean_check)}
Weighted mean == Check : {np.isclose(mean_v2, mean_check)}
""")

#### Pro-tip, array multiplication in numpy:

When you multiple two numpy arrays, such as (value*weights), it actually multiplies each element in value by each element in weights, 
it is equivalent to 

    n = len(values)
    outArray = np.zero((n))
    for i in range(n):
        outArray[i] = values[i] * weights[i]
        
Or, written mathemetically:

$\bf{v} = \bf{x}\bf{w}$ is equivalent to $v_i = x_i * w_i$ for each element $i$, and we use **bold** to indicate arrays.
        

In [None]:
values*weights

### When summary data "loses information"

Now, instead of rolling a dice, lets pick a bunch of real numbers between 0.5 and 6.5 and use a histogram to summarize that information.

The "a.u." on the axes labels stands for "arbitrary units".

In [None]:
dataSample = rng.uniform(low=0.5, high=6.5, size=60)
hist = plt.hist(dataSample, bins=np.linspace(0.5, 6.5, 7))
plt.xlabel("Value [a.u.]")
plt.ylabel("Trials [a.u.]")
plt.show()

In [None]:
# This grabs the bin values and bin edges from the hist data structure that matplotlib returned
values = hist[0]
edges = hist[1]
centers = (edges[0:-1] + edges[1:])/2.

print("Average bin content:  ", np.mean(values))
print("Average value:        ", np.mean(dataSample))
print("Average binned value: ", np.sum(values*centers) / len(dataSample))

### Questions for discussion

#### 2.1 Explain, in your own words, the difference between the three values computed in the previous cell.  

#### 2.2 How would these numbers change if you changed the bin size when histograming the data?  E.g., which would get bigger if you used smaller bins, which would get smaller, and which would stay the same?

In [None]:
# This is a cell to try out different binnings for summarize the data