In [1]:
%autosave 30

Autosaving every 30 seconds


# Why use a weighted average

The table below shows the prices and quantities that 3 different customers pay for the same product. 

<img src="shoe.png">

If someone were to ask, what is the average price of our shoes? The simple average of the shoe prices would be

<img src="price.png">

While this is an accurate average, this does not intuitively make sense for understanding our average selling price. This is especially challenging if we want to use an average for revenue projections.

If you look at the numbers, you can see we are selling far more shoes for under 200 than we are above 200. Therefore an average of $ 216.67 does not accurately reflect the real average selling price in the market.

What would be more useful is to weight those prices based on the quantity purchased. Let’s build a weighted average such that the average shoe price will be more representative of all customers’ purchase patterns.

<img src="avg.png">

To build that function, let's import our modules and read in our Excel file:

In [2]:
import pandas as pd
import numpy as np

sales = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sales-estimate.xlsx?raw=True", sheetname="projections")
sales.head()

Unnamed: 0,Account,Name,State,Rep,Manager,Current_Price,Quantity,New_Product_Price
0,714466,Trantow-Barrows,MN,Craig Booker,Debra Henley,500,100,550
1,737550,"Fritsch, Russel and Anderson",MN,Craig Booker,Debra Henley,600,90,725
2,146832,Kiehn-Spinka,TX,Daniel Hilton,Debra Henley,225,475,255
3,218895,Kulas Inc,TX,Daniel Hilton,Debra Henley,290,375,300
4,412290,Jerde-Hilpert,WI,John Smith,Debra Henley,375,400,400


If we want to determine a simple mean, we can use the built in functions to easily calculate it:

In [14]:
x = sales["Current_Price"].mean()

y = sales["New_Product_Price"].mean()

print (x)
print (y)

405.4166666666667
447.0833333333333


# Grouping Data with the Weighted Average

Panda’s groupby is commonly used to summarize data. For instance, if we want to look at the mean of the Current_Price by manager, it is simple with groupby :

In [15]:
sales.groupby("Manager")["Current_Price"].mean()

Manager
Debra Henley     423.333333
Fred Anderson    387.500000
Name: Current_Price, dtype: float64

The answer is to define a custom function that takes the names of the columns of our data and calculates the weighted average. Then, use apply to execute it against our grouped data.

In [16]:
def wavg(group, avg_name, weight_name):
    """ http://stackoverflow.com/questions/10951341/pandas-dataframe-aggregate-function-using-multiple-columns
    In rare instance, we may not have weights, so just return the mean. Customize this if your business case
    should return otherwise.
    """
    d = group[avg_name]
    w = group[weight_name]
    try:
        return (d * w).sum() / w.sum()
    except ZeroDivisionError:
        return d.mean()

In [17]:
wavg(sales, "Current_Price", "Quantity")

342.5406871609403

The nice thing is that this will also work on grouped data. The key is that we need to use apply in order for pandas to pass the various groupings to the function.

In [18]:
sales.groupby("Manager").apply(wavg, "Current_Price", "Quantity")

Manager
Debra Henley     340.665584
Fred Anderson    344.897959
dtype: float64

Using this on our projected price is easy because you just need to pass in a new column name:

In [19]:
sales.groupby("Manager").apply(wavg, "New_Product_Price", "Quantity")

Manager
Debra Henley     372.646104
Fred Anderson    377.142857
dtype: float64

It is also possible to group by multiple criteria and the function will make sure that the correct data is used in each grouping:

In [20]:
sales.groupby(["Manager", "State"]).apply(wavg, "New_Product_Price", "Quantity")

Manager        State
Debra Henley   MN       632.894737
               TX       274.852941
               WI       440.000000
Fred Anderson  CA       446.428571
               NV       325.000000
               WA       610.000000
dtype: float64

# Multiple Aggregations

One final item I wanted to cover is the ability to perform multiple aggregations on data. For instance, if we want to get the mean for some columns, median for one and sum for another, we can do this by defining a dictionary with the column names and aggregation functions to call. Then, we call it on the grouped data with agg

In [21]:
f = {'New_Product_Price': ['mean'],'Current_Price': ['median'], 'Quantity': ['sum', 'mean']}
sales.groupby("Manager").agg(f)

Unnamed: 0_level_0,New_Product_Price,Current_Price,Quantity,Quantity
Unnamed: 0_level_1,mean,median,sum,mean
Manager,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Debra Henley,471.666667,437.5,1540,256.666667
Fred Anderson,422.5,375.0,1225,204.166667


Here is the approach I use to combine multiple custom functions into a single DataFrame.

First create two datasets of the various weighted averages, then combine them into a single DataFrame and give it a meaningful label:

In [22]:
data_1 = sales.groupby("Manager").apply(wavg, "New_Product_Price", "Quantity")
data_2 = sales.groupby("Manager").apply(wavg, "Current_Price", "Quantity")
summary = pd.DataFrame(data=dict(s1=data_1, s2=data_2))
summary.columns = ["New Product Price","Current Product Price"]
summary.head()

Unnamed: 0_level_0,New Product Price,Current Product Price
Manager,Unnamed: 1_level_1,Unnamed: 2_level_1
Debra Henley,372.646104,340.665584
Fred Anderson,377.142857,344.897959


# Using NumPy for weighted average

In [23]:
np.average(sales["Current_Price"], weights=sales["Quantity"])

342.54068716094031

In [24]:
sales.groupby("Manager").apply(lambda x: np.average(x['New_Product_Price'], weights=x['Quantity']))

Manager
Debra Henley     372.646104
Fred Anderson    377.142857
dtype: float64