# 3.1 Measures of Central Location
**Central location** refers to the way data clusters around some center or middle. In essence, we want to find a typical or central value that describes our data. We may want to know the return on an investment, the price of a home in a neighborhood, or maybe the typical opinion of a group of people.

## The Mean
The **arithmetic mean** is the main measure of central location. We refer to it as the **mean** or the **average**.
$$
\bar{x} = \frac{\sum x_i}{n}
$$
$$
\mu = \frac{\sum x_i}{N}
$$
Where $\bar{x}$ is the sample mean for sample observations $x_1, x_2, ... , x_n$. And $\mu$ is the population mean for population observations $x_1, x_2, ... , x_N$.

In [1]:
""" Example 3.1
We want to find the mean for Growth and Value, respectively.
"""
import pandas as pd

# Read data.
grow_val = pd.read_csv('Growth_Value.csv', index_col=False)
grow_val.head(5)

Unnamed: 0,Year,Growth,Value
0,1984,-5.5,-8.59
1,1985,39.91,22.1
2,1986,13.03,14.74
3,1987,-1.7,-8.58
4,1988,16.05,29.05


In [2]:
# Averages.
grow_val[['Growth', 'Value']].mean()

Growth    15.755
Value     12.005
dtype: float64

In [3]:
""" Example 3.2

We're trying to understand the salaries at a firm.
As we'll see, this example shows the weakness of 
the mean when we have outliers.
"""
# Create data.
sals = [40000, 40000, 65000, 90000, 100000, 145000, 150000, 550000]
sals_df = pd.DataFrame({'Salary': sals})
sals_df.head(5)

Unnamed: 0,Salary
0,40000
1,40000
2,65000
3,90000
4,100000


In [4]:
# Average.
sals_df.Salary.mean()

147500.0

In [5]:
# Does this reflect the typical salary?
below_avg_sal = sals_df['Salary'] < sals_df.Salary.mean()
sals_df[below_avg_sal]

Unnamed: 0,Salary
0,40000
1,40000
2,65000
3,90000
4,100000
5,145000


Six of eight employees at the firm make less than the average. It seems that the large salary of the President skews our mean.

## The Median
Since the mean can be skewed by outliers, we turn to the **median** as another measure of central tendency. The median is the exact middle of a variable, it divides the data in half such that half of observations lie above and half lie below.

In [6]:
""" Example 3.3

We're trying to understand the salaries at a firm.
Since the mean doesn't capture much of the central
salary, we'll use the median.
"""
sals_df.Salary.median()

95000.0

Note that four salaries lie above the median of $\$95,000$ and four salaries above. This gives us a better idea of the salaries at this firm.

## The Mode
The **mode** is the observation that occurs most frequently. A variable can have more than one mode or no mode. If there's one mode, we say the variable is unimodal. If two or more modes exist, the variable is multimodal.

In [7]:
""" Example 3.4

Using the same firm's salary data.
"""
sals_df.Salary.mode()

0    40000
Name: Salary, dtype: int64

In [8]:
""" Example 3.6

Descriptive statistics of the
Growth_Value.csv.
"""
grow_val.head(5)

Unnamed: 0,Year,Growth,Value
0,1984,-5.5,-8.59
1,1985,39.91,22.1
2,1986,13.03,14.74
3,1987,-1.7,-8.58
4,1988,16.05,29.05


In [9]:
# Descriptive statistics.
grow_val[['Growth', 'Value']].describe()

Unnamed: 0,Growth,Value
count,36.0,36.0
mean,15.755,12.005
std,23.799285,17.979187
min,-40.9,-46.52
25%,2.86,1.7025
50%,15.245,15.38
75%,36.9725,22.4375
max,79.48,44.08


## Subsetted Means
As we discussed in Chapter 1, subsetting data can provide valuable insights. Let's examine measures of central tendency using the data of an online retail company.

In [10]:
""" Example 3.7

We want to understand the spending behavior
of customers during the holiday season.
"""
# Read data.
online = pd.read_csv('Online.csv', index_col=False)
online.head(5)

Unnamed: 0,Customer,Sex,Clothing,Health,Tech,Misc
0,1,Female,246,185,64,75
1,2,Male,171,78,345,10
2,3,Female,95,15,47,90
3,4,Male,125,16,493,13
4,5,Female,368,100,82,109


In [11]:
# Gender mask.
fem = online['Sex'] == 'Female'
f_dat = online[fem]
m_dat = online[~fem]

# Means.
f_means = f_dat[['Clothing', 'Health', 'Tech', 'Misc']].mean()
m_means = m_dat[['Clothing', 'Health', 'Tech', 'Misc']].mean()

# Results DF.
result = pd.DataFrame({'Sex': ['Female', 'Male'],
                       'Clothing': [round(f_means[0],2), round(m_means[0],2)],
                       'Health': [round(f_means[1],2), round(m_means[1],2)],
                       'Tech': [round(f_means[2],2), round(m_means[2],2)],
                       'Misc': [round(f_means[3],2), round(m_means[3],2)]})
result

Unnamed: 0,Sex,Clothing,Health,Tech,Misc
0,Female,225.67,100.25,47.1,159.88
1,Male,97.93,100.64,310.97,85.84


## The Weighted Mean
So far we've seen that variables equally contribute to the arithmetic mean. The **weighted mean** allows us to see when some variables contribute more than others. For example, a class grade is usually given by a weighted mean. Some assignments such as exams will be worth more than others like quizzes or homework.

Let $w_1, w_2, ..., w_n$ denote the weights of the sample observations $x_1, x_2, ... , x_n$ such that $w_1 + w_2 + ... + w_n = 1$. The weighted mean for the sample is then:
$$
\bar{x} = \sum w_i x_i
$$
For a frequency distribution, we substitute the relative frequency of the ith interval for $w_i$, and the midpoint of the ith interval for $x_i$. "The population weighted mean is computed similarly. 

In [12]:
""" Example 3.8

A student scores 60 on Exam 1, a 70 on Exam 2,
and 80 on Exam 3. What's the average score for
the course if the Exams are weighted 25%, 25%, 
and 50% of the grade, respectively?
"""
w = [0.25, 0.25, 0.5]
scores = [60, 70, 80]

grade = sum([w[i] * scores[i] for i in range(len(w))])
grade

72.5

In [13]:
""" Example 3.9

In Ch2 we made a freq. distr. to
summarize house prices. Let's find
the mean house price!
"""
interval_mids = [50 + (i + 1)* 100 for i in range(6)]
freqs_ = [0.225, 0.4, 0.2, 0.1, 0.05, 0.025]

avg_hs_price = sum([freqs_[x] * interval_mids[x] for x in range(len(freqs_))])
avg_hs_price

292.5