#  Defining the Population

### Introduction

While not strictly biases, we can also run into problems when defining the population.  In defining the population, we are defining the items or individuals that we wish to study, and also the units of measurement.  When we perform sampling we need to make sure that our study is consistent with our definition of a population.

**Extrapolation** 

Extrapolation occurs when we drawing a conclusion about something beyond the range of the data.  

**Units of Measurement**

One other component we indicated is important to specify is the units of measurement.  For example, let's say we are trying to get a sense of a university's average class size.  Here is some data for a class.

In [35]:
import pandas as pd

names = ['English', 'History', 'Biology', 'Psych', 'Advanced Stats', 'US History', 'Law', 'Public Policy']
df = pd.DataFrame([15, 12, 25, 150, 10, 200, 5, 150], columns = ['class size'], index = names)
df

Unnamed: 0,class size
English,15
History,12
Biology,25
Psych,150
Advanced Stats,10
US History,200
Law,5
Public Policy,150


Now if we look at the average class size we get the following.

In [36]:
df['class size'].mean()

70.875

In [50]:
import numpy as np
size_by_seat = np.hstack([np.repeat(val, val) for val in df['class size'].values])

size_by_seat[:20]

array([15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 12, 12,
       12, 12, 12])

In [51]:
size_by_seat.shape

(567,)

So here we see that there are 567 total classroom seats in the school.  And that first seat is in a class with fifteen total students, while the 16th seat is in a class with 12 total students.  Now let's look at the mean class size per seat.

In [52]:
size_by_seat.mean()

151.88536155202823

### Unit of Measurement Challenge

Now let's take a look another look at our NBA data.  Do you see any statistics that are aggregated, and would get find different averages based on the unit of measurement that we choose.  Prove it.

In [53]:
import pandas as pd
df = pd.read_csv('./nba_combined.csv', index_col = 0)

In [54]:
df[:2].T

Unnamed: 0,0,1
player_id,klebima01,wrighde01
name,Maxi Kleber,Delon Wright
weight,240,183
birth_date,1992-01-29,1992-04-26
height,82,77
nationality,Germany,United States of America
team_abbreviation,DAL,DAL
most_recent_season,2019,2019
box_plus_minus,0.3,2.2
games_played,209,263


### Resources

* Class size example from [Think Stats](https://greenteapress.com/thinkstats/)
* For more issues with grouping data, see [Simpson's Paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox).