# Introduction to Sampling
In statistics, the set of all individuals relevant to a particular statistical question is called a **population**. For our analyst's question, all the people inside the company were relevant. So the population in this case consisted from all the people in the company.

If we tried to find out whether people at international companies are satisfied at work, then our group formed by over 50000 employees would become a sample. There are a lot of international companies out there, and ours is just one of them.

A smaller group selected from a population is called a **sample**. When we select a smaller group from a population we do sampling. In our example, the data analyst took a sample of approximately 100 people from a population of over 50000 people.

Populations do not necessarily consist of people. Behavioral scientists, for instance, often try to answer questions about populations of monkeys, rats or other lab animals. In a similar way, other people try to answer questions about countries, companies, vegetables, soils, pieces of equipment produced in a factory, etc.

![s1m1_units.svg](attachment:s1m1_units.svg)


## Sampling error
When we sample, the data we get might be more or less similar to the data in the population. For instance, let's say we know that the average salary in our company is $34500, and the proportion of women is 60\%. We take two samples and find the results slightly differing.

A sample is by definition an incomplete set of data for the question we're trying to answer. For this reason, there's almost always some difference between the metrics of a population and the metrics of a sample. This difference can be seen as an error, and because it's the result of sampling, it's called **sampling error**.

A metric specific to a population is called a **parameter**, while one specific to a sample is called a **statistic**. 

Another way to think of the concept of the sampling error is as the difference between a parameter and a statistic:

**sampling error = parameter − statistic**


## Dataset

The data set is about basketball players in WNBA (Women's National Basketball Association), and contains general information about players, along with their metrics for the season 2016-2017. The data set was put together by Thomas De Jonghe, and can be downloaded from Kaggle, where you can also find useful documentation for the data set.

https://www.kaggle.com/jinxbe/wnba-player-stats-2017
http://www.wnba.com/stats/player-stats/

In [1]:
import pandas as pd
wnba = pd.read_csv('wnba.csv')

# Explore dataset
print(wnba.head())
print(wnba.tail())

# get dimensions
print(wnba.shape)

# Maximum value of games played
parameter = wnba['Games Played'].max()

sample = wnba['Games Played'].sample(n=30, random_state=1)

statistic = sample.max()
sampling_error = parameter - statistic

print(sampling_error)

  return f(*args, **kwds)


              Name Team  Pos  Height  Weight        BMI Birth_Place  \
0    Aerial Powers  DAL    F     183    71.0  21.200991          US   
1      Alana Beard   LA  G/F     185    73.0  21.329438          US   
2     Alex Bentley  CON    G     170    69.0  23.875433          US   
3  Alex Montgomery  SAN  G/F     185    84.0  24.543462          US   
4     Alexis Jones  MIN    G     175    78.0  25.469388          US   

           Birthdate  Age         College ...  OREB  DREB  REB  AST  STL  BLK  \
0   January 17, 1994   23  Michigan State ...     6    22   28   12    3    6   
1       May 14, 1982   35            Duke ...    19    82  101   72   63   13   
2   October 27, 1990   26      Penn State ...     4    36   40   78   22    3   
3  December 11, 1988   28    Georgia Tech ...    35   134  169   65   20   10   
4     August 5, 1994   23          Baylor ...     3     9   12   12    7    0   

   TO  PTS  DD2  TD3  
0  12   93    0    0  
1  40  217    0    0  
2  24  218    0  