# Sampling using Python

## Outine
- What is Sampling
- Simple Random Sample
- Stratified Random Sample
- Wrap up


  
## What is Sampling
   
When you can't collect data from an entire population, sampling is performed to collect a representative "sample" of the whole population. Even in relatively small populations, the data may be needed urgently, and including everyone in the population in your data collection may take too long.


Probability sampling is based on the fact that every member of a population has a known and equal chance of being selected. For example, if you had a population of 100 people, each person would have odds of 1 out of 100 of being chosen. With non-probability sampling, those odds are not equal. For example, a person might have a better chance of being chosen if they live close to the researcher or have access to a computer. 

Probability sampling gives you the best chance to create a sample that is truly representative of the population.

<img src="images/population_vs_sample.png" width="400">


## Simple Random Sample

Simple random sample: Every member and set of members has an equal chance of being included in the sample. Technology, random number generators, or some other sort of chance process is needed to get a simple random sample.

Example—A teachers puts students' names in a hat and chooses without looking to get a sample of students.

Why it's good: Random samples are usually fairly representative since they don't favor certain members.

<img src="images/simple_random_sample.png" width="200">

random.sample(population, k)

random.sample function has two arguments, and both are required.

The population can be any sequence or set from which you want to select a k length numbers. 

In [11]:
# Select items from a list
import random

print("random module's sample in Python ")
list = [20,40,80,100,120]
print ("choosing 3 random items from a list using sample function",
       random.sample(list,k=3))

list = [20,40,20,20,20]
print ("choosing 3 random items from a list using sample function",
       random.sample(list,k=3))

random module's sample in Python 
choosing 3 random items from a list using sample function [20, 100, 120]
choosing 3 random items from a list using sample function [40, 20, 20]


In [12]:
# select items from a set
weight_set = {40, 50, 55, 65, 75,80}
print ("choosing 4 random items from a set using sample method ",
       random.sample(weight_set, k=4))

choosing 4 random items from a set using sample method  [65, 80, 50, 40]


In [13]:
# select items from a dictionary
marks_dict = {
  "Kelly": 55,
  "jhon": 70,
  "Donald": 60,
  "Lennin" :50
}
print ("choosing 2 random items from a dictionary using sample method ",
       random.sample(marks_dict.items(), k=2))

choosing 2 random items from a dictionary using sample method  [('Kelly', 55), ('Donald', 60)]


## Stratified Random Sample 

A stratified sample is one that ensures that subgroups (strata) of a given population are each adequately represented within the whole sample population of a research study.

Stratified random sample: The population is first split into groups. The overall sample consists of some members from every group. The members from each group are chosen randomly.

Example—A student council surveys 100 students by getting random samples of 25 freshmen, 25 sophomores, 25 juniors, and 25 seniors.

Why it's good: A stratified sample guarantees that members from each group will be represented in the sample, so this sampling method is good when we want some members from every group.

<img src="images/stratified_random_sample.png" width="200">



In [14]:
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 1, 2, 0, 1, 2])

sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
sss.get_n_splits(X, y)
print(sss)
for train_index, test_index in sss.split(X, y):
#    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print("TRAIN:", X_train, y_train, "TEST:", X_test, y_test)

StratifiedShuffleSplit(n_splits=5, random_state=0, test_size=0.5,
            train_size=None)
TRAIN: [[1 2]
 [3 4]
 [3 4]] [2 1 0] TEST: [[1 2]
 [3 4]
 [1 2]] [0 2 1]
TRAIN: [[1 2]
 [1 2]
 [3 4]] [1 0 2] TEST: [[3 4]
 [1 2]
 [3 4]] [1 2 0]
TRAIN: [[3 4]
 [3 4]
 [3 4]] [1 0 2] TEST: [[1 2]
 [1 2]
 [1 2]] [0 2 1]
TRAIN: [[3 4]
 [1 2]
 [3 4]] [2 0 1] TEST: [[1 2]
 [1 2]
 [3 4]] [2 1 0]
TRAIN: [[1 2]
 [1 2]
 [1 2]] [1 2 0] TEST: [[3 4]
 [3 4]
 [3 4]] [2 1 0]


## Wrap up
We discussed:
- What is Sampling
- Simple Random Sample
- Stratified Random Sample