# Statistical Sampling

- What is Sampling ?

= Sampling is a method that allows us to get information about the population based on the statistics from a subset of the population (sample), without having to investigate every individual.

Let’s understand this at a more intuitive level through an example.

We want to find the average height of all adult males in Alexandria. The population of Alexandria is nearly 5.2 millions according to October 2018 population.That is almost impossible to find the average male height of all males in Alexandria. So we choose a ***Sample*** "A group of people as well represantitives for the population to find their average".

- Why only having a sample from the data? Why not having all the data?

Seeking all data is:

1) Time and money consuming.

2) Not practical.

3) The memory of our Devices and Clouds are limited - Google Colab provides 12 GB memory for example.

- Do we need rules for sampling ? Or we just take a **sample**?

Let's take our example of finding the average male height; and let’s say we go to a basketball court and take the average height of all the professional basketball players as our sample. This will not be considered a good sample because generally, a basketball player is taller than an average male and it will give us a bad estimate of the average male’s height.

## Steps for Sampling

- Step 1 : The first stage in the sampling process is to clearly define the target population. 

For our example : Only males who live in Alexandria, above 18 years old.

 - Step 2 : Sampling Frame – It is a list of items or people forming a population from which the sample is taken.

For our example : A list contain all the names of the above 18 years old males in Alexandria.

 - Step 3 : Choose sampling methods.

We will discuss this later in this session

 - Step 4 : Define sample size.

Sample Size – It is the number of individuals or items to be taken in a sample that would be enough to make inferences/deductions about the population with the desired level of accuracy and precision.

Larger the sample size, more accurate our inference about the population would be.

For our example : Let's say 30,000 adult males from Alexandria.

 - Step 5 : Collect the data.

# Different Types of Sampling Techniques

 - **Probability Sampling**: In probability sampling, every element of the population has an equal chance of being selected. Probability sampling gives us the best chance to create a sample that is truly representative of the population.

 - **Non-Probability Sampling**: In non-probability sampling, all elements do not have an equal chance of being selected. Consequently, there is a significant risk of ending up with a non-representative sample which does not produce generalizable results.

##Types of Probability Sampling

## - Simple Random Sampling
In a simple random sample, every member of the population has an equal chance of being selected. Your sampling frame should include the whole population.To conduct this type of sampling, you can use tools like random number generators or other techniques that are based entirely on chance.

 - Our example:

You want to select a simple random sample of 30,000 adult males of Alexandria. You assign a number to every adult male in the database from 1 to 3 millions, and use a random number generator to select 30,000 numbers.

## - Systematic sampling
Systematic sampling is similar to simple random sampling, but it is usually slightly easier to conduct. Every member of the population is listed with a number, but instead of randomly generating numbers, individuals are chosen at regular intervals.

ex : sample : (6,26,36, ... , 166, ....).

## - Stratified sampling
Stratified sampling involves dividing the population into subpopulations that may differ in important ways. It allows you draw more precise conclusions by ensuring that every subgroup is properly represented in the sample.

To use this sampling method, you divide the population into subgroups (called strata) based on the relevant characteristic (e.g. gender, age range, income bracket, job role).

ex : samples are according to Age : ( [18,25] , [26,30] , [31,35] , .... ).

## - Cluster sampling
Cluster sampling also involves dividing the population into subgroups, but each subgroup should have similar characteristics to the whole sample. Instead of sampling individuals from each subgroup, you randomly select entire subgroups.

1) Create subgroups that are similar to one another.

2) Choose a random subgroup.

## Non-probability sampling methods
In a non-probability sample, individuals are selected based on non-random criteria, and not every individual has a chance of being included.

##  - Convenience sampling
A convenience sample simply includes the individuals who happen to be most accessible to the researcher.

our ex: Taking my friends and family as a sample.

## - Voluntary response sampling
Similar to a convenience sample, a voluntary response sample is mainly based on ease of access. Instead of the researcher choosing participants and directly contacting them, people volunteer themselves (e.g. by responding to a public online survey).

## - Judgment Sampling
It depends on the judgment of the experts when choosing whom to ask to participate.

## - Snowball sampling
If the population is hard to access, snowball sampling can be used to recruit participants via other participants. The number of people you have access to “snowballs” as you get in contact with more people.

In [4]:
# Simple random sampling:
import numpy as np

datastored  = np.array([12,20,34,32,13,31,12,76,42,78,10,54,12,13])
index_sample= np.random.choice(range(14), 5, replace=False)
index_sample

array([6, 0, 2, 8, 4])

In [5]:
rand_sampled_data = []
for index in index_sample:
  rand_sampled_data.append(datastored[index])
print(rand_sampled_data)

[12, 12, 34, 42, 13]


In [27]:
import pandas as pd

df = pd.read_csv('/content/Social_Network_Ads.csv')

In [28]:
df

Unnamed: 0,User ID,Age,EstimatedSalary,Purchased
0,15624510,19,19000,0
1,15810944,35,20000,0
2,15668575,26,43000,0
3,15603246,27,57000,0
4,15804002,19,76000,0
...,...,...,...,...
395,15691863,46,41000,1
396,15706071,51,23000,1
397,15654296,50,20000,1
398,15755018,36,33000,0


In [29]:
sample_df = df.sample(10)
sample_df

Unnamed: 0,User ID,Age,EstimatedSalary,Purchased
349,15721835,38,61000,0
154,15605327,40,47000,0
386,15724150,49,39000,1
85,15663939,31,118000,1
288,15649668,41,79000,0
304,15598070,40,60000,0
347,15768151,54,108000,1
266,15721592,40,75000,0
193,15662901,19,70000,0
147,15749130,41,30000,0


In [30]:
# Systematic sampling:

index_sample = range(3,14,2)
sys_sampled_data = []

for index in index_sample:
  sys_sampled_data.append(datastored[index])
print(sys_sampled_data)

[32, 31, 76, 78, 54, 13]


In [32]:
X = df.iloc[:,:-1].values
y = df['Purchased']

In [33]:
# Stratified sampling:

from sklearn.model_selection import train_test_split

X_train, X_sampled, y_train, y_sampled = train_test_split(X, y,
stratify=y,
test_size=0.25)

In [34]:
X_sampled

array([[15586757,       39,   134000],
       [15753102,       35,    97000],
       [15591915,       33,    51000],
       [15735549,       39,   134000],
       [15688172,       59,    76000],
       [15660541,       40,    57000],
       [15707634,       49,    28000],
       [15638646,       48,   141000],
       [15596761,       36,   125000],
       [15671766,       26,    72000],
       [15575002,       35,    60000],
       [15724423,       40,    75000],
       [15800515,       38,   113000],
       [15778368,       54,    70000],
       [15637593,       37,    79000],
       [15787550,       42,    54000],
       [15710257,       35,    25000],
       [15619465,       48,    30000],
       [15746139,       20,    86000],
       [15815236,       45,   131000],
       [15594041,       49,    36000],
       [15796351,       36,   118000],
       [15611191,       53,    82000],
       [15670619,       25,    90000],
       [15601550,       36,    54000],
       [15593014,       2

In [35]:
y_sampled.shape

(100,)

## Oversampling

- What is it?

To amplify the minority class so the dataset becomes balanced.

- Why do we use it?

Having an imbalanced dataset that can result that the machine learning model has a poor performance.

- How is it done?

By Synthetic Minority Oversampling Technique (SMOTE), SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.

Check the paper: 
https://arxiv.org/pdf/1106.1813.pdf


In [42]:
from collections import Counter

counter = Counter(y)
counter
## 143 -> Purchased = 1
## 257 -> Purchased = 0

Counter({0: 257, 1: 143})

In [37]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(ratio='minority')
X_sm, y_sm = smote.fit_sample(X, y)



In [38]:
X_sm

array([[15624510,       19,    19000],
       [15810944,       35,    20000],
       [15668575,       26,    43000],
       ...,
       [15703775,       51,    27479],
       [15745551,       47,   144378],
       [15617377,       51,    34156]])

In [39]:
X_sm.shape

(514, 3)

In [43]:
counter = Counter(y_sm)
counter

Counter({0: 257, 1: 257})

## Resources:

article : https://www.analyticsvidhya.com/blog/2019/09/data-scientists-guide-8-types-of-sampling-techniques/

article : https://www.scribbr.com/methodology/sampling-methods/