# Amostragem

## Carregamento de Base de Dados Simpples

In [3]:
import pandas as pd
import random
import numpy as np

In [4]:
dataset = pd.read_csv("/content/drive/MyDrive/EstatiÃÅstica para CieÃÇncia de Dados e Machine Learning/Bases de dados/census.csv")

In [5]:
dataset.head()

Unnamed: 0,age,workclass,final-weight,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loos,hour-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## **Amostragem Simples**

### Conceito

√â o m√©todo em que todos os elementos da popula√ß√£o t√™m a mesma probabilidade de serem selecionados. A escolha √© feita totalmente ao acaso, geralmente por sorteio.

* Caracter√≠stica principal
* Probabilidade igual para todos os indiv√≠duos.

### F√≥rmulas para encontrar o tamanho a amostra

* Primeira, achar o valor de n
$$
n_0 = \frac{1}{E_0^2}
$$

* Depois, encontrando o tamanho da amostragem
$$
n = \frac{N \cdot n_0}{N + n_0}
$$
	‚Äã

* Formula simplificada
$$
n = \frac{N}{1 + N \cdot E_0^2}
$$

#### Onde:
* N -> tamanho da popula√ß√£o
* n -> tamanho da amostra
* n<sub>0</sub> -> primeira aproxima√ß√£o
* E<sub>0</sub> -> Erro Amostral
	‚Äã

### Quando usar:
* Quando todos os elementos da popula√ß√£o t√™m a mesma chance de serem escolhidos.
* Quando a popula√ß√£o √© relativamente homog√™nea.
* Quando voc√™ tem uma lista completa da popula√ß√£o.


Sortear 100 alunos de uma lista com todos os alunos da universidade.
<br><br>

**Vantagem:** simples e imparcial.

**Desvantagem:** pode n√£o representar bem subgrupos pequenos.

In [31]:
def sample_size(N, e):
    """Calculates the sample size for a finite population."""
    return round(N // (1 + (N * (e ** 2))))


def random_simple_sampling(dataset, sample_size):
    """Draws a simple random sample with a fixed size."""
    return dataset.sample(n=sample_size, random_state=1)


def random_simple_sampling_by_error(dataset, sampling_error):
    """Draws a simple random sample using a sampling error."""
    n = sample_size(len(dataset), sampling_error)
    return dataset.sample(n=n, random_state=1)

In [32]:

df_random_simple_sampling = random_simple_sampling_by_error(dataset, 0.05)


In [33]:
df_random_simple_sampling.shape

(395, 16)

In [34]:
df_random_simple_sampling.head()

Unnamed: 0,age,workclass,final-weight,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loos,hour-per-week,native-country,income,group
9646,62,Self-emp-not-inc,26911,7th-8th,4,Widowed,Other-service,Not-in-family,White,Female,0,0,66,United-States,<=50K,2
709,18,Private,208103,11th,7,Never-married,Other-service,Other-relative,White,Male,0,0,25,United-States,<=50K,0
7385,25,Private,102476,Bachelors,13,Never-married,Farming-fishing,Own-child,White,Male,27828,0,50,United-States,>50K,2
16671,33,Private,511517,HS-grad,9,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,40,United-States,<=50K,5
21932,36,Private,292570,11th,7,Never-married,Machine-op-inspct,Unmarried,White,Female,0,0,40,United-States,<=50K,6


## **Amostragem Sistem√°tica**

### Conceito
* Os elementos s√£o selecionados a partir de uma regra fixa, depois de um sorteio inicial. Geralmente escolhe-se um elemento a cada k posi√ß√µes.

### Caracter√≠stica principal

* Sele√ß√£o regular ap√≥s um ponto inicial aleat√≥rio.

### F√≥rmula do intervalo amostral
$$
 ùëò = \frac{N}{n}
$$
	‚Äã


### Onde:
* N = popula√ß√£o
* n = amostra
* k = intervalo de sele√ß√£o

Exemplo: selecionar 1 a cada 10 elementos.

### Quando usar:
* Quando a popula√ß√£o est√° organizada em uma lista ou sequ√™ncia.
* Quando voc√™ quer um m√©todo mais r√°pido que o sorteio puro.
* Quando n√£o h√° um padr√£o repetitivo na lista.

### Como funciona:
* Escolhe-se um elemento inicial ao acaso e depois seleciona-se a cada k elementos.

Exemplo:
* Entrevistar 1 a cada 10 clientes que entram em uma loja.

<br><br>
**Vantagem:** f√°cil de aplicar.

**Cuidado:** n√£o usar se houver padr√£o na ordem dos dados.

In [10]:
sample_n = 100
population_N = len(dataset)

# Calculate the sampling interval (k) by dividing the total population by the desired sample size.
# This determines how often a sample is taken from the population.
interval_k = population_N // sample_n

population_N, sample_n, interval_k

(32561, 100, 325)

In [11]:
# Get the value between 0 and sample_n
# This value will be the first value to be chosen from the population.
random.seed(1)
first_sample = random.randint(0, sample_n)
first_sample

17

In [12]:
# Generates a sequence of indices starting from the first sample up to the population size, using a fixed interval
np.arange(first_sample, population_N, step=interval_k)

array([   17,   342,   667,   992,  1317,  1642,  1967,  2292,  2617,
        2942,  3267,  3592,  3917,  4242,  4567,  4892,  5217,  5542,
        5867,  6192,  6517,  6842,  7167,  7492,  7817,  8142,  8467,
        8792,  9117,  9442,  9767, 10092, 10417, 10742, 11067, 11392,
       11717, 12042, 12367, 12692, 13017, 13342, 13667, 13992, 14317,
       14642, 14967, 15292, 15617, 15942, 16267, 16592, 16917, 17242,
       17567, 17892, 18217, 18542, 18867, 19192, 19517, 19842, 20167,
       20492, 20817, 21142, 21467, 21792, 22117, 22442, 22767, 23092,
       23417, 23742, 24067, 24392, 24717, 25042, 25367, 25692, 26017,
       26342, 26667, 26992, 27317, 27642, 27967, 28292, 28617, 28942,
       29267, 29592, 29917, 30242, 30567, 30892, 31217, 31542, 31867,
       32192, 32517])

In [13]:
def systematic_sampling(dataset, sample_size):
    """Draws a systematic sample with a fixed size."""
    interval_k = len(dataset) // sample_size
    random.seed(1)
    first_sample = random.randint(0, interval_k)
    indexes = np.arange(first_sample, len(dataset), step=interval_k)
    return dataset.iloc[indexes]

In [14]:
sample_n = sample_size(len(dataset), 0.05)
df_systematic_sampling = systematic_sampling(dataset, sample_n)
df_systematic_sampling

Unnamed: 0,age,workclass,final-weight,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loos,hour-per-week,native-country,income
17,32,Private,186824,HS-grad,9,Never-married,Machine-op-inspct,Unmarried,White,Male,0,0,40,United-States,<=50K
99,32,Federal-gov,249409,HS-grad,9,Never-married,Other-service,Own-child,Black,Male,0,0,40,United-States,<=50K
181,43,Private,114580,Some-college,10,Divorced,Adm-clerical,Not-in-family,White,Female,0,0,40,United-States,<=50K
263,59,Private,146013,Some-college,10,Married-civ-spouse,Sales,Husband,White,Male,4064,0,40,United-States,<=50K
345,43,Self-emp-not-inc,241895,Bachelors,13,Never-married,Sales,Not-in-family,White,Male,0,0,42,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32161,37,Private,143771,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,60,United-States,>50K
32243,33,Private,198069,HS-grad,9,Never-married,Machine-op-inspct,Not-in-family,White,Male,0,0,65,United-States,<=50K
32325,31,State-gov,110714,Some-college,10,Never-married,Other-service,Own-child,White,Female,0,0,37,United-States,<=50K
32407,33,Private,232475,Some-college,10,Never-married,Sales,Own-child,White,Male,0,0,45,United-States,<=50K


## **Amostragem Por Grupos**

### Conceito

* A popula√ß√£o √© dividida em grupos naturais (escolas, bairros, empresas) e alguns grupos s√£o sorteados, analisando-se todos os elementos desses grupos.

### Caracter√≠stica principal

* Sorteiam-se grupos inteiros, n√£o indiv√≠duos.

### F√≥rmula

* ‚ùå N√£o h√° f√≥rmula espec√≠fica, pois o m√©todo depende do sorteio dos conglomerados, n√£o do c√°lculo individual.

* üìå O foco est√° na log√≠stica, n√£o no c√°lculo matem√°tico.

### Quando usar:

* Quando a popula√ß√£o √© muito grande ou dispersa geograficamente.
* Quando √© dif√≠cil acessar indiv√≠duos, mas f√°cil acessar grupos inteiros.

### Como funciona:

* Divide-se a popula√ß√£o em grupos naturais e sorteiam-se alguns grupos inteiros.

### Exemplo:

* Sortear algumas escolas e entrevistar todos os alunos dessas escolas.

<br>

**Vantagem:** reduz custos.

**Desvantagem:** menor precis√£o se os grupos forem muito diferentes entre si.

In [18]:
"""
Divide the population size by the desired number of groups.
This value determines how many individuals will be in each group.
"""
quantity_in_group = len(dataset) // 10
quantity_in_group

3256

In [20]:
"""
Interater for each over row in dataset for separating it into groups.
The groups are stored in the groups list.
Each group is assigned a unique identifier using the id_group variable.
These id_group will be used to identify the group to which each individual belongs.
"""
groups = []
id_group = 0
count = 0

for _ in dataset.iterrows():
  groups.append(id_group)
  count += 1
  if count > quantity_in_group:
    id_group += 1
    count = 0

print(groups)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [27]:
np.unique(groups, return_counts=True)

(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 array([3257, 3257, 3257, 3257, 3257, 3257, 3257, 3257, 3257, 3248]))

In [30]:
np.shape(groups), dataset.shape

((32561,), (32561, 16))

In [21]:
"""
Adds the groups list as a new column to the dataset.
This column will be used to identify the group to which each individual belongs.
"""
dataset["group"] = groups
dataset.head()

Unnamed: 0,age,workclass,final-weight,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loos,hour-per-week,native-country,income,group
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,0


In [22]:
dataset.tail()

Unnamed: 0,age,workclass,final-weight,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loos,hour-per-week,native-country,income,group
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K,9
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K,9
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K,9
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K,9
32560,52,Self-emp-inc,287927,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,>50K,9


In [88]:
"""
Choose a random group and obtain the data from dataset belonging to that group.
"""
random.seed(1)
group_random = random.randint(0, 9)

df_grouping = dataset[dataset["group"] == group_random]
total_register_in_group = df_grouping["group"].value_counts()

print(df_grouping.shape, end="\n\n")
print(total_register_in_group)

(326, 16)

group
2    326
Name: count, dtype: int64


In [90]:
def simpling_grouping(dataset, group_quantity):
  interval = len(dataset) // group_quantity

  groups = []
  id_group = 0
  count = 0

  for _ in dataset.iterrows():
    groups.append(id_group)
    count += 1
    if count > interval:
      count = 0
      id_group += 1

  dataset["group"] = groups
  random.seed(1)
  selected_group = random.randint(0, group_quantity)
  return dataset[dataset["group"] == selected_group]


In [89]:
df_grouping = simpling_grouping(dataset, 100)
df_grouping.shape, df_grouping["group"].value_counts()

((326, 16),
 group
 17    326
 Name: count, dtype: int64)

## **Amostragem Estratificado**

### Conceito

* A popula√ß√£o √© dividida em estratos homog√™neos (sexo, idade, setor etc.) e √© feita uma amostragem aleat√≥ria dentro de cada estrato.

### Caracter√≠stica principal

* Todos os estratos s√£o representados.

### F√≥rmula de aloca√ß√£o proporcional
$$
n_h = \frac{N^h}{N}\cdot{n}
$$

### Onde:

* n<sup>h</sup> = tamanho da amostra no estrato
* N<sub>h</sub> = Tamanho do estrato
* N = popula√ß√£o total
* n = amostra total

### Quando usar:

* Quando a popula√ß√£o possui subgrupos importantes (estratos).
* Quando voc√™ quer garantir que todos os estratos estejam representados.

### Como funciona:

* Divide-se a popula√ß√£o em estratos homog√™neos e faz-se uma amostra aleat√≥ria em cada um.

### Exemplo:

* Separar funcion√°rios por setor (administra√ß√£o, vendas, produ√ß√£o) e sortear pessoas em cada setor.
<br>

**Vantagem:** maior precis√£o e representatividade.

**Desvantagem:** exige mais planejamento.


# **Amostragem de Reservat√≥rio.**