### Técnicas de Amostragem de Dados.

### Amostragem Aleatória Simples

Um determinado número de elementos é retirado da população de forma aleatória.

In [4]:
import pandas as pd

Carregando a base de dados.

In [6]:
df = pd.read_csv("covid19.csv")

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50982 entries, 0 to 50981
Data columns (total 12 columns):
case_id               0 non-null float64
provincial_case_id    50982 non-null int64
age                   50982 non-null object
sex                   50982 non-null object
health_region         50982 non-null object
province              50982 non-null object
country               50982 non-null object
date_report           50982 non-null object
report_week           50982 non-null object
has_travel_history    1150 non-null object
locally_acquired      574 non-null object
case_source           50982 non-null object
dtypes: float64(1), int64(1), object(10)
memory usage: 4.7+ MB


In [8]:
df.head()

Unnamed: 0,case_id,provincial_case_id,age,sex,health_region,province,country,date_report,report_week,has_travel_history,locally_acquired,case_source
0,,1,50-59,Male,Toronto,Ontario,Canada,2020-01-25,2020-01-19,t,,(1) https://news.ontario.ca/mohltc/en/2020/01/...
1,,2,50-59,Female,Toronto,Ontario,Canada,2020-01-27,2020-01-26,t,,(1) https://news.ontario.ca/mohltc/en/2020/01/...
2,,1,40-49,Male,Vancouver Coastal,BC,Canada,2020-01-28,2020-01-26,t,,https://news.gov.bc.ca/releases/2020HLTH0015-0...
3,,3,20-29,Female,Middlesex-London,Ontario,Canada,2020-01-31,2020-01-26,t,,(1) https://news.ontario.ca/mohltc/en/2020/01/...
4,,2,50-59,Female,Vancouver Coastal,BC,Canada,2020-02-04,2020-02-02,f,Close Contact,https://news.gov.bc.ca/releases/2020HLTH0023-0...


Criando uma amostra com apenas 1000 registros a partir do conjunto de dados.


In [13]:
df_sample = df.sample(n = 1000)

In [14]:
df_sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 48471 to 15468
Data columns (total 12 columns):
case_id               0 non-null float64
provincial_case_id    1000 non-null int64
age                   1000 non-null object
sex                   1000 non-null object
health_region         1000 non-null object
province              1000 non-null object
country               1000 non-null object
date_report           1000 non-null object
report_week           1000 non-null object
has_travel_history    31 non-null object
locally_acquired      18 non-null object
case_source           1000 non-null object
dtypes: float64(1), int64(1), object(10)
memory usage: 101.6+ KB


Especificando o tamanho da amostra através do percentual.

In [15]:
df_sample = df.sample(frac=0.10)

In [16]:
df_sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5098 entries, 15139 to 14714
Data columns (total 12 columns):
case_id               0 non-null float64
provincial_case_id    5098 non-null int64
age                   5098 non-null object
sex                   5098 non-null object
health_region         5098 non-null object
province              5098 non-null object
country               5098 non-null object
date_report           5098 non-null object
report_week           5098 non-null object
has_travel_history    135 non-null object
locally_acquired      72 non-null object
case_source           5098 non-null object
dtypes: float64(1), int64(1), object(10)
memory usage: 517.8+ KB


### Amostragem Aleatória Estratificada

Importando o método train_test_split para fazer a amostragem.

In [17]:
from sklearn.model_selection import train_test_split

Contagem de registro.

In [19]:
df['province'].value_counts() #Contagem de registro

Quebec           25757
Ontario          16337
Alberta           4850
BC                2053
Nova Scotia        915
Saskatchewan       366
Manitoba           272
NL                 258
New Brunswick      118
PEI                 27
Repatriated         13
Yukon               11
NWT                  5
Name: province, dtype: int64

Gerando a amostragem estratificada.

In [21]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('province',axis=1),
                                                    df['province'],
                                                    stratify=df['province'],
                                                    test_size=0.20)

Verificando a forma dos dados.

In [22]:
y_test.shape # 20% do total 

(10197,)

Verificando a contagem de registros.

In [23]:
y_test.value_counts()

Quebec           5152
Ontario          3267
Alberta           970
BC                411
Nova Scotia       183
Saskatchewan       73
Manitoba           54
NL                 52
New Brunswick      24
PEI                 5
Repatriated         3
Yukon               2
NWT                 1
Name: province, dtype: int64

### Amostragem Sistemática

Gerando a semente aleatória

In [24]:
import numpy as np

In [25]:
semente = np.random.choice(10,1) #entre valores de 1 a 10

In [26]:
semente

array([6])

Gerando índices a partir da semente.

In [27]:
indices = np.arange(0,100,semente)

In [28]:
indices # como a semente foi 6 estamos gerando agora os índices de 6 em 6, depende do valor da semente

array([ 0,  6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72, 78, 84, 90, 96])

Gerando a amostra a partir dos índices.

In [32]:
amostra = df.loc[indices,:]

Verificando os dados da amostra.

In [30]:
amostra #mostrando os meus índices

Unnamed: 0,case_id,provincial_case_id,age,sex,health_region,province,country,date_report,report_week,has_travel_history,locally_acquired,case_source
0,,1,50-59,Male,Toronto,Ontario,Canada,2020-01-25,2020-01-19,t,,(1) https://news.ontario.ca/mohltc/en/2020/01/...
6,,4,30-39,Female,Vancouver Coastal,BC,Canada,2020-02-06,2020-02-02,t,,https://news.gov.bc.ca/releases/2020HLTH0025-0...
12,,6,60-69,Male,Toronto,Ontario,Canada,2020-02-27,2020-02-23,f,Close Contact,(1) https://news.ontario.ca/mohltc/en/2020/02/...
18,,11,60-69,Male,Durham,Ontario,Canada,2020-02-29,2020-02-23,f,Close Contact,https://news.ontario.ca/mohltc/en/2020/02/onta...
24,,16,60-69,Female,York,Ontario,Canada,2020-03-03,2020-03-01,t,,https://toronto.ctvnews.ca/three-new-cases-of-...
30,,10,60-69,Male,Vancouver Coastal,BC,Canada,2020-03-03,2020-03-01,t,,https://news.gov.bc.ca/releases/2020HLTH0058-0...
36,,22,60-69,Male,Toronto,Ontario,Canada,2020-03-05,2020-03-01,t,,https://news.ontario.ca/mohltc/en/2020/03/onta...
42,,18,50-59,Female,Vancouver Coastal,BC,Canada,2020-03-05,2020-03-01,t,,https://news.gov.bc.ca/releases/2020HLTH0062-0...
48,,25,50-59,Male,Toronto,Ontario,Canada,2020-03-06,2020-03-01,t,,https://news.ontario.ca/mohltc/en/2020/03/onta...
54,,22,50-59,Male,Fraser,BC,Canada,2020-03-07,2020-03-01,t,,https://news.gov.bc.ca/releases/2020HLTH0064-0...


Contagem de registros.

In [31]:
amostra.info() #17 registros

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17 entries, 0 to 96
Data columns (total 12 columns):
case_id               0 non-null float64
provincial_case_id    17 non-null int64
age                   17 non-null object
sex                   17 non-null object
health_region         17 non-null object
province              17 non-null object
country               17 non-null object
date_report           17 non-null object
report_week           17 non-null object
has_travel_history    17 non-null object
locally_acquired      2 non-null object
case_source           17 non-null object
dtypes: float64(1), int64(1), object(10)
memory usage: 1.7+ KB
