# Técnicas de Amostragem (Sampling Techniques)
<br>

<img src="images/probability-sampling.png" width="50%">

### 1. Técnica de Amostragem Aleatória (Random Sampling Technique)

In [2]:
import pandas as pd
import seaborn as sns

In [3]:
penguins = sns.load_dataset('penguins')
penguins = penguins.dropna(how="any")
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male


In [4]:
penguins.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 333 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 20.8+ KB


In [5]:
df_sample = penguins.sample(n=10)

In [6]:
df_sample = penguins.sample(frac=0.10) ## forma porcentual, 10% da base de dados

In [7]:
df_sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33 entries, 175 to 207
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            33 non-null     object 
 1   island             33 non-null     object 
 2   bill_length_mm     33 non-null     float64
 3   bill_depth_mm      33 non-null     float64
 4   flipper_length_mm  33 non-null     float64
 5   body_mass_g        33 non-null     float64
 6   sex                33 non-null     object 
dtypes: float64(4), object(3)
memory usage: 2.1+ KB


### 2. Técnica de Amostragem Estratificada (Stratified Sampling Technique)

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
penguins['species'].value_counts()

Adelie       146
Gentoo       119
Chinstrap     68
Name: species, dtype: int64

In [10]:
X_train, X_test, y_train, y_test = train_test_split(penguins.drop('species', axis=1),
                                                    penguins['species'],
                                                    stratify=penguins['species'],
                                                    test_size=0.2)

In [11]:
y_test.shape

(67,)

In [12]:
y_test.value_counts()

Adelie       29
Gentoo       24
Chinstrap    14
Name: species, dtype: int64

### 3. Técnica de Amostragem Sistemática (Systematic Sampling Technique)

In [13]:
import numpy as np

In [14]:
seed = np.random.choice(10, 1)

In [15]:
seed

array([7])

In [16]:
index = np.arange(0, 100, seed)

In [17]:
index

array([ 0,  7, 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84, 91, 98])

In [18]:
# Gerar amostra baseado nas posições do index
sample = penguins.loc[index,:]

In [19]:
sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 0 to 98
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            15 non-null     object 
 1   island             15 non-null     object 
 2   bill_length_mm     15 non-null     float64
 3   bill_depth_mm      15 non-null     float64
 4   flipper_length_mm  15 non-null     float64
 5   body_mass_g        15 non-null     float64
 6   sex                15 non-null     object 
dtypes: float64(4), object(3)
memory usage: 960.0+ bytes


In [20]:
sample

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,Male
14,Adelie,Torgersen,34.6,21.1,198.0,4400.0,Male
21,Adelie,Biscoe,37.7,18.7,180.0,3600.0,Male
28,Adelie,Biscoe,37.9,18.6,172.0,3150.0,Female
35,Adelie,Dream,39.2,21.1,196.0,4150.0,Male
42,Adelie,Dream,36.0,18.5,186.0,3100.0,Female
49,Adelie,Dream,42.3,21.2,191.0,4150.0,Male
56,Adelie,Biscoe,39.0,17.5,186.0,3550.0,Female
63,Adelie,Biscoe,41.1,18.2,192.0,4050.0,Male


### 4. Cluster Sampling Technique

In [21]:
from sklearn.cluster import KMeans

In [22]:
features = penguins.columns[2:6]
features

Index(['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'], dtype='object')

In [23]:
# É uma boa ideia padronizar as features antes do k-means
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
penguins_scaled = scaler.fit_transform(penguins[features])
penguins_scaled = pd.DataFrame(penguins_scaled, columns=features)
penguins_scaled.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,333.0,333.0,333.0,333.0
mean,0.432465,0.483912,0.490966,0.418627
std,0.198861,0.234433,0.237555,0.223671
min,0.0,0.0,0.0,0.0
25%,0.269091,0.297619,0.305085,0.236111
50%,0.450909,0.5,0.423729,0.375
75%,0.6,0.666667,0.694915,0.576389
max,1.0,1.0,1.0,1.0


In [24]:
kmeans = KMeans(7)

In [25]:
clus = kmeans.fit_predict(penguins_scaled)

In [26]:
penguins['cluster'] = clus
penguins['cluster'].value_counts()

0    88
3    63
4    48
1    37
6    34
5    32
2    31
Name: cluster, dtype: int64

In [27]:
clusterDesc = pd.DataFrame(penguins.iloc[:,2:].groupby('cluster').mean().round(3))
clusterDesc.insert(0, 'size', penguins['cluster'].value_counts())

In [28]:
clusterDesc

Unnamed: 0_level_0,size,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,88,37.876,17.581,187.011,3398.58
1,37,47.67,15.197,217.135,5192.568
2,31,51.058,19.471,200.839,4004.032
3,63,40.424,19.262,194.095,4101.19
4,48,45.148,14.098,211.771,4616.146
5,32,47.7,17.722,192.5,3542.969
6,34,50.874,16.047,225.059,5655.882


In [29]:
### 5. Judmental or purposive Sampling Technique