In [None]:
import pandas as pd
import numpy as np
import random
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification, load_breast_cancer

# Sampling Algorithms

In [None]:
cancer = load_breast_cancer()
data = pd.DataFrame(cancer.data, columns=cancer.feature_names)
data["Target"] = cancer.target
data.loc[data["Target"] == 0, "Target"] = "malignant" # ממאיר
data.loc[data["Target"] == 1, "Target"] = "benign" # שפיר
data.head()

In [None]:
data["Target"].value_counts()

In [None]:
data["Target"].value_counts(normalize=True)

* **Sampling** is a *process used in statistical analysis in which a predetermined number of observations are taken from a larger population.*

---

## 1. Simple Random Sampling
* **Simple random sampling** is the *basic sampling technique where we select a group of subjects (a sample) for study from a larger group (a population).* Each individual is chosen entirely by chance and each member of the population has an equal chance of being included in the sample. Every possible sample of a given size has the same chance of selection. 

![](https://research-methodology.net/wp-content/uploads/2015/04/Simple-random-sampling2.png)

In [None]:
data.sample(n=5)

In [None]:
data.sample(frac=0.12)

## 2. Stratified Sampling

---

* **Stratified random sampling** is a method of sampling that *involves the division of a population into smaller sub-groups known* as **strata** In stratified random sampling or stratification, the strata are formed based on members' shared attributes or characteristics such as income or educational attainment.

* **Stratified random sampling** is also called *proportional random sampling or quota random sampling.*

<img src="https://www.qualtrics.com/m/assets/wp-content/uploads/2021/08/Screen-Shot-2021-08-31-at-10.17.31-AM.png" alt="Drawing" style="width: 500px;"/>


In [None]:
X_train, X_test, y_train, y_test = train_test_split(data.drop(columns=["Target"]), 
                                                    data["Target"],
                                                    stratify=data["Target"],
                                                    test_size=0.2)

In [None]:
y_train.value_counts(normalize=True)

In [None]:
y_test.value_counts(normalize=True)

## 3. Systematic Sampling

Systematic sampling is defined as a probability sampling approach where the elements from a target population are selected from a random starting point and after a fixed sampling interval.

We calculate the sampling interval by dividing the entire population size by the desired sample size.

Note that, Systematic Sampling usually produces a random sample but <b>is not addressing the bias in the created sample</b>.

In [None]:
def systematic_sampling(df, step): 
    indexes = np.arange(0, len(df), step=step)
    systematic_sample = df.iloc[indexes]
    return systematic_sample

In [None]:
systematic_sampling(data, 5)

## 4. Cluster Sampling

Cluster sampling is a probability sampling technique where we divide the population into multiple clusters(groups) based on certain clustering criteria. Then we select a random cluster(s) with simple random or systematic sampling techniques. So, in cluster sampling, the entire population is divided into clusters or segments and then cluster(s) are randomly selected.

Basic idea:
* Evaluate K-Means. 
* Sample <strong>equal number of observations</strong> from each cluster.

Note that, Systematic Sampling usually produces a random sample but is not addressing the bias in the created sample.


## Random Undersampling and Oversampling

---

![](https://miro.medium.com/max/700/0*u6pKLqdCDsG_5kXa.png)

* A widely adopted technique for dealing with highly imbalanced datasets is called resampling. It consists of *removing samples from the majority class* (**under-sampling**) and/or *adding more examples from the minority class* (**over-sampling**).

In [None]:
X, y = make_classification(
    n_classes=2, class_sep=1.5, weights=[0.9, 0.1],
    n_informative=3, n_redundant=1, flip_y=0,
    n_features=20, n_clusters_per_class=1,
    n_samples=100, random_state=10
)
X = pd.DataFrame(X)
X['Target'] = y

We can now do random oversampling and undersampling using:

In [None]:
num_0 = len(X[X['Target']==0])
num_1 = len(X[X['Target']==1])

# random undersample
undersampled_data = pd.concat([X[X['Target']==0].sample(num_1, replace=True) , X[X['Target']==1] ])
print(len(undersampled_data))

In [None]:
# random oversample
oversampled_data = pd.concat([X[X['Target']==1] , X[X['Target']==0].sample(num_0, replace=True) ])
print(len(oversampled_data))

In [None]:
data["Target"].value_counts()

In [None]:
new_data = pd.concat([data[data["Target"] == "malignant"].sample(frac=0.3), data])
new_data["Target"].value_counts()