# DATA SAMPLING
In supervised machine learning, we want our model to take independent variables as input and provide dependent variables as output.
To check the effectiveness of out model, we need to test the model on a set of unseen sample data.
We can do this by splitting a labelled dataset into a **training** and a **testing** dataset through sampling.
 Sampling is to select a portion of a labelled dataset as a subset of the whole.
The original dataset is called **population** and the subset is aclled **sample**

Data sampling are of the following types:
- **Sampling with replacement:** Every time we pick a sample instance from the population into a sample dataset, we remove that instance from the population such that it may not be selected into the saaple dataset again
- **Sampling without replacement:** When we pick a sample instance from the population without removing it from the actual population. In this way, we can select the same element multiple times in the sample. This is generally done when we have very little data. This phenomenon is called **bootstrapping** and allows us to evaluate our model for a small dataset
- **Stratified random sampling:** In this mode of sampling, we need to classify the instances based on a single feature and select the instances from the population into the sample. We choose the samples in such a way that the distribution values for a feature in the population remains the same as the distribution values for the feature in the sample

In [9]:
import pandas as pd

In [82]:
titanic = pd.read_csv("data/titanic.csv")
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [83]:
# Let us consider survival as a dependent variable and plot it along Y axis
dependent_variable = "Survived"
y = titanic[[dependent_variable]]
# y.head()

In [84]:
# Now we will create a list of independent variables out of the column names of the dataset and remove the "dependent_variable" from it
independent_variable = list(titanic.columns)
independent_variable.remove(dependent_variable)
independent_variable

['PassengerId',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In [85]:
# Now we will plot all the independent variables along the X axis
x = titanic[independent_variable]
# x.head()

**SIMPLE RANDOM SAMPLING**

In [98]:
# The train test split function from the model_selection subpackage from the sklearn library allows us to perform simple random sampling on our dataset
from sklearn .model_selection import train_test_split

In [99]:
x_train, x_test, y_train, y_test = train_test_split(x, y)
# x_train contains the independent variables of the training set
# y_train contains the dependent variables of the training set
# y_train contains the independent variables of the testing set
# y_train contains the independent variables of the testing set

We can get information about our testing and training datasets
By default, the train_test_split allocates **25% of the dataset into testing** dataset and the remaining **75% into the training** dataset

In [100]:
x_train.shape
# The independent variables of our training dataset contains 1500 rows and 17 columns

(668, 11)

In [101]:
x_test.shape
# The independent variables of our testing dataset contains 500 rows and 17 columns

(223, 11)

In [102]:
y_train.shape
# The dependent variables of our training dataset contains 1500 rows and 1 columns

(668, 1)

In [103]:
y_test.shape
# The independent variables of our testing dataset contains 1500 rows and 1 columns

(223, 1)

In [104]:
# We can change the default allocation by passing the "train_size" or "test_size" parameter

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)
# This allocates 50% of the total dataset into the training dataset
x_train.shape

(445, 11)

**STRATIFIED RANDOM SAMPLING**

In [105]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=999)

In [107]:
x["Sex"].value_counts(normalize=True)

male      0.647587
female    0.352413
Name: Sex, dtype: float64

In [108]:
x_test["Sex"].value_counts(normalize=True)

male      0.788889
female    0.211111
Name: Sex, dtype: float64

In [112]:
# Now we will split the data using stratified random sampling and stratifying via the feature "Sex"
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=999, stratify=x["Sex"])
# The random state parameter allows the model to select random instances into the training and testing dataset for every execution
# Setting it to "0" or "None" disables this function
x_test["Sex"].value_counts(normalize=True)
# From the output, we see that the stratified random sampling most closely mimics the original data's feature distribution

male      0.644444
female    0.355556
Name: Sex, dtype: float64