# How to Sample Data in Python

## Learning Objectives
In order to get an unbiased assessment of the performance of a supervised machine learning model, we need to evaluate it based on data that it did not previously encounter during the training process. To accomplish this, we must first split our data into a training subset and a test subset prior to the model build stage. One common way to split data in this fashion is by creating non-overlapping subsets of the original data using one of several **sampling** approaches. By the end of the tutorial, you will have learned:

+ how to split data using simple random sampling
+ how to split data using stratified random sampling

In [10]:
import pandas as pd
vehicles = pd.read_csv("vehicles.csv")
vehicles.head()

Unnamed: 0,citympg,cylinders,displacement,drive,highwaympg,make,model,class,year,transmissiontype,transmissionspeeds,co2emissions
0,14.0,6,4.1,2-Wheel Drive,19.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4,555.4375
1,14.0,8,5.0,2-Wheel Drive,20.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4,555.4375
2,18.0,8,5.7,2-Wheel Drive,26.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4,484.761905
3,21.0,6,4.3,Rear-Wheel Drive,31.0,Cadillac,Fleetwood/DeVille (FWD),Large Cars,1984,Automatic,4,424.166667
4,14.0,8,4.1,Rear-Wheel Drive,19.0,Cadillac,Brougham/DeVille (RWD),Large Cars,1984,Automatic,4,555.4375


In [11]:
# Separate the dependent variable (y) and independent variables (X)
y = vehicles["co2emissions"]
X = vehicles.drop(columns=["co2emissions"])

In [12]:
# Preview the independent variables
X.head()

Unnamed: 0,citympg,cylinders,displacement,drive,highwaympg,make,model,class,year,transmissiontype,transmissionspeeds
0,14.0,6,4.1,2-Wheel Drive,19.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4
1,14.0,8,5.0,2-Wheel Drive,20.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4
2,18.0,8,5.7,2-Wheel Drive,26.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4
3,21.0,6,4.3,Rear-Wheel Drive,31.0,Cadillac,Fleetwood/DeVille (FWD),Large Cars,1984,Automatic,4
4,14.0,8,4.1,Rear-Wheel Drive,19.0,Cadillac,Brougham/DeVille (RWD),Large Cars,1984,Automatic,4


In [13]:
# Preview the dependent variable
y.head()

0    555.437500
1    555.437500
2    484.761905
3    424.166667
4    555.437500
Name: co2emissions, dtype: float64

## How to split data using Simple Random Sampling

In [14]:
from sklearn.model_selection import train_test_split

In [9]:
# Split the data into training and testing sets using simple random sampling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [15]:
print(f"Training set: X_train={X_train.shape}, y_train={y_train.shape}")
print(f"Testing set: X_test={X_test.shape}, y_test={y_test.shape}")

Training set: X_train=(27734, 11), y_train=(27734,)
Testing set: X_test=(9245, 11), y_test=(9245,)


## How to split data using Stratified Random Sampling

In [16]:
# Stratified random sampling based on the 'drive' column
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=X["drive"])

In [17]:
# Original distribution of the 'drive' column
original_distribution = X["drive"].value_counts(normalize=True)

# Distribution in the test set
test_distribution = X_test["drive"].value_counts(normalize=True)

print("Original Distribution:\n", original_distribution)
print("\nTest Set Distribution:\n", test_distribution)

Original Distribution:
 drive
Rear-Wheel Drive     0.356797
Front-Wheel Drive    0.353552
All-Wheel Drive      0.239893
4-Wheel Drive        0.036480
2-Wheel Drive        0.013278
Name: proportion, dtype: float64

Test Set Distribution:
 drive
Rear-Wheel Drive     0.356842
Front-Wheel Drive    0.353488
All-Wheel Drive      0.239913
4-Wheel Drive        0.036452
2-Wheel Drive        0.013304
Name: proportion, dtype: float64
