<a href="https://colab.research.google.com/github/Davron030901/Machine_Learning/blob/main/How_to_Sample_Data_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to Sample Data in Python

## Learning Objectives
In order to get an unbiased assessment of the performance of a supervised machine learning model, we need to evaluate it based on data that it did not previously encounter during the training process. To accomplish this, we must first split our data into a training subset and a test subset prior to the model build stage. One common way to split data in this fashion is by creating non-overlapping subsets of the original data using one of several **sampling** approaches. By the end of the tutorial, you will have learned:

+ how to split data using simple random sampling
+ how to split data using stratified random sampling

In [None]:
!wget https://raw.githubusercontent.com/Davron030901/Pandas/main/data/vehicles.csv

--2024-12-18 19:40:25--  https://raw.githubusercontent.com/Davron030901/Pandas/main/data/vehicles.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3243664 (3.1M) [text/plain]
Saving to: ‘vehicles.csv’


2024-12-18 19:40:25 (57.9 MB/s) - ‘vehicles.csv’ saved [3243664/3243664]



In [None]:
import pandas as pd
vehicles = pd.read_csv("vehicles.csv")
vehicles

Unnamed: 0,citympg,cylinders,displacement,drive,highwaympg,make,model,class,year,transmissiontype,transmissionspeeds,co2emissions
0,14.0,6,4.1,2-Wheel Drive,19.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4,555.437500
1,14.0,8,5.0,2-Wheel Drive,20.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4,555.437500
2,18.0,8,5.7,2-Wheel Drive,26.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4,484.761905
3,21.0,6,4.3,Rear-Wheel Drive,31.0,Cadillac,Fleetwood/DeVille (FWD),Large Cars,1984,Automatic,4,424.166667
4,14.0,8,4.1,Rear-Wheel Drive,19.0,Cadillac,Brougham/DeVille (RWD),Large Cars,1984,Automatic,4,555.437500
...,...,...,...,...,...,...,...,...,...,...,...,...
36974,17.0,8,4.7,Rear-Wheel Drive,25.0,Mercedes-Benz,SL550,Two Seaters,2018,Automatic,9,442.000000
36975,16.0,8,6.2,Rear-Wheel Drive,25.0,Chevrolet,Corvette,Two Seaters,2018,Manual,7,466.000000
36976,15.0,8,6.2,Rear-Wheel Drive,22.0,Chevrolet,Corvette,Two Seaters,2018,Manual,7,503.000000
36977,12.0,12,6.5,Rear-Wheel Drive,16.0,Ferrari,812 Superfast,Two Seaters,2018,Automatic,7,661.000000


In [None]:
response='co2emissions'
y=vehicles[[response]]
y

Unnamed: 0,co2emissions
0,555.437500
1,555.437500
2,484.761905
3,424.166667
4,555.437500
...,...
36974,442.000000
36975,466.000000
36976,503.000000
36977,661.000000


In [None]:
predictors=list(vehicles.columns)
predictors

['citympg',
 'cylinders',
 'displacement',
 'drive',
 'highwaympg',
 'make',
 'model',
 'class',
 'year',
 'transmissiontype',
 'transmissionspeeds',
 'co2emissions']

In [None]:
predictors.remove(response)
predictors

['citympg',
 'cylinders',
 'displacement',
 'drive',
 'highwaympg',
 'make',
 'model',
 'class',
 'year',
 'transmissiontype',
 'transmissionspeeds']

In [None]:
X=vehicles[predictors]
X

Unnamed: 0,citympg,cylinders,displacement,drive,highwaympg,make,model,class,year,transmissiontype,transmissionspeeds
0,14.0,6,4.1,2-Wheel Drive,19.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4
1,14.0,8,5.0,2-Wheel Drive,20.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4
2,18.0,8,5.7,2-Wheel Drive,26.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4
3,21.0,6,4.3,Rear-Wheel Drive,31.0,Cadillac,Fleetwood/DeVille (FWD),Large Cars,1984,Automatic,4
4,14.0,8,4.1,Rear-Wheel Drive,19.0,Cadillac,Brougham/DeVille (RWD),Large Cars,1984,Automatic,4
...,...,...,...,...,...,...,...,...,...,...,...
36974,17.0,8,4.7,Rear-Wheel Drive,25.0,Mercedes-Benz,SL550,Two Seaters,2018,Automatic,9
36975,16.0,8,6.2,Rear-Wheel Drive,25.0,Chevrolet,Corvette,Two Seaters,2018,Manual,7
36976,15.0,8,6.2,Rear-Wheel Drive,22.0,Chevrolet,Corvette,Two Seaters,2018,Manual,7
36977,12.0,12,6.5,Rear-Wheel Drive,16.0,Ferrari,812 Superfast,Two Seaters,2018,Automatic,7


## How to split data using Simple Random Sampling

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
X_train.shape

(27734, 11)

In [None]:
y_train.shape

(27734, 1)

In [None]:
X_test.shape

(9245, 11)

In [None]:
y_test.shape

(9245, 1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4)
X_test.shape

(14792, 11)

## How to split data using Stratified Random Sampling

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.01,
                                                    random_state = 1234)

In [None]:
X.drive.value_counts(normalize = True)

Unnamed: 0_level_0,proportion
drive,Unnamed: 1_level_1
Rear-Wheel Drive,0.356797
Front-Wheel Drive,0.353552
All-Wheel Drive,0.239893
4-Wheel Drive,0.03648
2-Wheel Drive,0.013278


In [None]:
X_test.drive.value_counts(normalize = True)

Unnamed: 0_level_0,proportion
drive,Unnamed: 1_level_1
Front-Wheel Drive,0.364865
Rear-Wheel Drive,0.332432
All-Wheel Drive,0.248649
4-Wheel Drive,0.035135
2-Wheel Drive,0.018919


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.01,
                                                    random_state = 1234,
                                                    stratify = X.drive)

In [None]:
X_test['drive'].value_counts(normalize = True)

Unnamed: 0_level_0,proportion
drive,Unnamed: 1_level_1
Rear-Wheel Drive,0.356757
Front-Wheel Drive,0.354054
All-Wheel Drive,0.240541
4-Wheel Drive,0.035135
2-Wheel Drive,0.013514
