## Train Test Frameworks

The following exercise is to practice the syntax of the various functions from sklearn that split data into train and test sets. The goal of this exercise is to get familiar with these different splitting methods before engaging with the more complex activities at the end of the day. 

In [1]:
# import numpy
import numpy as np

In [2]:
X = np.random.normal(0,1,20).reshape(10,2)
y = np.random.normal(0,1,10)

* print X

In [3]:
X

array([[ 1.3877933 ,  1.60384871],
       [ 0.60139115,  0.20210125],
       [-0.88720544,  0.0981454 ],
       [ 1.03326459,  1.15454856],
       [-0.40488078, -0.54574946],
       [-1.43722709,  0.57815935],
       [-0.99470768,  0.1400356 ],
       [ 1.31046234,  1.38165797],
       [ 1.07974144,  0.47131451],
       [-0.49312581, -0.22751545]])

* print y

In [4]:
y

array([-1.26942355,  1.31116259, -0.98547213,  1.06533484, -0.47688888,
        0.8126839 , -1.15373361, -0.13222034, -1.99881119,  0.93899591])

_____________________________
### Holdout split

* import the **train_test_split** function from sklearn

In [5]:
import numpy as np
from sklearn.model_selection import train_test_split

* split the data to train set and test set, use a 70:30 ratio or a 80:20 ratio.

In [6]:
import numpy as np
from sklearn.model_selection import train_test_split

# Generate the data
X = np.random.normal(0, 1, 20).reshape(10, 2)
y = np.random.normal(0, 1, 10)

# Option 1: 70:30 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

# Option 2: 80:20 split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

print("Training features:\n", X_train)
print("Testing features:\n", X_test)
print("Training labels:\n", y_train)
print("Testing labels:\n", y_test)

Training features:
 [[-0.01782158  0.32332653]
 [-0.29224817  1.51231378]
 [-2.80072587 -1.1873131 ]
 [ 0.82150614 -1.12522936]
 [ 1.47478955  0.12038937]
 [ 0.38831536  0.26161022]
 [ 1.37283032 -0.33877649]]
Testing features:
 [[-0.66424461 -1.68533485]
 [-0.28539027 -1.25501525]
 [ 1.09339799 -0.20501463]]
Training labels:
 [ 1.19361929  0.67038402  0.28686753 -0.12074403 -0.53543687 -0.77438016
 -1.41391759]
Testing labels:
 [0.58888824 1.19849894 1.03566103]


* print X_train

In [7]:
X

array([[-0.01782158,  0.32332653],
       [-0.28539027, -1.25501525],
       [-2.80072587, -1.1873131 ],
       [ 0.38831536,  0.26161022],
       [ 1.47478955,  0.12038937],
       [ 1.09339799, -0.20501463],
       [ 1.37283032, -0.33877649],
       [-0.29224817,  1.51231378],
       [-0.66424461, -1.68533485],
       [ 0.82150614, -1.12522936]])

* split the data again but now with the parameter shuffle = False

In [8]:
import numpy as np
from sklearn.model_selection import train_test_split

# Generate the data
X = np.random.normal(0, 1, 20).reshape(10, 2)
y = np.random.normal(0, 1, 10)

# Option 1: 70:30 split without shuffling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42, shuffle=False)

# Option 2: 80:20 split without shuffling
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42, shuffle=False)

print("Training features:\n", X_train)
print("Testing features:\n", X_test)
print("Training labels:\n", y_train)
print("Testing labels:\n", y_test)

Training features:
 [[-1.06906527 -2.03954461]
 [ 0.59095945 -1.37429794]
 [ 0.75856288 -1.80475041]
 [ 0.58207566  0.29801092]
 [ 0.34396212  0.84798659]
 [-0.02543102  0.49573413]
 [ 0.97797957  0.02245508]]
Testing features:
 [[ 0.1515946   1.7851486 ]
 [-0.48779753 -0.90112864]
 [ 1.73852486  1.70896164]]
Training labels:
 [-0.63639292  0.32383032  0.28187436 -0.43502837  0.46266739 -0.87881314
  1.48706418]
Testing labels:
 [-1.60129494 -1.12882591 -0.64952823]


* print X_train

In [9]:
import numpy as np
from sklearn.model_selection import train_test_split

# Generate the data
X = np.random.normal(0, 1, 20).reshape(10, 2)
y = np.random.normal(0, 1, 10)

# Split the data without shuffling (70:30 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42, shuffle=False)

# Print the training features
print("Training features:\n", X_train)

Training features:
 [[-1.87752922  0.18259681]
 [-1.78290684  0.1338984 ]
 [-0.1159386  -0.7058418 ]
 [ 0.40144568  0.4858517 ]
 [ 0.90469406  0.12726077]
 [-0.51398259  0.28052366]
 [-1.18347583 -1.04218992]]


* print the shape of X_train and X_test

In [10]:
import numpy as np
from sklearn.model_selection import train_test_split

# Generate the data
X = np.random.normal(0, 1, 20).reshape(10, 2)
y = np.random.normal(0, 1, 10)

# Split the data without shuffling (70:30 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42, shuffle=False)

# Print the shapes of the training and testing feature sets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)

Shape of X_train: (7, 2)
Shape of X_test: (3, 2)


_________________________________
### K-fold split 

* import the **KFold** function from sklearn

In [11]:
from sklearn.model_selection import KFold

* instantiate KFold with k=5

In [12]:
from sklearn.model_selection import KFold

# Instantiate KFold with k=5
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Example data
X = np.random.normal(0, 1, 20).reshape(10, 2)
y = np.random.normal(0, 1, 10)

# Apply KFold
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print("Training indices:", train_index)
    print("Testing indices:", test_index)
    print("X_train:\n", X_train)
    print("X_test:\n", X_test)
    print("y_train:\n", y_train)
    print("y_test:\n", y_test)
    print("-" * 30)

Training indices: [0 2 3 4 5 6 7 9]
Testing indices: [1 8]
X_train:
 [[ 0.66270781  0.81474124]
 [ 0.58573283 -0.02768514]
 [ 1.17023947 -0.62029072]
 [-0.71046623 -1.48307451]
 [ 0.36847152 -1.44181645]
 [ 1.17773179 -0.39835878]
 [-0.56702026  1.49261945]
 [-1.45197908 -0.65124083]]
X_test:
 [[ 0.53636202 -0.1452261 ]
 [ 1.58861765 -0.04236291]]
y_train:
 [ 0.10096713 -1.90504676  1.29843983 -1.06703727 -0.12571111 -0.10795595
  0.10205291  0.68410502]
y_test:
 [-0.99972414 -0.18816473]
------------------------------
Training indices: [1 2 3 4 6 7 8 9]
Testing indices: [0 5]
X_train:
 [[ 0.53636202 -0.1452261 ]
 [ 0.58573283 -0.02768514]
 [ 1.17023947 -0.62029072]
 [-0.71046623 -1.48307451]
 [ 1.17773179 -0.39835878]
 [-0.56702026  1.49261945]
 [ 1.58861765 -0.04236291]
 [-1.45197908 -0.65124083]]
X_test:
 [[ 0.66270781  0.81474124]
 [ 0.36847152 -1.44181645]]
y_train:
 [-0.99972414 -1.90504676  1.29843983 -1.06703727 -0.10795595  0.10205291
 -0.18816473  0.68410502]
y_test:
 [ 0.100

* iterate over train_index and test_index in kf.split(X) and print them

In [13]:
from sklearn.model_selection import KFold
import numpy as np

# Instantiate KFold with k=5
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Example data
X = np.random.normal(0, 1, 20).reshape(10, 2)
y = np.random.normal(0, 1, 10)

# Iterate over the KFold splits
for fold, (train_index, test_index) in enumerate(kf.split(X), start=1):
    print(f"Fold {fold}")
    print("Training indices:", train_index)
    print("Testing indices:", test_index)
    print()

Fold 1
Training indices: [0 2 3 4 5 6 7 9]
Testing indices: [1 8]

Fold 2
Training indices: [1 2 3 4 6 7 8 9]
Testing indices: [0 5]

Fold 3
Training indices: [0 1 3 4 5 6 8 9]
Testing indices: [2 7]

Fold 4
Training indices: [0 1 2 3 5 6 7 8]
Testing indices: [4 9]

Fold 5
Training indices: [0 1 2 4 5 7 8 9]
Testing indices: [3 6]



* instantiate KFold with k=5 and shuffle=True

In [14]:
from sklearn.model_selection import KFold
import numpy as np

# Instantiate KFold with k=5 and shuffle=True
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Example data
X = np.random.normal(0, 1, 20).reshape(10, 2)
y = np.random.normal(0, 1, 10)

# Iterate over the KFold splits
for fold, (train_index, test_index) in enumerate(kf.split(X), start=1):
    print(f"Fold {fold}")
    print("Training indices:", train_index)
    print("Testing indices:", test_index)
    print()

Fold 1
Training indices: [0 2 3 4 5 6 7 9]
Testing indices: [1 8]

Fold 2
Training indices: [1 2 3 4 6 7 8 9]
Testing indices: [0 5]

Fold 3
Training indices: [0 1 3 4 5 6 8 9]
Testing indices: [2 7]

Fold 4
Training indices: [0 1 2 3 5 6 7 8]
Testing indices: [4 9]

Fold 5
Training indices: [0 1 2 4 5 7 8 9]
Testing indices: [3 6]



* iterate over train_index and test_index in kf.split(X) and print them

In [15]:
from sklearn.model_selection import KFold
import numpy as np

# Instantiate KFold with k=5 and shuffle=True
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Example data
X = np.random.normal(0, 1, 20).reshape(10, 2)
y = np.random.normal(0, 1, 10)

# Iterate over the KFold splits and print indices
for fold, (train_index, test_index) in enumerate(kf.split(X), start=1):
    print(f"Fold {fold}")
    print("Training indices:", train_index)
    print("Testing indices:", test_index)
    print()

Fold 1
Training indices: [0 2 3 4 5 6 7 9]
Testing indices: [1 8]

Fold 2
Training indices: [1 2 3 4 6 7 8 9]
Testing indices: [0 5]

Fold 3
Training indices: [0 1 3 4 5 6 8 9]
Testing indices: [2 7]

Fold 4
Training indices: [0 1 2 3 5 6 7 8]
Testing indices: [4 9]

Fold 5
Training indices: [0 1 2 4 5 7 8 9]
Testing indices: [3 6]



_______________________________________
### Leave-One-Out split
This is a similar technique to the Leave-p-out in the previous readings, with p=1. Each observation is used as test set separately.
- This is a popular method for tiny datasets.
- It takes a lot of time with bigger datasets and can lead to overfitting on a final model.

* import the **LeaveOneOut** function from sklearn

In [16]:
from sklearn.model_selection import LeaveOneOut

* instantiate LeaveOneOut

In [17]:
import numpy as np
from sklearn.model_selection import LeaveOneOut

# Instantiate LeaveOneOut
loo = LeaveOneOut()

# Example data
X = np.random.normal(0, 1, 20).reshape(10, 2)
y = np.random.normal(0, 1, 10)

# Iterate over the LeaveOneOut splits
for train_index, test_index in loo.split(X):
    print("Training indices:", train_index)
    print("Testing indices:", test_index)
    print()

Training indices: [1 2 3 4 5 6 7 8 9]
Testing indices: [0]

Training indices: [0 2 3 4 5 6 7 8 9]
Testing indices: [1]

Training indices: [0 1 3 4 5 6 7 8 9]
Testing indices: [2]

Training indices: [0 1 2 4 5 6 7 8 9]
Testing indices: [3]

Training indices: [0 1 2 3 5 6 7 8 9]
Testing indices: [4]

Training indices: [0 1 2 3 4 6 7 8 9]
Testing indices: [5]

Training indices: [0 1 2 3 4 5 7 8 9]
Testing indices: [6]

Training indices: [0 1 2 3 4 5 6 8 9]
Testing indices: [7]

Training indices: [0 1 2 3 4 5 6 7 9]
Testing indices: [8]

Training indices: [0 1 2 3 4 5 6 7 8]
Testing indices: [9]



* iterate over train_index and test_index in loo.split(X) and print them

In [18]:
import numpy as np
from sklearn.model_selection import LeaveOneOut

# Instantiate LeaveOneOut
loo = LeaveOneOut()

# Example data
X = np.random.normal(0, 1, 20).reshape(10, 2)
y = np.random.normal(0, 1, 10)

# Iterate over the LeaveOneOut splits and print indices
for train_index, test_index in loo.split(X):
    print("Training indices:", train_index)
    print("Testing indices:", test_index)
    print()

Training indices: [1 2 3 4 5 6 7 8 9]
Testing indices: [0]

Training indices: [0 2 3 4 5 6 7 8 9]
Testing indices: [1]

Training indices: [0 1 3 4 5 6 7 8 9]
Testing indices: [2]

Training indices: [0 1 2 4 5 6 7 8 9]
Testing indices: [3]

Training indices: [0 1 2 3 5 6 7 8 9]
Testing indices: [4]

Training indices: [0 1 2 3 4 6 7 8 9]
Testing indices: [5]

Training indices: [0 1 2 3 4 5 7 8 9]
Testing indices: [6]

Training indices: [0 1 2 3 4 5 6 8 9]
Testing indices: [7]

Training indices: [0 1 2 3 4 5 6 7 9]
Testing indices: [8]

Training indices: [0 1 2 3 4 5 6 7 8]
Testing indices: [9]



* print the number of splits

In [19]:
import numpy as np
from sklearn.model_selection import LeaveOneOut

# Instantiate LeaveOneOut
loo = LeaveOneOut()

# Example data
X = np.random.normal(0, 1, 20).reshape(10, 2)
y = np.random.normal(0, 1, 10)

# Print the number of splits
n_splits = loo.get_n_splits(X)
print("Number of splits:", n_splits)

Number of splits: 10
