## **ShuffleSplit**

**What it is:**  
- Generates a user-defined number of independent random train/test splits.
- Unlike KFold, the same data point may appear in multiple test sets.

**When to use it:**  
- When you want to perform random subsampling cross-validation.  
- Useful when the number of splits is not tied to the size of the dataset.

**Key points:**  
- Each split is generated by shuffling the data.
- Does not guarantee that every sample is used once in the test set (samples can be repeated across splits).


In [2]:
from sklearn.model_selection import ShuffleSplit
import numpy as np

X = np.arange(20).reshape(10, 2)
y = np.arange(10)

ss = ShuffleSplit(n_splits=5, test_size=0.3, random_state=42)
for train_index, test_index in ss.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)


TRAIN: [0 7 2 9 4 3 6] TEST: [8 1 5]
TRAIN: [5 3 4 7 9 6 2] TEST: [0 1 8]
TRAIN: [6 8 5 3 7 1 4] TEST: [9 2 0]
TRAIN: [2 8 0 3 4 5 9] TEST: [1 7 6]
TRAIN: [8 0 7 6 3 2 9] TEST: [1 5 4]


In [3]:
import seaborn as sns
import pandas as pd
# from sklearn.model_selection import ShuffleSplit

df=sns.load_dataset('tips')
df["tip-range"] = pd.cut(
    df['tip'],
    bins=[0., 1, 2., 3, 4., 5., np.inf],
    labels=[1, 2, 3, 4, 5, 6]
)
df['tip-range'].value_counts()/len(df)

tip-range
2    0.303279
3    0.278689
4    0.233607
5    0.094262
6    0.073770
1    0.016393
Name: count, dtype: float64

In [7]:
ss=ShuffleSplit(n_splits=5,test_size=0.3,random_state=42)
for fold,(train_ids, test_ids) in enumerate(ss.split(df.drop(columns=['tip','tip-range']),df['tip-range'])):
    print(f"################### Fold {fold + 1} ###################")

    train_distribution = df.loc[train_ids, 'tip-range'].value_counts(normalize=True)
    test_distribution = df.loc[test_ids, 'tip-range'].value_counts(normalize=True)

    print("Train Distribution:\n", train_distribution)
    print("Test Distribution:\n", test_distribution)
    
    print("############################################################\n")


################### Fold 1 ###################
Train Distribution:
 tip-range
2    0.294118
3    0.264706
4    0.247059
5    0.100000
6    0.082353
1    0.011765
Name: proportion, dtype: float64
Test Distribution:
 tip-range
2    0.324324
3    0.310811
4    0.202703
5    0.081081
6    0.054054
1    0.027027
Name: proportion, dtype: float64
############################################################

################### Fold 2 ###################
Train Distribution:
 tip-range
2    0.317647
3    0.276471
4    0.247059
6    0.076471
5    0.058824
1    0.023529
Name: proportion, dtype: float64
Test Distribution:
 tip-range
3    0.283784
2    0.270270
4    0.202703
5    0.175676
6    0.067568
1    0.000000
Name: proportion, dtype: float64
############################################################

################### Fold 3 ###################
Train Distribution:
 tip-range
2    0.311765
3    0.264706
4    0.241176
5    0.088235
6    0.076471
1    0.017647
Name: proportion, dtype: float