### sklearn.model_selection

### 1) train_test_split

What: Split arrays or DataFrames into random train and test subsets.
When: Quick holdout split (baseline evaluation, debugging, simple experiments).

Key args:

1. test_size / train_size (float or int) — portion or absolute count.

2. random_state — seed for reproducibility.

3. shuffle — whether to shuffle before splitting (True by default).

4. stratify — array to preserve class proportions (important for classification).

Pitfalls: Don’t call on whole dataset including future samples (time series). Use stratify for imbalanced classes.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


### 2) KFold

What: K-fold cross-validation splitter (no stratification). Splits data into n_splits consecutive folds.

Key args:

1. n_splits (default 5)

2. shuffle — whether to shuffle before splitting (False by default)

3. random_state — seed when shuffle=True

When: Regression or when class balance across folds is not required.

In [None]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in kf.split(X):
    X_tr, X_val = X[train_idx], X[val_idx]

### 3) StratifiedKFold

What: Like KFold, but preserves class proportions across folds (classification).

When: Classification tasks with imbalanced classes.


In [None]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for tr, val in skf.split(X, y):
    ...


### 4) GroupKFold

What: Ensure samples from the same group (e.g., patient, user, store) are all in either train or test, never split.

When: You have grouped data and want to avoid leakage between groups.

Key arg: groups passed to .split(X, y, groups).

In [None]:
from sklearn.model_selection import GroupKFold
gkf = GroupKFold(n_splits=5)
for tr, val in gkf.split(X, y, groups):
    ...


### 5) TimeSeriesSplit

What: For time-series cross-validation. Produces forward-chaining splits: train indices up to time t, test indices after t.

Key args: n_splits, max_train_size optional.

When: Time series forecasting or data with temporal order.

Pitfall: Do not shuffle! Keep chronological order.

In [None]:
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for tr, val in tscv.split(X):
    # tr < val in time
    ...

### 6) LeaveOneOut / LeavePOut

What: Extreme CV: LOO leaves one sample out each iteration; LeavePOut leaves p samples out.

When: Very small datasets (LOO). Expensive for large datasets.

Example (LOO):


In [None]:
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
for tr, val in loo.split(X):
    ...


### 7) ShuffleSplit / StratifiedShuffleSplit

What: Repeated random train/test splits (randomized). Stratified version preserves class proportions.

When: You want multiple random splits rather than contiguous folds.

In [None]:
from sklearn.model_selection import ShuffleSplit
ss = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
for tr, val in ss.split(X):
    ...


### 8) RepeatedKFold / RepeatedStratifiedKFold

What: Repeat KFold / StratifiedKFold multiple times with different shuffles (gives more stable estimate).

Key args: n_repeats, random_state.

In [None]:
from sklearn.model_selection import RepeatedKFold
rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)


### 9) PredefinedSplit

What: Use a user-specified split array to indicate train/test fold membership (useful for custom CV split / temporal splitting).

Example:
If test_fold = [-1,-1,0,0,1,1], indices with -1 are training, others are test for folds 0 and 1.

In [None]:
from sklearn.model_selection import PredefinedSplit
ps = PredefinedSplit(test_fold)
cross_val_score(model, X, y, cv=ps)


### 10) cross_val_score

What: Train estimator over cross-validation splits and return an array of scores.

Key args:

estimator, X, y, cv (int or splitter), scoring (string or callable), n_jobs, verbose, fit_params

Returns: 1-D array of scores (one per CV split). For negative metrics like neg_root_mean_squared_error, you may invert sign.

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe, X, y, cv=5, scoring='neg_root_mean_squared_error')
rmse = -scores.mean()


### 11) cross_validate

What: More flexible than cross_val_score. Can return multiple metrics, fit times, score times, and training scores.

Key args:

scoring can be a list/dict of metrics

return_train_score=True if you want train metrics

returns dict with keys like test_score, train_score, fit_time, score_time (and test_<scorename>)

In [None]:
from sklearn.model_selection import cross_validate
res = cross_validate(pipe, X, y, cv=5, scoring=['r2','neg_root_mean_squared_error'], return_train_score=True)


### 12) cross_val_predict

What: Generate cross-validated estimates for each input sample (out-of-fold predictions). Useful for stacking, blending, OOF predictions.

Important: This returns predictions that are always made by models that did not see the sample during training (good for meta-models).

In [None]:
from sklearn.model_selection import cross_val_predict
oof_preds = cross_val_predict(pipe, X, y, cv=5, method='predict')


### 13) GridSearchCV

What: Exhaustive search over parameter grid with cross-validation. Selects best parameter combination.

Key args:

estimator, param_grid (dict: param names -> list), cv, scoring, n_jobs, refit (if True, fits best estimator on full data), verbose.

Usage: Use estimator__param names for parameters inside pipelines.

Pitfalls:

Expensive for large grids.

Use pre_dispatch, n_jobs.

Use RandomizedSearchCV when grid is large.

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = {'model__alpha': [0.1, 1.0, 10.0]}
gs = GridSearchCV(pipeline, param_grid, cv=5, scoring='neg_root_mean_squared_error', n_jobs=-1)
gs.fit(X_train, y_train)
best = gs.best_estimator_


### 14) RandomizedSearchCV

What: Sample parameter combinations from distributions (faster/cheap alternative to GridSearch).

Key args: param_distributions (dict of distributions or lists), n_iter, cv, random_state.

When: Large hyperparameter spaces or expensive models.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
param_dist = {'model__n_estimators': [100,300,500], 'model__max_depth': [3,5,7]}
rs = RandomizedSearchCV(pipeline, param_distributions=param_dist, n_iter=20, cv=5, n_jobs=-1, random_state=42)
rs.fit(X_train, y_train)


### 15) ParameterGrid / ParameterSampler

What: Utility classes to create (grid) or sample (random) parameter combos for custom loop.

In [None]:
from sklearn.model_selection import ParameterGrid
grid = list(ParameterGrid({'a':[1,2], 'b':[True,False]}))


### 16) permutation_test_score

What: Statistical significance test for model score: shuffles labels to compute null distribution of the score. Returns observed score, permutation scores, p-value.

When: Test if model performance is significantly different from chance.

In [None]:
from sklearn.model_selection import permutation_test_score
score, perm_scores, pvalue = permutation_test_score(model, X, y, scoring='accuracy', cv=5, n_permutations=100, n_jobs=-1)


### 17) learning_curve

What: Compute training and cross-validation scores for different training set sizes. Useful to diagnose bias/variance and whether more data helps.

Key args: train_sizes, cv, scoring, n_jobs.

In [None]:
from sklearn.model_selection import learning_curve
train_sizes, train_scores, test_scores = learning_curve(pipe, X, y, cv=5, train_sizes=np.linspace(0.1,1.0,5), scoring='neg_root_mean_squared_error')


### 18) validation_curve

What: Evaluate model performance as a function of a single hyperparameter (useful for seeing under/overfitting behavior relative to a parameter).

In [None]:
from sklearn.model_selection import validation_curve
param_range = [1,10,100,1000]
train_scores, test_scores = validation_curve(Ridge(), X, y, param_name='alpha', param_range=param_range, cv=5, scoring='neg_root_mean_squared_error')


### 19) check_cv (sklearn.utils.validation? usually internal)

Note: Not a user-level common function; users choose CV splitters directly. Use cv=5 (int) for default KFold or pass splitter objects.

### 20) Choosing cv argument (int vs splitter)

cv=int → use default splitter:

classification → StratifiedKFold

regression → KFold

cv=splitter_object → pass any of the splitters above (KFold, TimeSeriesSplit, etc.)

cv=PredefinedSplit → custom splits

### 21) scoring parameter

Accepts strings (e.g., 'accuracy', 'roc_auc', 'neg_mean_squared_error') or a callable.

For loss metrics where lower is better, sklearn often uses negative versions (neg_mean_squared_error) so cross_val_score returns higher-is-better numbers. Convert sign when interpreting.

In [None]:
scores = cross_val_score(model, X, y, scoring='neg_root_mean_squared_error', cv=5)
rmse = -scores.mean()


### 22) Parallelism & performance

n_jobs=-1 to use all CPUs (useful for cross_val_score, GridSearchCV, etc.). Be mindful of memory.

verbose useful for long runs.

For nested parallelism (e.g., pipeline + grid search) prefer controlling n_jobs carefully to avoid oversubscription.