<a href="https://colab.research.google.com/github/EladMoshe98/testrepo/blob/main/Data_split_and_EDA_examples_per_ML_task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Split approaches

* **Regression:** Randomly split data since targets are continuous and assumed independent.
* **Classification:** Use a stratified split, meaning each subset keeps the same class proportions as the full dataset (important for balanced evaluation).
* **Anomaly Detection:** Train only on normal data and reserve anomalies for validation/testing.
* **Clustering:** Randomly split (or use all data) since there are no labels to guide splitting.
* **Forecasting:** Split chronologically to preserve time order and prevent future data leakage.



# Datasets used for data split examples

**California Housing (Regression):**
Tabular dataset (~20,640 samples, 8 numeric features like median income, house age, population, latitude/longitude) predicting median house value; used for regression tasks.

https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset

---

**Wine (Classification):**
Tabular dataset (178 samples, 13 chemical features such as alcohol, ash, magnesium) classifying wines into 3 cultivars; used for multi-class classification.

https://scikit-learn.org/stable/datasets/toy_dataset.html#wine-dataset

---

**Synthetic Anomaly Data (Anomaly Detection):**
Artificial dataset (1000 samples × 10 features) with 1% labeled as anomalies; used to simulate rare-event detection scenarios like fraud or fault detection.

https://scikit-learn.org/stable/datasets/sample_generators.html (conceptual reference for synthetic data generation)

---

**Blobs (Clustering):**
Synthetic dataset (1000 samples, 5 numeric features) generated around 3 cluster centers; used for testing unsupervised clustering algorithms.

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html

---

**Synthetic Time Series (Forecasting):**
Simulated dataset (1000 daily records with date and noisy sine value column) representing temporal data for time series forecasting examples.

https://pandas.pydata.org/docs/reference/api/pandas.date_range.html (used to generate time series dates)


# Data split for Regression

- Regression tasks predict continuous values.
- Stratification isn't possible (no discrete labels).
- Split randomly (assuming IID data).
- Consider k-fold CV instead of fixed validation if dataset is small.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

# Load example regression dataset
X, y = fetch_california_housing(return_X_y=True)

# Typical split: 70% train, 15% validation, 15% test
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)



# Data split for Classification

- Always use `stratify=y` for classification splits to preserve label distribution.
- Prevents bias when one class is underrepresented.
- For small datasets, use StratifiedKFold or StratifiedShuffleSplit.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y=True)

# Stratified split ensures class balance in all sets
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42
)

# Data split for Anomaly Detection

- Training data: ONLY normal samples (unsupervised anomaly detection).
- Validation/Test: Mix normal + anomaly for evaluation.
- Don't use stratified split—imbalanced by design.
- Ensure anomalies don’t leak into training.



In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

# Example: 99% normal, 1% anomalies
n_samples = 1000
X = np.random.randn(n_samples, 10)
y = np.zeros(n_samples)
y[:10] = 1  # mark 1% anomalies

# Split so anomalies are *only* in validation/test ideally
X_train, X_temp, y_train, y_temp = train_test_split(
    X[y == 0], y[y == 0], test_size=0.3, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Add anomalies to val/test sets
X_val = np.vstack([X_val, X[y == 1][:5]])
y_val = np.concatenate([y_val, y[y == 1][:5]])
X_test = np.vstack([X_test, X[y == 1][5:]])
y_test = np.concatenate([y_test, y[y == 1][5:]])

# Data split for Clustering

- No y labels → unsupervised.
- Often, all data is used for training (since we want full structure).
- Use a hold-out set to test cluster stability/generalization.
- Alternative: Cross-validation via repeated sub-sampling (e.g. evaluating clustering consistency).

In [None]:
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split

X, _ = make_blobs(n_samples=1000, centers=3, n_features=5, random_state=42)

# Random split since there are no labels
X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)

# Data split for Forecasting

- NEVER shuffle time series — order matters.
- Train/Val/Test split must preserve temporal order.
- Alternative: Use expanding-window or rolling-window cross-validation.
- Validation simulates “future unseen” predictions.

In [None]:
import numpy as np
import pandas as pd

# Simulated time series data
dates = pd.date_range(start="2020-01-01", periods=1000, freq="D")
data = pd.DataFrame({"date": dates, "value": np.sin(np.arange(1000) / 50) + np.random.randn(1000)*0.1})

# Split chronologically — no random shuffle
train_size = int(len(data) * 0.7)
val_size = int(len(data) * 0.15)

train = data.iloc[:train_size]
val = data.iloc[train_size:train_size + val_size]
test = data.iloc[train_size + val_size:]

In [None]:
# ================================================================
# Unified EDA Summary for All Five ML Task Types
# ================================================================

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing, load_wine, make_blobs
from sklearn.model_selection import train_test_split # Import train_test_split

pd.set_option("display.float_format", lambda x: f"{x:,.3f}")

def print_section(title):
    print("\n" + "=" * 70)
    print(f"📘 {title}")
    print("=" * 70)

def summarize_split(name, X_train, X_val=None, X_test=None, y_train=None, y_val=None, y_test=None):
    """Summarizes numeric stats across full dataset vs splits."""
    def summary(arr):
        if arr is None or len(arr) == 0: # Handle empty arrays
            return {"mean": np.nan, "std": np.nan, "shape": (0,) if arr is None else arr.shape}
        arr = np.array(arr)
        return {
            "mean": np.mean(arr, axis=0)[:3] if arr.ndim > 1 else np.mean(arr),
            "std": np.std(arr, axis=0)[:3] if arr.ndim > 1 else np.std(arr),
            "shape": arr.shape
        }

    print(f"\n{name} Split Summary:")
    print(f"Train size: {len(X_train)}")
    if X_val is not None:
        print(f"Val size: {len(X_val)}")
    if X_test is not None:
        print(f"Test size: {len(X_test)}")

    all_X = [X_train]
    if X_val is not None and len(X_val) > 0:
        all_X.append(X_val)
    if X_test is not None and len(X_test) > 0:
        all_X.append(X_test)
    full_X = np.vstack(all_X)


    print("Feature mean (first 3 dims):")
    print(f"  Full:  {summary(full_X)['mean']}")
    print(f"  Train: {summary(X_train)['mean']}")
    if X_val is not None:
        print(f"  Val:   {summary(X_val)['mean']}")
    if X_test is not None:
        print(f"  Test:  {summary(X_test)['mean']}")


    if y_train is not None:
        all_y = [y_train]
        if y_val is not None and len(y_val) > 0:
             all_y.append(y_val)
        if y_test is not None and len(y_test) > 0:
             all_y.append(y_test)
        full_y = np.concatenate(all_y)
        print("Target mean/std:")
        print(f"  Full:  mean={np.mean(full_y):.3f}, std={np.std(full_y):.3f}")
        print(f"  Train: mean={np.mean(y_train):.3f}, std={np.std(y_train):.3f}")
        if y_val is not None:
             print(f"  Val:   mean={np.mean(y_val):.3f}, std={np.std(y_val):.3f}")
        if y_test is not None:
             print(f"  Test:  mean={np.mean(y_test):.3f}, std={np.std(y_test):.3f}")


# ================================================================
# Regression
# ================================================================
print_section("Regression — California Housing")
# Load the California Housing data again to get the correct X and feature names
X_cal, y_cal = fetch_california_housing(return_X_y=True)
print(pd.DataFrame(X_cal, columns=fetch_california_housing().feature_names).describe().T)
# Use the split variables from the regression section (lU40d0UlZmOR)
X_train_reg, X_temp_reg, y_train_reg, y_temp_reg = train_test_split(X_cal, y_cal, test_size=0.3, random_state=42)
X_val_reg, X_test_reg, y_val_reg, y_test_reg = train_test_split(X_temp_reg, y_temp_reg, test_size=0.5, random_state=42)
summarize_split("California Housing", X_train_reg, X_val_reg, X_test_reg, y_train_reg, y_val_reg, y_test_reg)

# ================================================================
# Classification
# ================================================================
print_section("Classification — Wine Dataset")
# Load the Wine data again to get the correct X and y
X_wine, y_wine = load_wine(return_X_y=True)
print("Class distribution (full):", dict(zip(*np.unique(y_wine, return_counts=True))))
print(pd.DataFrame(X_wine, columns=load_wine().feature_names).describe().T)
# Use the split variables from the classification section (dvKYWho6ZnMC)
X_train_clf, X_temp_clf, y_train_clf, y_temp_clf = train_test_split(
    X_wine, y_wine, test_size=0.3, stratify=y_wine, random_state=42
)
X_val_clf, X_test_clf, y_val_clf, y_test_clf = train_test_split(
    X_temp_clf, y_temp_clf, test_size=0.5, stratify=y_temp_clf, random_state=42
)
summarize_split("Wine Classification", X_train_clf, X_val_clf, X_test_clf, y_train_clf, y_val_clf, y_test_clf)

# ================================================================
# Anomaly Detection
# ================================================================
print_section("Anomaly Detection — Synthetic")
# Use the X from the anomaly detection section (JrOO5xQWZnEF)
n_samples = 1000
X_ano = np.random.randn(n_samples, 10)
y_ano = np.zeros(n_samples)
y_ano[:10] = 1  # mark 1% anomalies
print("Overall Feature Summary (first 3 features):")
print(pd.DataFrame(X_ano[:, :3], columns=["f1", "f2", "f3"]).describe().T)
# Use the y from the anomaly detection section (JrOO5xQWZnEF)
print("Anomaly ratio full:", np.mean(y_ano))
# Use the split variables from the anomaly detection section (JrOO5xQWZnEF)
X_train_ano, X_test_ano, y_train_ano, y_test_ano = train_test_split(
    X_ano, y_ano, test_size=0.2, random_state=42, stratify=y_ano # Stratify to ensure anomalies are in test set
)
summarize_split("Anomaly Detection", X_train_ano, X_test=X_test_ano, y_train=y_train_ano, y_test=y_test_ano)


# ================================================================
# Clustering
# ================================================================
print_section("Clustering — Synthetic Blobs")
# Load the Blobs data again to get the correct X
X_blobs, _ = make_blobs(n_samples=1000, centers=3, n_features=5, random_state=42)
print(pd.DataFrame(X_blobs[:, :3], columns=["f1", "f2", "f3"]).describe().T)
# Use the split variables from the clustering section (L-tsj_B3Zm69)
X_train_clus, X_test_clus = train_test_split(X_blobs, test_size=0.2, random_state=42)
summarize_split("Clustering", X_train_clus, X_test=X_test_clus) # Corrected X_test and X_val

# ================================================================
# Forecasting
# ================================================================
print_section("Forecasting — Synthetic Time Series")
# Use the data from the forecasting section (zg6UXpqDdRq_)
dates = pd.date_range(start="2020-01-01", periods=1000, freq="D")
data_ts = pd.DataFrame({"date": dates, "value": np.sin(np.arange(1000) / 50) + np.random.randn(1000)*0.1})
train_size_ts = int(len(data_ts) * 0.7)
val_size_ts = int(len(data_ts) * 0.15)

train_ts = data_ts.iloc[:train_size_ts]
val_ts = data_ts.iloc[train_size_ts:train_size_ts + val_size_ts]
test_ts = data_ts.iloc[train_size_ts + val_size_ts:]

print(data_ts.describe().T)
print("\nSplit Sizes:")
print(f"Train: {len(train_ts)} | Val: {len(val_ts)} | Test: {len(test_ts)}")
print("Mean/std per split:")
print(f"  Train: mean={train_ts['value'].mean():.3f}, std={train_ts['value'].std():.3f}")
print(f"  Val:   mean={val_ts['value'].mean():.3f}, std={val_ts['value'].std():.3f}")
print(f"  Test:  mean={test_ts['value'].mean():.3f}, std={test_ts['value'].std():.3f}")

# ================================================================
# Done
# ================================================================
print_section("All EDA Comparisons Completed")


📘 Regression — California Housing
                count      mean       std      min      25%       50%  \
MedInc     20,640.000     3.871     1.900    0.500    2.563     3.535   
HouseAge   20,640.000    28.639    12.586    1.000   18.000    29.000   
AveRooms   20,640.000     5.429     2.474    0.846    4.441     5.229   
AveBedrms  20,640.000     1.097     0.474    0.333    1.006     1.049   
Population 20,640.000 1,425.477 1,132.462    3.000  787.000 1,166.000   
AveOccup   20,640.000     3.071    10.386    0.692    2.430     2.818   
Latitude   20,640.000    35.632     2.136   32.540   33.930    34.260   
Longitude  20,640.000  -119.570     2.004 -124.350 -121.800  -118.490   

                 75%        max  
MedInc         4.743     15.000  
HouseAge      37.000     52.000  
AveRooms       6.052    141.909  
AveBedrms      1.100     34.067  
Population 1,725.000 35,682.000  
AveOccup       3.282  1,243.333  
Latitude      37.710     41.950  
Longitude   -118.010   -114.310  

