# Introduction to MLOps in Jupyter Notebooks

Run the following exercises and explore the questions asked.

**Assumptions**: Participants have basic familiarity with Python, Pandas, Scikit-learn, and Jupyter Notebook usage.

## Exercise 1: Reproducibility
### Sample A:

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
import random

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=321, stratify=y)
print(f"Data split with test_size=0.2, random_state=321")

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("Data scaled using StandardScaler")

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train_scaled, y_train)
print(f"Trained KNeighborsClassifier with n_neighbors=5")

y_pred = model.predict(X_test_scaled)
acc = accuracy_score(y_test, y_pred)

print(f"\n--- Results ---")
print(f"Parameters: test_size=0.2, random_state=321, n_neighbors=5")
print(f"Achieved Accuracy: {acc:.4f}")

Data split with test_size=0.2, random_state=321
Data scaled using StandardScaler
Trained KNeighborsClassifier with n_neighbors=5

--- Results ---
Parameters: test_size=0.2, random_state=321, n_neighbors=5
Achieved Accuracy: 0.9333


### *Questions*:
1. Run code sample A. Note the accuracy.
2. Change `test_size` to 0.2 and re-run all cells. How easy was it? What's the new accuracy?
3. How would you systematically try 5 different `random_state` values (e.g., 0, 42, 100, 123, 2024)?
4. How do you reliably track which parameters (`random_state`, `test_size`, `n_neighbors`) produced the best score?
5. Could a teammate easily understand and run this notebook to get the exact same result?

### Sample B:

In [23]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from typing import Dict, Any, Tuple
import numpy as np

CONFIG: Dict[str, Any] = {
    "data_source": "sklearn_iris",
    "test_size": 0.3,
    "random_state": 123,
    "stratify": True,
    "model_params": {
        "n_neighbors": 5,
    },
    "features": ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], # Explicit feature names
    "target": 'target'
}

print("--- Configuration ---")
for key, value in CONFIG.items():
    print(f"{key}: {value}")

# --- Helper Functions ---

def load_data(source: str, feature_cols: list, target_col: str) -> Tuple[pd.DataFrame, pd.Series]:
    """Loads data based on the source specified in config."""
    print(f"\nLoading data from: {source}")
    if source == "sklearn_iris":
        iris = load_iris()
        df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
                          columns=feature_cols + [target_col])
        # Convert target to int if needed (already is for iris)
        df[target_col] = df[target_col].astype(int)
    else:
        raise ValueError(f"Unsupported data source: {source}")

    X = df[feature_cols]
    y = df[target_col]
    print(f"Features: {list(X.columns)}")
    print(f"Target: {target_col}")
    print(f"Data shape: {X.shape}, Target shape: {y.shape}")
    return X, y

def split_data(X: pd.DataFrame, y: pd.Series, test_size: float, random_state: int, stratify: bool) -> Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]:
    """Splits data into training and testing sets."""
    print(f"\nSplitting data: test_size={test_size}, random_state={random_state}, stratify={stratify}")
    stratify_col = y if stratify else None
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size=test_size,
        random_state=random_state,
        stratify=stratify_col
    )
    print(f"Train shapes: X={X_train.shape}, y={y_train.shape}")
    print(f"Test shapes: X={X_test.shape}, y={y_test.shape}")
    return X_train, X_test, y_train, y_test

def preprocess_data(X_train: pd.DataFrame, X_test: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray, StandardScaler]:
    """Scales the feature data using StandardScaler."""
    print("\nPreprocessing data using StandardScaler")
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    print("Scaling complete.")
    return X_train_scaled, X_test_scaled, scaler

def train_model(X_train: np.ndarray, y_train: pd.Series, model_params: Dict[str, Any]) -> KNeighborsClassifier:
    """Trains a KNeighborsClassifier model."""
    print(f"\nTraining KNeighborsClassifier with params: {model_params}")
    model = KNeighborsClassifier(**model_params)
    model.fit(X_train, y_train)
    print("Training complete.")
    return model

def evaluate_model(model: KNeighborsClassifier, X_test: np.ndarray, y_test: pd.Series) -> float:
    """Evaluates the model and returns accuracy."""
    print("\nEvaluating model")
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"Evaluation complete.")
    return acc

# --- Main Workflow ---

X, y = load_data(CONFIG["data_source"], CONFIG["features"], CONFIG["target"])

X_train, X_test, y_train, y_test = split_data(
    X, y,
    test_size=CONFIG["test_size"],
    random_state=CONFIG["random_state"],
    stratify=CONFIG["stratify"]
)

X_train_scaled, X_test_scaled, scaler_object = preprocess_data(X_train, X_test)

trained_model = train_model(X_train_scaled, y_train, CONFIG["model_params"])

accuracy = evaluate_model(trained_model, X_test_scaled, y_test)

print(f"\n--- Results ---")
print(f"Configuration Used: {CONFIG}")
print(f"Achieved Accuracy: {accuracy:.4f}")


--- Configuration ---
data_source: sklearn_iris
test_size: 0.3
random_state: 123
stratify: True
model_params: {'n_neighbors': 5}
features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
target: target

Loading data from: sklearn_iris
Features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target: target
Data shape: (150, 4), Target shape: (150,)

Splitting data: test_size=0.3, random_state=123, stratify=True
Train shapes: X=(105, 4), y=(105,)
Test shapes: X=(45, 4), y=(45,)

Preprocessing data using StandardScaler
Scaling complete.

Training KNeighborsClassifier with params: {'n_neighbors': 5}
Training complete.

Evaluating model
Evaluation complete.

--- Results ---
Configuration Used: {'data_source': 'sklearn_iris', 'test_size': 0.3, 'random_state': 123, 'stratify': True, 'model_params': {'n_neighbors': 5}, 'features': ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], 'target': 

### *Questions*:

1. Revisit the questions for code sample A.
2. How easy is it to change `test_size` or `random_state` now?
3. How would you systematically try 5 different `random_state` values?
4. How do you reliably track which parameters produced the best score?
5. What problems still remain?

## Exercise 2: Tracking Experiments
### Sample C:

In [41]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from typing import Dict, Any, Tuple, List
import numpy as np
import joblib # For saving models (artifacts)
import os # For interacting with the file system
import time # To make filenames unique if runs are fast

print("--- Setup: Loading Libraries ---")
print(f"Pandas: {pd.__version__}")
import sklearn
print(f"Scikit-learn: {sklearn.__version__}")
print(f"Joblib: {joblib.__version__}")

# Define base settings, hyperparameters
BASE_CONFIG: Dict[str, Any] = {
    "data_source": "sklearn_iris",
    "test_size": 0.3,
    "random_state": 42, # Using a fixed state for splitting consistency across runs
    "stratify": True,
    "features": ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'],
    "target": 'target',
    "artifact_dir": "manual_models" # Directory to save models
}

print("\n--- Base Configuration ---")
for key, value in BASE_CONFIG.items():
    print(f"{key}: {value}")

# Create artifact directory if it doesn't exist
if not os.path.exists(BASE_CONFIG["artifact_dir"]):
    os.makedirs(BASE_CONFIG["artifact_dir"])
    print(f"\nCreated directory for saving models: {BASE_CONFIG['artifact_dir']}")

def load_data(source: str, feature_cols: list, target_col: str) -> Tuple[pd.DataFrame, pd.Series]:
    """Loads data based on the source specified in config."""
    # print(f"\nLoading data from: {source}") # Reduced verbosity for loop
    if source == "sklearn_iris":
        iris = load_iris()
        df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
                          columns=feature_cols + [target_col])
        df[target_col] = df[target_col].astype(int)
    else:
        raise ValueError(f"Unsupported data source: {source}")
    X = df[feature_cols]
    y = df[target_col]
    return X, y

def split_data(X: pd.DataFrame, y: pd.Series, test_size: float, random_state: int, stratify: bool) -> Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]:
    """Splits data into training and testing sets."""
    # print(f"\nSplitting data: test_size={test_size}, random_state={random_state}, stratify={stratify}")
    stratify_col = y if stratify else None
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size=test_size,
        random_state=random_state,
        stratify=stratify_col
    )
    return X_train, X_test, y_train, y_test

def preprocess_data(X_train: pd.DataFrame, X_test: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray, StandardScaler]:
    """Scales the feature data using StandardScaler."""
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    return X_train_scaled, X_test_scaled, scaler

# Updated train_model function for Decision Tree
def train_model(X_train: np.ndarray, y_train: pd.Series, model_params: Dict[str, Any]) -> DecisionTreeClassifier:
    """Trains a DecisionTreeClassifier model."""
    print(f"\nTraining DecisionTreeClassifier with params: {model_params}")
    # Ensure only valid DT parameters are passed
    valid_dt_params = {k: v for k, v in model_params.items() if k in DecisionTreeClassifier().get_params()}
    model = DecisionTreeClassifier(**valid_dt_params, random_state=BASE_CONFIG["random_state"]) # Add base random state for tree consistency
    model.fit(X_train, y_train)
    return model

def evaluate_model(model: DecisionTreeClassifier, X_test: np.ndarray, y_test: pd.Series) -> Dict[str, float]:
    """Evaluates the model and returns a dictionary of metrics (e.g., accuracy)."""
    print("\nEvaluating model")
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    # Could add more metrics here (precision, recall, f1, etc.)
    metrics = {"accuracy": acc}
    return metrics

# --- Experiment Setup ---
print("\n--- Defining Experiment Runs ---")

# Define parameter sets to try
experiment_params: List[Dict[str, Any]] = [
    {"model_type": "DecisionTree", "max_depth": 2, "criterion": "gini"},
    {"model_type": "DecisionTree", "max_depth": 3, "criterion": "gini"},
    {"model_type": "DecisionTree", "max_depth": 4, "criterion": "gini"},
    {"model_type": "DecisionTree", "max_depth": 5, "criterion": "gini"},
    {"model_type": "DecisionTree", "max_depth": None, "criterion": "gini"}, # None = no limit
    {"model_type": "DecisionTree", "max_depth": 3, "criterion": "entropy"},
    {"model_type": "DecisionTree", "max_depth": 5, "criterion": "entropy"},
]

print(f"Defined {len(experiment_params)} experiment runs.")

# List to store results from each run
results_list: List[Dict[str, Any]] = []

# --- Execute Experiment Loop ---
print("\n--- Starting Experiment Tracking Loop ---")

# 1. Load and Split Data (do this once outside the loop if splitting is consistent)
X, y = load_data(BASE_CONFIG["data_source"], BASE_CONFIG["features"], BASE_CONFIG["target"])
X_train, X_test, y_train, y_test = split_data(
    X, y,
    test_size=BASE_CONFIG["test_size"],
    random_state=BASE_CONFIG["random_state"],
    stratify=BASE_CONFIG["stratify"]
)
# 2. Preprocess Data (do this once based on the training split)
X_train_scaled, X_test_scaled, scaler = preprocess_data(X_train, X_test)
# (Optional) Save the scaler as an artifact too
scaler_filename = os.path.join(BASE_CONFIG["artifact_dir"], f"scaler_rs_{BASE_CONFIG['random_state']}.joblib")
joblib.dump(scaler, scaler_filename)
print(f"Saved scaler to {scaler_filename}")


# Loop through each parameter configuration
for i, params in enumerate(experiment_params):
    run_id = i + 1
    print(f"\n--- Running Experiment {run_id}/{len(experiment_params)} ---")
    print(f"Parameters: {params}")

    # 3. Train Model using current params
    model = train_model(X_train_scaled, y_train, params)

    # 4. Evaluate Model
    metrics = evaluate_model(model, X_test_scaled, y_test)
    print(f"Metrics: {metrics}")

    # 5. (Task) Store Parameters and Metrics
    run_result = {
        "run_id": run_id,
        **params, # Unpack parameter dictionary
        **metrics # Unpack metrics dictionary
    }
    results_list.append(run_result)

    # 6. Artifact Logging: Save the trained model
    try:
        # Create a descriptive filename
        metric_str = f"acc_{metrics['accuracy']:.3f}".replace('.', '_') # Make filename safe
        param_str = f"depth_{params.get('max_depth', 'None')}_crit_{params.get('criterion', 'na')}"
        timestamp = int(time.time()) # Add timestamp for uniqueness if needed
        model_filename = os.path.join(
            BASE_CONFIG["artifact_dir"],
            f"model_run_{run_id}_{param_str}_{metric_str}.joblib"
        )
        # Save the model object
        joblib.dump(model, model_filename)
        print(f"Saved model artifact to: {model_filename}")
        # Add artifact path to results for tracking
        run_result["model_artifact"] = model_filename
    except Exception as e:
        print(f"Error saving model for run {run_id}: {e}")
        run_result["model_artifact"] = None # Indicate saving failed

print("\n--- Experiment Loop Finished ---")

# --- Result Analysis ---
print("\n--- Analyzing Results ---")

# Task: Convert list of results into a Pandas DataFrame
results_df = pd.DataFrame(results_list)

# Task: Display and sort the DataFrame
print("Results Summary DataFrame:")
# Display all columns clearly
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000) # Adjust width for better console display
print(results_df)

print("\nResults Sorted by Accuracy (Descending):")
# Sort to find the best performing parameters based on accuracy
results_df_sorted = results_df.sort_values(by="accuracy", ascending=False)
print(results_df_sorted)

# Identify best run parameters
if not results_df_sorted.empty:
    best_run = results_df_sorted.iloc[0]
    print("\n--- Best Performing Run ---")
    print(f"Run ID: {best_run['run_id']}")
    print(f"Parameters: {best_run.filter(items=experiment_params[0].keys()).to_dict()}") # Show only hyperparams
    print(f"Accuracy: {best_run['accuracy']:.4f}")
    print(f"Model Artifact: {best_run['model_artifact']}")
else:
    print("\nNo results to analyze.")

print("\n--- Listing Saved Artifacts ---")
print(f"Files in '{BASE_CONFIG['artifact_dir']}':")
try:
    saved_files = os.listdir(BASE_CONFIG["artifact_dir"])
    if saved_files:
        for f in saved_files:
            print(f"- {f}")
    else:
        print("(No files found)")
except FileNotFoundError:
    print(f"Error: Directory '{BASE_CONFIG['artifact_dir']}' not found.")



--- Setup: Loading Libraries ---
Pandas: 2.2.2
Scikit-learn: 1.5.1
Joblib: 1.4.2

--- Base Configuration ---
data_source: sklearn_iris
test_size: 0.3
random_state: 42
stratify: True
features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
target: target
artifact_dir: manual_models

Created directory for saving models: manual_models

--- Defining Experiment Runs ---
Defined 7 experiment runs.

--- Starting Experiment Tracking Loop ---
Saved scaler to manual_models\scaler_rs_42.joblib

--- Running Experiment 1/7 ---
Parameters: {'model_type': 'DecisionTree', 'max_depth': 2, 'criterion': 'gini'}

Training DecisionTreeClassifier with params: {'model_type': 'DecisionTree', 'max_depth': 2, 'criterion': 'gini'}

Evaluating model
Metrics: {'accuracy': 0.8888888888888888}
Saved model artifact to: manual_models\model_run_1_depth_2_crit_gini_acc_0_889.joblib

--- Running Experiment 2/7 ---
Parameters: {'model_type': 'DecisionTree', 'max_depth': 3, 'criterion': 

### Are you able to observe the key tasks in code sample C 
- Explicitly defined parameter sets to try.
- Looped through experiments, training, and evaluating for each set.
- Manually collected parameters and metrics into a list.
- Used Pandas to analyze the collected results.
- Manually saved model files (artifacts) with descriptive names.


## Exericse 3

In [47]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import warnings

warnings.filterwarnings('ignore', category=UserWarning, module='sklearn') # Suppress simple warnings for clarity

# --- Setup: Shared Model Configuration and Dummy Data Generation ---

# Fixed model configuration for all experiments in this activity
MODEL_PARAMS = {'solver': 'liblinear', 'random_state': 42, 'C': 1.0}
TEST_SIZE = 0.3
RANDOM_STATE = 42

# Generate Dummy Data Subsets
def generate_data(seed, num_samples=100, scale_factor=1.0, extra_noise_col=False):
    """Generates simple reproducible data."""
    np.random.seed(seed)
    X = pd.DataFrame({
        'feature1': np.random.rand(num_samples) * 10 * scale_factor,
        'feature2': np.random.rand(num_samples) * 5 * scale_factor
    })
    if extra_noise_col:
        X['noise_feature'] = np.random.randn(num_samples) * 0.1

    # Simple target variable based on feature1
    y = (X['feature1'] > (5 * scale_factor)).astype(int)
    return X, y

# Create two distinct data versions
print("\n--- Generating Simulated Data Versions ---")
data_A_X, data_A_y = generate_data(seed=1, num_samples=100, scale_factor=1.0)
data_B_X, data_B_y = generate_data(seed=1, num_samples=120, scale_factor=1.2) # More samples, different scale

# Save dummy data to CSV (optional, can use DataFrames directly)
data_A_X.assign(target=data_A_y).to_csv("data_subset_A.csv", index=False)
data_B_X.assign(target=data_B_y).to_csv("data_subset_B.csv", index=False)
print("Created 'data_subset_A.csv' (100 samples, scale 1.0)")
print("Created 'data_subset_B.csv' (120 samples, scale 1.2)")


# --- Part 1: Code Change Impact ---
print("\n--- Part 1: Impact of Code Changes (Preprocessing Logic) ---")

# Define preprocessing function v1
def preprocess_v1(df: pd.DataFrame) -> pd.DataFrame:
    """Version 1: Scales only feature1 using MinMaxScaler."""
    print("Using preprocess_v1: Scaling feature1 with MinMaxScaler.")
    scaler = MinMaxScaler()
    df_processed = df.copy()
    df_processed['feature1'] = scaler.fit_transform(df_processed[['feature1']])
    # feature2 is untouched
    return df_processed

# Define preprocessing function v2
def preprocess_v2(df: pd.DataFrame) -> pd.DataFrame:
    """Version 2: Scales both features using StandardScaler."""
    print("Using preprocess_v2: Scaling feature1 & feature2 with StandardScaler.")
    scaler = StandardScaler()
    df_processed = df.copy()
    # Apply scaling to all columns present
    df_processed[df_processed.columns] = scaler.fit_transform(df_processed)
    return df_processed

# -- Run with preprocess_v1 --
print("\nRunning workflow with preprocess_v1 using data_subset_A...")
X_train, X_test, y_train, y_test = train_test_split(data_A_X, data_A_y, test_size=TEST_SIZE, random_state=RANDOM_STATE)
X_train_p1 = preprocess_v1(X_train)
X_test_p1 = preprocess_v1(X_test) # In practice, fit on train, transform test

model_v1 = LogisticRegression(**MODEL_PARAMS)
model_v1.fit(X_train_p1, y_train)
y_pred_v1 = model_v1.predict(X_test_p1)
score_S1 = accuracy_score(y_test, y_pred_v1)
print(f"Score (S1) with preprocess_v1: {score_S1:.4f}")

# -- Run with preprocess_v2 --
print("\nRunning workflow with preprocess_v2 using data_subset_A...")
X_train, X_test, y_train, y_test = train_test_split(data_A_X, data_A_y, test_size=TEST_SIZE, random_state=RANDOM_STATE) # Re-split same data
X_train_p2 = preprocess_v2(X_train)
X_test_p2 = preprocess_v2(X_test) # In practice, fit on train, transform test

model_v2 = LogisticRegression(**MODEL_PARAMS) # Same model config
model_v2.fit(X_train_p2, y_train)
y_pred_v2 = model_v2.predict(X_test_p2)
score_S2 = accuracy_score(y_test, y_pred_v2)
print(f"Score (S2) with preprocess_v2: {score_S2:.4f}")




--- Generating Simulated Data Versions ---
Created 'data_subset_A.csv' (100 samples, scale 1.0)
Created 'data_subset_B.csv' (120 samples, scale 1.2)

--- Part 1: Impact of Code Changes (Preprocessing Logic) ---

Running workflow with preprocess_v1 using data_subset_A...
Using preprocess_v1: Scaling feature1 with MinMaxScaler.
Using preprocess_v1: Scaling feature1 with MinMaxScaler.
Score (S1) with preprocess_v1: 0.9000

Running workflow with preprocess_v2 using data_subset_A...
Using preprocess_v2: Scaling feature1 & feature2 with StandardScaler.
Using preprocess_v2: Scaling feature1 & feature2 with StandardScaler.
Score (S2) with preprocess_v2: 0.9667


## Questions:
Notice the scores differ: S1 (v1) vs S2 (v2), even with the same data and model parameters.

If you just saw Score S2 months later, how would you know *exactly* which preprocessing logic (v1 or v2) was used?"

How would you revert back to the code that produced S1 if v2 performed worse?


In [51]:
# -- Run with data_subset_A --
print("\nRunning workflow with preprocess_v1 using data_subset_A...")
# We can reuse the results from Part 1 or recalculate for clarity:
X_train, X_test, y_train, y_test = train_test_split(data_A_X, data_A_y, test_size=TEST_SIZE, random_state=RANDOM_STATE)
X_train_p1_dataA = preprocess_v1(X_train)
X_test_p1_dataA = preprocess_v1(X_test) # Fit on train, transform test ideally
model_dataA = LogisticRegression(**MODEL_PARAMS)
model_dataA.fit(X_train_p1_dataA, y_train)
y_pred_dataA = model_dataA.predict(X_test_p1_dataA)
score_S1_data = accuracy_score(y_test, y_pred_dataA)
print(f"Score (S1_data) with preprocess_v1 and data_A: {score_S1_data:.4f}")

# -- Run with data_subset_B --
print("\nRunning workflow with preprocess_v1 using data_subset_B...")
# Now use data_B but the SAME preprocessing code (v1) and model params
X_train, X_test, y_train, y_test = train_test_split(data_B_X, data_B_y, test_size=TEST_SIZE, random_state=RANDOM_STATE)
X_train_p1_dataB = preprocess_v1(X_train)
X_test_p1_dataB = preprocess_v1(X_test) # Fit on train, transform test ideally
model_dataB = LogisticRegression(**MODEL_PARAMS) # Same model config
model_dataB.fit(X_train_p1_dataB, y_train)
y_pred_dataB = model_dataB.predict(X_test_p1_dataB)
score_S2_data = accuracy_score(y_test, y_pred_dataB)
print(f"Score (S2_data) with preprocess_v1 and data_B: {score_S2_data:.4f}")



Running workflow with preprocess_v1 using data_subset_A...
Using preprocess_v1: Scaling feature1 with MinMaxScaler.
Using preprocess_v1: Scaling feature1 with MinMaxScaler.
Score (S1_data) with preprocess_v1 and data_A: 0.9000

Running workflow with preprocess_v1 using data_subset_B...
Using preprocess_v1: Scaling feature1 with MinMaxScaler.
Using preprocess_v1: Scaling feature1 with MinMaxScaler.
Score (S2_data) with preprocess_v1 and data_B: 0.8889


## Questions
Notice the scores differ: S1_data (Data A) vs S2_data (Data B), even with the same code and model parameters.

If you got the result S2_data, how would you know *exactly* which dataset ('data_subset_A.csv' or 'data_subset_B.csv') was used for training?

How could you reproduce the S1_data result if 'data_subset_A.csv' was accidentally overwritten or deleted?


## Further discussion

In Exercise 2, we saved models like 'model_run_4_depth_5_crit_gini_acc_0_933.joblib'.

Now imagine finding such a file months later.

How can you reliably determine:
- Which *exact version* of the preprocessing code (like v1 or v2) was used?
- Which *exact version* of the dataset (like data_A or data_B) was used?
- Which *exact set* of hyperparameters (already in filename, but need confirmation) was used?

Simply having the saved model file is often not enough for reproducibility or debugging.


## Takeaway

ML results are highly sensitive to changes in CODE (e.g., preprocessing, model architecture), DATA (e.g., samples, features, distribution), and PARAMETERS.
Without robust versioning of all these components, it's extremely difficult to:
- Reproduce past results.
- Understand why performance changed.
- Debug issues reliably.
- Collaborate effectively.
This simulation motivates the use of dedicated version control tools (like Git for code, DVC for data) and experiment tracking platforms (like MLflow) in MLOps workflows.