# Week 1: Introduction to MLOps & Version Control

## üìö Notebook Overview
This notebook demonstrates a complete Machine Learning pipeline using the Iris dataset.

**Key Concepts Covered:**
- Data loading and preprocessing
- Feature engineering
- Model training with RandomForest
- Hyperparameter tuning with Hyperopt (TPE algorithm)
- Model evaluation
- Scikit-learn Pipelines for reproducibility

**Dataset:** Iris Dataset (150 samples, 4 features, 3 classes)
- Features: Sepal Length, Sepal Width, Petal Length, Petal Width
- Target: Species (Setosa, Versicolor, Virginica)

In [1]:
# ============================================================================
# IMPORT LIBRARIES
# ============================================================================
#    
# Data manipulation and analysis
import pandas as pd  # For loading and manipulating tabular data (DataFrames)

# Scikit-learn: Machine Learning library
from sklearn.model_selection import train_test_split  # Split data into train/test sets
from sklearn.model_selection import cross_val_score   # Perform k-fold cross-validation
from sklearn.ensemble import RandomForestClassifier   # Random Forest algorithm for classification
from sklearn.metrics import accuracy_score            # Calculate accuracy metric

# Hyperopt: Hyperparameter optimization library
from hyperopt import fmin      # Function minimization (optimization)
from hyperopt import tpe       # Tree of Parzen Estimators algorithm (smart search)
from hyperopt import hp        # Define hyperparameter search spaces

# Scikit-learn Pipeline components for building reproducible ML workflows
from sklearn.pipeline import Pipeline              # Create sequential transformation pipeline
from sklearn.compose import ColumnTransformer      # Apply different transformations to different columns
from sklearn.preprocessing import StandardScaler   # Standardize features (mean=0, std=1)
from sklearn.impute import SimpleImputer           # Handle missing values (not used in this example)

  import pkg_resources


## üîß Helper Functions

These modular functions make our code:
- **Reusable**: Can be called multiple times
- **Testable**: Each function can be tested independently
- **Maintainable**: Easy to update and debug
- **Production-ready**: Follows software engineering best practices

In [2]:
# ============================================================================
# FUNCTION 1: READ CSV FILE
# ============================================================================
def read_csv(file_path):
    """
    Load data from a CSV file into a pandas DataFrame.
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV file (relative or absolute)
    
    Returns:
    --------
    pd.DataFrame
        Loaded data as a pandas DataFrame
    
    Example:
    --------
    >>> data = read_csv('Iris.csv')
    """
    # pd.read_csv() reads CSV files and automatically infers data types
    return pd.read_csv(file_path)


# ============================================================================
# FUNCTION 2: CREATE FEATURES (FEATURE ENGINEERING)
# ============================================================================
def create_features(data):
    """
    Perform feature engineering on the dataset.
    
    Feature engineering is the process of creating new features from existing ones
    to improve model performance. Examples include:
    - Creating interaction features (e.g., sepal_length * sepal_width)
    - Polynomial features (e.g., sepal_length^2)
    - Binning continuous variables
    - Encoding categorical variables
    
    Parameters:
    -----------
    data : pd.DataFrame
        Input dataset
    
    Returns:
    --------
    pd.DataFrame
        Dataset with engineered features
    
    Note:
    -----
    For this Iris dataset example, we don't create additional features
    because the original 4 features are already highly predictive.
    In real-world scenarios, feature engineering is crucial for model performance.
    """
    # No feature creation for this simple example
    # The Iris dataset's original features are sufficient for classification
    return data


# ============================================================================
# FUNCTION 3: TRAIN CLASSIFIER MODEL
# ============================================================================
def train_classifier(data):
    """
    Train a Random Forest classifier on the provided data.
    
    This function performs:
    1. Train-test split (80-20 split)
    2. Model initialization
    3. Model training
    4. Prediction on test set
    5. Accuracy calculation
    
    Parameters:
    -----------
    data : Not used directly (uses global X, y variables)
    
    Returns:
    --------
    tuple
        (trained_model, accuracy_score)
    
    Note:
    -----
    This function assumes X and y are defined globally.
    In production, you should pass X and y as parameters.
    """
    # Split data into training (80%) and testing (20%) sets
    # random_state=42 ensures reproducibility (same split every time)
    # X_train: Features for training
    # X_test: Features for testing
    # y_train: Target labels for training
    # y_test: Target labels for testing
    X_train, X_test, y_train, y_test = train_test_split(
        X,                    # Feature matrix
        y,                    # Target vector
        test_size=0.2,        # 20% of data for testing
        random_state=42       # Seed for reproducibility
    )

    # Initialize Random Forest Classifier with default parameters
    # Random Forest is an ensemble method that builds multiple decision trees
    # and combines their predictions (voting for classification)
    # Default parameters: n_estimators=100, max_depth=None, etc.
    model = RandomForestClassifier()
    
    # Train the model on training data
    # .fit() learns patterns from X_train to predict y_train
    model.fit(X_train, y_train)

    # Make predictions on the test set
    # .predict() uses the trained model to predict labels for X_test
    y_pred = model.predict(X_test)
    
    # Calculate accuracy: (correct predictions / total predictions)
    # Accuracy = (TP + TN) / (TP + TN + FP + FN)
    accuracy = accuracy_score(y_test, y_pred)

    # Return both the trained model and its accuracy
    return model, accuracy


# ============================================================================
# FUNCTION 4: OBJECTIVE FUNCTION FOR HYPERPARAMETER TUNING
# ============================================================================
def objective(params):
    """
    Objective function for Hyperopt optimization.
    
    This function is called by Hyperopt's optimization algorithm (TPE)
    to evaluate different hyperparameter combinations.
    
    How it works:
    1. Receives a set of hyperparameters from Hyperopt
    2. Trains a model with those hyperparameters
    3. Evaluates the model using cross-validation
    4. Returns a score (lower is better for fmin)
    
    Parameters:
    -----------
    params : dict
        Dictionary of hyperparameters to test
        Example: {'n_estimators': 50, 'max_depth': 10}
    
    Returns:
    --------
    float
        Negative mean cross-validation accuracy
        (negative because fmin minimizes, but we want to maximize accuracy)
    
    Note:
    -----
    Cross-validation provides a more robust estimate of model performance
    than a single train-test split.
    """
    # Create a Random Forest model with the hyperparameters provided by Hyperopt
    # **params unpacks the dictionary: {'n_estimators': 50} -> n_estimators=50
    model = RandomForestClassifier(**params)
    
    # Perform 5-fold cross-validation
    # Cross-validation splits data into 5 parts:
    # - Train on 4 parts, validate on 1 part
    # - Repeat 5 times (each part used as validation once)
    # - Returns 5 accuracy scores
    # .mean() calculates the average of these 5 scores
    score = cross_val_score(
        model,      # Model to evaluate
        X,          # Feature matrix
        y,          # Target vector
        cv=5        # Number of folds (5-fold cross-validation)
    ).mean()        # Average accuracy across all folds
    
    # Return negative score because fmin() MINIMIZES the objective function
    # We want to MAXIMIZE accuracy, so we minimize negative accuracy
    # Example: accuracy=0.95 -> return -0.95 (lower is better for fmin)
    return -score


# ============================================================================
# FUNCTION 5: EVALUATE MODEL ON TEST SET
# ============================================================================
def evaluate_model(model, X_test, y_test):
    """
    Evaluate a trained model on the test set.
    
    This function provides the final performance metric on unseen data.
    
    Parameters:
    -----------
    model : sklearn estimator
        Trained machine learning model
    X_test : pd.DataFrame or np.array
        Test features
    y_test : pd.Series or np.array
        True labels for test set
    
    Returns:
    --------
    float
        Accuracy score (0.0 to 1.0)
    
    Example:
    --------
    >>> accuracy = evaluate_model(model, X_test, y_test)
    >>> print(f"Test Accuracy: {accuracy:.2%}")  # Output: Test Accuracy: 96.67%
    """
    # Make predictions on the test set
    y_pred = model.predict(X_test)
    
    # Calculate and return accuracy
    # Accuracy = (Number of correct predictions) / (Total predictions)
    accuracy = accuracy_score(y_test, y_pred)
    
    return accuracy

## üìä Data Loading

Loading the Iris dataset from CSV file.

**Note:** The file path has been updated to use the correct location (`Iris.csv` in the current directory).

In [3]:
# ============================================================================
# LOAD DATA FROM CSV FILE
# ============================================================================

# Define the file path to the Iris dataset
# Using relative path: file is in the same directory as this notebook
file_path = "Iris.csv"

# Load the CSV file into a pandas DataFrame using our custom function
# This reads the entire dataset into memory
data = read_csv(file_path)

# Display the first few rows to verify data loaded correctly
print("Dataset loaded successfully!")
print(f"Shape: {data.shape}")  # (rows, columns)
print(f"\nFirst 5 rows:")
data.head()  # Shows first 5 rows by default

Dataset loaded successfully!
Shape: (150, 6)

First 5 rows:


Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


## üîç Data Exploration

Let's examine the dataset structure and contents.

In [4]:
# ============================================================================
# EXPLORE THE DATASET
# ============================================================================

# Display basic information about the dataset
print("Dataset Information:")
print("=" * 50)

# Show dataset dimensions
print(f"Number of samples (rows): {data.shape[0]}")
print(f"Number of features (columns): {data.shape[1]}")

# Show column names and data types
print(f"\nColumn names: {list(data.columns)}")

# Check for missing values
print(f"\nMissing values per column:")
print(data.isnull().sum())

# Show class distribution (how many samples per species)
print(f"\nClass distribution:")
print(data['Species'].value_counts())

# Display the full dataset
data

Dataset Information:
Number of samples (rows): 150
Number of features (columns): 6

Column names: ['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm', 'Species']

Missing values per column:
Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

Class distribution:
Species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64


Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


## üîÑ ML Pipeline Implementation

Building a complete machine learning pipeline with:
1. **Data Preprocessing**: Feature scaling
2. **Model Training**: Random Forest Classifier
3. **Hyperparameter Tuning**: Hyperopt with TPE algorithm

### Why Use Pipelines?
- **Reproducibility**: Same transformations applied consistently
- **Prevents Data Leakage**: Ensures test data isn't used in training
- **Simplifies Deployment**: Single object contains all steps
- **Easy to Version Control**: Entire workflow in one object

In [5]:
# ============================================================================
# MAIN PIPELINE EXECUTION
# ============================================================================

# Step 1: Apply feature engineering (if any)
# ----------------------------------------
# In this case, no new features are created, but this step is included
# to demonstrate a complete ML workflow
data = create_features(data)


# Step 2: Prepare Features (X) and Target (y)
# ----------------------------------------
# Separate the dataset into:
# - X: Feature matrix (all columns except 'Species' and 'Id')
# - y: Target vector (only 'Species' column)

# Drop 'Species' (target) and 'Id' (not a feature) columns to create feature matrix
# axis=1 means drop columns (axis=0 would drop rows)
X = data.drop(['Species', 'Id'], axis=1)

# Extract only the 'Species' column as the target variable
y = data['Species']

# Display shapes to verify correct separation
print(f"Feature matrix (X) shape: {X.shape}")  # Should be (150, 4)
print(f"Target vector (y) shape: {y.shape}")   # Should be (150,)
print(f"\nFeature columns: {list(X.columns)}")


# Step 3: Split Data into Training and Test Sets
# ----------------------------------------
# Split ratio: 80% training, 20% testing
# random_state=42 ensures the same split every time (reproducibility)
X_train, X_test, y_train, y_test = train_test_split(
    X,                    # Feature matrix
    y,                    # Target vector
    test_size=0.2,        # 20% of data reserved for testing (30 samples)
    random_state=42       # Seed for reproducibility
)

print(f"\nTraining set size: {X_train.shape[0]} samples")  # 120 samples
print(f"Test set size: {X_test.shape[0]} samples")        # 30 samples


# Step 4: Define the ML Pipeline
# ----------------------------------------
# A pipeline chains multiple steps together:
# 1. Preprocessing (StandardScaler)
# 2. Model (RandomForestClassifier)

pipeline = Pipeline([
    # First step: Preprocessing
    # ColumnTransformer allows different transformations for different columns
    ('preprocessor', ColumnTransformer(
        transformers=[
            # Apply StandardScaler to all numeric columns
            # StandardScaler: transforms features to have mean=0 and std=1
            # Formula: z = (x - mean) / std
            # This ensures all features are on the same scale
            ('num', StandardScaler(), X.columns)
        ],
        # 'passthrough' means any columns not specified are passed unchanged
        remainder='passthrough'
    )),
    
    # Second step: Classification model
    # RandomForestClassifier with default hyperparameters
    # Default: n_estimators=100, max_depth=None, min_samples_split=2, etc.
    ('classifier', RandomForestClassifier())
])

print("\nPipeline created successfully!")
print("Pipeline steps:")
for step_name, step_obj in pipeline.steps:
    print(f"  - {step_name}: {type(step_obj).__name__}")


# Step 5: Train the Pipeline
# ----------------------------------------
# .fit() trains the entire pipeline:
# 1. Fits StandardScaler on X_train (calculates mean and std)
# 2. Transforms X_train using the fitted scaler
# 3. Trains RandomForest on the scaled data
print("\nTraining the model...")
pipeline.fit(X_train, y_train)
print("Training complete!")


# Step 6: Make Predictions on Test Set
# ----------------------------------------
# .predict() applies the entire pipeline:
# 1. Scales X_test using the SAME scaler fitted on X_train
# 2. Makes predictions using the trained RandomForest
y_pred = pipeline.predict(X_test)


# Step 7: Evaluate Model Performance
# ----------------------------------------
# Calculate accuracy: percentage of correct predictions
accuracy = accuracy_score(y_test, y_pred)

# Display results
print("\n" + "="*50)
print("MODEL EVALUATION RESULTS")
print("="*50)
print(f"Model accuracy on test set: {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"Correct predictions: {int(accuracy * len(y_test))}/{len(y_test)}")
print("="*50)

Feature matrix (X) shape: (150, 4)
Target vector (y) shape: (150,)

Feature columns: ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']

Training set size: 120 samples
Test set size: 30 samples

Pipeline created successfully!
Pipeline steps:
  - preprocessor: ColumnTransformer
  - classifier: RandomForestClassifier

Training the model...
Training complete!

MODEL EVALUATION RESULTS
Model accuracy on test set: 1.0000 (100.00%)
Correct predictions: 30/30


## üéØ Hyperparameter Tuning with Hyperopt

### What is Hyperparameter Tuning?
Hyperparameters are settings that control the learning process (e.g., number of trees, tree depth).
Unlike model parameters (learned from data), hyperparameters must be set before training.

### Why Use Hyperopt?
- **Smarter than Grid Search**: Uses Bayesian optimization (learns from previous trials)
- **Faster**: Doesn't try every combination
- **TPE Algorithm**: Tree of Parzen Estimators - models P(x|y) and P(y)

### Hyperparameters We're Tuning:
1. **n_estimators**: Number of trees in the forest (10-100)
2. **max_depth**: Maximum depth of each tree (1-20)

### Alternative Approach (Commented):
Brute force with nested loops would try ALL combinations:
- 91 values for n_estimators √ó 20 values for max_depth = 1,820 combinations!
- Hyperopt intelligently samples the space, trying ~100 combinations

In [6]:
# ============================================================================
# HYPERPARAMETER TUNING WITH HYPEROPT
# ============================================================================

# Note: Brute force approach (commented below) would be:
# for n_est in range(10, 101):           # 91 values
#     for max_d in range(1, 21):         # 20 values
#         # Train and evaluate model     # Total: 91 √ó 20 = 1,820 iterations!
# This is computationally expensive and inefficient.

print("Starting Hyperparameter Optimization...")
print("="*50)

# Define the search space for hyperparameters
# ----------------------------------------
# hp.choice() samples from a discrete list of values
# range(10, 101) creates [10, 11, 12, ..., 100]
space = {
    # Number of trees in the forest: try values from 10 to 100
    # More trees generally improve performance but increase training time
    'n_estimators': hp.choice('n_estimators', range(10, 101)),
    
    # Maximum depth of each tree: try values from 1 to 20
    # Deeper trees can model complex patterns but may overfit
    # Shallower trees are simpler but may underfit
    'max_depth': hp.choice('max_depth', range(1, 21))
}

print("Search space defined:")
print(f"  - n_estimators: 10 to 100 (91 possible values)")
print(f"  - max_depth: 1 to 20 (20 possible values)")
print(f"  - Total combinations: 91 √ó 20 = 1,820")
print(f"  - Hyperopt will intelligently sample ~100 combinations\n")

# Run Hyperopt optimization
# ----------------------------------------
# fmin() finds the hyperparameters that MINIMIZE the objective function
best_params = fmin(
    fn=objective,           # Function to minimize (our objective function)
    space=space,            # Hyperparameter search space defined above
    algo=tpe.suggest,       # Algorithm: Tree of Parzen Estimators (TPE)
                            # TPE is a Bayesian optimization algorithm that:
                            # 1. Tries random combinations initially
                            # 2. Learns which regions of hyperparameter space work well
                            # 3. Focuses search on promising regions
    max_evals=100           # Maximum number of hyperparameter combinations to try
                            # More evaluations = better results but longer runtime
)

print("\n" + "="*50)
print("OPTIMIZATION COMPLETE!")
print("="*50)
print(f"Best hyperparameters found:")
print(f"  - n_estimators: {best_params['n_estimators']}")
print(f"  - max_depth: {best_params['max_depth']}")
print("="*50)

# Note: The best_params values are indices from hp.choice()
# They represent the actual values from our range() objects

Starting Hyperparameter Optimization...
Search space defined:
  - n_estimators: 10 to 100 (91 possible values)
  - max_depth: 1 to 20 (20 possible values)
  - Total combinations: 91 √ó 20 = 1,820
  - Hyperopt will intelligently sample ~100 combinations

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100/100 [00:50<00:00,  2.00trial/s, best loss: -0.9666666666666668]

OPTIMIZATION COMPLETE!
Best hyperparameters found:
  - n_estimators: 43
  - max_depth: 12


## üìã Display Best Hyperparameters

Let's examine the optimal hyperparameters found by Hyperopt.

In [7]:
# ============================================================================
# DISPLAY BEST HYPERPARAMETERS
# ============================================================================

# Show the best hyperparameters found by Hyperopt
# This dictionary contains the optimal values for n_estimators and max_depth
print("Optimal Hyperparameters:")
print(best_params)

# Additional information about what these values mean
print("\nInterpretation:")
print(f"  - The Random Forest should use {best_params['n_estimators']} trees")
print(f"  - Each tree should have a maximum depth of {best_params['max_depth']}")
print("\nThese values were found to give the best cross-validation accuracy.")

Optimal Hyperparameters:
{'max_depth': 12, 'n_estimators': 43}

Interpretation:
  - The Random Forest should use 43 trees
  - Each tree should have a maximum depth of 12

These values were found to give the best cross-validation accuracy.


## üîÑ Retrain Model with Optimized Hyperparameters

Now that we have the best hyperparameters, let's train a new model and compare performance.

In [8]:
# ============================================================================
# RETRAIN MODEL WITH BEST HYPERPARAMETERS
# ============================================================================

print("Retraining model with optimized hyperparameters...")
print("="*50)

# Create a new pipeline with optimized hyperparameters
# ----------------------------------------
optimized_pipeline = Pipeline([
    # Same preprocessing step as before
    ('preprocessor', ColumnTransformer(
        transformers=[
            ('num', StandardScaler(), X.columns)
        ],
        remainder='passthrough'
    )),
    
    # Updated classifier with BEST hyperparameters from Hyperopt
    # **best_params unpacks the dictionary into keyword arguments
    ('classifier', RandomForestClassifier(**best_params, random_state=42))
])

# Train the optimized pipeline
# ----------------------------------------
optimized_pipeline.fit(X_train, y_train)

# Make predictions with optimized model
# ----------------------------------------
y_pred_optimized = optimized_pipeline.predict(X_test)

# Evaluate optimized model
# ----------------------------------------
accuracy_optimized = accuracy_score(y_test, y_pred_optimized)

# Display comparison
# ----------------------------------------
print("\n" + "="*50)
print("PERFORMANCE COMPARISON")
print("="*50)
print(f"Default hyperparameters accuracy:   {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"Optimized hyperparameters accuracy: {accuracy_optimized:.4f} ({accuracy_optimized*100:.2f}%)")
print(f"\nImprovement: {(accuracy_optimized - accuracy)*100:.2f} percentage points")
print("="*50)

# Note: For the Iris dataset, both models may achieve 100% accuracy
# because it's a relatively simple classification problem.
# Hyperparameter tuning shows more significant improvements on complex datasets.

Retraining model with optimized hyperparameters...

PERFORMANCE COMPARISON
Default hyperparameters accuracy:   1.0000 (100.00%)
Optimized hyperparameters accuracy: 1.0000 (100.00%)

Improvement: 0.00 percentage points


## üìä Summary and Key Takeaways

### What We Accomplished:
1. ‚úÖ Loaded and explored the Iris dataset
2. ‚úÖ Built a reproducible ML pipeline with preprocessing and modeling
3. ‚úÖ Trained a Random Forest classifier
4. ‚úÖ Performed hyperparameter tuning using Hyperopt (TPE algorithm)
5. ‚úÖ Compared default vs. optimized model performance

### MLOps Best Practices Demonstrated:
- **Modularity**: Functions for each step (reusable, testable)
- **Reproducibility**: Fixed random seeds, version-controlled code
- **Pipelines**: Prevents data leakage, simplifies deployment
- **Automation**: Hyperparameter tuning instead of manual search
- **Documentation**: Comprehensive comments and markdown cells

### Next Steps in MLOps Journey:
- Version control this notebook with Git
- Experiment tracking (MLflow, Weights & Biases)
- Model serialization (save trained models)
- CI/CD pipelines for automated training
- Model deployment as REST API
- Monitoring and retraining strategies

---

**Congratulations!** üéâ You've completed Week 1 of MLOps learning!

In [None]:
# ============================================================================
# ADDITIONAL EXPERIMENTS (OPTIONAL)
# ============================================================================

# This cell is left empty for you to experiment with:
# - Different hyperparameter ranges
# - Other classification algorithms (SVM, Gradient Boosting, etc.)
# - Feature engineering ideas
# - Different evaluation metrics (precision, recall, F1-score)
# - Cross-validation strategies

# Example experiments you could try:
# 1. Add more hyperparameters to tune (min_samples_split, min_samples_leaf)
# 2. Try different scalers (MinMaxScaler, RobustScaler)
# 3. Create polynomial features
# 4. Visualize decision boundaries
# 5. Plot feature importances

# Happy experimenting! üöÄ