# Python Assignment: Model Pipeline Serialization and Deployment with Joblib

This assignment focuses on a crucial aspect of MLOps: packaging your entire machine learning workflow, from preprocessing to prediction, into a single, deployable unit. You will build a Scikit-learn pipeline, train it, serialize it using `joblib`, and then demonstrate how to load and use it to make predictions on new, raw data. This ensures consistency and reproducibility in deployment.

## Part 1: Data Generation and Feature Engineering Setup (30 points)

We'll create a synthetic dataset that requires various preprocessing steps, laying the groundwork for building a robust pipeline.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import joblib # For serialization
import warnings

warnings.filterwarnings("ignore") # Suppress warnings for cleaner output
np.random.seed(42) # for reproducibility

# 1.1 Generate Synthetic Dataset
#    Create a DataFrame with:
#    - `numerical_feature_1`: Continuous, some missing values (NaN)
#    - `numerical_feature_2`: Continuous, no missing values
#    - `categorical_feature`: Categorical, 3-4 unique values, some missing values (NaN)
#    - `target`: A continuous target variable related to the features.
#    Ensure `n_samples` is at least 500.

n_samples = 500

# Generate numerical features
num_f1 = np.random.rand(n_samples) * 100
num_f2 = np.random.normal(loc=50, scale=15, size=n_samples)

# Introduce missing values in num_f1
missing_indices_f1 = np.random.choice(n_samples, size=int(0.05 * n_samples), replace=False)
num_f1[missing_indices_f1] = np.nan

# Generate categorical feature
categories = ['A', 'B', 'C', 'D']
cat_f = np.random.choice(categories, size=n_samples, p=[0.4, 0.3, 0.2, 0.1])

# Introduce missing values in cat_f
missing_indices_cat = np.random.choice(n_samples, size=int(0.03 * n_samples), replace=False)
cat_f[missing_indices_cat] = np.nan

# Generate target variable with some noise
target = 2 * num_f1 + 0.5 * num_f2 + (10 if cat_f[0] == 'A' else (5 if cat_f[0] == 'B' else 0)) + np.random.randn(n_samples) * 5

# Create DataFrame
data = pd.DataFrame({
    'numerical_feature_1': num_f1,
    'numerical_feature_2': num_f2,
    'categorical_feature': cat_f,
    'target': target
})

print("Original Data Head:\n", data.head())
print("\nOriginal Data Info:")
data.info()

# Define features and target
X = data[['numerical_feature_1', 'numerical_feature_2', 'categorical_feature']]
y = data['target']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"\nTrain set shape: {X_train.shape}, Test set shape: {X_test.shape}")


## Part 2: Building and Training the Pipeline (40 points)

You will now construct a comprehensive Scikit-learn pipeline that handles all preprocessing steps and integrates the final model.

In [None]:
# 2.1 Define Preprocessing Steps
#    - For `numerical_feature_1`: Impute missing values with the mean, then apply `StandardScaler`.
#    - For `numerical_feature_2`: Apply `StandardScaler` directly.
#    - For `categorical_feature`: Impute missing values with the most frequent value, then apply `OneHotEncoder`.
#    Use `SimpleImputer`, `StandardScaler`, and `OneHotEncoder` from `sklearn.preprocessing`.

# Define column types
numerical_features_with_missing = ['numerical_feature_1']
numerical_features_no_missing = ['numerical_feature_2']
categorical_features = ['categorical_feature']

# Create preprocessing pipelines for numerical and categorical features

# TODO: Numerical pipeline for features with missing values
numerical_transformer_missing = Pipeline(steps=[
    # ('imputer', SimpleImputer(strategy='mean')),
    # ('scaler', StandardScaler())
])

# TODO: Numerical pipeline for features without missing values
numerical_transformer_no_missing = Pipeline(steps=[
    # ('scaler', StandardScaler())
])

# TODO: Categorical pipeline
categorical_transformer = Pipeline(steps=[
    # ('imputer', SimpleImputer(strategy='most_frequent')),
    # ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# 2.2 Create a ColumnTransformer
#    Combine the preprocessing steps for different column types using `ColumnTransformer`.

preprocessor = ColumnTransformer(
    transformers=[
        # TODO: Add your transformers here
        # ('num_missing', numerical_transformer_missing, numerical_features_with_missing),
        # ('num_no_missing', numerical_transformer_no_missing, numerical_features_no_missing),
        # ('cat', categorical_transformer, categorical_features)
    ],
    remainder='passthrough' # Keep other columns if any (not strictly needed here)
)

print("Preprocessor created:\n", preprocessor)


# 2.3 Build and Train the Full Pipeline
#    Create a full `Pipeline` that first applies the `preprocessor` and then trains a `RandomForestRegressor`.
#    Train this entire pipeline on your `X_train` and `y_train` data.
#    Evaluate the pipeline's performance on the `X_test` and `y_test` data (RMSE and R2 Score).

# TODO: Create the full pipeline
full_pipeline = Pipeline(steps=[
    # ('preprocessor', preprocessor),
    # ('regressor', RandomForestRegressor(n_estimators=100, random_state=42)) # or LinearRegression()
])

print("\nFull Pipeline created:\n", full_pipeline)

# Train the pipeline
print("\nTraining the full pipeline...")
# TODO: Fit the pipeline
# full_pipeline.fit(X_train, y_train)
print("Pipeline training complete.")

# Evaluate the pipeline
print("\nEvaluating the pipeline performance...")
# TODO: Make predictions and calculate metrics
pipeline_predictions = # ... predictions
pipeline_rmse = # ... RMSE
pipeline_r2 = # ... R2 Score

print(f"Pipeline Test RMSE: {pipeline_rmse:.4f}")
print(f"Pipeline Test R2 Score: {pipeline_r2:.4f}")


## Part 3: Serialization with `joblib` (15 points)

You will now save the entire trained pipeline to disk using `joblib`. This is essential for deploying the exact trained state of your model and its preprocessing steps.

In [None]:
# 3.1 Save the Trained Pipeline
#    Use `joblib.dump()` to save the `full_pipeline` object to a file named `model_pipeline.joblib`.
#    Explain why `joblib` is often preferred over Python's built-in `pickle` for serializing Scikit-learn models and pipelines.

pipeline_filename = 'model_pipeline.joblib'

print(f"\nSaving the pipeline to {pipeline_filename}...")
# TODO: Save the pipeline
# joblib.dump(full_pipeline, pipeline_filename)
print("Pipeline saved successfully.")

### Explanation: Why Joblib over Pickle for Scikit-learn?
*(Write your explanation here)*


## Part 4: Loading and Deployment Simulation (15 points)

Demonstrate how to load the saved pipeline and use it to make predictions on new, raw data, mimicking a production environment.

In [None]:
# 4.1 Load the Pipeline
#    Load the `model_pipeline.joblib` file back into memory using `joblib.load()`.

print(f"\nLoading the pipeline from {pipeline_filename}...")
# TODO: Load the pipeline
# loaded_pipeline = joblib.load(pipeline_filename)
print("Pipeline loaded successfully.")

# 4.2 Simulate New Raw Data for Prediction
#    Create a new `pd.DataFrame` that represents unseen, raw input data (i.e., not yet preprocessed).
#    Ensure it has the same column names and data types as your original training data.

new_raw_data = pd.DataFrame({
    'numerical_feature_1': [85.0, np.nan, 23.5],
    'numerical_feature_2': [45.1, 60.5, 38.0],
    'categorical_feature': ['A', 'C', np.nan]
})
print("\nNew Raw Data for Prediction:\n", new_raw_data)

# 4.3 Make Predictions with the Loaded Pipeline
#    Use the `loaded_pipeline` to make predictions on the `new_raw_data`.
#    Observe that the pipeline automatically handles all preprocessing steps before prediction.

print("\nMaking predictions on new raw data with loaded pipeline...")
# TODO: Make predictions
# new_predictions = loaded_pipeline.predict(new_raw_data)
# print(f"Predictions: {new_predictions}")

# 4.4 (Tougher) Create a simple prediction function/endpoint simulation
#    Define a function that takes raw input data (e.g., a dictionary or list) and uses the loaded pipeline to return a prediction.
#    This simulates what a web API endpoint might do.

def predict_from_raw_input(input_dict: dict, pipeline) -> float:
    # TODO: Convert input_dict to a DataFrame suitable for the pipeline
    input_df = pd.DataFrame([input_dict])
    prediction = pipeline.predict(input_df)[0]
    return prediction

sample_input = {
    'numerical_feature_1': 75.2,
    'numerical_feature_2': 55.0,
    'categorical_feature': 'B'
}

print(f"\nSimulating API prediction for: {sample_input}")
# TODO: Call your prediction function
# api_prediction = predict_from_raw_input(sample_input, loaded_pipeline)
# print(f"API Prediction: {api_prediction:.2f}")


## Part 5: Reflection and Best Practices (10 points)

Answer the following questions in a markdown cell below.

### Your Answers to Reflection Questions:

1.  **What are the primary advantages of saving an entire machine learning pipeline (preprocessing + model) instead of just the trained model?** (List at least 3 advantages)

    * **Advantage 1:** _(Your answer here)_
    * **Advantage 2:** _(Your answer here)_
    * **Advantage 3:** _(Your answer here)_

2.  **When deploying a `joblib`-serialized pipeline to a production environment (e.g., a Docker container or cloud function), what key dependencies or considerations must you ensure are present in that environment?**

    _(Your answer here)_

3.  **Are there any potential issues or limitations with using `joblib` for model serialization, especially in large-scale or long-term production scenarios? (e.g., version compatibility, cross-language support)**

    _(Your answer here)_

4.  **Briefly compare and contrast `joblib` with other model serialization formats/tools you might know (e.g., `pickle`, ONNX, PMML). When might you choose one over the other?**

    _(Your answer here)_


## Deliverables:

1.  This completed Jupyter Notebook (`joblib_pipeline_deployment_assignment.ipynb`) with all code cells executed and reflection questions answered.
2.  The `model_pipeline.joblib` file generated by your code (you can optionally include it in your submission if submitting as a zipped folder).