### Ensuring Feature Consistency Between Training & InferencePipelines:

**Task 1**: Consistent Feature Preparation
- Step 1: Write a function for data preprocessing and imputation shared by both training and inference pipelines.
- Step 2: Demonstrate consistent application on both datasets.

In [1]:
# write your code from here
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Sample training data with missing values
train_data = pd.DataFrame({
    'age': [25, 30, np.nan, 40, 35],
    'income': [50000, 60000, 55000, np.nan, 65000],
    'score': [200, 220, 210, 230, np.nan]
})

# Sample inference data with missing values
inference_data = pd.DataFrame({
    'age': [28, np.nan, 50],
    'income': [52000, 62000, np.nan],
    'score': [205, 215, 225]
})

# Step 1: Shared preprocessing function
def preprocess_data(df, imputer=None, scaler=None, fit=True):
    """
    Preprocesses the input DataFrame by imputing missing values and scaling features.
    
    Parameters:
    - df: pd.DataFrame, input data
    - imputer: sklearn SimpleImputer instance or None
    - scaler: sklearn StandardScaler instance or None
    - fit: bool, whether to fit imputer and scaler on this data
    
    Returns:
    - processed_df: np.ndarray, transformed data
    - imputer: fitted SimpleImputer instance
    - scaler: fitted StandardScaler instance
    """
    numeric_cols = df.columns
    
    if imputer is None:
        imputer = SimpleImputer(strategy='mean')
    if scaler is None:
        scaler = StandardScaler()
        
    if fit:
        # Fit imputer and scaler on training data
        imputed_data = imputer.fit_transform(df[numeric_cols])
        scaled_data = scaler.fit_transform(imputed_data)
    else:
        # Use existing imputer and scaler on inference data
        imputed_data = imputer.transform(df[numeric_cols])
        scaled_data = scaler.transform(imputed_data)
    
    return scaled_data, imputer, scaler


# Step 2: Apply preprocessing consistently

# Preprocess training data (fit=True)
X_train_processed, fitted_imputer, fitted_scaler = preprocess_data(train_data, fit=True)

print("Processed training data:")
print(X_train_processed)

# Preprocess inference data using the fitted imputer and scaler (fit=False)
X_infer_processed, _, _ = preprocess_data(inference_data, imputer=fitted_imputer, scaler=fitted_scaler, fit=False)

print("\nProcessed inference data:")
print(X_infer_processed)


Processed training data:
[[-1.5 -1.5 -1.5]
 [-0.5  0.5  0.5]
 [ 0.  -0.5 -0.5]
 [ 1.5  0.   1.5]
 [ 0.5  1.5  0. ]]

Processed inference data:
[[-0.9 -1.1 -1. ]
 [ 0.   0.9  0. ]
 [ 3.5  0.   1. ]]


**Task 2**: Pipeline Integration
- Step 1: Use sklearn pipelines to encapsulate the preprocessing steps.
- Step 2: Configure identical pipelines for both training and building inference models.

In [2]:
# write your code from here
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Sample training data with missing values
train_data = pd.DataFrame({
    'age': [25, 30, np.nan, 40, 35],
    'income': [50000, 60000, 55000, np.nan, 65000],
    'score': [200, 220, 210, 230, np.nan]
})

# Sample inference data with missing values
inference_data = pd.DataFrame({
    'age': [28, np.nan, 50],
    'income': [52000, 62000, np.nan],
    'score': [205, 215, 225]
})

# Step 1: Create sklearn pipeline for preprocessing
preprocessing_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Step 2: Fit pipeline on training data and transform training data
X_train_processed = preprocessing_pipeline.fit_transform(train_data)

print("Processed training data:")
print(X_train_processed)

# Step 3: Use the same fitted pipeline to transform inference data
X_infer_processed = preprocessing_pipeline.transform(inference_data)

print("\nProcessed inference data:")
print(X_infer_processed)


Processed training data:
[[-1.5 -1.5 -1.5]
 [-0.5  0.5  0.5]
 [ 0.  -0.5 -0.5]
 [ 1.5  0.   1.5]
 [ 0.5  1.5  0. ]]

Processed inference data:
[[-0.9 -1.1 -1. ]
 [ 0.   0.9  0. ]
 [ 3.5  0.   1. ]]


**Task 3**: Saving and Loading Preprocessing Models
- Step 1: Save the transformation model after fitting it to the training data.
- Step 2: Load and apply the saved model during inference.

In [3]:
# write your code from here
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import joblib  # For saving and loading models

# Sample training data
train_data = pd.DataFrame({
    'age': [25, 30, np.nan, 40, 35],
    'income': [50000, 60000, 55000, np.nan, 65000],
    'score': [200, 220, 210, 230, np.nan]
})

# Sample inference data
inference_data = pd.DataFrame({
    'age': [28, np.nan, 50],
    'income': [52000, 62000, np.nan],
    'score': [205, 215, 225]
})

# Step 1: Create pipeline
preprocessing_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Step 2: Fit pipeline on training data
preprocessing_pipeline.fit(train_data)

# Step 3: Save the fitted pipeline to disk
joblib.dump(preprocessing_pipeline, 'preprocessing_pipeline.joblib')
print("Preprocessing pipeline saved to 'preprocessing_pipeline.joblib'")

# --- Later or in a different script ---

# Step 4: Load the saved pipeline
loaded_pipeline = joblib.load('preprocessing_pipeline.joblib')
print("Preprocessing pipeline loaded.")

# Step 5: Apply loaded pipeline on inference data
X_infer_processed = loaded_pipeline.transform(inference_data)

print("\nProcessed inference data using loaded pipeline:")
print(X_infer_processed)


Preprocessing pipeline saved to 'preprocessing_pipeline.joblib'
Preprocessing pipeline loaded.

Processed inference data using loaded pipeline:
[[-0.9 -1.1 -1. ]
 [ 0.   0.9  0. ]
 [ 3.5  0.   1. ]]
