# Model Persistence and Inference with Joblib in a Random Forest Pipeline

- Lets now summarize how to train a Random Forest model on California housing data
- save the model and preprocessing pipeline using <b>joblib
- and reuse the model later for inference on new data <b> ( input.csv )
- This approach helps avoid retraining the model every time, improving performance and enabling reproducibility    

# Why These Steps?

## 1. Why Train Once and Save?
- Training models repeatedly is time-consuming and computationally expensive.
- Saving the model ( model.pkl ) and preprocessing pipeline ( pipeline.pkl ) ensures you can quickly load and run inference anytime in the future.

## 2. Why Use a Preprocessing Pipeline?
- Raw data needs to be cleaned, scaled, and encoded before model training.
- A Pipeline automates this transformation and ensures identical preprocessing during inference.

## 3. Why Use Joblib?
- joblib efficiently serializes large NumPy arrays (like in sklearn models).
- Faster and more suitable than pickle for scikit-learn objects.

## 4. Why the If-Else Logic?
- The program checks if a saved model exists.
- If not, it trains and saves the model.
- If it does, it skips training and only runs inference, saving time.

# Full Code

In [3]:
import pandas as pd
import numpy as np
import joblib

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor

# ---------------- LOAD DATA ----------------
data = pd.read_csv("housing.csv")

TARGET = "median_house_value"
X = data.drop(columns=[TARGET])
y = data[TARGET]

# ---------------- COLUMN TYPES ----------------

num_features = X.select_dtypes(include=["int64", "float64"]).columns
cat_features = X.select_dtypes(include=["object", "string"]).columns

# ---------------- PREPROCESSING ----------------
num_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median"))
])

cat_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

preprocessor = ColumnTransformer(transformers=[
    ("num", num_pipeline, num_features),
    ("cat", cat_pipeline, cat_features)
])

# ---------------- FULL PIPELINE ----------------
pipeline = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", RandomForestRegressor(
        n_estimators=100,
        random_state=42,
        n_jobs=-1
    ))
])

# ---------------- TRAIN ----------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

pipeline.fit(X_train, y_train)

# ---------------- SAVE PIPELINE ----------------
# joblib.dump(pipeline, "pipeline.pkl")
joblib.dump(pipeline, "pipeline.pkl", compress=3)

print("✅ New pipeline.pkl saved successfully")


✅ New pipeline.pkl saved successfully


##### how we have checked it 
- output.csv ki empty file banao ousme output store kro input ke output se compare kro

### Summary
With this setup, our ML pipeline is:
- Efficient – No retraining needed if the model exists.
- Reproducible – Same preprocessing logic every time.
- Production-ready – Can be deployed or reused across multiple systems.