### Step 5: Inference and Deployment

We exported the trained model using Joblib and tested it on a single sample input. This section lays the foundation for integrating the model into a user-friendly interface like a Streamlit web app.


#  Step 5: Model Inference and Usage

This section demonstrates how to use the trained models in real-world settings for lung cancer prediction.



##  Saving the Models

All trained models are serialized using `joblib` and stored in the `models/` directory. This enables reusability without retraining and supports deployment into web or API frameworks like Streamlit or Flask.

Saved assets include:
- `logistic_model.pkl`
- `random_forest_model.pkl`
- `svm_model.pkl`
- `scaler.pkl` (for feature scaling)
- `imputer.pkl` (for handling missing values)



##  Loading and Making Predictions

We load the saved models and apply them to new sample inputs. For this demonstration, we select a few samples from the test set and predict their labels.

We also show how the prediction can be integrated into an interactive application for end users (e.g., doctors or lab technicians).


##  Example Output

- Input: Normalized feature vector from an unseen patient sample  
- Output: `Prediction: Lung Cancer` or `Prediction: Normal`

This simple interface will later be enhanced into a user-friendly app using **Streamlit**.



##  Ready for Deployment

With the models and preprocessing pipeline saved, this notebook serves as the foundation for:
- Building a diagnostic app using **Streamlit**
- Hosting a REST API endpoint
- Integrating into clinical decision support systems




### Saving and Reloading Imputer

To handle missing values consistently across training and inference, we use `SimpleImputer` with the "mean" strategy. This imputer is fitted on the training data and saved using `joblib`. It can be reloaded later to ensure the same transformation logic is applied during model deployment.


In [3]:
import pandas as pd
from sklearn.impute import SimpleImputer
import joblib

# Load your dataset
df = pd.read_csv(r"C:\Users\sanja\cfDNA_LungCancer_ML\data\processed\merged_labeled_light.csv", index_col=0)

# Separate features and labels
X = df.drop(columns=["Label"])

# Fit the imputer (using mean strategy)
imputer = SimpleImputer(strategy='mean')
imputer.fit(X)
# Save the imputer
joblib.dump(imputer, r"C:\Users\sanja\cfDNA_LungCancer_ML\models\imputer.pkl")
imputer = joblib.load(r"C:\Users\sanja\cfDNA_LungCancer_ML\models\imputer.pkl")


>  **Note:** If you try to run `joblib.load('imputer.pkl')` in a new notebook without first defining or saving the `imputer` object, it will throw a `NameError`.  
> The imputer must first be **fitted on your dataset** using `SimpleImputer` and then **saved** using `joblib.dump()` before it can be reused across notebooks.
> 
> Additionally, `.pkl` files like `imputer.pkl` are **binary files** and should not be opened directly in Jupyter. Always load them through code using `joblib.load()`.


In [5]:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
import joblib

# ✅ Load your CSV file (update path if needed)
df = pd.read_csv(r"C:\Users\sanja\cfDNA-Lung-Cancer-ML\data\processed\merged_labeled_light.csv", index_col=0)

# ✅ Features and target based on your CSV
REQUIRED_FEATURES = ['gene 1', 'gene 2', 'gene 3', 'miRNA_21', 'miRNA_34a']

# ✅ Filter only these columns
X = df[REQUIRED_FEATURES]

# ✅ Step 1: Imputation
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
joblib.dump(imputer, "imputer.pkl")

# ✅ Step 2: Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)
joblib.dump(scaler, "scaler.pkl")

print("✅ Imputer and Scaler saved successfully!")



✅ Imputer and Scaler saved successfully!
