### Step 5: Inference and Deployment

We exported the trained model using Joblib and tested it on a single sample input. This section lays the foundation for integrating the model into a user-friendly interface like a Streamlit web app.


#  Step 5: Model Inference and Usage

This section demonstrates how to use the trained models in real-world settings for lung cancer prediction.



##  Saving the Models

All trained models are serialized using `joblib` and stored in the `models/` directory. This enables reusability without retraining and supports deployment into web or API frameworks like Streamlit or Flask.

Saved assets include:
- `logistic_model.pkl`
- `random_forest_model.pkl`
- `svm_model.pkl`
- `scaler.pkl` (for feature scaling)
- `imputer.pkl` (for handling missing values)



##  Loading and Making Predictions

We load the saved models and apply them to new sample inputs. For this demonstration, we select a few samples from the test set and predict their labels.

We also show how the prediction can be integrated into an interactive application for end users (e.g., doctors or lab technicians).


##  Example Output

- Input: Normalized feature vector from an unseen patient sample  
- Output: `Prediction: Lung Cancer` or `Prediction: Normal`

This simple interface will later be enhanced into a user-friendly app using **Streamlit**.



##  Ready for Deployment

With the models and preprocessing pipeline saved, this notebook serves as the foundation for:
- Building a diagnostic app using **Streamlit**
- Hosting a REST API endpoint
- Integrating into clinical decision support systems




### Saving and Reloading Imputer

To handle missing values consistently across training and inference, we use `SimpleImputer` with the "mean" strategy. This imputer is fitted on the training data and saved using `joblib`. It can be reloaded later to ensure the same transformation logic is applied during model deployment.


In [7]:
import pandas as pd
print(df.columns.tolist())


['Unnamed: 0', 'gene 1', 'gene 2', 'gene 3', 'miRNA_21', 'miRNA_34a', 'Label']


In [11]:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import joblib
import os

# Load your cleaned data
df = pd.read_csv("C:/Users/sanja/cfDNA-Lung-Cancer-ML/data/processed/merged_labeled_light.csv")

# Rename columns if needed
df = df.rename(columns={
    'gene 1': 'gene1',
    'gene 2': 'gene2',
    'gene 3': 'gene3'
})
df = df.drop(columns=['Unnamed: 0'], errors='ignore')

# Select only the 5 required features
features = ['gene1', 'gene2', 'gene3', 'miRNA_21', 'miRNA_34a']
X = df[features]
y = df['Label']

# Preprocess: impute and scale
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Train Random Forest model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Save model, imputer, scaler
os.makedirs("models", exist_ok=True)
joblib.dump(model, "models/random_forest_model.pkl")
joblib.dump(imputer, "models/imputer.pkl")
joblib.dump(scaler, "models/scaler.pkl")

print("✅ Retrained model with 5 features saved successfully.")




✅ Retrained model with 5 features saved successfully.


>  **Note:** If you try to run `joblib.load('imputer.pkl')` in a new notebook without first defining or saving the `imputer` object, it will throw a `NameError`.  
> The imputer must first be **fitted on your dataset** using `SimpleImputer` and then **saved** using `joblib.dump()` before it can be reused across notebooks.
> 
> Additionally, `.pkl` files like `imputer.pkl` are **binary files** and should not be opened directly in Jupyter. Always load them through code using `joblib.load()`.
