# Blood Cancer Risk Prediction Backend Prototype

This notebook demonstrates a machine learning pipeline for blood cancer risk prediction based on blood work parameters.

## Steps Covered:
1. **Exploratory Data Analysis (EDA)** and Data Loading
2. **Preprocessing**: Encoding categorical variables and scaling
3. **Model Training**: Random Forest Classifier
4. **Evaluation**: Assessing model performance
5. **Model Serialization**: Saving the model for deployment
6. **API Prototype**: A lightweight Flask API to serve predictions


In [11]:
# Import necessary libraries
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
import warnings

warnings.filterwarnings('ignore')

## 1. Load Dataset

In [12]:
# Check if dataset exists, if not, generate it (Self-contained backup)
import os
if not os.path.exists('blood_cancer_dataset.xlsx'):
    print("Dataset not found. Generating dummy dataset...")
    # Generate dummy data
    data = {
        'Gender': np.random.choice(['Male', 'Female'], 3000),
        'Age': np.random.randint(18, 90, 3000),
        'WBC': np.random.uniform(3000, 50000, 3000),
        'Hgb': np.random.uniform(5, 18, 3000),
        'Platelet': np.random.uniform(50000, 600000, 3000),
        'RBC': np.random.uniform(2.5, 6.5, 3000),
        'HCT': np.random.uniform(20, 55, 3000),
        'MCV': np.random.uniform(60, 110, 3000),
        'Neutrophil': np.random.uniform(20, 90, 3000),
        'Lymphocyte': np.random.uniform(10, 80, 3000)
    }
    df_gen = pd.DataFrame(data)
    # Simple rule for labels
    conditions = [
        (df_gen['WBC'] > 20000) | (df_gen['Platelet'] < 100000),
        (df_gen['WBC'] > 11000) & (df_gen['WBC'] <= 20000)
    ]
    df_gen['Risk_Label'] = np.select(conditions, ['High Risk', 'Moderate Risk'], default='Low Risk')
    df_gen.to_excel('blood_cancer_dataset.xlsx', index=False)
    print("Dataset generated.")

# Load the dataset
df = pd.read_excel('blood_cancer_dataset.xlsx')
print(f"Shape: {df.shape}")
df.head()

Shape: (3000, 11)


Unnamed: 0,Gender,Age,WBC,Hgb,Platelet,RBC,HCT,MCV,Neutrophil,Lymphocyte,Risk_Label
0,Male,65,18205.929387,13.141371,407000.457296,2.517127,42.431208,62.548603,32.103756,76.588536,Moderate Risk
1,Female,83,34257.631483,13.697696,361939.698352,6.041755,53.064817,105.606795,64.828428,35.254925,High Risk
2,Male,87,29159.18614,6.641685,354386.25864,2.765963,23.934165,104.16026,84.357639,71.423883,High Risk
3,Male,45,19108.353495,17.295094,67521.219391,6.312186,35.815053,108.539907,55.619615,18.420202,High Risk
4,Male,43,9325.084554,6.723463,308116.647998,3.627303,26.734386,95.826359,40.858219,53.191204,Low Risk


## 2. Preprocessing

In [13]:
# Encode Gender (Male=1, Female=0)
le_gender = LabelEncoder()
df['Gender'] = le_gender.fit_transform(df['Gender'])

# Split Features and Target
X = df.drop('Risk_Label', axis=1)
y = df['Risk_Label']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training Set structure:", X_train.shape)

Training Set structure: (2400, 10)


## 3. Model Training

In [14]:
# Initialize Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train Process
rf_model.fit(X_train, y_train)

print("Model trained.")

Model trained.


## 4. Evaluation

In [15]:
# Predictions
y_pred = rf_model.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.9983333333333333

Classification Report:
                precision    recall  f1-score   support

    High Risk       1.00      1.00      1.00       393
     Low Risk       1.00      1.00      1.00       105
Moderate Risk       0.99      1.00      1.00       102

     accuracy                           1.00       600
    macro avg       1.00      1.00      1.00       600
 weighted avg       1.00      1.00      1.00       600



## 5. Save Model

In [17]:
# Save Model and LabelEncoder
joblib.dump(rf_model, 'blood_cancer_model.pkl')
joblib.dump(le_gender, 'gender_encoder.pkl')
print("Model saved to blood_cancer_model.pkl")

Model saved to blood_cancer_model.pkl


## 6. Flask API Endpoint
Run this cell to start a local server that can accept requests from the frontend.
**Note:** To stop the server, you will need to interrupt the kernel.

In [19]:
from flask import Flask, request, jsonify
from flask_cors import CORS
import threading

app = Flask(__name__)
CORS(app) # Allow Cross-Origin requests from our frontend

model = joblib.load('blood_cancer_model.pkl')
gender_encoder = joblib.load('gender_encoder.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    try:
        data = request.json
        
        # Extract features in correct order
        # Features: Gender, Age, WBC, Hgb, Platelet, RBC, HCT, MCV, Neutrophil, Lymphocyte
        gender = data.get('Gender')
        # Encode gender safely
        # Assuming 'Male'/'Female' input, handle unknown/lowercase
        gender_cleaned = gender.capitalize() if gender else 'Male'
        try:
            gender_val = gender_encoder.transform([gender_cleaned])[0]
        except:
            gender_val = 0 # Default fallback
            
        input_features = [
            gender_val,
            float(data.get('Age', 30)), # Default age 30 if missing
            float(data.get('WBC')),
            float(data.get('Hgb')),
            float(data.get('Platelet')),
            float(data.get('RBC')),
            float(data.get('HCT')),
            float(data.get('MCV')),
            float(data.get('Neutrophil')),
            float(data.get('Lymphocyte'))
        ]
        
        # Reshape for prediction
        prediction = model.predict([input_features])[0]
        
        return jsonify({'risk_prediction': prediction})
    except Exception as e:
        return jsonify({'error': str(e)}), 400

def run_app():
    app.run(port=5000, debug=False, use_reloader=False)

# Run in a thread so it doesn't block the notebook
t = threading.Thread(target=run_app)
t.start()
print("API Server running on port 5000... success")

API Server running on port 5000... success
 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
Press CTRL+C to quit
