# Breast Cancer Prediction Model
This notebook demonstrates loading, preprocessing, training, and saving a Random Forest model to predict breast cancer diagnosis. It also includes code for deploying the model using Streamlit.

## Step 1: Import Libraries
We'll start by importing necessary libraries for data handling, model training, and saving/loading.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder, StandardScaler
import joblib

## Step 2: Load and Prepare the Data
Load the breast cancer dataset, drop unnecessary columns, encode the target variable, and split into training and testing sets.

In [None]:
cancer_data = pd.read_csv('breast_cancer.csv').drop(columns=['id'])
X = cancer_data.drop(columns=['diagnosis'])
y = LabelEncoder().fit_transform(cancer_data['diagnosis'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

To see the structure of the data:

In [None]:
cancer_data.head()

## Step 3: Standardize the Data
Standardize the features to improve model performance.

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Step 4: Set Up and Train the Model
We switch to using a Random Forest classifier as it often performs better on structured data.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')
rf_model.fit(X_train, y_train)

## Step 5: Evaluate the Model
Use metrics like accuracy, precision, recall, and F1 score to evaluate model performance.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

y_pred = rf_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print('Accuracy:', accuracy)
print('Precision:', precision)
print('Recall:', recall)
print('F1 Score:', f1)

print('\nClassification Report:\n', classification_report(y_test, y_pred))

## Step 6: Save the Model and Scaler
Save the trained Random Forest model and scaler so they can be used in the Streamlit app.

In [None]:
joblib.dump(rf_model, 'rf_model.pkl')
joblib.dump(scaler, 'scaler.pkl')

print('Random Forest model and scaler saved as .pkl files.')

## Step 7: Create a Streamlit App
We now create a Streamlit app to load the model and scaler, allow users to upload new data, and generate predictions.

In [None]:
import streamlit as st

@st.cache_resource
def load_model_and_scaler():
    model = joblib.load('rf_model.pkl')
    scaler = joblib.load('scaler.pkl')
    return model, scaler

# Main app
st.title('Breast Cancer Prediction using Random Forest')
st.write('Upload your data file in the same format to get predictions.')

# Load model and scaler
model, scaler = load_model_and_scaler()

# File uploader
uploaded_file = st.file_uploader('Upload a CSV file', type=['csv'])
if uploaded_file is not None:
    input_data = pd.read_csv(uploaded_file)
    input_data = input_data.drop(columns=['id', 'diagnosis'], errors='ignore')

    # Scale the input data
    input_data_scaled = scaler.transform(input_data)

    # Make predictions
    predictions = model.predict(input_data_scaled)
    predictions = ['Benign' if pred == 0 else 'Malignant' for pred in predictions]

    # Display results
    st.write('Predictions:')
    st.write(predictions)