<a href="https://colab.research.google.com/github/RishiMotwani/TumorClassification/blob/main/Untitled0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tumor Classification with Support Vector Machines (SVMs)

## Introduction

This project demonstrates a beginner-friendly approach to classifying tumors as benign or malignant using a Support Vector Machine (SVM) model. We will use the Wisconsin Breast Cancer Dataset, a widely used dataset for binary classification tasks. The project will cover the essential steps of a machine learning workflow, including data acquisition, preprocessing, model implementation, training, evaluation, and prediction on new data. By following this guide, you will gain practical experience in building and evaluating a classification model using scikit-learn.

## Data acquisition


Load the Wisconsin Breast Cancer Dataset directly from the scikit-learn library.


In [None]:
from sklearn.datasets import load_breast_cancer

# Load the dataset
cancer = load_breast_cancer()

## Data preprocessing and exploration


Separate features (X) from the target variable (y), scale the features using StandardScaler, and split the data into training and testing sets.


In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Separate features (X) from the target variable (y)
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = pd.Series(cancer.target)

# Scale the features
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting datasets to verify the split
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (455, 30)
Shape of X_test: (114, 30)
Shape of y_train: (455,)
Shape of y_test: (114,)


## Model implementation and training


Initialize and train an SVC model.


In [None]:
from sklearn.svm import SVC

# Instantiate the SVC model
# Using the default rbf kernel
model = SVC()

# Train the model
model.fit(X_train, y_train)

## Model evaluation


Calculate and interpret key performance metrics (accuracy, precision, recall, F1-score) on the test data, and explain what a confusion matrix is and its utility.


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Print the evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
print("\nConfusion Matrix:\n", conf_matrix)

# Explanation of the metrics and confusion matrix

# Accuracy: The proportion of correctly classified instances (both benign and malignant) out of the total instances.
# It tells us how often the model is correct overall.

# Precision: The proportion of true positive predictions (correctly identified malignant tumors) out of all positive predictions (all instances predicted as malignant).
# It tells us how many of the predicted malignant tumors were actually malignant. High precision indicates a low false positive rate.

# Recall (Sensitivity): The proportion of true positive predictions (correctly identified malignant tumors) out of all actual positive instances (all actual malignant tumors).
# It tells us how many of the actual malignant tumors were correctly identified by the model. High recall indicates a low false negative rate.

# F1-score: The harmonic mean of precision and recall. It provides a single score that balances both precision and recall.
# It is particularly useful when there is an uneven class distribution.

# Confusion Matrix: A table that summarizes the performance of a classification model.
# It shows the counts of true positives (correctly predicted malignant), true negatives (correctly predicted benign),
# false positives (incorrectly predicted malignant - Type I error), and false negatives (incorrectly predicted benign - Type II error).
# It helps to understand where the model is making errors.

Accuracy: 0.9736842105263158
Precision: 0.9722222222222222
Recall: 0.9859154929577465
F1-score: 0.9790209790209791

Confusion Matrix:
 [[41  2]
 [ 1 70]]


## Summary:

### Data Analysis Key Findings

*   The Wisconsin Breast Cancer Dataset was successfully loaded from scikit-learn.
*   The features and target variable were separated, features were scaled using `StandardScaler`, and the data was split into training and testing sets with a 80/20 ratio.
*   An SVC model with the default RBF kernel was initialized and trained on the training data.
*   The trained model achieved high performance on the test set: Accuracy: 0.9737, Precision: 0.9722, Recall: 0.9859, and F1-score: 0.9790.
*   The confusion matrix showed a low number of misclassifications, with 41 True Negatives, 2 False Positives, 1 False Negative, and 70 True Positives.

### Insights or Next Steps

*   The current SVM model shows promising results for tumor classification based on the provided dataset.
*   Further improvement could be explored by performing hyperparameter tuning using techniques like GridSearchCV to potentially optimize the model's performance.


## Prediction on New Data


In [None]:
import numpy as np
import pandas as pd

# Prompt the user for a comma-separated string of 30 numerical values
user_input_string = input("Please enter 30 numerical values separated by commas: ")

# Convert the input string to a list of floats
try:
    user_data_list = [float(x.strip()) for x in user_input_string.split(',')]
    if len(user_data_list) != 30:
        raise ValueError("Incorrect number of values provided.")
except ValueError as e:
    print(f"Invalid input: {e}")
    user_data_list = None # Set to None to indicate invalid input

# Proceed only if input is valid
if user_data_list is not None:
    # Convert the list to a numpy array and reshape it for prediction
    user_data_array = np.array(user_data_list).reshape(1, -1)

    # Convert the numpy array to a pandas DataFrame with the same column names as the training data
    # Ensure that 'cancer.feature_names' is accessible from previous steps
    try:
        user_data_df = pd.DataFrame(user_data_array, columns=cancer.feature_names)

        # Scale the user data using the same scaler fitted on the training data
        # Ensure the scaler is accessible (assuming it's named 'scaler' from preprocessing)
        user_data_scaled_array = scaler.transform(user_data_df)

        # Convert the scaled numpy array back to a pandas DataFrame with feature names for prediction
        user_data_scaled = pd.DataFrame(user_data_scaled_array, columns=cancer.feature_names)


        # Predict the class using the trained model
        # Ensure the model is accessible (assuming it's named 'model' from training)
        prediction = model.predict(user_data_scaled)

        # Interpret the prediction
        if prediction[0] == 0:
            print("The model predicts the tumor is Malignant.")
        else:
            print("The model predicts the tumor is Benign.")

    except NameError:
        print("Error: Scaler, model, or cancer.feature_names not found. Please ensure previous steps were executed.")
    except Exception as e:
        print(f"An error occurred during prediction: {e}")

Please enter 30 numerical values separated by commas: 84358402	M	20.29	14.34	135.1	1297	0.1003	0.1328	0.198	0.1043	0.1809	0.05883	0.7572	0.7813	5.438	94.44	0.01149	0.02461	0.05688	0.01885	0.01756	0.005115	22.54	16.67	152.2	1575	0.1374	0.205	0.4	0.1625	0.2364	0.07678	
Invalid input: could not convert string to float: '84358402\tM\t20.29\t14.34\t135.1\t1297\t0.1003\t0.1328\t0.198\t0.1043\t0.1809\t0.05883\t0.7572\t0.7813\t5.438\t94.44\t0.01149\t0.02461\t0.05688\t0.01885\t0.01756\t0.005115\t22.54\t16.67\t152.2\t1575\t0.1374\t0.205\t0.4\t0.1625\t0.2364\t0.07678'
