# Classifying Cancer Cells Uidng Support Vector Machines

## Diagnosing breast lumps as benign or malignant
In the United States, almost 270,000 cases of breast cancer are diagnosed each year. While screening advances have reduced breast cancer fatalities breast cancer remains the second largest cause of cancer-related death in women.

Once a lump is found, doctors perform additional tests to understand whether the mass is cancerous and identify possible treatments. Biopsy is a diagnostic test that removes tissue or fluid from an affected area to be examined using a microscope or other testing. Most lumps found in breast tissue are benign, or non-cancerous. Benign lumps may be removed, but usually do not require further treatment. But, breast lumps that are malignant are cancerous and will spread to other parts of the body without treatment.

## Wisconsin Breast Cancer Database
The Wisconsin Breast Cancer Database contains data from 569 scans of cells from biopsied breast tissue. Image recognition was used to calculate features describing the shape and texture of cells in each image, such as cell radius, perimeter, and texture. The database also includes whether the cell sample was benign or malignant. Since the correct cell type is known, classifying cells is a supervised learning task. In the breast cancer dataset, malignant cells are labeled Diagnosis=1. Benign cells are labeled Diagnosis=0.

## Support vector classification based on radius and texture
Support vector machines are another possible classification model. Like k-nearest neighbors, support vector machines depend on selection of a hyperparameter. A hyperparameter is a value that defines a specific variant on a machine learning model. For k-nearest neighbors, a hyperparameter is the number of neighbors, k. Support vector machines have several hyperparameters, including the kernel function and the loss function's slope.

In [1]:
# Import needed packages for classification
# Load packages
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler

In [2]:
# Import packages for evaluation
import numpy as np
import pandas as pd

In [3]:
import pandas as pd
cancer = pd.read_csv('WisconsinBreastCancerDatabase.csv').dropna()
cancer = cancer.replace(to_replace=['M', 'B'], value=[int(1), int(0)])
cancer

Unnamed: 0,ID,Diagnosis,Radius mean,Texture mean,Perimeter mean,Area mean,Smoothness mean,Compactness mean,Concavity mean,Concave points mean,...,Radius worst,Texture worst,Perimeter worst,Area worst,Smoothness worst,Compactness worst,Concavity worst,Concave points worst,Symmetry worst,Fractal dimension worst
0,842302,1,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,842517,1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,84300903,1,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,84348301,1,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,84358402,1,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,1,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,926682,1,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,926954,1,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,927241,1,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [4]:
# Create dataframe X with features "Radius mean" and "Texture mean"
X = cancer[['Radius mean', 'Texture mean']]

# Create dataframe y with feature "Diagnosis"
y = cancer['Diagnosis']


In [5]:
seed=123

In [6]:
# Create training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=seed)

In [7]:
# Scale the input features using the StandardScaler 
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


In [8]:
# Initialize a SVM model with a linear kernel
linearSVM = svm.SVC(kernel='linear')

# Fit model using X_train_scaled and y_train
linearSVM.fit(X_train_scaled, y_train)

# Find the predicted classes for X_test
y_pred = linearSVM.predict(X_test_scaled)

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


In [9]:
# Calculate accuracy score for the linear SVM
score = linearSVM.score(X_test_scaled, y_test)
print('Accuracy score is ', score)

Accuracy score is  0.8713450292397661


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


In [10]:
# Print confusion matrix for the linear SVM
print(confusion_matrix(y_test, y_pred))

[[101   2]
 [ 20  48]]


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


In [11]:
# Initialize a SVM model with a Radial Basis Function (rbf) kernel
rbfSVM = svm.SVC(kernel='rbf')

# Fit model using X_train_scaled and y_train
rbfSVM.fit(X_train_scaled, y_train)

# Find the predicted classes for X_test
y_pred = rbfSVM.predict(X_test_scaled)

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


In [12]:
# Calculate accuracy score for the rbf SVM
score =rbfSVM.score(X_test_scaled, y_test)
print('Accuracy score is ', score)

Accuracy score is  0.9005847953216374


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


In [13]:
# Print confusion matrix for the rbf SVM
print(confusion_matrix(y_test, y_pred))

[[102   1]
 [ 16  52]]


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


In [14]:
# Initialize a SVM model with a poly kernel
polySVM = svm.SVC(kernel='poly')

# Fit model using X_train_scaled and y_train
polySVM.fit(X_train_scaled, y_train)

# Find the predicted classes for X_test
y_pred = polySVM.predict(X_test_scaled)

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


In [15]:
# Calculate accuracy score for the rbf SVM
score = polySVM.score(X_test_scaled, y_test)
print('Accuracy score is ', score)

Accuracy score is  0.7777777777777778


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


In [16]:
# Print confusion matrix for the rbf SVM
print(confusion_matrix(y_test, y_pred))

[[103   0]
 [ 38  30]]


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
