### SVM on breast cancer data

This dataset is computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

The dataset comprises __30 features__ (mean radius, mean texture, mean perimeter, mean area, mean smoothness, mean compactness, mean concavity, mean concave points, mean symmetry, mean fractal dimension, radius error, texture error, perimeter error, area error, smoothness error, compactness error, concavity error, concave points error, symmetry error, fractal dimension error, worst radius, worst texture, worst perimeter, worst area, worst smoothness, worst compactness, worst concavity, worst concave points, worst symmetry, and worst fractal dimension) and __a target (type of cancer)__.

This data has __two types__ of cancer classes: __malignant (harmful)__ and __benign (not harmful)__.

In [None]:
#Import scikit-learn dataset library
from sklearn import datasets

In [None]:
#Load dataset
cancer = datasets.load_breast_cancer()

In [None]:
# print the names of the 30 features
print("Features: ", cancer.feature_names)

In [None]:
# print the label type of cancer('malignant' 'benign')
print("Labels: ", cancer.target_names)

In [None]:
# print data(feature)shape
cancer.data.shape

In [None]:
# print the cancer data features (top 5 records)
print(cancer.data[0:5])

In [None]:
# print the cancer labels (0:malignant, 1:benign)
print(cancer.target)

To understand model performance, __dividing__ the dataset into a training set and a test set is __ALWAYS a good strategy__.

Split the dataset by using the function __train_test_split()__. 

In [None]:
# Import train_test_split function
from sklearn.model_selection import train_test_split

In [None]:
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.3,random_state=109) # 70% training and 30% test

Let's build support vector machine model. First, import the SVM module and create support vector classifier object by passing argument kernel as the `linear kernel in SVC()` function.

In [None]:
#Import svm model
from sklearn import svm

In [None]:
#Create a svm Classifier (clf)
clf = svm.SVC(kernel='linear') # Linear Kernel

In [None]:
#Train the model using the training sets
clf.fit(X_train, y_train)

In [None]:
#Predict the response for test dataset
y_pred = clf.predict(X_test)

In [None]:
# import useful functions
from sklearn.metrics import confusion_matrix, classification_report, plot_confusion_matrix

In [None]:
# view the report
print(classification_report(y_test, y_pred, digits=3))# Veiw

In [None]:
confusion_matrix(y_test, y_pred)

In [None]:
from matplotlib import pyplot as plt
plot_confusion_matrix(clf, X_test, y_test) 
plt.grid(False)
plt.show();

In [None]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

In [None]:
# Model Accuracy: how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
# Model Precision: what percentage of positive tuples are labeled as such?
print("Precision:",metrics.precision_score(y_test, y_pred))

In [None]:
# Model Recall: what percentage of positive tuples are labelled as such?
print("Recall:",metrics.recall_score(y_test, y_pred))

In [None]:
# Get the ROC curve and AUC for calibration (training) and test
from sklearn.metrics import roc_curve, auc

In [None]:
y_train_pred = clf.decision_function(X_train)    
y_test_pred = clf.decision_function(X_test) 

In [None]:
train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)

plt.grid()

plt.plot(train_fpr, train_tpr, label=" AUC TRAIN ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label=" AUC TEST ="+str(auc(test_fpr, test_tpr)))
plt.plot([0,1],[0,1],'g--')
plt.legend()
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("AUC(ROC curve)")
plt.grid(color='black', linestyle='-', linewidth=0.5)
plt.show()

Can you do this for iris dataset?

In [None]:
# A silly suggestion...
import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

# Assign colum names to the dataset
colnames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

# Read dataset to pandas dataframe
data = pd.read_csv(url, names=colnames)

In [None]:
data