# <center> SHREYAANS NAHATA: 19BCE2686 </center>

## Q01: Implement Decision tree classifier for breast cancer Wisconsin dataset (load_breast_cancer) and evaluate the algorithm with precision, recall sensitivity and specificity.

#### Importing the required libraries

In [1]:
import pandas as pd
import numpy as np

#### Loading the breast cancer dataset

In [2]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
print('Target Names: ', data.target_names)
print('Feature Names: ', data.feature_names)

Target Names:  ['malignant' 'benign']
Feature Names:  ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


#### Extracting the data attributes and target labels

In [3]:
X = data.data      # Data attributes
y = data.target    # Target Labels
print('Number of examples in the data:', X.shape)

Number of examples in the data: (569, 30)


#### First four rows in the variable 'X'

In [4]:
print(X[:4])

[[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
  1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
  6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
  1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
  4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02
  7.017e-02 1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01
  5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01
  2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01
  2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 1.203e+03 1.096e-01 1.599e-01 1.974e-01
  1.279e-01 2.069e-01 5.999e-02 7.456e-01 7.869e-01 4.585e+00 9.403e+01
  6.150e-03 4.006e-02 3.832e-02 2.058e-02 2.250e-02 4.571e-03 2.357e+01
  2.553e+01 1.525e+02 1.709e+03 1.444e-01 4.245e-01 4.504e-01 2.430e-01
  3.613e-01 8.758e-02]
 [1.142e+01 2.038e+01 7.758e+01 3.861e+02 1.425e-01 2.839e-01 2.414

#### Splitting the dataset into train and test sets

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 10, train_size = 0.8)

#### Training the Decision Tree Classifier on the train set

In [6]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

DecisionTreeClassifier()

#### Predicting labels on the test set

In [7]:
y_pred = clf.predict(X_test)

#### Calculating Accuracy Score, Precision Score, Recall Sensitivity and Specificity on the train data

In [8]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

# Calculating the confusion matrix
confusion = confusion_matrix(y_true=y_train, y_pred=clf.predict(X_train))
print('Confusion Matrix: \n', confusion)

# Calculating the sensitivity
sensitivity = confusion[0,0]/(confusion[0,0]+confusion[0,1])

# Calculating the specificity
specificity = confusion[1,1]/(confusion[1,0]+confusion[1,1])

print('Accuracy Score on train data:', accuracy_score(y_true=y_train, y_pred=clf.predict(X_train))*100)
print('Precision Score on train data:', precision_score(y_true=y_train, y_pred=clf.predict(X_train))*100)
print('Recall Sensitivity of train data:', sensitivity*100)
print('Specificity of train data:', specificity*100)

Confusion Matrix: 
 [[173   0]
 [  0 282]]
Accuracy Score on train data: 100.0
Precision Score on train data: 100.0
Recall Sensitivity of train data: 100.0
Specificity of train data: 100.0


#### Calculating Accuracy Score, Precision Score, Recall Sensitivity and Specificity on the test data

In [9]:
confusion = confusion_matrix(y_true=y_test, y_pred=y_pred)
print('Confusion Matrix: \n', confusion)
sensitivity = confusion[0,0]/(confusion[0,0]+confusion[0,1])
specificity = confusion[1,1]/(confusion[1,0]+confusion[1,1])

print('Accuracy Score on test data:', accuracy_score(y_true=y_test, y_pred=y_pred)*100)
print('Precision Score on test data:', precision_score(y_true=y_test, y_pred=y_pred)*100)
print('Recall Sensitivity of train data:', sensitivity*100)
print('Specificity of train data:', specificity*100)

Confusion Matrix: 
 [[36  3]
 [ 8 67]]
Accuracy Score on test data: 90.35087719298247
Precision Score on test data: 95.71428571428572
Recall Sensitivity of train data: 92.3076923076923
Specificity of train data: 89.33333333333333


#### Tuning the Decision Tree to increase the accuracy using 'min_samples_split' 

In [10]:
clf = DecisionTreeClassifier(criterion='entropy', min_samples_split=2)
clf.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy')

#### Calculating Accuracy Score, Precision Score, Recall Sensitivity and Specificity on the train data using 'min_samples_split' 

In [11]:
confusion = confusion_matrix(y_true=y_train, y_pred=clf.predict(X_train))
print('Confusion Matrix: \n', confusion)
sensitivity = confusion[0,0]/(confusion[0,0]+confusion[0,1])
specificity = confusion[1,1]/(confusion[1,0]+confusion[1,1])

print('Accuracy Score on train data:', accuracy_score(y_true=y_train, y_pred=clf.predict(X_train))*100)
print('Precision Score on train data:', precision_score(y_true=y_train, y_pred=clf.predict(X_train))*100)
print('Recall Sensitivity of train data:', sensitivity*100)
print('Specificity of train data:', specificity*100)

Confusion Matrix: 
 [[173   0]
 [  0 282]]
Accuracy Score on train data: 100.0
Precision Score on train data: 100.0
Recall Sensitivity of train data: 100.0
Specificity of train data: 100.0


#### Calculating Accuracy Score, Precision Score, Recall Sensitivity and Specificity on the test data using 'min_samples_split' 

In [12]:
confusion = confusion_matrix(y_true=y_test, y_pred=clf.predict(X_test))
print('Confusion Matrix: \n', confusion)
sensitivity = confusion[0,0]/(confusion[0,0]+confusion[0,1])
specificity = confusion[1,1]/(confusion[1,0]+confusion[1,1])

print('Accuracy Score on the test data:', accuracy_score(y_true=y_test, y_pred=clf.predict(X_test))*100)
print('Precision Score on the test data:', precision_score(y_true=y_test, y_pred=clf.predict(X_test))*100)
print('Recall Sensitivity of the test data:', sensitivity*100)
print('Recall Specificity of train data:', specificity*100)

Confusion Matrix: 
 [[36  3]
 [ 6 69]]
Accuracy Score on the test data: 92.10526315789474
Precision Score on the test data: 95.83333333333334
Recall Sensitivity of the test data: 92.3076923076923
Recall Specificity of train data: 92.0
