# Assignment 4 - Supervised Learning
##  Breast Cancer Diagnosis Classification 
![BreastCancer](https://archive.ics.uci.edu/ml/assets/MLimages/Large14.jpg)


### Data Set Information:

This is a subset from UCI Breast Cancer Wisconsin (Diagnostic) Data Set [Link](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic))

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.



### Attribute Information:

Label
 - Diagnosis (1 = M = malignant, 0 = B = benign)
 
 
Features 1-5 : Five real-valued features are computed for each cell nucleus:

- a) radius (mean of distances from center to points on the perimeter)
- b) texture (standard deviation of gray-scale values)
- c) perimeter
- d) area
- e) smoothness (local variation in radius lengths)

### Main Task:
Breast Cancer Prediction (binary classification of the diagnosis (M/B) using the classification models we have coverd.

In [None]:
import pandas as pf
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import export_graphviz
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.metrics import classification_report, accuracy_score, roc_curve, auc, roc_auc_score


### Task 1: Load the data set and have a general overview on it

In [None]:
BreastCancer=pd.read_csv('/home/nofe/lms/ds/breastcancerdata.csv')
BreastCancer.head()

In [None]:
BreastCancer.set_index('mean_radius', inplace = True)
BreastCancer.head()

### Task 2: Split the dataset into 75% Training and 25& Testing sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(BreastCancer.drop('diagnosis', axis=1), 
                                                    BreastCancer['diagnosis'], test_size=0.25, 
                                                    random_state=101)

###  Task 3: Try all the coverd classification models (you may use others):
- print summary classification report
- plot Normalized confusion matrix
- plot ROC Curve (you may use [plot_roc_curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_roc_curve.html) function from sklearn instead of doing it manually)
- print the accuracy_score & roc_auc_score scores

In [None]:
# regression
reg= LogisticRegression()
reg.fit(X_train,y_train)
pred=reg.predict(X_test)
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))
print(accuracy_score(y_test,pred))
#Normalized confusion matrix
print ('Normalized confusion matrix')
plot_confusion_matrix(reg, X_test, y_test, normalize='true', cmap ='orange')  
plt.show() 
y_score = reg.predict_proba(X_test)[::,1]
fpr, tpr, thresholds = roc_curve(y_test,y_score)
roc_auc = roc_auc_score(y_test,y_score)

# Plot ROC curve
plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')  # random predictions curve
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate or (1 - Specifity)')
plt.ylabel('True Positive Rate or (Sensitivity)')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
# 4-print the accuracy_score & roc_auc_score scores

lg_a = accuracy_score(y_test,pred)
lg_r = (roc_auc)

print('accuracy score= %.2f' % lg_a)
print ('roc aue score = %.2f' % lg_r)

In [None]:
#knn
knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train,y_train)
pred = knn.predict(X_test)
#Printing Confusiob matrix, classification report and accuracy report 
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))
print(accuracy_score(y_test,pred))
# Compute fpr, tpr, thresholds and roc auc
y_score = knn.predict_proba(X_test)[::,1]
fpr, tpr, thresholds = roc_curve(y_test,y_score)
roc_auc = roc_auc_score(y_test,y_score)

# Plot ROC curve
plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')  # random predictions curve
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate or (1 - Specifity)')
plt.ylabel('True Positive Rate or (Sensitivity)')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
KNN_a = accuracy_score(y_test,pred)
KNN_r = (roc_auc)

print('accuracy score= %.2f' % KNN_a)
print ('roc aue score = %.2f' % KNN_r)


In [None]:
#deciosn tree
DT = DecisionTreeClassifier(random_state=42).fit(X_train, y_train)
pred = DT.predict(X_test)
#Normalized confusion matrix
print ('Normalized confusion matrix')
plot_confusion_matrix(DT, X_test, y_test, normalize='true', cmap ='Blues')  
plt.show() 

In [None]:
# Compute fpr, tpr, thresholds and roc auc
y_score = DT.predict_proba(X_test)[::,1]
fpr, tpr, thresholds = roc_curve(y_test,y_score)
roc_auc = roc_auc_score(y_test,y_score)

# Plot ROC curve
plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')  # random predictions curve
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate or (1 - Specifity)')
plt.ylabel('True Positive Rate or (Sensitivity)')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
Decision_Tree_a = accuracy_score(y_test,pred)
Decision_Tree_r = (roc_auc)
print('accuracy score= %.2f' % Decision_Tree_a)
print ('roc aue score = %.2f' % Decision_Tree_r)

In [None]:
#random forset
RF = RandomForestClassifier(n_estimators=100, random_state=0)
RF.fit(X_train, y_train)
pred = RF.predict(X_test)
#Printing Confusiob matrix, classification report and accuracy report 
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))
print(accuracy_score(y_test,pred))
#Normalized confusion matrix
print ('Normalized confusion matrix')
plot_confusion_matrix(RF, X_test, y_test, normalize='true', cmap ='Blues')  
plt.show() 

In [None]:
# Compute fpr, tpr, thresholds and roc auc
y_score = RF.predict_proba(X_test)[::,1]
fpr, tpr, thresholds = roc_curve(y_test,y_score)
roc_auc = roc_auc_score(y_test,y_score)

# Plot ROC curve
plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')  # random predictions curve
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate or (1 - Specifity)')
plt.ylabel('True Positive Rate or (Sensitivity)')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")

Random_Forest_a = accuracy_score(y_test,pred)
Random_Forest_r = (roc_auc)
print('accuracy score= %.2f' % Random_Forest_a)
print ('roc aue score = %.2f' % Random_Forest_r)

In [None]:
#Making the object of GNB

GNB = GaussianNB()
GNB.fit(X_train, y_train)
pred = GNB.predict(X_test)
#Printing Confusiob matrix, classification report and accuracy report 
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))
print(accuracy_score(y_test,pred))

#Normalized confusion matrix
print ('Normalized confusion matrix')
plot_confusion_matrix(GNB, X_test, y_test, normalize='true', cmap ='Blues')  
plt.show() 
# Compute fpr, tpr, thresholds and roc auc
y_score = GNB.predict_proba(X_test)[::,1]
fpr, tpr, thresholds = roc_curve(y_test,y_score)
roc_auc = roc_auc_score(y_test,y_score)

# Plot ROC curve
plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')  # random predictions curve
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate or (1 - Specifity)')
plt.ylabel('True Positive Rate or (Sensitivity)')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
Naive_Bayes_a = accuracy_score(y_test,pred)
Naive_Bayes_r = (roc_auc)
print('accuracy score= %.2f' % Naive_Bayes_a)
print ('roc aue score = %.2f' % Naive_Bayes_r)

In [None]:

# SVM needs features to be scaled
#Creating SVM model
SVC = make_pipeline(StandardScaler(),
                    SVC(probability=True))

SVC.fit(X_train, y_train)
pred = SVC.predict(X_test)
#Printing Confusiob matrix, classification report and accuracy report 
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))
print(accuracy_score(y_test,pred))
# Compute fpr, tpr, thresholds and roc auc
y_score = SVC.predict_proba(X_test)[::,1]
fpr, tpr, thresholds = roc_curve(y_test,y_score)
roc_auc = roc_auc_score(y_test,y_score)

# Plot ROC curve
plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')  # random predictions curve
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate or (1 - Specifity)')
plt.ylabel('True Positive Rate or (Sensitivity)')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
SVM_a = accuracy_score(y_test,pred)
SVM_r = (roc_auc)
print('accuracy score= %.2f' % SVM_a)
print ('roc aue score = %.2f' % SVM_r)

### Task 4: Compare all models results:
- in one cell, print all accuracy and auc scores
- which model has the hieghest accuracy and whic has the heighest auc?

In [None]:
print('for regreission:')
print('accuracy score= %.2f' % lg_a)
print ('roc aue score = %.2f' % lg_r)
print('for knn:')
print('accuracy score= %.2f' % KNN_a)
print ('roc aue score = %.2f' % KNN_r)
print('for decision tree:')
print('accuracy score= %.2f' % Decision_Tree_a)
print ('roc aue score = %.2f' % Decision_Tree_r)
print('for random forest:')
print('accuracy score= %.2f' % Random_Forest_a)
print ('roc aue score = %.2f' % Random_Forest_r)
print('for naive bayes :')
print('accuracy score= %.2f' % Naive_Bayes_a)
print ('roc aue score = %.2f' % Naive_Bayes_r)
print('for svm :')
print('accuracy score= %.2f' % SVM_a)
print ('roc aue score = %.2f' % SVM_r)

In [None]:
 # which model has the hieghest accuracy and which has the heighest auc?
 Random Foreast and SVM have the highest accuracy
 Naive Bayes and SVM have the highest auc score

### Task 5: For cancer diagnosis problems, which metric do you think is more important, accuracy or auc? why? depending on your choice, which model would you consider the best?


In [None]:
accuracy is better because when comparing the performance of machine learning algorithms, AUC is considered to be a more appropriate performance evaluation indicator than accuracy.
so for medical data anylsis and such the more accurate the better .
the AUC is used with probabilities in order to analyze the prediction more deeply.which diagnostics needs.