## Importing libraries and dataset

Importing necessary libraries

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder,StandardScaler
from sklearn.model_selection import cross_val_score,train_test_split
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.metrics import classification_report,accuracy_score,confusion_matrix

Reading the dataset

In [None]:
breast_cancer_df=pd.read_csv("cancer_detection.csv")
breast_cancer_df=breast_cancer_df.drop(labels={
    "Unnamed: 32",
    "id"},axis=1)
print(breast_cancer_df.head())

## Performing EDA on the dataset

Converting the categorical columns to numeric columns 

In [None]:
le=LabelEncoder()
breast_cancer_df["diagnosis"]=le.fit_transform(breast_cancer_df["diagnosis"])
print(breast_cancer_df.head())
print(breast_cancer_df["diagnosis"].value_counts())

We would first want to have a lok at how the diagnosis is distributed, i.e. what proportion of the tumors were diagnosed as malignant or benign.

In [None]:
sns.barplot(x=breast_cancer_df["diagnosis"].value_counts().index,y=breast_cancer_df["diagnosis"].value_counts(),hue=breast_cancer_df["diagnosis"].value_counts().index)
plt.xlabel("Type of tumor")

Identify the correlations

In [None]:
# print(breast_cancer_df.columns)
correlation_matrix=breast_cancer_df.corr()
plt.figure(figsize=(20,20))
sns.heatmap(correlation_matrix,annot=True)
correlation_coefficient=correlation_matrix.loc["diagnosis"]
print(correlation_coefficient.dtype)
# print(temp_correlation_coefficient[1])
# print(correlation_coefficient[correlation_coefficient>=0.5])

Since there are numerous features which influence the decision about whether a tumour is malignant or benign, we will use only those features which are relevant and have high correlation with the diagnosis. Thus here we try to extract the top 15 features having the highest correlation and these features would then be used to design our model.

In [None]:
new_correlation_coefficient=abs(correlation_coefficient)
temp_correlation_coefficient=new_correlation_coefficient.sort_values(ascending=False)
top=temp_correlation_coefficient[1:16]
bottom=temp_correlation_coefficient[16:]
# print(top)
print("\n")
top_correlation_coefficient=correlation_coefficient[top.index]
print(top_correlation_coefficient)

In [None]:
new_breast_cancer_df=breast_cancer_df.drop(labels=bottom.index,axis=1)
print(new_breast_cancer_df.head())

We would now try to visualize the correlations of the features by using a bar chart.

In [None]:
sns.barplot(x=top_correlation_coefficient.index,y=top_correlation_coefficient)
plt.xticks(rotation=75)
plt.xlabel("Features")

We can also get an idea about how the values of the features themselves are distributed using a histogram for each of these features. 

In [None]:
for col in new_breast_cancer_df.columns:
    plt.figure()
    sns.histplot(new_breast_cancer_df[col],kde=True)

## Data Preprocessing

It can be clearly observed from the above histograms that the features are unequally distributed i.e. their ranges are not the same. This leads to issues in the modelling process where incorrect weights can be attached to the features. Thus we would like to scale the inputs to have the same range of values. 

In [None]:
scaler=StandardScaler()
diagnosis_df=new_breast_cancer_df["diagnosis"]
scaled_breast_cancer_array=scaler.fit_transform(new_breast_cancer_df[top.index])
print(scaled_breast_cancer_array.shape)
scaled_breast_cancer_inputs=pd.DataFrame(scaled_breast_cancer_array,columns=top.index)
scaled_breast_cancer_df=scaled_breast_cancer_inputs.join(diagnosis_df)
print(scaled_breast_cancer_df.head(),"\n",scaled_breast_cancer_df["diagnosis"].value_counts())
for col in scaled_breast_cancer_df.columns:
    print(scaled_breast_cancer_df[col].describe())

Splitting the dataset into training and testing sets. The model is trained using only the training set while the test set is then used to check whether the model performs similarly when confronted with previously unknown data. 

In [11]:
breast_cancer_train,breast_cancer_test=train_test_split(scaled_breast_cancer_df,test_size=0.2,random_state=42,shuffle=True,stratify=scaled_breast_cancer_df["diagnosis"])

We have diagnosis as a categorical column in this dataset which also happens to be our target variable. Thus we would like this variable to be distributed in equal proportion in both the testing and training sets. As a result, we have stratified the diagnosis column. This is important as we do not want higher values of 0 in one set and higher values of 1 in the other. 

In [None]:
breast_cancer_train.diagnosis.value_counts()

In [None]:
breast_cancer_test.diagnosis.value_counts()

## Model Selection

Now we would try to analyze the results obtained by using various classification models and noting down the evaluation metrics. Various models such as Logistic Regression, Decision Tree, Random Forest, Naive Bayes, KNN,etc. will be used for the comparison. Metrics such as confusion matrix, accuracy score and classification report would be used. We would also have a look at the weights assigned to each of the features by these models and compare them with the coefficients obtained during the data analysis.

In [14]:
def model_evaluation_train(model):
    train_inputs=breast_cancer_train[top.index]
    train_targets=breast_cancer_train["diagnosis"]
    model.fit(train_inputs,train_targets)
    train_predictions=model.predict(train_inputs)
    print(classification_report(train_targets,train_predictions))
    print(confusion_matrix(train_targets,train_predictions))
    train_accuracy=accuracy_score(train_targets,train_predictions)
    print(train_accuracy)

    test_inputs=breast_cancer_test[top.index]
    test_targets=breast_cancer_test["diagnosis"]
    test_predictions=model.predict(test_inputs)
    print(classification_report(test_targets,test_predictions))
    print(confusion_matrix(test_targets,test_predictions))
    test_accuracy=accuracy_score(test_targets,test_predictions)
    print(test_accuracy)

    return train_accuracy,test_accuracy

    # weights=model.coef_
    # print(weights)
    # sns.barplot(x=top.index,y=weights.flatten())
    # plt.xticks(rotation=75)

Logistic Regression

In [None]:
logistic_train_accuracy,logistic_test_accuracy=model_evaluation_train(LogisticRegression())

Decision Tree

In [None]:
dt_train_accuracy,dt_test_accuracy=model_evaluation_train(DecisionTreeClassifier())

Random Forest

In [None]:
rf_train_accuracy,rf_test_accuracy=model_evaluation_train(RandomForestClassifier())

KNN

In [None]:
knn_train_accuarcy,knn_test_accuracy=model_evaluation_train(KNeighborsClassifier())

KNN with custom number of neighbours

In [None]:
modifiedknn_train_accuracy,modifiedknn_test_accuracy=model_evaluation_train(KNeighborsClassifier(n_neighbors=3))

Naive Bayes

In [None]:
bayes_train_accuracy,bayes_test_accuracy=model_evaluation_train(GaussianNB())

Simple Gradient Descent

In [None]:
sgd_train_accuracy,sgd_test_accuracy=model_evaluation_train(SGDClassifier())

SVC

In [None]:
svc_train_accuracy,svc_test_accuracy=model_evaluation_train(SVC())

XGBoost

In [None]:
xg_train_accuracy,xg_test_accuracy=model_evaluation_train(XGBClassifier())

Gradient Boosting

In [None]:
gb_train_accuracy,gb_test_accuracy=model_evaluation_train(GradientBoostingClassifier())

## Visualizing the results and comparing the models

In [None]:
results_array=np.array([["LogisticRegression",logistic_train_accuracy,logistic_test_accuracy],["DecisionTreeClassifier",dt_train_accuracy,dt_test_accuracy],["RandomForestClassifier",rf_train_accuracy,rf_test_accuracy],["KNeighborsClassifier",knn_train_accuarcy,knn_test_accuracy],["KNeighborsClassifierWith3Neighbours",modifiedknn_train_accuracy,modifiedknn_test_accuracy],["Naive Bayes",bayes_train_accuracy,bayes_test_accuracy],["SGDClassifier",sgd_train_accuracy,sgd_test_accuracy],["SVC",svc_train_accuracy,svc_test_accuracy],["XGBClassifier",xg_train_accuracy,xg_test_accuracy],["GradientBoostingClassifier",gb_train_accuracy,gb_test_accuracy]])
# print(results_array.shape)

results_df=pd.DataFrame(data=results_array,columns=["models","train_scores","test_scores"])
# print(results_df.head())
results_df["train_scores"]=pd.to_numeric(results_df["train_scores"])
results_df["test_scores"]=pd.to_numeric(results_df["test_scores"])
# print(results_df.head(),"\n",results_df["train_scores"].dtype,results_df["test_scores"].dtype)

results_df.plot(kind="bar",xlabel="models",ylabel="scores",x="models",color=["mediumpurple","rebeccapurple"],title="Model Comparison")
plt.xticks(rotation=85)
plt.ylim([0.9,1.0])
plt.show()

## Conclusion

From the model comparison bar chart we can observe that: 
1. Some models like DecisionTree, RandomForest, XGBoost and Gradient Boosting have overfitted the data as they give 100% accuracy. However this shortfall is clearly visible whne compared with the accuracy obtained on the test data, which are lower compared to 100%. 
2. The SVC model shows similar accuracy for both training and testing data and also has a higher accuracy of almost 96%. This indicates that the SVC model could be a suitable choice for the breast cancer detection.
3. The k-nearest neighbours model was used with two different parameters: the default model where the number of neighbours is 5 and a modified version where we took the number of neighbours as 3. We can observe that the modified model gives a higher accuracy score than the default, although the deviation from the accuracy on the test data is higher for the modified model. 