# **OncoScope: Harnessing the Power of Machine Learning for Early Detection and Treatment of Cancer**

## **Problem Statement :** To develop a predictive model that can accurately classify breast tumors as either malignant or benign based on their characteristics. The model can then be used by healthcare providers to aid in the diagnosis and treatment of breast cancer patients.

# Importing the libraries

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix

# Importing the dataset

In [None]:
dataset = pd.read_csv('breast_cancer.csv')

# Viewing the dataset

In [None]:
dataset.head()

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


# To get the shape of dataset

In [None]:
dataset.shape


(683, 11)

# To get the information of dataset

In [None]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 683 entries, 0 to 682
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype
---  ------                       --------------  -----
 0   Sample code number           683 non-null    int64
 1   Clump Thickness              683 non-null    int64
 2   Uniformity of Cell Size      683 non-null    int64
 3   Uniformity of Cell Shape     683 non-null    int64
 4   Marginal Adhesion            683 non-null    int64
 5   Single Epithelial Cell Size  683 non-null    int64
 6   Bare Nuclei                  683 non-null    int64
 7   Bland Chromatin              683 non-null    int64
 8   Normal Nucleoli              683 non-null    int64
 9   Mitoses                      683 non-null    int64
 10  Class                        683 non-null    int64
dtypes: int64(11)
memory usage: 58.8 KB


# To get the description of dataset

In [None]:
dataset.describe()

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
count,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0
mean,1076720.0,4.442167,3.150805,3.215227,2.830161,3.234261,3.544656,3.445095,2.869693,1.603221,2.699854
std,620644.0,2.820761,3.065145,2.988581,2.864562,2.223085,3.643857,2.449697,3.052666,1.732674,0.954592
min,63375.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,877617.0,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0
50%,1171795.0,4.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,2.0
75%,1238705.0,6.0,5.0,5.0,4.0,4.0,6.0,5.0,4.0,1.0,4.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


# Checking for missing values

In [None]:
dataset.isnull().sum()

Sample code number             0
Clump Thickness                0
Uniformity of Cell Size        0
Uniformity of Cell Shape       0
Marginal Adhesion              0
Single Epithelial Cell Size    0
Bare Nuclei                    0
Bland Chromatin                0
Normal Nucleoli                0
Mitoses                        0
Class                          0
dtype: int64

# Dividing dataset into Matrix of Features (X) and Dependent variable (y)

In [None]:
X = dataset.iloc[:,1:-1].values
y = dataset.iloc[:,-1].values

# Splitting the dataset into Training set and Test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=0)

#Logistic Regression Classification model

In [None]:
from sklearn.metrics._plot.confusion_matrix import ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression
LR_classifier = LogisticRegression(random_state=0)
LR_classifier.fit(X_train,y_train)
y_pred = LR_classifier.predict(X_test)
cm = confusion_matrix(y_test,y_pred)
print(f"Confusion Matrix:\n {cm}")
print(f"Accuracy : {accuracy_score(y_test, y_pred)*100}%")

Confusion Matrix:
 [[84  3]
 [ 3 47]]
Accuracy : 95.62043795620438%


#SVM Classification model

In [None]:
from sklearn.svm import SVC
SVM_classifier = SVC(kernel='linear',random_state=0)
SVM_classifier.fit(X_train, y_train)
y_pred_DT = SVM_classifier.predict(X_test)
cm = confusion_matrix(y_test,y_pred)
print(f"Confusion Matrix:\n {cm}")
print(f"Accuracy : {accuracy_score(y_test, y_pred)*100}%")


Confusion Matrix:
 [[84  3]
 [ 3 47]]
Accuracy : 95.62043795620438%


#Decision Tree Classification model

In [None]:
from sklearn.tree import DecisionTreeClassifier
DT_classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
DT_classifier.fit(X_train, y_train)
y_pred_DT = DT_classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred_DT)
print(f"Confusion Matrix:\n {cm}")
print(f"Accuracy : {accuracy_score(y_test, y_pred)*100}%")


Confusion Matrix:
 [[84  3]
 [ 3 47]]
Accuracy : 95.62043795620438%


#SVM Kernel Classification model

In [None]:
from sklearn.svm import SVC
Kernel_classifier = SVC(kernel='rbf' , random_state=0)
Kernel_classifier.fit(X_train, y_train)
y_pred = Kernel_classifier.predict(X_test)
cm = confusion_matrix(y_test,y_pred)
print(f"Confusion Matrix:\n {cm}")
print(f"Accuracy : {accuracy_score(y_test, y_pred)*100}%")


Confusion Matrix:
 [[83  4]
 [ 1 49]]
Accuracy : 96.35036496350365%


#Random Forest Classification model

In [None]:
from sklearn.ensemble import RandomForestClassifier
RF_classifier = RandomForestClassifier(n_estimators=10,criterion = 'entropy' , random_state=0)
RF_classifier.fit(X_train, y_train)
cm = confusion_matrix(y_test,y_pred)
print(f"Confusion Matrix:\n {cm}")
print(f"Accuracy : {accuracy_score(y_test, y_pred)*100}%")


Confusion Matrix:
 [[83  4]
 [ 1 49]]
Accuracy : 96.35036496350365%


#Naive Bayes Classification model

In [None]:
from sklearn.naive_bayes import GaussianNB
NB_classifier = GaussianNB()
NB_classifier.fit(X_train, y_train)
cm = confusion_matrix(y_test,y_pred)
print(f"Confusion Matrix:\n {cm}")
print(f"Accuracy : {accuracy_score(y_test, y_pred)*100}%")


Confusion Matrix:
 [[83  4]
 [ 1 49]]
Accuracy : 96.35036496350365%


#KNN Classification model

In [None]:
from sklearn.neighbors import KNeighborsClassifier
KNN_classifier = KNeighborsClassifier(n_neighbors = 5,metric='minkowski',p=2)
KNN_classifier.fit(X_train, y_train)
cm = confusion_matrix(y_test,y_pred)
print(f"Confusion Matrix:\n {cm}")
print(f"Accuracy : {accuracy_score(y_test, y_pred)*100}%")


Confusion Matrix:
 [[83  4]
 [ 1 49]]
Accuracy : 96.35036496350365%


## **Conclusion  :** The SVM kernel, Naive Bayes, Random Forest, and K-Nearest Neighbors (KNN) models(with accuracies 96.35% each) performed better than the Logistic Regression, SVM, and Decision Tree models(with accuracies 95.62% each) for predicting breast cancer.