![breast_cancer.webp](attachment:breast_cancer.webp)

### Introduction

Breast cancer is one of the most common types of cancer in women worldwide and can often be fatal (Ferlay et al., 2010). A proper diagnosis of breast cancer is attained through the examintaion of many mammographic and clinical features. The diagnostic system should be able to differentiate between benign and malignant masses. To help radiologists in making decisions as to whether a tumor is benign or malignant, the posbility of using automated tools has been explored. Artificial intelligence and machine learning methods are popular methods that have been widely applied in distinguishing benign and malignant tumors. Since cancer is potentially fatal, the need to have accurate estimations is widely regarded as the basis for selecting a particular model. Artificial Neural Networks (ANN), have gained considerable interest in their use for medical analysis due to their ability to model non-linear relationships.
This study aims to showcase the use of ANN in breast cancer diagnosis using the Winconsin Breast Cancer dataset from Kaggle.

### Objective

The objective of this study is to create an ANN based model to be used in Breast Cancer Estimation.


### About the Dataset
The features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. The dataset can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

The Attribute Information is as follows:

* ID number
* Diagnosis (M = malignant, B = benign)3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)

All feature values are recoded with four significant digits.


### Data Exploration

In [1]:
import sys
!{sys.executable} -m pip install tensorflow

zsh:1: no such file or directory: /Users/dominic/Downloads/Nexford_University/Applied


In [3]:
#Import the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
import tensorflow as tf

### Step 1: Load the Dataset

In [4]:
# Read in the data
df = pd.read_csv("../data/data.csv")
df.head(10) # show the first ten rows

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,
5,843786,M,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,...,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244,
6,844359,M,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,...,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368,
7,84458202,M,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,...,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151,
8,844981,M,13.0,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,...,30.73,106.2,739.3,0.1703,0.5401,0.539,0.206,0.4378,0.1072,
9,84501001,M,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,...,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075,


### Step 2: Data Preprocessing

#### Data Cleaning
In this step, we drop all the irrelevant columns from the dataset, encode the diagnosis column and check for missing values.

In [5]:
# Drop irrelvant columns
df.drop(['id', 'Unnamed: 32'], axis = 1, inplace=True)

#Encode the diagnosis column where Malignant=1 and Benign=0
df['diagnosis'] = df['diagnosis'].map({'M':1, 'B':0})

#Check for missing values
print(df.isnull().sum())

# Check the value counts for each of the diagnosis types. Benign (B) or Malignant (M)
df['diagnosis'].value_counts()

diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64


diagnosis
0    357
1    212
Name: count, dtype: int64

There are no missing attribute values and the Class distribution apears as follows:
* 357 benign
* 212 malignant

#### Feature Scaling
First, we divide the data into independent and dependent variable (The diagnosis column). Then we encode the categorical data using the label encoder.

In [7]:
X = df.drop('diagnosis', axis = 1)
y = df['diagnosis']

# Standardize the feautures
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the Train and test
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42)


### Step 3: Build ANN Model

### References
1. Sepandi M, Taghdir M, Rezaianzadeh A, Rahimikazerooni S. Assessing Breast Cancer Risk with an Artificial Neural Network. Asian Pac J Cancer Prev. 2018 Apr 25;19(4):1017-1019. doi: 10.22034/APJCP.2018.19.4.1017. PMID: 29693975; PMCID: PMC6031801.
2. Tran KA, Kondrashova O, Bradley A, Williams ED, Pearson JV, Waddell N. Deep learning in cancer diagnosis, prognosis and treatment selection. Genome Med. 2021 Sep 27;13(1):152. doi: 10.1186/s13073-021-00968-x. PMID: 34579788; PMCID: PMC8477474.