<h1 style="color: #001a79;"> Wisconsin Breast Cancer Analysis Notebook</h1>

<h3 style="color: #001a79;">For this notebook you will need the following packages:</h3>

In [1]:
import pandas as pd

<h2 style="color: #001a79;">Introduction</h2>

<hr style="border-top: 1px solid #001a79;" />

Breast cancer is a common cancer in Ireland with more than 3,500 women and approximately 35 men  diagnosed with breast cancer each year. Breast cancer is when cells in the breast grow and divide in an uncontrolled way. This creates a mass of tissue called a tumor. Signs of breast cancer can include feeling a lump in the breast, a change in the size of the breast and changes to the skin of the breasts. Breast cancer is treated with surgery, radiotherapy, chemotherapy, hormone therapy and targeted therapies, depending on the type.

<img src="content/breast-cancer.jpg" alt="Breast cancer" style="width: 350px;"/> 

Sources:<br>
<a href="https://www.cancer.ie/cancer-information-and-support/cancer-types/breast-cancer#:~:text=Each%20year%20in%20Ireland%2C%20more,therapies%2C%20depending%20on%20the%20type." target="_blank">Irish Cancer Society: Breast Cancer</a><br>
<a href="https://my.clevelandclinic.org/health/diseases/3986-breast-cancer" target="_blank">Cleveland Clinic: Breast Cancer</a>

The <a href="https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29" target="_blank">Wisconsin-Breast Cancer (Diagnostics) dataset</a> (WBC) from UCI machine learning repository is a classification dataset records the measurements for breast cancer cases. There are two classes Benign and Malignant:

<table style="border-collapse: collapse; width: 80%; height: 261px;" border="1">
<tbody>
<tr>
<td style="width: 50%; text-align: center;">
<h2><strong>1. Benign</strong></h2>
</td>
<td style="width: 50%; text-align: center;">
<h2><strong>2. Malignant</strong></h2>
</td>
</tr>
<tr>
<td style="width: 50%; text-align: center;">
<h3>Grow slowly and have distant borders.</h3>
</td>
<td style="width: 50%; text-align: center;">
<h3>Can grow quickly and have irregular borders.</h3>
</td>
</tr>
<tr>
<td style="width: 50%; text-align: center;">
<h3>Do not invade surrounding tissue.</h3>
</td>
<td style="width: 50%; text-align: center;">
<h3>Often invade surrounding tissue.</h3>
</td>
</tr>
<tr>
<td style="width: 50%; text-align: center;">
<h3>Do not invade other parts of the body.</h3>
</td>
<td style="width: 50%; text-align: center;">
<h3>Can spread to other parts of the body through a process called metastasis.&nbsp;</h3>
</td>
</tr>
</tbody>
</table>

<img src="content/tumortype.PNG" alt="Benign versus Malignant" style="width: 550px;"/> 

<a href="https://jamanetwork.com/journals/jamaoncology/fullarticle/2768634" target="_blank">JAMA Oncology: Benign vs Malignant Tumors</a><br>
<a href="https://www.technologynetworks.com/cancer-research/articles/benign-vs-malignant-tumors-364765" target="_blank">Technology Networks (Cancer Research): Benign vs Malignant Tumors</a>

<h2 style="color: #001a79;">Breast Cancer Wisconsin (Diagnostic) Data Set</h2>

<hr style="border-top: 1px solid #001a79;" />

### Background

The diagnosis of breast tumors has traditionally been performed by a full biopsy. This process involves the extraction of sample cells or tissues for examination to determine the presence or extent of a disease. Dr. William H. Wolberg, physician at the University Of Wisconsin Hospital used fine needle aspirations (FNAs) to extract a small amount of breast tissue or fluid from the suspicious area with a thin, hollow needle and is then checked for cancer cells. 

Dr. Wolberg used fluid samples, taken from patients with solid breast masses 

and used an interactive computer system that evaluates and diagnoses based on cytologic features derived directly from a digital scan of FNA slides.

the user initializes active contour models, known as snakes, near the boundaries of a set of cell nuclei. The customized snakes are deformed to the exact shape of the nuclei. This allows for precise, automated analysis of nuclear size, shape and texture. Ten such features are computed for each nucleus, and the mean value, largest (or 'worst') value and standard error of each feature are found over the range of isolated cells.


569 patients provided the data to develop this system. The program uses a curve-fitting algorithm, to compute ten features from each one of the cells in the sample, then it calculates the mean value, extreme value and standard error of each feature for the image, returning a 30 real-value variables. 

Sources:<br>
<a href="https://www.cancer.org/cancer/breast-cancer/screening-tests-and-early-detection/breast-biopsy/fine-needle-aspiration-biopsy-of-the-breast.html#:~:text=During%20a%20fine%20needle%20aspiration,needle%20biopsy%20is%20often%20preferred" target="_blank">Cancer.org: Fine Needle Aspiration (FNA) of the Breast</a><br>
<a href="https://www.sciencedirect.com/science/article/abs/pii/030438359490099X" target="_blank">William H.Wolberg - Machine learning techniques to diagnose breast cancer from image-processed nuclear features of fine needle aspirates</a>

### Importing the Data

Source: <a href="https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data" target="_blank">Kaggle: Breast Cancer Wisconsin (Diagnostic) Data Set</a>

In [12]:
df = pd.read_csv("data.csv")

In [4]:
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


### About the Data

The data is made up of 569 rows each representing a sample. 

It's 33 columns are made up of:
1. The patient's ID and their diagnosis. These are types ints and strings respectively. 
2. Ten real-valued features that are computed for each cell nucleus:
    - **Radius** - mean of distances from center to points on the perimeter
    - **Texture** - standard deviation of gray-scale values
    - **Perimeter** 
    - **Area**
    - **Smoothness** - local variation in radius lengths
    - **Compactness** - $\frac{perimeter^2}{area - 1.0}$
    - **Concavity** - severity of concave portions of the contour
    - **Concave Points** - number of concave portions of the contour
    - **Symmetry**
    - **Fractal dimension** - coastline approximation - 1
    
For each of the 10 the mean value, extreme value and standard error of each feature are calculated. 

All feature values are floats recorded to four significant digits.
3. Missing attribute values: none. 

In [6]:
df.shape

(569, 33)

In [8]:
df.columns

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
      dtype='object')

In [10]:
df.dtypes

id                           int64
diagnosis                   object
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst     