# Predicting Breast Cancer - Data Wrangling

## Capstone Project Two: Springboard Data Science Career Track 

### Notebook by Manthan Desai

## Introduction:

## 1. Problem Statement:

Breast cancer is an increasingly common and dangerous disease for women that forms in the cells of the breast. Nearly 12% of women worldwide are affected by the disease. Early detection remains vital for successful treatment of the disease and improved outcomes. Machine learning algorithms can help improve the accuracy of breast cancer detection by analyzing large-scale datasets for relevant trends and most important features. 

In the context of this problem, how can tumors be classified as benign or malignant with a minimum accuracy of 80% based on nine features that describe the tumor?

## 2. The Data:

The dataset is acquired from OpenML.org
(https://www.openml.org/search?type=data&sort=runs&status=active&qualities.NumberOfInstances=between_10000_100000&id=251). 

The dataset is comprised of the follwing fields:

 - id - Patient ID
 - Clump_Thickness - Indicates grouping of cancer cells in multilayer (Values range from 1-10).
 - Cell_Size_Uniformity - Indicates metastasis to lymph nodes (Values range from 1-10).
 - Cell_Shape_Uniformity - Identifies cancerous cells of varying size (Values range from 1-10).
 - Marginal_Adhesion - Quantifies loss of adhesion in cells (Values range from 1-10).
 - Single_Epi_Cell_Size - Quantifies the size of the epithelial cells (Values range from 1-10).
 - Bare_Nuclei - Quantifies the presence of bare nuclei in the cells (Values range from 1-10).
 - Bland_Chromatin - Quantifies the presence of bland chromatin in the cells (Values range from 1-10).
 - Normal Nucleoli - Quantifies the presence of normal nucleoli in the cells (Values range from 1-10).
 - Mitoses - Quantifies the stage of Mitoses in the cells (Values range from 1-10).
 - Class - The target variable that qualifies tumors as malignant (1) or benign (0) 

## 3. Library Imports

In [1]:
import os
import pandas as pd
import numpy as np

from library.sb_utils import save_file

## 4. Data Collection

In [2]:
bc_data = pd.read_csv('../raw_data/breast_cancer_dataset.csv')

In [3]:
bc_data.head()

Unnamed: 0,id,Clump_Thickness,Cell_Size_Uniformity,Cell_Shape_Uniformity,Marginal_Adhesion,Single_Epi_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
0,1,7.581819,9.745087,1.0,4.50341,7.03993,10.0,4.412282,10.0,5.055266,malignant
1,2,5.210921,8.169596,7.841875,6.033275,4.269619,10.0,4.236312,4.84535,1.0,malignant
2,3,4.0,4.594296,2.33038,2.0,3.0,1.0,10.701823,1.101305,1.0,benign
3,4,2.428871,1.0,1.0,1.0,4.099291,1.0,2.0,1.0,1.0,benign
4,5,8.855971,2.697539,6.047068,3.301891,3.0,1.0,5.297592,4.104791,3.115741,malignant


## 5. Data Definition

In [4]:
bc_data.columns

Index(['id', 'Clump_Thickness', 'Cell_Size_Uniformity',
       'Cell_Shape_Uniformity', 'Marginal_Adhesion', 'Single_Epi_Cell_Size',
       'Bare_Nuclei', 'Bland_Chromatin', 'Normal_Nucleoli', 'Mitoses',
       'Class'],
      dtype='object')

In [5]:
bc_data.dtypes

id                         int64
Clump_Thickness          float64
Cell_Size_Uniformity     float64
Cell_Shape_Uniformity    float64
Marginal_Adhesion        float64
Single_Epi_Cell_Size     float64
Bare_Nuclei              float64
Bland_Chromatin          float64
Normal_Nucleoli          float64
Mitoses                  float64
Class                     object
dtype: object

In [6]:
#Check the summary statisitcs
bc_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,39366.0,19683.5,11364.129685,1.0,9842.25,19683.5,29524.75,39366.0
Clump_Thickness,39366.0,4.394013,2.812104,0.73546,2.243989,4.0,5.630522,13.717991
Cell_Size_Uniformity,39366.0,3.13007,3.039493,0.564014,1.0,1.0,4.553797,10.933095
Cell_Shape_Uniformity,39366.0,3.203657,2.975983,1.0,1.0,1.0,4.966335,12.604289
Marginal_Adhesion,39366.0,2.827221,2.872543,1.0,1.0,1.0,3.551935,11.158505
Single_Epi_Cell_Size,39366.0,3.209844,2.220422,1.0,2.0,2.0,4.003072,14.414889
Bare_Nuclei,39366.0,3.497453,3.619992,-0.117818,1.0,1.0,6.333434,13.160789
Bland_Chromatin,39366.0,3.409142,2.422371,1.0,2.0,3.0,4.561324,12.005376
Normal_Nucleoli,39366.0,2.894595,3.069489,0.758343,1.0,1.0,3.797023,10.700432
Mitoses,39366.0,1.591809,1.706766,1.0,1.0,1.0,1.0,12.044924


In [7]:
#Calculate the range of the dataframe
df_max = bc_data.drop(['id','Class'],axis=1).max()
df_min = bc_data.drop(['id','Class'],axis=1).min()
df_range = df_max - df_min
df_range

Clump_Thickness          12.982531
Cell_Size_Uniformity     10.369081
Cell_Shape_Uniformity    11.604289
Marginal_Adhesion        10.158505
Single_Epi_Cell_Size     13.414889
Bare_Nuclei              13.278607
Bland_Chromatin          11.005376
Normal_Nucleoli           9.942089
Mitoses                  11.044924
dtype: float64

In [8]:
#Calculate the median of the dataframe
bc_data.drop(['id','Class'],axis=1).median()

Clump_Thickness          4.0
Cell_Size_Uniformity     1.0
Cell_Shape_Uniformity    1.0
Marginal_Adhesion        1.0
Single_Epi_Cell_Size     2.0
Bare_Nuclei              1.0
Bland_Chromatin          3.0
Normal_Nucleoli          1.0
Mitoses                  1.0
dtype: float64

In [9]:
#Calculate the mode of the dataframe
bc_data.drop(['id','Class'],axis=1).mode().T

Unnamed: 0,0
Clump_Thickness,1.0
Cell_Size_Uniformity,1.0
Cell_Shape_Uniformity,1.0
Marginal_Adhesion,1.0
Single_Epi_Cell_Size,2.0
Bare_Nuclei,1.0
Bland_Chromatin,3.0
Normal_Nucleoli,1.0
Mitoses,1.0


In [10]:
bc_data.nunique()

id                       39366
Clump_Thickness          26387
Cell_Size_Uniformity     17780
Cell_Shape_Uniformity    16030
Marginal_Adhesion        13232
Single_Epi_Cell_Size     10910
Bare_Nuclei               8561
Bland_Chromatin          11991
Normal_Nucleoli          11095
Mitoses                   2988
Class                        2
dtype: int64

## 6. Data Cleaning

### Check for missing values:

In [11]:
bc_data.isnull().sum()

id                       0
Clump_Thickness          0
Cell_Size_Uniformity     0
Cell_Shape_Uniformity    0
Marginal_Adhesion        0
Single_Epi_Cell_Size     0
Bare_Nuclei              0
Bland_Chromatin          0
Normal_Nucleoli          0
Mitoses                  0
Class                    0
dtype: int64

### Check for outliers (invalid data): 

Additional Information provided with the dataset indicates that the domains of 'Clump_Thickness', 'Cell_Size_Uniformity', 'Cell_Shape_Uniformity', 'Marginal_Adhesion', 'Single_Epi_Cell_Size', 'Bare_Nuclei', 'Bland_Chromatin', 'Normal_Nucleoli', 'Mitoses' are between 1-10. 

Therefore, the dataset needs to checked for invalid outliers, defined to be values below 1 or greater than 10. 

In [12]:
#min outliers are values < 1
def check_min_outlier(column_name):
    return len(bc_data[bc_data[column_name]<1])

In [13]:
print('Number of min outliers in Clump_Thickness: ', check_min_outlier('Clump_Thickness'))
print('Number of min outliers in Cell_Size_Uniformity: ', check_min_outlier('Cell_Size_Uniformity'))
print('Number of min outliers in Cell_Shape_Uniformity: ', check_min_outlier('Cell_Shape_Uniformity'))
print('Number of min outliers in Marginal_Adhesion: ', check_min_outlier('Marginal_Adhesion'))
print('Number of min outliers in Single_Epi_Cell_Size: ', check_min_outlier('Single_Epi_Cell_Size'))
print('Number of min outliers in Bare_Nuclei: ', check_min_outlier('Bare_Nuclei'))
print('Number of min outliers in Bland_Chromatin: ', check_min_outlier('Bland_Chromatin'))
print('Number of min outliers in Normal_Nucleoli: ', check_min_outlier('Normal_Nucleoli'))
print('Number of min outliers in Mitoses: ', check_min_outlier('Mitoses'))

Number of min outliers in Clump_Thickness:  3
Number of min outliers in Cell_Size_Uniformity:  8
Number of min outliers in Cell_Shape_Uniformity:  0
Number of min outliers in Marginal_Adhesion:  0
Number of min outliers in Single_Epi_Cell_Size:  0
Number of min outliers in Bare_Nuclei:  35
Number of min outliers in Bland_Chromatin:  0
Number of min outliers in Normal_Nucleoli:  3
Number of min outliers in Mitoses:  0


In [14]:
#min outliers are values > 10
def check_max_outlier(column_name):
    return len(bc_data[bc_data[column_name]>10])

In [15]:
print('Number of max outliers in Clump_Thickness: ', check_max_outlier('Clump_Thickness'))
print('Number of max outliers in Cell_Size_Uniformity: ', check_max_outlier('Cell_Size_Uniformity'))
print('Number of max outliers in Cell_Shape_Uniformity: ', check_max_outlier('Cell_Shape_Uniformity'))
print('Number of max outliers in Marginal_Adhesion: ', check_max_outlier('Marginal_Adhesion'))
print('Number of max outliers in Single_Epi_Cell_Size: ', check_max_outlier('Single_Epi_Cell_Size'))
print('Number of max outliers in Bare_Nuclei: ', check_max_outlier('Bare_Nuclei'))
print('Number of max outliers in Bland_Chromatin: ', check_max_outlier('Bland_Chromatin'))
print('Number of max outliers in Normal_Nucleoli: ', check_max_outlier('Normal_Nucleoli'))
print('Number of max outliers in Mitoses: ', check_max_outlier('Mitoses'))

Number of max outliers in Clump_Thickness:  1329
Number of max outliers in Cell_Size_Uniformity:  1597
Number of max outliers in Cell_Shape_Uniformity:  1212
Number of max outliers in Marginal_Adhesion:  1343
Number of max outliers in Single_Epi_Cell_Size:  510
Number of max outliers in Bare_Nuclei:  69
Number of max outliers in Bland_Chromatin:  184
Number of max outliers in Normal_Nucleoli:  11
Number of max outliers in Mitoses:  261


Based on the results of these queries, there is a signficant subset of the dataset with max outliers. Removing all max outliers from the dataset would remove close to 5000 entries or about 13% of the data. That is too much data to remove from the classification problem without compromising the integrity of the results of this project.

There is a small number of min outliers and removing them doesn't alter the significance of the results.

In [16]:
def remove_min_outliers(column_name):
    return (bc_data[bc_data[column_name]>=1])

In [17]:
bc_data = remove_min_outliers('Clump_Thickness')
bc_data = remove_min_outliers('Cell_Size_Uniformity')
bc_data = remove_min_outliers('Cell_Shape_Uniformity')
bc_data = remove_min_outliers('Marginal_Adhesion')
bc_data = remove_min_outliers('Single_Epi_Cell_Size')
bc_data = remove_min_outliers('Bare_Nuclei')
bc_data = remove_min_outliers('Bland_Chromatin')
bc_data = remove_min_outliers('Normal_Nucleoli')
bc_data = remove_min_outliers('Mitoses')

In [18]:
print('Number of min outliers in Clump_Thickness: ', check_min_outlier('Clump_Thickness'))
print('Number of min outliers in Cell_Size_Uniformity: ', check_min_outlier('Cell_Size_Uniformity'))
print('Number of min outliers in Cell_Shape_Uniformity: ', check_min_outlier('Cell_Shape_Uniformity'))
print('Number of min outliers in Marginal_Adhesion: ', check_min_outlier('Marginal_Adhesion'))
print('Number of min outliers in Single_Epi_Cell_Size: ', check_min_outlier('Single_Epi_Cell_Size'))
print('Number of min outliers in Bare_Nuclei: ', check_min_outlier('Bare_Nuclei'))
print('Number of min outliers in Bland_Chromatin: ', check_min_outlier('Bland_Chromatin'))
print('Number of min outliers in Normal_Nucleoli: ', check_min_outlier('Normal_Nucleoli'))
print('Number of min outliers in Mitoses: ', check_min_outlier('Mitoses'))

Number of min outliers in Clump_Thickness:  0
Number of min outliers in Cell_Size_Uniformity:  0
Number of min outliers in Cell_Shape_Uniformity:  0
Number of min outliers in Marginal_Adhesion:  0
Number of min outliers in Single_Epi_Cell_Size:  0
Number of min outliers in Bare_Nuclei:  0
Number of min outliers in Bland_Chromatin:  0
Number of min outliers in Normal_Nucleoli:  0
Number of min outliers in Mitoses:  0


In [19]:
bc_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,39317.0,19680.358649,11363.605788,1.0,9839.0,19679.0,29520.0,39366.0
Clump_Thickness,39317.0,4.393465,2.811865,1.0,2.244085,4.0,5.629741,13.717991
Cell_Size_Uniformity,39317.0,3.129445,3.039286,1.0,1.0,1.0,4.553463,10.933095
Cell_Shape_Uniformity,39317.0,3.203115,2.975756,1.0,1.0,1.0,4.965623,12.604289
Marginal_Adhesion,39317.0,2.826947,2.872436,1.0,1.0,1.0,3.551612,11.158505
Single_Epi_Cell_Size,39317.0,3.20955,2.220396,1.0,2.0,2.0,4.002653,14.414889
Bare_Nuclei,39317.0,3.500404,3.620691,1.0,1.0,1.0,6.338096,13.160789
Bland_Chromatin,39317.0,3.40823,2.421583,1.0,2.0,3.0,4.559898,12.005376
Normal_Nucleoli,39317.0,2.894197,3.06932,1.0,1.0,1.0,3.793476,10.700432
Mitoses,39317.0,1.591045,1.705605,1.0,1.0,1.0,1.0,12.044924


In [20]:
#save the cleaned dataframe as csv for the next steps
datapath = '../data'
save_file(bc_data, 'bc_data_cleaned.csv', datapath)

A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "../data\bc_data_cleaned.csv"
