<h1 style="color:blue">Cancer Diagnosis Prediction - Preprocessing</h1>

# Dataset


<a href="https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic">Breast Cancer Wisconsin (Diagnostic)</a>


<a href="https://github.com/uci-ml-repo/ucimlrepo/tree/main">github: ucimlrepo package</a>



Citation: Wolberg,William, Mangasarian,Olvi, Street,Nick, and Street,W.. (1995). Breast Cancer Wisconsin (Diagnostic). UCI Machine Learning Repository. https://doi.org/10.24432/C5DW2B.

**Additional Variable Information**
1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)

Ten real-valued features are computed for each cell nucleus:

	a) radius (mean of distances from center to points on the perimeter)
	b) texture (standard deviation of gray-scale values)
	c) perimeter
	d) area
	e) smoothness (local variation in radius lengths)
	f) compactness (perimeter^2 / area - 1.0)
	g) concavity (severity of concave portions of the contour)
	h) concave points (number of concave portions of the contour)
	i) symmetry 
	j) fractal dimension ("coastline approximation" - 1)

# 1. Import

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns

# dotenv import
from dotenv import load_dotenv
import os
from tqdm.auto import tqdm

# Datasets
from ucimlrepo import fetch_ucirepo, list_available_datasets # Libraries for Breast Camcer dataset 



# 2. Breast Cancer Dataset - Load Data

In [2]:
!Dir

 Datentr„ger in Laufwerk C: ist Windows
 Volumeseriennummer: 20BD-6958

 Verzeichnis von C:\Users\andre\Documents\IRONHACK\UNIT_8\Project\Cancer_Diagnosis_Prediction\notebooks

10.04.2024  14:27    <DIR>          .
08.04.2024  20:46    <DIR>          ..
10.04.2024  09:55    <DIR>          .ipynb_checkpoints
10.04.2024  09:57               606 01_CDP_EDA.ipynb
10.04.2024  14:27         1.859.574 01_CDP_Preprocessing.ipynb
               2 Datei(en),      1.860.180 Bytes
               3 Verzeichnis(se), 653.112.172.544 Bytes frei


In [3]:
# fetch dataset 
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17) 

In [4]:
# data (as pandas dataframes) 
df_original = breast_cancer_wisconsin_diagnostic.data.original

## 3. Breast Cancer Dataset - Overview 

In [5]:
# Display first 5 rows of target data
df_original.head()

Unnamed: 0,ID,radius1,texture1,perimeter1,area1,smoothness1,compactness1,concavity1,concave_points1,symmetry1,...,texture3,perimeter3,area3,smoothness3,compactness3,concavity3,concave_points3,symmetry3,fractal_dimension3,Diagnosis
0,842302,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,M
1,842517,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,M
2,84300903,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,M
3,84348301,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,M
4,84358402,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,M


In [6]:
# Dataframe structure
print("no of rows:", df_original.shape[0], "and no of columns:", df_original.shape[1])

no of rows: 569 and no of columns: 32


In [7]:
# Column Names
df_original.columns

Index(['ID', 'radius1', 'texture1', 'perimeter1', 'area1', 'smoothness1',
       'compactness1', 'concavity1', 'concave_points1', 'symmetry1',
       'fractal_dimension1', 'radius2', 'texture2', 'perimeter2', 'area2',
       'smoothness2', 'compactness2', 'concavity2', 'concave_points2',
       'symmetry2', 'fractal_dimension2', 'radius3', 'texture3', 'perimeter3',
       'area3', 'smoothness3', 'compactness3', 'concavity3', 'concave_points3',
       'symmetry3', 'fractal_dimension3', 'Diagnosis'],
      dtype='object')

In [8]:
# Info stats
df_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  569 non-null    int64  
 1   radius1             569 non-null    float64
 2   texture1            569 non-null    float64
 3   perimeter1          569 non-null    float64
 4   area1               569 non-null    float64
 5   smoothness1         569 non-null    float64
 6   compactness1        569 non-null    float64
 7   concavity1          569 non-null    float64
 8   concave_points1     569 non-null    float64
 9   symmetry1           569 non-null    float64
 10  fractal_dimension1  569 non-null    float64
 11  radius2             569 non-null    float64
 12  texture2            569 non-null    float64
 13  perimeter2          569 non-null    float64
 14  area2               569 non-null    float64
 15  smoothness2         569 non-null    float64
 16  compactn

In [9]:
breast_cancer_wisconsin_diagnostic.keys()

dict_keys(['data', 'metadata', 'variables'])

In [10]:
# Metadata
print(breast_cancer_wisconsin_diagnostic.metadata['additional_info']['variable_info'])

1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)

Ten real-valued features are computed for each cell nucleus:

	a) radius (mean of distances from center to points on the perimeter)
	b) texture (standard deviation of gray-scale values)
	c) perimeter
	d) area
	e) smoothness (local variation in radius lengths)
	f) compactness (perimeter^2 / area - 1.0)
	g) concavity (severity of concave portions of the contour)
	h) concave points (number of concave portions of the contour)
	i) symmetry 
	j) fractal dimension ("coastline approximation" - 1)


In [11]:
breast_cancer_wisconsin_diagnostic.variables['name']

0                     ID
1              Diagnosis
2                radius1
3               texture1
4             perimeter1
5                  area1
6            smoothness1
7           compactness1
8             concavity1
9        concave_points1
10             symmetry1
11    fractal_dimension1
12               radius2
13              texture2
14            perimeter2
15                 area2
16           smoothness2
17          compactness2
18            concavity2
19       concave_points2
20             symmetry2
21    fractal_dimension2
22               radius3
23              texture3
24            perimeter3
25                 area3
26           smoothness3
27          compactness3
28            concavity3
29       concave_points3
30             symmetry3
31    fractal_dimension3
Name: name, dtype: object

In [12]:
breast_cancer_wisconsin_diagnostic.metadata['data_url']

'https://archive.ics.uci.edu/static/public/17/data.csv'

## 4. Breast Cancer Dataset - Unique Values

In [13]:
# Unique Values of df_target
df_original.nunique()

ID                    569
radius1               456
texture1              479
perimeter1            522
area1                 539
smoothness1           474
compactness1          537
concavity1            537
concave_points1       542
symmetry1             432
fractal_dimension1    499
radius2               540
texture2              519
perimeter2            533
area2                 528
smoothness2           547
compactness2          541
concavity2            533
concave_points2       507
symmetry2             498
fractal_dimension2    545
radius3               457
texture3              511
perimeter3            514
area3                 544
smoothness3           411
compactness3          529
concavity3            539
concave_points3       492
symmetry3             500
fractal_dimension3    535
Diagnosis               2
dtype: int64

In [14]:
df_original['Diagnosis'].unique()

array(['M', 'B'], dtype=object)

## 5. Breast Cancer Dataset - Descriptive Statistics

In [15]:
df_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  569 non-null    int64  
 1   radius1             569 non-null    float64
 2   texture1            569 non-null    float64
 3   perimeter1          569 non-null    float64
 4   area1               569 non-null    float64
 5   smoothness1         569 non-null    float64
 6   compactness1        569 non-null    float64
 7   concavity1          569 non-null    float64
 8   concave_points1     569 non-null    float64
 9   symmetry1           569 non-null    float64
 10  fractal_dimension1  569 non-null    float64
 11  radius2             569 non-null    float64
 12  texture2            569 non-null    float64
 13  perimeter2          569 non-null    float64
 14  area2               569 non-null    float64
 15  smoothness2         569 non-null    float64
 16  compactn

In [16]:
df_original.describe()

Unnamed: 0,ID,radius1,texture1,perimeter1,area1,smoothness1,compactness1,concavity1,concave_points1,symmetry1,...,radius3,texture3,perimeter3,area3,smoothness3,compactness3,concavity3,concave_points3,symmetry3,fractal_dimension3
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,906024.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


In [17]:
df_original.describe(include='object')

Unnamed: 0,Diagnosis
count,569
unique,2
top,B
freq,357


In [18]:
df_original['Diagnosis'].value_counts()

Diagnosis
B    357
M    212
Name: count, dtype: int64

## 6. Breast Cancer Dataset - Data Cleaning & Transformation

**Column Names**

In [19]:
df_original.columns

Index(['ID', 'radius1', 'texture1', 'perimeter1', 'area1', 'smoothness1',
       'compactness1', 'concavity1', 'concave_points1', 'symmetry1',
       'fractal_dimension1', 'radius2', 'texture2', 'perimeter2', 'area2',
       'smoothness2', 'compactness2', 'concavity2', 'concave_points2',
       'symmetry2', 'fractal_dimension2', 'radius3', 'texture3', 'perimeter3',
       'area3', 'smoothness3', 'compactness3', 'concavity3', 'concave_points3',
       'symmetry3', 'fractal_dimension3', 'Diagnosis'],
      dtype='object')

**Missing Values**

In [20]:
df_original.isna().sum()

ID                    0
radius1               0
texture1              0
perimeter1            0
area1                 0
smoothness1           0
compactness1          0
concavity1            0
concave_points1       0
symmetry1             0
fractal_dimension1    0
radius2               0
texture2              0
perimeter2            0
area2                 0
smoothness2           0
compactness2          0
concavity2            0
concave_points2       0
symmetry2             0
fractal_dimension2    0
radius3               0
texture3              0
perimeter3            0
area3                 0
smoothness3           0
compactness3          0
concavity3            0
concave_points3       0
symmetry3             0
fractal_dimension3    0
Diagnosis             0
dtype: int64

**Encode Target Variable**

In [21]:
df_original['Diagnosis'] = df_original['Diagnosis'].apply(lambda x: 1 if x == 'M' else 0)

**Separate Features/Target**

In [22]:
features = df_original.drop(columns = ['ID',"Diagnosis"])
features.columns

Index(['radius1', 'texture1', 'perimeter1', 'area1', 'smoothness1',
       'compactness1', 'concavity1', 'concave_points1', 'symmetry1',
       'fractal_dimension1', 'radius2', 'texture2', 'perimeter2', 'area2',
       'smoothness2', 'compactness2', 'concavity2', 'concave_points2',
       'symmetry2', 'fractal_dimension2', 'radius3', 'texture3', 'perimeter3',
       'area3', 'smoothness3', 'compactness3', 'concavity3', 'concave_points3',
       'symmetry3', 'fractal_dimension3'],
      dtype='object')

In [23]:
target = df_original['Diagnosis']
target

0      1
1      1
2      1
3      1
4      1
      ..
564    1
565    1
566    1
567    1
568    0
Name: Diagnosis, Length: 569, dtype: int64

## 7. Breast Cancer Dataset - Write to CSV

In [69]:
df_original.to_csv('../data/cleaned/df_clean.csv', index=False)

<h1 style="color:green">Summary & Conclusion - Pre-Processing</h1>

- **Column Names** are correct & coherent -> no changes applied
- No **Missing Values** in the dataset -> no changes applied
- **Data Types** are numerical except for target column -> no changes applied to numerical columns
- Target Column ('Diagnosis') shows **class imbalance** (see Notebook 02)
- Features have a **different scaling** -> Use Scaler for ML (see Notebook 03)
- Encoded Diagnosis labels:
    - Benign (B) = 0
    - Malignant (M) = 1
 
<b style="color:#2f4f4f">Overall the provided dataset is clean in terms of table structure, data type and missing values. The only downstream processing recommended is Feature Scaling, resolving Class Imbalance and processing required for machine learning (ML) development (e.g. Feature selection)</b>