# Data cleaning

This notebook is optional and just explains how we arrived to the preprocessing steps to clean the data that are implemented in the `preprocess_data` function in the file `utils.py`

# Imports

In [None]:
# https://stackoverflow.com/questions/36786722/how-to-display-full-output-in-jupyter-not-only-last-result
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Imports
import pandas as pd
import numpy as np

# **1. Load Data**

In [None]:
data_source = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
data = pd.read_csv(data_source, header=None)
data

# Dataset: **Breast Cancer Wisconsin (Diagnostic)**

Data source [here](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data) and documentation [here](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names)

Data summary:
```
 Sample code number          : Id number
 Clump Thickness             : 1–10
 Uniformity of Cell Size     : 1–10
 Uniformity of Cell Shape    : 1–10
 Marginal Adhesion           : 1–10
 Single Epithelial Cell Size : 1–10
 Bare Nuclei                 : 1–10
 Bland Chromatin             : 1–10
 Normal Nucleoli             : 1–10
 Mitoses                     : 1–10
 Class                       : 2 for benign, 4 for malignant
```



Much better!

are there any missing values?

In [None]:
data.isna().sum()

No missing information, nice!

What are the data types of our data?



In [None]:
data.dtypes

We expect all columns to be numbers, which most are (`int64` means the computer stores the values of this column as numbers that occupy each 64 bits in memory), but for some reason the values of the column `Bare Nucleoli` are stored as text - which wouldn't make sense. Let's see what could be wrong here:

In [None]:
data['Bare Nuclei'].unique()

Looking at the unique values of this column, we realize at least one observation contains '?'. That is forcing python to store everything as string because sometimes the value is the string '?'. What we want to have is store everything as number and give a special value to '?' which is `Nan` meaning to python that these are null values

In [None]:
positions_with_interrogation_mark = data['Bare Nuclei'] == '?'
data.loc[positions_with_interrogation_mark,'Bare Nuclei'] = np.nan
data['Bare Nuclei'] = data['Bare Nuclei'].astype(float) # floats are numbers in python

now we need to do something with the nans. we can either:
1. Fill in their values with something
2. Remove these observations because we have no values for this column


Let's fill in the values with the average of the observations so that we can keep these datapoints in our data, giving our ML models more examples to learn from. Even though these observations will be noisy in this feature, it may still help the model learn more patterns between the other features and the target

In [None]:
average_value = data['Bare Nuclei'].mean()
data['Bare Nuclei'] = data['Bare Nuclei'].fillna(average_value) # fill nan values with the average value of Bare Nuclei from the whole dataset

Now we confirm if the data types make sense and if we still have nans

In [None]:
data.dtypes

In [None]:
data.isna().sum()

All good!

The last problem we observe is that our target column, which contains the values we want to predict, has values either 2 or 4:

In [None]:
data['Class'].unique()

In machine learning we need for the binary classification target to be either 0 or 1. So we will map the values of `2` to 0 because they mean **benign**, and the values of `4` to 1 because they mean **malignant** and that's the class we are interested in predicting.

In [None]:
data["Class"] = data["Class"].replace({2:0, 4:1})

In [None]:
data

And we are all set!