# Wisconsin Breast Cancer Data Analysis - Data Cleaning

## Introduction
Early detection of the malignancy of a breast lump is the key to high probability of survival of breast cancer. Many imaging techniques have been developed for detection breast cancer.In this project, we will use Machine learning algorithms to accurately classify disgonosis from a breast imaging.

### Dataset

The Wisconsin Breast Cancer (Diagnostic) dataset has been extracted from Kaggle. It has 569 items out of which 212 are Malignant and 357 are benign. Ten real-valued features including:

- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness ($\text{perimeter}^2$ / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour) 
- symmetry
- fractal dimension ("coastline approximation" - 1)

There are three different measurements computed for each of the features described above:
- Mean values
- Standard Error
- Worst or Largest Values
Making it a total of 32 features, including id and diagnosis. 

## Objective
The aim of this data analysis project is to work users through the process of classifying the data into two classes of diagnosis - Malignant & Benign, using Machine learning algorithms. 

The project will be divided into three notebooks.
- The current notebook will discuss the data cleaning/processing
- A second notebook containing summary statistics process
- A third notebook containing the Machine Learning process.

In [1]:
#Import all packages needed
import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


In [2]:
#read csv file and display 2 lines of dataframe
df = pd.read_csv("data.csv")
df.head(2)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,


In [3]:
#print a concise summary of a dataframe.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5


## Preprocessing Data

**Observation**: The dataset looks clean with no missing values, and all features are of correct data types. 

Things to be done

1) Drop irrelevant columns ['Unnamed','id']

2) Group Data into three parts, (mean values, se values and worst values) and save.

3) Standardize feature data using Standard Scaler Library in Scikit learn. 

4) Rename the values in diagnosis column

In [4]:
#drop columns
df.drop(['Unnamed: 32', 'id'],axis=1,inplace=True)
df.head(2)

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902


In [6]:
#reame the M and B values in diagnosis to Malignant and Benign
df['diagnosis'].replace(['M', 'B'], ['Malignant','Benign'], inplace=True)
print(df['diagnosis'].value_counts()) #confirm update

Benign       357
Malignant    212
Name: diagnosis, dtype: int64



### Feature Scaling using MinMaxScaler

It is necessary to identify whether data are balanced or unbalanced. It can be observed that the dataset was not smoothly balanced, and the number of benign tumors was 357, which is 146 more than malignant tumors. Thus we will use MinMaxScaler() from Sklearn to normaize (scale) our features. It scales all the data features in the range [0,1] or else in the range [-1,-1] if there are negative values in the dataset.

Let us split the dataset into categorical and quantitatie data so we apply the scaling function on the qualitative data alone.

In [5]:
#split data
target = df.loc[:,'diagnosis']
print(target.head(2))
features = df.iloc[:,1:]
features.head(2)

0    M
1    M
Name: diagnosis, dtype: object


Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902


In [7]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
#scalera.fit(features)
d = scaler.fit_transform(features)
#scalera.transform(features)
names = features.columns
df_transformed = pd.DataFrame(columns = names, data = d)

In [8]:
#print few lines of the transformed features
df_transformed.head(2)

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,0.521037,0.022658,0.545989,0.363733,0.593753,0.792037,0.70314,0.731113,0.686364,0.605518,...,0.620776,0.141525,0.66831,0.450698,0.601136,0.619292,0.56861,0.912027,0.598462,0.418864
1,0.643144,0.272574,0.615783,0.501591,0.28988,0.181768,0.203608,0.348757,0.379798,0.141323,...,0.606901,0.303571,0.539818,0.435214,0.347553,0.154563,0.192971,0.639175,0.23359,0.222878


So we see that all features now have values between [0,1] or [-1,1], this will make our Machine learning algorithm better. So we will now add the target feature back to our data frame and save the data for future use in other notebook.

In [9]:
#add the target feature to transformed data at position 0, name it diagnosis.
df_transformed.insert(0, 'diagnosis', target)
df_transformed.head(2)

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,Malignant,0.521037,0.022658,0.545989,0.363733,0.593753,0.792037,0.70314,0.731113,0.686364,...,0.620776,0.141525,0.66831,0.450698,0.601136,0.619292,0.56861,0.912027,0.598462,0.418864
1,Malignant,0.643144,0.272574,0.615783,0.501591,0.28988,0.181768,0.203608,0.348757,0.379798,...,0.606901,0.303571,0.539818,0.435214,0.347553,0.154563,0.192971,0.639175,0.23359,0.222878


In [10]:
#save clean data
df_transformed.to_csv('data_clean.csv', index=False)

In [11]:
df_transformed.columns

Index(['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

We observe from the index above that three different measurements (mean, standard error and maximum) were made per feature. Thus our aim in the subsequent sections is to slice the dataframe into three.
### Indexing and Slicing Data in Pandas
Split the data into three (mean, standard error and maximum)

In [17]:
#mean dataframe
df_transformed_mean = df_transformed.iloc[:,:11]
df_transformed_mean.head(2)

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean
0,Malignant,0.521037,0.022658,0.545989,0.363733,0.593753,0.792037,0.70314,0.731113,0.686364,0.605518
1,Malignant,0.643144,0.272574,0.615783,0.501591,0.28988,0.181768,0.203608,0.348757,0.379798,0.141323


Now that we have a dataframe containing only our mean data, we will rename the column names

In [18]:
#creat a list to store the transformed column names.
l=[]
[l.append(i[:-5]) if 'mean' in i else l.append(i) for i in df_transformed_mean.columns ]
print(l)
        

['diagnosis', 'radius', 'texture', 'perimeter', 'area', 'smoothness', 'compactness', 'concavity', 'concave points', 'symmetry', 'fractal_dimension']


In [19]:
# asign the names in the list l to 
df_transformed_mean.columns = l 
#print 2 lines of dataframe to confirm change
df_transformed_mean.head(2)

Unnamed: 0,diagnosis,radius,texture,perimeter,area,smoothness,compactness,concavity,concave points,symmetry,fractal_dimension
0,Malignant,0.521037,0.022658,0.545989,0.363733,0.593753,0.792037,0.70314,0.731113,0.686364,0.605518
1,Malignant,0.643144,0.272574,0.615783,0.501591,0.28988,0.181768,0.203608,0.348757,0.379798,0.141323


Let's save the mean dataframe for use in a future notebook.

In [20]:
df_transformed_mean.to_csv('cancer_data_means.csv', index=False)

### Selecting multiple ranges

Selecting the columns for the mean dataframe was straightforward since the columns we needed to select were all together (diagnosis, and the mean columns). Now we run into a little issue when we try to do the same for the standard errors or worst (or maximum) values. 'diagnosis' is separated from the rest of the columns we need. We can't specify all of these in one range.

We can achieve this by calling a the np.r method in [NumPy](https://numpy.org/doc/stable/reference/generated/numpy.r_.html)

credit to this [stackoverflow link](https://stackoverflow.com/questions/41256648/select-multiple-ranges-of-columns-in-pandas-dataframe)

In [21]:
# create the standard errors dataframe

df_transformed_SE = df_transformed.iloc[:, np.r_[:1,11:21]]
df_transformed_SE.columns = l
# view the first few rows to confirm this was successful
df_transformed_SE.head(2)

Unnamed: 0,diagnosis,radius,texture,perimeter,area,smoothness,compactness,concavity,concave points,symmetry,fractal_dimension
0,Malignant,0.356147,0.120469,0.369034,0.273811,0.159296,0.351398,0.135682,0.300625,0.311645,0.183042
1,Malignant,0.156437,0.082589,0.12444,0.12566,0.119387,0.081323,0.04697,0.253836,0.084539,0.09111


Again let's save the standard errors dataframe for use in a future notebook.

In [22]:
df_transformed_SE.to_csv('cancer_data_SE.csv', index=False)

In [23]:
# create the worst or maximum dataframe

df_transformed_max = df_transformed.iloc[:, np.r_[:1,21:31]]
df_transformed_max.columns = l
# view the first few rows to confirm this was successful
df_transformed_max.head(2)

Unnamed: 0,diagnosis,radius,texture,perimeter,area,smoothness,compactness,concavity,concave points,symmetry,fractal_dimension
0,Malignant,0.620776,0.141525,0.66831,0.450698,0.601136,0.619292,0.56861,0.912027,0.598462,0.418864
1,Malignant,0.606901,0.303571,0.539818,0.435214,0.347553,0.154563,0.192971,0.639175,0.23359,0.222878


Again let's save the maximum dataframe for use in a future notebook.

In [24]:
df_transformed_max.to_csv('cancer_data_max.csv', index=False)