# Wisconsin Breast Cancer Data Analysis - Data Cleaning

## Introduction
Early detection of the malignancy of a breast lump is the key to high probability of survival of breast cancer. Many imaging techniques have been developed for detection breast cancer.In this project, we will use Machine learning algorithms to accurately classify disgonosis from a breast imaging.

### Dataset

The Wisconsin Breast Cancer (Diagnostic) dataset has been extracted from Kaggle. It has 569 items out of which 212 are Malignant and 357 are benign. Ten real-valued features including:

- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness ($\text{perimeter}^2$ / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour) 
- symmetry
- fractal dimension ("coastline approximation" - 1)

There are three different measurements computed for each of the features described above:
- Mean values
- Standard Error
- Worst or Largest Values
Making it a total of 32 features, including id and diagnosis. 

## Objective
The aim of this data analysis project is to work users through the process of classifying the data into two classes of diagnosis - Malignant & Benign, using Machine learning algorithms. 

The project will be divided into three notebooks.
- The current notebook will discuss the data cleaning/processing
- A second notebook containing summary statistics process
- A third notebook containing the Machine Learning process.

In [280]:
#Import all packages needed
import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


In [281]:
#read csv file and display 5 lines of dataframe
df = pd.read_csv("data.csv")
df.head(5)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [282]:
df.head(5)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [283]:
#print a concise summary of a dataframe.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

**Observation**: The dataset looks clean with no missing values, and all features are of correct data types. 

Things to be done

1) Drop irrelevant columns ['Unnamed','id']

2) Group Data into three parts, (mean values, se values and worst values) and save.

3) Standardize feature data using Standard Scaler Library in Scikit learn. 

4) Rename the values in diagnosis column

In [284]:
#drop columns
df.drop(['Unnamed: 32', 'id'],axis=1,inplace=True)
df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [285]:
df.columns

Index(['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

We observe from the index above that three different measurements (mean, standard error and maximum) were made per feature. Thus our aim in the subsequent sections is to slice the dataframe into three.
### Indexing and Slicing Data in Pandas
Split the data into three (mean, standard error and maximum)

In [286]:
#mean dataframe
df_mean = df.iloc[:,:11]
df_mean.head(2)

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667


Now that we have a dataframe containing only our mean data, we will rename the column names

In [287]:
#creat a list to store the transformed column names.
l=[]
[l.append(i[:-5]) if 'mean' in i else l.append(i) for i in df_mean.columns ]
print(l)
        

['diagnosis', 'radius', 'texture', 'perimeter', 'area', 'smoothness', 'compactness', 'concavity', 'concave points', 'symmetry', 'fractal_dimension']


In [288]:
# asign the names in the list l to 
df_mean.columns = l 
#print 2 lines of dataframe to confirm change
df_mean.head(2)

Unnamed: 0,diagnosis,radius,texture,perimeter,area,smoothness,compactness,concavity,concave points,symmetry,fractal_dimension
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667


Let's save the mean dataframe for use in a future notebook.

In [289]:
df_mean.to_csv('cancer_data_means.csv', index=False)

### Selecting multiple ranges

Selecting the columns for the mean dataframe was straightforward since the columns we needed to select were all together (diagnosis, and the mean columns). Now we run into a little issue when we try to do the same for the standard errors or worst (or maximum) values. 'diagnosis' is separated from the rest of the columns we need. We can't specify all of these in one range.

We can achieve this by calling a the np.r method in [NumPy](https://numpy.org/doc/stable/reference/generated/numpy.r_.html)

credit to this [stackoverflow link](https://stackoverflow.com/questions/41256648/select-multiple-ranges-of-columns-in-pandas-dataframe)

In [290]:
# create the standard errors dataframe

df_SE = df.iloc[:, np.r_[:1,11:21]]
df_SE.columns = l
# view the first few rows to confirm this was successful
df_SE.head(2)

Unnamed: 0,diagnosis,radius,texture,perimeter,area,smoothness,compactness,concavity,concave points,symmetry,fractal_dimension
0,M,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193
1,M,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532


Again let's save the standard errors dataframe for use in a future notebook.

In [291]:
df_SE.to_csv('cancer_data_SE.csv', index=False)

In [292]:
# create the worst or maximum dataframe

df_max = df.iloc[:, np.r_[:1,21:31]]
df_max.columns = l
# view the first few rows to confirm this was successful
df_max.head(2)

Unnamed: 0,diagnosis,radius,texture,perimeter,area,smoothness,compactness,concavity,concave points,symmetry,fractal_dimension
0,M,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902


Again let's save the maximum dataframe for use in a future notebook.

In [293]:
df_max.to_csv('cancer_data_max.csv', index=False)